Saturday, April 19, 2025

Load-Testing LLMs Utilizing LLMPerf | In the direction of Information Science


Language Mannequin (LLM) isn’t essentially the ultimate step in productionizing your Generative AI utility. An typically forgotten, but essential a part of the MLOPs lifecycle is correctly load testing your LLM and making certain it is able to face up to your anticipated manufacturing site visitors. Load testing at a excessive stage is the observe of testing your utility or on this case your mannequin with the site visitors it might expect in a manufacturing setting to make sure that it’s performant.

Up to now we’ve mentioned load testing conventional ML fashions utilizing open supply Python instruments akin to Locust. Locust helps seize common efficiency metrics akin to requests per second (RPS) and latency percentiles on a per request foundation. Whereas that is efficient with extra conventional APIs and ML fashions it doesn’t seize the complete story for LLMs. 

LLMs historically have a a lot decrease RPS and better latency than conventional ML fashions as a result of their dimension and bigger compute necessities. Typically the RPS metric does probably not present essentially the most correct image both as requests can enormously range relying on the enter to the LLM. As an illustration you may need a question asking to summarize a big chunk of textual content and one other question which may require a one-word response. 

Because of this tokens are seen as a way more correct illustration of an LLM’s efficiency. At a excessive stage a token is a bit of textual content, each time an LLM is processing your enter it “tokenizes” the enter. A token differs relying particularly on the LLM you’re utilizing, however you may think about it as an example as a phrase, sequence of phrases, or characters in essence.

Picture by Writer

What we’ll do on this article is discover how we are able to generate token based mostly metrics so we are able to perceive how your LLM is acting from a serving/deployment perspective. After this text you’ll have an thought of how one can arrange a load-testing device particularly to benchmark completely different LLMs within the case that you’re evaluating many fashions or completely different deployment configurations or a mixture of each.

Let’s get arms on! In case you are extra of a video based mostly learner be at liberty to observe my corresponding YouTube video down under:

NOTE: This text assumes a primary understanding of Python, LLMs, and Amazon Bedrock/SageMaker. In case you are new to Amazon Bedrock please check with my starter information right here. If you wish to be taught extra about SageMaker JumpStart LLM deployments check with the video right here.

DISCLAIMER: I’m a Machine Studying Architect at AWS and my opinions are my very own.

Desk of Contents

  1. LLM Particular Metrics
  2. LLMPerf Intro
  3. Making use of LLMPerf to Amazon Bedrock
  4. Extra Sources & Conclusion

LLM-Particular Metrics

As we briefly mentioned within the introduction with reference to LLM internet hosting, token based mostly metrics typically present a significantly better illustration of how your LLM is responding to completely different payload sizes or kinds of queries (summarization vs QnA). 

Historically we’ve at all times tracked RPS and latency which we are going to nonetheless see right here nonetheless, however extra so at a token stage. Listed here are among the metrics to pay attention to earlier than we get began with load testing:

  1. Time to First Token: That is the length it takes for the primary token to generate. That is particularly useful when streaming. As an illustration when utilizing ChatGPT we begin processing info when the primary piece of textual content (token) seems.
  2. Whole Output Tokens Per Second: That is the full variety of tokens generated per second, you may consider this as a extra granular various to the requests per second we historically monitor.

These are the most important metrics that we’ll concentrate on, and there’s a number of others akin to inter-token latency that can even be displayed as a part of the load checks. Consider the parameters that additionally affect these metrics embrace the anticipated enter and output token dimension. We particularly play with these parameters to get an correct understanding of how our LLM performs in response to completely different era duties. 

Now let’s check out a device that permits us to toggle these parameters and show the related metrics we’d like.

LLMPerf Intro

LLMPerf is constructed on prime of Ray, a preferred distributed computing Python framework. LLMPerf particularly leverages Ray to create distributed load checks the place we are able to simulate real-time manufacturing stage site visitors. 

Word that any load-testing device can be solely going to have the ability to generate your anticipated quantity of site visitors if the shopper machine it’s on has sufficient compute energy to match your anticipated load. As an illustration as you scale the concurrency or throughput anticipated in your mannequin, you’d additionally wish to scale the shopper machine(s) the place you’re working your load check.

Now particularly inside LLMPerf there’s a number of parameters which can be uncovered which can be tailor-made for LLM load testing as we’ve mentioned:

  • Mannequin: That is the mannequin supplier and your hosted mannequin that you just’re working with. For our use-case it’ll be Amazon Bedrock and Claude 3 Sonnet particularly.
  • LLM API: That is the API format by which the payload ought to be structured. We use LiteLLM which gives a standardized payload construction throughout completely different mannequin suppliers, thus simplifying the setup course of for us particularly if we wish to check completely different fashions hosted on completely different platforms.
  • Enter Tokens: The imply enter token size, you may also specify an ordinary deviation for this quantity.
  • Output Tokens: The imply output token size, you may also specify an ordinary deviation for this quantity.
  • Concurrent Requests: The variety of concurrent requests for the load check to simulate.
  • Take a look at Period: You may management the length of the check, this parameter is enabled in seconds.

LLMPerf particularly exposes all these parameters via their token_benchmark_ray.py script which we configure with our particular values. Let’s have a look now at how we are able to configure this particularly for Amazon Bedrock.

Making use of LLMPerf to Amazon Bedrock

Setup

For this instance we’ll be working in a SageMaker Basic Pocket book Occasion with a conda_python3 kernel and ml.g5.12xlarge occasion. Word that you just wish to choose an occasion that has sufficient compute to generate the site visitors load that you just wish to simulate. Make sure that you even have your AWS credentials for LLMPerf to entry the hosted mannequin be it on Bedrock or SageMaker.

LiteLLM Configuration

We first configure our LLM API construction of selection which is LiteLLM on this case. With LiteLLM there’s help throughout varied mannequin suppliers, on this case we configure the completion API to work with Amazon Bedrock:

import os
from litellm import completion

os.environ["AWS_ACCESS_KEY_ID"] = "Enter your entry key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret entry key"
os.environ["AWS_REGION_NAME"] = "us-east-1"

response = completion(
    mannequin="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.selections[0].message.content material
print(output)

To work with Bedrock we configure the Mannequin ID to level in the direction of Claude 3 Sonnet and move in our immediate. The neat half with LiteLLM is that messages key has a constant format throughout mannequin suppliers.

Put up-execution right here we are able to concentrate on configuring LLMPerf for Bedrock particularly.

LLMPerf Bedrock Integration

To execute a load check with LLMPerf we are able to merely use the offered token_benchmark_ray.py script and move within the following parameters that we talked of earlier:

  • Enter Tokens Imply & Customary Deviation
  • Output Tokens Imply & Customary Deviation
  • Max variety of requests for check
  • Period of check
  • Concurrent requests

On this case we additionally specify our API format to be LiteLLM and we are able to execute the load check with a easy shell script like the next:

%%sh
python llmperf/token_benchmark_ray.py 
    --model bedrock/anthropic.claude-3-sonnet-20240229-v1:0 
    --mean-input-tokens 1024 
    --stddev-input-tokens 200 
    --mean-output-tokens 1024 
    --stddev-output-tokens 200 
    --max-num-completed-requests 30 
    --num-concurrent-requests 1 
    --timeout 300 
    --llm-api litellm 
    --results-dir bedrock-outputs

On this case we preserve the concurrency low, however be at liberty to toggle this quantity relying on what you’re anticipating in manufacturing. Our check will run for 300 seconds and publish length you need to see an output listing with two recordsdata representing statistics for every inference and in addition the imply metrics throughout all requests within the length of the check.

We are able to make this look just a little neater by parsing the abstract file with pandas:

import json
from pathlib import Path
import pandas as pd

# Load JSON recordsdata
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")

with open(individual_path, "r") as f:
    individual_data = json.load(f)

with open(summary_path, "r") as f:
    summary_data = json.load(f)

# Print abstract metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
    "Mannequin": summary_data.get("mannequin"),
    "Imply Enter Tokens": summary_data.get("mean_input_tokens"),
    "Stddev Enter Tokens": summary_data.get("stddev_input_tokens"),
    "Imply Output Tokens": summary_data.get("mean_output_tokens"),
    "Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
    "Imply TTFT (s)": summary_data.get("results_ttft_s_mean"),
    "Imply Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
    "Imply Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
    "Accomplished Requests": summary_data.get("results_num_completed_requests"),
    "Error Fee": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Efficiency Abstract:n")
for okay, v in summary_metrics.gadgets():
    print(f"{okay}: {v}")

The ultimate load check outcomes will look one thing like the next:

Screenshot by Writer

As we are able to see we see the enter parameters that we configured, after which the corresponding outcomes with time to first token(s) and throughput with reference to imply output tokens per second.

In a real-world use case you may use LLMPerf throughout many alternative mannequin suppliers and run checks throughout these platforms. With this device you should use it holistically to determine the appropriate mannequin and deployment stack in your use-case when used at scale.

Extra Sources & Conclusion

The complete code for the pattern could be discovered at this related Github repository. When you additionally wish to work with SageMaker endpoints you’ll find a Llama JumpStart deployment load testing pattern right here.

All in all load testing and analysis are each essential to making sure that your LLM is performant in opposition to your anticipated site visitors earlier than pushing to manufacturing. In future articles we’ll cowl not simply the analysis portion, however how we are able to create a holistic check with each elements.

As at all times thanks for studying and be at liberty to go away any suggestions and join with me on Linkedln and X.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com