├── llm_bench ├── requirements.txt ├── locust.conf ├── locust-grafana.conf ├── README.md ├── limericks.txt ├── benchmark_suite.ipynb └── load_test.py ├── .gitignore ├── README.md └── LICENSE /llm_bench/requirements.txt: -------------------------------------------------------------------------------- 1 | locust 2 | orjson 3 | pillow 4 | gevent 5 | transformers 6 | locust-plugins 7 | -------------------------------------------------------------------------------- /llm_bench/locust.conf: -------------------------------------------------------------------------------- 1 | locustfile = load_test.py 2 | headless = yes 3 | host = http://localhost:80 4 | reset-stats = yes 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | # Benchmark results 4 | llm_bench/results/ 5 | .env 6 | env/ -------------------------------------------------------------------------------- /llm_bench/locust-grafana.conf: -------------------------------------------------------------------------------- 1 | locustfile = load_test.py 2 | host = http://localhost:80 3 | headless = true 4 | timescale = true 5 | pghost = 127.0.0.1 6 | pgport = 5432 7 | pguser = postgres 8 | pgpassword = password 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Benchmark / Load-testing Suite by Fireworks.ai 2 | 3 | ## LLM benchmarking 4 | 5 | The load test is designed to simulate continuous production load and minimize effect of model generation behavior: 6 | * variation in generation parameters 7 | * continuous request stream with varying distribution and load levels 8 | * force generation of exact number of output tokens (for most providers) 9 | * specified load test duration 10 | 11 | Supported providers and API flavors: 12 | * OpenAI API compatible endpoints: 13 | * [Fireworks.ai](https://app.fireworks.ai) public or private deployments 14 | * VLLM 15 | * Anyscale Endpoints 16 | * OpenAI 17 | * Text Generation Inference (TGI) / HuggingFace Endpoints 18 | * Together.ai 19 | * NVidia Triton server: 20 | * Legacy HTTP endpoints (no streaming) 21 | * LLM-focused endpoints (with or without streaming) 22 | 23 | Captured metrics: 24 | * Overall latency 25 | * Number of generated tokens 26 | * Sustained requests throughput (QPS) 27 | * Time to first token (TTFT) for streaming 28 | * Per token latency for streaming 29 | 30 | Metrics summary can be exported to CSV. This way multiple configuration can be scripted over. CSV file can be imported to Google Sheets/Excel or Jupyter for further analysis. 31 | 32 | See [`llm_bench`](llm_bench) folder for detailed usage. 33 | 34 | See [`llm_bench/benchmark_suite.ipynb`](llm_bench/benchmark_suite.ipynb) for a detailed example of how to use the load test script and run different types of benchmark suites. 35 | -------------------------------------------------------------------------------- /llm_bench/README.md: -------------------------------------------------------------------------------- 1 | # LLM Load test 2 | 3 | Please refer to the [`benchmark_suite.ipynb`](benchmark_suite.ipynb) for a detailed example of how to use the load_test.py script and run different types of benchmark suites. 4 | 5 | ## Installation 6 | 7 | The load test relies on [Locust package](https://locust.io/). Install it from pip. 8 | 9 | ```bash 10 | pip install -r requirements.txt 11 | ``` 12 | 13 | Then run the commands described below from the enclosing directory. Locust will pick up the settings from `locust.conf` automatically. 14 | 15 | ## Usage 16 | 17 | The load test script exercises LLM generation endpoint under varying load. See below for the common configuration options. Check `--help` for the full list. 18 | 19 | ### Target 20 | 21 | - `-H`: target endpoint URL (preceding `/v1/...`). E.g. `-H http://localhost` or `-H https://api.fireworks.ai/inference`. Defaults to `localhost:80`. 22 | - (optional) `-m`: model to send requests too. Can be omitted for a local test if the server has a single model loaded only. 23 | - (optional) `--provider`: provider name like `fireworks` or `openai`. APIs have slight differences that the script accounts for. If omitted the script tries to guess based on URI and API return information. Must be specified for non-OpenAI-compatible providers like Triton. 24 | - `-k`: API key to be passed as `Authorization: Bearer ...`. 25 | 26 | ### Rate of requests 27 | 28 | There are several primary modes the script can be used: 29 | 30 | 1. **Fixed concurrency**. N workers are created. Each sends a request, waits for the response and then sends the next request. Thus as concurrency increases, the server will get more loaded and latency will grow. Usually increasing concurrency beyond some point doesn't increase throughput and just leads to growing latency. 31 | - `-u`: the number of concurrent workers to spawn (standard Locust argument) 32 | - `-r`: the rate per second of spawning concurrent workers. If processing workload takes a while (more than several seconds), it makes sense to set this value to something lower than `-u` for a gradual ramp-up to avoid request bursts. 33 | - (optionally) `--burst `: synchronizes all N workers to issue requests in one go with the specified interval. The maximum latency should be less than the period, otherwise some workers may fall behind. 34 | 35 | 2. **Fixed QPS**. The script ensures that input requests are issued at specific times to average out at the specified rate per second. If the target QPS is too high and the server is overloaded it will likely drop additional requests or stall. 36 | - `--qps`: the desired rate of requests per second. Can be a fractional number, e.g. `0.1`. 37 | - `-u -r `: needs to be set to a sufficiently high value to allow generating the target QPS. The script will complain if it's too low. Passing something like `-u 100 -r 100` is a good choice. 38 | - (optional) `--qps-distribution`: specify how to space out requests. Default is `constant` meaning evenly spaced out. `exponential` is an option simulating [Poisson distribution](https://en.wikipedia.org/wiki/Traffic_generation_model#Poisson_traffic_model). 39 | 40 | ### Workload 41 | 42 | Input is read from --dataset, which is either: 43 | - `limerics`: default dataset. Requires --tokenizer to be passed. Will be used to auto-generate realistic prompts. 44 | - `@`-prefixed, specifies a path to JSONL file, used to read contents of each request. 45 | 46 | The number of tokens to generate is sampled on every request from a given distribution: 47 | - `-o`/`--max-tokens`: maximum number of tokens to generate. If --max-tokens-distribution is non-constant this is going to be the mean of the distribution. 48 | - `--max-tokens-distribution`: specifies probability distribution to use. 49 | - `--max-tokens-range` Specifies "the width" of the distribution (e.g. stddev for "normal" distribution). Specified value `alpha` is relative to `max-tokens`. Default is 0.3 so most of the range falls in "3 sigma" region. 50 | - `--max-tokens-cap`: specify upper bound to "truncate" the probability distribution. The lower bound is always 1 token. This allows to sample from "truncated normal" or "truncated exponential" distributions. 51 | 52 | Based on the above settings the following distributions are supported: 53 | - `constant`: use `--max_tokens` value on every request 54 | - `uniform`: sample from the range `[max_tokens - max_tokens * alpha, max_tokens + max_tokens * alpha]` 55 | - `normal`: sample from gaussian distribution `N(max_tokens, max_tokens * alpha)` 56 | - `exponential`: sample from exponential distribution with the mean `max_tokens`. `alpha` is ignored 57 | 58 | The benchmark makes the best effort to ensure the desired `max_tokens` number is respected: 59 | - for providers that support it, it passes `ignore_eos` or `min_tokens` parameter to avoid early stopping 60 | - the default prompt is a lengthy code generation request that usually doesn't stop early 61 | - it verifies the number of tokens actually generated and prints warnings on mismatch. Different providers use varying mechanisms of returning generated number of tokens. For some of them `--logprobs` might be needed in the streaming mode. 62 | - optionally, `--tokenizer` can be passed specifying Huggingface tokenizer to be used to count the output tokens on client side. 63 | 64 | Generation options: 65 | - `--chat`: specify to call chat API instead of raw completions 66 | - `--stream`: stream the result back. Enabling this gives "time to first token" and "time per token" metrics 67 | - (optional) `--logprobs`: corresponds to `logprobs` API parameter. For some providers, it's needed for output token counting in streaming mode. 68 | 69 | ### Writing results 70 | 71 | Locust prints out the detailed summary including quantiles of various metrics. Additionally, the script prints out the summary block at the very end of the output that includes the model being tested. 72 | 73 | When comparing multiple configurations, it's useful to aggregate results together: 74 | 75 | - `--summary-file`: Append the line with the summary to the specified CSV file. Useful for generating a spreadsheet with perf sweep results. If the file doesn't exist, it writes out the header first. 76 | - `-t`: duration (e.g. `5min`) for which to run the test (standard Locust option). It's particularly useful when scripting multiple runs. By default, the test runs without a limit until Ctrl+C is pressed. 77 | 78 | The typical workflow would be to run benchmark several times appending to the same CSV file. The resulting file can be imported into a spreadsheet or pandas for further analysis. 79 | 80 | ### Custom prompts 81 | 82 | Sometimes it's necessary to replay exact prompts, for example in the case of embedding images. 83 | `--dataset` option can be used in this case to specify a file with .jsonl extension (starting with an ampersand, e.g. `@prompt.jsonl`.). 84 | jsonl files will be read line-by-line. Each line has to have a valid JSON object, which will be used to form the resulting API request. 85 | Examples: 86 | 87 | Chat dataset (--chat option): 88 | ``` 89 | {"messages": [{"role": "user", "content": "Write a poem about a cat"}], "temperature": 0.9} 90 | {"messages": [{"role": "user", "content": "Write a poem about a dog"}], "temperature": 1} 91 | ``` 92 | 93 | Non-chat dataset (--no-chat option): 94 | ``` 95 | {"prompt": "One two three four"} 96 | {"prompt": "Five six seven eight"} 97 | ``` 98 | 99 | 100 | ## Examples 101 | 102 | Download tokenizer for the model being benchmarked from Huggingface. 103 | ``` 104 | huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir /models/Meta-Llama-3-8B-Instruct --include '*.json' 105 | export TOKENIZER=/models/Meta-Llama-3-8B-Instruct 106 | ``` 107 | 108 | Maintain fixed 8 requests concurrency against local deployment: 109 | 110 | ```bash 111 | locust -u 8 -r 2 -p 512 -o 128 112 | ``` 113 | 114 | Call streaming chat API locally with the request issued every 2 seconds. Run for 1 minute and save results to `results.csv`: 115 | 116 | ```bash 117 | locust -t 1min -u 100 -r 100 -p 512 -o 128 --stream --chat --qps 0.5 --summary-file results.csv 118 | ``` 119 | 120 | Benchmark Fireworks public deployment deployment with 1 request only: 121 | 122 | ```bash 123 | locust -u 1 -H https://api.fireworks.ai/inference -p 128 -o 200 --api-key $FIREWORKS_API_KEY --model=accounts/fireworks/models/llama-v3p1-8b-instruct 124 | ``` 125 | 126 | Benchmark Fireworks public deployment with 1 request and 2 images (1024w x 1024h and 3084w x 1080h): 127 | 128 | ```bash 129 | locust -u 1 -H https://api.fireworks.ai/inference -p 128 -o 200 --api-key $FIREWORKS_API_KEY --model=accounts/fireworks/models/llama-v3p1-8b-instruct --chat --prompt-images-with-resolutions 1024x1024 3084x1080 130 | ``` 131 | 132 | Benchmark OpenAI deployment reading prompts from a file at 1 QPS: 133 | 134 | ```bash 135 | locust --dataset '@input.jsonl' -u 1 -H https://api.openai.com -o 200 --api-key $OPENAI_API_KEY --model=gpt-3.5-turbo --chat 136 | ``` 137 | 138 | ## UI mode 139 | 140 | Instead of relying on textual data, it's also possible to plot the results in Grafana. 141 | 142 | ```bash 143 | pip install locust locust-plugins 144 | locust-compose up 145 | ``` 146 | 147 | This starts your local Postgre and Grafana. Grafana is available at http://127.0.0.1:3000 (sometimes logs don't print out). 148 | 149 | Then run the test as specified above with an additional argument: 150 | 151 | ```bash 152 | locust --config locust-grafana.conf ... 153 | ``` 154 | 155 | This starts the load test locally and pushes results into Grafana in real-time. Besides the actual requests, we push additional metrics (e.g. time per token) as separate fake requests to get stats aggregation. Make sure to remove them from aggregation when viewing the graphs. 156 | 157 | Other settings for Locust are in `./locust.conf`. You may start Locust in non-headless mode, but its UI is very basic and misses advanced stats aggregation capabilities. 158 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /llm_bench/limericks.txt: -------------------------------------------------------------------------------- 1 | An elderly man called Keith, 2 | Mislaid his set of false teeth. 3 | They'd been laid on a chair, 4 | He'd forgot they were there, 5 | Sat down, and was bitten beneath. 6 | 7 | There once was a man from the sticks 8 | Who loved to compose limericks 9 | But he failed at his sport 10 | They were always too short... 11 | 12 | There once was a runner named Dwight 13 | Who could speed even faster than light. 14 | He set out one day 15 | In a relative way 16 | And returned on the previous night. 17 | 18 | There was a young lady named Alice 19 | Who was known to have peed in a chalice. 20 | ‘Twas the common belief 21 | It was done for relief, 22 | And not out of protestant malice. 23 | 24 | A canner, exceedingly canny, 25 | One morning remarked to his granny, 26 | "A canner can can 27 | Anything that he can; 28 | But a canner can't can a can, can he?" 29 | 30 | At times I’m so mad that I’m hopping. 31 | My angriness sets my veins popping. 32 | I yell and I curse, 33 | With swear words diverse, 34 | But my wife does much worse: she goes shopping. 35 | 36 | A tutor who tooted a flute 37 | Tried to teach two young tooters to toot. 38 | Said the two to the tutor, 39 | “Is it harder to toot, or… 40 | To tutor two tooters to toot?” 41 | 42 | A crafty young bard named McMahon 43 | Whose poetry never would scan 44 | Once said, with a pause, 45 | “It’s probably because 46 | I’m always trying to cram as many additional syllables into the last line as I possibly can.” 47 | 48 | An elephant slept in his bunk, 49 | And in slumber his chest rose and sunk. 50 | But he snored - how he snored! 51 | All the other beasts roared, 52 | So his wife tied a knot in his trunk. 53 | 54 | There once was a man from Nantucket 55 | Who kept all his cash in a bucket 56 | His daughter, named Nan 57 | Ran away with a man 58 | And as for the bucket, Nantucket. 59 | 60 | I need a front door for my hall, 61 | The replacement I bought was too tall. 62 | So I hacked it and chopped it, 63 | And carefully lopped it, 64 | And now the dumb thing is too small. 65 | 66 | "There's a train at 4:04," said Miss Jenny. 67 | "Four tickets I'll take; have you any?" 68 | Said the man at the door, 69 | "Not four for 4:04, 70 | For four for 4:04 is too many." 71 | 72 | How to spell the potato has tried 73 | Many minds, sometimes mine, I’ll confide. 74 | Though it may have an eye, 75 | There’s no E – don’t ask why! 76 | Not until it’s been baked, boiled, or fried. 77 | 78 | I'd rather have Fingers than Toes, 79 | I'd rather have Ears than a Nose. 80 | And as for my Hair, 81 | I'm glad it's all there, 82 | I'll be awfully sad, when it goes. 83 | 84 | There was a faith-healer of Deal, 85 | Who said: "Although pain isn't real, 86 | If I sit on a pin 87 | And it punctures my skin, 88 | I dislike what I fancy I feel.' 89 | 90 | My dog is really quite hip, 91 | Except when he takes a cold dip. 92 | He looks like a fool, 93 | When he jumps in the pool, 94 | And reminds me of a sinking ship. 95 | 96 | I'm papering walls in the loo 97 | And quite frankly I haven't a clue; 98 | For the pattern's all wrong 99 | (Or the paper's too long) 100 | And I'm stuck to the toilet with glue. 101 | 102 | There once was an old man of Esser, 103 | Whose knowledge grew lesser and lesser, 104 | It at last grew so small 105 | He knew nothing at all 106 | And now he's a college professor. 107 | 108 | As 007 walked by 109 | He heard a wee spider say, "Hi." 110 | But shaken, he shot 111 | It right there on the spot 112 | As it tried to explain, "I'm a spi..." 113 | 114 | There was a young fellow of Crete 115 | Who was so exceedingly neat. 116 | When he got out of bed 117 | He stood on his head 118 | To make sure of not soiling his feet. 119 | 120 | An amoeba named Max and his brother 121 | Were sharing a drink with each other; 122 | In the midst of their quaffing, 123 | They split themselves laughing, 124 | And each of them now is a mother. 125 | 126 | A flea and a fly in a flue 127 | Were imprisoned, so what could they do? 128 | Said the fly, “Let us flee!” 129 | “Let us fly!” said the flea 130 | So they flew through a flaw in the flue. 131 | 132 | What happens when you retire? 133 | You really don't have to inquire - 134 | No job and no phone 135 | There's no place but home, 136 | And your checkbook's about to expire! 137 | 138 | The star violinist was bowing; 139 | The quarrelsome oarsmen were rowing. 140 | But how is the sage 141 | To discern from this page: 142 | Was it piglets, or seeds, that were sowing? 143 | 144 | An oyster from Kalamazoo 145 | Confessed he was feeling quite blue. 146 | For he said, “As a rule, 147 | When the weather turns cool, 148 | I invariably get in a stew.” 149 | 150 | A painter, who lived in Great Britain, 151 | Interrupted two girls with their knitting, 152 | He said, with a sigh, 153 | "That park bench, well I, 154 | Just painted it, right where you're sitting." 155 | 156 | I once had a gerbil named Bobby, 157 | Who had an unusual hobby. 158 | He chewed on a cord, 159 | and now - oh my lord, 160 | now all that's left is a blobby. 161 | 162 | My ambition, said old Mr. King, 163 | Is to live as a bird on the wing. 164 | Then he climbed up a steeple, 165 | Which scared all the people, 166 | So they caged him and taught him to sing. 167 | 168 | Is it me or the nature of money, 169 | That's odd and particularly funny. 170 | But when I have dough, 171 | It goes quickly, you know, 172 | And seeps out of my pockets like honey. 173 | 174 | A mouse in her room woke Miss Dowd 175 | She was frightened — it must be allowed. 176 | Soon a happy thought hit her — 177 | To scare off the critter, 178 | She sat up in bed and meowed. 179 | 180 | A nifty young flapper named Jane 181 | While walking was caught in the rain. 182 | She ran - almost flew, 183 | Her complexion did too, 184 | And she reached home exceedingly plain. 185 | 186 | There was an old person of Fratton 187 | Who would go to church with his hat on. 188 | 'If I wake up,' he said, 189 | 'With a hat on my head, 190 | I will know that it hasn't been sat on.' 191 | 192 | I told him, "Get out of my place 193 | You're an utter uncultured disgrace; 194 | You're a simpleton loon. 195 | Don't you know a good tune?" 196 | Then he walloped me square in the face. 197 | 198 | No woodsman would cut a wood, would he 199 | If woods would be woodless – nor should he. 200 | Yet no woodcutter would 201 | Cut a woody-wood wood 202 | If no woodsmen cut woody woods, would he? 203 | 204 | A forgetful old gasman named Dieter, 205 | Who went poking around his gas heater, 206 | Touched a leak with his light; 207 | He blew out of sight — 208 | And, as everyone who knows anything about poetry can tell you, he also ruined the meter. 209 | 210 | One Saturday morning at three 211 | A cheesemonger’s shop in Paree 212 | Collapsed to the ground 213 | With a thunderous sound 214 | Leaving only a pile of de brie. 215 | 216 | There was an old man of Peru, 217 | Who dreamt he was eating his shoe. 218 | He woke in the night, 219 | With a terrible fright, 220 | And found it was perfectly true. 221 | 222 | A crossword compiler named Moss 223 | Who found himself quite at a loss 224 | When asked, 'Why so blue?' 225 | Said, 'I haven't a clue 226 | I'm 2 Down to put 1 Across.' 227 | 228 | To compose a sonata today, 229 | Don't proceed in the old-fashioned way: 230 | With your toes on the keys, 231 | Bang the floor with your knees: 232 | "Oh how modern!" the critics will say. 233 | 234 | There was a young lady named Perkins, 235 | Who just simply doted on gherkins. 236 | In spite of advice, 237 | She ate so much spice, 238 | That she pickled her internal workins'. 239 | 240 | A certain young fellow named Bee-Bee 241 | Wished to wed a woman named Phoebe. 242 | "But," he said, "I must see 243 | What the clerical fee 244 | Be before Phoebe be Phoebe Bee-Bee." 245 | 246 | If you catch a chinchilla in Chile 247 | And cut off its beard, willy-nilly 248 | You can honestly say 249 | That you have just made 250 | A Chilean chinchilla's chin chilly. 251 | 252 | There once was a man from the city 253 | Stooped to pat what he thought was a kitty 254 | He gave it a pat 255 | But it wasn't a cat - 256 | They buried his clothes - what a pity! 257 | 258 | There was an old man of the Cape 259 | Who made himself garments of crepe. 260 | When asked, “Do they tear?” 261 | He replied, “Here and there, 262 | But they’re perfectly splendid for shape!” 263 | 264 | There was a young lady named Cager 265 | Who, as the result of a wager, 266 | Consented to fart 267 | The complete oboe part 268 | Of Mozart’s quartet in F major. 269 | 270 | The limerick packs laughs anatomical 271 | Into space that is quite economical. 272 | But the good ones I’ve seen 273 | So seldom are clean 274 | And the clean ones so seldom are comical. 275 | 276 | The incredible Wizard of Oz 277 | Retired from his business because 278 | Due to up-to-date science 279 | To most of his clients 280 | He wasn’t the Wizard he was. 281 | 282 | A wonderful bird is the pelican 283 | His bill holds more than his belican, 284 | He can take in his beak 285 | Enough food for a week 286 | But I’m damned if I see how the helican. 287 | 288 | There once was a lady named Ferris 289 | Whom nothing could ever embarrass. 290 | ‘Til the bath salts one day, 291 | in the tub where she lay, 292 | turned out to be Plaster of Paris. 293 | 294 | There once was a girl named Irene 295 | Who lived on distilled kerosene 296 | But she started absorbing 297 | A new hydrocarbon 298 | And since then has never benzene. 299 | 300 | Is algebra fruitless endeavor? 301 | It seems they’ve been trying forever 302 | To find x, y, and z 303 | And it’s quite clear to me: 304 | If they’ve not found them yet then they’ll never. 305 | 306 | A rather disgruntled young Viking 307 | Found plunder was not to his liking 308 | When they yelled “All ashore,” 309 | He just threw down his oar 310 | And announced, “I’m not striking, I’m striking!” 311 | 312 | There once was a girl in the choir 313 | Whose voice rose up hoir and hoir, 314 | Till it reached such a height 315 | It went clear out of seight, 316 | And they found it next day in the spoir. 317 | 318 | There once was a fly on the wall, 319 | I wonder, why didn’t it fall? 320 | Because its feet stuck? 321 | Or was it just luck? 322 | Or does gravity miss things so small? 323 | 324 | There once was a farmer from Leeds, 325 | Who swallowed a packet of seeds. 326 | It soon came to pass, 327 | He was covered with grass, 328 | But has all the tomatoes he needs. 329 | 330 | There was once a great man in Japan 331 | Whose name on Tuesday began, 332 | It lasted through Sunday 333 | Till twilight on Monday 334 | And it sounded like stones in a can. 335 | 336 | Remember when nearly sixteen 337 | On your very first date as a teen 338 | At the movies? If yes, 339 | Then I bet you can't guess 340 | What was shown on the cinema screen. 341 | 342 | There was a young man from Dealing 343 | Who caught the bus for Ealing. 344 | It said on the door 345 | 'Don't spit on the floor' 346 | So he jumped up and spat on the ceiling. 347 | 348 | There once was a man named Muvett 349 | Who lived in the city of Lovett 350 | But his car broke down 351 | Two miles out of town 352 | And Muvett had to shove it to Lovett! 353 | 354 | There once was a beautiful nurse 355 | Who carried an ugly old purse 356 | But she tripped on the door 357 | And fell on the floor 358 | And they both went away in the hearse. 359 | 360 | A bather whose clothing was strewed 361 | By breezes that left her quite nude, 362 | Saw a man come along 363 | And, unless I am wrong, 364 | You expect this last line to be lewd! 365 | 366 | An ambitious young fellow named Matt, 367 | Tried to parachute using his hat. 368 | Folks below looked so small, 369 | As he started to fall, 370 | Then got bigger and bigger and SPLAT! 371 | 372 | There was a young lady whose chin 373 | Resembled the point of a pin 374 | So she had it made sharp 375 | And purchased a harp 376 | And played several tunes with her chin. 377 | 378 | There was once a young girl who said: “Why 379 | Can’t I look in my ear with my eye? 380 | If I put my mind to it 381 | I’m sure I can do it. 382 | You never can tell till you try.” 383 | 384 | Limericks I cannot compose, 385 | With noxious smells in my nose. 386 | But this one was easy, 387 | I only felt queasy, 388 | Because I was sniffing my toes. 389 | 390 | There was an odd fellow named Gus, 391 | When traveling he made such a fuss. 392 | He was banned from the train, 393 | Not allowed on a plane, 394 | And now travels only by bus. 395 | 396 | There once was a man from Tibet, 397 | Who couldn't find a cigarette 398 | So he smoked all his socks, 399 | and got chicken-pox, 400 | and had to go to the vet. 401 | 402 | A newspaperman named Fling, 403 | Could make "copy" from any old thing. 404 | But the copy he wrote, 405 | Of a five-dollar note, 406 | Was so good he now wears so much bling. 407 | 408 | There is a young schoolboy named Mason, 409 | Whose mom cuts his hair with a basin. 410 | When he stands in one place, 411 | With a scarf round his face, 412 | It's a mystery which way he’s facing. 413 | 414 | There was a young schoolboy of Rye, 415 | Who was baked by mistake in a pie. 416 | To his mother’s disgust, 417 | He emerged through the crust, 418 | And exclaimed, with a yawn, where am I? 419 | 420 | A fellow jumped off a high wall, 421 | And had a most terrible fall. 422 | He went back to bed, 423 | With a bump on his head, 424 | That's why you don't jump off a wall. 425 | 426 | There was a young lady of Cork, 427 | Whose Pa made a fortune in pork. 428 | He bought for his daughter, 429 | A tutor who taught her, 430 | To balance green peas on her fork. 431 | 432 | There once was a Martian called Zed 433 | With antennae all over his head. 434 | He sent out a lot 435 | Di-di-dash-di-dot 436 | But nobody knew what he said. 437 | 438 | There once was a girl named Sam 439 | Who did not eat roast beef and ham 440 | She ate a green apple 441 | Then drank some Snapple 442 | Some say she eats like a lamb. 443 | 444 | A major, with wonderful force, 445 | Called out in Hyde Park for a horse. 446 | All the flowers looked round, 447 | But no horse could be found; 448 | So he just rhododendron, of course. 449 | 450 | A canny young fisher named Fisher 451 | Once fished from the edge of a fissure. 452 | A fish with a grin 453 | Pulled the fisherman in — 454 | Now they're fishing the fissure for Fisher. 455 | 456 | A cheerful old bear at the Zoo 457 | Could always find something to do. 458 | When it bored him, you know, 459 | To walk to and fro, 460 | He reversed it and walked fro and to. 461 | 462 | The bottle of perfume that Willie sent 463 | Was highly displeasing to Millicent; 464 | Her thanks were so cold 465 | They quarreled, I'm told, 466 | Through that silly scent Willie sent Millicent. 467 | 468 | I bought a new Hoover today, 469 | Plugged it in in the usual way, 470 | Switched it on - what a din; 471 | It sucked everything in, 472 | Now I'm homeless with no place to stay. 473 | 474 | There was a young lady named Hannah, 475 | Who slipped on a peel of banana. 476 | As she lay on her side, 477 | More stars she espied 478 | Than there are in the Star-Spangled Banner. 479 | 480 | My neighbor came over to say 481 | (Although not in a neighborly way) 482 | That he'd knock me around 483 | If I didn't curb the sound 484 | Of the classical music I play. 485 | 486 | There once was a man from Gorem 487 | Had a pair of tight pants and he wore 'em 488 | When he bowed with a grin 489 | A draft of air rushed in 490 | And he knew by the sound that he tore 'em! 491 | 492 | There was an Old Man in a tree, 493 | Who was horribly bored by a bee. 494 | When they said “Does it buzz?” 495 | He replied “Yes, it does! 496 | It’s a regular brute of a bee!” 497 | 498 | There was a young belle of old Natchez 499 | Whose garments were always in patchez. 500 | When comments arose 501 | On the state of her clothes, 502 | She replied, “When Ah itchez, Ah scratchez.” 503 | 504 | There was a young fellow from Belfast 505 | That I wanted so badly to tell fast 506 | Not to climb up the stair 507 | As the top step was air 508 | And that’s why the young fellow fell fast. 509 | 510 | There was an old girl of Genoa 511 | And I blush when I think that Iowa; 512 | She’s gone to her rest, 513 | It’s all for the best, 514 | Otherwise I would borrow Samoa. 515 | 516 | There was a dear lady of Eden, 517 | Who on apples was quite fond of feedin’; 518 | She gave one to Adam, 519 | Who said, “Thank you, Madam,” 520 | And then both skedaddled from Eden. 521 | 522 | I know an old owl named Boo, 523 | Every night he yelled Hoo, 524 | Once a kid walked by, 525 | And started to cry, 526 | And yelled I don't have a clue! 527 | 528 | I once fell in love with a blonde, 529 | But found that she wasn't so fond. 530 | Of my pet turtle named Odle, 531 | whom I'd taught how to Yodel, 532 | So she dumped him outside in the pond. 533 | 534 | A man and his lady-love, Min, 535 | Skated out where the ice was quite thin. 536 | Had a quarrel, no doubt, 537 | For I hear they fell out, 538 | What a blessing they didn't fall in! 539 | 540 | Said the man with a wink of his eye 541 | "But I love you" and then the reply 542 | From the girl, it was heard 543 | "You are truly absurd! 544 | I have only this moment walked by!" 545 | 546 | Leah is such a great swimmer. 547 | Leah is very much slimmer. 548 | She saw a big whale. 549 | It had a big tail. 550 | She put in a pot to simmer. 551 | 552 | Liam is afraid of a wolf. 553 | Tries to talk, but only goes woof. 554 | Liam bites meat too. 555 | The wolf only chews. 556 | Liam is such a silly goof. 557 | 558 | Jack is feeling scared of a fall. 559 | He is scared of playing basketball. 560 | He wants to play game. 561 | But his shot is lame. 562 | Gives it his all and the ball falls. 563 | 564 | Anne likes to skate on the cold ice. 565 | She thinks she has some skating spice. 566 | So, the big ice breaks. 567 | Then she starts to shake. 568 | She gets out and says, “That’s not nice.” 569 | 570 | Joe took a trip on a big boat. 571 | Joe had bought a really long rope. 572 | He pulled a big fish. 573 | The fish was a pitch. 574 | Joe rowed and rowed and rowed his boat. 575 | 576 | There was a nice girl named Jaray. 577 | She rhymed up and knew she could slay. 578 | Baddie knows she’s a 10. 579 | She talked it to win. 580 | She knows she could get some green pay. 581 | 582 | My brother saw one of his coats. 583 | Then my brother saw a cool boat. 584 | So, he grabbed a book. 585 | Then he saw a hook. 586 | Then my brother started to float. 587 | 588 | Peter is so scared of big rats. 589 | The rat is meaner than all cats. 590 | The rat bites and fights. 591 | He has a tall height. 592 | So, Peter becomes friends with rats. 593 | 594 | Hello, I’m the Earth, you love me. 595 | If you look closer you will see, 596 | I am in danger, 597 | Mars is a stranger. 598 | Mars is chasing me, please send help! 599 | 600 | Azuri is scared of the door. 601 | Her mom is honking the car horn. 602 | Her dad gets so mad. 603 | He says she so bad. 604 | Azuri plays with the front door. 605 | 606 | MJ is a great ball player. 607 | He is a fearless shot taker. 608 | When he dunks, he floats. 609 | That’s why he the G.O.A.T. 610 | He is the greatest shot maker. 611 | 612 | Ant-Man has no luck, he is stuck. 613 | He’s thinking man this really sucks. 614 | He starts to get mad. 615 | After he gets sad. 616 | Then he knew he had no good luck. 617 | 618 | Robert’s scared of a fight in the night. 619 | Woke with anime in his sight. 620 | Robert might bite you, 621 | Til’ your fingers blue. 622 | Robert saw a bad fight last night. 623 | 624 | Autumn is terrified of bees. 625 | The bees like to eat and smell feet. 626 | Bees love to waste time. 627 | And eat some green limes. 628 | The bees left to hide in the trees. 629 | 630 | There was an Old Man with a beard, 631 | Who said, 'It is just as I feared! 632 | Two Owls and a Hen, 633 | Four Larks and a Wren, 634 | Have all built their nests in my beard!' 635 | 636 | There was an Old Person of Ischia, 637 | Whose conduct grew friskier and friskier; 638 | He danced hornpipes and jigs, 639 | And ate thousands of figs, 640 | That lively Old Person of Ischia. 641 | 642 | There was an Old Man in a boat, 643 | Who said, 'I'm afloat, I'm afloat!' 644 | When they said, 'No! you ain't!' 645 | He was ready to faint, 646 | That unhappy Old Man in a boat. 647 | 648 | There was a Young Lady of Hull, 649 | Who was chased by a virulent bull; 650 | But she seized on a spade, 651 | And called out, 'Who's afraid?' 652 | Which distracted that virulent bull. 653 | 654 | There was an Old Person of Ems, 655 | Who casually fell in the Thames; 656 | And when he was found 657 | They said he was drowned, 658 | That unlucky Old Person of Ems. 659 | 660 | There was an Old Man who said, 'Hush! 661 | I perceive a young bird in this bush!' 662 | When they said, 'Is it small?' 663 | He replied, 'Not at all! 664 | It is four times as big as the bush!' 665 | 666 | There was a Young Lady of Russia, 667 | Who screamed so that no one could hush her; 668 | Her screams were extreme, 669 | No one heard such a scream, 670 | As was screamed by that lady of Russia. 671 | 672 | There was an Old Person of Ewell, 673 | Who chiefly subsisted on gruel; 674 | But to make it more nice 675 | He inserted some mice, 676 | Which refreshed that Old Person of Ewell. 677 | 678 | There was an old man in a tree, 679 | Whose whiskers were lovely to see; 680 | But the birds of the air, 681 | Pluck'd them perfectly bare, 682 | To make themselves nests on that tree. 683 | 684 | There is a Young Lady whose nose 685 | Continually prospers and grows; 686 | When it grew out of sight, 687 | She exclaimed in a fright, 688 | "Oh! Farewell to the end of my nose!" 689 | 690 | There was an Old Person of Dean, 691 | Who dined on one pea and one bean; 692 | For he said, 693 | "More than that would make me too fat," 694 | That cautious Old Person of Dean. 695 | 696 | There was an Old Person of Dover, 697 | Who rushed through a field of blue Clover; 698 | But some very large bees, 699 | Stung his nose and his knees, 700 | So he very soon went back to Dover. 701 | 702 | There was an Old Man of Peru, 703 | Who watched his wife making a stew; 704 | But once by mistake, 705 | In a stove she did bake, 706 | That unfortunate Man of Peru. 707 | 708 | There was a Young Lady whose bonnet, 709 | Came untied when the birds sate upon it; 710 | But she said: 'I don't care! 711 | All the birds in the air 712 | Are welcome to sit on my bonnet!' -------------------------------------------------------------------------------- /llm_bench/benchmark_suite.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%pip install python-dotenv seaborn matplotlib pandas numpy locust==2.18.1 orjson==3.9.10\n", 10 | "from dotenv import load_dotenv\n", 11 | "load_dotenv()\n", 12 | "import os\n", 13 | "import pandas as pd\n", 14 | "import matplotlib.pyplot as plt\n", 15 | "import seaborn as sns\n", 16 | "import datetime\n", 17 | "import json\n", 18 | "import subprocess\n", 19 | "import numpy as np\n", 20 | "import time" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "# LLM Bench Guide\n", 28 | "\n", 29 | "This guide explains how to use the benchmarking tools to evaluate LLM performance.\n", 30 | "\n", 31 | "### Metrics Collected\n", 32 | "The `load_test.py` script measures:\n", 33 | "\n", 34 | "1. Average Time to First Token\n", 35 | "2. Average Token Latency\n", 36 | "3. Average Token Count\n", 37 | "4. Average Total Response Time\n", 38 | "5. Request Count\n", 39 | "6. Queries Per Second (QPS)\n", 40 | "7. Latency Percentiles (p50, p90, p99, p99.9)\n", 41 | " - Time to First Token\n", 42 | " - Total Response Time\n", 43 | "\n", 44 | "### Benchmark Scenarios\n", 45 | "We will create the following benchmark suites in this notebook. Although there are many other types of tests that can be run, please refer to the README.md for more details:\n", 46 | "\n", 47 | "1. Single Model Performance Analysis\n", 48 | "2. Comparing Performance of different Models and Providers\n", 49 | "3. Token Length Impact Study" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 71, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "'''Helper functions, you can ignore this'''\n", 59 | "\n", 60 | "#Function to utilize subprocess to run the locust script\n", 61 | "def execute_subprocess(cmd):\n", 62 | " print(f\"\\nExecuting benchmark: {' '.join(cmd)}\\n\")\n", 63 | " process = subprocess.Popen(\n", 64 | " cmd,\n", 65 | " text=True,\n", 66 | " stdout=subprocess.PIPE,\n", 67 | " stderr=subprocess.STDOUT,\n", 68 | " bufsize=1,\n", 69 | " universal_newlines=True\n", 70 | " )\n", 71 | " # Display output in real-time\n", 72 | " while True:\n", 73 | " output = process.stdout.readline()\n", 74 | " if output == '' and process.poll() is not None:\n", 75 | " break\n", 76 | " if output:\n", 77 | " print(output.strip())\n", 78 | "\n", 79 | " return_code = process.poll()\n", 80 | " if return_code != 0:\n", 81 | " print(f\"Benchmark failed with return code: {return_code}\")\n", 82 | " return False\n", 83 | " return True\n", 84 | "\n", 85 | "\n", 86 | "%matplotlib inline\n", 87 | "# Optional: for higher resolution plots\n", 88 | "%config InlineBackend.figure_format = 'retina'\n", 89 | "\n", 90 | "\n", 91 | "def visualize_comparative_results(stat_result_paths, results_dir):\n", 92 | " \"\"\"\n", 93 | " Create comparative visualizations for multiple model benchmark results\n", 94 | " \n", 95 | " Args:\n", 96 | " stat_result_paths (list): List of dicts containing paths to stats files and configs\n", 97 | " [{\"path\": \"path/to/stats.csv\", \"config\": {\"provider\": \"...\", \"model\": \"...\"}}, ...]\n", 98 | " results_dir (str): Directory to save the comparative visualizations\n", 99 | " \"\"\"\n", 100 | " # Read and combine all CSV files with model information\n", 101 | " dfs = []\n", 102 | " for result in stat_result_paths:\n", 103 | " df = pd.read_csv(result['path'])\n", 104 | " df['model'] = result['config']['model']\n", 105 | " df['provider'] = result['config']['provider']\n", 106 | " dfs.append(df)\n", 107 | " \n", 108 | " # Combine all dataframes\n", 109 | " combined_df = pd.concat(dfs, ignore_index=True)\n", 110 | " \n", 111 | " # Get POST data first\n", 112 | " post_data = combined_df[combined_df['Type'] == 'POST']\n", 113 | " \n", 114 | " # Get unique models and calculate dynamic bar width\n", 115 | " models = post_data['model'].unique()\n", 116 | " num_models = len(models)\n", 117 | " bar_width = min(0.35, 0.8 / num_models) # Dynamically reduce bar width as models increase\n", 118 | " \n", 119 | " # Set style for better visualizations\n", 120 | " plt.style.use('ggplot')\n", 121 | " fig = plt.figure(figsize=(20, 15))\n", 122 | "\n", 123 | " # 1. Response Time Distribution Comparison\n", 124 | " plt.subplot(2, 2, 1)\n", 125 | " metrics_to_plot = ['Average Response Time', 'Median Response Time']\n", 126 | " \n", 127 | " x = np.arange(len(metrics_to_plot))\n", 128 | " # Adjust bar positions to be centered\n", 129 | " positions = np.linspace(-(bar_width * (num_models-1))/2, \n", 130 | " (bar_width * (num_models-1))/2, \n", 131 | " num_models)\n", 132 | " \n", 133 | " for i, model in enumerate(models):\n", 134 | " model_data = post_data[post_data['model'] == model][metrics_to_plot]\n", 135 | " plt.bar(x + positions[i], model_data.iloc[0], bar_width, label=model.split('/')[-1])\n", 136 | " \n", 137 | " plt.title('Response Time Comparison')\n", 138 | " plt.xlabel('Metrics')\n", 139 | " plt.ylabel('Time (ms)')\n", 140 | " plt.xticks(x, metrics_to_plot, rotation=45)\n", 141 | " plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n", 142 | "\n", 143 | " # 2. QPS Comparison\n", 144 | " plt.subplot(2, 2, 2)\n", 145 | " qps_data = combined_df[combined_df['Type'] == 'POST'][['model', 'Requests/s']]\n", 146 | " x = np.arange(len(models))\n", 147 | " plt.bar(x, [qps_data[qps_data['model'] == model]['Requests/s'].iloc[0] for model in models], \n", 148 | " width=0.6) # Single bars can be wider\n", 149 | " plt.title('Throughput Comparison')\n", 150 | " plt.xlabel('Model')\n", 151 | " plt.ylabel('Requests per Second')\n", 152 | " plt.xticks(x, [model.split('/')[-1] for model in models], rotation=45)\n", 153 | "\n", 154 | " # 3. Token Latency Comparison\n", 155 | " plt.subplot(2, 2, 3)\n", 156 | " token_metrics = ['latency_per_token', 'overall_latency_per_token']\n", 157 | " token_data = combined_df[\n", 158 | " (combined_df['Type'] == 'METRIC') & \n", 159 | " (combined_df['Name'].isin(token_metrics))\n", 160 | " ]\n", 161 | " \n", 162 | " x = np.arange(len(token_metrics))\n", 163 | " for i, model in enumerate(models):\n", 164 | " model_data = token_data[token_data['model'] == model]\n", 165 | " values = [\n", 166 | " model_data[model_data['Name'] == metric]['Average Response Time'].iloc[0]\n", 167 | " for metric in token_metrics\n", 168 | " ]\n", 169 | " plt.bar(x + positions[i], values, bar_width, label=model.split('/')[-1])\n", 170 | " \n", 171 | " plt.title('Token Latency Comparison')\n", 172 | " plt.xlabel('Metrics')\n", 173 | " plt.ylabel('Time (ms)')\n", 174 | " plt.xticks(x, ['Per Token', 'Overall Per Token'], rotation=45)\n", 175 | " plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n", 176 | "\n", 177 | " # 4. Percentile Distribution Comparison\n", 178 | " plt.subplot(2, 2, 4)\n", 179 | " percentiles = ['50%', '75%', '90%', '95%', '99%', '99.9%']\n", 180 | " \n", 181 | " for model in models:\n", 182 | " model_data = post_data[post_data['model'] == model]\n", 183 | " plt.plot(percentiles, model_data[percentiles].iloc[0], marker='o', label=model.split('/')[-1])\n", 184 | " \n", 185 | " plt.title('Response Time Percentiles Comparison')\n", 186 | " plt.xlabel('Percentile')\n", 187 | " plt.ylabel('Response Time (ms)')\n", 188 | " plt.xticks(rotation=45)\n", 189 | " plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n", 190 | "\n", 191 | " # Adjust layout to prevent overlapping\n", 192 | " plt.tight_layout()\n", 193 | " # Save the figure\n", 194 | " plt.savefig(f'{results_dir}/comparative_performance_metrics.png', \n", 195 | " bbox_inches='tight', # Ensure the legend is included in the saved figure\n", 196 | " dpi=300) # Higher resolution\n", 197 | " # Display in notebook\n", 198 | " plt.show()\n", 199 | " # Close the figure\n", 200 | " plt.close()\n", 201 | "\n", 202 | " # Generate summary statistics\n", 203 | " summary_stats = []\n", 204 | " for model in models:\n", 205 | " model_data = combined_df[combined_df['model'] == model]\n", 206 | " post_data = model_data[model_data['Type'] == 'POST'].iloc[0]\n", 207 | " token_data = model_data[model_data['Type'] == 'METRIC']\n", 208 | " \n", 209 | " summary_stats.append({\n", 210 | " \"Model\": model.split('/')[-1],\n", 211 | " \"Provider\": model_data['provider'].iloc[0],\n", 212 | " \"Average QPS\": post_data['Requests/s'],\n", 213 | " \"Average Response Time\": post_data['Average Response Time'],\n", 214 | " \"99th Percentile Latency\": post_data['99%'],\n", 215 | " \"Average Tokens per Request\": token_data[token_data['Name'] == 'num_tokens']['Average Response Time'].iloc[0]\n", 216 | " })\n", 217 | " \n", 218 | " # Print comparative summary\n", 219 | " print(\"\\nComparative Summary:\")\n", 220 | " print(\"-\" * 80)\n", 221 | " summary_df = pd.DataFrame(summary_stats)\n", 222 | " print(summary_df.to_string(index=False))\n", 223 | " \n", 224 | " # Save summary to CSV\n", 225 | " summary_df.to_csv(f'{results_dir}/comparative_summary.csv', index=False)\n", 226 | " \n", 227 | " return summary_stats" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "## Single Model and Provider Performance Analysis\n", 235 | "\n", 236 | "Evaluate performance metrics of singular model from one provider. This is the most basic benchmark that can be run from the load_test.py script." 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "'''\n", 246 | "Make sure to create a .env file in the root directory and add your API keys.\n", 247 | "For this example, we will use the Fireworks API key.\n", 248 | "\n", 249 | "Add the following to your .env file:\n", 250 | "\n", 251 | "FIREWORKS_API_KEY=.\n", 252 | "\n", 253 | "Alternatively you can edit the following script flags for custom configurations.\n", 254 | "'''\n", 255 | "\n", 256 | "\n", 257 | "provider_name = \"fireworks\"\n", 258 | "model_name = \"accounts/fireworks/models/llama-v3p2-3b-instruct\"\n", 259 | "h = \"https://api.fireworks.ai/inference\" #host url\n", 260 | "api_key = os.getenv(\"FIREWORKS_API_KEY\")\n", 261 | "\n", 262 | "t = \"5s\" #test duration, set to 1 minute for now\n", 263 | "\n", 264 | "'''\n", 265 | "Choose ONE of the following two modes by commenting/uncommenting:\n", 266 | "'''\n", 267 | "# MODE 1: Fixed Queries Per Second (QPS)\n", 268 | "# Use this mode to maintain a steady rate of requests\n", 269 | "qps = 5 # Target requests per second\n", 270 | "u = 100 # Number of users (keep high enough to achieve target QPS)\n", 271 | "s = 100 # Spawn rate (keep high enough to achieve target QPS)\n", 272 | "\n", 273 | "# MODE 2: Fixed Concurrency\n", 274 | "# Use this mode to maintain a steady number of concurrent requests\n", 275 | "# Comment out Mode 1 above and uncomment below to use this mode\n", 276 | "'''\n", 277 | "# QPS does not need to be set for fixed concurrency mode\n", 278 | "u = 5 # Number of concurrent workers\n", 279 | "r = 5 # Rate of spawning new workers (workers/second). Look through README.md for more details on spawn rate\n", 280 | "'''\n", 281 | "\n", 282 | "\n", 283 | "\n", 284 | "# Create results directory of name single_model_provider_analysis_{TIMESTAMP}\n", 285 | "timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n", 286 | "\n", 287 | "edited_model_name = model_name.replace(\"/\", \"_\") if provider_name != \"fireworks\" else model_name.replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n", 288 | "\n", 289 | "\n", 290 | "\n", 291 | "results_dir = f\"results/{provider_name}_{edited_model_name}_analysis_{timestamp}\"\n", 292 | "os.makedirs(results_dir, exist_ok=True)\n", 293 | "\n", 294 | "# Construct the command\n", 295 | "cmd = [\n", 296 | " \"locust\",\n", 297 | " \"--headless\", # Run without web UI\n", 298 | " \"--only-summary\", # Only show summary stats\n", 299 | " \"-H\", h, # Host URL\n", 300 | " \"--provider\", provider_name,\n", 301 | " \"--model\", model_name,\n", 302 | " \"--api-key\", api_key,\n", 303 | " \"-t\", t, # Test duration\n", 304 | " \"--html\", f\"{results_dir}/report.html\", # Generate HTML report\n", 305 | " \"--csv\", f\"{results_dir}/stats\", # Generate CSV stats\n", 306 | "]\n", 307 | "\n", 308 | "# Add Mode 1 (Fixed QPS) parameters if uncommented, remember to remove --qps below if using fixed concurrency mode\n", 309 | "cmd.extend([\n", 310 | " \"-u\", str(u), # Number of users\n", 311 | " \"-r\", str(s), # Spawn rate\n", 312 | " \"--qps\", str(qps) # Target QPS\n", 313 | "])\n", 314 | "\n", 315 | "# Add load_test.py as the locust file\n", 316 | "locust_file = os.path.join(os.path.dirname(os.getcwd()), \"llm_bench\", \"load_test.py\")\n", 317 | "cmd.extend([\"-f\", locust_file]) \n", 318 | "\n", 319 | "#call our helper function to execute the command\n", 320 | "success = execute_subprocess(cmd)\n", 321 | "\n", 322 | "#Visualize the results\n", 323 | "if success: \n", 324 | " time.sleep(1)\n", 325 | " stat_result_paths = [{\"path\": f'{results_dir}/stats_stats.csv', \"config\": {\"provider\": provider_name, \"model\": model_name}}]\n", 326 | " visualize_comparative_results(stat_result_paths, results_dir)" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## Comparing different Models and Providers\n", 334 | "\n", 335 | "Evaluate performance metrics of different models and providers." 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "'''\n", 345 | "Edit the provider_configs list to add more providers and models.\n", 346 | "\n", 347 | "Add any needed api keys to the .env file. This example uses the Fireworks API key again.\n", 348 | "'''\n", 349 | "\n", 350 | "provider_configs = [\n", 351 | " {\"provider\": \"fireworks\", \"model\": \"accounts/fireworks/models/llama-v3p2-3b-instruct\", \"host\": \"https://api.fireworks.ai/inference\", \"api_key\": os.getenv(\"FIREWORKS_API_KEY\")},\n", 352 | " {\"provider\": \"fireworks\", \"model\": \"accounts/fireworks/models/mistral-small-24b-instruct-2501\", \"host\": \"https://api.fireworks.ai/inference\", \"api_key\": os.getenv(\"FIREWORKS_API_KEY\")},\n", 353 | " #... add more providers and models here\n", 354 | "]\n", 355 | "\n", 356 | "# some starter configs and flags\n", 357 | "t = \"5s\" #test duration, set to 1 minute for now\n", 358 | "qps = 5 # Target requests per second\n", 359 | "u = 100 # Number of users (keep high enough to achieve target QPS)\n", 360 | "s = 100 # Spawn rate (keep high enough to achieve target QPS)\n", 361 | "\n", 362 | "# Create results directory of name single_model_provider_analysis_{TIMESTAMP}\n", 363 | "timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n", 364 | "results_dir = f\"results/different_models_and_providers_analysis_{timestamp}\"\n", 365 | "os.makedirs(results_dir, exist_ok=True)\n", 366 | "\n", 367 | "for index, config in enumerate(provider_configs):\n", 368 | " # Construct the command\n", 369 | "\n", 370 | " edited_model_name = config[\"model\"].replace(\"/\", \"_\") if config[\"provider\"] != \"fireworks\" else config[\"model\"].replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n", 371 | " \n", 372 | " provider_model_path = f\"{results_dir}/{config[\"provider\"]}_{edited_model_name}_{index}\"\n", 373 | " \n", 374 | " os.makedirs(provider_model_path, exist_ok=True)\n", 375 | " cmd = [\n", 376 | " \"locust\",\n", 377 | " \"--headless\", # Run without web UI\n", 378 | " \"--only-summary\", # Only show summary stats\n", 379 | " \"-H\", config[\"host\"], # Host URL\n", 380 | " \"--provider\", config[\"provider\"],\n", 381 | " \"--model\", config[\"model\"],\n", 382 | " \"--api-key\", config[\"api_key\"],\n", 383 | " \"-t\", t, # Test duration\n", 384 | " \"--html\", f\"{provider_model_path}/report.html\", # Generate HTML report\n", 385 | " \"--csv\", f\"{provider_model_path}/stats\", # Generate CSV stats\n", 386 | " ]\n", 387 | "\n", 388 | " # Add Mode 1 (Fixed QPS) parameters if uncommented, remember to remove --qps below if using fixed concurrency mode\n", 389 | " cmd.extend([\n", 390 | " \"-u\", str(u), # Number of users\n", 391 | " \"-r\", str(s), # Spawn rate\n", 392 | " \"--qps\", str(qps) # Target QPS\n", 393 | " ])\n", 394 | "\n", 395 | " # Add load_test.py as the locust file\n", 396 | " locust_file = os.path.join(os.path.dirname(os.getcwd()), \"llm_bench\", \"load_test.py\")\n", 397 | " cmd.extend([\"-f\", locust_file]) \n", 398 | "\n", 399 | " #call our helper function to execute the command\n", 400 | " execute_subprocess(cmd)\n", 401 | "\n", 402 | "#Visualize the results\n", 403 | "stat_result_paths = []\n", 404 | "for index, config in enumerate(provider_configs):\n", 405 | " edited_model_name = config[\"model\"].replace(\"/\", \"_\") if config[\"provider\"] != \"fireworks\" else config[\"model\"].replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n", 406 | " stat_result_paths.append({\"path\": f\"{results_dir}/{config['provider']}_{edited_model_name}_{index}/stats_stats.csv\", \"config\": config})\n", 407 | "\n", 408 | "time.sleep(1)\n", 409 | "visualize_comparative_results(stat_result_paths, results_dir)\n", 410 | "\n" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "## Token Length Analysis\n", 418 | "\n", 419 | "Evaluates model performance across different output lengths by testing the same model \n", 420 | "with varying input token limits (from short to long responses)." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "max_token_lengths = [30,64,128,256] # --max-tokens flag\n", 430 | "max_token_lengths_distribution = \"uniform\" # --max-tokens-distribution flag check readme.md for more details and other options\n", 431 | "\n", 432 | "\n", 433 | "provider_name = \"fireworks\"\n", 434 | "model_name = \"accounts/fireworks/models/llama-v3p2-3b-instruct\"\n", 435 | "h = \"https://api.fireworks.ai/inference\" #host url\n", 436 | "api_key = os.getenv(\"FIREWORKS_API_KEY\")\n", 437 | "\n", 438 | "t = \"5s\" #test duration, set to 1 minute for now\n", 439 | "qps = 5 # Target requests per second\n", 440 | "u = 100 # Number of users (keep high enough to achieve target QPS)\n", 441 | "s = 100 # Spawn rate (keep high enough to achieve target QPS)\n", 442 | "\n", 443 | "# Create results directory of name single_model_provider_analysis_{TIMESTAMP}\n", 444 | "timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n", 445 | "results_dir = f\"results/token_length_analysis_{timestamp}\"\n", 446 | "os.makedirs(results_dir, exist_ok=True)\n", 447 | "\n", 448 | "for index, token_length in enumerate(max_token_lengths):\n", 449 | " # Construct the command\n", 450 | "\n", 451 | " edited_model_name = model_name.replace(\"/\", \"_\") if provider_name != \"fireworks\" else model_name.replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n", 452 | " \n", 453 | " token_length_path = f\"{results_dir}/{provider_name}_{edited_model_name}_{token_length}\"\n", 454 | " os.makedirs(f\"{token_length_path}\", exist_ok=True)\n", 455 | " cmd = [\n", 456 | " \"locust\",\n", 457 | " \"--headless\", # Run without web UI\n", 458 | " \"--only-summary\", # Only show summary stats\n", 459 | " \"-H\", h, # Host URL\n", 460 | " \"--provider\", provider_name,\n", 461 | " \"--model\", model_name,\n", 462 | " \"--api-key\", api_key,\n", 463 | " \"-t\", t, # Test duration\n", 464 | " \"--max-tokens\", str(token_length), \n", 465 | " \"--max-tokens-distribution\", max_token_lengths_distribution,\n", 466 | " \"--html\", f\"{token_length_path}/report.html\", # Generate HTML report\n", 467 | " \"--csv\", f\"{token_length_path}/stats\", # Generate CSV stats\n", 468 | " ]\n", 469 | "\n", 470 | " # Add Mode 1 (Fixed QPS) parameters if uncommented, remember to remove --qps below if using fixed concurrency mode\n", 471 | " cmd.extend([\n", 472 | " \"-u\", str(u), # Number of users\n", 473 | " \"-r\", str(s), # Spawn rate\n", 474 | " \"--qps\", str(qps) # Target QPS\n", 475 | " ])\n", 476 | "\n", 477 | " # Add load_test.py as the locust file\n", 478 | " locust_file = os.path.join(os.path.dirname(os.getcwd()), \"llm_bench\", \"load_test.py\")\n", 479 | " cmd.extend([\"-f\", locust_file]) \n", 480 | "\n", 481 | " #call our helper function to execute the command\n", 482 | " execute_subprocess(cmd)\n", 483 | "\n", 484 | "#Visualize the results\n", 485 | "stat_result_paths = []\n", 486 | "for index, token_length in enumerate(max_token_lengths):\n", 487 | "\n", 488 | " edited_model_name = model_name.replace(\"/\", \"_\") if provider_name != \"fireworks\" else model_name.replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n", 489 | " stat_result_paths.append({\"path\": f\"{results_dir}/{provider_name}_{edited_model_name}_{token_length}/stats_stats.csv\", \"config\": {\"provider\": \"fireworks\", \"model\": \"accounts/fireworks/models/llama-v3p2-3b-instruct\" + \"_\" + str(token_length)}})\n", 490 | "\n", 491 | "time.sleep(1)\n", 492 | "visualize_comparative_results(stat_result_paths, results_dir)" 493 | ] 494 | } 495 | ], 496 | "metadata": { 497 | "kernelspec": { 498 | "display_name": "benchmark", 499 | "language": "python", 500 | "name": "python3" 501 | }, 502 | "language_info": { 503 | "codemirror_mode": { 504 | "name": "ipython", 505 | "version": 3 506 | }, 507 | "file_extension": ".py", 508 | "mimetype": "text/x-python", 509 | "name": "python", 510 | "nbconvert_exporter": "python", 511 | "pygments_lexer": "ipython3", 512 | "version": "3.12.8" 513 | } 514 | }, 515 | "nbformat": 4, 516 | "nbformat_minor": 2 517 | } 518 | -------------------------------------------------------------------------------- /llm_bench/load_test.py: -------------------------------------------------------------------------------- 1 | import abc 2 | import argparse 3 | import csv 4 | from dataclasses import dataclass 5 | from functools import partial 6 | import os 7 | import random 8 | import sys 9 | import traceback 10 | from typing import Optional 11 | from locust import HttpUser, task, events, constant_pacing 12 | import copy 13 | import json 14 | import time 15 | import orjson 16 | import base64 17 | import io 18 | import itertools 19 | from PIL import Image 20 | import transformers 21 | import re 22 | import gevent 23 | from locust.util.timespan import parse_timespan as _locust_parse_timespan 24 | 25 | try: 26 | import locust_plugins 27 | except ImportError: 28 | print("locust-plugins is not installed, Grafana won't work") 29 | 30 | 31 | def add_custom_metric(name, value, length_value=0): 32 | events.request.fire( 33 | request_type="METRIC", 34 | name=name, 35 | response_time=value, 36 | response_length=length_value, 37 | exception=None, 38 | context=None, 39 | ) 40 | 41 | 42 | PROMPT_CHAT_IMAGE_PLACEHOLDER = "" 43 | 44 | 45 | class LimericsDataset: 46 | _PROMPT = "\n\nTranslate the limericks above to Spanish, then re-write limericks using different styles. Do it 10 times." 47 | 48 | def __init__( 49 | self, 50 | path: str, 51 | tokenizer_path: str, 52 | chat: bool, 53 | num_tokens: int, 54 | common_tokens: int, 55 | ): 56 | self._tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_path) 57 | self._num_tokens = num_tokens 58 | 59 | self._all_limericks = [] 60 | with open(path, "r") as f: 61 | text = f.read() 62 | lims = text.split("\n\n") 63 | for i, lim in enumerate(lims): 64 | num_tokens = len(self._tokenizer.encode(lim)) 65 | self._all_limericks.append((lim, num_tokens)) 66 | 67 | self._prefix = "" 68 | self._suffix = self._PROMPT 69 | self._prefix_suffix_tokens = len(self._tokenizer.encode(self._PROMPT)) 70 | while self._prefix_suffix_tokens < common_tokens: 71 | lim, num_tokens = self._all_limericks[ 72 | random.randint(0, len(self._all_limericks) - 1) 73 | ] 74 | self._prefix += lim + "\n\n" 75 | self._prefix_suffix_tokens += num_tokens 76 | 77 | if chat: 78 | empty_tempalate_tokens = self._tokenizer.apply_chat_template( 79 | [{"role": "user", "content": ""}], 80 | tokenize=True, 81 | add_generation_prompt=True, 82 | ) 83 | self._prefix_suffix_tokens += len(empty_tempalate_tokens) 84 | 85 | def __next__(self): 86 | prompt_tokens = self._prefix_suffix_tokens 87 | prompt = self._prefix 88 | while prompt_tokens < self._num_tokens: 89 | lim, num_tokens = self._all_limericks[ 90 | random.randint(0, len(self._all_limericks) - 1) 91 | ] 92 | 93 | prompt += lim + "\n\n" 94 | prompt_tokens += num_tokens 95 | prompt += self._suffix 96 | 97 | return prompt, prompt_tokens 98 | 99 | def __iter__(self): 100 | return self 101 | 102 | 103 | class JsonlDataset: 104 | def __init__(self, path: str): 105 | self.path = path 106 | 107 | def __iter__(self): 108 | return itertools.cycle(self._read_data()) 109 | 110 | def _read_data(self): 111 | with open(self.path, "r") as f: 112 | for line in f: 113 | yield json.loads(line), 0 114 | 115 | 116 | class DatasetHolder: 117 | _instance = None 118 | 119 | @classmethod 120 | def _create_dataset(cls, options: argparse.Namespace): 121 | if options.dataset.startswith("@"): 122 | return JsonlDataset(options.dataset[1:]) 123 | elif options.dataset == "limerics": 124 | assert ( 125 | options.tokenizer is not None 126 | ), "--tokenizer is required for limerics dataset" 127 | return LimericsDataset( 128 | path=os.path.join( 129 | os.path.dirname(os.path.abspath(__file__)), "limericks.txt" 130 | ), 131 | tokenizer_path=options.tokenizer, 132 | chat=options.chat, 133 | num_tokens=options.prompt_tokens, 134 | common_tokens=options.prompt_cache_max_len, 135 | ) 136 | else: 137 | raise ValueError(f"Unknown dataset: {options.dataset}") 138 | 139 | @classmethod 140 | def get_instance(cls, options: argparse.Namespace): 141 | if cls._instance is None: 142 | cls._instance = cls._create_dataset(options) 143 | return cls._instance 144 | 145 | 146 | class FixedQPSPacer: 147 | _instance = None 148 | 149 | def __init__(self, qps, distribution): 150 | self.qps = qps 151 | self.distribution = distribution 152 | 153 | # It's kind of thread safe thanks to GIL as the only state is `t` - good enough for a loadtest 154 | def gen(): 155 | t = time.time() 156 | mean_wait = 1 / self.qps 157 | while True: 158 | if self.distribution == "exponential": 159 | wait = random.expovariate(1 / mean_wait) 160 | elif self.distribution == "uniform": 161 | wait = random.uniform(0, 2 * mean_wait) 162 | elif self.distribution == "constant": 163 | wait = mean_wait 164 | else: 165 | print("Unknown distribution {self.distribution}") 166 | os._exit(1) 167 | t += wait 168 | yield t 169 | 170 | self.iterator = gen() 171 | 172 | @classmethod 173 | def instance(cls, qps, distribution): 174 | if cls._instance is None: 175 | cls._instance = cls(qps, distribution) 176 | else: 177 | assert cls._instance.qps == qps 178 | assert cls._instance.distribution == distribution 179 | return cls._instance 180 | 181 | def wait_time_till_next(self): 182 | t = next(self.iterator) 183 | now = time.time() 184 | if now > t: 185 | print( 186 | f"WARNING: not enough locust users to keep up with the desired QPS. Either the number of locust users is too low or the server is overloaded. Delay: {now-t:.3f}s" 187 | ) 188 | return 0 189 | return t - now 190 | 191 | 192 | class LengthSampler: 193 | def __init__(self, distribution: str, mean: int, cap: Optional[int], alpha: float): 194 | self.distribution = distribution 195 | self.mean = mean 196 | self.cap = cap 197 | self.alpha = alpha 198 | 199 | if self.distribution == "exponential": 200 | self.sample_func = lambda: int(random.expovariate(1 / self.mean)) 201 | elif self.distribution == "uniform": 202 | mx = self.mean + int(self.alpha * self.mean) 203 | if self.cap is not None: 204 | mx = min(mx, self.cap) 205 | self.sample_func = lambda: random.randint( 206 | max(1, self.mean - int(self.alpha * self.mean)), mx 207 | ) 208 | elif self.distribution == "constant": 209 | self.sample_func = lambda: self.mean 210 | elif self.distribution == "normal": 211 | self.sample_func = lambda: int( 212 | random.gauss(self.mean, self.mean * self.alpha) 213 | ) 214 | else: 215 | raise ValueError(f"Unknown distribution {self.distribution}") 216 | 217 | def sample(self) -> int: 218 | for _ in range(1000): 219 | sample = self.sample_func() 220 | if sample <= 0: 221 | continue 222 | if self.cap is not None and sample > self.cap: 223 | continue 224 | return sample 225 | else: 226 | raise ValueError( 227 | "Can't sample a value after 1000 attempts, check distribution parameters" 228 | ) 229 | 230 | def __str__(self): 231 | r = int(self.mean * self.alpha) 232 | if self.distribution == "constant": 233 | s = str(self.mean) 234 | elif self.distribution == "uniform": 235 | s = f"uniform({self.mean} +/- {r})" 236 | elif self.distribution == "normal": 237 | s = f"normal({self.mean}, {r})" 238 | elif self.distribution == "exponential": 239 | s = f"exponential({self.mean})" 240 | else: 241 | assert False 242 | if self.cap is not None: 243 | s += f" capped at {self.cap}" 244 | return s 245 | 246 | 247 | class InitTracker: 248 | users = None 249 | first_request_done = 0 250 | logging_params = None 251 | environment = None 252 | tokenizer = None 253 | deferred_run_time_seconds = None 254 | stop_scheduled = False 255 | stats_reset_done = False 256 | 257 | @classmethod 258 | def notify_init(cls, environment, logging_params): 259 | if cls.environment is None: 260 | cls.environment = environment 261 | if cls.logging_params is None: 262 | cls.logging_params = logging_params 263 | else: 264 | assert ( 265 | cls.logging_params == logging_params 266 | ), f"Inconsistent settings between workers: {cls.logging_params} != {logging_params}" 267 | 268 | @classmethod 269 | def notify_first_request(cls): 270 | cls.first_request_done += 1 271 | 272 | @classmethod 273 | def notify_spawning_complete(cls, user_count): 274 | cls.users = user_count 275 | # Start steady-state measurement exactly when all users have spawned 276 | if not cls.stats_reset_done: 277 | cls.reset_stats() 278 | cls.stats_reset_done = True 279 | # If -t/--run-time was provided, schedule test stop relative to spawn complete 280 | if ( 281 | cls.deferred_run_time_seconds is not None 282 | and not cls.stop_scheduled 283 | and cls.environment is not None 284 | and cls.environment.runner is not None 285 | ): 286 | delay = float(cls.deferred_run_time_seconds) 287 | print(f"Scheduling stop {delay}s after spawning complete (deferred -t)") 288 | gevent.spawn_later(delay, cls.environment.runner.quit) 289 | cls.stop_scheduled = True 290 | 291 | @classmethod 292 | def reset_stats(cls): 293 | assert cls.environment.runner, "only local mode is supported" 294 | print("Resetting stats after traffic reach a steady state") 295 | cls.environment.events.reset_stats.fire() 296 | cls.environment.runner.stats.reset_all() 297 | 298 | @classmethod 299 | def load_tokenizer(cls, dir): 300 | if not dir: 301 | return None 302 | if cls.tokenizer: 303 | return cls.tokenizer 304 | import transformers 305 | 306 | cls.tokenizer = transformers.AutoTokenizer.from_pretrained(dir) 307 | cls.tokenizer.add_bos_token = False 308 | cls.tokenizer.add_eos_token = False 309 | return cls.tokenizer 310 | 311 | 312 | events.spawning_complete.add_listener(InitTracker.notify_spawning_complete) 313 | 314 | 315 | def _parse_run_time_to_seconds(run_time_value): 316 | """Parse Locust -t/--run-time value into seconds (float). Supports both 317 | already-parsed numeric values and human strings like '30s', '5m', '1h30m'. 318 | """ 319 | if not run_time_value: 320 | return None 321 | # If Locust already parsed it to a number (seconds), just use it 322 | if isinstance(run_time_value, (int, float)): 323 | return float(run_time_value) 324 | # Try Locust's own parser first 325 | if _locust_parse_timespan is not None: 326 | try: 327 | return float(_locust_parse_timespan(run_time_value)) 328 | except Exception: 329 | pass 330 | # Fallback simple parser for strings like '1h30m15s' 331 | s = str(run_time_value).strip().lower() 332 | total = 0.0 333 | for value, unit in re.findall(r"(\d+)\s*([smhd])", s): 334 | n = float(value) 335 | if unit == "s": 336 | total += n 337 | elif unit == "m": 338 | total += n * 60 339 | elif unit == "h": 340 | total += n * 3600 341 | elif unit == "d": 342 | total += n * 86400 343 | if total == 0.0: 344 | raise ValueError(f"Unable to parse run time value: {run_time_value}") 345 | return total 346 | 347 | 348 | @events.init.add_listener 349 | def _defer_run_time_to_after_spawn(environment, **_kwargs): 350 | """Capture -t/--run-time and defer it to start counting after spawn completes. 351 | 352 | We store the desired duration, null out the original option to prevent 353 | Locust from scheduling an early stop, and then schedule our own stop in 354 | InitTracker.notify_spawning_complete. 355 | """ 356 | try: 357 | run_time_value = getattr(environment.parsed_options, "run_time", None) 358 | except Exception: 359 | run_time_value = None 360 | seconds = _parse_run_time_to_seconds(run_time_value) if run_time_value else None 361 | if seconds: 362 | # Disable Locust's default run_time handling by clearing it 363 | try: 364 | environment.parsed_options.run_time = None 365 | except Exception: 366 | pass 367 | InitTracker.deferred_run_time_seconds = seconds 368 | InitTracker.environment = environment 369 | print( 370 | f"Deferring -t/--run-time to start after spawning complete: {seconds}s" 371 | ) 372 | 373 | 374 | @dataclass 375 | class ChunkMetadata: 376 | text: str 377 | logprob_tokens: Optional[int] 378 | usage_tokens: Optional[int] 379 | prompt_usage_tokens: Optional[int] 380 | 381 | 382 | class BaseProvider(abc.ABC): 383 | DEFAULT_MODEL_NAME = None 384 | 385 | def __init__(self, model, parsed_options): 386 | self.model = model 387 | self.parsed_options = parsed_options 388 | 389 | @abc.abstractmethod 390 | def get_url(self): ... 391 | 392 | @abc.abstractmethod 393 | def format_payload(self, prompt, max_tokens, images): ... 394 | 395 | @abc.abstractmethod 396 | def parse_output_json(self, json): ... 397 | 398 | 399 | class OpenAIProvider(BaseProvider): 400 | def get_url(self): 401 | if self.parsed_options.embeddings: 402 | return "/v1/embeddings" 403 | elif self.parsed_options.chat: 404 | return "/v1/chat/completions" 405 | else: 406 | return "/v1/completions" 407 | 408 | def format_payload(self, prompt, max_tokens, images): 409 | if self.parsed_options.embeddings: 410 | data = { 411 | "model": self.model, 412 | "input": prompt, 413 | } 414 | # Add embeddings-specific parameters 415 | if self.parsed_options.return_logits is not None: 416 | data["return_logits"] = self.parsed_options.return_logits 417 | if self.parsed_options.normalize is not None: 418 | data["normalize"] = self.parsed_options.normalize 419 | return data 420 | 421 | data = { 422 | "model": self.model, 423 | "max_tokens": max_tokens, 424 | "stream": self.parsed_options.stream, 425 | "temperature": self.parsed_options.temperature, 426 | "n": self.parsed_options.n, 427 | } 428 | if self.parsed_options.top_k is not None: 429 | data["top_k"] = self.parsed_options.top_k 430 | if self.parsed_options.logprobs is not None: 431 | data["logprobs"] = self.parsed_options.logprobs 432 | if isinstance(prompt, str): 433 | if self.parsed_options.chat: 434 | if images is None: 435 | data["messages"] = [{"role": "user", "content": prompt}] 436 | else: 437 | image_urls = [] 438 | for image in images: 439 | image_urls.append( 440 | {"type": "image_url", "image_url": {"url": image}} 441 | ) 442 | data["messages"] = [ 443 | { 444 | "role": "user", 445 | "content": [{"type": "text", "text": prompt}, *image_urls], 446 | } 447 | ] 448 | else: 449 | data["prompt"] = prompt 450 | if images is not None: 451 | data["images"] = images 452 | else: 453 | assert isinstance(prompt, dict), "prompt must be a dict" 454 | for k, v in prompt.items(): 455 | data[k] = v 456 | 457 | return data 458 | 459 | def parse_output_json(self, data): 460 | if self.parsed_options.embeddings: 461 | return ChunkMetadata( 462 | text=data["data"][0]["embedding"], 463 | logprob_tokens=None, 464 | usage_tokens=None, 465 | prompt_usage_tokens=None, 466 | ) 467 | usage = data.get("usage", None) 468 | 469 | assert len(data["choices"]) == 1, f"Too many choices {len(data['choices'])}" 470 | choice = data["choices"][0] 471 | if self.parsed_options.chat: 472 | if self.parsed_options.stream: 473 | block = choice["delta"] 474 | else: 475 | block = choice["message"] 476 | text = (block.get("reasoning", "") or "") + (block.get("reasoning_content", "") or "") + (block.get("content", "") or "") 477 | else: 478 | text = choice["text"] 479 | 480 | logprobs = choice.get("logprobs", None) 481 | if logprobs and "tokens" in logprobs: 482 | logprob_tokens = len(logprobs["tokens"]) 483 | else: 484 | logprob_tokens = None 485 | 486 | return ChunkMetadata( 487 | text=text, 488 | logprob_tokens=logprob_tokens, 489 | usage_tokens=usage["completion_tokens"] if usage else None, 490 | prompt_usage_tokens=usage.get("prompt_tokens", None) if usage else None, 491 | ) 492 | 493 | 494 | class FireworksProvider(OpenAIProvider): 495 | def format_payload(self, prompt, max_tokens, images): 496 | data = super().format_payload(prompt, max_tokens, images) 497 | if not self.parsed_options.embeddings: 498 | data["min_tokens"] = max_tokens 499 | data["prompt_cache_max_len"] = self.parsed_options.prompt_cache_max_len 500 | return data 501 | 502 | 503 | class VllmProvider(OpenAIProvider): 504 | def format_payload(self, prompt, max_tokens, images): 505 | data = super().format_payload(prompt, max_tokens, images) 506 | data["ignore_eos"] = True 507 | return data 508 | 509 | 510 | class TogetherProvider(OpenAIProvider): 511 | def get_url(self): 512 | assert not self.parsed_options.chat, "Chat is not supported" 513 | return "/" 514 | 515 | def format_payload(self, prompt, max_tokens, images): 516 | data = super().format_payload(prompt, max_tokens, images) 517 | data["ignore_eos"] = True 518 | data["stream_tokens"] = data.pop("stream") 519 | return data 520 | 521 | def parse_output_json(self, data): 522 | if not self.parsed_options.stream: 523 | data = data["output"] 524 | return super().parse_output_json(data) 525 | 526 | 527 | class TgiProvider(BaseProvider): 528 | DEFAULT_MODEL_NAME = "" 529 | 530 | def get_url(self): 531 | assert self.parsed_options.n == 1, "n > 1 is not supported" 532 | assert not self.parsed_options.chat, "Chat is not supported" 533 | stream_suffix = "_stream" if self.parsed_options.stream else "" 534 | return f"/generate{stream_suffix}" 535 | 536 | def format_payload(self, prompt, max_tokens, images): 537 | assert isinstance(prompt, str), "prompt must be a string" 538 | assert images is None, "images are not supported" 539 | data = { 540 | "inputs": prompt, 541 | "parameters": { 542 | "max_new_tokens": max_tokens, 543 | "temperature": self.parsed_options.temperature, 544 | "top_n_tokens": self.parsed_options.logprobs, 545 | "details": self.parsed_options.logprobs is not None, 546 | }, 547 | } 548 | return data 549 | 550 | def parse_output_json(self, data): 551 | if "token" in data: 552 | # streaming chunk 553 | return ChunkMetadata( 554 | text=data["token"]["text"], 555 | logprob_tokens=1, 556 | usage_tokens=None, 557 | prompt_usage_tokens=None, 558 | ) 559 | else: 560 | # non-streaming response 561 | return ChunkMetadata( 562 | text=data["generated_text"], 563 | logprob_tokens=( 564 | len(data["details"]["tokens"]) if "details" in data else None 565 | ), 566 | usage_tokens=( 567 | data["details"]["generated_tokens"] if "details" in data else None 568 | ), 569 | prompt_usage_tokens=None, 570 | ) 571 | 572 | 573 | PROVIDER_CLASS_MAP = { 574 | "fireworks": FireworksProvider, 575 | "vllm": VllmProvider, 576 | "sglang": VllmProvider, 577 | "openai": OpenAIProvider, 578 | "together": TogetherProvider, 579 | "tgi": TgiProvider, 580 | } 581 | 582 | 583 | def _load_curl_like_data(text): 584 | """ 585 | Either use the passed string or load from a file if the string is `@filename` 586 | """ 587 | if text.startswith("@"): 588 | try: 589 | if text.endswith(".jsonl"): 590 | with open(text[1:], "r") as f: 591 | return [json.loads(line) for line in f] 592 | else: 593 | with open(text[1:], "r") as f: 594 | return f.read() 595 | except Exception as e: 596 | raise ValueError(f"Failed to read file {text[1:]}") from e 597 | else: 598 | return text 599 | 600 | 601 | class LLMUser(HttpUser): 602 | # no wait time, so every user creates a continuous load, sending requests as quickly as possible 603 | 604 | def on_start(self): 605 | try: 606 | self._on_start() 607 | except Exception as e: 608 | print(f"Failed to initialize: {repr(e)}") 609 | print(traceback.format_exc()) 610 | sys.exit(1) 611 | 612 | def _guess_provider(self): 613 | self.model = self.environment.parsed_options.model 614 | self.provider = self.environment.parsed_options.provider 615 | # guess based on URL 616 | if self.provider is None: 617 | if "fireworks.ai" in self.host: 618 | self.provider = "fireworks" 619 | elif "together" in self.host: 620 | self.provider = "together" 621 | elif "openai" in self.host: 622 | self.provider = "openai" 623 | 624 | if ( 625 | self.model is None 626 | and self.provider is not None 627 | and PROVIDER_CLASS_MAP[self.provider].DEFAULT_MODEL_NAME is not None 628 | ): 629 | self.model = PROVIDER_CLASS_MAP[self.provider].DEFAULT_MODEL_NAME 630 | 631 | if self.model and self.provider: 632 | return 633 | 634 | # vllm doesn't support /model/ endpoint, so iterate over all models 635 | try: 636 | resp = self.client.get("/v1/models") 637 | resp.raise_for_status() 638 | resp = resp.json() 639 | except Exception as e: 640 | raise ValueError( 641 | "Argument --model or --provider was not specified and /v1/models failed" 642 | ) from e 643 | 644 | models = resp["data"] 645 | assert len(models) > 0, "No models found in /v1/models" 646 | owned_by = None 647 | # pick the first model 648 | for m in models: 649 | if self.model is None or m["id"] == self.model: 650 | self.model = m["id"] 651 | owned_by = m["owned_by"] 652 | break 653 | if self.provider is None: 654 | if not owned_by: 655 | raise ValueError( 656 | f"Model {self.model} not found in /v1/models. Specify --provider explicitly" 657 | ) 658 | if owned_by in PROVIDER_CLASS_MAP: 659 | self.provider = owned_by 660 | else: 661 | raise ValueError( 662 | f"Can't detect provider, specify it explicitly with --provider, owned_by={owned_by}" 663 | ) 664 | 665 | def _on_start(self): 666 | self.client.headers["Content-Type"] = "application/json" 667 | if self.environment.parsed_options.api_key: 668 | self.client.headers["Authorization"] = ( 669 | "Bearer " + self.environment.parsed_options.api_key 670 | ) 671 | if self.environment.parsed_options.header: 672 | for header in self.environment.parsed_options.header: 673 | key, val = header.split(":", 1) 674 | self.client.headers[key] = val 675 | self._guess_provider() 676 | print(f" Provider {self.provider} using model {self.model} ".center(80, "*")) 677 | self.provider_formatter = PROVIDER_CLASS_MAP[self.provider]( 678 | self.model, self.environment.parsed_options 679 | ) 680 | 681 | self.stream = self.environment.parsed_options.stream 682 | 683 | image_resolutions = ( 684 | self.environment.parsed_options.prompt_images_with_resolutions 685 | ) 686 | self.prompt_images = None 687 | if image_resolutions: 688 | if not self.environment.parsed_options.chat: 689 | # Using regular /completions endpoint, each model has it's own image placeholder 690 | # e.g., <|image|> for Phi, <|image_pad|> for Qwen, for Llava 691 | # So using /completions endpoint requires a bit more work to support this 692 | raise AssertionError( 693 | "--prompt-images-with-resolutions is only supported with --chat mode." 694 | ) 695 | self.prompt_images = [ 696 | self._create_base64_image(width, height) 697 | for width, height in image_resolutions 698 | ] 699 | 700 | self.max_tokens_sampler = LengthSampler( 701 | distribution=self.environment.parsed_options.max_tokens_distribution, 702 | mean=self.environment.parsed_options.max_tokens, 703 | cap=self.environment.parsed_options.max_tokens_cap, 704 | alpha=self.environment.parsed_options.max_tokens_range, 705 | ) 706 | self.temperature = self.environment.parsed_options.temperature 707 | 708 | logging_params = { 709 | # TODO: add some server info with git version 710 | "provider": self.provider, 711 | "model": self.model, 712 | "prompt_tokens": self.environment.parsed_options.prompt_tokens, # might be overwritten based on metric 713 | "generation_tokens": str(self.max_tokens_sampler), 714 | "stream": self.stream, 715 | "temperature": self.temperature, 716 | "logprobs": self.environment.parsed_options.logprobs, 717 | } 718 | 719 | if self.environment.parsed_options.top_k is not None: 720 | logging_params["top_k"] = self.environment.parsed_options.top_k 721 | 722 | InitTracker.notify_init(self.environment, logging_params) 723 | 724 | if self.environment.parsed_options.qps is not None: 725 | if self.environment.parsed_options.burst: 726 | raise ValueError("Burst and QPS modes are mutually exclusive") 727 | pacer = FixedQPSPacer.instance( 728 | self.environment.parsed_options.qps, 729 | self.environment.parsed_options.qps_distribution, 730 | ) 731 | # it will be called by Locust after each task 732 | self.wait_time = pacer.wait_time_till_next 733 | self.wait() 734 | elif self.environment.parsed_options.burst: 735 | self.wait_time = partial( 736 | constant_pacing(self.environment.parsed_options.burst), self 737 | ) 738 | else: 739 | # introduce initial delay to avoid all users hitting the service at the same time 740 | time.sleep(random.random()) 741 | 742 | self.first_done = False 743 | 744 | dataset = DatasetHolder.get_instance(self.environment.parsed_options) 745 | self.dataset = iter(dataset) 746 | 747 | def _create_base64_image(self, width, height): 748 | """Create a random RGB image with the given dimensions and return as base64 data URI.""" 749 | img = Image.new("RGB", (width, height)) 750 | buffer = io.BytesIO() 751 | img.save(buffer, format="JPEG") 752 | img_str = base64.b64encode(buffer.getvalue()).decode("utf-8") 753 | return f"data:image/jpeg;base64,{img_str}" 754 | 755 | def _get_input(self): 756 | prompt, prompt_tokens = next(self.dataset) 757 | 758 | if self.prompt_images: 759 | images = self.prompt_images 760 | prompt_images_positioning = ( 761 | self.environment.parsed_options.prompt_images_positioning 762 | ) 763 | prompt = self.insert_image_placeholders( 764 | prompt, len(images), prompt_images_positioning 765 | ) 766 | else: 767 | images = None 768 | 769 | return prompt, prompt_tokens, images 770 | 771 | def insert_image_placeholders(self, prompt, num_images, prompt_images_positioning): 772 | if num_images <= 0: 773 | return prompt 774 | 775 | prompt_length = len(prompt) 776 | if prompt_length == 0: 777 | return PROMPT_CHAT_IMAGE_PLACEHOLDER * num_images 778 | 779 | if prompt_images_positioning == "space-evenly": 780 | """ 781 | Insert placeholders evenly throughout the prompt. 782 | E.g., for 3 images, a prompt "abcdefgh" is changed to "abcdefgh" 783 | 784 | Images are spaced out evenly based on on character length. 785 | This may result in a few extra tokens if the image tags are placed in the middle of tokens. 786 | But shouldn't affect results meaningfully. 787 | """ 788 | # we need num_images + 1 segments to place between tags 789 | segment_length = prompt_length / (num_images + 1) 790 | result = "" 791 | for i in range(num_images): 792 | # Move a sliding window of segment_length across the prompt 793 | # Truncating to ensure all segments are non-overlapping 794 | # If segment_end is truncated, that character will be included in the next segment 795 | segment_start = int(i * segment_length) 796 | segment_end = int((i + 1) * segment_length) 797 | result += ( 798 | prompt[segment_start:segment_end] + PROMPT_CHAT_IMAGE_PLACEHOLDER 799 | ) 800 | 801 | # Final segment 802 | result += prompt[int(num_images * segment_length) :] 803 | 804 | return result 805 | elif prompt_images_positioning == "end": 806 | return prompt + PROMPT_CHAT_IMAGE_PLACEHOLDER * num_images 807 | else: 808 | raise ValueError( 809 | f"Invalid prompt images positioning: {prompt_images_positioning}" 810 | ) 811 | 812 | @task 813 | def generate_text(self): 814 | max_tokens = self.max_tokens_sampler.sample() 815 | prompt, prompt_usage_tokens, images = self._get_input() 816 | data = self.provider_formatter.format_payload(prompt, max_tokens, images) 817 | t_start = time.perf_counter() 818 | 819 | with self.client.post( 820 | self.provider_formatter.get_url(), 821 | data=json.dumps(data), 822 | stream=True, 823 | catch_response=True, 824 | ) as response: 825 | combined_text = "" 826 | done_empty_chunk = False 827 | done = False 828 | total_usage_tokens = None 829 | total_logprob_tokens = None 830 | try: 831 | response.raise_for_status() 832 | except Exception as e: 833 | raise RuntimeError(f"Error in response: {response.text}") from e 834 | t_first_token = None 835 | for chunk in response.iter_lines(delimiter=b"\n\n"): 836 | if len(chunk) == 0: 837 | continue # come providers send empty lines between data chunks 838 | if done: 839 | if chunk != b"data: [DONE]": 840 | print(f"WARNING: Received more chunks after [DONE]: {chunk}") 841 | try: 842 | now = time.perf_counter() 843 | if self.provider_formatter.parsed_options.embeddings: 844 | t_first_token = now 845 | if self.environment.parsed_options.show_response: 846 | out = self.provider_formatter.parse_output_json(orjson.loads(chunk)) 847 | combined_text = out.text 848 | break 849 | if self.stream: 850 | assert chunk.startswith( 851 | b"data:" 852 | ), f"Unexpected chunk not starting with 'data': {chunk}" 853 | chunk = chunk[len(b"data:") :] 854 | if chunk.strip() == b"[DONE]": 855 | done = True 856 | continue 857 | if done_empty_chunk: 858 | print(f"WARNING: Received more chunks after the trailing last chunk: {chunk}") 859 | data = orjson.loads(chunk) 860 | if not data.get("choices"): 861 | done_empty_chunk = True 862 | continue 863 | out = self.provider_formatter.parse_output_json(data) 864 | if out.usage_tokens: 865 | total_usage_tokens = out.usage_tokens 866 | if out.prompt_usage_tokens: 867 | prompt_usage_tokens = out.prompt_usage_tokens 868 | combined_text += out.text 869 | 870 | # some providers (SGLang) send an empty chunk first skewing the TTFT 871 | if combined_text and t_first_token is None: 872 | t_first_token = now 873 | 874 | if out.logprob_tokens: 875 | total_logprob_tokens = ( 876 | total_logprob_tokens or 0 877 | ) + out.logprob_tokens 878 | except Exception as e: 879 | print(f"Failed to parse response: {chunk} with error {repr(e)}") 880 | response.failure(e) 881 | return 882 | assert t_first_token is not None, "empty response received" 883 | if ( 884 | (total_logprob_tokens is not None) 885 | and (total_usage_tokens is not None) 886 | and total_logprob_tokens != total_usage_tokens 887 | ): 888 | print( 889 | f"WARNING: usage_tokens {total_usage_tokens} != logprob_tokens {total_logprob_tokens}" 890 | ) 891 | if total_logprob_tokens is not None: 892 | num_tokens = total_logprob_tokens 893 | else: 894 | num_tokens = total_usage_tokens 895 | 896 | num_tokens = num_tokens or 0 897 | num_chars = len(combined_text) 898 | now = time.perf_counter() 899 | dur_total = now - t_start 900 | dur_generation = now - t_first_token 901 | dur_first_token = t_first_token - t_start 902 | print( 903 | f"Response received: total {dur_total*1000:.2f} ms, first token {dur_first_token*1000:.2f} ms, {num_chars} chars, {num_tokens} tokens" 904 | ) 905 | if self.environment.parsed_options.show_response: 906 | print("---") 907 | print(combined_text) 908 | print("---") 909 | if num_chars: 910 | add_custom_metric( 911 | "latency_per_char", dur_generation / num_chars * 1000, num_chars 912 | ) 913 | if self.stream: 914 | add_custom_metric("time_to_first_token", dur_first_token * 1000) 915 | add_custom_metric("total_latency", dur_total * 1000) 916 | if num_tokens: 917 | if num_tokens != max_tokens: 918 | print( 919 | f"WARNING: wrong number of tokens: {num_tokens}, expected {max_tokens}" 920 | ) 921 | add_custom_metric("num_tokens", num_tokens) 922 | add_custom_metric( 923 | "latency_per_token", dur_generation / num_tokens * 1000, num_tokens 924 | ) 925 | add_custom_metric( 926 | "overall_latency_per_token", 927 | dur_total / num_tokens * 1000, 928 | num_tokens, 929 | ) 930 | 931 | if not self.provider_formatter.parsed_options.embeddings: 932 | prompt_tokens = prompt_usage_tokens or self.prompt_tokenizer_tokens 933 | if prompt_tokens: 934 | add_custom_metric("prompt_tokens", prompt_tokens) 935 | 936 | if not self.first_done: 937 | self.first_done = True 938 | InitTracker.notify_first_request() 939 | 940 | 941 | def parse_resolution(res_str): 942 | """Parse a resolution string like '3084x1080' into a tuple of integers (width, height).""" 943 | try: 944 | width, height = map(int, res_str.split("x")) 945 | return (width, height) 946 | except (ValueError, AttributeError): 947 | raise argparse.ArgumentTypeError( 948 | f"Invalid resolution format: {res_str}. Expected format: WIDTHxHEIGHT (e.g. 1024x1024)" 949 | ) 950 | 951 | 952 | @events.init_command_line_parser.add_listener 953 | def init_parser(parser): 954 | parser.add_argument( 955 | "--provider", 956 | choices=list(PROVIDER_CLASS_MAP.keys()), 957 | type=str, 958 | help="Which flavor of API to use. If not specified, we'll try to guess based on the URL and /v1/models output", 959 | ) 960 | parser.add_argument( 961 | "-d", 962 | "--dataset", 963 | env_var="DATASET", 964 | type=str, 965 | help="Either 'limerics' or a path to a JSONL file", 966 | default="limerics", 967 | ) 968 | parser.add_argument( 969 | "-m", 970 | "--model", 971 | env_var="MODEL", 972 | type=str, 973 | help="The model to use for generating text. If not specified we will pick the first model from the service as returned by /v1/models", 974 | ) 975 | parser.add_argument( 976 | "--tokenizer", 977 | env_var="TOKENIZER", 978 | type=str, 979 | help="Specify HF tokenizer to use for validating the output of the model. It's optional, we're going to rely on 'usage' or 'logprobs' field to get token count information", 980 | ) 981 | parser.add_argument( 982 | "--chat", 983 | action=argparse.BooleanOptionalAction, 984 | default=True, 985 | help="Use /v1/chat/completions API", 986 | ) 987 | parser.add_argument( 988 | "--embeddings", 989 | action=argparse.BooleanOptionalAction, 990 | default=False, 991 | help="Use /v1/embeddings API", 992 | ) 993 | parser.add_argument( 994 | "--return-logits", 995 | type=int, 996 | nargs="*", 997 | default=None, 998 | help="For embeddings: return per-token or per-class logits. Provide specific token/class indices, or empty list for all. Only works with certain models.", 999 | ) 1000 | parser.add_argument( 1001 | "--normalize", 1002 | action=argparse.BooleanOptionalAction, 1003 | default=False, 1004 | help="For embeddings: apply L2 normalization to activations when return_logits is None, or softmax to selected logits when return_logits is provided.", 1005 | ) 1006 | parser.add_argument( 1007 | "-p", 1008 | "--prompt-tokens", 1009 | env_var="PROMPT_TOKENS", 1010 | type=int, 1011 | default=512, 1012 | help="Length of the prompt in tokens. Default 512", 1013 | ) 1014 | parser.add_argument( 1015 | "--prompt-images-with-resolutions", 1016 | type=parse_resolution, 1017 | nargs="+", 1018 | default=[], 1019 | help="Images to add to the prompt for vision models, defined by their resolutions in format WIDTHxHEIGHT. " 1020 | 'For example, "--prompt-images-with-resolutions 3084x1080 1024x1024" will insert 2 images ' 1021 | "(3084 width x 1080 height and 1024 width x 1024 height) into the prompt. " 1022 | "Images will be spaced out evenly across the prompt." 1023 | "Only supported with --chat mode.", 1024 | ) 1025 | parser.add_argument( 1026 | "--prompt-images-positioning", 1027 | type=str, 1028 | choices=["space-evenly", "end"], 1029 | default="space-evenly", 1030 | help="How to position the images in the prompt. " 1031 | "space-evenly: images are spaced out evenly across the prompt. E.g., 3 images in 'abcdefgh' is 'abcdefgh'" 1032 | "end: images are added to the end of the prompt. E.g., 3 images in 'abcdefgh' is 'abcdefgh'" 1033 | "Only relevant with --prompt-images-with-resolutions.", 1034 | ) 1035 | parser.add_argument( 1036 | "-o", 1037 | "--max-tokens", 1038 | env_var="MAX_TOKENS", 1039 | type=int, 1040 | default=64, 1041 | help="Max number of tokens to generate. If --max-tokens-distribution is non-constant this is going to be the mean. Defaults to 64", 1042 | ) 1043 | parser.add_argument( 1044 | "--max-tokens-cap", 1045 | env_var="MAX_TOKENS_CAP", 1046 | type=int, 1047 | help="If --max-tokens-distribution is non-constant, this truncates the distribition at the specified limit", 1048 | ) 1049 | parser.add_argument( 1050 | "--max-tokens-distribution", 1051 | env_var="MAX_TOKENS_DISTRIBUTION", 1052 | type=str, 1053 | choices=["constant", "uniform", "exponential", "normal"], 1054 | default="constant", 1055 | help="How to sample `max-tokens` on each request", 1056 | ) 1057 | parser.add_argument( 1058 | "--max-tokens-range", 1059 | env_var="MAX_TOKENS_RANGE", 1060 | type=float, 1061 | default=0.3, 1062 | help="Specifies the width of the distribution. Specified value `alpha` is relative to `max-tokens`. For uniform distribution we'd sample from [max_tokens - max_tokens * alpha, max_tokens + max_tokens * alpha]. For normal distribution we'd sample from `N(max_tokens, max_tokens * alpha)`. Defaults to 0.3", 1063 | ) 1064 | parser.add_argument( 1065 | "--top-k", 1066 | env_var="TOP_K", 1067 | type=int, 1068 | default=None, 1069 | help="Specifies the top-k sampling parameter.", 1070 | ) 1071 | parser.add_argument( 1072 | "--stream", 1073 | dest="stream", 1074 | action=argparse.BooleanOptionalAction, 1075 | default=True, 1076 | help="Use the streaming API", 1077 | ) 1078 | parser.add_argument( 1079 | "-k", 1080 | "--api-key", 1081 | env_var="API_KEY", 1082 | help="Auth for the API", 1083 | ) 1084 | parser.add_argument( 1085 | "--temperature", 1086 | env_var="TEMPERATURE", 1087 | type=float, 1088 | default=1.0, 1089 | help="Temperature parameter for the API", 1090 | ) 1091 | parser.add_argument( 1092 | "--logprobs", 1093 | type=int, 1094 | default=None, 1095 | help="Whether to ask for logprobs, it makes things slower for some providers but is necessary for token count in streaming (unless it's Fireworks API that returns usage in streaming mode)", 1096 | ) 1097 | parser.add_argument( 1098 | "--summary-file", 1099 | type=str, 1100 | help="Append the line with the summary to the specified CSV file. Useful for generating a spreadsheet with perf sweep results. If the file doesn't exist, writes out the header first", 1101 | ) 1102 | parser.add_argument( 1103 | "--qps", 1104 | type=float, 1105 | default=None, 1106 | help="Enabled 'fixed QPS' mode where requests are issues at the specified rate regardless of how long the processing takes. In this case --users and --spawn-rate need to be set to a sufficiently high value (e.g. 100)", 1107 | ) 1108 | parser.add_argument( 1109 | "--qps-distribution", 1110 | type=str, 1111 | choices=["constant", "uniform", "exponential"], 1112 | default="constant", 1113 | help="Must be used with --qps. Specifies how to space out requests: equally ('constant') or by sampling wait times from a distribution ('uniform' or 'exponential'). Expected QPS is going to match --qps", 1114 | ) 1115 | parser.add_argument( 1116 | "--burst", 1117 | type=float, 1118 | default=None, 1119 | help="Makes requests to arrive in bursts every specified number of seconds. Note that burst duration has to be longer than maximum time of the response. Size of the burst is controlled by --users. The spawn rate -r is best set to a high value", 1120 | ) 1121 | parser.add_argument( 1122 | "--show-response", 1123 | action=argparse.BooleanOptionalAction, 1124 | default=False, 1125 | help="Print the result of each generation", 1126 | ) 1127 | parser.add_argument( 1128 | "-pcml", 1129 | "--prompt-cache-max-len", 1130 | env_var="PROMPT_CACHE_MAX_LEN", 1131 | type=int, 1132 | default=0, 1133 | help="Maximum length of the prompt cache to use. Defaults to 0 (no caching).", 1134 | ) 1135 | parser.add_argument( 1136 | "--header", 1137 | action="append", 1138 | default=[], 1139 | help="Arbitrary headers to add to the inference request. Can be used multiple times. For example, --header header1:value1 --header header2:value2", 1140 | ) 1141 | parser.add_argument( 1142 | "-n", 1143 | "--n", 1144 | default=1, 1145 | type=int, 1146 | help="How many sequences to generate (makes sense to use with non-zero temperature).", 1147 | ) 1148 | 1149 | 1150 | @events.quitting.add_listener 1151 | def _(environment, **kw): 1152 | total_latency = environment.stats.entries[("total_latency", "METRIC")] 1153 | if environment.stats.total.num_failures > 0 or total_latency.num_requests == 0: 1154 | print("Test failed due to failed requests") 1155 | environment.process_exit_code = 1 1156 | return 1157 | 1158 | entries = copy.copy(InitTracker.logging_params) 1159 | if environment.parsed_options.qps is not None: 1160 | entries["concurrency"] = ( 1161 | f"QPS {environment.parsed_options.qps} {environment.parsed_options.qps_distribution}" 1162 | ) 1163 | else: 1164 | entries["concurrency"] = InitTracker.users 1165 | for metric_name in [ 1166 | "time_to_first_token", 1167 | "latency_per_token", 1168 | "overall_latency_per_token", 1169 | "num_tokens", 1170 | "total_latency", 1171 | "prompt_tokens", # might overwrite the static value based on server side tokenization 1172 | ]: 1173 | entries[metric_name] = environment.stats.entries[ 1174 | (metric_name, "METRIC") 1175 | ].avg_response_time 1176 | if not environment.parsed_options.stream: 1177 | # if there's no streaming these metrics are meaningless 1178 | entries["time_to_first_token"] = "" 1179 | entries["latency_per_token"] = "" 1180 | entries["num_requests"] = total_latency.num_requests 1181 | entries["qps"] = total_latency.total_rps 1182 | percentile_to_report = [50, 90, 95, 99, 99.9] 1183 | percentile_metrics = ["time_to_first_token", "total_latency"] 1184 | for percentile_metric in percentile_metrics: 1185 | metrics = environment.stats.entries[percentile_metric, "METRIC"] 1186 | for percentile in percentile_to_report: 1187 | name = f"P{percentile}_{percentile_metric}" 1188 | entries[name] = metrics.get_response_time_percentile(percentile / 100) 1189 | 1190 | pretty_name = lambda s: " ".join([w.capitalize() for w in s.split("_")]) 1191 | entries = {pretty_name(k): v for k, v in entries.items()} 1192 | 1193 | # print in the final event handler to make sure our output is the last one 1194 | @events.quit.add_listener 1195 | def exit_printer(**kw): 1196 | max_width = max(len(k) for k in entries.keys()) 1197 | print(" Summary ".center(80, "=")) 1198 | for k, v in entries.items(): 1199 | print(f"{k:<{max_width}}: {v}") 1200 | print("=" * 80) 1201 | 1202 | if environment.parsed_options.summary_file: 1203 | with open(environment.parsed_options.summary_file, "a") as f: 1204 | writer = csv.DictWriter(f, fieldnames=entries.keys()) 1205 | if f.tell() == 0: 1206 | writer.writeheader() 1207 | writer.writerow(entries) 1208 | --------------------------------------------------------------------------------