├── llm_bench
    ├── requirements.txt
    ├── locust.conf
    ├── locust-grafana.conf
    ├── README.md
    ├── limericks.txt
    ├── benchmark_suite.ipynb
    └── load_test.py
├── .gitignore
├── README.md
└── LICENSE


/llm_bench/requirements.txt:
--------------------------------------------------------------------------------
1 | locust
2 | orjson
3 | pillow
4 | gevent
5 | transformers
6 | locust-plugins
7 | 


--------------------------------------------------------------------------------
/llm_bench/locust.conf:
--------------------------------------------------------------------------------
1 | locustfile = load_test.py
2 | headless = yes
3 | host = http://localhost:80
4 | reset-stats = yes
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | # Benchmark results
4 | llm_bench/results/
5 | .env
6 | env/


--------------------------------------------------------------------------------
/llm_bench/locust-grafana.conf:
--------------------------------------------------------------------------------
1 | locustfile = load_test.py
2 | host = http://localhost:80
3 | headless = true
4 | timescale = true
5 | pghost = 127.0.0.1
6 | pgport = 5432
7 | pguser = postgres
8 | pgpassword = password
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Benchmark / Load-testing Suite by Fireworks.ai
 2 | 
 3 | ## LLM benchmarking
 4 | 
 5 | The load test is designed to simulate continuous production load and minimize effect of model generation behavior:
 6 | * variation in generation parameters
 7 | * continuous request stream with varying distribution and load levels
 8 | * force generation of exact number of output tokens (for most providers)
 9 | * specified load test duration
10 | 
11 | Supported providers and API flavors:
12 | * OpenAI API compatible endpoints:
13 |   * [Fireworks.ai](https://app.fireworks.ai) public or private deployments
14 |   * VLLM
15 |   * Anyscale Endpoints
16 |   * OpenAI
17 | * Text Generation Inference (TGI) / HuggingFace Endpoints
18 | * Together.ai
19 | * NVidia Triton server:
20 |   * Legacy HTTP endpoints (no streaming)
21 |   * LLM-focused endpoints (with or without streaming)
22 | 
23 | Captured metrics:
24 | * Overall latency
25 | * Number of generated tokens
26 | * Sustained requests throughput (QPS)
27 | * Time to first token (TTFT) for streaming
28 | * Per token latency for streaming
29 | 
30 | Metrics summary can be exported to CSV. This way multiple configuration can be scripted over. CSV file can be imported to Google Sheets/Excel or Jupyter for further analysis.
31 | 
32 | See [`llm_bench`](llm_bench) folder for detailed usage.
33 | 
34 | See [`llm_bench/benchmark_suite.ipynb`](llm_bench/benchmark_suite.ipynb) for a detailed example of how to use the load test script and run different types of benchmark suites.
35 | 


--------------------------------------------------------------------------------
/llm_bench/README.md:
--------------------------------------------------------------------------------
  1 | # LLM Load test
  2 | 
  3 | Please refer to the [`benchmark_suite.ipynb`](benchmark_suite.ipynb) for a detailed example of how to use the load_test.py script and run different types of benchmark suites.
  4 | 
  5 | ## Installation
  6 | 
  7 | The load test relies on [Locust package](https://locust.io/). Install it from pip.
  8 | 
  9 | ```bash
 10 | pip install -r requirements.txt
 11 | ```
 12 | 
 13 | Then run the commands described below from the enclosing directory. Locust will pick up the settings from `locust.conf` automatically.
 14 | 
 15 | ## Usage
 16 | 
 17 | The load test script exercises LLM generation endpoint under varying load. See below for the common configuration options. Check `--help` for the full list.
 18 | 
 19 | ### Target
 20 | 
 21 | - `-H`: target endpoint URL (preceding `/v1/...`). E.g. `-H http://localhost` or `-H https://api.fireworks.ai/inference`. Defaults to `localhost:80`.
 22 | - (optional) `-m`: model to send requests too. Can be omitted for a local test if the server has a single model loaded only.
 23 | - (optional) `--provider`: provider name like `fireworks` or `openai`. APIs have slight differences that the script accounts for. If omitted the script tries to guess based on URI and API return information. Must be specified for non-OpenAI-compatible providers like Triton.
 24 | - `-k`: API key to be passed as `Authorization: Bearer ...`.
 25 | 
 26 | ### Rate of requests
 27 | 
 28 | There are several primary modes the script can be used:
 29 | 
 30 | 1. **Fixed concurrency**. N workers are created. Each sends a request, waits for the response and then sends the next request. Thus as concurrency increases, the server will get more loaded and latency will grow. Usually increasing concurrency beyond some point doesn't increase throughput and just leads to growing latency.
 31 |    - `-u`: the number of concurrent workers to spawn (standard Locust argument)
 32 |    - `-r`: the rate per second of spawning concurrent workers. If processing workload takes a while (more than several seconds), it makes sense to set this value to something lower than `-u` for a gradual ramp-up to avoid request bursts.
 33 |    - (optionally) `--burst <period in seconds>`: synchronizes all N workers to issue requests in one go with the specified interval. The maximum latency should be less than the period, otherwise some workers may fall behind.
 34 | 
 35 | 2. **Fixed QPS**. The script ensures that input requests are issued at specific times to average out at the specified rate per second. If the target QPS is too high and the server is overloaded it will likely drop additional requests or stall.
 36 |    - `--qps`: the desired rate of requests per second. Can be a fractional number, e.g. `0.1`.
 37 |    - `-u <high number> -r <high number>`: needs to be set to a sufficiently high value to allow generating the target QPS. The script will complain if it's too low. Passing something like `-u 100 -r 100` is a good choice.
 38 |    - (optional) `--qps-distribution`: specify how to space out requests. Default is `constant` meaning evenly spaced out. `exponential` is an option simulating [Poisson distribution](https://en.wikipedia.org/wiki/Traffic_generation_model#Poisson_traffic_model).
 39 | 
 40 | ### Workload
 41 | 
 42 | Input is read from --dataset, which is either:
 43 | - `limerics`: default dataset. Requires --tokenizer to be passed. Will be used to auto-generate realistic prompts.
 44 | - `@`-prefixed, specifies a path to JSONL file, used to read contents of each request.
 45 | 
 46 | The number of tokens to generate is sampled on every request from a given distribution:
 47 | - `-o`/`--max-tokens`: maximum number of tokens to generate. If --max-tokens-distribution is non-constant this is going to be the mean of the distribution.
 48 | - `--max-tokens-distribution`: specifies probability distribution to use.
 49 | - `--max-tokens-range` Specifies "the width" of the distribution (e.g. stddev for "normal" distribution). Specified value `alpha` is relative to `max-tokens`. Default is 0.3 so most of the range falls in "3 sigma" region.
 50 | - `--max-tokens-cap`: specify upper bound to "truncate" the probability distribution. The lower bound is always 1 token. This allows to sample from "truncated normal" or "truncated exponential" distributions.
 51 | 
 52 | Based on the above settings the following distributions are supported:
 53 | - `constant`: use `--max_tokens` value on every request
 54 | - `uniform`: sample from the range `[max_tokens - max_tokens * alpha, max_tokens + max_tokens * alpha]`
 55 | - `normal`: sample from gaussian distribution `N(max_tokens, max_tokens * alpha)`
 56 | - `exponential`: sample from exponential distribution with the mean `max_tokens`. `alpha` is ignored
 57 | 
 58 | The benchmark makes the best effort to ensure the desired `max_tokens` number is respected:
 59 | - for providers that support it, it passes `ignore_eos` or `min_tokens` parameter to avoid early stopping
 60 | - the default prompt is a lengthy code generation request that usually doesn't stop early
 61 | - it verifies the number of tokens actually generated and prints warnings on mismatch. Different providers use varying mechanisms of returning generated number of tokens. For some of them `--logprobs` might be needed in the streaming mode.
 62 | - optionally, `--tokenizer` can be passed specifying Huggingface tokenizer to be used to count the output tokens on client side.
 63 | 
 64 | Generation options:
 65 | - `--chat`: specify to call chat API instead of raw completions
 66 | - `--stream`: stream the result back. Enabling this gives "time to first token" and "time per token" metrics
 67 | - (optional) `--logprobs`: corresponds to `logprobs` API parameter. For some providers, it's needed for output token counting in streaming mode.
 68 | 
 69 | ### Writing results
 70 | 
 71 | Locust prints out the detailed summary including quantiles of various metrics. Additionally, the script prints out the summary block at the very end of the output that includes the model being tested.
 72 | 
 73 | When comparing multiple configurations, it's useful to aggregate results together:
 74 | 
 75 | - `--summary-file`: Append the line with the summary to the specified CSV file. Useful for generating a spreadsheet with perf sweep results. If the file doesn't exist, it writes out the header first.
 76 | - `-t`: duration (e.g. `5min`) for which to run the test (standard Locust option). It's particularly useful when scripting multiple runs. By default, the test runs without a limit until Ctrl+C is pressed.
 77 | 
 78 | The typical workflow would be to run benchmark several times appending to the same CSV file. The resulting file can be imported into a spreadsheet or pandas for further analysis.
 79 | 
 80 | ### Custom prompts
 81 | 
 82 | Sometimes it's necessary to replay exact prompts, for example in the case of embedding images.
 83 | `--dataset` option can be used in this case to specify a file with .jsonl extension (starting with an ampersand, e.g. `@prompt.jsonl`.).
 84 | jsonl files will be read line-by-line. Each line has to have a valid JSON object, which will be used to form the resulting API request.
 85 | Examples:
 86 | 
 87 | Chat dataset (--chat option):
 88 | ```
 89 | {"messages": [{"role": "user", "content": "Write a poem about a cat"}], "temperature": 0.9}
 90 | {"messages": [{"role": "user", "content": "Write a poem about a dog"}], "temperature": 1}
 91 | ```
 92 | 
 93 | Non-chat dataset (--no-chat option):
 94 | ```
 95 | {"prompt": "One two three four"}
 96 | {"prompt": "Five six seven eight"}
 97 | ```
 98 | 
 99 | 
100 | ## Examples
101 | 
102 | Download tokenizer for the model being benchmarked from Huggingface.
103 | ```
104 | huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir /models/Meta-Llama-3-8B-Instruct   --include '*.json'
105 | export TOKENIZER=/models/Meta-Llama-3-8B-Instruct
106 | ```
107 | 
108 | Maintain fixed 8 requests concurrency against local deployment:
109 | 
110 | ```bash
111 | locust -u 8 -r 2 -p 512 -o 128
112 | ```
113 | 
114 | Call streaming chat API locally with the request issued every 2 seconds. Run for 1 minute and save results to `results.csv`:
115 | 
116 | ```bash
117 | locust -t 1min -u 100 -r 100 -p 512 -o 128 --stream --chat --qps 0.5 --summary-file results.csv
118 | ```
119 | 
120 | Benchmark Fireworks public deployment deployment with 1 request only:
121 | 
122 | ```bash
123 | locust -u 1 -H https://api.fireworks.ai/inference -p 128 -o 200 --api-key $FIREWORKS_API_KEY --model=accounts/fireworks/models/llama-v3p1-8b-instruct
124 | ```
125 | 
126 | Benchmark Fireworks public deployment with 1 request and 2 images (1024w x 1024h and 3084w x 1080h):
127 | 
128 | ```bash
129 | locust -u 1  -H https://api.fireworks.ai/inference -p 128 -o 200 --api-key $FIREWORKS_API_KEY --model=accounts/fireworks/models/llama-v3p1-8b-instruct --chat --prompt-images-with-resolutions 1024x1024 3084x1080
130 | ```
131 | 
132 | Benchmark OpenAI deployment reading prompts from a file at 1 QPS:
133 | 
134 | ```bash
135 | locust --dataset '@input.jsonl' -u 1 -H https://api.openai.com -o 200 --api-key $OPENAI_API_KEY --model=gpt-3.5-turbo --chat
136 | ```
137 | 
138 | ## UI mode
139 | 
140 | Instead of relying on textual data, it's also possible to plot the results in Grafana.
141 | 
142 | ```bash
143 | pip install locust locust-plugins
144 | locust-compose up
145 | ```
146 | 
147 | This starts your local Postgre and Grafana. Grafana is available at http://127.0.0.1:3000 (sometimes logs don't print out).
148 | 
149 | Then run the test as specified above with an additional argument:
150 | 
151 | ```bash
152 | locust --config locust-grafana.conf ...
153 | ```
154 | 
155 | This starts the load test locally and pushes results into Grafana in real-time. Besides the actual requests, we push additional metrics (e.g. time per token) as separate fake requests to get stats aggregation. Make sure to remove them from aggregation when viewing the graphs.
156 | 
157 | Other settings for Locust are in `./locust.conf`. You may start Locust in non-headless mode, but its UI is very basic and misses advanced stats aggregation capabilities.
158 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/llm_bench/limericks.txt:
--------------------------------------------------------------------------------
  1 | An elderly man called Keith,
  2 | Mislaid his set of false teeth.
  3 | They'd been laid on a chair,
  4 | He'd forgot they were there,
  5 | Sat down, and was bitten beneath.
  6 | 
  7 | There once was a man from the sticks
  8 | Who loved to compose limericks
  9 | But he failed at his sport
 10 | They were always too short...
 11 | 
 12 | There once was a runner named Dwight
 13 | Who could speed even faster than light.
 14 | He set out one day
 15 | In a relative way
 16 | And returned on the previous night.
 17 | 
 18 | There was a young lady named Alice
 19 | Who was known to have peed in a chalice.
 20 | ‘Twas the common belief
 21 | It was done for relief,
 22 | And not out of protestant malice.
 23 | 
 24 | A canner, exceedingly canny,
 25 | One morning remarked to his granny,
 26 | "A canner can can
 27 | Anything that he can;
 28 | But a canner can't can a can, can he?"
 29 | 
 30 | At times I’m so mad that I’m hopping.
 31 | My angriness sets my veins popping.
 32 | I yell and I curse,
 33 | With swear words diverse,
 34 | But my wife does much worse: she goes shopping.
 35 | 
 36 | A tutor who tooted a flute
 37 | Tried to teach two young tooters to toot.
 38 | Said the two to the tutor,
 39 | “Is it harder to toot, or…
 40 | To tutor two tooters to toot?”
 41 | 
 42 | A crafty young bard named McMahon
 43 | Whose poetry never would scan
 44 | Once said, with a pause,
 45 | “It’s probably because
 46 | I’m always trying to cram as many additional syllables into the last line as I possibly can.”
 47 | 
 48 | An elephant slept in his bunk,
 49 | And in slumber his chest rose and sunk.
 50 | But he snored - how he snored!
 51 | All the other beasts roared,
 52 | So his wife tied a knot in his trunk.
 53 | 
 54 | There once was a man from Nantucket
 55 | Who kept all his cash in a bucket
 56 | His daughter, named Nan
 57 | Ran away with a man
 58 | And as for the bucket, Nantucket.
 59 | 
 60 | I need a front door for my hall,
 61 | The replacement I bought was too tall.
 62 | So I hacked it and chopped it,
 63 | And carefully lopped it,
 64 | And now the dumb thing is too small.
 65 | 
 66 | "There's a train at 4:04," said Miss Jenny.
 67 | "Four tickets I'll take; have you any?"
 68 | Said the man at the door,
 69 | "Not four for 4:04,
 70 | For four for 4:04 is too many."
 71 | 
 72 | How to spell the potato has tried
 73 | Many minds, sometimes mine, I’ll confide.
 74 | Though it may have an eye,
 75 | There’s no E – don’t ask why!
 76 | Not until it’s been baked, boiled, or fried.
 77 | 
 78 | I'd rather have Fingers than Toes,
 79 | I'd rather have Ears than a Nose.
 80 | And as for my Hair,
 81 | I'm glad it's all there,
 82 | I'll be awfully sad, when it goes.
 83 | 
 84 | There was a faith-healer of Deal,
 85 | Who said: "Although pain isn't real,
 86 | If I sit on a pin
 87 | And it punctures my skin,
 88 | I dislike what I fancy I feel.'
 89 | 
 90 | My dog is really quite hip,
 91 | Except when he takes a cold dip.
 92 | He looks like a fool,
 93 | When he jumps in the pool,
 94 | And reminds me of a sinking ship.
 95 | 
 96 | I'm papering walls in the loo
 97 | And quite frankly I haven't a clue;
 98 | For the pattern's all wrong
 99 | (Or the paper's too long)
100 | And I'm stuck to the toilet with glue.
101 | 
102 | There once was an old man of Esser,
103 | Whose knowledge grew lesser and lesser,
104 | It at last grew so small
105 | He knew nothing at all
106 | And now he's a college professor.
107 | 
108 | As 007 walked by
109 | He heard a wee spider say, "Hi."
110 | But shaken, he shot
111 | It right there on the spot
112 | As it tried to explain, "I'm a spi..."
113 | 
114 | There was a young fellow of Crete
115 | Who was so exceedingly neat.
116 | When he got out of bed
117 | He stood on his head
118 | To make sure of not soiling his feet.
119 | 
120 | An amoeba named Max and his brother
121 | Were sharing a drink with each other;
122 | In the midst of their quaffing,
123 | They split themselves laughing,
124 | And each of them now is a mother.
125 | 
126 | A flea and a fly in a flue
127 | Were imprisoned, so what could they do?
128 | Said the fly, “Let us flee!”
129 | “Let us fly!” said the flea
130 | So they flew through a flaw in the flue.
131 | 
132 | What happens when you retire?
133 | You really don't have to inquire -
134 | No job and no phone
135 | There's no place but home,
136 | And your checkbook's about to expire!
137 | 
138 | The star violinist was bowing;
139 | The quarrelsome oarsmen were rowing.
140 | But how is the sage
141 | To discern from this page:
142 | Was it piglets, or seeds, that were sowing?
143 | 
144 | An oyster from Kalamazoo
145 | Confessed he was feeling quite blue.
146 | For he said, “As a rule,
147 | When the weather turns cool,
148 | I invariably get in a stew.”
149 | 
150 | A painter, who lived in Great Britain,
151 | Interrupted two girls with their knitting,
152 | He said, with a sigh,
153 | "That park bench, well I,
154 | Just painted it, right where you're sitting."
155 | 
156 | I once had a gerbil named Bobby,
157 | Who had an unusual hobby.
158 | He chewed on a cord,
159 | and now - oh my lord,
160 | now all that's left is a blobby.
161 | 
162 | My ambition, said old Mr. King,
163 | Is to live as a bird on the wing.
164 | Then he climbed up a steeple,
165 | Which scared all the people,
166 | So they caged him and taught him to sing.
167 | 
168 | Is it me or the nature of money,
169 | That's odd and particularly funny.
170 | But when I have dough,
171 | It goes quickly, you know,
172 | And seeps out of my pockets like honey.
173 | 
174 | A mouse in her room woke Miss Dowd
175 | She was frightened — it must be allowed.
176 | Soon a happy thought hit her —
177 | To scare off the critter,
178 | She sat up in bed and meowed.
179 | 
180 | A nifty young flapper named Jane
181 | While walking was caught in the rain.
182 | She ran - almost flew,
183 | Her complexion did too,
184 | And she reached home exceedingly plain.
185 | 
186 | There was an old person of Fratton
187 | Who would go to church with his hat on.
188 | 'If I wake up,' he said,
189 | 'With a hat on my head,
190 | I will know that it hasn't been sat on.'
191 | 
192 | I told him, "Get out of my place
193 | You're an utter uncultured disgrace;
194 | You're a simpleton loon.
195 | Don't you know a good tune?"
196 | Then he walloped me square in the face.
197 | 
198 | No woodsman would cut a wood, would he
199 | If woods would be woodless – nor should he.
200 | Yet no woodcutter would
201 | Cut a woody-wood wood
202 | If no woodsmen cut woody woods, would he?
203 | 
204 | A forgetful old gasman named Dieter,
205 | Who went poking around his gas heater,
206 | Touched a leak with his light;
207 | He blew out of sight —
208 | And, as everyone who knows anything about poetry can tell you, he also ruined the meter.
209 | 
210 | One Saturday morning at three
211 | A cheesemonger’s shop in Paree
212 | Collapsed to the ground
213 | With a thunderous sound
214 | Leaving only a pile of de brie.
215 | 
216 | There was an old man of Peru,
217 | Who dreamt he was eating his shoe.
218 | He woke in the night,
219 | With a terrible fright,
220 | And found it was perfectly true.
221 | 
222 | A crossword compiler named Moss
223 | Who found himself quite at a loss
224 | When asked, 'Why so blue?'
225 | Said, 'I haven't a clue
226 | I'm 2 Down to put 1 Across.'
227 | 
228 | To compose a sonata today,
229 | Don't proceed in the old-fashioned way:
230 | With your toes on the keys,
231 | Bang the floor with your knees:
232 | "Oh how modern!" the critics will say.
233 | 
234 | There was a young lady named Perkins,
235 | Who just simply doted on gherkins.
236 | In spite of advice,
237 | She ate so much spice,
238 | That she pickled her internal workins'.
239 | 
240 | A certain young fellow named Bee-Bee
241 | Wished to wed a woman named Phoebe.
242 | "But," he said, "I must see
243 | What the clerical fee
244 | Be before Phoebe be Phoebe Bee-Bee."
245 | 
246 | If you catch a chinchilla in Chile
247 | And cut off its beard, willy-nilly
248 | You can honestly say
249 | That you have just made
250 | A Chilean chinchilla's chin chilly.
251 | 
252 | There once was a man from the city
253 | Stooped to pat what he thought was a kitty
254 | He gave it a pat
255 | But it wasn't a cat -
256 | They buried his clothes - what a pity!
257 | 
258 | There was an old man of the Cape
259 | Who made himself garments of crepe.
260 | When asked, “Do they tear?”
261 | He replied, “Here and there,
262 | But they’re perfectly splendid for shape!”
263 | 
264 | There was a young lady named Cager
265 | Who, as the result of a wager,
266 | Consented to fart
267 | The complete oboe part
268 | Of Mozart’s quartet in F major.
269 | 
270 | The limerick packs laughs anatomical
271 | Into space that is quite economical.
272 | But the good ones I’ve seen
273 | So seldom are clean
274 | And the clean ones so seldom are comical.
275 | 
276 | The incredible Wizard of Oz
277 | Retired from his business because
278 | Due to up-to-date science
279 | To most of his clients
280 | He wasn’t the Wizard he was.
281 | 
282 | A wonderful bird is the pelican
283 | His bill holds more than his belican,
284 | He can take in his beak
285 | Enough food for a week
286 | But I’m damned if I see how the helican.
287 | 
288 | There once was a lady named Ferris
289 | Whom nothing could ever embarrass.
290 | ‘Til the bath salts one day,
291 | in the tub where she lay,
292 | turned out to be Plaster of Paris.
293 | 
294 | There once was a girl named Irene
295 | Who lived on distilled kerosene
296 | But she started absorbing
297 | A new hydrocarbon
298 | And since then has never benzene.
299 | 
300 | Is algebra fruitless endeavor?
301 | It seems they’ve been trying forever
302 | To find x, y, and z
303 | And it’s quite clear to me:
304 | If they’ve not found them yet then they’ll never.
305 | 
306 | A rather disgruntled young Viking
307 | Found plunder was not to his liking
308 | When they yelled “All ashore,”
309 | He just threw down his oar
310 | And announced, “I’m not striking, I’m striking!”
311 | 
312 | There once was a girl in the choir
313 | Whose voice rose up hoir and hoir,
314 | Till it reached such a height
315 | It went clear out of seight,
316 | And they found it next day in the spoir.
317 | 
318 | There once was a fly on the wall,
319 | I wonder, why didn’t it fall?
320 | Because its feet stuck?
321 | Or was it just luck?
322 | Or does gravity miss things so small?
323 | 
324 | There once was a farmer from Leeds,
325 | Who swallowed a packet of seeds.
326 | It soon came to pass,
327 | He was covered with grass,
328 | But has all the tomatoes he needs.
329 | 
330 | There was once a great man in Japan
331 | Whose name on Tuesday began,
332 | It lasted through Sunday
333 | Till twilight on Monday
334 | And it sounded like stones in a can.
335 | 
336 | Remember when nearly sixteen
337 | On your very first date as a teen
338 | At the movies? If yes,
339 | Then I bet you can't guess
340 | What was shown on the cinema screen.
341 | 
342 | There was a young man from Dealing
343 | Who caught the bus for Ealing.
344 | It said on the door
345 | 'Don't spit on the floor'
346 | So he jumped up and spat on the ceiling.
347 | 
348 | There once was a man named Muvett
349 | Who lived in the city of Lovett
350 | But his car broke down
351 | Two miles out of town
352 | And Muvett had to shove it to Lovett!
353 | 
354 | There once was a beautiful nurse
355 | Who carried an ugly old purse
356 | But she tripped on the door
357 | And fell on the floor
358 | And they both went away in the hearse.
359 | 
360 | A bather whose clothing was strewed
361 | By breezes that left her quite nude,
362 | Saw a man come along
363 | And, unless I am wrong,
364 | You expect this last line to be lewd!
365 | 
366 | An ambitious young fellow named Matt,
367 | Tried to parachute using his hat.
368 | Folks below looked so small,
369 | As he started to fall,
370 | Then got bigger and bigger and SPLAT!
371 | 
372 | There was a young lady whose chin
373 | Resembled the point of a pin
374 | So she had it made sharp
375 | And purchased a harp
376 | And played several tunes with her chin.
377 | 
378 | There was once a young girl who said: “Why
379 | Can’t I look in my ear with my eye?
380 | If I put my mind to it
381 | I’m sure I can do it.
382 | You never can tell till you try.”
383 | 
384 | Limericks I cannot compose,
385 | With noxious smells in my nose.
386 | But this one was easy,
387 | I only felt queasy,
388 | Because I was sniffing my toes.
389 | 
390 | There was an odd fellow named Gus,
391 | When traveling he made such a fuss.
392 | He was banned from the train,
393 | Not allowed on a plane,
394 | And now travels only by bus.
395 | 
396 | There once was a man from Tibet,
397 | Who couldn't find a cigarette
398 | So he smoked all his socks,
399 | and got chicken-pox,
400 | and had to go to the vet.
401 | 
402 | A newspaperman named Fling,
403 | Could make "copy" from any old thing.
404 | But the copy he wrote,
405 | Of a five-dollar note,
406 | Was so good he now wears so much bling.
407 | 
408 | There is a young schoolboy named Mason,
409 | Whose mom cuts his hair with a basin.
410 | When he stands in one place,
411 | With a scarf round his face,
412 | It's a mystery which way he’s facing.
413 | 
414 | There was a young schoolboy of Rye,
415 | Who was baked by mistake in a pie.
416 | To his mother’s disgust,
417 | He emerged through the crust,
418 | And exclaimed, with a yawn, where am I?
419 | 
420 | A fellow jumped off a high wall,
421 | And had a most terrible fall.
422 | He went back to bed,
423 | With a bump on his head,
424 | That's why you don't jump off a wall.
425 | 
426 | There was a young lady of Cork,
427 | Whose Pa made a fortune in pork.
428 | He bought for his daughter,
429 | A tutor who taught her,
430 | To balance green peas on her fork.
431 | 
432 | There once was a Martian called Zed
433 | With antennae all over his head.
434 | He sent out a lot
435 | Di-di-dash-di-dot
436 | But nobody knew what he said.
437 | 
438 | There once was a girl named Sam
439 | Who did not eat roast beef and ham
440 | She ate a green apple
441 | Then drank some Snapple
442 | Some say she eats like a lamb.
443 | 
444 | A major, with wonderful force,
445 | Called out in Hyde Park for a horse.
446 | All the flowers looked round,
447 | But no horse could be found;
448 | So he just rhododendron, of course.
449 | 
450 | A canny young fisher named Fisher
451 | Once fished from the edge of a fissure.
452 | A fish with a grin
453 | Pulled the fisherman in —
454 | Now they're fishing the fissure for Fisher.
455 | 
456 | A cheerful old bear at the Zoo
457 | Could always find something to do.
458 | When it bored him, you know,
459 | To walk to and fro,
460 | He reversed it and walked fro and to.
461 | 
462 | The bottle of perfume that Willie sent
463 | Was highly displeasing to Millicent;
464 | Her thanks were so cold
465 | They quarreled, I'm told,
466 | Through that silly scent Willie sent Millicent.
467 | 
468 | I bought a new Hoover today,
469 | Plugged it in in the usual way,
470 | Switched it on - what a din;
471 | It sucked everything in,
472 | Now I'm homeless with no place to stay.
473 | 
474 | There was a young lady named Hannah,
475 | Who slipped on a peel of banana.
476 | As she lay on her side,
477 | More stars she espied
478 | Than there are in the Star-Spangled Banner.
479 | 
480 | My neighbor came over to say
481 | (Although not in a neighborly way)
482 | That he'd knock me around
483 | If I didn't curb the sound
484 | Of the classical music I play.
485 | 
486 | There once was a man from Gorem
487 | Had a pair of tight pants and he wore 'em
488 | When he bowed with a grin
489 | A draft of air rushed in
490 | And he knew by the sound that he tore 'em!
491 | 
492 | There was an Old Man in a tree,
493 | Who was horribly bored by a bee.
494 | When they said “Does it buzz?”
495 | He replied “Yes, it does!
496 | It’s a regular brute of a bee!”
497 | 
498 | There was a young belle of old Natchez
499 | Whose garments were always in patchez.
500 | When comments arose
501 | On the state of her clothes,
502 | She replied, “When Ah itchez, Ah scratchez.”
503 | 
504 | There was a young fellow from Belfast
505 | That I wanted so badly to tell fast
506 | Not to climb up the stair
507 | As the top step was air
508 | And that’s why the young fellow fell fast.
509 | 
510 | There was an old girl of Genoa
511 | And I blush when I think that Iowa;
512 | She’s gone to her rest,
513 | It’s all for the best,
514 | Otherwise I would borrow Samoa.
515 | 
516 | There was a dear lady of Eden,
517 | Who on apples was quite fond of feedin’;
518 | She gave one to Adam,
519 | Who said, “Thank you, Madam,”
520 | And then both skedaddled from Eden.
521 | 
522 | I know an old owl named Boo,
523 | Every night he yelled Hoo,
524 | Once a kid walked by,
525 | And started to cry,
526 | And yelled I don't have a clue!
527 | 
528 | I once fell in love with a blonde,
529 | But found that she wasn't so fond.
530 | Of my pet turtle named Odle,
531 | whom I'd taught how to Yodel,
532 | So she dumped him outside in the pond.
533 | 
534 | A man and his lady-love, Min,
535 | Skated out where the ice was quite thin.
536 | Had a quarrel, no doubt,
537 | For I hear they fell out,
538 | What a blessing they didn't fall in!
539 | 
540 | Said the man with a wink of his eye
541 | "But I love you" and then the reply
542 | From the girl, it was heard
543 | "You are truly absurd!
544 | I have only this moment walked by!"
545 | 
546 | Leah is such a great swimmer.
547 | Leah is very much slimmer.
548 | She saw a big whale.
549 | It had a big tail.
550 | She put in a pot to simmer.
551 | 
552 | Liam is afraid of a wolf.
553 | Tries to talk, but only goes woof.
554 | Liam bites meat too.
555 | The wolf only chews.
556 | Liam is such a silly goof.
557 | 
558 | Jack is feeling scared of a fall.
559 | He is scared of playing basketball.
560 | He wants to play game.
561 | But his shot is lame.
562 | Gives it his all and the ball falls.
563 | 
564 | Anne likes to skate on the cold ice.
565 | She thinks she has some skating spice.
566 | So, the big ice breaks.
567 | Then she starts to shake.
568 | She gets out and says, “That’s not nice.”
569 | 
570 | Joe took a trip on a big boat.
571 | Joe had bought a really long rope.
572 | He pulled a big fish.
573 | The fish was a pitch.
574 | Joe rowed and rowed and rowed his boat.
575 | 
576 | There was a nice girl named Jaray.
577 | She rhymed up and knew she could slay.
578 | Baddie knows she’s a 10.
579 | She talked it to win.
580 | She knows she could get some green pay.
581 | 
582 | My brother saw one of his coats.
583 | Then my brother saw a cool boat.
584 | So, he grabbed a book.
585 | Then he saw a hook.
586 | Then my brother started to float.
587 | 
588 | Peter is so scared of big rats.
589 | The rat is meaner than all cats.
590 | The rat bites and fights.
591 | He has a tall height.
592 | So, Peter becomes friends with rats.
593 | 
594 | Hello, I’m the Earth, you love me.
595 | If you look closer you will see,
596 | I am in danger,
597 | Mars is a stranger.
598 | Mars is chasing me, please send help!
599 | 
600 | Azuri is scared of the door.
601 | Her mom is honking the car horn.
602 | Her dad gets so mad.
603 | He says she so bad.
604 | Azuri plays with the front door.
605 | 
606 | MJ is a great ball player.
607 | He is a fearless shot taker.
608 | When he dunks, he floats.
609 | That’s why he the G.O.A.T.
610 | He is the greatest shot maker.
611 | 
612 | Ant-Man has no luck, he is stuck.
613 | He’s thinking man this really sucks.
614 | He starts to get mad.
615 | After he gets sad.
616 | Then he knew he had no good luck.
617 | 
618 | Robert’s scared of a fight in the night.
619 | Woke with anime in his sight.
620 | Robert might bite you,
621 | Til’ your fingers blue.
622 | Robert saw a bad fight last night.
623 | 
624 | Autumn is terrified of bees.
625 | The bees like to eat and smell feet.
626 | Bees love to waste time.
627 | And eat some green limes.
628 | The bees left to hide in the trees.
629 | 
630 | There was an Old Man with a beard,
631 | Who said, 'It is just as I feared!
632 | Two Owls and a Hen,
633 | Four Larks and a Wren,
634 | Have all built their nests in my beard!'
635 | 
636 | There was an Old Person of Ischia,
637 | Whose conduct grew friskier and friskier;
638 | He danced hornpipes and jigs,
639 | And ate thousands of figs,
640 | That lively Old Person of Ischia.
641 | 
642 | There was an Old Man in a boat,
643 | Who said, 'I'm afloat, I'm afloat!'
644 | When they said, 'No! you ain't!'
645 | He was ready to faint,
646 | That unhappy Old Man in a boat.
647 | 
648 | There was a Young Lady of Hull,
649 | Who was chased by a virulent bull;
650 | But she seized on a spade,
651 | And called out, 'Who's afraid?'
652 | Which distracted that virulent bull.
653 | 
654 | There was an Old Person of Ems,
655 | Who casually fell in the Thames;
656 | And when he was found
657 | They said he was drowned,
658 | That unlucky Old Person of Ems.
659 | 
660 | There was an Old Man who said, 'Hush!
661 | I perceive a young bird in this bush!'
662 | When they said, 'Is it small?'
663 | He replied, 'Not at all!
664 | It is four times as big as the bush!'
665 | 
666 | There was a Young Lady of Russia,
667 | Who screamed so that no one could hush her;
668 | Her screams were extreme,
669 | No one heard such a scream,
670 | As was screamed by that lady of Russia.
671 | 
672 | There was an Old Person of Ewell,
673 | Who chiefly subsisted on gruel;
674 | But to make it more nice
675 | He inserted some mice,
676 | Which refreshed that Old Person of Ewell.
677 | 
678 | There was an old man in a tree,
679 | Whose whiskers were lovely to see;
680 | But the birds of the air,
681 | Pluck'd them perfectly bare,
682 | To make themselves nests on that tree.
683 | 
684 | There is a Young Lady whose nose
685 | Continually prospers and grows;
686 | When it grew out of sight,
687 | She exclaimed in a fright,
688 | "Oh! Farewell to the end of my nose!"
689 | 
690 | There was an Old Person of Dean,
691 | Who dined on one pea and one bean;
692 | For he said,
693 | "More than that would make me too fat,"
694 | That cautious Old Person of Dean.
695 | 
696 | There was an Old Person of Dover,
697 | Who rushed through a field of blue Clover;
698 | But some very large bees,
699 | Stung his nose and his knees,
700 | So he very soon went back to Dover.
701 | 
702 | There was an Old Man of Peru,
703 | Who watched his wife making a stew;
704 | But once by mistake,
705 | In a stove she did bake,
706 | That unfortunate Man of Peru.
707 | 
708 | There was a Young Lady whose bonnet,
709 | Came untied when the birds sate upon it;
710 | But she said: 'I don't care!
711 | All the birds in the air
712 | Are welcome to sit on my bonnet!'


--------------------------------------------------------------------------------
/llm_bench/benchmark_suite.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "%pip install python-dotenv seaborn matplotlib pandas numpy locust==2.18.1 orjson==3.9.10\n",
 10 |     "from dotenv import load_dotenv\n",
 11 |     "load_dotenv()\n",
 12 |     "import os\n",
 13 |     "import pandas as pd\n",
 14 |     "import matplotlib.pyplot as plt\n",
 15 |     "import seaborn as sns\n",
 16 |     "import datetime\n",
 17 |     "import json\n",
 18 |     "import subprocess\n",
 19 |     "import numpy as np\n",
 20 |     "import time"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "# LLM Bench Guide\n",
 28 |     "\n",
 29 |     "This guide explains how to use the benchmarking tools to evaluate LLM performance.\n",
 30 |     "\n",
 31 |     "### Metrics Collected\n",
 32 |     "The `load_test.py` script measures:\n",
 33 |     "\n",
 34 |     "1. Average Time to First Token\n",
 35 |     "2. Average Token Latency\n",
 36 |     "3. Average Token Count\n",
 37 |     "4. Average Total Response Time\n",
 38 |     "5. Request Count\n",
 39 |     "6. Queries Per Second (QPS)\n",
 40 |     "7. Latency Percentiles (p50, p90, p99, p99.9)\n",
 41 |     "   - Time to First Token\n",
 42 |     "   - Total Response Time\n",
 43 |     "\n",
 44 |     "### Benchmark Scenarios\n",
 45 |     "We will create the following benchmark suites in this notebook. Although there are many other types of tests that can be run, please refer to the README.md for more details:\n",
 46 |     "\n",
 47 |     "1. Single Model Performance Analysis\n",
 48 |     "2. Comparing Performance of different Models and Providers\n",
 49 |     "3. Token Length Impact Study"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 71,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "'''Helper functions, you can ignore this'''\n",
 59 |     "\n",
 60 |     "#Function to utilize subprocess to run the locust script\n",
 61 |     "def execute_subprocess(cmd):\n",
 62 |     "    print(f\"\\nExecuting benchmark: {' '.join(cmd)}\\n\")\n",
 63 |     "    process = subprocess.Popen(\n",
 64 |     "        cmd,\n",
 65 |     "        text=True,\n",
 66 |     "        stdout=subprocess.PIPE,\n",
 67 |     "        stderr=subprocess.STDOUT,\n",
 68 |     "        bufsize=1,\n",
 69 |     "        universal_newlines=True\n",
 70 |     "    )\n",
 71 |     "    # Display output in real-time\n",
 72 |     "    while True:\n",
 73 |     "        output = process.stdout.readline()\n",
 74 |     "        if output == '' and process.poll() is not None:\n",
 75 |     "            break\n",
 76 |     "        if output:\n",
 77 |     "            print(output.strip())\n",
 78 |     "\n",
 79 |     "    return_code = process.poll()\n",
 80 |     "    if return_code != 0:\n",
 81 |     "        print(f\"Benchmark failed with return code: {return_code}\")\n",
 82 |     "        return False\n",
 83 |     "    return True\n",
 84 |     "\n",
 85 |     "\n",
 86 |     "%matplotlib inline\n",
 87 |     "# Optional: for higher resolution plots\n",
 88 |     "%config InlineBackend.figure_format = 'retina'\n",
 89 |     "\n",
 90 |     "\n",
 91 |     "def visualize_comparative_results(stat_result_paths, results_dir):\n",
 92 |     "    \"\"\"\n",
 93 |     "    Create comparative visualizations for multiple model benchmark results\n",
 94 |     "    \n",
 95 |     "    Args:\n",
 96 |     "        stat_result_paths (list): List of dicts containing paths to stats files and configs\n",
 97 |     "            [{\"path\": \"path/to/stats.csv\", \"config\": {\"provider\": \"...\", \"model\": \"...\"}}, ...]\n",
 98 |     "        results_dir (str): Directory to save the comparative visualizations\n",
 99 |     "    \"\"\"\n",
100 |     "    # Read and combine all CSV files with model information\n",
101 |     "    dfs = []\n",
102 |     "    for result in stat_result_paths:\n",
103 |     "        df = pd.read_csv(result['path'])\n",
104 |     "        df['model'] = result['config']['model']\n",
105 |     "        df['provider'] = result['config']['provider']\n",
106 |     "        dfs.append(df)\n",
107 |     "    \n",
108 |     "    # Combine all dataframes\n",
109 |     "    combined_df = pd.concat(dfs, ignore_index=True)\n",
110 |     "    \n",
111 |     "    # Get POST data first\n",
112 |     "    post_data = combined_df[combined_df['Type'] == 'POST']\n",
113 |     "    \n",
114 |     "    # Get unique models and calculate dynamic bar width\n",
115 |     "    models = post_data['model'].unique()\n",
116 |     "    num_models = len(models)\n",
117 |     "    bar_width = min(0.35, 0.8 / num_models)  # Dynamically reduce bar width as models increase\n",
118 |     "    \n",
119 |     "    # Set style for better visualizations\n",
120 |     "    plt.style.use('ggplot')\n",
121 |     "    fig = plt.figure(figsize=(20, 15))\n",
122 |     "\n",
123 |     "    # 1. Response Time Distribution Comparison\n",
124 |     "    plt.subplot(2, 2, 1)\n",
125 |     "    metrics_to_plot = ['Average Response Time', 'Median Response Time']\n",
126 |     "    \n",
127 |     "    x = np.arange(len(metrics_to_plot))\n",
128 |     "    # Adjust bar positions to be centered\n",
129 |     "    positions = np.linspace(-(bar_width * (num_models-1))/2, \n",
130 |     "                          (bar_width * (num_models-1))/2, \n",
131 |     "                          num_models)\n",
132 |     "    \n",
133 |     "    for i, model in enumerate(models):\n",
134 |     "        model_data = post_data[post_data['model'] == model][metrics_to_plot]\n",
135 |     "        plt.bar(x + positions[i], model_data.iloc[0], bar_width, label=model.split('/')[-1])\n",
136 |     "    \n",
137 |     "    plt.title('Response Time Comparison')\n",
138 |     "    plt.xlabel('Metrics')\n",
139 |     "    plt.ylabel('Time (ms)')\n",
140 |     "    plt.xticks(x, metrics_to_plot, rotation=45)\n",
141 |     "    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n",
142 |     "\n",
143 |     "    # 2. QPS Comparison\n",
144 |     "    plt.subplot(2, 2, 2)\n",
145 |     "    qps_data = combined_df[combined_df['Type'] == 'POST'][['model', 'Requests/s']]\n",
146 |     "    x = np.arange(len(models))\n",
147 |     "    plt.bar(x, [qps_data[qps_data['model'] == model]['Requests/s'].iloc[0] for model in models], \n",
148 |     "            width=0.6)  # Single bars can be wider\n",
149 |     "    plt.title('Throughput Comparison')\n",
150 |     "    plt.xlabel('Model')\n",
151 |     "    plt.ylabel('Requests per Second')\n",
152 |     "    plt.xticks(x, [model.split('/')[-1] for model in models], rotation=45)\n",
153 |     "\n",
154 |     "    # 3. Token Latency Comparison\n",
155 |     "    plt.subplot(2, 2, 3)\n",
156 |     "    token_metrics = ['latency_per_token', 'overall_latency_per_token']\n",
157 |     "    token_data = combined_df[\n",
158 |     "        (combined_df['Type'] == 'METRIC') & \n",
159 |     "        (combined_df['Name'].isin(token_metrics))\n",
160 |     "    ]\n",
161 |     "    \n",
162 |     "    x = np.arange(len(token_metrics))\n",
163 |     "    for i, model in enumerate(models):\n",
164 |     "        model_data = token_data[token_data['model'] == model]\n",
165 |     "        values = [\n",
166 |     "            model_data[model_data['Name'] == metric]['Average Response Time'].iloc[0]\n",
167 |     "            for metric in token_metrics\n",
168 |     "        ]\n",
169 |     "        plt.bar(x + positions[i], values, bar_width, label=model.split('/')[-1])\n",
170 |     "    \n",
171 |     "    plt.title('Token Latency Comparison')\n",
172 |     "    plt.xlabel('Metrics')\n",
173 |     "    plt.ylabel('Time (ms)')\n",
174 |     "    plt.xticks(x, ['Per Token', 'Overall Per Token'], rotation=45)\n",
175 |     "    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n",
176 |     "\n",
177 |     "    # 4. Percentile Distribution Comparison\n",
178 |     "    plt.subplot(2, 2, 4)\n",
179 |     "    percentiles = ['50%', '75%', '90%', '95%', '99%', '99.9%']\n",
180 |     "    \n",
181 |     "    for model in models:\n",
182 |     "        model_data = post_data[post_data['model'] == model]\n",
183 |     "        plt.plot(percentiles, model_data[percentiles].iloc[0], marker='o', label=model.split('/')[-1])\n",
184 |     "    \n",
185 |     "    plt.title('Response Time Percentiles Comparison')\n",
186 |     "    plt.xlabel('Percentile')\n",
187 |     "    plt.ylabel('Response Time (ms)')\n",
188 |     "    plt.xticks(rotation=45)\n",
189 |     "    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n",
190 |     "\n",
191 |     "    # Adjust layout to prevent overlapping\n",
192 |     "    plt.tight_layout()\n",
193 |     "    # Save the figure\n",
194 |     "    plt.savefig(f'{results_dir}/comparative_performance_metrics.png', \n",
195 |     "                bbox_inches='tight',  # Ensure the legend is included in the saved figure\n",
196 |     "                dpi=300)  # Higher resolution\n",
197 |     "    # Display in notebook\n",
198 |     "    plt.show()\n",
199 |     "    # Close the figure\n",
200 |     "    plt.close()\n",
201 |     "\n",
202 |     "    # Generate summary statistics\n",
203 |     "    summary_stats = []\n",
204 |     "    for model in models:\n",
205 |     "        model_data = combined_df[combined_df['model'] == model]\n",
206 |     "        post_data = model_data[model_data['Type'] == 'POST'].iloc[0]\n",
207 |     "        token_data = model_data[model_data['Type'] == 'METRIC']\n",
208 |     "        \n",
209 |     "        summary_stats.append({\n",
210 |     "            \"Model\": model.split('/')[-1],\n",
211 |     "            \"Provider\": model_data['provider'].iloc[0],\n",
212 |     "            \"Average QPS\": post_data['Requests/s'],\n",
213 |     "            \"Average Response Time\": post_data['Average Response Time'],\n",
214 |     "            \"99th Percentile Latency\": post_data['99%'],\n",
215 |     "            \"Average Tokens per Request\": token_data[token_data['Name'] == 'num_tokens']['Average Response Time'].iloc[0]\n",
216 |     "        })\n",
217 |     "    \n",
218 |     "    # Print comparative summary\n",
219 |     "    print(\"\\nComparative Summary:\")\n",
220 |     "    print(\"-\" * 80)\n",
221 |     "    summary_df = pd.DataFrame(summary_stats)\n",
222 |     "    print(summary_df.to_string(index=False))\n",
223 |     "    \n",
224 |     "    # Save summary to CSV\n",
225 |     "    summary_df.to_csv(f'{results_dir}/comparative_summary.csv', index=False)\n",
226 |     "    \n",
227 |     "    return summary_stats"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "markdown",
232 |    "metadata": {},
233 |    "source": [
234 |     "## Single Model and Provider Performance Analysis\n",
235 |     "\n",
236 |     "Evaluate performance metrics of singular model from one provider. This is the most basic benchmark that can be run from the load_test.py script."
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "metadata": {},
243 |    "outputs": [],
244 |    "source": [
245 |     "'''\n",
246 |     "Make sure to create a .env file in the root directory and add your API keys.\n",
247 |     "For this example, we will use the Fireworks API key.\n",
248 |     "\n",
249 |     "Add the following to your .env file:\n",
250 |     "\n",
251 |     "FIREWORKS_API_KEY=<your_fireworks_api_key>.\n",
252 |     "\n",
253 |     "Alternatively you can edit the following script flags for custom configurations.\n",
254 |     "'''\n",
255 |     "\n",
256 |     "\n",
257 |     "provider_name = \"fireworks\"\n",
258 |     "model_name = \"accounts/fireworks/models/llama-v3p2-3b-instruct\"\n",
259 |     "h = \"https://api.fireworks.ai/inference\" #host url\n",
260 |     "api_key = os.getenv(\"FIREWORKS_API_KEY\")\n",
261 |     "\n",
262 |     "t = \"5s\" #test duration, set to 1 minute for now\n",
263 |     "\n",
264 |     "'''\n",
265 |     "Choose ONE of the following two modes by commenting/uncommenting:\n",
266 |     "'''\n",
267 |     "# MODE 1: Fixed Queries Per Second (QPS)\n",
268 |     "# Use this mode to maintain a steady rate of requests\n",
269 |     "qps = 5  # Target requests per second\n",
270 |     "u = 100   # Number of users (keep high enough to achieve target QPS)\n",
271 |     "s = 100   # Spawn rate (keep high enough to achieve target QPS)\n",
272 |     "\n",
273 |     "# MODE 2: Fixed Concurrency\n",
274 |     "# Use this mode to maintain a steady number of concurrent requests\n",
275 |     "# Comment out Mode 1 above and uncomment below to use this mode\n",
276 |     "'''\n",
277 |     "# QPS does not need to be set for fixed concurrency mode\n",
278 |     "u = 5      # Number of concurrent workers\n",
279 |     "r = 5      # Rate of spawning new workers (workers/second). Look through README.md for more details on spawn rate\n",
280 |     "'''\n",
281 |     "\n",
282 |     "\n",
283 |     "\n",
284 |     "# Create results directory of name single_model_provider_analysis_{TIMESTAMP}\n",
285 |     "timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n",
286 |     "\n",
287 |     "edited_model_name = model_name.replace(\"/\", \"_\") if provider_name != \"fireworks\" else model_name.replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n",
288 |     "\n",
289 |     "\n",
290 |     "\n",
291 |     "results_dir = f\"results/{provider_name}_{edited_model_name}_analysis_{timestamp}\"\n",
292 |     "os.makedirs(results_dir, exist_ok=True)\n",
293 |     "\n",
294 |     "# Construct the command\n",
295 |     "cmd = [\n",
296 |     "    \"locust\",\n",
297 |     "    \"--headless\",       # Run without web UI\n",
298 |     "    \"--only-summary\",   # Only show summary stats\n",
299 |     "    \"-H\", h,           # Host URL\n",
300 |     "    \"--provider\", provider_name,\n",
301 |     "    \"--model\", model_name,\n",
302 |     "    \"--api-key\", api_key,\n",
303 |     "    \"-t\", t,           # Test duration\n",
304 |     "    \"--html\", f\"{results_dir}/report.html\",  # Generate HTML report\n",
305 |     "    \"--csv\", f\"{results_dir}/stats\",        # Generate CSV stats\n",
306 |     "]\n",
307 |     "\n",
308 |     "# Add Mode 1 (Fixed QPS) parameters if uncommented, remember to remove --qps below if using fixed concurrency mode\n",
309 |     "cmd.extend([\n",
310 |     "    \"-u\", str(u),      # Number of users\n",
311 |     "    \"-r\", str(s),      # Spawn rate\n",
312 |     "    \"--qps\", str(qps)  # Target QPS\n",
313 |     "])\n",
314 |     "\n",
315 |     "# Add load_test.py as the locust file\n",
316 |     "locust_file = os.path.join(os.path.dirname(os.getcwd()), \"llm_bench\", \"load_test.py\")\n",
317 |     "cmd.extend([\"-f\", locust_file]) \n",
318 |     "\n",
319 |     "#call our helper function to execute the command\n",
320 |     "success = execute_subprocess(cmd)\n",
321 |     "\n",
322 |     "#Visualize the results\n",
323 |     "if success: \n",
324 |     "    time.sleep(1)\n",
325 |     "    stat_result_paths = [{\"path\": f'{results_dir}/stats_stats.csv', \"config\": {\"provider\": provider_name, \"model\": model_name}}]\n",
326 |     "    visualize_comparative_results(stat_result_paths, results_dir)"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "## Comparing different Models and Providers\n",
334 |     "\n",
335 |     "Evaluate performance metrics of different models and providers."
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "code",
340 |    "execution_count": null,
341 |    "metadata": {},
342 |    "outputs": [],
343 |    "source": [
344 |     "'''\n",
345 |     "Edit the provider_configs list to add more providers and models.\n",
346 |     "\n",
347 |     "Add any needed api keys to the .env file. This example uses the Fireworks API key again.\n",
348 |     "'''\n",
349 |     "\n",
350 |     "provider_configs = [\n",
351 |     "    {\"provider\": \"fireworks\", \"model\": \"accounts/fireworks/models/llama-v3p2-3b-instruct\", \"host\": \"https://api.fireworks.ai/inference\", \"api_key\": os.getenv(\"FIREWORKS_API_KEY\")},\n",
352 |     "    {\"provider\": \"fireworks\", \"model\": \"accounts/fireworks/models/mistral-small-24b-instruct-2501\", \"host\": \"https://api.fireworks.ai/inference\", \"api_key\": os.getenv(\"FIREWORKS_API_KEY\")},\n",
353 |     "    #... add more providers and models here\n",
354 |     "]\n",
355 |     "\n",
356 |     "# some starter configs and flags\n",
357 |     "t = \"5s\" #test duration, set to 1 minute for now\n",
358 |     "qps = 5  # Target requests per second\n",
359 |     "u = 100   # Number of users (keep high enough to achieve target QPS)\n",
360 |     "s = 100   # Spawn rate (keep high enough to achieve target QPS)\n",
361 |     "\n",
362 |     "# Create results directory of name single_model_provider_analysis_{TIMESTAMP}\n",
363 |     "timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n",
364 |     "results_dir = f\"results/different_models_and_providers_analysis_{timestamp}\"\n",
365 |     "os.makedirs(results_dir, exist_ok=True)\n",
366 |     "\n",
367 |     "for index, config in enumerate(provider_configs):\n",
368 |     "    # Construct the command\n",
369 |     "\n",
370 |     "    edited_model_name = config[\"model\"].replace(\"/\", \"_\") if config[\"provider\"] != \"fireworks\" else config[\"model\"].replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n",
371 |     "    \n",
372 |     "    provider_model_path = f\"{results_dir}/{config[\"provider\"]}_{edited_model_name}_{index}\"\n",
373 |     "    \n",
374 |     "    os.makedirs(provider_model_path, exist_ok=True)\n",
375 |     "    cmd = [\n",
376 |     "        \"locust\",\n",
377 |     "        \"--headless\",       # Run without web UI\n",
378 |     "        \"--only-summary\",   # Only show summary stats\n",
379 |     "        \"-H\", config[\"host\"],           # Host URL\n",
380 |     "        \"--provider\", config[\"provider\"],\n",
381 |     "        \"--model\", config[\"model\"],\n",
382 |     "        \"--api-key\", config[\"api_key\"],\n",
383 |     "        \"-t\", t,           # Test duration\n",
384 |     "        \"--html\", f\"{provider_model_path}/report.html\",  # Generate HTML report\n",
385 |     "        \"--csv\", f\"{provider_model_path}/stats\",        # Generate CSV stats\n",
386 |     "    ]\n",
387 |     "\n",
388 |     "    # Add Mode 1 (Fixed QPS) parameters if uncommented, remember to remove --qps below if using fixed concurrency mode\n",
389 |     "    cmd.extend([\n",
390 |     "        \"-u\", str(u),      # Number of users\n",
391 |     "        \"-r\", str(s),      # Spawn rate\n",
392 |     "        \"--qps\", str(qps)  # Target QPS\n",
393 |     "    ])\n",
394 |     "\n",
395 |     "    # Add load_test.py as the locust file\n",
396 |     "    locust_file = os.path.join(os.path.dirname(os.getcwd()), \"llm_bench\", \"load_test.py\")\n",
397 |     "    cmd.extend([\"-f\", locust_file]) \n",
398 |     "\n",
399 |     "    #call our helper function to execute the command\n",
400 |     "    execute_subprocess(cmd)\n",
401 |     "\n",
402 |     "#Visualize the results\n",
403 |     "stat_result_paths = []\n",
404 |     "for index, config in enumerate(provider_configs):\n",
405 |     "    edited_model_name = config[\"model\"].replace(\"/\", \"_\") if config[\"provider\"] != \"fireworks\" else config[\"model\"].replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n",
406 |     "    stat_result_paths.append({\"path\": f\"{results_dir}/{config['provider']}_{edited_model_name}_{index}/stats_stats.csv\", \"config\": config})\n",
407 |     "\n",
408 |     "time.sleep(1)\n",
409 |     "visualize_comparative_results(stat_result_paths, results_dir)\n",
410 |     "\n"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "markdown",
415 |    "metadata": {},
416 |    "source": [
417 |     "## Token Length Analysis\n",
418 |     "\n",
419 |     "Evaluates model performance across different output lengths by testing the same model \n",
420 |     "with varying input token limits (from short to long responses)."
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": null,
426 |    "metadata": {},
427 |    "outputs": [],
428 |    "source": [
429 |     "max_token_lengths = [30,64,128,256] # --max-tokens flag\n",
430 |     "max_token_lengths_distribution = \"uniform\" # --max-tokens-distribution flag check readme.md for more details and other options\n",
431 |     "\n",
432 |     "\n",
433 |     "provider_name = \"fireworks\"\n",
434 |     "model_name = \"accounts/fireworks/models/llama-v3p2-3b-instruct\"\n",
435 |     "h = \"https://api.fireworks.ai/inference\" #host url\n",
436 |     "api_key = os.getenv(\"FIREWORKS_API_KEY\")\n",
437 |     "\n",
438 |     "t = \"5s\" #test duration, set to 1 minute for now\n",
439 |     "qps = 5  # Target requests per second\n",
440 |     "u = 100   # Number of users (keep high enough to achieve target QPS)\n",
441 |     "s = 100   # Spawn rate (keep high enough to achieve target QPS)\n",
442 |     "\n",
443 |     "# Create results directory of name single_model_provider_analysis_{TIMESTAMP}\n",
444 |     "timestamp = datetime.datetime.now().strftime(\"%Y%m%d_%H%M\")\n",
445 |     "results_dir = f\"results/token_length_analysis_{timestamp}\"\n",
446 |     "os.makedirs(results_dir, exist_ok=True)\n",
447 |     "\n",
448 |     "for index, token_length in enumerate(max_token_lengths):\n",
449 |     "    # Construct the command\n",
450 |     "\n",
451 |     "    edited_model_name = model_name.replace(\"/\", \"_\") if provider_name != \"fireworks\" else model_name.replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n",
452 |     "    \n",
453 |     "    token_length_path = f\"{results_dir}/{provider_name}_{edited_model_name}_{token_length}\"\n",
454 |     "    os.makedirs(f\"{token_length_path}\", exist_ok=True)\n",
455 |     "    cmd = [\n",
456 |     "        \"locust\",\n",
457 |     "        \"--headless\",       # Run without web UI\n",
458 |     "        \"--only-summary\",   # Only show summary stats\n",
459 |     "        \"-H\", h,           # Host URL\n",
460 |     "        \"--provider\", provider_name,\n",
461 |     "        \"--model\", model_name,\n",
462 |     "        \"--api-key\", api_key,\n",
463 |     "        \"-t\", t,           # Test duration\n",
464 |     "        \"--max-tokens\", str(token_length), \n",
465 |     "        \"--max-tokens-distribution\", max_token_lengths_distribution,\n",
466 |     "        \"--html\", f\"{token_length_path}/report.html\",  # Generate HTML report\n",
467 |     "        \"--csv\", f\"{token_length_path}/stats\",        # Generate CSV stats\n",
468 |     "    ]\n",
469 |     "\n",
470 |     "    # Add Mode 1 (Fixed QPS) parameters if uncommented, remember to remove --qps below if using fixed concurrency mode\n",
471 |     "    cmd.extend([\n",
472 |     "        \"-u\", str(u),      # Number of users\n",
473 |     "        \"-r\", str(s),      # Spawn rate\n",
474 |     "        \"--qps\", str(qps)  # Target QPS\n",
475 |     "    ])\n",
476 |     "\n",
477 |     "    # Add load_test.py as the locust file\n",
478 |     "    locust_file = os.path.join(os.path.dirname(os.getcwd()), \"llm_bench\", \"load_test.py\")\n",
479 |     "    cmd.extend([\"-f\", locust_file]) \n",
480 |     "\n",
481 |     "    #call our helper function to execute the command\n",
482 |     "    execute_subprocess(cmd)\n",
483 |     "\n",
484 |     "#Visualize the results\n",
485 |     "stat_result_paths = []\n",
486 |     "for index, token_length in enumerate(max_token_lengths):\n",
487 |     "\n",
488 |     "    edited_model_name = model_name.replace(\"/\", \"_\") if provider_name != \"fireworks\" else model_name.replace(\"accounts/fireworks/models/\", \"\").replace(\"/\", \"_\")\n",
489 |     "    stat_result_paths.append({\"path\": f\"{results_dir}/{provider_name}_{edited_model_name}_{token_length}/stats_stats.csv\", \"config\": {\"provider\": \"fireworks\", \"model\": \"accounts/fireworks/models/llama-v3p2-3b-instruct\" + \"_\" + str(token_length)}})\n",
490 |     "\n",
491 |     "time.sleep(1)\n",
492 |     "visualize_comparative_results(stat_result_paths, results_dir)"
493 |    ]
494 |   }
495 |  ],
496 |  "metadata": {
497 |   "kernelspec": {
498 |    "display_name": "benchmark",
499 |    "language": "python",
500 |    "name": "python3"
501 |   },
502 |   "language_info": {
503 |    "codemirror_mode": {
504 |     "name": "ipython",
505 |     "version": 3
506 |    },
507 |    "file_extension": ".py",
508 |    "mimetype": "text/x-python",
509 |    "name": "python",
510 |    "nbconvert_exporter": "python",
511 |    "pygments_lexer": "ipython3",
512 |    "version": "3.12.8"
513 |   }
514 |  },
515 |  "nbformat": 4,
516 |  "nbformat_minor": 2
517 | }
518 | 


--------------------------------------------------------------------------------
/llm_bench/load_test.py:
--------------------------------------------------------------------------------
   1 | import abc
   2 | import argparse
   3 | import csv
   4 | from dataclasses import dataclass
   5 | from functools import partial
   6 | import os
   7 | import random
   8 | import sys
   9 | import traceback
  10 | from typing import Optional
  11 | from locust import HttpUser, task, events, constant_pacing
  12 | import copy
  13 | import json
  14 | import time
  15 | import orjson
  16 | import base64
  17 | import io
  18 | import itertools
  19 | from PIL import Image
  20 | import transformers
  21 | import re
  22 | import gevent
  23 | from locust.util.timespan import parse_timespan as _locust_parse_timespan
  24 | 
  25 | try:
  26 |     import locust_plugins
  27 | except ImportError:
  28 |     print("locust-plugins is not installed, Grafana won't work")
  29 | 
  30 | 
  31 | def add_custom_metric(name, value, length_value=0):
  32 |     events.request.fire(
  33 |         request_type="METRIC",
  34 |         name=name,
  35 |         response_time=value,
  36 |         response_length=length_value,
  37 |         exception=None,
  38 |         context=None,
  39 |     )
  40 | 
  41 | 
  42 | PROMPT_CHAT_IMAGE_PLACEHOLDER = "<image>"
  43 | 
  44 | 
  45 | class LimericsDataset:
  46 |     _PROMPT = "\n\nTranslate the limericks above to Spanish, then re-write limericks using different styles. Do it 10 times."
  47 | 
  48 |     def __init__(
  49 |         self,
  50 |         path: str,
  51 |         tokenizer_path: str,
  52 |         chat: bool,
  53 |         num_tokens: int,
  54 |         common_tokens: int,
  55 |     ):
  56 |         self._tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_path)
  57 |         self._num_tokens = num_tokens
  58 | 
  59 |         self._all_limericks = []
  60 |         with open(path, "r") as f:
  61 |             text = f.read()
  62 |             lims = text.split("\n\n")
  63 |             for i, lim in enumerate(lims):
  64 |                 num_tokens = len(self._tokenizer.encode(lim))
  65 |                 self._all_limericks.append((lim, num_tokens))
  66 | 
  67 |         self._prefix = ""
  68 |         self._suffix = self._PROMPT
  69 |         self._prefix_suffix_tokens = len(self._tokenizer.encode(self._PROMPT))
  70 |         while self._prefix_suffix_tokens < common_tokens:
  71 |             lim, num_tokens = self._all_limericks[
  72 |                 random.randint(0, len(self._all_limericks) - 1)
  73 |             ]
  74 |             self._prefix += lim + "\n\n"
  75 |             self._prefix_suffix_tokens += num_tokens
  76 | 
  77 |         if chat:
  78 |             empty_tempalate_tokens = self._tokenizer.apply_chat_template(
  79 |                 [{"role": "user", "content": ""}],
  80 |                 tokenize=True,
  81 |                 add_generation_prompt=True,
  82 |             )
  83 |             self._prefix_suffix_tokens += len(empty_tempalate_tokens)
  84 | 
  85 |     def __next__(self):
  86 |         prompt_tokens = self._prefix_suffix_tokens
  87 |         prompt = self._prefix
  88 |         while prompt_tokens < self._num_tokens:
  89 |             lim, num_tokens = self._all_limericks[
  90 |                 random.randint(0, len(self._all_limericks) - 1)
  91 |             ]
  92 | 
  93 |             prompt += lim + "\n\n"
  94 |             prompt_tokens += num_tokens
  95 |         prompt += self._suffix
  96 | 
  97 |         return prompt, prompt_tokens
  98 | 
  99 |     def __iter__(self):
 100 |         return self
 101 | 
 102 | 
 103 | class JsonlDataset:
 104 |     def __init__(self, path: str):
 105 |         self.path = path
 106 | 
 107 |     def __iter__(self):
 108 |         return itertools.cycle(self._read_data())
 109 | 
 110 |     def _read_data(self):
 111 |         with open(self.path, "r") as f:
 112 |             for line in f:
 113 |                 yield json.loads(line), 0
 114 | 
 115 | 
 116 | class DatasetHolder:
 117 |     _instance = None
 118 | 
 119 |     @classmethod
 120 |     def _create_dataset(cls, options: argparse.Namespace):
 121 |         if options.dataset.startswith("@"):
 122 |             return JsonlDataset(options.dataset[1:])
 123 |         elif options.dataset == "limerics":
 124 |             assert (
 125 |                 options.tokenizer is not None
 126 |             ), "--tokenizer is required for limerics dataset"
 127 |             return LimericsDataset(
 128 |                 path=os.path.join(
 129 |                     os.path.dirname(os.path.abspath(__file__)), "limericks.txt"
 130 |                 ),
 131 |                 tokenizer_path=options.tokenizer,
 132 |                 chat=options.chat,
 133 |                 num_tokens=options.prompt_tokens,
 134 |                 common_tokens=options.prompt_cache_max_len,
 135 |             )
 136 |         else:
 137 |             raise ValueError(f"Unknown dataset: {options.dataset}")
 138 | 
 139 |     @classmethod
 140 |     def get_instance(cls, options: argparse.Namespace):
 141 |         if cls._instance is None:
 142 |             cls._instance = cls._create_dataset(options)
 143 |         return cls._instance
 144 | 
 145 | 
 146 | class FixedQPSPacer:
 147 |     _instance = None
 148 | 
 149 |     def __init__(self, qps, distribution):
 150 |         self.qps = qps
 151 |         self.distribution = distribution
 152 | 
 153 |         # It's kind of thread safe thanks to GIL as the only state is `t` - good enough for a loadtest
 154 |         def gen():
 155 |             t = time.time()
 156 |             mean_wait = 1 / self.qps
 157 |             while True:
 158 |                 if self.distribution == "exponential":
 159 |                     wait = random.expovariate(1 / mean_wait)
 160 |                 elif self.distribution == "uniform":
 161 |                     wait = random.uniform(0, 2 * mean_wait)
 162 |                 elif self.distribution == "constant":
 163 |                     wait = mean_wait
 164 |                 else:
 165 |                     print("Unknown distribution {self.distribution}")
 166 |                     os._exit(1)
 167 |                 t += wait
 168 |                 yield t
 169 | 
 170 |         self.iterator = gen()
 171 | 
 172 |     @classmethod
 173 |     def instance(cls, qps, distribution):
 174 |         if cls._instance is None:
 175 |             cls._instance = cls(qps, distribution)
 176 |         else:
 177 |             assert cls._instance.qps == qps
 178 |             assert cls._instance.distribution == distribution
 179 |         return cls._instance
 180 | 
 181 |     def wait_time_till_next(self):
 182 |         t = next(self.iterator)
 183 |         now = time.time()
 184 |         if now > t:
 185 |             print(
 186 |                 f"WARNING: not enough locust users to keep up with the desired QPS. Either the number of locust users is too low or the server is overloaded. Delay: {now-t:.3f}s"
 187 |             )
 188 |             return 0
 189 |         return t - now
 190 | 
 191 | 
 192 | class LengthSampler:
 193 |     def __init__(self, distribution: str, mean: int, cap: Optional[int], alpha: float):
 194 |         self.distribution = distribution
 195 |         self.mean = mean
 196 |         self.cap = cap
 197 |         self.alpha = alpha
 198 | 
 199 |         if self.distribution == "exponential":
 200 |             self.sample_func = lambda: int(random.expovariate(1 / self.mean))
 201 |         elif self.distribution == "uniform":
 202 |             mx = self.mean + int(self.alpha * self.mean)
 203 |             if self.cap is not None:
 204 |                 mx = min(mx, self.cap)
 205 |             self.sample_func = lambda: random.randint(
 206 |                 max(1, self.mean - int(self.alpha * self.mean)), mx
 207 |             )
 208 |         elif self.distribution == "constant":
 209 |             self.sample_func = lambda: self.mean
 210 |         elif self.distribution == "normal":
 211 |             self.sample_func = lambda: int(
 212 |                 random.gauss(self.mean, self.mean * self.alpha)
 213 |             )
 214 |         else:
 215 |             raise ValueError(f"Unknown distribution {self.distribution}")
 216 | 
 217 |     def sample(self) -> int:
 218 |         for _ in range(1000):
 219 |             sample = self.sample_func()
 220 |             if sample <= 0:
 221 |                 continue
 222 |             if self.cap is not None and sample > self.cap:
 223 |                 continue
 224 |             return sample
 225 |         else:
 226 |             raise ValueError(
 227 |                 "Can't sample a value after 1000 attempts, check distribution parameters"
 228 |             )
 229 | 
 230 |     def __str__(self):
 231 |         r = int(self.mean * self.alpha)
 232 |         if self.distribution == "constant":
 233 |             s = str(self.mean)
 234 |         elif self.distribution == "uniform":
 235 |             s = f"uniform({self.mean} +/- {r})"
 236 |         elif self.distribution == "normal":
 237 |             s = f"normal({self.mean}, {r})"
 238 |         elif self.distribution == "exponential":
 239 |             s = f"exponential({self.mean})"
 240 |         else:
 241 |             assert False
 242 |         if self.cap is not None:
 243 |             s += f" capped at {self.cap}"
 244 |         return s
 245 | 
 246 | 
 247 | class InitTracker:
 248 |     users = None
 249 |     first_request_done = 0
 250 |     logging_params = None
 251 |     environment = None
 252 |     tokenizer = None
 253 |     deferred_run_time_seconds = None
 254 |     stop_scheduled = False
 255 |     stats_reset_done = False
 256 | 
 257 |     @classmethod
 258 |     def notify_init(cls, environment, logging_params):
 259 |         if cls.environment is None:
 260 |             cls.environment = environment
 261 |         if cls.logging_params is None:
 262 |             cls.logging_params = logging_params
 263 |         else:
 264 |             assert (
 265 |                 cls.logging_params == logging_params
 266 |             ), f"Inconsistent settings between workers: {cls.logging_params} != {logging_params}"
 267 | 
 268 |     @classmethod
 269 |     def notify_first_request(cls):
 270 |         cls.first_request_done += 1
 271 | 
 272 |     @classmethod
 273 |     def notify_spawning_complete(cls, user_count):
 274 |         cls.users = user_count
 275 |         # Start steady-state measurement exactly when all users have spawned
 276 |         if not cls.stats_reset_done:
 277 |             cls.reset_stats()
 278 |             cls.stats_reset_done = True
 279 |         # If -t/--run-time was provided, schedule test stop relative to spawn complete
 280 |         if (
 281 |             cls.deferred_run_time_seconds is not None
 282 |             and not cls.stop_scheduled
 283 |             and cls.environment is not None
 284 |             and cls.environment.runner is not None
 285 |         ):
 286 |             delay = float(cls.deferred_run_time_seconds)
 287 |             print(f"Scheduling stop {delay}s after spawning complete (deferred -t)")
 288 |             gevent.spawn_later(delay, cls.environment.runner.quit)
 289 |             cls.stop_scheduled = True
 290 | 
 291 |     @classmethod
 292 |     def reset_stats(cls):
 293 |         assert cls.environment.runner, "only local mode is supported"
 294 |         print("Resetting stats after traffic reach a steady state")
 295 |         cls.environment.events.reset_stats.fire()
 296 |         cls.environment.runner.stats.reset_all()
 297 | 
 298 |     @classmethod
 299 |     def load_tokenizer(cls, dir):
 300 |         if not dir:
 301 |             return None
 302 |         if cls.tokenizer:
 303 |             return cls.tokenizer
 304 |         import transformers
 305 | 
 306 |         cls.tokenizer = transformers.AutoTokenizer.from_pretrained(dir)
 307 |         cls.tokenizer.add_bos_token = False
 308 |         cls.tokenizer.add_eos_token = False
 309 |         return cls.tokenizer
 310 | 
 311 | 
 312 | events.spawning_complete.add_listener(InitTracker.notify_spawning_complete)
 313 | 
 314 | 
 315 | def _parse_run_time_to_seconds(run_time_value):
 316 |     """Parse Locust -t/--run-time value into seconds (float). Supports both
 317 |     already-parsed numeric values and human strings like '30s', '5m', '1h30m'.
 318 |     """
 319 |     if not run_time_value:
 320 |         return None
 321 |     # If Locust already parsed it to a number (seconds), just use it
 322 |     if isinstance(run_time_value, (int, float)):
 323 |         return float(run_time_value)
 324 |     # Try Locust's own parser first
 325 |     if _locust_parse_timespan is not None:
 326 |         try:
 327 |             return float(_locust_parse_timespan(run_time_value))
 328 |         except Exception:
 329 |             pass
 330 |     # Fallback simple parser for strings like '1h30m15s'
 331 |     s = str(run_time_value).strip().lower()
 332 |     total = 0.0
 333 |     for value, unit in re.findall(r"(\d+)\s*([smhd])", s):
 334 |         n = float(value)
 335 |         if unit == "s":
 336 |             total += n
 337 |         elif unit == "m":
 338 |             total += n * 60
 339 |         elif unit == "h":
 340 |             total += n * 3600
 341 |         elif unit == "d":
 342 |             total += n * 86400
 343 |     if total == 0.0:
 344 |         raise ValueError(f"Unable to parse run time value: {run_time_value}")
 345 |     return total
 346 | 
 347 | 
 348 | @events.init.add_listener
 349 | def _defer_run_time_to_after_spawn(environment, **_kwargs):
 350 |     """Capture -t/--run-time and defer it to start counting after spawn completes.
 351 | 
 352 |     We store the desired duration, null out the original option to prevent
 353 |     Locust from scheduling an early stop, and then schedule our own stop in
 354 |     InitTracker.notify_spawning_complete.
 355 |     """
 356 |     try:
 357 |         run_time_value = getattr(environment.parsed_options, "run_time", None)
 358 |     except Exception:
 359 |         run_time_value = None
 360 |     seconds = _parse_run_time_to_seconds(run_time_value) if run_time_value else None
 361 |     if seconds:
 362 |         # Disable Locust's default run_time handling by clearing it
 363 |         try:
 364 |             environment.parsed_options.run_time = None
 365 |         except Exception:
 366 |             pass
 367 |         InitTracker.deferred_run_time_seconds = seconds
 368 |         InitTracker.environment = environment
 369 |         print(
 370 |             f"Deferring -t/--run-time to start after spawning complete: {seconds}s"
 371 |         )
 372 | 
 373 | 
 374 | @dataclass
 375 | class ChunkMetadata:
 376 |     text: str
 377 |     logprob_tokens: Optional[int]
 378 |     usage_tokens: Optional[int]
 379 |     prompt_usage_tokens: Optional[int]
 380 | 
 381 | 
 382 | class BaseProvider(abc.ABC):
 383 |     DEFAULT_MODEL_NAME = None
 384 | 
 385 |     def __init__(self, model, parsed_options):
 386 |         self.model = model
 387 |         self.parsed_options = parsed_options
 388 | 
 389 |     @abc.abstractmethod
 390 |     def get_url(self): ...
 391 | 
 392 |     @abc.abstractmethod
 393 |     def format_payload(self, prompt, max_tokens, images): ...
 394 | 
 395 |     @abc.abstractmethod
 396 |     def parse_output_json(self, json): ...
 397 | 
 398 | 
 399 | class OpenAIProvider(BaseProvider):
 400 |     def get_url(self):
 401 |         if self.parsed_options.embeddings:
 402 |             return "/v1/embeddings"
 403 |         elif self.parsed_options.chat:
 404 |             return "/v1/chat/completions"
 405 |         else:
 406 |             return "/v1/completions"
 407 | 
 408 |     def format_payload(self, prompt, max_tokens, images):
 409 |         if self.parsed_options.embeddings:
 410 |             data = {
 411 |                 "model": self.model,
 412 |                 "input": prompt,
 413 |             }
 414 |             # Add embeddings-specific parameters
 415 |             if self.parsed_options.return_logits is not None:
 416 |                 data["return_logits"] = self.parsed_options.return_logits
 417 |             if self.parsed_options.normalize is not None:
 418 |                 data["normalize"] = self.parsed_options.normalize
 419 |             return data
 420 | 
 421 |         data = {
 422 |             "model": self.model,
 423 |             "max_tokens": max_tokens,
 424 |             "stream": self.parsed_options.stream,
 425 |             "temperature": self.parsed_options.temperature,
 426 |             "n": self.parsed_options.n,
 427 |         }
 428 |         if self.parsed_options.top_k is not None:
 429 |             data["top_k"] = self.parsed_options.top_k
 430 |         if self.parsed_options.logprobs is not None:
 431 |             data["logprobs"] = self.parsed_options.logprobs
 432 |         if isinstance(prompt, str):
 433 |             if self.parsed_options.chat:
 434 |                 if images is None:
 435 |                     data["messages"] = [{"role": "user", "content": prompt}]
 436 |                 else:
 437 |                     image_urls = []
 438 |                     for image in images:
 439 |                         image_urls.append(
 440 |                             {"type": "image_url", "image_url": {"url": image}}
 441 |                         )
 442 |                     data["messages"] = [
 443 |                         {
 444 |                             "role": "user",
 445 |                             "content": [{"type": "text", "text": prompt}, *image_urls],
 446 |                         }
 447 |                     ]
 448 |             else:
 449 |                 data["prompt"] = prompt
 450 |                 if images is not None:
 451 |                     data["images"] = images
 452 |         else:
 453 |             assert isinstance(prompt, dict), "prompt must be a dict"
 454 |             for k, v in prompt.items():
 455 |                 data[k] = v
 456 | 
 457 |         return data
 458 | 
 459 |     def parse_output_json(self, data):
 460 |         if self.parsed_options.embeddings:
 461 |             return ChunkMetadata(
 462 |                 text=data["data"][0]["embedding"],
 463 |                 logprob_tokens=None,
 464 |                 usage_tokens=None,
 465 |                 prompt_usage_tokens=None,
 466 |             )
 467 |         usage = data.get("usage", None)
 468 | 
 469 |         assert len(data["choices"]) == 1, f"Too many choices {len(data['choices'])}"
 470 |         choice = data["choices"][0]
 471 |         if self.parsed_options.chat:
 472 |             if self.parsed_options.stream:
 473 |                 block = choice["delta"]
 474 |             else:
 475 |                 block = choice["message"]
 476 |             text = (block.get("reasoning", "") or "") + (block.get("reasoning_content", "") or "") + (block.get("content", "") or "")
 477 |         else:
 478 |             text = choice["text"]
 479 | 
 480 |         logprobs = choice.get("logprobs", None)
 481 |         if logprobs and "tokens" in logprobs:
 482 |             logprob_tokens = len(logprobs["tokens"])
 483 |         else:
 484 |             logprob_tokens = None
 485 | 
 486 |         return ChunkMetadata(
 487 |             text=text,
 488 |             logprob_tokens=logprob_tokens,
 489 |             usage_tokens=usage["completion_tokens"] if usage else None,
 490 |             prompt_usage_tokens=usage.get("prompt_tokens", None) if usage else None,
 491 |         )
 492 | 
 493 | 
 494 | class FireworksProvider(OpenAIProvider):
 495 |     def format_payload(self, prompt, max_tokens, images):
 496 |         data = super().format_payload(prompt, max_tokens, images)
 497 |         if not self.parsed_options.embeddings:
 498 |             data["min_tokens"] = max_tokens
 499 |         data["prompt_cache_max_len"] = self.parsed_options.prompt_cache_max_len
 500 |         return data
 501 | 
 502 | 
 503 | class VllmProvider(OpenAIProvider):
 504 |     def format_payload(self, prompt, max_tokens, images):
 505 |         data = super().format_payload(prompt, max_tokens, images)
 506 |         data["ignore_eos"] = True
 507 |         return data
 508 | 
 509 | 
 510 | class TogetherProvider(OpenAIProvider):
 511 |     def get_url(self):
 512 |         assert not self.parsed_options.chat, "Chat is not supported"
 513 |         return "/"
 514 | 
 515 |     def format_payload(self, prompt, max_tokens, images):
 516 |         data = super().format_payload(prompt, max_tokens, images)
 517 |         data["ignore_eos"] = True
 518 |         data["stream_tokens"] = data.pop("stream")
 519 |         return data
 520 | 
 521 |     def parse_output_json(self, data):
 522 |         if not self.parsed_options.stream:
 523 |             data = data["output"]
 524 |         return super().parse_output_json(data)
 525 | 
 526 | 
 527 | class TgiProvider(BaseProvider):
 528 |     DEFAULT_MODEL_NAME = "<unused>"
 529 | 
 530 |     def get_url(self):
 531 |         assert self.parsed_options.n == 1, "n > 1 is not supported"
 532 |         assert not self.parsed_options.chat, "Chat is not supported"
 533 |         stream_suffix = "_stream" if self.parsed_options.stream else ""
 534 |         return f"/generate{stream_suffix}"
 535 | 
 536 |     def format_payload(self, prompt, max_tokens, images):
 537 |         assert isinstance(prompt, str), "prompt must be a string"
 538 |         assert images is None, "images are not supported"
 539 |         data = {
 540 |             "inputs": prompt,
 541 |             "parameters": {
 542 |                 "max_new_tokens": max_tokens,
 543 |                 "temperature": self.parsed_options.temperature,
 544 |                 "top_n_tokens": self.parsed_options.logprobs,
 545 |                 "details": self.parsed_options.logprobs is not None,
 546 |             },
 547 |         }
 548 |         return data
 549 | 
 550 |     def parse_output_json(self, data):
 551 |         if "token" in data:
 552 |             # streaming chunk
 553 |             return ChunkMetadata(
 554 |                 text=data["token"]["text"],
 555 |                 logprob_tokens=1,
 556 |                 usage_tokens=None,
 557 |                 prompt_usage_tokens=None,
 558 |             )
 559 |         else:
 560 |             # non-streaming response
 561 |             return ChunkMetadata(
 562 |                 text=data["generated_text"],
 563 |                 logprob_tokens=(
 564 |                     len(data["details"]["tokens"]) if "details" in data else None
 565 |                 ),
 566 |                 usage_tokens=(
 567 |                     data["details"]["generated_tokens"] if "details" in data else None
 568 |                 ),
 569 |                 prompt_usage_tokens=None,
 570 |             )
 571 | 
 572 | 
 573 | PROVIDER_CLASS_MAP = {
 574 |     "fireworks": FireworksProvider,
 575 |     "vllm": VllmProvider,
 576 |     "sglang": VllmProvider,
 577 |     "openai": OpenAIProvider,
 578 |     "together": TogetherProvider,
 579 |     "tgi": TgiProvider,
 580 | }
 581 | 
 582 | 
 583 | def _load_curl_like_data(text):
 584 |     """
 585 |     Either use the passed string or load from a file if the string is `@filename`
 586 |     """
 587 |     if text.startswith("@"):
 588 |         try:
 589 |             if text.endswith(".jsonl"):
 590 |                 with open(text[1:], "r") as f:
 591 |                     return [json.loads(line) for line in f]
 592 |             else:
 593 |                 with open(text[1:], "r") as f:
 594 |                     return f.read()
 595 |         except Exception as e:
 596 |             raise ValueError(f"Failed to read file {text[1:]}") from e
 597 |     else:
 598 |         return text
 599 | 
 600 | 
 601 | class LLMUser(HttpUser):
 602 |     # no wait time, so every user creates a continuous load, sending requests as quickly as possible
 603 | 
 604 |     def on_start(self):
 605 |         try:
 606 |             self._on_start()
 607 |         except Exception as e:
 608 |             print(f"Failed to initialize: {repr(e)}")
 609 |             print(traceback.format_exc())
 610 |             sys.exit(1)
 611 | 
 612 |     def _guess_provider(self):
 613 |         self.model = self.environment.parsed_options.model
 614 |         self.provider = self.environment.parsed_options.provider
 615 |         # guess based on URL
 616 |         if self.provider is None:
 617 |             if "fireworks.ai" in self.host:
 618 |                 self.provider = "fireworks"
 619 |             elif "together" in self.host:
 620 |                 self.provider = "together"
 621 |             elif "openai" in self.host:
 622 |                 self.provider = "openai"
 623 | 
 624 |         if (
 625 |             self.model is None
 626 |             and self.provider is not None
 627 |             and PROVIDER_CLASS_MAP[self.provider].DEFAULT_MODEL_NAME is not None
 628 |         ):
 629 |             self.model = PROVIDER_CLASS_MAP[self.provider].DEFAULT_MODEL_NAME
 630 | 
 631 |         if self.model and self.provider:
 632 |             return
 633 | 
 634 |         # vllm doesn't support /model/<name> endpoint, so iterate over all models
 635 |         try:
 636 |             resp = self.client.get("/v1/models")
 637 |             resp.raise_for_status()
 638 |             resp = resp.json()
 639 |         except Exception as e:
 640 |             raise ValueError(
 641 |                 "Argument --model or --provider was not specified and /v1/models failed"
 642 |             ) from e
 643 | 
 644 |         models = resp["data"]
 645 |         assert len(models) > 0, "No models found in /v1/models"
 646 |         owned_by = None
 647 |         # pick the first model
 648 |         for m in models:
 649 |             if self.model is None or m["id"] == self.model:
 650 |                 self.model = m["id"]
 651 |                 owned_by = m["owned_by"]
 652 |                 break
 653 |         if self.provider is None:
 654 |             if not owned_by:
 655 |                 raise ValueError(
 656 |                     f"Model {self.model} not found in /v1/models. Specify --provider explicitly"
 657 |                 )
 658 |             if owned_by in PROVIDER_CLASS_MAP:
 659 |                 self.provider = owned_by
 660 |             else:
 661 |                 raise ValueError(
 662 |                     f"Can't detect provider, specify it explicitly with --provider, owned_by={owned_by}"
 663 |                 )
 664 | 
 665 |     def _on_start(self):
 666 |         self.client.headers["Content-Type"] = "application/json"
 667 |         if self.environment.parsed_options.api_key:
 668 |             self.client.headers["Authorization"] = (
 669 |                 "Bearer " + self.environment.parsed_options.api_key
 670 |             )
 671 |         if self.environment.parsed_options.header:
 672 |             for header in self.environment.parsed_options.header:
 673 |                 key, val = header.split(":", 1)
 674 |                 self.client.headers[key] = val
 675 |         self._guess_provider()
 676 |         print(f" Provider {self.provider} using model {self.model} ".center(80, "*"))
 677 |         self.provider_formatter = PROVIDER_CLASS_MAP[self.provider](
 678 |             self.model, self.environment.parsed_options
 679 |         )
 680 | 
 681 |         self.stream = self.environment.parsed_options.stream
 682 | 
 683 |         image_resolutions = (
 684 |             self.environment.parsed_options.prompt_images_with_resolutions
 685 |         )
 686 |         self.prompt_images = None
 687 |         if image_resolutions:
 688 |             if not self.environment.parsed_options.chat:
 689 |                 # Using regular /completions endpoint, each model has it's own image placeholder
 690 |                 # e.g., <|image|> for Phi, <|image_pad|> for Qwen, <image> for Llava
 691 |                 # So using /completions endpoint requires a bit more work to support this
 692 |                 raise AssertionError(
 693 |                     "--prompt-images-with-resolutions is only supported with --chat mode."
 694 |                 )
 695 |             self.prompt_images = [
 696 |                 self._create_base64_image(width, height)
 697 |                 for width, height in image_resolutions
 698 |             ]
 699 | 
 700 |         self.max_tokens_sampler = LengthSampler(
 701 |             distribution=self.environment.parsed_options.max_tokens_distribution,
 702 |             mean=self.environment.parsed_options.max_tokens,
 703 |             cap=self.environment.parsed_options.max_tokens_cap,
 704 |             alpha=self.environment.parsed_options.max_tokens_range,
 705 |         )
 706 |         self.temperature = self.environment.parsed_options.temperature
 707 | 
 708 |         logging_params = {
 709 |             # TODO: add some server info with git version
 710 |             "provider": self.provider,
 711 |             "model": self.model,
 712 |             "prompt_tokens": self.environment.parsed_options.prompt_tokens,  # might be overwritten based on metric
 713 |             "generation_tokens": str(self.max_tokens_sampler),
 714 |             "stream": self.stream,
 715 |             "temperature": self.temperature,
 716 |             "logprobs": self.environment.parsed_options.logprobs,
 717 |         }
 718 | 
 719 |         if self.environment.parsed_options.top_k is not None:
 720 |             logging_params["top_k"] = self.environment.parsed_options.top_k
 721 | 
 722 |         InitTracker.notify_init(self.environment, logging_params)
 723 | 
 724 |         if self.environment.parsed_options.qps is not None:
 725 |             if self.environment.parsed_options.burst:
 726 |                 raise ValueError("Burst and QPS modes are mutually exclusive")
 727 |             pacer = FixedQPSPacer.instance(
 728 |                 self.environment.parsed_options.qps,
 729 |                 self.environment.parsed_options.qps_distribution,
 730 |             )
 731 |             # it will be called by Locust after each task
 732 |             self.wait_time = pacer.wait_time_till_next
 733 |             self.wait()
 734 |         elif self.environment.parsed_options.burst:
 735 |             self.wait_time = partial(
 736 |                 constant_pacing(self.environment.parsed_options.burst), self
 737 |             )
 738 |         else:
 739 |             # introduce initial delay to avoid all users hitting the service at the same time
 740 |             time.sleep(random.random())
 741 | 
 742 |         self.first_done = False
 743 | 
 744 |         dataset = DatasetHolder.get_instance(self.environment.parsed_options)
 745 |         self.dataset = iter(dataset)
 746 | 
 747 |     def _create_base64_image(self, width, height):
 748 |         """Create a random RGB image with the given dimensions and return as base64 data URI."""
 749 |         img = Image.new("RGB", (width, height))
 750 |         buffer = io.BytesIO()
 751 |         img.save(buffer, format="JPEG")
 752 |         img_str = base64.b64encode(buffer.getvalue()).decode("utf-8")
 753 |         return f"data:image/jpeg;base64,{img_str}"
 754 | 
 755 |     def _get_input(self):
 756 |         prompt, prompt_tokens = next(self.dataset)
 757 | 
 758 |         if self.prompt_images:
 759 |             images = self.prompt_images
 760 |             prompt_images_positioning = (
 761 |                 self.environment.parsed_options.prompt_images_positioning
 762 |             )
 763 |             prompt = self.insert_image_placeholders(
 764 |                 prompt, len(images), prompt_images_positioning
 765 |             )
 766 |         else:
 767 |             images = None
 768 | 
 769 |         return prompt, prompt_tokens, images
 770 | 
 771 |     def insert_image_placeholders(self, prompt, num_images, prompt_images_positioning):
 772 |         if num_images <= 0:
 773 |             return prompt
 774 | 
 775 |         prompt_length = len(prompt)
 776 |         if prompt_length == 0:
 777 |             return PROMPT_CHAT_IMAGE_PLACEHOLDER * num_images
 778 | 
 779 |         if prompt_images_positioning == "space-evenly":
 780 |             """
 781 |             Insert <image> placeholders evenly throughout the prompt.
 782 |             E.g., for 3 images, a prompt "abcdefgh" is changed to "ab<image>cd<image>ef<image>gh"
 783 | 
 784 |             Images are spaced out evenly based on on character length.
 785 |             This may result in a few extra tokens if the image tags are placed in the middle of tokens.
 786 |             But shouldn't affect results meaningfully.
 787 |             """
 788 |             # we need num_images + 1 segments to place between <image> tags
 789 |             segment_length = prompt_length / (num_images + 1)
 790 |             result = ""
 791 |             for i in range(num_images):
 792 |                 # Move a sliding window of segment_length across the prompt
 793 |                 # Truncating to ensure all segments are non-overlapping
 794 |                 # If segment_end is truncated, that character will be included in the next segment
 795 |                 segment_start = int(i * segment_length)
 796 |                 segment_end = int((i + 1) * segment_length)
 797 |                 result += (
 798 |                     prompt[segment_start:segment_end] + PROMPT_CHAT_IMAGE_PLACEHOLDER
 799 |                 )
 800 | 
 801 |             # Final segment
 802 |             result += prompt[int(num_images * segment_length) :]
 803 | 
 804 |             return result
 805 |         elif prompt_images_positioning == "end":
 806 |             return prompt + PROMPT_CHAT_IMAGE_PLACEHOLDER * num_images
 807 |         else:
 808 |             raise ValueError(
 809 |                 f"Invalid prompt images positioning: {prompt_images_positioning}"
 810 |             )
 811 | 
 812 |     @task
 813 |     def generate_text(self):
 814 |         max_tokens = self.max_tokens_sampler.sample()
 815 |         prompt, prompt_usage_tokens, images = self._get_input()
 816 |         data = self.provider_formatter.format_payload(prompt, max_tokens, images)
 817 |         t_start = time.perf_counter()
 818 | 
 819 |         with self.client.post(
 820 |             self.provider_formatter.get_url(),
 821 |             data=json.dumps(data),
 822 |             stream=True,
 823 |             catch_response=True,
 824 |         ) as response:
 825 |             combined_text = ""
 826 |             done_empty_chunk = False
 827 |             done = False
 828 |             total_usage_tokens = None
 829 |             total_logprob_tokens = None
 830 |             try:
 831 |                 response.raise_for_status()
 832 |             except Exception as e:
 833 |                 raise RuntimeError(f"Error in response: {response.text}") from e
 834 |             t_first_token = None
 835 |             for chunk in response.iter_lines(delimiter=b"\n\n"):
 836 |                 if len(chunk) == 0:
 837 |                     continue  # come providers send empty lines between data chunks
 838 |                 if done:
 839 |                     if chunk != b"data: [DONE]":
 840 |                         print(f"WARNING: Received more chunks after [DONE]: {chunk}")
 841 |                 try:
 842 |                     now = time.perf_counter()
 843 |                     if self.provider_formatter.parsed_options.embeddings:
 844 |                         t_first_token = now
 845 |                         if self.environment.parsed_options.show_response:
 846 |                             out = self.provider_formatter.parse_output_json(orjson.loads(chunk))
 847 |                             combined_text = out.text
 848 |                         break
 849 |                     if self.stream:
 850 |                         assert chunk.startswith(
 851 |                             b"data:"
 852 |                         ), f"Unexpected chunk not starting with 'data': {chunk}"
 853 |                         chunk = chunk[len(b"data:") :]
 854 |                         if chunk.strip() == b"[DONE]":
 855 |                             done = True
 856 |                             continue
 857 |                     if done_empty_chunk:
 858 |                         print(f"WARNING: Received more chunks after the trailing last chunk: {chunk}")
 859 |                     data = orjson.loads(chunk)
 860 |                     if not data.get("choices"):
 861 |                         done_empty_chunk = True
 862 |                         continue
 863 |                     out = self.provider_formatter.parse_output_json(data)
 864 |                     if out.usage_tokens:
 865 |                         total_usage_tokens = out.usage_tokens
 866 |                     if out.prompt_usage_tokens:
 867 |                         prompt_usage_tokens = out.prompt_usage_tokens
 868 |                     combined_text += out.text
 869 | 
 870 |                     # some providers (SGLang) send an empty chunk first skewing the TTFT
 871 |                     if combined_text and t_first_token is None:
 872 |                         t_first_token = now
 873 | 
 874 |                     if out.logprob_tokens:
 875 |                         total_logprob_tokens = (
 876 |                             total_logprob_tokens or 0
 877 |                         ) + out.logprob_tokens
 878 |                 except Exception as e:
 879 |                     print(f"Failed to parse response: {chunk} with error {repr(e)}")
 880 |                     response.failure(e)
 881 |                     return
 882 |             assert t_first_token is not None, "empty response received"
 883 |             if (
 884 |                 (total_logprob_tokens is not None)
 885 |                 and (total_usage_tokens is not None)
 886 |                 and total_logprob_tokens != total_usage_tokens
 887 |             ):
 888 |                 print(
 889 |                     f"WARNING: usage_tokens {total_usage_tokens} != logprob_tokens {total_logprob_tokens}"
 890 |                 )
 891 |             if total_logprob_tokens is not None:
 892 |                 num_tokens = total_logprob_tokens
 893 |             else:
 894 |                 num_tokens = total_usage_tokens
 895 | 
 896 |             num_tokens = num_tokens or 0
 897 |             num_chars = len(combined_text)
 898 |             now = time.perf_counter()
 899 |             dur_total = now - t_start
 900 |             dur_generation = now - t_first_token
 901 |             dur_first_token = t_first_token - t_start
 902 |             print(
 903 |                 f"Response received: total {dur_total*1000:.2f} ms, first token {dur_first_token*1000:.2f} ms, {num_chars} chars, {num_tokens} tokens"
 904 |             )
 905 |             if self.environment.parsed_options.show_response:
 906 |                 print("---")
 907 |                 print(combined_text)
 908 |                 print("---")
 909 |             if num_chars:
 910 |                 add_custom_metric(
 911 |                     "latency_per_char", dur_generation / num_chars * 1000, num_chars
 912 |                 )
 913 |             if self.stream:
 914 |                 add_custom_metric("time_to_first_token", dur_first_token * 1000)
 915 |             add_custom_metric("total_latency", dur_total * 1000)
 916 |             if num_tokens:
 917 |                 if num_tokens != max_tokens:
 918 |                     print(
 919 |                         f"WARNING: wrong number of tokens: {num_tokens}, expected {max_tokens}"
 920 |                     )
 921 |                 add_custom_metric("num_tokens", num_tokens)
 922 |                 add_custom_metric(
 923 |                     "latency_per_token", dur_generation / num_tokens * 1000, num_tokens
 924 |                 )
 925 |                 add_custom_metric(
 926 |                     "overall_latency_per_token",
 927 |                     dur_total / num_tokens * 1000,
 928 |                     num_tokens,
 929 |                 )
 930 | 
 931 |             if not self.provider_formatter.parsed_options.embeddings:
 932 |                 prompt_tokens = prompt_usage_tokens or self.prompt_tokenizer_tokens
 933 |                 if prompt_tokens:
 934 |                     add_custom_metric("prompt_tokens", prompt_tokens)
 935 | 
 936 |             if not self.first_done:
 937 |                 self.first_done = True
 938 |                 InitTracker.notify_first_request()
 939 | 
 940 | 
 941 | def parse_resolution(res_str):
 942 |     """Parse a resolution string like '3084x1080' into a tuple of integers (width, height)."""
 943 |     try:
 944 |         width, height = map(int, res_str.split("x"))
 945 |         return (width, height)
 946 |     except (ValueError, AttributeError):
 947 |         raise argparse.ArgumentTypeError(
 948 |             f"Invalid resolution format: {res_str}. Expected format: WIDTHxHEIGHT (e.g. 1024x1024)"
 949 |         )
 950 | 
 951 | 
 952 | @events.init_command_line_parser.add_listener
 953 | def init_parser(parser):
 954 |     parser.add_argument(
 955 |         "--provider",
 956 |         choices=list(PROVIDER_CLASS_MAP.keys()),
 957 |         type=str,
 958 |         help="Which flavor of API to use. If not specified, we'll try to guess based on the URL and /v1/models output",
 959 |     )
 960 |     parser.add_argument(
 961 |         "-d",
 962 |         "--dataset",
 963 |         env_var="DATASET",
 964 |         type=str,
 965 |         help="Either 'limerics' or a path to a JSONL file",
 966 |         default="limerics",
 967 |     )
 968 |     parser.add_argument(
 969 |         "-m",
 970 |         "--model",
 971 |         env_var="MODEL",
 972 |         type=str,
 973 |         help="The model to use for generating text. If not specified we will pick the first model from the service as returned by /v1/models",
 974 |     )
 975 |     parser.add_argument(
 976 |         "--tokenizer",
 977 |         env_var="TOKENIZER",
 978 |         type=str,
 979 |         help="Specify HF tokenizer to use for validating the output of the model. It's optional, we're going to rely on 'usage' or 'logprobs' field to get token count information",
 980 |     )
 981 |     parser.add_argument(
 982 |         "--chat",
 983 |         action=argparse.BooleanOptionalAction,
 984 |         default=True,
 985 |         help="Use /v1/chat/completions API",
 986 |     )
 987 |     parser.add_argument(
 988 |         "--embeddings",
 989 |         action=argparse.BooleanOptionalAction,
 990 |         default=False,
 991 |         help="Use /v1/embeddings API",
 992 |     )
 993 |     parser.add_argument(
 994 |         "--return-logits",
 995 |         type=int,
 996 |         nargs="*",
 997 |         default=None,
 998 |         help="For embeddings: return per-token or per-class logits. Provide specific token/class indices, or empty list for all. Only works with certain models.",
 999 |     )
1000 |     parser.add_argument(
1001 |         "--normalize",
1002 |         action=argparse.BooleanOptionalAction,
1003 |         default=False,
1004 |         help="For embeddings: apply L2 normalization to activations when return_logits is None, or softmax to selected logits when return_logits is provided.",
1005 |     )
1006 |     parser.add_argument(
1007 |         "-p",
1008 |         "--prompt-tokens",
1009 |         env_var="PROMPT_TOKENS",
1010 |         type=int,
1011 |         default=512,
1012 |         help="Length of the prompt in tokens. Default 512",
1013 |     )
1014 |     parser.add_argument(
1015 |         "--prompt-images-with-resolutions",
1016 |         type=parse_resolution,
1017 |         nargs="+",
1018 |         default=[],
1019 |         help="Images to add to the prompt for vision models, defined by their resolutions in format WIDTHxHEIGHT. "
1020 |         'For example, "--prompt-images-with-resolutions 3084x1080 1024x1024" will insert 2 images '
1021 |         "(3084 width x 1080 height and 1024 width x 1024 height) into the prompt. "
1022 |         "Images will be spaced out evenly across the prompt."
1023 |         "Only supported with --chat mode.",
1024 |     )
1025 |     parser.add_argument(
1026 |         "--prompt-images-positioning",
1027 |         type=str,
1028 |         choices=["space-evenly", "end"],
1029 |         default="space-evenly",
1030 |         help="How to position the images in the prompt. "
1031 |         "space-evenly: images are spaced out evenly across the prompt. E.g., 3 images in 'abcdefgh' is 'ab<image>cd<image>ef<image>gh'"
1032 |         "end: images are added to the end of the prompt. E.g., 3 images in 'abcdefgh' is 'abcdefgh<image><image><image>'"
1033 |         "Only relevant with --prompt-images-with-resolutions.",
1034 |     )
1035 |     parser.add_argument(
1036 |         "-o",
1037 |         "--max-tokens",
1038 |         env_var="MAX_TOKENS",
1039 |         type=int,
1040 |         default=64,
1041 |         help="Max number of tokens to generate. If --max-tokens-distribution is non-constant this is going to be the mean. Defaults to 64",
1042 |     )
1043 |     parser.add_argument(
1044 |         "--max-tokens-cap",
1045 |         env_var="MAX_TOKENS_CAP",
1046 |         type=int,
1047 |         help="If --max-tokens-distribution is non-constant, this truncates the distribition at the specified limit",
1048 |     )
1049 |     parser.add_argument(
1050 |         "--max-tokens-distribution",
1051 |         env_var="MAX_TOKENS_DISTRIBUTION",
1052 |         type=str,
1053 |         choices=["constant", "uniform", "exponential", "normal"],
1054 |         default="constant",
1055 |         help="How to sample `max-tokens` on each request",
1056 |     )
1057 |     parser.add_argument(
1058 |         "--max-tokens-range",
1059 |         env_var="MAX_TOKENS_RANGE",
1060 |         type=float,
1061 |         default=0.3,
1062 |         help="Specifies the width of the distribution. Specified value `alpha` is relative to `max-tokens`. For uniform distribution we'd sample from [max_tokens - max_tokens * alpha, max_tokens + max_tokens * alpha]. For normal distribution we'd sample from `N(max_tokens, max_tokens * alpha)`. Defaults to 0.3",
1063 |     )
1064 |     parser.add_argument(
1065 |         "--top-k",
1066 |         env_var="TOP_K",
1067 |         type=int,
1068 |         default=None,
1069 |         help="Specifies the top-k sampling parameter.",
1070 |     )
1071 |     parser.add_argument(
1072 |         "--stream",
1073 |         dest="stream",
1074 |         action=argparse.BooleanOptionalAction,
1075 |         default=True,
1076 |         help="Use the streaming API",
1077 |     )
1078 |     parser.add_argument(
1079 |         "-k",
1080 |         "--api-key",
1081 |         env_var="API_KEY",
1082 |         help="Auth for the API",
1083 |     )
1084 |     parser.add_argument(
1085 |         "--temperature",
1086 |         env_var="TEMPERATURE",
1087 |         type=float,
1088 |         default=1.0,
1089 |         help="Temperature parameter for the API",
1090 |     )
1091 |     parser.add_argument(
1092 |         "--logprobs",
1093 |         type=int,
1094 |         default=None,
1095 |         help="Whether to ask for logprobs, it makes things slower for some providers but is necessary for token count in streaming (unless it's Fireworks API that returns usage in streaming mode)",
1096 |     )
1097 |     parser.add_argument(
1098 |         "--summary-file",
1099 |         type=str,
1100 |         help="Append the line with the summary to the specified CSV file. Useful for generating a spreadsheet with perf sweep results. If the file doesn't exist, writes out the header first",
1101 |     )
1102 |     parser.add_argument(
1103 |         "--qps",
1104 |         type=float,
1105 |         default=None,
1106 |         help="Enabled 'fixed QPS' mode where requests are issues at the specified rate regardless of how long the processing takes. In this case --users and --spawn-rate need to be set to a sufficiently high value (e.g. 100)",
1107 |     )
1108 |     parser.add_argument(
1109 |         "--qps-distribution",
1110 |         type=str,
1111 |         choices=["constant", "uniform", "exponential"],
1112 |         default="constant",
1113 |         help="Must be used with --qps. Specifies how to space out requests: equally ('constant') or by sampling wait times from a distribution ('uniform' or 'exponential'). Expected QPS is going to match --qps",
1114 |     )
1115 |     parser.add_argument(
1116 |         "--burst",
1117 |         type=float,
1118 |         default=None,
1119 |         help="Makes requests to arrive in bursts every specified number of seconds. Note that burst duration has to be longer than maximum time of the response. Size of the burst is controlled by --users. The spawn rate -r is best set to a high value",
1120 |     )
1121 |     parser.add_argument(
1122 |         "--show-response",
1123 |         action=argparse.BooleanOptionalAction,
1124 |         default=False,
1125 |         help="Print the result of each generation",
1126 |     )
1127 |     parser.add_argument(
1128 |         "-pcml",
1129 |         "--prompt-cache-max-len",
1130 |         env_var="PROMPT_CACHE_MAX_LEN",
1131 |         type=int,
1132 |         default=0,
1133 |         help="Maximum length of the prompt cache to use. Defaults to 0 (no caching).",
1134 |     )
1135 |     parser.add_argument(
1136 |         "--header",
1137 |         action="append",
1138 |         default=[],
1139 |         help="Arbitrary headers to add to the inference request. Can be used multiple times. For example, --header header1:value1 --header header2:value2",
1140 |     )
1141 |     parser.add_argument(
1142 |         "-n",
1143 |         "--n",
1144 |         default=1,
1145 |         type=int,
1146 |         help="How many sequences to generate (makes sense to use with non-zero temperature).",
1147 |     )
1148 | 
1149 | 
1150 | @events.quitting.add_listener
1151 | def _(environment, **kw):
1152 |     total_latency = environment.stats.entries[("total_latency", "METRIC")]
1153 |     if environment.stats.total.num_failures > 0 or total_latency.num_requests == 0:
1154 |         print("Test failed due to failed requests")
1155 |         environment.process_exit_code = 1
1156 |         return
1157 | 
1158 |     entries = copy.copy(InitTracker.logging_params)
1159 |     if environment.parsed_options.qps is not None:
1160 |         entries["concurrency"] = (
1161 |             f"QPS {environment.parsed_options.qps} {environment.parsed_options.qps_distribution}"
1162 |         )
1163 |     else:
1164 |         entries["concurrency"] = InitTracker.users
1165 |     for metric_name in [
1166 |         "time_to_first_token",
1167 |         "latency_per_token",
1168 |         "overall_latency_per_token",
1169 |         "num_tokens",
1170 |         "total_latency",
1171 |         "prompt_tokens",  # might overwrite the static value based on server side tokenization
1172 |     ]:
1173 |         entries[metric_name] = environment.stats.entries[
1174 |             (metric_name, "METRIC")
1175 |         ].avg_response_time
1176 |     if not environment.parsed_options.stream:
1177 |         # if there's no streaming these metrics are meaningless
1178 |         entries["time_to_first_token"] = ""
1179 |         entries["latency_per_token"] = ""
1180 |     entries["num_requests"] = total_latency.num_requests
1181 |     entries["qps"] = total_latency.total_rps
1182 |     percentile_to_report = [50, 90, 95, 99, 99.9]
1183 |     percentile_metrics = ["time_to_first_token", "total_latency"]
1184 |     for percentile_metric in percentile_metrics:
1185 |         metrics = environment.stats.entries[percentile_metric, "METRIC"]
1186 |         for percentile in percentile_to_report:
1187 |             name = f"P{percentile}_{percentile_metric}"
1188 |             entries[name] = metrics.get_response_time_percentile(percentile / 100)
1189 | 
1190 |     pretty_name = lambda s: " ".join([w.capitalize() for w in s.split("_")])
1191 |     entries = {pretty_name(k): v for k, v in entries.items()}
1192 | 
1193 |     # print in the final event handler to make sure our output is the last one
1194 |     @events.quit.add_listener
1195 |     def exit_printer(**kw):
1196 |         max_width = max(len(k) for k in entries.keys())
1197 |         print(" Summary ".center(80, "="))
1198 |         for k, v in entries.items():
1199 |             print(f"{k:<{max_width}}: {v}")
1200 |         print("=" * 80)
1201 | 
1202 |     if environment.parsed_options.summary_file:
1203 |         with open(environment.parsed_options.summary_file, "a") as f:
1204 |             writer = csv.DictWriter(f, fieldnames=entries.keys())
1205 |             if f.tell() == 0:
1206 |                 writer.writeheader()
1207 |             writer.writerow(entries)
1208 | 


--------------------------------------------------------------------------------