├── LICENSE ├── README.md ├── results ├── 3090-long-context.json └── 3090.json ├── run_benchmarks.py └── vllm_benchmark.py /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # vLLM Benchmark 2 | 3 | This repository contains scripts for benchmarking the performance of large language models (LLMs) served using vLLM. It's designed to test the scalability and performance of LLM deployments under various concurrency levels. 4 | 5 | ## Features 6 | 7 | - Benchmark LLMs with different concurrency levels 8 | - Measure key performance metrics: 9 | - Requests per second 10 | - Latency 11 | - Tokens per second 12 | - Time to first token 13 | - Easy to run with customizable parameters 14 | - Generates JSON output for further analysis or visualization 15 | 16 | ## Requirements 17 | 18 | - Python 3.7+ 19 | - `openai` Python package 20 | - `numpy` Python package 21 | 22 | ## Installation 23 | 24 | 1. Clone this repository: 25 | ``` 26 | git clone https://github.com/yourusername/vllm-benchmark.git 27 | cd vllm-benchmark 28 | ``` 29 | 30 | 2. Install the required packages: 31 | ``` 32 | pip install openai numpy 33 | ``` 34 | 35 | ## Usage 36 | 37 | ### Single Benchmark Run 38 | 39 | To run a single benchmark: 40 | 41 | ``` 42 | python vllm_benchmark.py --num_requests 100 --concurrency 10 --output_tokens 100 --vllm_url "http://localhost:8000/v1" --api_key "your-api-key" 43 | ``` 44 | 45 | Parameters: 46 | - `num_requests`: Total number of requests to make 47 | - `concurrency`: Number of concurrent requests 48 | - `output_tokens`: Number of tokens to generate per request 49 | - `vllm_url`: URL of the vLLM server 50 | - `api_key`: API key for the vLLM server 51 | - `request_timeout`: (Optional) Timeout for each request in seconds (default: 30) 52 | 53 | ### Multiple Benchmark Runs 54 | 55 | To run multiple benchmarks with different concurrency levels: 56 | 57 | ``` 58 | python run_benchmarks.py --vllm_url "http://localhost:8000/v1" --api_key "your-api-key" 59 | ``` 60 | 61 | This script will run benchmarks with concurrency levels of 1, 10, 50, and 100, and save the results to `benchmark_results.json`. 62 | 63 | ## Output 64 | 65 | The benchmark results are saved in JSON format, containing detailed metrics for each run, including: 66 | 67 | - Total requests and successful requests 68 | - Requests per second 69 | - Total output tokens 70 | - Latency (average, p50, p95, p99) 71 | - Tokens per second (average, p50, p95, p99) 72 | - Time to first token (average, p50, p95, p99) 73 | 74 | ## Results 75 | 76 | Please see the results directory for benchmarks on [Backprop](https://backprop.co) instances. 77 | 78 | ## Contributing 79 | 80 | Contributions to improve the benchmarking scripts or add new features are welcome! Please feel free to submit pull requests or open issues for any bugs or feature requests. 81 | 82 | ## License 83 | 84 | This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details. 85 | -------------------------------------------------------------------------------- /results/3090-long-context.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "total_requests": 10, 4 | "successful_requests": 10, 5 | "concurrency": 1, 6 | "request_timeout": 30, 7 | "max_output_tokens": 100, 8 | "use_long_context": true, 9 | "total_time": 22.5137779712677, 10 | "requests_per_second": 0.44417245354209745, 11 | "total_output_tokens": 988, 12 | "latency": { 13 | "average": 2.2510788202285767, 14 | "p50": 2.278542995452881, 15 | "p95": 2.318180203437805, 16 | "p99": 2.320843529701233 17 | }, 18 | "tokens_per_second": { 19 | "average": 43.89558127108431, 20 | "p50": 43.81966578571083, 21 | "p95": 43.13739372149801, 22 | "p99": 43.087817924812605 23 | }, 24 | "time_to_first_token": { 25 | "average": 0.08543474674224853, 26 | "p50": 0.08408665657043457, 27 | "p95": 0.10620640516281124, 28 | "p99": 0.12021474599838257 29 | } 30 | }, 31 | { 32 | "total_requests": 100, 33 | "successful_requests": 100, 34 | "concurrency": 10, 35 | "request_timeout": 30, 36 | "max_output_tokens": 100, 37 | "use_long_context": true, 38 | "total_time": 32.27758765220642, 39 | "requests_per_second": 3.0981249614285917, 40 | "total_output_tokens": 9504, 41 | "latency": { 42 | "average": 3.1147828197479246, 43 | "p50": 3.259150743484497, 44 | "p95": 3.5111179471015928, 45 | "p99": 3.570436794757843 46 | }, 47 | "tokens_per_second": { 48 | "average": 30.564971179207532, 49 | "p50": 30.449958784376477, 50 | "p95": 28.06449882552865, 51 | "p99": 26.943581331477713 52 | }, 53 | "time_to_first_token": { 54 | "average": 0.18024307489395142, 55 | "p50": 0.15599143505096436, 56 | "p95": 0.393571352958679, 57 | "p99": 0.5766693615913391 58 | } 59 | }, 60 | { 61 | "total_requests": 500, 62 | "successful_requests": 500, 63 | "concurrency": 50, 64 | "request_timeout": 30, 65 | "max_output_tokens": 100, 66 | "use_long_context": true, 67 | "total_time": 71.27928352355957, 68 | "requests_per_second": 7.0146608563305435, 69 | "total_output_tokens": 48186, 70 | "latency": { 71 | "average": 6.9842355442047115, 72 | "p50": 7.309303164482117, 73 | "p95": 7.993396949768065, 74 | "p99": 8.79437571287155 75 | }, 76 | "tokens_per_second": { 77 | "average": 13.95278024236097, 78 | "p50": 13.636708699969198, 79 | "p95": 12.126518787879712, 80 | "p99": 11.23631729586349 81 | }, 82 | "time_to_first_token": { 83 | "average": 0.3635313196182251, 84 | "p50": 0.24643683433532715, 85 | "p95": 1.49878112077713, 86 | "p99": 2.513434214591979 87 | } 88 | }, 89 | { 90 | "total_requests": 1000, 91 | "successful_requests": 1000, 92 | "concurrency": 100, 93 | "request_timeout": 30, 94 | "max_output_tokens": 100, 95 | "use_long_context": true, 96 | "total_time": 121.41526246070862, 97 | "requests_per_second": 8.236196831708959, 98 | "total_output_tokens": 96276, 99 | "latency": { 100 | "average": 11.90893905711174, 101 | "p50": 12.352873802185059, 102 | "p95": 13.750476312637321, 103 | "p99": 16.92872543096542 104 | }, 105 | "tokens_per_second": { 106 | "average": 8.239526326841927, 107 | "p50": 8.030530719440774, 108 | "p95": 6.904625438889649, 109 | "p99": 5.848174162022502 110 | }, 111 | "time_to_first_token": { 112 | "average": 0.6344853692054748, 113 | "p50": 0.278128981590271, 114 | "p95": 3.2009871602058366, 115 | "p99": 5.930156297683715 116 | } 117 | } 118 | ] 119 | -------------------------------------------------------------------------------- /results/3090.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "total_requests": 10, 4 | "successful_requests": 10, 5 | "concurrency": 1, 6 | "request_timeout": 30, 7 | "max_output_tokens": 100, 8 | "total_time": 22.17478585243225, 9 | "requests_per_second": 0.4509626413778037, 10 | "total_output_tokens": 1000, 11 | "latency": { 12 | "average": 2.2171688079833984, 13 | "p50": 2.2200329303741455, 14 | "p95": 2.2468859434127806, 15 | "p99": 2.2539733266830444 16 | }, 17 | "tokens_per_second": { 18 | "average": 45.10770844776175, 19 | "p50": 45.04457357732862, 20 | "p95": 44.50688834968052, 21 | "p99": 44.366376542710306 22 | }, 23 | "time_to_first_token": { 24 | "average": 0.04618570804595947, 25 | "p50": 0.042708396911621094, 26 | "p95": 0.06267501115798947, 27 | "p99": 0.07317304849624634 28 | } 29 | }, 30 | { 31 | "total_requests": 100, 32 | "successful_requests": 100, 33 | "concurrency": 10, 34 | "request_timeout": 30, 35 | "max_output_tokens": 100, 36 | "total_time": 28.231784105300903, 37 | "requests_per_second": 3.542106996391476, 38 | "total_output_tokens": 10000, 39 | "latency": { 40 | "average": 2.8115273785591124, 41 | "p50": 2.8183696269989014, 42 | "p95": 2.998655843734741, 43 | "p99": 3.0170435428619387 44 | }, 45 | "tokens_per_second": { 46 | "average": 35.60375380525657, 47 | "p50": 35.48150644359394, 48 | "p95": 33.34827893284108, 49 | "p99": 33.145030424296216 50 | }, 51 | "time_to_first_token": { 52 | "average": 0.10436686754226684, 53 | "p50": 0.08644676208496094, 54 | "p95": 0.2474665403366089, 55 | "p99": 0.26386147260665893 56 | } 57 | }, 58 | { 59 | "total_requests": 500, 60 | "successful_requests": 500, 61 | "concurrency": 50, 62 | "request_timeout": 30, 63 | "max_output_tokens": 100, 64 | "total_time": 48.70970058441162, 65 | "requests_per_second": 10.264895780533973, 66 | "total_output_tokens": 50000, 67 | "latency": { 68 | "average": 4.806947554588318, 69 | "p50": 4.882625460624695, 70 | "p95": 5.036605310440064, 71 | "p99": 5.06476114988327 72 | }, 73 | "tokens_per_second": { 74 | "average": 20.8403180022753, 75 | "p50": 20.480784628450962, 76 | "p95": 19.854642939354846, 77 | "p99": 19.74426770470382 78 | }, 79 | "time_to_first_token": { 80 | "average": 0.3985143051147461, 81 | "p50": 0.42035186290740967, 82 | "p95": 0.5230958819389343, 83 | "p99": 0.6170247411727905 84 | } 85 | }, 86 | { 87 | "total_requests": 1000, 88 | "successful_requests": 1000, 89 | "concurrency": 100, 90 | "request_timeout": 30, 91 | "max_output_tokens": 100, 92 | "total_time": 73.57148337364197, 93 | "requests_per_second": 13.592222885073216, 94 | "total_output_tokens": 100000, 95 | "latency": { 96 | "average": 7.219060533285141, 97 | "p50": 7.340110182762146, 98 | "p95": 7.718387544155121, 99 | "p99": 7.760984728336334 100 | }, 101 | "tokens_per_second": { 102 | "average": 13.896121631639664, 103 | "p50": 13.62377369639281, 104 | "p95": 12.956073976478255, 105 | "p99": 12.884962917095779 106 | }, 107 | "time_to_first_token": { 108 | "average": 0.610873586177826, 109 | "p50": 0.6050795316696167, 110 | "p95": 0.9738370656967161, 111 | "p99": 1.0204019618034363 112 | } 113 | } 114 | ] 115 | -------------------------------------------------------------------------------- /run_benchmarks.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import json 3 | import time 4 | import argparse 5 | from vllm_benchmark import run_benchmark 6 | 7 | async def run_all_benchmarks(vllm_url, api_key, use_long_context): 8 | configurations = [ 9 | {"num_requests": 10, "concurrency": 1, "output_tokens": 100}, 10 | {"num_requests": 100, "concurrency": 10, "output_tokens": 100}, 11 | {"num_requests": 500, "concurrency": 50, "output_tokens": 100}, 12 | {"num_requests": 1000, "concurrency": 100, "output_tokens": 100}, 13 | ] 14 | 15 | all_results = [] 16 | 17 | for config in configurations: 18 | print(f"Running benchmark with concurrency {config['concurrency']}...") 19 | results = await run_benchmark(config['num_requests'], config['concurrency'], 30, config['output_tokens'], vllm_url, api_key, use_long_context) 20 | all_results.append(results) 21 | time.sleep(5) # Wait a bit between runs to let the system cool down 22 | 23 | return all_results 24 | 25 | def main(): 26 | parser = argparse.ArgumentParser(description="Run vLLM benchmarks with various configurations") 27 | parser.add_argument("--vllm_url", type=str, required=True, help="URL of the vLLM server") 28 | parser.add_argument("--api_key", type=str, required=True, help="API key for vLLM server") 29 | parser.add_argument("--use_long_context", action="store_true", help="Use long context prompt pairs instead of short prompts") 30 | args = parser.parse_args() 31 | 32 | all_results = asyncio.run(run_all_benchmarks(args.vllm_url, args.api_key, args.use_long_context)) 33 | 34 | with open('benchmark_results.json', 'w') as f: 35 | json.dump(all_results, f, indent=2) 36 | 37 | print("Benchmark results saved to benchmark_results.json") 38 | 39 | if __name__ == "__main__": 40 | main() 41 | 42 | -------------------------------------------------------------------------------- /vllm_benchmark.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import time 3 | import numpy as np 4 | from openai import AsyncOpenAI 5 | import logging 6 | import argparse 7 | import json 8 | import random 9 | 10 | # Set up logging 11 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 12 | 13 | SHORT_PROMPTS = [ 14 | "Explain the concept of artificial intelligence in simple terms.", 15 | "What are the main causes of climate change?", 16 | "Describe the process of photosynthesis in plants.", 17 | "How does the human immune system work?", 18 | "What were the main causes of World War II?", 19 | "Explain the theory of relativity in layman's terms.", 20 | "What are the key principles of effective leadership?", 21 | "How does blockchain technology work?", 22 | "What are the main theories about the origin of the universe?", 23 | "Describe the water cycle and its importance for life on Earth.", 24 | "What are the major differences between capitalism and socialism?", 25 | "How does the human brain process and store memories?", 26 | "What are the main challenges in space exploration?", 27 | "Explain the concept of supply and demand in economics.", 28 | ] 29 | 30 | LONG_PROMPT_PAIRS = [ 31 | { 32 | "prompt": "Explain the concept of artificial intelligence in simple terms.", 33 | "context": "Artificial intelligence (AI) is a rapidly evolving field of computer science that aims to create intelligent machines that can perform tasks that typically require human intelligence. These tasks include visual perception, speech recognition, decision-making, and language translation. AI systems are designed to learn from experience, adjust to new inputs, and perform human-like tasks. The field of AI encompasses various subfields, including machine learning, neural networks, and deep learning, which have led to significant advancements in areas such as autonomous vehicles, virtual assistants, and recommendation systems." 34 | }, 35 | { 36 | "prompt": "What are the main causes of climate change?", 37 | "context": "Climate change is a complex global phenomenon primarily driven by human activities that release greenhouse gases into the atmosphere. The burning of fossil fuels for energy, deforestation, industrial processes, and agriculture are major contributors to the increased concentration of carbon dioxide and other heat-trapping gases. These gases form a 'blanket' around the Earth, causing the planet to warm at an unprecedented rate. The resulting changes in temperature patterns lead to more frequent and severe weather events, rising sea levels, and disruptions to ecosystems worldwide." 38 | }, 39 | { 40 | "prompt": "Describe the process of photosynthesis in plants.", 41 | "context": "Photosynthesis is a fundamental biological process that allows plants to convert light energy into chemical energy. This process occurs in the chloroplasts of plant cells, specifically in structures called thylakoids. Chlorophyll, the pigment that gives plants their green color, is crucial in capturing light energy. During photosynthesis, plants take in carbon dioxide from the air through tiny pores called stomata and water from the soil through their roots. Using light energy, they combine these ingredients to produce glucose and oxygen. This process not only provides energy for the plant but also releases oxygen as a byproduct, which is essential for most life on Earth." 42 | }, 43 | { 44 | "prompt": "How does the human immune system work?", 45 | "context": "The human immune system is a complex network of cells, tissues, and organs that work together to defend the body against harmful pathogens. It consists of two main parts: the innate immune system, which provides a quick, non-specific response to invaders, and the adaptive immune system, which develops targeted defenses against specific pathogens. Key components include white blood cells (such as neutrophils, macrophages, and lymphocytes), antibodies, and the complement system. The immune system has the remarkable ability to distinguish between the body's own cells and foreign invaders, allowing it to target threats while minimizing damage to healthy tissue." 46 | }, 47 | { 48 | "prompt": "What were the main causes of World War II?", 49 | "context": "World War II, which lasted from 1939 to 1945, was one of the deadliest conflicts in human history. Its origins can be traced to several complex factors. The harsh terms of the Treaty of Versailles, which ended World War I, left Germany economically devastated and resentful. This paved the way for the rise of fascism and the Nazi Party under Adolf Hitler. Aggressive expansionist policies by Nazi Germany, Fascist Italy, and Imperial Japan, combined with the policy of appeasement by Western powers, allowed these regimes to gain territory unchecked. The immediate trigger for the war in Europe was Germany's invasion of Poland in September 1939, while the attack on Pearl Harbor in 1941 brought the United States into the conflict." 50 | }, 51 | { 52 | "prompt": "Explain the theory of relativity in layman's terms.", 53 | "context": "Albert Einstein's theory of relativity, developed in the early 20th century, revolutionized our understanding of space, time, and gravity. It consists of two parts: special relativity and general relativity. Special relativity, introduced in 1905, deals with objects moving at very high speeds. It proposes that the speed of light is constant for all observers and that time and space are not absolute but relative to the observer's motion. This leads to phenomena like time dilation and length contraction. General relativity, published in 1915, extends these ideas to include gravity. Einstein proposed that massive objects curve the fabric of spacetime, and this curvature is what we experience as gravity. These theories have been consistently supported by experimental evidence and have practical applications in technologies like GPS satellites." 54 | }, 55 | { 56 | "prompt": "What are the key principles of effective leadership?", 57 | "context": "Effective leadership is crucial in guiding organizations, teams, and individuals towards achieving their goals. While leadership styles may vary, several key principles are widely recognized as essential for success. These include clear communication, which ensures that vision and expectations are understood by all; integrity, which builds trust and respect; adaptability, allowing leaders to navigate changing environments; empathy, fostering strong relationships and understanding team dynamics; decision-making skills, enabling timely and informed choices; vision, providing direction and inspiration; and the ability to empower others, encouraging growth and innovation within the team. Effective leaders also demonstrate accountability, both for their own actions and those of their team, and continuously seek personal growth and learning opportunities." 58 | }, 59 | { 60 | "prompt": "How does blockchain technology work?", 61 | "context": "Blockchain is a decentralized, distributed ledger technology that underlies cryptocurrencies like Bitcoin, but has potential applications far beyond digital currencies. At its core, a blockchain is a chain of blocks, each containing a list of transactions. Every block is linked to the previous one through cryptographic hashes, creating an immutable record. The key innovation of blockchain is its ability to achieve consensus in a decentralized network without requiring trust in any single entity. This is typically achieved through consensus mechanisms like Proof of Work or Proof of Stake. When a new transaction occurs, it is broadcast to a network of computers (nodes) for validation. Once validated, the transaction is combined with others to create a new block, which is then added to the chain. This process ensures transparency, security, and resistance to tampering, making blockchain suitable for various applications beyond finance, including supply chain management, voting systems, and digital identity verification." 62 | }, 63 | { 64 | "prompt": "What are the main theories about the origin of the universe?", 65 | "context": "The origin of the universe has been a subject of intense scientific inquiry and philosophical debate for centuries. Currently, the most widely accepted scientific theory is the Big Bang model, which proposes that the universe began as an infinitely dense and hot singularity about 13.8 billion years ago, and has been expanding and cooling ever since. This theory is supported by observational evidence such as the cosmic microwave background radiation and the abundance of light elements in the universe. However, questions remain about what happened before the Big Bang and what caused it. Other theories include the Steady State theory, which suggests that the universe has always existed and is constantly creating new matter as it expands, though this theory has fallen out of favor due to lack of supporting evidence. More speculative ideas include the concept of a cyclic universe, where big bangs and big crunches occur in an endless cycle, and the idea of a multiverse, where our universe is just one of many existing universes." 66 | }, 67 | { 68 | "prompt": "Describe the water cycle and its importance for life on Earth.", 69 | "context": "The water cycle, also known as the hydrologic cycle, is the continuous movement of water within the Earth and atmosphere. It is a complex system involving the processes of evaporation, transpiration, condensation, precipitation, and runoff. Water evaporates from the Earth's surface, primarily from oceans, lakes, and rivers, due to solar energy. Plants also release water vapor through transpiration. As this water vapor rises in the atmosphere, it cools and condenses to form clouds. Eventually, it falls back to Earth as precipitation in the form of rain, snow, or hail. Some of this water flows over the land as surface runoff, returning to bodies of water, while some seeps into the ground, replenishing groundwater reserves. This cycle is crucial for life on Earth as it redistributes water around the globe, shapes landscapes through erosion and deposition, regulates global temperatures, and provides fresh water essential for all living organisms. Understanding and protecting the water cycle is vital for managing water resources and addressing environmental challenges like climate change and water scarcity." 70 | }, 71 | { 72 | "prompt": "What are the major differences between capitalism and socialism?", 73 | "context": "Capitalism and socialism are two contrasting economic and political systems that have shaped much of modern history. Capitalism is characterized by private ownership of the means of production, where individuals or corporations own businesses and property. It operates on the principles of free market competition, with prices determined by supply and demand. Profit is a key motivator in capitalist systems, and government intervention is generally limited. In contrast, socialism advocates for collective or governmental ownership and administration of the means of production and distribution of goods. It aims to create a more equitable society by reducing class distinctions and distributing resources according to need rather than ability to pay. In socialist systems, the government plays a much larger role in economic planning and the provision of social services. While pure forms of either system are rare, many countries adopt mixed economies incorporating elements of both capitalism and socialism to varying degrees." 74 | }, 75 | { 76 | "prompt": "How does the human brain process and store memories?", 77 | "context": "The human brain's ability to process and store memories is a complex and fascinating process involving various regions and neural networks. When we experience something, sensory information is first processed in the relevant cortical areas (e.g., visual cortex for sight, auditory cortex for sound). This information is then integrated in the hippocampus, a seahorse-shaped structure crucial for forming new memories. The hippocampus helps bind different aspects of an experience into a cohesive memory and plays a key role in converting short-term memories into long-term ones. Long-term memories are thought to be stored through changes in synaptic connections between neurons across widespread areas of the cortex. This process, known as consolidation, can take days or even years. Different types of memories (e.g., episodic, semantic, procedural) involve different brain regions and processes. The retrieval of memories involves reactivating these neural patterns, which explains why memories can be influenced by our current state and environment. Understanding these processes is crucial for addressing memory-related disorders and developing potential therapies." 78 | }, 79 | { 80 | "prompt": "What are the main challenges in space exploration?", 81 | "context": "Space exploration, while offering immense potential for scientific discovery and technological advancement, faces numerous challenges. One of the primary obstacles is the hostile environment of space itself. The vacuum of space, extreme temperatures, and harmful radiation pose significant risks to both human astronauts and sensitive equipment. Prolonged exposure to microgravity can lead to health issues for astronauts, including muscle atrophy and bone density loss. Logistical challenges are also substantial: the enormous distances involved in space travel require advanced propulsion systems and careful resource management. Launching payloads into orbit remains extremely expensive, limiting the scope and frequency of missions. Communication delays become increasingly problematic for deep space missions, necessitating a high degree of autonomy in spacecraft and rovers. Additionally, space debris orbiting Earth poses a growing threat to satellites and spacecraft. As we look towards long-term goals like establishing bases on the Moon or Mars, we face new challenges in creating sustainable habitats and managing psychological effects on crew members during extended missions. Despite these obstacles, ongoing research and technological innovations continue to push the boundaries of what's possible in space exploration." 82 | }, 83 | { 84 | "prompt": "Explain the concept of supply and demand in economics.", 85 | "context": "Supply and demand is a fundamental concept in economics that describes how the price and quantity of a good or service in a market are determined through the interaction between buyers and sellers. The law of demand states that, all else being equal, as the price of a product increases, the quantity demanded by consumers decreases. This is typically represented by a downward-sloping demand curve. Conversely, the law of supply states that as the price of a product increases, the quantity that producers are willing to supply increases, represented by an upward-sloping supply curve. The point where these two curves intersect is called the equilibrium point, determining the market price and quantity. This model helps explain how prices fluctuate in response to changes in supply or demand. For instance, if demand increases while supply remains constant, prices will rise. If supply increases while demand remains constant, prices will fall. Understanding supply and demand is crucial for analyzing market behavior, predicting price changes, and formulating economic policies." 86 | }, 87 | { 88 | "prompt": "What are the key features of a democratic government?", 89 | "context": "Democratic government is a system of governance based on the principle of rule by the people. While democracies can take various forms, they typically share several key features. First and foremost is the concept of free and fair elections, where citizens have the right to vote for their representatives at regular intervals. This is closely tied to the principle of political pluralism, allowing for multiple political parties and viewpoints to compete for power. The protection of individual rights and civil liberties, such as freedom of speech, press, and assembly, is another crucial aspect of democracy. Separation of powers is often implemented to prevent the concentration of power, typically dividing government into executive, legislative, and judicial branches that provide checks and balances on each other. The rule of law, ensuring that all citizens, including those in power, are equally subject to the law, is fundamental to democratic governance. Transparency and accountability in government operations, often facilitated by a free press and active civil society, help maintain democratic principles. Additionally, many democracies emphasize the protection of minority rights and the concept of majority rule with minority rights, aiming to balance the will of the majority with the fundamental rights of all citizens." 90 | }, 91 | { 92 | "prompt": "How do vaccines work to prevent diseases?", 93 | "context": "Vaccines are one of the most effective tools in preventing infectious diseases, working by harnessing the body's own immune system. When a pathogen such as a virus or bacteria enters the body, the immune system responds by producing antibodies specific to that pathogen. These antibodies help neutralize or destroy the invader. Vaccines mimic this natural process by introducing a harmless form of the pathogen – either weakened, inactivated, or just a part of it – into the body. This stimulates the immune system to produce antibodies and memory cells specific to that pathogen, without causing the actual disease. If the vaccinated person later encounters the real pathogen, their immune system can quickly recognize it and mount a rapid and effective response, often preventing the disease entirely or reducing its severity. Some vaccines require multiple doses or periodic boosters to maintain immunity. The concept of herd immunity is also important in vaccination strategies: when a large portion of a population is vaccinated, it becomes difficult for the pathogen to spread, indirectly protecting those who cannot be vaccinated. Advances in vaccine technology, such as mRNA vaccines, are expanding our ability to rapidly develop vaccines for new threats." 94 | }, 95 | { 96 | "prompt": "What are the main theories of human evolution?", 97 | "context": "Human evolution is the study of the biological and cultural development of our species, Homo sapiens, and our ancestors. The main scientific theory explaining human evolution is based on Darwin's theory of evolution by natural selection, adapted to incorporate modern genetic understanding. This theory proposes that humans evolved from earlier primate species over millions of years. Key ideas include the concept of common ancestry, suggesting that humans share a common ancestor with other primates, particularly the great apes. The 'Out of Africa' theory posits that modern humans originated in Africa and then migrated to other parts of the world. Fossil evidence has revealed a series of intermediate species, such as Australopithecus, Homo habilis, and Homo erectus, showing gradual changes in features like brain size, bipedalism, and tool use. Recent discoveries and genetic studies have complicated this picture, suggesting interbreeding between different human species (like Homo sapiens and Neanderthals) and the possibility of multiple migrations out of Africa. Ongoing research in paleontology, genetics, and archaeology continues to refine our understanding of human evolution, often challenging previous assumptions and revealing the complex history of our species." 98 | }, 99 | { 100 | "prompt": "Describe the process of plate tectonics and its effects on Earth.", 101 | "context": "Plate tectonics is a fundamental theory in geology that explains the large-scale motions of Earth's lithosphere. The theory proposes that Earth's outer layer is divided into several large, rigid plates that move relative to one another. These plates float on the semi-fluid asthenosphere beneath them and are driven by convection currents in the mantle. Plate boundaries are classified into three types: divergent boundaries, where plates move apart and new crust is created; convergent boundaries, where plates collide, leading to subduction or mountain building; and transform boundaries, where plates slide past each other horizontally. The process of plate tectonics has profound effects on Earth's surface and internal structure. It is responsible for the formation of mountain ranges, ocean basins, and island arcs. It also plays a crucial role in the rock cycle, volcanic activity, and earthquake occurrence. Over geological time, plate tectonics has influenced climate patterns, ocean currents, and the distribution of flora and fauna across the globe. Understanding plate tectonics is essential for predicting geological hazards, explaining the distribution of natural resources, and comprehending Earth's long-term geological history." 102 | }, 103 | { 104 | "prompt": "What are the primary causes of biodiversity loss?", 105 | "context": "Biodiversity loss, the decline in the variety of life forms on Earth, is a critical environmental issue with far-reaching consequences for ecosystems and human well-being. Several interconnected factors contribute to this loss. Habitat destruction and fragmentation, often due to human activities like deforestation, urbanization, and agricultural expansion, is a primary driver. Climate change is increasingly recognized as a major threat, altering ecosystems faster than many species can adapt. Overexploitation of natural resources, including overfishing and poaching, directly reduces populations of many species. Pollution, including chemical runoff, plastic waste, and air pollution, degrades habitats and harms wildlife. The introduction of invasive species, often facilitated by human activities, can disrupt local ecosystems and outcompete native species. Additionally, the spread of diseases, sometimes exacerbated by climate change and habitat stress, can devastate populations of certain species. These factors often interact and compound each other's effects, accelerating the rate of biodiversity loss. Addressing this crisis requires comprehensive conservation strategies, sustainable resource management, and global cooperation to mitigate human impacts on natural ecosystems." 106 | }, 107 | ] 108 | 109 | async def process_stream(stream): 110 | first_token_time = None 111 | total_tokens = 0 112 | async for chunk in stream: 113 | if first_token_time is None: 114 | first_token_time = time.time() 115 | if chunk.choices[0].delta.content: 116 | total_tokens += 1 117 | if chunk.choices[0].finish_reason is not None: 118 | break 119 | return first_token_time, total_tokens 120 | 121 | async def make_request(client, output_tokens, request_timeout, use_long_context): 122 | start_time = time.time() 123 | if use_long_context: 124 | prompt_pair = random.choice(LONG_PROMPT_PAIRS) 125 | content = prompt_pair["context"] + "\n\n" + prompt_pair["prompt"] 126 | else: 127 | content = random.choice(SHORT_PROMPTS) 128 | 129 | try: 130 | stream = await client.chat.completions.create( 131 | model="NousResearch/Meta-Llama-3.1-8B-Instruct", 132 | messages=[ 133 | {"role": "user", "content": content} 134 | ], 135 | max_tokens=output_tokens, 136 | stream=True 137 | ) 138 | first_token_time, total_tokens = await asyncio.wait_for(process_stream(stream), timeout=request_timeout) 139 | 140 | end_time = time.time() 141 | elapsed_time = end_time - start_time 142 | ttft = first_token_time - start_time if first_token_time else None 143 | tokens_per_second = total_tokens / elapsed_time if elapsed_time > 0 else 0 144 | return total_tokens, elapsed_time, tokens_per_second, ttft 145 | 146 | except asyncio.TimeoutError: 147 | logging.warning(f"Request timed out after {request_timeout} seconds") 148 | return None 149 | except Exception as e: 150 | logging.error(f"Error during request: {str(e)}") 151 | return None 152 | 153 | async def worker(client, semaphore, queue, results, output_tokens, request_timeout, use_long_context): 154 | while True: 155 | async with semaphore: 156 | task_id = await queue.get() 157 | if task_id is None: 158 | queue.task_done() 159 | break 160 | logging.info(f"Starting request {task_id}") 161 | result = await make_request(client, output_tokens, request_timeout, use_long_context) 162 | if result: 163 | results.append(result) 164 | else: 165 | logging.warning(f"Request {task_id} failed") 166 | queue.task_done() 167 | logging.info(f"Finished request {task_id}") 168 | 169 | def calculate_percentile(values, percentile, reverse=False): 170 | if not values: 171 | return None 172 | if reverse: 173 | return np.percentile(values, 100 - percentile) 174 | return np.percentile(values, percentile) 175 | 176 | async def run_benchmark(num_requests, concurrency, request_timeout, output_tokens, vllm_url, api_key, use_long_context): 177 | client = AsyncOpenAI(base_url=vllm_url, api_key=api_key) 178 | semaphore = asyncio.Semaphore(concurrency) 179 | queue = asyncio.Queue() 180 | results = [] 181 | 182 | # Add tasks to the queue 183 | for i in range(num_requests): 184 | await queue.put(i) 185 | 186 | # Add sentinel values to stop workers 187 | for _ in range(concurrency): 188 | await queue.put(None) 189 | 190 | # Create worker tasks 191 | workers = [asyncio.create_task(worker(client, semaphore, queue, results, output_tokens, request_timeout, use_long_context)) for _ in range(concurrency)] 192 | 193 | start_time = time.time() 194 | 195 | # Wait for all tasks to complete 196 | await queue.join() 197 | await asyncio.gather(*workers) 198 | 199 | end_time = time.time() 200 | 201 | # Calculate metrics 202 | total_elapsed_time = end_time - start_time 203 | total_tokens = sum(tokens for tokens, _, _, _ in results if tokens is not None) 204 | latencies = [elapsed_time for _, elapsed_time, _, _ in results if elapsed_time is not None] 205 | tokens_per_second_list = [tps for _, _, tps, _ in results if tps is not None] 206 | ttft_list = [ttft for _, _, _, ttft in results if ttft is not None] 207 | 208 | successful_requests = len(results) 209 | requests_per_second = successful_requests / total_elapsed_time if total_elapsed_time > 0 else 0 210 | avg_latency = sum(latencies) / len(latencies) if latencies else 0 211 | avg_tokens_per_second = sum(tokens_per_second_list) / len(tokens_per_second_list) if tokens_per_second_list else 0 212 | avg_ttft = sum(ttft_list) / len(ttft_list) if ttft_list else 0 213 | 214 | # Calculate percentiles 215 | percentiles = [50, 95, 99] 216 | latency_percentiles = [calculate_percentile(latencies, p) for p in percentiles] 217 | tps_percentiles = [calculate_percentile(tokens_per_second_list, p, reverse=True) for p in percentiles] 218 | ttft_percentiles = [calculate_percentile(ttft_list, p) for p in percentiles] 219 | 220 | return { 221 | "total_requests": num_requests, 222 | "successful_requests": successful_requests, 223 | "concurrency": concurrency, 224 | "request_timeout": request_timeout, 225 | "max_output_tokens": output_tokens, 226 | "use_long_context": use_long_context, 227 | "total_time": total_elapsed_time, 228 | "requests_per_second": requests_per_second, 229 | "total_output_tokens": total_tokens, 230 | "latency": { 231 | "average": avg_latency, 232 | "p50": latency_percentiles[0], 233 | "p95": latency_percentiles[1], 234 | "p99": latency_percentiles[2] 235 | }, 236 | "tokens_per_second": { 237 | "average": avg_tokens_per_second, 238 | "p50": tps_percentiles[0], 239 | "p95": tps_percentiles[1], 240 | "p99": tps_percentiles[2] 241 | }, 242 | "time_to_first_token": { 243 | "average": avg_ttft, 244 | "p50": ttft_percentiles[0], 245 | "p95": ttft_percentiles[1], 246 | "p99": ttft_percentiles[2] 247 | } 248 | } 249 | 250 | def print_results(results): 251 | print(json.dumps(results, indent=2)) 252 | 253 | if __name__ == "__main__": 254 | parser = argparse.ArgumentParser(description="Benchmark LLaMA-3 model with vLLM") 255 | parser.add_argument("--num_requests", type=int, required=True, help="Number of requests to make") 256 | parser.add_argument("--concurrency", type=int, required=True, help="Number of concurrent requests") 257 | parser.add_argument("--request_timeout", type=int, default=30, help="Timeout for each request in seconds (default: 30)") 258 | parser.add_argument("--output_tokens", type=int, default=50, help="Number of output tokens (default: 50)") 259 | parser.add_argument("--vllm_url", type=str, required=True, help="URL of the vLLM server") 260 | parser.add_argument("--api_key", type=str, required=True, help="API key for vLLM server") 261 | parser.add_argument("--use_long_context", action="store_true", help="Use long context prompt pairs instead of short prompts") 262 | args = parser.parse_args() 263 | 264 | results = asyncio.run(run_benchmark(args.num_requests, args.concurrency, args.request_timeout, args.output_tokens, args.vllm_url, args.api_key, args.use_long_context)) 265 | print_results(results) 266 | else: 267 | # When imported as a module, provide the run_benchmark function 268 | __all__ = ['run_benchmark'] 269 | --------------------------------------------------------------------------------