├── README.md ├── cover.png ├── requirements.txt ├── sgl ├── benchmark.yaml └── logs │ ├── sgl-0.4.5.post2-deepseek-r1-200.log │ ├── sgl-0.4.5.post2-deepseek-r1.log │ ├── sgl-0.4.5.post3-deepseek-r1-200.log │ ├── sgl-0.4.5.post3-deepseek-r1.log │ ├── vllm-deepseek-r1-200.log │ └── vllm-deepseek-r1.log └── vllm ├── benchmark.yaml ├── logs ├── sgl-deepseek-r1.log └── vllm-deepseek-r1.log └── setup.sh /README.md: -------------------------------------------------------------------------------- 1 | # LLM Inference Engine Benchmarks 2 | 3 | 4 | This collection of open-source LLM inference engine benchmarks provides **fair and reproducible** one-line commands to compare different inference engines on **identical hardware** on different infrastructures -- your own clouds or kubernetes clusters. 5 | 6 | We use [SkyPilot](https://github.com/skypilot-ai/skypilot) YAML to ensure consistent and reproducible infrastructure deployment across benchmarks. 7 | 8 | ![cover](./cover.png) 9 | 10 | ## Background 11 | 12 | When different LLM inference engines start to post their performance numbers [[1](https://x.com/vllm_project/status/1913513173342392596), [2](https://x.com/lmsysorg/status/1913064701313073656)], it can be confusing to see different numbers. It can due to different configurations, or different hardware setup. 13 | 14 | This repo is trying to create a centralized place for these benchmarks to be able to run on the same hardware, with the optimal configurations (e.g. TP, DP, etc.) that can be set by the official teams for those inference engines. 15 | 16 | Disclaimer: This repo is created for learning, and is not affiliated with any of the inference engine teams. 17 | 18 | ## Installation 19 | 20 | ```bash 21 | pip install -U "skypilot[nebius]" 22 | ``` 23 | 24 | Setup cloud credentials. See [SkyPilot docs](https://docs.skypilot.co/en/latest/getting-started/installation.html). 25 | 26 | 27 | ## Version 28 | 29 | The version of the inference engines are as follows: 30 | 31 | - vLLM: 0.8.4 32 | - SGLang: 0.4.5.post1/0.4.5.post3 33 | - TRT-LLM: NOT SUPPORTED YET 34 | 35 | ## Benchmark from vLLM 36 | 37 | vLLM created a [benchmark](https://github.com/simon-mo/vLLM-Benchmark/tree/main) for vLLM vs SGLang and TRT-LLM. 38 | 39 | To run the benchmarks: 40 | 41 | > [!NOTE] 42 | > vLLM team runs the benchmarks on Nebius H200 machines, so we use `--cloud nebius` below. 43 | 44 | 45 | ### Run the benchmarks 46 | 47 | ```bash 48 | cd ./vllm 49 | 50 | # Run the benchmarks for vLLM 51 | sky launch --cloud nebius -c benchmark -d benchmark.yaml 52 | --env HF_TOKEN 53 | --env MODEL=deepseek-r1 54 | --env ENGINE=vllm 55 | 56 | # Run the benchmarks for SGLang 57 | # Note: the first run of SGLang will have half of the throughput, likely due to 58 | # the JIT code generation. In the benchmark.yaml, we discard the first run and 59 | # run the sweeps again. 60 | sky launch --cloud nebius -c benchmark -d benchmark.yaml 61 | --env HF_TOKEN 62 | --env MODEL=deepseek-r1 63 | --env ENGINE=sgl 64 | 65 | 66 | # This is not supported yet 67 | # sky launch --cloud nebius -c benchmark benchmark.yaml 68 | # --env HF_TOKEN 69 | # --env MODEL=deepseek-r1 70 | # --env ENGINE=trt 71 | ``` 72 | 73 | Automatically stop the cluster after the benchmarks are done: 74 | 75 | ```bash 76 | sky autostop benchmark 77 | ``` 78 | 79 | > [!NOTE] 80 | > If you would like to run the benchmarks on different infrastructure, you can change `--cloud` to other clouds or your kubernetes cluster: `--cloud k8s`. 81 | 82 | You can also change the model to one of the following: `deepseek-r1`, `qwq-32b`, `llama-8b`, `llama-3b`, `qwen-1.5b`. 83 | 84 | ### Benchmark Results for DeepSeek-R1 85 | 86 | - **CPU**: Intel(R) Xeon(R) Platinum 8468 87 | - **GPU**: 8x NVIDIA H200 88 | 89 | **Output token throughput (tok/s)** 90 | 91 | | Input Tokens | Output Tokens | vLLM v0.8.4 | SGLang v0.4.5.post1 | 92 | | ------------ | ------------- | ------------ | ------------ | 93 | | 1000 | 2000 | 1136.92 | 1041.14 | 94 | | 5000 | 1000 | 857.13 | 821.40 | 95 | | 10000 | 500 | 441.53 | 389.84 | 96 | | 30000 | 100 | 37.07 | 33.94 | 97 | | sharegpt | sharegpt | 1330.60 | 981.47 | 98 | 99 | 100 | **Logs** 101 | - vLLM logs: [vllm_deepseek-r1.log](./vllm/logs/vllm-deepseek-r1.log) 102 | - SGLang logs: [sgl_deepseek-r1.log](./vllm/logs/sgl-deepseek-r1.log) 103 | Logs are dumped with `sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log`. 104 | 105 | 106 | ## Benchmark from SGLang 107 | 108 | SGLang created a [benchmark](https://github.com/sgl-project/sglang/issues/5514) for SGLang on random input and output. This benchmark uses the same configurations from it. 109 | 110 | ### Run the benchmarks 111 | 112 | ```bash 113 | cd ./sgl 114 | 115 | # Run the benchmarks for SGLang 116 | sky launch --cloud nebius -c benchmark benchmark.yaml \ 117 | --env HF_TOKEN \ 118 | --env ENGINE=sgl 119 | 120 | # Run the benchmarks for vLLM 121 | sky launch --cloud nebius -c benchmark benchmark.yaml \ 122 | --env HF_TOKEN \ 123 | --env ENGINE=vllm 124 | ``` 125 | 126 | ### Benchmark Results for DeepSeek-R1 127 | 128 | - **CPU**: Intel(R) Xeon(R) Platinum 8468 129 | - **GPU**: 8x NVIDIA H200 130 | 131 | **Output token throughput (tok/s)** 132 | 133 | | Input Tokens | Output Tokens | vLLM v0.8.4 (2025-04-14) | SGLang v0.4.5.post3 (2025-04-21) | 134 | | ------------ | ------------- | ------------ | ------------ | 135 | |1000 | 2000 | 1042.17 | 1329.14 | 136 | |5000 | 1000 | 794.54 | 951.64 | 137 | |10000 | 500 | 436.08 | 479.69 | 138 | |30000 | 100 | 37.76 | 47.38 | 139 | 140 | **Logs** 141 | - vLLM logs: [vllm-deepseek-r1.log](./sgl/logs/vllm-deepseek-r1.log) 142 | - SGLang logs: [sgl-0.4.5.post3-deepseek-r1.log](./sgl/logs/sgl-0.4.5.post3-deepseek-r1.log) 143 | 144 | Logs are dumped with `sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log`. 145 | 146 | **Output token throughput (tok/s): Using 200 prompts (vs 50 prompts in the official benchmark)** 147 | 148 | | Input Tokens | Output Tokens | vLLM v0.8.4 (2025-04-14) | SGLang v0.4.5.post3 (2025-04-21) | 149 | | ------------ | ------------- | ------- | ------- | 150 | | 1000 | 2000 | 2498.90 | 3276.05 | 151 | | 5000 | 1000 | 930.93 | 1322.31 | 152 | | 10000 | 500 | 341.70 | 501.95 | 153 | | 30000 | 100 | 38.44 | 47.68 | 154 | 155 | **Logs** 156 | - vLLM logs: [vllm-deepseek-r1-200.log](./sgl/logs/vllm-deepseek-r1-200.log) 157 | - SGLang logs: [sgl-0.4.5.post3-deepseek-r1-200.log](./sgl/logs/sgl-0.4.5.post3-deepseek-r1-200.log) 158 | 159 | 160 | ## Contribution 161 | 162 | Any contributions from the community are welcome, to tune the versions and configurations for different inference engines, so as to make the benchmarks more accurate and fair. 163 | 164 | 165 | ## Final Thoughts 166 | 167 | Interestingly, vLLM and SGLang's official benchmark results diverge, even with the same hardware, and the same flags. 168 | 169 | Although both of the benchmark scripts try to simulate the real inference scenario, the throughput numbers are very sensitive to the benchmark setup -- even simply changing the number of prompts from 50 to 200 can flip the conclusion for the performance of the two engines. 170 | 171 | A better benchmark is in need to provide more insights into the performance of inference engines, while this repo could offer a platform for the community to run the benchmarks in a fair and reproducible way, including same settings, same hardware, etc. 172 | -------------------------------------------------------------------------------- /cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Michaelvll/llm-ie-benchmarks/5f717fed82f9bbf1d15881842bd61b1915d78b1e/cover.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | skypilot[nebius] 2 | -------------------------------------------------------------------------------- /sgl/benchmark.yaml: -------------------------------------------------------------------------------- 1 | resources: 2 | accelerators: H200:8 3 | disk_size: 1000 4 | memory: 750+ 5 | 6 | run: | 7 | # Get the model name from the model path, i.e. deepseek-ai/DeepSeek-R1 -> DeepSeek-R1 8 | MODEL_NAME=$(basename ${MODEL}) 9 | if [ "$ENGINE" == "sgl" ]; then 10 | uv pip install sglang[all]==0.4.5.post3 11 | export SGL_ENABLE_JIT_DEEPGEMM=1 12 | python -m sglang.launch_server \ 13 | --model ${MODEL} \ 14 | --tp 8 \ 15 | --disable-radix-cache \ 16 | --trust-remote-code > ${ENGINE}_${MODEL_NAME}.log 2>&1 & 17 | elif [ "$ENGINE" == "vllm" ]; then 18 | uv pip install vllm==0.8.4 19 | VLLM_USE_FLASHINFER_SAMPLER=1 vllm serve $MODEL \ 20 | --tensor-parallel-size 8 \ 21 | --trust-remote-code --no-enable-prefix-caching \ 22 | --disable-log-requests > ${ENGINE}_${MODEL_NAME}.log 2>&1 & 23 | fi 24 | 25 | until grep -q "Started server process" ${ENGINE}_${MODEL_NAME}.log; do 26 | sleep 5 27 | echo "Waiting for ${ENGINE} server to start..." 28 | done 29 | sleep 10 30 | echo "$ENGINE server started" 31 | 32 | BACKEND="$ENGINE" 33 | if [ "$ENGINE" == "sgl" ]; then 34 | BACKEND="sglang-oai" 35 | fi 36 | 37 | input_output_pairs=( 38 | "1000 2000" 39 | "5000 1000" 40 | "10000 500" 41 | "30000 100" 42 | ) 43 | 44 | # Create main results directory 45 | mkdir -p results 46 | 47 | # Array to store aggregated results 48 | aggregated_results="| Input Tokens | Output Tokens | Output Token Throughput (tok/s) |" 49 | 50 | # Run benchmark for 3 retries, as SGLang's JIT code generation needs 51 | # to warm up. 52 | for retry in {1..3}; do 53 | echo "== Running retry ${retry} ==" 54 | echo "== Running retry ${retry} ==" >> ${ENGINE}_${MODEL_NAME}.log 55 | # Create retry-specific directory 56 | retry_dir="results/retry-${retry}" 57 | mkdir -p "$retry_dir" 58 | 59 | # Clear aggregated results for this retry 60 | if [ $retry -eq 3 ]; then 61 | aggregated_results="| Input Tokens | Output Tokens | Output Token Throughput (tok/s) |" 62 | fi 63 | 64 | # Iterate through input-output pairs 65 | for pair in "${input_output_pairs[@]}"; do 66 | input_len=$(echo $pair | cut -d' ' -f1) 67 | output_len=$(echo $pair | cut -d' ' -f2) 68 | 69 | # Run benchmark and save to retry-specific directory 70 | python -m sglang.bench_serving \ 71 | --backend $BACKEND \ 72 | --num-prompts $NUM_PROMPTS \ 73 | --request-rate 10 \ 74 | --dataset-name random \ 75 | --random-input-len $input_len \ 76 | --random-output-len $output_len \ 77 | --random-range-ratio 1 | tee "${retry_dir}/${ENGINE}_${MODEL_NAME}_${input_len}_${output_len}.log" 78 | 79 | # Only collect results for the last retry 80 | if [ $retry -eq 3 ]; then 81 | output_token_throughput=$(grep "Output token throughput (tok/s):" "${retry_dir}/${ENGINE}_${MODEL_NAME}_${input_len}_${output_len}.log" | awk '{print $NF}') 82 | aggregated_results="$aggregated_results\n|$input_len | $output_len | $output_token_throughput |" 83 | fi 84 | done 85 | done 86 | 87 | echo "== Server logs for ${ENGINE} on ${MODEL} ==" 88 | cat ${ENGINE}_${MODEL_NAME}.log 89 | echo "== End of server logs ==" 90 | 91 | # Only print results for the last retry 92 | echo "Benchmark results for ${ENGINE} on ${MODEL}:" 93 | echo -e "$aggregated_results" 94 | 95 | envs: 96 | HF_TOKEN: 97 | MODEL: "deepseek-ai/DeepSeek-R1" 98 | ENGINE: "sgl" 99 | NUM_PROMPTS: 50 100 | -------------------------------------------------------------------------------- /vllm/benchmark.yaml: -------------------------------------------------------------------------------- 1 | name: vllm-benchmark 2 | 3 | resources: 4 | accelerators: H200:8 5 | disk_size: 1000 6 | memory: 750+ 7 | 8 | file_mounts: 9 | /tmp/setup.sh: ./setup.sh 10 | 11 | setup: | 12 | set -e 13 | # Run the common setup script 14 | bash -i /tmp/setup.sh 15 | 16 | run: | 17 | # Log system information 18 | unset OMP_NUM_THREADS 19 | nvidia-smi 20 | 21 | lscpu 22 | 23 | cd ~/vLLM-Benchmark 24 | 25 | just serve ${ENGINE} ${MODEL} > ${ENGINE}_${MODEL}.log 2>&1 & 26 | 27 | until grep -q "Started server process" ${ENGINE}_${MODEL}.log; do 28 | sleep 5 29 | echo "Waiting for ${ENGINE} server to start..." 30 | done 31 | sleep 10 32 | echo "$ENGINE server started" 33 | 34 | just run-sweeps ${ENGINE} ${MODEL} 35 | 36 | if [ "$ENGINE" == "sgl" ]; then 37 | # SGLang has a JIT code generation overhead, so we discard the first run, 38 | # and run the sweeps again. 39 | just show-results ${MODEL} 40 | rm results/$MODEL/$ENGINE-*.json || true 41 | just run-sweeps ${ENGINE} ${MODEL} 42 | fi 43 | 44 | just show-results ${MODEL} 45 | 46 | 47 | envs: 48 | MODEL: "llama-8b" 49 | ENGINE: "vllm" 50 | HF_TOKEN: 51 | 52 | 53 | # experimental: 54 | # config_overrides: 55 | # nvidia_gpus: 56 | # disable_ecc: true 57 | -------------------------------------------------------------------------------- /vllm/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Common setup script for all LLM inference benchmarks 3 | 4 | set -e # Exit on error 5 | 6 | # Install system dependencies 7 | echo "Installing system dependencies..." 8 | sudo apt-get update -y 9 | sudo apt-get install -y git-lfs 10 | 11 | curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | sudo bash -s -- --to /usr/local/bin || true 12 | 13 | # Create benchmark directories 14 | echo "Creating benchmark directories..." 15 | mkdir -p ~/benchmark_results 16 | 17 | # Clone the vLLM benchmark repository 18 | echo "Cloning vLLM-Benchmark repository..." 19 | git clone https://github.com/simon-mo/vLLM-Benchmark ~/vLLM-Benchmark || true 20 | cd ~/vLLM-Benchmark 21 | 22 | echo "Cloning vLLM benchmarks directory..." 23 | just clone-vllm-benchmarks || true 24 | uv pip install -r requirements-benchmark.txt 25 | 26 | # Setup environment and download datasets 27 | echo "Setting up environment and downloading datasets..." 28 | just list-versions # This will set up the environment 29 | just download-dataset sharegpt 30 | 31 | # Download model if MODEL is defined 32 | if [ ! -z "$MODEL" ]; then 33 | echo "Downloading model: $MODEL" 34 | just download-model "$MODEL" 35 | fi 36 | 37 | echo "Setup complete!" 38 | --------------------------------------------------------------------------------