├── README.md
├── cover.png
├── requirements.txt
├── sgl
    ├── benchmark.yaml
    └── logs
    │   ├── sgl-0.4.5.post2-deepseek-r1-200.log
    │   ├── sgl-0.4.5.post2-deepseek-r1.log
    │   ├── sgl-0.4.5.post3-deepseek-r1-200.log
    │   ├── sgl-0.4.5.post3-deepseek-r1.log
    │   ├── vllm-deepseek-r1-200.log
    │   └── vllm-deepseek-r1.log
└── vllm
    ├── benchmark.yaml
    ├── logs
        ├── sgl-deepseek-r1.log
        └── vllm-deepseek-r1.log
    └── setup.sh


/README.md:
--------------------------------------------------------------------------------
  1 | # LLM Inference Engine Benchmarks
  2 | 
  3 | 
  4 | This collection of open-source LLM inference engine benchmarks provides **fair and reproducible** one-line commands to compare different inference engines on **identical hardware** on different infrastructures -- your own clouds or kubernetes clusters.
  5 | 
  6 | We use [SkyPilot](https://github.com/skypilot-ai/skypilot) YAML to ensure consistent and reproducible infrastructure deployment across benchmarks.
  7 | 
  8 | ![cover](./cover.png)
  9 | 
 10 | ## Background
 11 | 
 12 | When different LLM inference engines start to post their performance numbers [[1](https://x.com/vllm_project/status/1913513173342392596), [2](https://x.com/lmsysorg/status/1913064701313073656)], it can be confusing to see different numbers. It can due to different configurations, or different hardware setup.
 13 | 
 14 | This repo is trying to create a centralized place for these benchmarks to be able to run on the same hardware, with the optimal configurations (e.g. TP, DP, etc.) that can be set by the official teams for those inference engines.
 15 | 
 16 | Disclaimer: This repo is created for learning, and is not affiliated with any of the inference engine teams.
 17 | 
 18 | ## Installation
 19 | 
 20 | ```bash
 21 | pip install -U "skypilot[nebius]"
 22 | ```
 23 | 
 24 | Setup cloud credentials. See [SkyPilot docs](https://docs.skypilot.co/en/latest/getting-started/installation.html).
 25 | 
 26 | 
 27 | ## Version
 28 | 
 29 | The version of the inference engines are as follows:
 30 | 
 31 | - vLLM: 0.8.4
 32 | - SGLang: 0.4.5.post1/0.4.5.post3
 33 | - TRT-LLM: NOT SUPPORTED YET
 34 | 
 35 | ## Benchmark from vLLM
 36 | 
 37 | vLLM created a [benchmark](https://github.com/simon-mo/vLLM-Benchmark/tree/main) for vLLM vs SGLang and TRT-LLM.
 38 | 
 39 | To run the benchmarks:
 40 | 
 41 | > [!NOTE]
 42 | > vLLM team runs the benchmarks on Nebius H200 machines, so we use `--cloud nebius` below.
 43 | 
 44 | 
 45 | ### Run the benchmarks
 46 | 
 47 | ```bash
 48 | cd ./vllm
 49 | 
 50 | # Run the benchmarks for vLLM
 51 | sky launch --cloud nebius -c benchmark -d benchmark.yaml
 52 |   --env HF_TOKEN
 53 |   --env MODEL=deepseek-r1
 54 |   --env ENGINE=vllm
 55 | 
 56 | # Run the benchmarks for SGLang
 57 | # Note: the first run of SGLang will have half of the throughput, likely due to
 58 | # the JIT code generation. In the benchmark.yaml, we discard the first run and
 59 | # run the sweeps again.
 60 | sky launch --cloud nebius -c benchmark -d benchmark.yaml
 61 |   --env HF_TOKEN
 62 |   --env MODEL=deepseek-r1
 63 |   --env ENGINE=sgl
 64 | 
 65 | 
 66 | # This is not supported yet
 67 | # sky launch --cloud nebius -c benchmark benchmark.yaml
 68 | #   --env HF_TOKEN
 69 | #   --env MODEL=deepseek-r1
 70 | #   --env ENGINE=trt
 71 | ```
 72 | 
 73 | Automatically stop the cluster after the benchmarks are done: 
 74 | 
 75 | ```bash
 76 | sky autostop benchmark
 77 | ```
 78 | 
 79 | > [!NOTE]
 80 | > If you would like to run the benchmarks on different infrastructure, you can change `--cloud` to other clouds or your kubernetes cluster: `--cloud k8s`.
 81 | 
 82 | You can also change the model to one of the following: `deepseek-r1`, `qwq-32b`, `llama-8b`, `llama-3b`, `qwen-1.5b`.
 83 | 
 84 | ### Benchmark Results for DeepSeek-R1
 85 | 
 86 | - **CPU**: Intel(R) Xeon(R) Platinum 8468
 87 | - **GPU**: 8x NVIDIA H200
 88 | 
 89 | **Output token throughput (tok/s)**
 90 | 
 91 | | Input Tokens | Output Tokens | vLLM v0.8.4 | SGLang v0.4.5.post1 |
 92 | | ------------ | ------------- | ------------ | ------------ |
 93 | |         1000 |          2000 | 1136.92 | 1041.14 |
 94 | |         5000 |          1000 |  857.13 |  821.40 |
 95 | |        10000 |           500 |  441.53 |  389.84 |
 96 | |        30000 |           100 |   37.07 |   33.94 |
 97 | |     sharegpt |      sharegpt | 1330.60 |  981.47 |
 98 | 
 99 | 
100 | **Logs**
101 | - vLLM logs: [vllm_deepseek-r1.log](./vllm/logs/vllm-deepseek-r1.log)
102 | - SGLang logs: [sgl_deepseek-r1.log](./vllm/logs/sgl-deepseek-r1.log)
103 | Logs are dumped with `sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log`.
104 | 
105 | 
106 | ## Benchmark from SGLang
107 | 
108 | SGLang created a [benchmark](https://github.com/sgl-project/sglang/issues/5514) for SGLang on random input and output. This benchmark uses the same configurations from it.
109 | 
110 | ### Run the benchmarks
111 | 
112 | ```bash
113 | cd ./sgl
114 | 
115 | # Run the benchmarks for SGLang
116 | sky launch --cloud nebius -c benchmark benchmark.yaml \
117 |   --env HF_TOKEN \
118 |   --env ENGINE=sgl
119 | 
120 | # Run the benchmarks for vLLM
121 | sky launch --cloud nebius -c benchmark benchmark.yaml \
122 |   --env HF_TOKEN \
123 |   --env ENGINE=vllm
124 | ```
125 | 
126 | ### Benchmark Results for DeepSeek-R1
127 | 
128 | - **CPU**: Intel(R) Xeon(R) Platinum 8468
129 | - **GPU**: 8x NVIDIA H200
130 | 
131 | **Output token throughput (tok/s)**
132 | 
133 | | Input Tokens | Output Tokens | vLLM v0.8.4 (2025-04-14) | SGLang v0.4.5.post3 (2025-04-21) |
134 | | ------------ | ------------- | ------------ | ------------ |
135 | |1000 | 2000 | 1042.17 | 1329.14 |
136 | |5000 | 1000 | 794.54  | 951.64 |
137 | |10000 | 500 | 436.08  | 479.69 |
138 | |30000 | 100 | 37.76   | 47.38 |
139 | 
140 | **Logs**
141 | - vLLM logs: [vllm-deepseek-r1.log](./sgl/logs/vllm-deepseek-r1.log)
142 | - SGLang logs: [sgl-0.4.5.post3-deepseek-r1.log](./sgl/logs/sgl-0.4.5.post3-deepseek-r1.log)
143 | 
144 | Logs are dumped with `sky logs benchmark > vllm/logs/$ENGINE-deepseek-r1.log`.
145 | 
146 | **Output token throughput (tok/s): Using 200 prompts (vs 50 prompts in the official benchmark)**
147 | 
148 | | Input Tokens | Output Tokens | vLLM v0.8.4 (2025-04-14) | SGLang v0.4.5.post3 (2025-04-21) |
149 | | ------------ | ------------- | ------- | ------- |
150 | | 1000 | 2000 | 2498.90 | 3276.05 |
151 | | 5000 | 1000 | 930.93 | 1322.31 |
152 | | 10000 | 500 | 341.70 | 501.95 |
153 | | 30000 | 100 | 38.44 | 47.68 |
154 | 
155 | **Logs**
156 | - vLLM logs: [vllm-deepseek-r1-200.log](./sgl/logs/vllm-deepseek-r1-200.log)
157 | - SGLang logs: [sgl-0.4.5.post3-deepseek-r1-200.log](./sgl/logs/sgl-0.4.5.post3-deepseek-r1-200.log)
158 | 
159 | 
160 | ## Contribution
161 | 
162 | Any contributions from the community are welcome, to tune the versions and configurations for different inference engines, so as to make the benchmarks more accurate and fair.
163 | 
164 | 
165 | ## Final Thoughts
166 | 
167 | Interestingly, vLLM and SGLang's official benchmark results diverge, even with the same hardware, and the same flags.
168 | 
169 | Although both of the benchmark scripts try to simulate the real inference scenario, the throughput numbers are very sensitive to the benchmark setup -- even simply changing the number of prompts from 50 to 200 can flip the conclusion for the performance of the two engines.
170 | 
171 | A better benchmark is in need to provide more insights into the performance of inference engines, while this repo could offer a platform for the community to run the benchmarks in a fair and reproducible way, including same settings, same hardware, etc.
172 | 


--------------------------------------------------------------------------------
/cover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Michaelvll/llm-ie-benchmarks/5f717fed82f9bbf1d15881842bd61b1915d78b1e/cover.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | skypilot[nebius]
2 | 


--------------------------------------------------------------------------------
/sgl/benchmark.yaml:
--------------------------------------------------------------------------------
  1 | resources:
  2 |   accelerators: H200:8
  3 |   disk_size: 1000
  4 |   memory: 750+
  5 | 
  6 | run: |
  7 |   # Get the model name from the model path, i.e. deepseek-ai/DeepSeek-R1 -> DeepSeek-R1
  8 |   MODEL_NAME=$(basename ${MODEL}) 
  9 |   if [ "$ENGINE" == "sgl" ]; then
 10 |     uv pip install sglang[all]==0.4.5.post3
 11 |     export SGL_ENABLE_JIT_DEEPGEMM=1
 12 |     python -m sglang.launch_server \
 13 |       --model ${MODEL} \
 14 |       --tp 8 \
 15 |       --disable-radix-cache \
 16 |       --trust-remote-code > ${ENGINE}_${MODEL_NAME}.log 2>&1 &
 17 |   elif [ "$ENGINE" == "vllm" ]; then
 18 |     uv pip install vllm==0.8.4
 19 |     VLLM_USE_FLASHINFER_SAMPLER=1 vllm serve $MODEL \
 20 |       --tensor-parallel-size 8 \
 21 |       --trust-remote-code --no-enable-prefix-caching \
 22 |       --disable-log-requests > ${ENGINE}_${MODEL_NAME}.log 2>&1 &
 23 |   fi
 24 | 
 25 |   until grep -q "Started server process" ${ENGINE}_${MODEL_NAME}.log; do
 26 |     sleep 5
 27 |     echo "Waiting for ${ENGINE} server to start..."
 28 |   done
 29 |   sleep 10
 30 |   echo "$ENGINE server started"
 31 | 
 32 |   BACKEND="$ENGINE"
 33 |   if [ "$ENGINE" == "sgl" ]; then
 34 |     BACKEND="sglang-oai"
 35 |   fi
 36 | 
 37 |   input_output_pairs=(
 38 |     "1000 2000"
 39 |     "5000 1000"
 40 |     "10000 500"
 41 |     "30000 100"
 42 |   )
 43 |   
 44 |   # Create main results directory
 45 |   mkdir -p results
 46 |   
 47 |   # Array to store aggregated results
 48 |   aggregated_results="| Input Tokens | Output Tokens | Output Token Throughput (tok/s) |"
 49 |   
 50 |   # Run benchmark for 3 retries, as SGLang's JIT code generation needs
 51 |   # to warm up.
 52 |   for retry in {1..3}; do
 53 |       echo "== Running retry ${retry} =="
 54 |       echo "== Running retry ${retry} ==" >> ${ENGINE}_${MODEL_NAME}.log
 55 |       # Create retry-specific directory
 56 |       retry_dir="results/retry-${retry}"
 57 |       mkdir -p "$retry_dir"
 58 |   
 59 |       # Clear aggregated results for this retry
 60 |       if [ $retry -eq 3 ]; then
 61 |           aggregated_results="| Input Tokens | Output Tokens | Output Token Throughput (tok/s) |"
 62 |       fi
 63 |   
 64 |       # Iterate through input-output pairs
 65 |       for pair in "${input_output_pairs[@]}"; do
 66 |           input_len=$(echo $pair | cut -d' ' -f1)
 67 |           output_len=$(echo $pair | cut -d' ' -f2)
 68 |   
 69 |           # Run benchmark and save to retry-specific directory
 70 |           python -m sglang.bench_serving \
 71 |               --backend $BACKEND \
 72 |               --num-prompts $NUM_PROMPTS \
 73 |               --request-rate 10 \
 74 |               --dataset-name random \
 75 |               --random-input-len $input_len \
 76 |               --random-output-len $output_len \
 77 |               --random-range-ratio 1 | tee "${retry_dir}/${ENGINE}_${MODEL_NAME}_${input_len}_${output_len}.log"
 78 |   
 79 |           # Only collect results for the last retry
 80 |           if [ $retry -eq 3 ]; then
 81 |               output_token_throughput=$(grep "Output token throughput (tok/s):" "${retry_dir}/${ENGINE}_${MODEL_NAME}_${input_len}_${output_len}.log" | awk '{print $NF}')
 82 |               aggregated_results="$aggregated_results\n|$input_len | $output_len | $output_token_throughput |"
 83 |           fi
 84 |       done
 85 |   done
 86 |   
 87 |   echo "== Server logs for ${ENGINE} on ${MODEL} =="
 88 |   cat ${ENGINE}_${MODEL_NAME}.log
 89 |   echo "== End of server logs =="
 90 | 
 91 |   # Only print results for the last retry
 92 |   echo "Benchmark results for ${ENGINE} on ${MODEL}:"
 93 |   echo -e "$aggregated_results"
 94 | 
 95 | envs:
 96 |   HF_TOKEN:
 97 |   MODEL: "deepseek-ai/DeepSeek-R1"
 98 |   ENGINE: "sgl"
 99 |   NUM_PROMPTS: 50
100 | 


--------------------------------------------------------------------------------
/vllm/benchmark.yaml:
--------------------------------------------------------------------------------
 1 | name: vllm-benchmark
 2 | 
 3 | resources:
 4 |   accelerators: H200:8
 5 |   disk_size: 1000
 6 |   memory: 750+
 7 | 
 8 | file_mounts:
 9 |   /tmp/setup.sh: ./setup.sh
10 | 
11 | setup: |
12 |   set -e
13 |   # Run the common setup script
14 |   bash -i /tmp/setup.sh
15 | 
16 | run: |
17 |   # Log system information
18 |   unset OMP_NUM_THREADS
19 |   nvidia-smi
20 |   
21 |   lscpu
22 | 
23 |   cd ~/vLLM-Benchmark
24 | 
25 |   just serve ${ENGINE} ${MODEL} > ${ENGINE}_${MODEL}.log 2>&1 &
26 | 
27 |   until grep -q "Started server process" ${ENGINE}_${MODEL}.log; do
28 |     sleep 5
29 |     echo "Waiting for ${ENGINE} server to start..."
30 |   done
31 |   sleep 10
32 |   echo "$ENGINE server started"
33 | 
34 |   just run-sweeps ${ENGINE} ${MODEL}
35 | 
36 |   if [ "$ENGINE" == "sgl" ]; then
37 |     # SGLang has a JIT code generation overhead, so we discard the first run,
38 |     # and run the sweeps again.
39 |     just show-results ${MODEL}
40 |     rm results/$MODEL/$ENGINE-*.json || true
41 |     just run-sweeps ${ENGINE} ${MODEL}
42 |   fi
43 | 
44 |   just show-results ${MODEL}
45 |   
46 | 
47 | envs:
48 |   MODEL: "llama-8b"
49 |   ENGINE: "vllm"
50 |   HF_TOKEN:
51 | 
52 | 
53 | # experimental:
54 | #   config_overrides:
55 | #     nvidia_gpus:
56 | #       disable_ecc: true
57 | 


--------------------------------------------------------------------------------
/vllm/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Common setup script for all LLM inference benchmarks
 3 | 
 4 | set -e  # Exit on error
 5 | 
 6 | # Install system dependencies
 7 | echo "Installing system dependencies..."
 8 | sudo apt-get update -y
 9 | sudo apt-get install -y git-lfs
10 | 
11 | curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | sudo bash -s -- --to /usr/local/bin || true
12 | 
13 | # Create benchmark directories
14 | echo "Creating benchmark directories..."
15 | mkdir -p ~/benchmark_results
16 | 
17 | # Clone the vLLM benchmark repository
18 | echo "Cloning vLLM-Benchmark repository..."
19 | git clone https://github.com/simon-mo/vLLM-Benchmark ~/vLLM-Benchmark || true
20 | cd ~/vLLM-Benchmark
21 | 
22 | echo "Cloning vLLM benchmarks directory..."
23 | just clone-vllm-benchmarks || true
24 | uv pip install -r requirements-benchmark.txt
25 | 
26 | # Setup environment and download datasets
27 | echo "Setting up environment and downloading datasets..."
28 | just list-versions  # This will set up the environment
29 | just download-dataset sharegpt
30 | 
31 | # Download model if MODEL is defined
32 | if [ ! -z "$MODEL" ]; then
33 |   echo "Downloading model: $MODEL"
34 |   just download-model "$MODEL"
35 | fi
36 | 
37 | echo "Setup complete!" 
38 | 


--------------------------------------------------------------------------------