├── .gitignore
├── README.md
├── docs
├── images
│ ├── Arena_120-1.png
│ ├── Arena_40-1.png
│ ├── Mixed_80_20_120-1.png
│ ├── Mixed_80_20_40-1.png
│ ├── Pubmed_120-1.png
│ ├── Pubmed_40-1.png
│ ├── a100-v-a10g-2.png
│ ├── a100-v-a10g-slo-grid-2.png
│ ├── a100-v-a10g-slo-grid.png
│ ├── a100-v-a10g-slo.png
│ ├── a100-v-a10g.png
│ ├── all-gpus-best-2.png
│ ├── all-gpus-best.png
│ ├── all-gpus-worst-2.png
│ ├── all-gpus-worst.png
│ ├── h100-v-a100-2.png
│ ├── h100-v-a100.png
│ ├── melange-diagram.png
│ ├── req-rate-comparison-2.png
│ └── req-rate-comparison.png
└── index.md
├── melange
├── __init__.py
├── config
│ └── example.json
├── lib
│ ├── __init__.py
│ ├── runner.py
│ └── util.py
├── main.py
├── profiling
│ ├── benchmark-launcher.sh
│ ├── gpu-benchmark.py
│ └── profiling-instructions.md
└── solver.py
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
2 |
3 | ## About
4 | Here we provide the implementation of the Mélange solver and other related scripts used in our [paper](https://arxiv.org/pdf/2404.14527).
5 |
6 | ## Getting Started
7 | ```bash
8 | # Tested on Python 3.9.18
9 |
10 | # 1. Install the necessary dependencies
11 | pip install -r requirements.txt
12 |
13 | # See the melange/profiling/profiling-instructions.md for instructions on how to obtain the GPU information needed as the solver's input.
14 |
15 | # 2. Execute the solver with your own input configuration
16 | python -m melange.main -c melange/config/example.json
17 |
18 | # 3. By default, the solver will save the output in a JSON file named as "melange_result.json" at the root directory
19 | ```
20 |
21 |
22 | ## Explanation of Inputs and Outputs
23 | ### Inputs
24 | The solver requires a json file with the following inputs:
25 | 1. `workload_distribution`: A 2D matrix representing the distribution of input and output lengths that the LLM service expects. Each row refers to one input size, each column refer to one output size, and each cell correspond to the proportion of requests that are within the cell's input and output size range (i.e., a bucket). The request size boundaries between buckets can be tuned to reach a desired balance of granularity and solver complexity. An example for the range of input and output sizes could be as follows:
26 | - Input/Output size: 1-25, 25-100, 100-250, ...
27 | - The cell at (0, 0) represents the request rate for requests with input and output sizes of 1-25 tokens.
28 | - The cell at (0, 1) represents the request rate for requests with input size 1-25 tokens and output size 25-100 tokens.
29 | - And so on ...
30 | 2. `gpu_info`: A list of dictionaries, where each dictionary contains the following keys:
31 | - `name`: The name of the GPU.
32 | - `cost`: The hourly rental cost of the GPU.
33 | - `tputs`: A 2D matrix where each cell represents the GPU's profiled maximum throughput for requests of size equivalent to the corresponding cell in the `workload_distribution` matrix.
34 | 3. `total_request_rate`: A float value representing the total request rate of the workload.
35 | 4. `slice_factor`: An integer multiplier for the number of slices each bucket is split into.
36 |
37 | Please kindly refer to [example.json](melange/config/example.json) for an example of the inputs and check out our paper for more details on our methodology. We have also provided the profiling scripts we used to obtain the GPU information in the [profiling](melange/profiling) directory. See the [profiling instructions](melange/profiling/profiling-instructions.md) for more details on how to use these scripts.
38 |
39 | ### Outputs
40 | ### Solver Output
41 | The solver returns a dictionary containing the following:
42 | 1. The name of each GPU and the number of that GPU type to use.
43 | 2. The total cost for one hour.
44 |
45 | An example of the solver output is as follows:
46 | ```json
47 | {
48 | "A10G": 3,
49 | "A100-80GB": 1,
50 | "cost": 6.7
51 | }
52 | ```
53 | In this case, the solver recommends using 3 A10G GPUs and 1 A100-80GB GPUs, which results in a total cost of $6.7/hr.
54 |
55 | ### Output Formats
56 | Melange currently supports the following output formats:
57 | * **JSON**:
58 | * Default output format.
59 | * The solver output is saved as a JSON file at the root directory with the name `melange_result.json`.
60 |
61 | ## Run with Your Own Dataset or GPU Information
62 | The toy example at [script_code](melange/main.py) and [example_config](melange/config/example.json) includes examples of the four inputs to Mélange, which should be replaced to fit your setting's need.
63 |
64 | ### Workload Distribution
65 | 1. Determine the expected distribution of request sizes your LLM service expects. For example, you can use historical data of requests served by your service. In our evaluations, we used publicly available datasets (such as [Chatbot Arena](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)) to determine a reasonable distribution of request sizes.
66 | 2. Populate the `workload_distribution` based on the determined distribution. As mentioned, each row refers to a single input size, each column refers to a single output size, and each cell corresponds to the proportion of requests that fall into the given bucket. For example, a cell value of 0.1 indicates that 10% are in that bucket's size range.
67 |
68 | ### GPU Information
69 | For each GPU instance of interest, provide the following information:
70 | 1. The name of the instance.
71 | 2. The hourly rental cost of the instance.
72 | 3. Results from profiling the GPUs maximum throughput (in requests/s) for requests within each bucket's size range from the buckets in `workload_distribution`.
73 |
74 | ### Overall Rate and Slice Factor
75 | 1. Determine the service's overall request rate across all request sizes, and provide it as the `total_request_rate`.
76 | 2. Decide on the slice factor. We find that the solver's output is not very sensitive to the choice of slice factor. We empirically find that 4 is sufficient for most cases.
77 |
78 | ## For Arm-based Mac platforms
79 | We have occasionally (but not always) seen errors using PuLP on Arm-based MACs (M1/M2/M3). If you experience this issue, it's likely because the default ILP solver used by the PuLP library is not compatible with your architecture and will require additional steps.
80 | 1. Install the COIN CBC ILP solver using homebrew: `brew install coin-or-tools/coinor/cbc`
81 | 2. In [melange/solver.py](melange/solver.py), uncomment the following code to use the CBC solver. Note that your `path` may differ based on where the library was installed.
82 | ```
83 | solver= pulp.getSolver('COIN_CMD', path='/opt/homebrew/opt/cbc/bin/cbc', msg=0)
84 | problem.solve(solver)
85 | ```
86 |
87 | ## Citation
88 | If you use Mélange in your research, please cite our [paper](https://arxiv.org/abs/2404.14527):
89 | ```
90 | @article{griggs2024m,
91 | title={M$\backslash$'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity},
92 | author={Griggs, Tyler and Liu, Xiaoxuan and Yu, Jiaxiang and Kim, Doyoung and Chiang, Wei-Lin and Cheung, Alvin and Stoica, Ion},
93 | journal={arXiv preprint arXiv:2404.14527},
94 | year={2024}
95 | }
96 | ```
97 |
--------------------------------------------------------------------------------
/docs/images/Arena_120-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/Arena_120-1.png
--------------------------------------------------------------------------------
/docs/images/Arena_40-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/Arena_40-1.png
--------------------------------------------------------------------------------
/docs/images/Mixed_80_20_120-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/Mixed_80_20_120-1.png
--------------------------------------------------------------------------------
/docs/images/Mixed_80_20_40-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/Mixed_80_20_40-1.png
--------------------------------------------------------------------------------
/docs/images/Pubmed_120-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/Pubmed_120-1.png
--------------------------------------------------------------------------------
/docs/images/Pubmed_40-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/Pubmed_40-1.png
--------------------------------------------------------------------------------
/docs/images/a100-v-a10g-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/a100-v-a10g-2.png
--------------------------------------------------------------------------------
/docs/images/a100-v-a10g-slo-grid-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/a100-v-a10g-slo-grid-2.png
--------------------------------------------------------------------------------
/docs/images/a100-v-a10g-slo-grid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/a100-v-a10g-slo-grid.png
--------------------------------------------------------------------------------
/docs/images/a100-v-a10g-slo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/a100-v-a10g-slo.png
--------------------------------------------------------------------------------
/docs/images/a100-v-a10g.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/a100-v-a10g.png
--------------------------------------------------------------------------------
/docs/images/all-gpus-best-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/all-gpus-best-2.png
--------------------------------------------------------------------------------
/docs/images/all-gpus-best.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/all-gpus-best.png
--------------------------------------------------------------------------------
/docs/images/all-gpus-worst-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/all-gpus-worst-2.png
--------------------------------------------------------------------------------
/docs/images/all-gpus-worst.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/all-gpus-worst.png
--------------------------------------------------------------------------------
/docs/images/h100-v-a100-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/h100-v-a100-2.png
--------------------------------------------------------------------------------
/docs/images/h100-v-a100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/h100-v-a100.png
--------------------------------------------------------------------------------
/docs/images/melange-diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/melange-diagram.png
--------------------------------------------------------------------------------
/docs/images/req-rate-comparison-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/req-rate-comparison-2.png
--------------------------------------------------------------------------------
/docs/images/req-rate-comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/docs/images/req-rate-comparison.png
--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
1 | # Exploiting Heterogeneous GPUs to Cut LLM Deployment Costs
2 |
3 | *This blog is based on our preprint "Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity". For more details, see the* *preprint on arXiv* and *code on Github*.
4 |
5 | ### TL;DR
6 | Using an optimal *mix* of GPU types in your LLM deployment can significantly cut deployment costs by exploiting differing GPU cost efficiencies across diverse LLM service scenarios.
7 |
8 | ## The High Cost of LLM Deployment
9 |
10 | Large language models (LLMs) are increasingly integrated into many online services such as search engines and virtual assistants. However, the deployment of these models is often cost-prohibitive due to the need for expensive GPU resources.
11 |
12 |
13 | Many prior works reduce deployment costs by increasing inference engine performance, but our study shifts the spotlight to choosing the most cost-effective GPU type(s) for any given LLM service.
14 |
15 | ## GPU Heterogeneity to the Rescue
16 |
17 | There is a large and growing option space of AI hardware accelerators, from NVIDIA GPUs and AMD GPUs to Google TPUs, AWS Inferentia, and more. Within these options, *higher cost does not always lead to higher performance*. We find, instead, that the cost efficiency of a given GPU is heavily influenced by three key characteristics of LLM services: request sizes, request rates, and service-level objectives (SLOs).
18 |
19 | Whereas most LLM service deployments use only a single GPU type to host model replicas, we show that a *mix* of heterogeneous GPUs, tailored to the specific characteristics of a given LLM service, can lead to significant cost savings.
20 |
21 |
22 |
23 | # Key Factors Influencing GPU Cost Efficiency
24 |
25 | In this section, we highlight the three key LLM service characteristics that influence GPU cost efficiency: **request size**, **request rate**, and **SLO**.
26 |
27 | ## Request Size
28 |
29 |
30 | To demonstrate the effect of LLM request size (input and output lengths) on GPU cost efficiency, we illustrate three case studies. In each case, we measure the maximum generation throughput each GPU type achieves across a range of request sizes, and divide the throughput by the GPU's on-demand rental cost, resulting in a measure of cost efficiency (tokens/\\$, or T/\\$). In each plot, a tile's shade indicates which GPU is most cost effective and the tile's value indicates the percent increase of cost efficiency relative to the less cost efficient GPU.
31 |
32 | * **Llama2-7b on A100 and A10G:** In Plot 1, for small requests, A10G achieves up to 72% greater cost efficiency. Conversely, as the request size increases, A100 demonstrates 38% greater cost efficiency.
33 |
34 | * **Llama2-70b on L4, A10G, A100, and H100 serving:** We extend the study to include L4 and H100. Plot 3 compares the best GPU to the second best, and Plot 4 compares the best to the worst. In the black boxes, only A100 and H100 are compared. Note that, across the large request size spectrum, there are regions where each of the four GPU types is most cost effective.
35 |
36 | * **Llama2-70b on 2xH100 and 2xA100:** In Plot 2, we also examine a tensor parallel setting (2 H100s and 2 A100s) and observe similar trends. The cheaper GPU (A100) achieves higher cost efficiency for smaller requests, while the higher-end GPU (H100) excels for large request sizes.
37 |
38 | **Takeaways:** There is no universally most cost-efficient GPU for a given LLM. Instead, GPU cost efficiency is highly dependent on request sizes. Lower-end GPUs are more cost-effective for small request sizes whereas higher-end GPUs are best for large request sizes.
39 |
40 |
41 |
42 |
43 |

44 |
Plot 1: Llama2-7b on A10G, A100
45 |
46 |
47 |

48 |
Plot 2: Llama2-70b on 2xH100, 2xA100
49 |
50 |
51 |
52 |
53 |
54 |

55 |
Plot 3: Best GPU vs 2nd Best GPU
56 |
57 |
58 |

59 |
Plot 4: Best GPU vs Worst GPU
60 |
61 |
62 |
63 |
64 | ## Request Rate
65 |
66 | Consider serving Llama2-7b across varying request rates with three different GPU allocation policies: A10G-only, A100-only, or a mix of both. Plot 5 depicts the on-demand rental cost of serving a range of traffic volume with these policies. At low rates, A10G is the cheapest choice, then A100 becomes the economic option for higher rates. However, using a mix of A10G and A100s permits finer-grained scaling and consistently leads to the lowest cost.
67 |
68 | In general, at low request rates, services can save costs by right-sizing down from expensive, high-end GPUs to more affordable, lower-end GPUs. Further, even at high request rates, a mix of GPU types can be used to more closely match demand, optimizing GPU utilization and reducing resource waste.
69 |
70 | **Takeaway:** Mixing heterogeneous GPU types permits a finer-grained approach to resource scaling, which better aligns provisioned resources with workload demand.
71 |
72 |
73 |
74 |
75 |
76 |

77 |
Plot 5: Llama2-7b on A10G and A100 across rates
78 |
79 |
80 |

81 |
Plot 6: Llama2-7b on A10G and A100 across TPOT SLOs and request sizes
82 |
83 |
84 |
85 | ## Service-Level Objectives (SLOs)
86 | Services typically establish latency-based service-level objectives to define the performance standards that a service must meet. High-end GPUs are essential for stringent SLOs due to their lower latency and higher throughput. However, for services with more relaxed SLOs, lower-end GPUs can be used effectively to cut costs while still meeting performance expectations.
87 |
88 | In Plot 6, we compare the cost efficiency (tokens/\\$, or T/\\$) of A10G and A100 serving Llama2-7b at a range of request rates and Time Per Output Token (TPOT) SLOs. A modification to the TPOT SLO shifts the boundary within the request size space between which A10G or A100 are most cost effective, and significantly influences the magnitude of cost efficiency differences between the GPUs. As a result, both request size and SLO must be considered in tandem when determining cost efficiency.
89 |
90 | **Takeaway:** While strict SLOs require expensive high-performance GPUs, lower-end GPUs can be used to cut deployment costs in loose-SLO scenarios.
91 |
92 |
93 | # Mélange
94 |
95 | 
96 | *The Mélange Framework*
97 |
98 | Building on this analysis, we introduce **Mélange**, a GPU allocation framework that derives the minimal-cost GPU allocation for a given LLM service.
99 |
100 | In Mélange, each GPU type (1a) passes through a one-time offline profiling step (2) to measure GPU performance across request sizes and rates. Then, given the profiling results and an LLM service definition (1b), Mélange’s objective is to choose a GPU allocation for the service workload that minimizes cost. To do so, we frame the allocation task as a cost-aware bin packing problem, where bins are GPUs and items are slices of the workload. We formulate the problem as an integer linear program (ILP) and efficiently solve with an off-the-shelf solver (3). Upon solution, Mélange produces the GPU allocation that can serve the LLM service at minimal cost while adhering to the service SLO (4).
101 |
102 | Mélange’s strength stems from two key properties. First, it is *heterogeneity-aware*. Mélange’s profiling and ILP formulation account for the large diversity of GPU types and LLM services, enabling efficient navigation of heterogeneous GPU types given a service specification. Second, Mélange is *flexible*. The inputs (1a, 1b) can be flexibly modified to include new generations of GPUs or alternative definitions of SLO, ensuring Mélange is effective for diverse services.
103 |
104 |
105 | # Experimental Results
106 |
107 | We evaluated Mélange's performance using various GPU types (NVIDIA L4, A10G, A100, and H100), model sizes (Llama2-7b and Llama2-70b), and TPOT SLOs (40ms, 120ms). To capture a range of service scenarios, we use three datasets in evaluations: Chatbot Arena [dataset](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) for short-context tasks, Pubmed [dataset](https://huggingface.co/datasets/ccdv/pubmed-summarization) for long-contex tasks, and a synthetic blend of the two datasets for a mixed-context setting. We compare against baselines that use only a single GPU type. Our results indicate substantial cost reductions in diverse service settings:
108 |
109 |
110 |
111 |

112 |
Plot 7: Short-context, 120ms TPOT SLO
113 |
114 |
115 |

116 |
Plot 8: Short-context, 40ms TPOT SLO
117 |
118 |
119 |
120 | **Short-Context Tasks (Interactive Chats):** In Plots 7 & 8, Mélange achieves 15-77% cost reduction (120ms SLO) and 9-68% reduction (40ms SLO) compared to single-GPU strategies. At 1-2 req/s, H100/A100 are underutilized, making L4/A10G the economic option.. However, as the rate increases, L4/A10G’s cost advantage reduces as A100/H100 are better utilized, yet they remain competitive with A100 even at higher request rates due to their T/\\$ advantage for smaller request sizes. Conversely, at a 40ms SLO, A10G/L4 show much higher relative costs due to their increased latency, requiring more instances to meet the tight deadline. Mélange adapts by allocating more L4/A10G at 120ms SLO and more A100 at 40ms SLO, consistently reducing overall cost.
121 |
122 |
123 |
124 |

125 |
Plot 9: Long-context, 120ms TPOT SLO
126 |
127 |
128 |

129 |
Plot 10: Long-context, 40ms TPOT SLO
130 |
131 |
132 |
133 | **Long-Context Tasks (Document Summarization):** In Plots 9 & 10, Mélange achieves 15-33% cost reduction (120ms SLO) and 2-22% reduction (40ms SLO). A100 generally achieves higher T/\\$ for the request sizes in PubMed, evidenced by the 120ms setting where A100-only is consistently cheaper than H100-only. However, when SLO tightens to 40ms, H100 is the clear winner due to H100’s lower inference latency. Again, Mélange adapts to these dynamics by allocating a greater share of A100s at a looser SLO, and more H100s as the SLO is tightened.
134 |
135 |
136 |
137 |

138 |
Plot 11: Mixed-context, 120ms TPOT SLO
139 |
140 |
141 |

142 |
Plot 12: Mixed-context, 40ms TPOT SLO
143 |
144 |
145 |
146 | **Mixed-Context Tasks (Chat with Documents):** In Plots 11 & 12, Mélange achieves 13-51% cost reduction (120ms SLO) and 4-51% reduction (40ms SLO). Compared to the PubMed workload, A100-only has much greater cost efficiency in the Mixed workload than H100 due to a greater portion of short-context requests, for which A100 achieves greater T/\\$. Mélange capitalizes by using more A100 than H100, but it also uses L4/A10Gs for small requests, enabling even further cost reduction.
147 |
148 |
149 | The results validate the core observations that request size, request rate, and SLOs jointly determine GPU cost efficiency. As any of these LLM service characteristics vary, Mélange flexibly adjusts its GPU allocation and mixes GPU types to exploit their heterogeneity. This consistently delivers the most cost efficient allocation, achieving up to a 77% cost reduction.
150 |
151 | # Get Started with Mélange
152 | Clone the [repository](https://github.com/tyler-griggs/melange-release) and run the following commands at the project root directory:
153 | ```bash
154 | # 1. Install the necessary dependencies
155 | pip install -r requirements.txt
156 |
157 | # 2. Execute the solver with an example input configuration
158 | python -m melange.main -c melange/config/example.json
159 |
160 | # 3. By default, the solver will save the output in a JSON file named "melange_result.json" at the project root directory
161 | ```
162 | To run with your own dataset or GPU information, please check out the [README](https://github.com/tyler-griggs/melange-release/blob/main/README.md). We have also provided the profiling scripts and their respective [profiling instructions](https://github.com/tyler-griggs/melange-release/blob/main/melange/profiling/profiling-instructions.md) in the repository.
163 |
164 | # Conclusion
165 |
166 | Within the large and growing option space of AI hardware accelerators, there is significant opportunity to exploit their heterogeneity to cut LLM serving costs. By allocating a mix of GPU types tailored to a given LLM service, Mélange offers an efficient solution for reducing LLM deployment costs while ensuring service quality remains uncompromised.
167 |
168 | *For more details, see the [preprint on Arxiv](https://arxiv.org/abs/2404.14527).*
169 |
170 | # Citation
171 | If you use Mélange in your research, please cite our [paper](https://arxiv.org/abs/2404.14527):
172 | ```
173 | @article{griggs2024m,
174 | title={M$\backslash$'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity},
175 | author={Griggs, Tyler and Liu, Xiaoxuan and Yu, Jiaxiang and Kim, Doyoung and Chiang, Wei-Lin and Cheung, Alvin and Stoica, Ion},
176 | journal={arXiv preprint arXiv:2404.14527},
177 | year={2024}
178 | }
179 | ```
180 |
--------------------------------------------------------------------------------
/melange/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/melange/__init__.py
--------------------------------------------------------------------------------
/melange/config/example.json:
--------------------------------------------------------------------------------
1 | {
2 | "gpu_info": {
3 | "A10G": {
4 | "cost": 1.01,
5 | "tputs": [[2, 1], [5, 2]]
6 | },
7 | "A100-80GB": {
8 | "cost": 3.67,
9 | "tputs": [[20, 20], [40, 20]]
10 | }
11 | },
12 | "workload_distribution": [[0.2, 0.1], [0.5, 0.2]],
13 | "total_request_rate": 30.0,
14 | "slice_factor": 1
15 | }
--------------------------------------------------------------------------------
/melange/lib/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tyler-griggs/melange-release/d46ab43855bcdbfed4740a42058d2e269374ea55/melange/lib/__init__.py
--------------------------------------------------------------------------------
/melange/lib/runner.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass, field
2 | import json
3 | from typing import Dict
4 | from pathlib import Path
5 |
6 | from melange.solver import MelangeSolver, Solver
7 |
8 | PROJECT_DIR = Path(__file__).parent.parent.parent
9 |
10 | class SolverRunner:
11 | @dataclass
12 | class Config:
13 | gpu_info: dict = field(default_factory=dict)
14 | workload_distribution: list = field(default_factory=list)
15 | total_request_rate: float = 0 # units: requests per second
16 | slice_factor: int = 1
17 |
18 | def __init__(self, config_path: str):
19 | self.config: SolverRunner.Config = SolverRunner.Config(**json.load(open(config_path)))
20 | self.solver: Solver = MelangeSolver(
21 | workload_distribution=self.config.workload_distribution,
22 | total_request_rate=self.config.total_request_rate,
23 | gpu_info=self.config.gpu_info,
24 | slice_factor=self.config.slice_factor
25 | )
26 | self.execution_result = {}
27 |
28 | def run(self):
29 | self.execution_result = self.solver.run()
30 | print(f"[Melange] Recommendation: {self.execution_result}")
31 |
32 | def export(self):
33 | output_path = PROJECT_DIR / "melange_result.json"
34 | with open(output_path, "w") as f:
35 | json.dump(self.execution_result, f, indent=4)
36 |
37 | print(f"[Melange] Output saved to {output_path}")
38 |
39 |
--------------------------------------------------------------------------------
/melange/lib/util.py:
--------------------------------------------------------------------------------
1 | from typing import List
2 | import pandas as pd
3 | import os
4 | from dataclasses import dataclass, field
5 |
6 | # Convert max throughput profiling to a mapping from request size to load
7 | def tputs_to_loads_2d(max_tputs: List[List[float]]):
8 | loads = []
9 | for i in range(len(max_tputs)):
10 | loads.append([])
11 | for j in range(len(max_tputs[0])):
12 | load = 1 / max_tputs[i][j]
13 | loads[-1].append(load)
14 | return loads
15 |
16 |
--------------------------------------------------------------------------------
/melange/main.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | from melange.lib.runner import SolverRunner
3 |
4 |
5 | def main(config_path: str):
6 | runner = SolverRunner(config_path)
7 | runner.run()
8 | runner.export()
9 |
10 | if __name__ == "__main__":
11 | parser = argparse.ArgumentParser()
12 | # Input arguments
13 | parser.add_argument(
14 | "--config",
15 | "-c",
16 | type=str,
17 | default="melange/config/example.json",
18 | help="Path to the input configuration file, in json",
19 | )
20 | args = parser.parse_args()
21 |
22 | main(args.config)
23 |
--------------------------------------------------------------------------------
/melange/profiling/benchmark-launcher.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Result files will be added to 'PATH_PREFIX' directory.
4 | PATH_PREFIX='/home/user/results'
5 | TOTAL=10
6 |
7 | # TODO: Set your preferred request sizes and rates here.
8 | for input_len in 25 100 250 500 1000 2000; do
9 | for output_len in 25 100 250 500 1000 2000; do
10 | for req_rate in 1 2 4 8 16 32; do
11 | OUTPUT_FILE="${PATH_PREFIX}/${input_len}-${output_len}-${req_rate}-${TOTAL}.txt"
12 | python gpu-benchmark.py --backend=vllm --request-rate=$req_rate --num-prompts=$TOTAL --input_len $input_len --output_len $output_len > ${OUTPUT_FILE}
13 | done
14 | done
15 | done
16 |
17 | echo "Profiling finished."
--------------------------------------------------------------------------------
/melange/profiling/gpu-benchmark.py:
--------------------------------------------------------------------------------
1 | """Benchmark online serving throughput.
2 |
3 | Adapted from https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
4 |
5 | """
6 | import argparse
7 | import asyncio
8 | import json
9 | import random
10 | import time
11 | from typing import AsyncGenerator, List, Tuple
12 |
13 | import aiohttp
14 | import numpy as np
15 | from vllm.transformers_utils.tokenizer import get_tokenizer
16 |
17 | # (prompt len, output len, request latency)
18 | REQUEST_LATENCY: List[Tuple[int, int, float]] = []
19 | # (prompt len, output len, [per-token latencies])
20 | TOKEN_LATENCY: List[Tuple[int, int, List[float]]] = []
21 | TIME_TO_FIRST_TOKEN: List[float] = []
22 | TEMPERATURE = 0.0
23 |
24 | def sample_requests(
25 | num_requests: int,
26 | config_input_len: int,
27 | config_output_len: int,
28 | ) -> List[Tuple[str, int, int]]:
29 | return [("hi " * config_input_len, config_input_len, config_output_len) for _ in range(num_requests)]
30 |
31 | async def get_request(
32 | input_requests: List[Tuple[str, int, int]],
33 | request_rate: float,
34 | ) -> AsyncGenerator[Tuple[str, int, int], None]:
35 | input_requests = iter(input_requests)
36 | for request in input_requests:
37 | yield request
38 |
39 | if request_rate == float("inf"):
40 | # If the request rate is infinity, then we don't need to wait.
41 | continue
42 | # Sample the request interval from the exponential distribution.
43 | interval = np.random.exponential(1.0 / request_rate)
44 | # The next request will be sent after the interval.
45 | await asyncio.sleep(interval)
46 |
47 |
48 | async def send_request(
49 | backend: str,
50 | api_url: str,
51 | prompt: str,
52 | prompt_len: int,
53 | output_len: int,
54 | best_of: int,
55 | use_beam_search: bool,
56 | ) -> None:
57 |
58 | headers = {"User-Agent": "Benchmark Client"}
59 | if backend == "vllm":
60 | pload = {
61 | "prompt": prompt,
62 | "n": 1,
63 | "best_of": best_of,
64 | "use_beam_search": use_beam_search,
65 | "temperature": 0.0 if use_beam_search else TEMPERATURE,
66 | "top_p": 1.0,
67 | "max_tokens": output_len,
68 | "ignore_eos": True,
69 | "stream": True,
70 | }
71 | elif backend == "tgi":
72 | assert not use_beam_search
73 | params = {
74 | "best_of": best_of,
75 | "max_new_tokens": output_len,
76 | "do_sample": True,
77 | }
78 | pload = {
79 | "inputs": prompt,
80 | "parameters": params,
81 | }
82 | else:
83 | raise ValueError(f"Unknown backend: {backend}")
84 |
85 | request_start_time = time.perf_counter()
86 | timeout = aiohttp.ClientTimeout(total=3 * 3600)
87 | async with aiohttp.ClientSession(timeout=timeout) as session:
88 | while True:
89 | async with session.post(api_url, headers=headers, json=pload) as response:
90 | chunks = []
91 | token_latencies = []
92 | previous_token_time = time.perf_counter()
93 | first = True
94 | async for chunk, _ in response.content.iter_chunks():
95 | # Stream on: Each chunk in the response is the full response so far
96 | chunks = [chunk]
97 |
98 | now_time = time.perf_counter()
99 | if first:
100 | time_to_first = now_time - previous_token_time
101 | first = False
102 | else:
103 | token_latencies.append(now_time - previous_token_time)
104 | previous_token_time = now_time
105 |
106 | # Stream off: Chunks are full response.
107 | # chunks.append(chunk)
108 | output = b"".join(chunks).decode("utf-8")
109 | output = output[:-1] # Get rid of EOF
110 | output = json.loads(output)
111 |
112 | # Re-send the request if it failed.
113 | if "error" not in output:
114 | break
115 |
116 | request_end_time = time.perf_counter()
117 | request_latency = request_end_time - request_start_time
118 | REQUEST_LATENCY.append((prompt_len, output_len, request_latency))
119 | TOKEN_LATENCY.append((prompt_len, output_len, token_latencies))
120 | TIME_TO_FIRST_TOKEN.append(time_to_first)
121 |
122 | async def benchmark(
123 | backend: str,
124 | api_url: str,
125 | input_requests: List[Tuple[str, int, int]],
126 | best_of: int,
127 | use_beam_search: bool,
128 | request_rate: float,
129 | ) -> None:
130 | tasks: List[asyncio.Task] = []
131 |
132 | async for request in get_request(input_requests, request_rate):
133 | prompt, prompt_len, output_len = request
134 | task = asyncio.create_task(send_request(backend, api_url, prompt,
135 | prompt_len, output_len,
136 | best_of, use_beam_search))
137 | tasks.append(task)
138 |
139 | await asyncio.gather(*tasks)
140 |
141 |
142 | def main(args: argparse.Namespace):
143 | print(args)
144 | random.seed(args.seed)
145 | np.random.seed(args.seed)
146 |
147 | api_url = f"http://{args.host}:{args.port}/generate"
148 | input_requests = sample_requests(args.num_prompts, args.input_len, args.output_len)
149 |
150 | benchmark_start_time = time.perf_counter()
151 | asyncio.run(benchmark(args.backend, api_url, input_requests, args.best_of,
152 | args.use_beam_search, args.request_rate))
153 | benchmark_end_time = time.perf_counter()
154 | benchmark_time = benchmark_end_time - benchmark_start_time
155 | print()
156 | print("RESULT SUMMARY")
157 | print(f"Request rate: {args.request_rate} req/s")
158 | print(f"Prompt count: {len(REQUEST_LATENCY)}")
159 | print(f"Total time: {benchmark_time:.2f} s")
160 | print(f"Request Throughput: {len(REQUEST_LATENCY) / benchmark_time:.2f} requests/s")
161 | print(f"Output Token Throughput: {sum([output for _, output, _ in REQUEST_LATENCY]) / benchmark_time:.2f} tokens/s")
162 | print()
163 |
164 | # Compute the latency statistics.
165 | avg_latency = np.mean([latency for _, _, latency in REQUEST_LATENCY])
166 | print("REQUEST LATENCIES")
167 | print(f"Avg: {avg_latency:.2f} s")
168 | print(f"50p: {np.percentile([latency for _, _, latency in REQUEST_LATENCY], 50)} s")
169 | print(f"90p: {np.percentile([latency for _, _, latency in REQUEST_LATENCY], 90)} s")
170 | print(f"99p: {np.percentile([latency for _, _, latency in REQUEST_LATENCY], 99)} s")
171 | print()
172 |
173 | print()
174 |
175 | all_token_latencies = np.array([token_latencies for _, _, token_latencies in TOKEN_LATENCY])
176 | print("TOKEN LATENCIES")
177 | print("TTFT")
178 | print(f'Avg: {np.mean(TIME_TO_FIRST_TOKEN)}')
179 | print(f'50p: {np.percentile(TIME_TO_FIRST_TOKEN, 50)}')
180 | print(f'90p: {np.percentile(TIME_TO_FIRST_TOKEN, 90)}')
181 | print(f'99p: {np.percentile(TIME_TO_FIRST_TOKEN, 99)}')
182 | print("TPOT")
183 | print(f'Avg: {np.mean(all_token_latencies)}')
184 | print(f'50p: {np.percentile(all_token_latencies, 50)}')
185 | print(f'90p: {np.percentile(all_token_latencies, 90)}')
186 | print(f'99p: {np.percentile(all_token_latencies, 99)}')
187 | print()
188 |
189 | if __name__ == "__main__":
190 | parser = argparse.ArgumentParser(
191 | description="Benchmark the online serving throughput.")
192 | parser.add_argument("--backend", type=str, default="vllm",
193 | choices=["vllm", "tgi"])
194 | parser.add_argument("--host", type=str, default="localhost")
195 | parser.add_argument("--port", type=int, default=8000)
196 | parser.add_argument("--best-of", type=int, default=1,
197 | help="Generates `best_of` sequences per prompt and "
198 | "returns the best one.")
199 | parser.add_argument("--use-beam-search", action="store_true")
200 | parser.add_argument("--num-prompts", type=int, default=1000,
201 | help="Number of prompts to process.")
202 | parser.add_argument("--request-rate", type=float, default=float("inf"),
203 | help="Number of requests per second. If this is inf, "
204 | "then all the requests are sent at time 0. "
205 | "Otherwise, we use Poisson process to synthesize "
206 | "the request arrival times.")
207 | parser.add_argument("--seed", type=int, default=0)
208 | parser.add_argument('--trust-remote-code', action='store_true',
209 | help='trust remote code from huggingface')
210 | parser.add_argument("--input_len", type=int, default=0)
211 | parser.add_argument("--output_len", type=int, default=0)
212 | args = parser.parse_args()
213 | main(args)
214 |
--------------------------------------------------------------------------------
/melange/profiling/profiling-instructions.md:
--------------------------------------------------------------------------------
1 | # GPU Profiling
2 |
3 | ## About
4 | This directory holds the code we use to profile GPU performance to find their throughputs and latencies. The bash script `benchmark-launcher.sh` is used to launch multiple sequential instances of `gpu-benchmark.py`. Each instance of `gpu-benchmark.py` profiles a specific request size and rate.
5 |
6 | ## Launching Benchmarks
7 | First, deploy your model of choice on the GPU you wish to profile. We use [vLLM](https://github.com/vllm-project/vllm/tree/main) as our inference engine, which can be launched by following the instructions in their github repo.
8 |
9 | Once your model is up and running, modify `benchmark-launcher.sh` to configure the following parameters for the profiling:
10 | * PATH_PREFIX: the absolute path of an existing folder where the results should be saved into.
11 | * TOTAL: the number of requests to be sent
12 | * input_len, output_len: the input/output length of each request to be sent
13 | * req_rate: the overall request rate
14 |
15 | Finally, simply run `bash benchmark-launcher.sh` and, upon script completion, the results will be in the configured result directory.
--------------------------------------------------------------------------------
/melange/solver.py:
--------------------------------------------------------------------------------
1 | import pulp
2 | from pulp import LpVariable, LpProblem, LpMinimize, LpInteger
3 |
4 | from melange.lib.util import tputs_to_loads_2d
5 |
6 |
7 | # base class
8 | class Solver:
9 | def __init__(self, workload_distribution: list, total_request_rate: float, gpu_info: dict):
10 | self.workload_distribution = workload_distribution
11 | self.overall_rate = total_request_rate
12 | self.gpu_info = gpu_info
13 |
14 | def run(self, logs=False):
15 | raise NotImplementedError
16 |
17 |
18 | class MelangeSolver(Solver):
19 | def __init__(self, workload_distribution: list, total_request_rate: float, gpu_info: dict, slice_factor: int):
20 | super().__init__(workload_distribution, total_request_rate, gpu_info)
21 | self.slice_factor = slice_factor
22 |
23 | def run(self, logs=False):
24 | # Multiply overall rate across distribution.
25 | request_rate_histogram = []
26 | for i in range(len(self.workload_distribution)):
27 | request_rate_histogram.append([])
28 | for j in range(len(self.workload_distribution[0])):
29 | request_rate_histogram[-1].append(
30 | self.workload_distribution[i][j] * self.overall_rate
31 | )
32 |
33 | # Convert the profiled max throughputs into mapping from request size to load
34 | for gpu in self.gpu_info:
35 | self.gpu_info[gpu]["loads"] = tputs_to_loads_2d(self.gpu_info[gpu]["tputs"])
36 |
37 | gpu_types = list(self.gpu_info.keys())
38 | cost_vector = [self.gpu_info[gpu]["cost"] for gpu in gpu_types]
39 |
40 | # Create slices, which is a single dimension.
41 | slices = []
42 | for i in range(len(request_rate_histogram)):
43 | for j in range(len(request_rate_histogram[i])):
44 | for _ in range(self.slice_factor):
45 | slices.append(request_rate_histogram[i][j] / self.slice_factor)
46 |
47 | # Create slice-to-load mapping, which is a single dimension.
48 | for gpu in gpu_types:
49 | slice_loads = []
50 | for i in range(len(self.gpu_info[gpu]["loads"])):
51 | for j in range(len(self.gpu_info[gpu]["loads"][i])):
52 | for _ in range(self.slice_factor):
53 | slice_loads.append(self.gpu_info[gpu]["loads"][i][j])
54 | assert len(slices) == len(slice_loads)
55 | self.gpu_info[gpu]["slice_loads"] = slice_loads
56 |
57 | # Decision matrix value is binary. The slice is assigned to a GPU, or it isn't.
58 | matrix_rows = len(slices)
59 | matrix_cols = len(gpu_types)
60 |
61 | # Vector value is non-negative integer of how many of each GPU type are needed
62 | vector_length = matrix_cols
63 |
64 | decision_matrix = [
65 | [
66 | LpVariable(f"x_{i}_{j}", cat=LpInteger, lowBound=0, upBound=1)
67 | for j in range(matrix_cols)
68 | ]
69 | for i in range(matrix_rows)
70 | ]
71 | decision_vector = [
72 | LpVariable(f"y_{i}", cat=LpInteger, lowBound=0)
73 | for i in range(vector_length)
74 | ]
75 |
76 | # Objective: minimize cost
77 | problem = LpProblem("GpuAllocation", LpMinimize)
78 | problem += pulp.lpSum(
79 | [decision_vector[i] * cost_vector[i] for i in range(len(decision_vector))]
80 | )
81 |
82 | # C1: Each row of decision matrix must sum to exactly 1 (ie, each slice assigned to one GPU)
83 | for i in range(len(decision_matrix)):
84 | problem += pulp.lpSum(decision_matrix[i]) == 1
85 |
86 | # C2: Load of column of decision matrix must fit in decision vector capacity
87 | for j in range(len(decision_matrix[0])):
88 | # j is idx of GPU type, i is slice
89 | problem += (
90 | pulp.lpSum(
91 | [
92 | decision_matrix[i][j]
93 | * self.gpu_info[gpu_types[j]]["slice_loads"][i]
94 | * slices[i]
95 | for i in range(len(decision_matrix))
96 | ]
97 | )
98 | <= decision_vector[j]
99 | )
100 |
101 | # Solve the problem
102 | problem.solve(pulp.PULP_CBC_CMD(msg=0))
103 |
104 | # For Arm-based Mac platforms.
105 | # solver= pulp.getSolver('COIN_CMD', path='/opt/homebrew/opt/cbc/bin/cbc', msg=0)
106 | # problem.solve(solver)
107 |
108 | # Print the results if needed
109 | if logs:
110 | print(f"Decision Matrix:")
111 | for row in decision_matrix:
112 | print([var.value() for var in row])
113 | print(f"Decision Vector:")
114 | print(f"{[var.value() for var in decision_vector]}")
115 |
116 | if pulp.LpStatus[problem.status] != "Optimal":
117 | return None
118 |
119 | solution_dict = {}
120 | for i in range(len(decision_vector)):
121 | solution_dict[gpu_types[i]] = int(decision_vector[i].value())
122 |
123 | total_cost = 0
124 | for gpu in solution_dict:
125 | total_cost += solution_dict[gpu] * self.gpu_info[gpu]["cost"]
126 | solution_dict["cost"] = total_cost
127 |
128 | return solution_dict
129 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # used for the solver
2 | numpy==1.26.4
3 | pulp==2.8.0
4 | pandas==2.2.1
5 | ruamel.yaml==0.18.6
6 |
7 | # used for profiling
8 | vllm==0.2.7
9 | aiohttp==3.9.5
--------------------------------------------------------------------------------