├── LICENSE ├── README.md ├── kv-proxy ├── Dockerfile ├── proxy.py └── requirements.txt ├── lmcache-lmslocalbackend ├── Dockerfile ├── requirements-build.txt ├── requirements-common.txt └── requirements-cuda.txt └── vllm-lmcache ├── Dockerfile ├── example_run.sh ├── patch ├── factory.py ├── lmcache_connector.py └── parallel_state.patch ├── requirements-build.txt ├── requirements-common.txt └── requirements-cuda.txt /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # splitwise-demos 2 | This repo contains a set of demos for PD diaggregation base on vllm and lmcache. 3 | 4 | ## Architecture 5 | 6 | ``` 7 | +-----------+ +------------------+ +-----------+ 8 | | | | | | | 9 | | | | +------------+ | | | 10 | | | | | GPU HBM | | | | 11 | | | /chat/completions | +-+----------+ | | | 12 | | +<------------------->+ | | Put | | 13 | | | set max_tokens=1 | |GPUConnector +----------->+ | 14 | | | | v | TCP/RDMA | | 15 | | | | +-+----------+ | | | 16 | | | | | CPU DRAM | | | | 17 | | | | +------------+ | | | 18 | | | | | | | 19 | | | | Prefill | | | 20 | | | +------------------+ | | 21 | /chat/completions| | | KV Cache | 22 | +--------------->+ Proxy | | Storage | 23 | max_tokens=100 | | | Backend | 24 | | | +------------------+ | | 25 | | | | | | | 26 | | | | +------------+ | | | 27 | | | | | GPU HBM | | | | 28 | | | | +-+----------+ | | | 29 | | | /chat/completions | | | | | 30 | | +<------------------->+ |GPUConnector | | | 31 | | | set max_tokens=100 | v | | | 32 | | | | +-+----------+ | Get | | 33 | | | | | CPU DRAM | +----------->+ | 34 | | | | +------------+ | TCP/RDMA | | 35 | | | | | | | 36 | | | | Decode | | | 37 | +-----------+ +------------------+ +-----------+ 38 | ``` 39 | 40 | In this architecture, the Prefill, Decode, KvCache Storage Backend and Proxy are 4 separate services, and each service can be scaling independantly. The workflow of the architecture design are: 41 | 1. When the Proxy service receive a inference request, it will override the max generation tokens to 1 and forward the request to the Prefill service. 42 | 2. The Prefill service start to run the prefill process with the prompt, store the kv cache and hidden states to the KvCache Storage Backend via TCP/RDMA, it will skip the entries that already exist in remote storage backend. 43 | 3. After the receiving the reponse from the Prefill service, the Proxy serivce will forward the original inference request to the Decode service. 44 | 4. The Decode service will retrieve the kv cache and hidden states from the KvCache Storage Backend and start to run decode process to generate the output tokens by streaming. 45 | 46 | ## Usage 47 | 48 | ### Configurations 49 | 50 | The example.yaml file is used for Prefill and Decode service connect to the Kv Cache Storage Backend. 51 | 52 | ``` yaml 53 | # example.yaml 54 | chunk_size: 32 # will try to recv kv cache from remote storage backend if tokens is larger than chunk_size otherwise it will recompute. 55 | local_device: "cpu" 56 | remote_url: "lm://192.168.0.1:10080" 57 | remote_serde: "naive" # naive, kivi, cachegen, choose naive avoid (de)serialization 58 | max_local_cpu_size: 20 # GB 59 | 60 | # Whether retrieve() is pipelined or not 61 | pipelined_backend: False 62 | ``` 63 | 64 | ### Start the LMSLocalBackend 65 | Using the dummy kv cache storage backend for the proof of concept. 66 | 67 | ```bash 68 | docker run -d -p 10080:8000 ghcr.io/bd-iaas-us/lmcache-lmslocalbackend:latest 0.0.0.0 8000 "cpu" 69 | ``` 70 | 71 | ### Start the Prefill 72 | Start the Prefill service with the DeepSeek-V2-Lite model, using docker for proof of concept purpose. 73 | 74 | ```bash 75 | docker run -d -p 8010:8000 --gpus="device=0" -v ./example.yaml:/app/example.yaml -v ./DeepSeek-V2-Lite:/app/DeepSeek-V2-Lite --env "VLLM_MLA_DISABLE=1" --env "LMCACHE_CONFIG_FILE=/app/example.yaml" --env "LMCACHE_USE_EXPERIMENTAL=True" ghcr.io/bd-iaas-us/vllm-lmcache:latest /app/DeepSeek-V2-Lite --port 8000 --max-model-len 8192 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --swap-space 0 --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_producer"}' 76 | ``` 77 | 78 | ### Start the Decode 79 | 80 | Start the Decode service with the DeepSeek-V2-Lite model, using docker for proof of concept purpose. 81 | 82 | ```bash 83 | docker run -d -p 8020:8000 --gpus="device=1" -v ./example.yaml:/app/example.yaml -v ./DeepSeek-V2-Lite:/app/DeepSeek-V2-Lite --env "VLLM_MLA_DISABLE=1" --env "LMCACHE_CONFIG_FILE=/app/example.yaml" --env "LMCACHE_USE_EXPERIMENTAL=True" ghcr.io/bd-iaas-us/vllm-lmcache:latest /app/DeepSeek-V2-Lite --port 8000 --max-model-len 8192 --trust-remote-code --enforce-eager --gpu-memory-utilization 0.9 --swap-space 0 --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' 84 | ``` 85 | 86 | ### Start the KvProxy 87 | #### Option 1 88 | If only 1P1D, use environment variable would be idea for proof of concept. 89 | ```bash 90 | docker run -p 8030:8000 --env "PREFILL_ENDPOINTS=http://192.168.0.1:8010" --env "DECODE_ENDPOINTS=http://192.168.0.1:8020" ghcr.io/bd-iaas-us/kvproxy:latest --host 0.0.0.0 --port 8000 91 | ``` 92 | 93 | #### Option 2 94 | For nPmD, the proxy will forward the request in round-robin strategy. 95 | 96 | ``` bash 97 | # .env file 98 | PREFILL_ENDPOINTS=http://192.168.0.1:8010 99 | DECODE_ENDPOINTS=http://192.168.0.1:8020 100 | ``` 101 | 102 | ```bash 103 | docker run -p 8030:8000 -v ./.env:/app/.env pull ghcr.io/bd-iaas-us/kvproxy:latest --host 0.0.0.0 --port 8000 104 | ``` 105 | 106 | ### Sample Request 107 | ```bash 108 | curl http://127.0.0.1:8030/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/app/DeepSeek-V2-Lite", "temperature": 0.75, "messages": [ {"role": "user", "content": "San Francisco"}], "max_tokens": 200}' 109 | curl http://127.0.0.1:8030/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/app/DeepSeek-V2-Lite", "temperature": 0.75, "messages": [ {"role": "user", "content": "San Francisco is a major city in California, United States, known for its iconic landmarks, cultural diversity, and technological innovation. Key features include:\n\n1. **Golden Gate Bridge**: A famous red suspension bridge and symbol of the city.\n2. **Alcatraz Island**: A former federal prison located in San Francisco "}], "max_tokens": 200}' 110 | ``` 111 | 112 | ## Improvement 113 | 1. Prefix caching support for the Prefill service will help to reduce the TTFT latency. 114 | 2. Optimize the Prefill store kv cache and Decode service retrieve kv cache performance will help to reduce the TTFT and TPOT latency. 115 | -------------------------------------------------------------------------------- /kv-proxy/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.12 AS python 2 | 3 | RUN mkdir /app 4 | COPY requirements.txt /app/requirements.txt 5 | COPY proxy.py /app/proxy.py 6 | 7 | WORKDIR /app 8 | RUN --mount=type=cache,target=/root/.cache/pip \ 9 | python3 -m pip install -r requirements.txt 10 | 11 | ENTRYPOINT ["uvicorn", "proxy:app"] -------------------------------------------------------------------------------- /kv-proxy/proxy.py: -------------------------------------------------------------------------------- 1 | from dotenv import load_dotenv 2 | import asyncio 3 | from fastapi import FastAPI, Request 4 | from fastapi.responses import StreamingResponse 5 | import httpx 6 | import os 7 | 8 | # Initialize the FastAPI app 9 | app = FastAPI() 10 | 11 | load_dotenv() 12 | 13 | # Base URLs for the two vLLM processes (set to the root of the API) 14 | PREFILL_BASE_URLS = os.getenv("PREFILL_ENDPOINTS", "http://localhost:8060").split(",") 15 | DECODE_BASE_URLS = os.getenv("DECODE_ENDPOINTS", "http://localhost:8070").split(",") 16 | LISTEN_PORT = os.getenv("LISTEN_PORT", 8080) 17 | LISTEN_HOST = os.getenv("LISTEN_HOST", "0.0.0.0") 18 | 19 | # Initialize variables to hold the persistent clients 20 | app.state.prefill_clients = None 21 | app.state.decode_clients = None 22 | 23 | counter = 0 24 | 25 | 26 | @app.on_event("startup") 27 | async def startup_event(): 28 | """ 29 | Initialize persistent HTTPX clients for vLLM services on startup. 30 | """ 31 | app.state.decode_clients = [ 32 | httpx.AsyncClient(timeout=None, base_url=url) for url in DECODE_BASE_URLS 33 | ] 34 | app.state.prefill_clients = [ 35 | httpx.AsyncClient(timeout=None, base_url=url) for url in PREFILL_BASE_URLS 36 | ] 37 | 38 | 39 | @app.on_event("shutdown") 40 | async def shutdown_event(): 41 | """ 42 | Close the persistent HTTPX clients on shutdown. 43 | """ 44 | for prefill_client in app.state.prefill_clients: 45 | await prefill_client.aclose() 46 | 47 | for decode_client in app.state.decode_clients: 48 | await decode_client.aclose() 49 | 50 | 51 | async def send_request_to_prefill( 52 | client: httpx.AsyncClient, url_path: str, req_data: dict 53 | ): 54 | """ 55 | Args: 56 | client (httpx.AsyncClient): The persistent HTTPX client. 57 | url_path (str): The URL path to send the request to. 58 | req_data (dict): The JSON payload to send. 59 | Returns: 60 | Response: The response from the prefill service. 61 | """ 62 | req_data = req_data.copy() 63 | req_data["max_tokens"] = 1 64 | req_data["max_completion_tokens"] = 1 65 | response = await client.post(url_path, json=req_data) # Correct endpoint path 66 | response.raise_for_status() 67 | return response 68 | 69 | 70 | async def send_request_to_decode( 71 | client: httpx.AsyncClient, url_path: str, req_data: dict 72 | ): 73 | """ 74 | Asynchronously stream the response from a vLLM process using a persistent client. 75 | 76 | Args: 77 | client (httpx.AsyncClient): The persistent HTTPX client. 78 | url_path (str): The URL path to send the request to. 79 | req_data (dict): The JSON payload to send. 80 | 81 | Yields: 82 | bytes: Chunks of the response data. 83 | """ 84 | async with client.stream( 85 | "POST", url_path, json=req_data 86 | ) as response: # Correct endpoint path 87 | response.raise_for_status() 88 | async for chunk in response.aiter_bytes(): 89 | yield chunk 90 | 91 | 92 | @app.post("/v1/chat/completions") 93 | @app.post("/v1/completions") 94 | async def proxy_request(request: Request): 95 | global counter 96 | """ 97 | Proxy endpoint that forwards requests to two vLLM services. 98 | 99 | Args: 100 | request (Request): The incoming HTTP request. 101 | 102 | Returns: 103 | StreamingResponse: The streamed response from the second vLLM service. 104 | """ 105 | counter += 1 106 | req_data = await request.json() 107 | req_path = request.url.path 108 | try: 109 | prefill_client = app.state.prefill_clients[ 110 | counter % len(app.state.prefill_clients) 111 | ] 112 | # Send request to prefill worker, ignore the response 113 | await send_request_to_prefill(prefill_client, req_path, req_data) 114 | 115 | decode_client = app.state.decode_clients[ 116 | counter % len(app.state.decode_clients) 117 | ] 118 | 119 | # Stream response from decode worker 120 | async def generate_stream(): 121 | async for chunk in send_request_to_decode( 122 | decode_client, req_path, req_data 123 | ): 124 | yield chunk 125 | 126 | return StreamingResponse(generate_stream(), media_type="application/json") 127 | except Exception as e: 128 | print(f"Error streaming response from vLLM: {e}") 129 | raise 130 | 131 | 132 | if __name__ == "__main__": 133 | import uvicorn 134 | 135 | uvicorn.run(app, host=LISTEN_HOST, port=LISTEN_PORT) 136 | -------------------------------------------------------------------------------- /kv-proxy/requirements.txt: -------------------------------------------------------------------------------- 1 | fastapi 2 | uvicorn 3 | python-dotenv 4 | httpx -------------------------------------------------------------------------------- /lmcache-lmslocalbackend/Dockerfile: -------------------------------------------------------------------------------- 1 | # The vLLM Dockerfile is used to construct vLLM image that can be directly used 2 | # to run the OpenAI compatible server. 3 | 4 | # Please update any changes made here to 5 | # docs/source/dev/dockerfile/dockerfile.rst and 6 | # docs/source/assets/dev/dockerfile-stages-dependency.png 7 | 8 | ARG CUDA_VERSION=12.4.1 9 | #################### BASE BUILD IMAGE #################### 10 | # prepare basic build environment 11 | FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base 12 | ARG CUDA_VERSION=12.4.1 13 | ARG PYTHON_VERSION=3.12 14 | ENV DEBIAN_FRONTEND=noninteractive 15 | 16 | # Install Python and other dependencies 17 | RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ 18 | && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ 19 | && apt-get update -y \ 20 | && apt-get install -y ccache software-properties-common git curl sudo \ 21 | && add-apt-repository ppa:deadsnakes/ppa \ 22 | && apt-get update -y \ 23 | && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \ 24 | && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \ 25 | && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \ 26 | && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \ 27 | && curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \ 28 | && python3 --version && python3 -m pip --version 29 | 30 | # Workaround for https://github.com/openai/triton/issues/2507 and 31 | # https://github.com/pytorch/pytorch/issues/107960 -- hopefully 32 | # this won't be needed for future versions of this docker image 33 | # or future versions of triton. 34 | RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/ 35 | 36 | WORKDIR /workspace 37 | 38 | # install build and runtime dependencies 39 | COPY requirements-common.txt requirements-common.txt 40 | COPY requirements-cuda.txt requirements-cuda.txt 41 | RUN --mount=type=cache,target=/root/.cache/pip \ 42 | python3 -m pip install -r requirements-cuda.txt 43 | 44 | 45 | # cuda arch list used by torch 46 | # can be useful for both `dev` and `test` 47 | # explicitly set the list to avoid issues with torch 2.2 48 | # see https://github.com/pytorch/pytorch/pull/123243 49 | ARG torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0+PTX' 50 | ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} 51 | # Override the arch list for flash-attn to reduce the binary size 52 | ARG vllm_fa_cmake_gpu_arches='80-real;90-real' 53 | ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches} 54 | #################### BASE BUILD IMAGE #################### 55 | 56 | #################### WHEEL BUILD IMAGE #################### 57 | FROM base AS build 58 | 59 | # install build dependencies 60 | COPY requirements-build.txt requirements-build.txt 61 | 62 | # max jobs used by Ninja to build extensions 63 | ARG max_jobs=2 64 | ENV MAX_JOBS=${max_jobs} 65 | # number of threads used by nvcc 66 | ARG nvcc_threads=8 67 | ENV NVCC_THREADS=$nvcc_threads 68 | 69 | RUN --mount=type=cache,target=/root/.cache/pip \ 70 | python3 -m pip install -r requirements-build.txt 71 | 72 | ARG LMCACHE_COMMIT_ID=1 73 | 74 | # PR https://github.com/LMCache/LMCache/pull/368 is pending to merge 75 | # TODO: remove the feature branch after PR is merged 76 | ENV FEATURE_BRANCH=qian/fix-hidden-states 77 | RUN git clone --branch ${FEATURE_BRANCH} --single-branch https://github.com/bd-iaas-us/LMCache.git 78 | RUN git clone https://github.com/LMCache/torchac_cuda.git 79 | 80 | 81 | WORKDIR /workspace/LMCache 82 | RUN --mount=type=cache,target=/root/.cache/ccache \ 83 | --mount=type=cache,target=/root/.cache/pip \ 84 | python3 setup.py bdist_wheel --dist-dir=dist_lmcache 85 | 86 | WORKDIR /workspace/torchac_cuda 87 | RUN --mount=type=cache,target=/root/.cache/ccache \ 88 | --mount=type=cache,target=/root/.cache/pip \ 89 | python3 setup.py bdist_wheel --dist-dir=/workspace/LMCache/dist_lmcache 90 | 91 | 92 | #################### vLLM installation IMAGE #################### 93 | # Install torchac_cuda wheel into the vLLM image 94 | FROM python:3.12 AS python 95 | RUN --mount=type=bind,from=build,src=/workspace/LMCache/dist_lmcache,target=/dist_lmcache \ 96 | --mount=type=cache,target=/root/.cache/pip \ 97 | pip install /dist_lmcache/*.whl --verbose 98 | 99 | ENTRYPOINT ["python3", "-m", "lmcache.experimental.server"] -------------------------------------------------------------------------------- /lmcache-lmslocalbackend/requirements-build.txt: -------------------------------------------------------------------------------- 1 | # Should be mirrored in pyproject.toml 2 | cmake>=3.26 3 | ninja 4 | packaging 5 | setuptools>=61 6 | setuptools-scm>=8 7 | torch==2.4.0 8 | wheel 9 | jinja2 10 | -------------------------------------------------------------------------------- /lmcache-lmslocalbackend/requirements-common.txt: -------------------------------------------------------------------------------- 1 | psutil 2 | sentencepiece # Required for LLaMA tokenizer. 3 | numpy < 2.0.0 4 | requests 5 | tqdm 6 | py-cpuinfo 7 | transformers >= 4.45.0 # Required for Llama 3.2. 8 | tokenizers >= 0.19.1 # Required for Llama 3. 9 | protobuf # Required by LlamaTokenizer. 10 | fastapi < 0.113.0; python_version < '3.9' 11 | fastapi >= 0.114.1; python_version >= '3.9' 12 | aiohttp 13 | openai >= 1.40.0 # Ensure modern openai package (ensure types module present) 14 | uvicorn[standard] 15 | pydantic >= 2.9 # Required for fastapi >= 0.113.0 16 | pillow # Required for image processing 17 | prometheus_client >= 0.18.0 18 | prometheus-fastapi-instrumentator >= 7.0.0 19 | tiktoken >= 0.6.0 # Required for DBRX tokenizer 20 | lm-format-enforcer == 0.10.6 21 | outlines >= 0.0.43, < 0.1 22 | typing_extensions >= 4.10 23 | filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4 24 | partial-json-parser # used for parsing partial JSON outputs 25 | pyzmq 26 | msgspec 27 | gguf == 0.10.0 28 | importlib_metadata 29 | mistral_common >= 1.4.3 30 | pyyaml 31 | six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12 32 | setuptools>=74.1.1; python_version > '3.11' # Setuptools is used by triton, we need to ensure a modern version is installed for 3.12+ so that it does not try to import distutils, which was removed in 3.12 33 | einops # Required for Qwen2-VL. 34 | -------------------------------------------------------------------------------- /lmcache-lmslocalbackend/requirements-cuda.txt: -------------------------------------------------------------------------------- 1 | # Common dependencies 2 | -r requirements-common.txt 3 | 4 | # Dependencies for NVIDIA GPUs 5 | ray >= 2.9 6 | nvidia-ml-py # for pynvml package 7 | torch == 2.4.0 8 | # These must be updated alongside torch 9 | torchvision == 0.19 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version 10 | xformers == 0.0.27.post2; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch 2.4.0 11 | -------------------------------------------------------------------------------- /vllm-lmcache/Dockerfile: -------------------------------------------------------------------------------- 1 | # The vLLM Dockerfile is used to construct vLLM image that can be directly used 2 | # to run the OpenAI compatible server. 3 | 4 | # Please update any changes made here to 5 | # docs/source/dev/dockerfile/dockerfile.rst and 6 | # docs/source/assets/dev/dockerfile-stages-dependency.png 7 | 8 | ARG CUDA_VERSION=12.4.1 9 | #################### BASE BUILD IMAGE #################### 10 | # prepare basic build environment 11 | FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base 12 | ARG CUDA_VERSION=12.4.1 13 | ARG PYTHON_VERSION=3.12 14 | ENV DEBIAN_FRONTEND=noninteractive 15 | 16 | # Install Python and other dependencies 17 | RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ 18 | && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ 19 | && apt-get update -y \ 20 | && apt-get install -y ccache software-properties-common git curl sudo \ 21 | && add-apt-repository ppa:deadsnakes/ppa \ 22 | && apt-get update -y \ 23 | && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \ 24 | && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \ 25 | && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \ 26 | && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \ 27 | && curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \ 28 | && python3 --version && python3 -m pip --version 29 | 30 | # Workaround for https://github.com/openai/triton/issues/2507 and 31 | # https://github.com/pytorch/pytorch/issues/107960 -- hopefully 32 | # this won't be needed for future versions of this docker image 33 | # or future versions of triton. 34 | RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/ 35 | 36 | WORKDIR /workspace 37 | 38 | # install build and runtime dependencies 39 | COPY requirements-common.txt requirements-common.txt 40 | COPY requirements-cuda.txt requirements-cuda.txt 41 | RUN --mount=type=cache,target=/root/.cache/pip \ 42 | python3 -m pip install -r requirements-cuda.txt 43 | 44 | 45 | # cuda arch list used by torch 46 | # can be useful for both `dev` and `test` 47 | # explicitly set the list to avoid issues with torch 2.2 48 | # see https://github.com/pytorch/pytorch/pull/123243 49 | ARG torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0+PTX' 50 | ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} 51 | # Override the arch list for flash-attn to reduce the binary size 52 | ARG vllm_fa_cmake_gpu_arches='80-real;90-real' 53 | ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches} 54 | #################### BASE BUILD IMAGE #################### 55 | 56 | #################### WHEEL BUILD IMAGE #################### 57 | FROM base AS build 58 | 59 | # install build dependencies 60 | COPY requirements-build.txt requirements-build.txt 61 | 62 | # max jobs used by Ninja to build extensions 63 | ARG max_jobs=2 64 | ENV MAX_JOBS=${max_jobs} 65 | # number of threads used by nvcc 66 | ARG nvcc_threads=8 67 | ENV NVCC_THREADS=$nvcc_threads 68 | 69 | RUN --mount=type=cache,target=/root/.cache/pip \ 70 | python3 -m pip install -r requirements-build.txt 71 | 72 | ARG LMCACHE_COMMIT_ID=1 73 | 74 | # PR https://github.com/LMCache/LMCache/pull/368 is pending to merge 75 | # TODO: remove the feature branch after PR is merged 76 | ENV FEATURE_BRANCH=qian/fix-hidden-states 77 | RUN git clone --branch ${FEATURE_BRANCH} --single-branch https://github.com/bd-iaas-us/LMCache.git 78 | RUN git clone https://github.com/LMCache/torchac_cuda.git 79 | 80 | 81 | WORKDIR /workspace/LMCache 82 | RUN --mount=type=cache,target=/root/.cache/ccache \ 83 | --mount=type=cache,target=/root/.cache/pip \ 84 | python3 setup.py bdist_wheel --dist-dir=dist_lmcache 85 | 86 | WORKDIR /workspace/torchac_cuda 87 | RUN --mount=type=cache,target=/root/.cache/ccache \ 88 | --mount=type=cache,target=/root/.cache/pip \ 89 | python3 setup.py bdist_wheel --dist-dir=/workspace/LMCache/dist_lmcache 90 | 91 | 92 | #################### vLLM installation IMAGE #################### 93 | # Install torchac_cuda wheel into the vLLM image 94 | FROM vllm/vllm-openai:v0.7.3 AS vllm-openai 95 | RUN --mount=type=bind,from=build,src=/workspace/LMCache/dist_lmcache,target=/vllm-workspace/dist_lmcache \ 96 | --mount=type=cache,target=/root/.cache/pip \ 97 | pip install dist_lmcache/*.whl --verbose 98 | 99 | ENV LMCACHE_USE_EXPERIMENTAL=True 100 | 101 | # Copy lmc_connector patch into vllm 102 | COPY patch/factory.py \ 103 | /usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/ 104 | COPY patch/lmcache_connector.py \ 105 | /usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/ 106 | 107 | # Use diff if file is too large 108 | COPY patch/parallel_state.patch \ 109 | /usr/local/lib/python3.12/dist-packages/vllm/distributed/ 110 | 111 | RUN patch /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py \ 112 | /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.patch 113 | 114 | ENTRYPOINT ["vllm", "serve"] 115 | -------------------------------------------------------------------------------- /vllm-lmcache/example_run.sh: -------------------------------------------------------------------------------- 1 | IMAGE=: 2 | docker run --runtime nvidia --gpus all \ 3 | --env "HF_TOKEN=" \ 4 | --env "LMCACHE_USE_EXPERIMENTAL=True" \ 5 | --env "chunk_size=256" \ 6 | --env "local_cpu=True" \ 7 | --env "max_local_cpu_size=5" \ 8 | -v ~/.cache/huggingface:/root/.cache/huggingface \ 9 | --network host \ 10 | --entrypoint "/usr/local/bin/vllm" \ 11 | $IMAGE \ 12 | serve mistralai/Mistral-7B-Instruct-v0.2 --kv-transfer-config \ 13 | '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}' \ 14 | --enable-chunked-prefill false 15 | -------------------------------------------------------------------------------- /vllm-lmcache/patch/factory.py: -------------------------------------------------------------------------------- 1 | # SPDX-License-Identifier: Apache-2.0 2 | 3 | import importlib 4 | from typing import TYPE_CHECKING, Callable, Dict, Type 5 | 6 | from .base import KVConnectorBase 7 | 8 | if TYPE_CHECKING: 9 | from vllm.config import VllmConfig 10 | 11 | 12 | class KVConnectorFactory: 13 | _registry: Dict[str, Callable[[], Type[KVConnectorBase]]] = {} 14 | 15 | @classmethod 16 | def register_connector(cls, name: str, module_path: str, 17 | class_name: str) -> None: 18 | """Register a connector with a lazy-loading module and class name.""" 19 | if name in cls._registry: 20 | raise ValueError(f"Connector '{name}' is already registered.") 21 | 22 | def loader() -> Type[KVConnectorBase]: 23 | module = importlib.import_module(module_path) 24 | return getattr(module, class_name) 25 | 26 | cls._registry[name] = loader 27 | 28 | @classmethod 29 | def create_connector(cls, rank: int, local_rank: int, 30 | config: "VllmConfig") -> KVConnectorBase: 31 | connector_name = config.kv_transfer_config.kv_connector 32 | if connector_name not in cls._registry: 33 | raise ValueError(f"Unsupported connector type: {connector_name}") 34 | 35 | connector_cls = cls._registry[connector_name]() 36 | return connector_cls(rank, local_rank, config) 37 | 38 | 39 | # Register various connectors here. 40 | # The registration should not be done in each individual file, as we want to 41 | # only load the files corresponding to the current connector. 42 | KVConnectorFactory.register_connector( 43 | "PyNcclConnector", 44 | "vllm.distributed.kv_transfer.kv_connector.simple_connector", 45 | "SimpleConnector") 46 | 47 | KVConnectorFactory.register_connector( 48 | "MooncakeConnector", 49 | "vllm.distributed.kv_transfer.kv_connector.simple_connector", 50 | "SimpleConnector") 51 | 52 | KVConnectorFactory.register_connector( 53 | "LMCacheConnector", 54 | "vllm.distributed.kv_transfer.kv_connector.lmcache_connector", 55 | "LMCacheConnector") 56 | -------------------------------------------------------------------------------- /vllm-lmcache/patch/lmcache_connector.py: -------------------------------------------------------------------------------- 1 | """ 2 | Simple KV Cache Connector for Distributed Machine Learning Inference 3 | 4 | The LMCacheConnector can (1) transfer KV caches between prefill vLLM worker 5 | (KV cache producer) and decode vLLM worker (KV cache consumer) using LMCache; 6 | (2) offload and share KV caches. Only (2) is supported for now. 7 | """ 8 | 9 | from typing import TYPE_CHECKING, List, Tuple, Union 10 | 11 | import torch 12 | from vllm.config import VllmConfig 13 | from vllm.distributed.kv_transfer.kv_connector.base import KVConnectorBase 14 | from vllm.logger import init_logger 15 | from vllm.sequence import IntermediateTensors 16 | 17 | if TYPE_CHECKING: 18 | from vllm.worker.model_runner import ModelInputForGPUWithSamplingMetadata 19 | 20 | logger = init_logger(__name__) 21 | 22 | 23 | class LMCacheConnector(KVConnectorBase): 24 | 25 | def __init__( 26 | self, 27 | rank: int, 28 | local_rank: int, 29 | config: VllmConfig, 30 | ): 31 | 32 | self.transfer_config = config.kv_transfer_config 33 | self.vllm_config = config 34 | 35 | from lmcache.integration.vllm.vllm_adapter import ( 36 | RetrieveStatus, StoreStatus, init_lmcache_engine, 37 | lmcache_retrieve_kv, lmcache_should_store, lmcache_store_kv) 38 | 39 | logger.info("Initializing LMCacheConfig under kv_transfer_config %s", 40 | self.transfer_config) 41 | 42 | # TODO (Jiayi): Find model_config, parallel_config, and cache_config 43 | self.engine = init_lmcache_engine(config.model_config, 44 | config.parallel_config, 45 | config.cache_config) 46 | 47 | self.model_config = config.model_config 48 | self.parallel_config = config.parallel_config 49 | self.cache_config = config.cache_config 50 | self.lmcache_retrieve_kv = lmcache_retrieve_kv 51 | self.lmcache_store_kv = lmcache_store_kv 52 | self.lmcache_should_store = lmcache_should_store 53 | self.store_status = StoreStatus 54 | self.retrieve_status = RetrieveStatus 55 | 56 | def recv_kv_caches_and_hidden_states( 57 | self, model_executable: torch.nn.Module, 58 | model_input: "ModelInputForGPUWithSamplingMetadata", 59 | kv_caches: List[torch.Tensor] 60 | ) -> Tuple[Union[torch.Tensor, IntermediateTensors], bool, 61 | "ModelInputForGPUWithSamplingMetadata"]: 62 | 63 | # TODO(Jiayi): This shouldn't be none for disagg prefill 64 | hidden_or_intermediate_states = None 65 | 66 | # TODO (Jiayi): Only normal prefill is supported for now 67 | retrieve_status = [self.retrieve_status.PREFILL] 68 | 69 | model_input, hidden_or_intermediate_states, bypass_model_exec = \ 70 | self.lmcache_retrieve_kv( 71 | model_executable, model_input, self.cache_config, 72 | kv_caches, retrieve_status 73 | ) 74 | 75 | return hidden_or_intermediate_states, bypass_model_exec, model_input 76 | 77 | def send_kv_caches_and_hidden_states( 78 | self, 79 | model_executable: torch.nn.Module, 80 | model_input: "ModelInputForGPUWithSamplingMetadata", 81 | kv_caches: List[torch.Tensor], 82 | hidden_or_intermediate_states: Union[torch.Tensor, 83 | IntermediateTensors], 84 | ) -> None: 85 | # TODO (Jiayi): Only normal prefill is supported for now 86 | #store_status = [self.store_status.PREFILL] * num_reqs 87 | store_status = self.lmcache_should_store(model_input) 88 | self.lmcache_store_kv( 89 | self.model_config, 90 | self.parallel_config, 91 | self.cache_config, 92 | model_executable, 93 | model_input, 94 | kv_caches, 95 | store_status, 96 | hidden_or_intermediate_states, 97 | ) 98 | 99 | def close(self): 100 | self.engine.close() 101 | -------------------------------------------------------------------------------- /vllm-lmcache/patch/parallel_state.patch: -------------------------------------------------------------------------------- 1 | --- parallel_state_old.py 2025-03-04 00:00:00 +0000 2 | +++ parallel_state.py 2025-03-04 00:00:00 +0000 3 | @@ -919,7 +919,7 @@ 4 | return 5 | 6 | if all([ 7 | - vllm_config.kv_transfer_config.need_kv_parallel_group, _KV_TRANSFER 8 | + vllm_config.kv_transfer_config.is_kv_transfer_instance, _KV_TRANSFER 9 | is None 10 | ]): 11 | _KV_TRANSFER = kv_transfer.KVTransferAgent( 12 | -------------------------------------------------------------------------------- /vllm-lmcache/requirements-build.txt: -------------------------------------------------------------------------------- 1 | # Should be mirrored in pyproject.toml 2 | cmake>=3.26 3 | ninja 4 | packaging 5 | setuptools>=61 6 | setuptools-scm>=8 7 | torch==2.4.0 8 | wheel 9 | jinja2 10 | -------------------------------------------------------------------------------- /vllm-lmcache/requirements-common.txt: -------------------------------------------------------------------------------- 1 | psutil 2 | sentencepiece # Required for LLaMA tokenizer. 3 | numpy < 2.0.0 4 | requests 5 | tqdm 6 | py-cpuinfo 7 | transformers >= 4.45.0 # Required for Llama 3.2. 8 | tokenizers >= 0.19.1 # Required for Llama 3. 9 | protobuf # Required by LlamaTokenizer. 10 | fastapi < 0.113.0; python_version < '3.9' 11 | fastapi >= 0.114.1; python_version >= '3.9' 12 | aiohttp 13 | openai >= 1.40.0 # Ensure modern openai package (ensure types module present) 14 | uvicorn[standard] 15 | pydantic >= 2.9 # Required for fastapi >= 0.113.0 16 | pillow # Required for image processing 17 | prometheus_client >= 0.18.0 18 | prometheus-fastapi-instrumentator >= 7.0.0 19 | tiktoken >= 0.6.0 # Required for DBRX tokenizer 20 | lm-format-enforcer == 0.10.6 21 | outlines >= 0.0.43, < 0.1 22 | typing_extensions >= 4.10 23 | filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4 24 | partial-json-parser # used for parsing partial JSON outputs 25 | pyzmq 26 | msgspec 27 | gguf == 0.10.0 28 | importlib_metadata 29 | mistral_common >= 1.4.3 30 | pyyaml 31 | six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12 32 | setuptools>=74.1.1; python_version > '3.11' # Setuptools is used by triton, we need to ensure a modern version is installed for 3.12+ so that it does not try to import distutils, which was removed in 3.12 33 | einops # Required for Qwen2-VL. 34 | -------------------------------------------------------------------------------- /vllm-lmcache/requirements-cuda.txt: -------------------------------------------------------------------------------- 1 | # Common dependencies 2 | -r requirements-common.txt 3 | 4 | # Dependencies for NVIDIA GPUs 5 | ray >= 2.9 6 | nvidia-ml-py # for pynvml package 7 | torch == 2.4.0 8 | # These must be updated alongside torch 9 | torchvision == 0.19 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version 10 | xformers == 0.0.27.post2; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch 2.4.0 11 | --------------------------------------------------------------------------------