├── README.md ├── create-cluster.sh ├── vllm-client.yaml ├── vllm-deploy-lws-deepseek.yaml ├── vllm-deploy-mistral-h100.yaml ├── vllm-deploy-tpu.yaml ├── vllm-deploy.yaml └── webapp ├── Dockerfile ├── cloudbuild.yaml └── src ├── app.py ├── falcon.jpeg └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | # Serving Open Source LLMs on GKE using vLLM framework 2 | 3 | This post shows how to serve Open source LLM models(Mistrial 7B, Llama2 etc) on Nvidia GPUs(L4, Tesla-T4, for example) running on Google Cloud Kubernetes Engine (GKE). It will help you understand the AI/ML ready features of GKE and how to use them to serve large language models make life of self-managing OSS LLM models in GKE not as dauting as you may originally think of . 4 | 5 | 6 | TGI and vLLM are 2 common frameworks to address significant challenges on slow latencities to obtain an output from a LLM, primarily due to ever increasing LLM substantial sizes to get responses back 7 | 8 | 𝐯𝐋𝐋𝐌(𝐕𝐞𝐫𝐬𝐚𝐭𝐢𝐥𝐞 𝐥𝐚𝐫𝐠𝐞 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐦𝐨𝐝𝐞𝐥) is a framework designed to enhance the inference and serving speed of LLMs. It has demonstrated remarkable performance improvements compared to mainstream frameworks like Hugging Face’s Transformers, primarily because of a highly innovative new algorithm at its core. 9 | 10 | One key reason behind vLLM’s speed during inference is its use of the 𝐏𝐚𝐠𝐞𝐝 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞. In traditional attention mechanisms, the keys and values computed are stored in GPU memory as a KV cache. This cache stores attention keys and values for previous tokens, which can consume a significant amount of memory, especially for large models and long sequences. These keys and values are also stored in a contiguous manner. 11 | 12 | 𝐓𝐆𝐈 (𝐓𝐞𝐱𝐭-𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧-𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞) is another solution aimed at increasing the speed of LLM inference. It offers high-performance text generation using Tensor Parallelism and dynamic batching for popular open-source LLMs like StarCoder, BLOOM,Llama and other models. 13 | 14 | 𝐓𝐆𝐈 𝐯𝐬 𝐯𝐋𝐋𝐌 15 | 16 | - TGI does not support paged optimization. 17 | 18 | -Both techniques don’t handle all LLM architectures. 19 | 20 | -TGI also allows quantizing and fine-tuning models, which are not supported by vLLM. 21 | 22 | -VLLM achieves better performance than TGI and the Hugging Face transformer library, with up to 24x higher throughput compared to Hugging Face and up to 3.5x higher throughput than TGI. 23 | This post shows how to serve OSS LLMs(Mistral 7B, or Llama2) model on L4 GPUs running on Google Cloud Kubernetes Engine (GKE). It will help you understand the AI/ML ready features of GKE and how to use them to serve large language models. 24 | 25 | 26 | GKE is a fully managed service that allows you to run containerized workloads on Google Cloud. It’s a great choice for running large language models and AI/ML workloads because it is easy to set up, it’s secure, and it’s AI/ML batteries included. GKE installs the latest NVIDIA GPU drivers for you in GPU-enabled node pools, and gives you autoscaling and partitioning capabilities for GPUs out of the box, so you can easily scale your workloads to the size you need while keeping the costs under control. 27 | 28 | 29 | For example on how to serve Open source model( Mistral 7B) on GKE using TGI, please refer to this Google community blog: 30 | https://medium.com/google-cloud/serving-mistral-7b-on-l4-gpus-running-on-gke-25c6041dff27 31 | 32 | 33 | ## Opensource Models supported 34 | Llama2, Mistril, Falcon, see full list in https://docs.vllm.ai/en/latest/models/supported_models.html 35 | 36 | ## Prerequisite: Huggingface API token 37 | 38 | Access to a Google Cloud project with the L4 GPUs available and enough quota in the region you select. 39 | A computer terminal with kubectl and the Google Cloud SDK installed. From the GCP project console you’ll be working with, you may want to use the included Cloud Shell as it already has the required tools installed. 40 | Some models such as Llama2 will need Huggingface API token to download model files 41 | Meta access request: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ need regisgter an email address to download 42 | 43 | Go to Hugging face, create account account with same email registered in Meta request. Then find Llama 2 model, fill out access request: https://huggingface.co/meta-llama/Llama-2-7b. Need to wait for a few hours with the approval email to be able to use Llama 2 44 | 45 | Get Hugging face access token from your huggingface account profile settings, 46 | 47 | ## Setup project environments 48 | 49 | From your console, select the Google Cloud region and project, checking that there’s availability for L4 GPUs in the one that you end up selecting. The one used in this tutorial is us-central, where at the time of writing this article there was availability for L4 GPUs( alternatively, you can choose other regions with different GPU accelerator type available): 50 | 51 | ``` 52 | export PROJECT_ID= 53 | export REGION=us-central1 54 | export ZONE_1=${REGION}-a # You may want to change the zone letter based on the region you selected above 55 | export ZONE_2=${REGION}-b # You may want to change the zone letter based on the region you selected above 56 | export CLUSTER_NAME=vllm-serving-cluster 57 | export NAMESPACE=vllm 58 | gcloud config set project "$PROJECT_ID" 59 | gcloud config set compute/region "$REGION" 60 | gcloud config set compute/zone "$ZONE_1" 61 | ``` 62 | Then, enable the required APIs to create a GK cluster: 63 | ``` 64 | gcloud services enable compute.googleapis.com container.googleapis.com 65 | ``` 66 | 67 | Also, you may go ahead download the source code repo for this exercise, : 68 | ``` 69 | git clone https://github.com/llm-on-gke/vllm-inference.git 70 | cd vllm-inference 71 | ``` 72 | 73 | 74 | In this exercise, you will be using the default service account to create the cluster, you need to grant it the required permissions to store metrics and logs in Cloud Monitoring that you will be using later on: 75 | 76 | ``` 77 | PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)') 78 | GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" 79 | for role in monitoring.metricWriter stackdriver.resourceMetadata.writer; do 80 | gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:${GCE_SA} --role=roles/${role} 81 | done 82 | ``` 83 | ## Create GKE Cluster and Nodepools 84 | 85 | ### Quick Estimates of GPU type and number of GPU needed for model infereence: 86 | Estimate the size of a model in gigabytes by multiplying the number of parameters (in billions) by 2. This approach is based on a simple formula: with each parameter using 16 bits (or 2 bytes) of memory in half-precision, the memory usage in GB is approximately twice the number of parameters. Therefore, a 7B parameter model, for instance, will take up approximately 14 GB of memory. We can comfortably run a 7B parameter model in Nvidia L4 and still have about 10 GB of memory remaining as a buffer for inferencing. Alternatively, you can choose to have 2 Tesla-T4 GPUs with 32G by sharding model across both GPUs, but there will be impacts of moving data around. 87 | 88 | For Models with larger parameter size, resource requirements can be reduced through weights Quantization into lower precision bits. 89 | Example, for Llama 2 70b model which may need 140G memeory with default half point(16 bits), resource requirements can be reduced with quatization into float 8 bits precision or even further with 4 bits, which only need 35G memory and can fit into 2 L4(48G)GPU. 90 | Reference: https://www.baseten.co/blog/llm-transformer-inference-guide/ 91 | 92 | ### GKE Cluster 93 | 94 | Now, create a GKE cluster with a minimal default node pool, as you will be adding a node pool with L4 GPUs later on: 95 | ``` 96 | gcloud container clusters create $CLUSTER_NAME \ 97 | --location "$REGION" \ 98 | --workload-pool "${PROJECT_ID}.svc.id.goog" \ 99 | --enable-image-streaming --enable-shielded-nodes \ 100 | --shielded-secure-boot --shielded-integrity-monitoring \ 101 | --enable-ip-alias \ 102 | --node-locations="$ZONE_1" \ 103 | --workload-pool="${PROJECT_ID}.svc.id.goog" \ 104 | --addons GcsFuseCsiDriver \ 105 | --no-enable-master-authorized-networks \ 106 | --machine-type n2d-standard-4 \ 107 | --num-nodes 1 --min-nodes 1 --max-nodes 5 \ 108 | --ephemeral-storage-local-ssd=count=2 \ 109 | --enable-ip-alias 110 | ``` 111 | 112 | ### Nodepool 113 | 114 | Create an additional Spot node pool with regular (we use spot to illustrate) VMs with 2 L4 GPUs each: 115 | ``` 116 | gcloud container node-pools create g2-standard-24 --cluster $CLUSTER_NAME \ 117 | --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \ 118 | --machine-type g2-standard-8 \ 119 | --ephemeral-storage-local-ssd=count=1 \ 120 | --enable-autoscaling --enable-image-streaming \ 121 | --num-nodes=1 --min-nodes=0 --max-nodes=2 \ 122 | --shielded-secure-boot \ 123 | --shielded-integrity-monitoring \ 124 | --node-locations $ZONE_1,$ZONE_2 --region $REGION --spot 125 | ``` 126 | Note how easy enabling GPUs in GKE is. Just adding the option --accelerator automatically bootstraps the nodes with the necessary drivers and configuration so your workloads can start using the GPUs attached to the cluster nodes. If you need to try tesla-t4, need to update --accelerator and --machine-type parameter values, as one example: 127 | --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest 128 | machine-type n2d-standard-8 129 | 130 | After a few minutes, check that the node pool was created correctly: 131 | ``` 132 | gcloud container node-pools list --region $REGION --cluster $CLUSTER_NAME 133 | ``` 134 | 135 | Also, check that the corresponding nodes in the g2-standard-24 node pool have the GPUs available: 136 | ``` 137 | kubectl get nodes -o json | jq -r '.items[] | {name:.metadata.name, gpus:.status.capacity."nvidia.com/gpu"}' 138 | ``` 139 | You should get one with 2 GPUs available corresponding to the node pool you just created: 140 | 141 | { 142 | "name": "vllm-serving-cluster-g2-standard-8-XXXXXX", 143 | "gpus": "1" 144 | } 145 | 146 | Run the following commands to setup identiy and IAM roles: 147 | ``` 148 | kubectl create ns $NAMESPACE 149 | kubectl create serviceaccount $NAMESPACE --namespace $NAMESPACE 150 | gcloud iam service-accounts add-iam-policy-binding $GCE_SA \ 151 | --role roles/iam.workloadIdentityUser \ 152 | --member "serviceAccount:${PROJECT_ID}.svc.id.goog[${NAMESPACE}/${NAMESPACE}]" 153 | 154 | kubectl annotate serviceaccount $NAMESPACE \ 155 | --namespace $NAMESPACE \ 156 | iam.gke.io/gcp-service-account=$GCE_SA 157 | ``` 158 | 159 | ## Deploy model to GKE cluster 160 | We’re now ready to deploy the model. 161 | Save the following vllm-deploy.yaml, 162 | ``` 163 | apiVersion: apps/v1 164 | kind: Deployment 165 | metadata: 166 | name: vllm-server 167 | labels: 168 | app: vllm-server 169 | spec: 170 | replicas: 1 171 | selector: 172 | matchLabels: 173 | app: vllm-inference-server 174 | template: 175 | metadata: 176 | labels: 177 | app: vllm-inference-server 178 | spec: 179 | volumes: 180 | - name: cache 181 | emptyDir: {} 182 | - name: dshm 183 | emptyDir: 184 | medium: Memory 185 | nodeSelector: 186 | cloud.google.com/gke-accelerator: nvidia-l4 187 | serviceAccountName: triton 188 | containers: 189 | - name: vllm-inference-server 190 | image: vllm/vllm-openai 191 | imagePullPolicy: IfNotPresent 192 | 193 | resources: 194 | limits: 195 | nvidia.com/gpu: 1 196 | env: 197 | - name: HUGGING_FACE_HUB_TOKEN 198 | valueFrom: 199 | secretKeyRef: 200 | name: huggingface 201 | key: HF_TOKEN 202 | - name: TRANSFORMERS_CACHE 203 | value: /.cache 204 | - name: shm-size 205 | value: 1g 206 | command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] 207 | args: ["--model=meta-llama/Llama-2-7b-hf", 208 | "--gpu-memory-utilization=0.95", 209 | "--disable-log-requests", 210 | "--trust-remote-code", 211 | "--port=8000", 212 | "--tensor-parallel-size=1"] 213 | ports: 214 | - containerPort: 8000 215 | name: http 216 | securityContext: 217 | runAsUser: 1000 218 | volumeMounts: 219 | - mountPath: /dev/shm 220 | name: dshm 221 | - mountPath: /.cache 222 | name: cache 223 | 224 | --- 225 | apiVersion: v1 226 | kind: Service 227 | metadata: 228 | name: vllm-inference-server 229 | annotations: 230 | cloud.google.com/neg: '{"ingress": true}' 231 | labels: 232 | app: vllm-inference-server 233 | spec: 234 | type: NodePort 235 | ports: 236 | - port: 8000 237 | targetPort: http 238 | name: http-inference-server 239 | 240 | selector: 241 | app: vllm-inference-server 242 | 243 | --- 244 | apiVersion: networking.k8s.io/v1 245 | kind: Ingress 246 | metadata: 247 | name: vllm-ingress 248 | annotations: 249 | kubernetes.io/ingress.class: "gce" 250 | kubernetes.io/ingress.global-static-ip-name: "ingress-vllm" 251 | spec: 252 | rules: 253 | - http: 254 | paths: 255 | - path: "/" 256 | pathType: Prefix 257 | backend: 258 | service: 259 | name: vllm-inference-server 260 | port: 261 | number: 8000 262 | ``` 263 | 264 | 265 | Notes: 266 | We include kubernetes resource templates for a deployment, service, and Ingress 267 | We use official container image to run the modles: vllm/vllm-openai 268 | Huggingface token setup, 269 | Execute the command to deploy inference deployment in GKE, update the HF_TOKEN values 270 | 271 | ``` 272 | gcloud container clusters get-credentials $CLUSTER_NAME $REGION 273 | export HF_TOKEN= 274 | kubectl create secret generic huggingface --from-literal="HF_TOKEN=$HF_TOKEN" -n $NAMESPACE 275 | ``` 276 | This GKE huggingface secrect is used to set the environment value in gke-deploy.yaml( need to keep the name: HUGGING_FACE_HUB_TOKEN ): 277 | env: 278 | - name: HUGGING_FACE_HUB_TOKEN 279 | valueFrom: 280 | secretKeyRef: 281 | name: huggingface 282 | key: HF_TOKEN 283 | 284 | 285 | ## vLLM model config parameters: 286 | Update vllm-deploy.yml file as described earlier, 287 | 288 | You can override the command and arguments, 289 | 290 | if you prefer to Langchain and OpenAI integration in applications, you can use this: 291 | 292 | command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] 293 | 294 | or you can use different entrypoint for native vLLM APIs: 295 | 296 | command: ["python3", "-m", "vllm.entrypoints.api_server"] 297 | 298 | To understand the vLLM model related arguments, See this doc: https://docs.vllm.ai/en/latest/models/engine_args.html 299 | 300 | You can adjust the model related parameters in args settings in gke-deploy.yaml 301 | 302 | --model=ModelNameFromHuggingFace, replace with specific models from Huggingface, meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, mistralai/Mistral-7B-v0.1, tiiuae/falcon-7b 303 | 304 | If you use Vertex vLLM image, --model value you can be full Cloud Storage path of model files, e.g., gs://vertex-model-garden-public-us-central1/llama2/llama2-13b-hf 305 | 306 | ## Deploy the model to GKE 307 | After vllm-deploy.yaml file been updated with proper settings, execute the followin command: 308 | ``` 309 | kubectl apply -f vllm-deploy.yaml -n $NAMESPACE 310 | ``` 311 | The following GKE artifacts will be created: 312 | a. vllm-server deployment 313 | b. Ingress 314 | b. Service with endpoint of LLM APIs, routing traffic through Ingress 315 | 316 | Check all the objects you’ve just created: 317 | 318 | kubectl get all 319 | Check that the pod has been correctly scheduled in one of the nodes in the g2-standard-8 node pool that has the GPUs available: 320 | 321 | 322 | ## Tests 323 | 324 | Simplely run the following command to get the cluster ip: 325 | ``` 326 | kubectl get service/vllm-server -o jsonpath='{.spec.clusterIP}' -n $NAMESPACE 327 | ``` 328 | 329 | Then use the following curl command to test inside the Cluster(update the cluster IP first): 330 | ``` 331 | kubectl run curl --image=curlimages/curl \ 332 | -it --rm --restart=Never \ 333 | -- "$CLUSTERIP:8000/v1/models" 334 | 335 | curl http://ClusterIP:8000/v1/completions \ 336 | -H "Content-Type: application/json" \ 337 | -d '{ 338 | "model": "google/gemma-1.1-2b-it", 339 | "prompt": "San Francisco is a", 340 | "max_tokens": 250, 341 | "temperature": 0.1 342 | }' 343 | ``` 344 | 345 | ## Deploy WebApp 346 | 347 | Siince vLLM can expose different model as OpenAI style APIs, different models will be transparent applications how to access LLM models. 348 | 349 | The sample app provided will use vLLM OpenAI library to initialize any model deployed through vLLM with following python code: 350 | 351 | import gradio as gr 352 | import requests 353 | import os 354 | from langchain.llms import VLLMOpenAI 355 | llm_url = os.environ.get('LLM_URL') 356 | llm_name= os.environ.get('LLM_NAME') 357 | llm = VLLMOpenAI( 358 | openai_api_key="EMPTY", 359 | openai_api_base=f"{llm_url}/v1", 360 | model_name=f"{llm_name}", 361 | model_kwargs={"stop": ["."]}, 362 | ) 363 | 364 | Note, you don't need to run this code here. 365 | 366 | You need to build the webapp container image first so that you can host the webapp in Cloud Run: 367 | update(don't run it) the cloudbuild.yaml file under webapp directory to replace your own artifactory repository paths, 368 | 369 | steps: 370 | - name: 'gcr.io/cloud-builders/docker' 371 | args: [ 'build', '-t', 'us-east1-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest', '.' ] 372 | images: 373 | - 'us-east1-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest' 374 | 375 | 376 | then run the build to build sample app container image use Cloud Build: 377 | 378 | ``` 379 | cd webapp 380 | gcloud builds submit. 381 | ``` 382 | 383 | Then update the vllm-client.yaml file, 384 | - name: gradio 385 | image: us-east1-docker.pkg.dev/PROJECT_ID/gke-llm/vllm-client:latest 386 | env: 387 | - name: LLM_URL 388 | value: "http://CLusterIP:8000" 389 | - name: LLM_NAME 390 | value: "meta-llama/Llama-2-7b-hf" 391 | 392 | a. image: URI, replace with your own vllm-client image 393 | b. LLM server and name settings: 394 | - name: LLM_URL 395 | value: "http://CLusterIP:8000" (replace with the full LLM svc endpoint including port) 396 | - name: LLM_NAME 397 | value: "meta-llama/Llama-2-7b-hf" ( replace with your own Model setup earlier) 398 | 399 | Run the command to deploy webapp, 400 | kubectl apply -f vllm-client.yaml -n $NAMESPACE 401 | kubectl get service/vllm-client -o jsonpath='{.spec.externalIP}' -n $NAMESPACE 402 | 403 | ## Validations: 404 | 405 | Go to the external IP for the webapp, hptt://externalIP:8080, 406 | and test a few questions from the web application. 407 | 408 | ## Cleanups: 409 | Don’t forget to clean up the resources created in this article once you’ve finished experimenting with GKE and Mistral 7b, as keeping the cluster running for a long time can incur in important costs. To clean up, you just need to delete the GKE cluster: 410 | 411 | gcloud container clusters delete $CLUSTER_NAME — region $REGION 412 | 413 | ## Conclusion 414 | This post tries to demonstrate how deploying Opensource LLM models such as Mistrial 7B and Llama2 7B/13B using vLLM on GKE is flexible and straightforward. 415 | LLM operations and self manage AI ML workload in flagship Managed kubernates platform GKE enables deploying LLM models in production, bringing ML Ops one step closer to existing platform teams with expertises in those managed platforms. Also, given the resources that are consumed, and the number of potential applications using AI/ML features moving forward, having a framework that offers scalability and cost control features simplifies adoption. 416 | 417 | Don't forget to check out other GKE related resources on AI ML infrastrucure offered by Google Cloud. 418 | 419 | 420 | -------------------------------------------------------------------------------- /create-cluster.sh: -------------------------------------------------------------------------------- 1 | # for L4 and spot node-pools 2 | export PROJECT_ID= 3 | export HF_TOKEN= 4 | 5 | export REGION=us-central1 6 | export ZONE_1=${REGION}-a # You may want to change the zone letter based on the region you selected above 7 | export ZONE_2=${REGION}-b # You may want to change the zone letter based on the region you selected above 8 | export CLUSTER_NAME=vllm-cluster 9 | export NAMESPACE=vllm 10 | gcloud config set project "$PROJECT_ID" 11 | gcloud config set compute/region "$REGION" 12 | gcloud config set compute/zone "$ZONE_1" 13 | 14 | #autopilot 15 | gcloud container clusters create-auto ${CLUSTER_NAME} \ 16 | --project=${PROJECT_ID} \ 17 | --region=${REGION} 18 | 19 | gcloud container clusters create $CLUSTER_NAME --location ${REGION} \ 20 | --workload-pool ${PROJECT_ID}.svc.id.goog \ 21 | --enable-image-streaming --enable-shielded-nodes \ 22 | --shielded-secure-boot --shielded-integrity-monitoring \ 23 | --enable-ip-alias \ 24 | --node-locations=$REGION-b \ 25 | --workload-pool=${PROJECT_ID}.svc.id.goog \ 26 | --addons GcsFuseCsiDriver \ 27 | --no-enable-master-authorized-networks \ 28 | --machine-type n2d-standard-4 \ 29 | --cluster-version 1.27.5-gke.200 \ 30 | --num-nodes 1 --min-nodes 1 --max-nodes 3 \ 31 | --ephemeral-storage-local-ssd=count=2 \ 32 | --scopes="gke-default,storage-rw" 33 | 34 | PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)') 35 | GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" 36 | for role in monitoring.metricWriter stackdriver.resourceMetadata.writer; do 37 | gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:${GCE_SA} --role=roles/${role} 38 | done 39 | 40 | gcloud container node-pools create vllm-inference-pool --cluster \ 41 | $CLUSTER_NAME --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest --machine-type g2-standard-8 \ 42 | --ephemeral-storage-local-ssd=count=1 --enable-autoscaling --enable-image-streaming --num-nodes=0 --min-nodes=0 --max-nodes=3 \ 43 | --shielded-secure-boot --shielded-integrity-monitoring --node-version=1.27.5-gke.200 --node-locations $ZONE_1,$ZONE_2 --region $REGION --spot 44 | 45 | kubectl create ns $NAMESPACE 46 | kubectl create serviceaccount $NAMESPACE --namespace $NAMESPACE 47 | gcloud iam service-accounts add-iam-policy-binding $GCE_SA \ 48 | --role roles/iam.workloadIdentityUser \ 49 | --member "serviceAccount:${PROJECT_ID}.svc.id.goog[${NAMESPACE}/${NAMESPACE}]" 50 | 51 | kubectl annotate serviceaccount $NAMESPACE \ 52 | --namespace $NAMESPACE \ 53 | iam.gke.io/gcp-service-account=$GCE_SA 54 | 55 | kubectl create secret generic huggingface --from-literal="HF_TOKEN=$HF_TOKEN" -n $NAMESPACE 56 | 57 | gcloud beta container clusters update $CLUSTER_NAME --update-addons=HttpLoadBalancing=ENABLED --region $REGION -------------------------------------------------------------------------------- /vllm-client.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: vllm-client 5 | spec: 6 | selector: 7 | matchLabels: 8 | app: vllm-client 9 | template: 10 | metadata: 11 | labels: 12 | app: vllm-client 13 | spec: 14 | containers: 15 | - name: gradio 16 | image: us-east1-docker.pkg.dev/rick-vertex-ai/gke-llm/vllm-client:latest 17 | env: 18 | - name: LLM_URL 19 | value: "http://CLusterIP:8000" 20 | - name: LLM_NAME 21 | value: "meta-llama/Llama-3.1-8B-Instruct" 22 | - name: MAX_TOKENS 23 | value: "400" 24 | - name: APIGEE_HOST 25 | value: "" 26 | - name: APIKEY 27 | value: "" 28 | resources: 29 | requests: 30 | memory: "128Mi" 31 | cpu: "250m" 32 | limits: 33 | memory: "256Mi" 34 | cpu: "500m" 35 | ports: 36 | - containerPort: 7860 37 | --- 38 | apiVersion: v1 39 | kind: Service 40 | metadata: 41 | name: vllm-client-service 42 | spec: 43 | type: LoadBalancer 44 | selector: 45 | app: vllm-client 46 | ports: 47 | - port: 8080 48 | targetPort: 7860 -------------------------------------------------------------------------------- /vllm-deploy-lws-deepseek.yaml: -------------------------------------------------------------------------------- 1 | 2 | apiVersion: leaderworkerset.x-k8s.io/v1 3 | kind: LeaderWorkerSet 4 | metadata: 5 | name: vllm-dps 6 | spec: 7 | replicas: 1 8 | leaderWorkerTemplate: 9 | size: 2 10 | restartPolicy: RecreateGroupOnPodRestart 11 | leaderTemplate: 12 | metadata: 13 | labels: 14 | role: leader 15 | spec: 16 | nodeSelector: 17 | cloud.google.com/gke-accelerator: nvidia-h100-80gb 18 | containers: 19 | - name: vllm-leader 20 | image: us-east1-docker.pkg.dev/northam-ce-mlai-tpu/gke-llm/vllm-lws:latest 21 | env: 22 | - name: HUGGING_FACE_HUB_TOKEN 23 | valueFrom: 24 | secretKeyRef: 25 | name: huggingface 26 | key: HF_TOKEN 27 | command: 28 | - sh 29 | - -c 30 | - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 31 | huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN ; 32 | huggingface-cli download deepseek-ai/DeepSeek-V3 --local-dir /models; 33 | python3 -m vllm.entrypoints.openai.api_server --port 8080 --model /models --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust_remote_code" 34 | resources: 35 | limits: 36 | nvidia.com/gpu: "8" 37 | ports: 38 | - containerPort: 8080 39 | readinessProbe: 40 | tcpSocket: 41 | port: 8080 42 | initialDelaySeconds: 15 43 | periodSeconds: 10 44 | volumeMounts: 45 | - mountPath: /dev/shm 46 | name: dshm 47 | - mountPath: /models 48 | name: models 49 | volumes: 50 | - name: dshm 51 | emptyDir: 52 | medium: Memory 53 | sizeLimit: 15Gi 54 | - name: models 55 | emptyDir: {} 56 | workerTemplate: 57 | spec: 58 | nodeSelector: 59 | cloud.google.com/gke-accelerator: nvidia-h100-80gb 60 | containers: 61 | - name: vllm-worker 62 | image: us-east1-docker.pkg.dev/northam-ce-mlai-tpu/gke-llm/vllm-lws:latest 63 | command: 64 | - sh 65 | - -c 66 | - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" 67 | resources: 68 | limits: 69 | nvidia.com/gpu: "8" 70 | env: 71 | - name: HUGGING_FACE_HUB_TOKEN 72 | valueFrom: 73 | secretKeyRef: 74 | name: huggingface 75 | key: HF_TOKEN 76 | volumeMounts: 77 | - mountPath: /dev/shm 78 | name: dshm 79 | - mountPath: /models 80 | name: models 81 | volumes: 82 | - name: dshm 83 | emptyDir: 84 | medium: Memory 85 | sizeLimit: 15Gi 86 | - name: models 87 | emptyDir: {} 88 | --- 89 | apiVersion: v1 90 | kind: Service 91 | metadata: 92 | name: vllm-leader 93 | spec: 94 | ports: 95 | - name: http 96 | port: 8080 97 | protocol: TCP 98 | targetPort: 8080 99 | selector: 100 | leaderworkerset.sigs.k8s.io/name: vllm 101 | role: leader 102 | type: ClusterIP -------------------------------------------------------------------------------- /vllm-deploy-mistral-h100.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: vllm-server 5 | labels: 6 | app: vllm-server 7 | spec: 8 | replicas: 1 9 | selector: 10 | matchLabels: 11 | app: vllm-inference-server 12 | template: 13 | metadata: 14 | labels: 15 | app: vllm-inference-server 16 | spec: 17 | volumes: 18 | - name: cache 19 | emptyDir: {} 20 | - name: dshm 21 | emptyDir: 22 | medium: Memory 23 | - name: triton 24 | emptyDir: {} 25 | nodeSelector: 26 | cloud.google.com/gke-accelerator: nvidia-h100-80gb 27 | cloud.google.com/gke-spot: "true" 28 | serviceAccountName: vllm 29 | containers: 30 | - name: vllm-inference-server 31 | image: vllm/vllm-openai 32 | imagePullPolicy: IfNotPresent 33 | securityContext: 34 | privileged: true 35 | resources: 36 | requests: 37 | cpu: "4" 38 | memory: "30Gi" 39 | ephemeral-storage: "100Gi" 40 | nvidia.com/gpu: "8" 41 | limits: 42 | cpu: "4" 43 | memory: "30Gi" 44 | ephemeral-storage: "100Gi" 45 | nvidia.com/gpu: "8" 46 | env: 47 | - name: HUGGING_FACE_HUB_TOKEN 48 | valueFrom: 49 | secretKeyRef: 50 | name: huggingface 51 | key: HF_TOKEN 52 | - name: TRANSFORMERS_CACHE 53 | value: /.cache 54 | - name: shm-size 55 | value: 5g 56 | command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] 57 | args: ["--model=mistralai/Mixtral-8x7B-v0.1", 58 | "--gpu-memory-utilization=0.95", 59 | "--disable-log-requests", 60 | "--trust-remote-code", 61 | "--port=8000", 62 | "--tensor-parallel-size=8"] 63 | ports: 64 | - containerPort: 8000 65 | name: http 66 | volumeMounts: 67 | - mountPath: /dev/shm 68 | name: dshm 69 | - mountPath: /.triton 70 | name: cache 71 | - mountPath: /.cache 72 | name: cache 73 | - mountPath: /.triton 74 | name: triton 75 | 76 | --- 77 | apiVersion: v1 78 | kind: Service 79 | metadata: 80 | name: vllm-inference-server 81 | annotations: 82 | cloud.google.com/neg: '{"ingress": true}' 83 | cloud.google.com/backend-config: '{"default": "vllm-backendconfig"}' 84 | labels: 85 | app: vllm-inference-server 86 | spec: 87 | type: NodePort 88 | ports: 89 | - port: 8000 90 | targetPort: http 91 | name: http-inference-server 92 | 93 | selector: 94 | app: vllm-inference-server 95 | 96 | --- 97 | apiVersion: cloud.google.com/v1 98 | kind: BackendConfig 99 | metadata: 100 | name: vllm-backendconfig 101 | spec: 102 | # gRPC healthchecks not supported, use http endpoint instead https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-configuration#direct_health 103 | healthCheck: 104 | checkIntervalSec: 15 105 | timeoutSec: 500 106 | healthyThreshold: 1 107 | unhealthyThreshold: 2 108 | type: HTTP # GKE Ingress controller only supports HTTP, HTTPS, or HTTP2 109 | requestPath: /health # Not a real endpoint, but should work (via prometheus metrics exporter) 110 | port: 8000 111 | --- 112 | apiVersion: networking.k8s.io/v1 113 | kind: Ingress 114 | metadata: 115 | name: vllm-ingress 116 | annotations: 117 | kubernetes.io/ingress.class: "gce" 118 | kubernetes.io/ingress.global-static-ip-name: "ingress-vllm" 119 | spec: 120 | rules: 121 | - http: 122 | paths: 123 | - path: "/" 124 | pathType: Prefix 125 | backend: 126 | service: 127 | name: vllm-inference-server 128 | port: 129 | number: 8000 130 | -------------------------------------------------------------------------------- /vllm-deploy-tpu.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: vllm-tpu 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: vllm-tpu 10 | template: 11 | metadata: 12 | labels: 13 | app: vllm-tpu 14 | annotations: 15 | gke-gcsfuse/volumes: "true" 16 | gke-gcsfuse/cpu-limit: "0" 17 | gke-gcsfuse/memory-limit: "0" 18 | gke-gcsfuse/ephemeral-storage-limit: "0" 19 | 20 | spec: 21 | serviceAccountName: vllm 22 | nodeSelector: 23 | cloud.google.com/gke-tpu-topology: 2x2 24 | cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice 25 | containers: 26 | - name: vllm-tpu 27 | image: docker.io/vllm/vllm-tpu:73aa7041bfee43581314e6f34e9a657137ecc092 #$REGION_NAME-docker.pkg.dev/$PROJECT_ID/vllm-tpu/vllm-tpu:latest 28 | command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] 29 | args: 30 | - --host=0.0.0.0 31 | - --port=8000 32 | - --tensor-parallel-size=4 33 | - --max-model-len=8192 # max input + output len 34 | - --model=meta-llama/Meta-Llama-3.1-70B 35 | - --download-dir=/data 36 | env: 37 | - name: HUGGING_FACE_HUB_TOKEN 38 | valueFrom: 39 | secretKeyRef: 40 | name: hf-secret 41 | key: hf_api_token 42 | - name: VLLM_XLA_CACHE_PATH 43 | value: "/data" 44 | ports: 45 | - containerPort: 8000 46 | resources: 47 | limits: 48 | google.com/tpu: 4 49 | readinessProbe: 50 | tcpSocket: 51 | port: 8000 52 | initialDelaySeconds: 15 53 | periodSeconds: 10 54 | volumeMounts: 55 | - name: gcs-fuse-csi-ephemeral 56 | mountPath: /data 57 | - name: dshm 58 | mountPath: /dev/shm 59 | volumes: 60 | - name: gke-gcsfuse-cache 61 | emptyDir: 62 | medium: Memory 63 | - name: dshm 64 | emptyDir: 65 | medium: Memory 66 | - name: gcs-fuse-csi-ephemeral 67 | csi: 68 | driver: gcsfuse.csi.storage.gke.io 69 | volumeAttributes: 70 | bucketName: rick-lllama-factory 71 | mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1" 72 | --- 73 | apiVersion: v1 74 | kind: Service 75 | metadata: 76 | name: vllm-service 77 | spec: 78 | selector: 79 | app: vllm-tpu 80 | type: LoadBalancer 81 | ports: 82 | - name: http 83 | protocol: TCP 84 | port: 8000 85 | targetPort: 8000 86 | -------------------------------------------------------------------------------- /vllm-deploy.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: vllm-server 5 | labels: 6 | app: vllm-server 7 | spec: 8 | replicas: 1 9 | selector: 10 | matchLabels: 11 | app: vllm-inference-server 12 | template: 13 | metadata: 14 | labels: 15 | app: vllm-inference-server 16 | spec: 17 | volumes: 18 | - name: cache 19 | emptyDir: {} 20 | - name: dshm 21 | emptyDir: 22 | medium: Memory 23 | nodeSelector: 24 | cloud.google.com/gke-accelerator: nvidia-l4 25 | cloud.google.com/gke-spot: "true" #autopilot spot 26 | tolerations: 27 | - key: "nvidia.com/gpu" 28 | operator: "Exists" 29 | effect: "NoSchedule" 30 | serviceAccountName: vllm 31 | containers: 32 | - name: vllm-inference-server 33 | image: vllm/vllm-openai 34 | imagePullPolicy: IfNotPresent 35 | 36 | resources: 37 | limits: 38 | nvidia.com/gpu: 1 39 | env: 40 | - name: HUGGING_FACE_HUB_TOKEN 41 | valueFrom: 42 | secretKeyRef: 43 | name: huggingface 44 | key: HF_TOKEN 45 | - name: TRANSFORMERS_CACHE 46 | value: /.cache 47 | - name: shm-size 48 | value: 1g 49 | command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] 50 | args: ["--model=meta-llama/Llama-3.1-8B-Instruct", #google/gemma-2-2b-it", 51 | "--gpu-memory-utilization=0.95", 52 | "--disable-log-requests", 53 | "--trust-remote-code", 54 | "--port=8000", 55 | "--tensor-parallel-size=1"] 56 | ports: 57 | - containerPort: 8000 58 | name: http 59 | volumeMounts: 60 | - mountPath: /dev/shm 61 | name: dshm 62 | - mountPath: /.cache 63 | name: cache 64 | 65 | --- 66 | apiVersion: v1 67 | kind: Service 68 | metadata: 69 | name: vllm-inference-server 70 | annotations: 71 | cloud.google.com/neg: '{"ingress": true}' 72 | #cloud.google.com/backend-config: '{"default": "vllm-backendconfig"}' 73 | labels: 74 | app: vllm-inference-server 75 | spec: 76 | type: NodePort 77 | ports: 78 | - port: 8000 79 | targetPort: http 80 | name: http-inference-server 81 | 82 | selector: 83 | app: vllm-inference-server 84 | 85 | --- 86 | #apiVersion: cloud.google.com/v1 87 | #kind: BackendConfig 88 | #metadata: 89 | # name: vllm-backendconfig 90 | #spec: 91 | # gRPC healthchecks not supported, use http endpoint instead https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-configuration#direct_health 92 | # healthCheck: 93 | # checkIntervalSec: 15 94 | # timeoutSec: 15 95 | # healthyThreshold: 1 96 | # unhealthyThreshold: 2 97 | # type: HTTP # GKE Ingress controller only supports HTTP, HTTPS, or HTTP2 98 | # requestPath: /health # Not a real endpoint, but should work (via prometheus metrics exporter) 99 | # port: 8000 100 | --- 101 | apiVersion: networking.k8s.io/v1 102 | kind: Ingress 103 | metadata: 104 | name: vllm-ingress 105 | annotations: 106 | kubernetes.io/ingress.class: "gce-internal" 107 | #kubernetes.io/ingress.global-static-ip-name: "ingress-vllm" 108 | spec: 109 | rules: 110 | - http: 111 | paths: 112 | - path: "/" 113 | pathType: Prefix 114 | backend: 115 | service: 116 | name: vllm-inference-server 117 | port: 118 | number: 8000 119 | -------------------------------------------------------------------------------- /webapp/Dockerfile: -------------------------------------------------------------------------------- 1 | # python3.8 breaks with gradio 2 | FROM python:3.10 3 | 4 | # Install dependencies from requirements.txt 5 | COPY ./src/requirements.txt . 6 | RUN pip install -r requirements.txt 7 | 8 | COPY ./src /src 9 | 10 | WORKDIR /src 11 | 12 | EXPOSE 7860 13 | 14 | CMD ["python", "app.py"] -------------------------------------------------------------------------------- /webapp/cloudbuild.yaml: -------------------------------------------------------------------------------- 1 | steps: 2 | - name: 'gcr.io/cloud-builders/docker' 3 | args: [ 'build', '-t', '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest', '.' ] 4 | images: 5 | - '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest' 6 | # Push the container image to Artifact Registry, get sha256 of the image 7 | -------------------------------------------------------------------------------- /webapp/src/app.py: -------------------------------------------------------------------------------- 1 | import gradio as gr 2 | import requests 3 | import os 4 | from langchain_community.llms import VLLMOpenAI 5 | import json 6 | from openai import OpenAI 7 | 8 | llm_url = os.environ.get('LLM_URL') 9 | llm_name= os.environ.get('LLM_NAME') 10 | APIGEE_HOST=os.environ.get('APIGEE_HOST') 11 | APIKEY=os.environ.get('APIKEY') 12 | llm = VLLMOpenAI( 13 | openai_api_key="EMPTY", 14 | openai_api_base=f"{llm_url}/v1", 15 | model_name=f"{llm_name}", 16 | model_kwargs={"stop": ["."]}, 17 | ) 18 | #llm=OpenAI( 19 | # api_key="EMPTY", 20 | # base_url=f"{llm_url}/v1", 21 | # model=f"{llm_name}", 22 | #) 23 | 24 | def predict(question): 25 | url = f"https://{APIGEE_HOST}/v1/products?count=100" 26 | headers = {'x-apikey': APIKEY, 'Content-type': 'application/json'} 27 | 28 | resp = requests.get(url, headers = headers) 29 | products = resp.json() # Parse JSON response 30 | 31 | # Create a more structured prompt for better results 32 | system_prompt = """You are a helpful assistant that analyzes product information. 33 | When describing products, focus on key details like: 34 | - name 35 | - categories 36 | - priceUsd 37 | - description 38 | Provide a concise and natural summary.""" 39 | 40 | user_prompt = f"""I want information about {question}. 41 | Here are the product details in JSON format: 42 | {json.dumps(products, indent=2)} 43 | 44 | Please provide a natural summary focusing on products related to {question}. 45 | If no relevant products are found, please indicate that.""" 46 | 47 | messages = [ 48 | {"role": "system", "content": system_prompt}, 49 | {"role": "user", "content": user_prompt} 50 | ] 51 | 52 | #chat_response = llm.chat.completions.create( 53 | #model=f"{llm_name}", 54 | #messages=[ 55 | # {"role": "system", "content": "You are a helpful assistant."}, 56 | # {"role": "user", "content": "Tell me a joke."}, 57 | #] 58 | #) 59 | #print("Chat response:", chat_response) 60 | 61 | response = requests.post( 62 | f"{llm_url}/v1/chat/completions", 63 | headers={"Content-Type": "application/json"}, 64 | json={ 65 | "model": llm_name, 66 | "messages": messages, 67 | "temperature": 0.7, 68 | "max_tokens": 600 # Adjust based on your needs 69 | } 70 | ) 71 | # Debug the response 72 | response_data = response.json() 73 | print("API Response:", response_data) # For debugging 74 | 75 | if response.status_code != 200: 76 | return f"Error: API returned status code {response.status_code}" 77 | 78 | # Check if response has the expected structure 79 | if "choices" not in response_data: 80 | return f"Error: Unexpected API response format: {response_data}" 81 | 82 | if not response_data["choices"]: 83 | return "Error: No completion choices returned" 84 | 85 | return response_data["choices"][0]["message"]["content"].strip() 86 | 87 | 88 | examples = [ 89 | ["Sunglass"], 90 | ["Shoes"], 91 | ["Clothes"], 92 | ] 93 | logo_html = '
Logo
' 94 | 95 | demo = gr.Interface( 96 | predict, 97 | [ gr.Textbox(label="Enter prompt:", value="Sunglass"), 98 | 99 | ], 100 | "text", 101 | examples=examples, 102 | title= llm_name+" Knowledge Bot" +logo_html 103 | ) 104 | 105 | demo.launch(server_name="0.0.0.0", server_port=7860) -------------------------------------------------------------------------------- /webapp/src/falcon.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/llm-on-gke/vllm-inference/c531a0ab396f235c76975a413b5c80712e0e0f47/webapp/src/falcon.jpeg -------------------------------------------------------------------------------- /webapp/src/requirements.txt: -------------------------------------------------------------------------------- 1 | gradio==3.37.0 2 | Flask==2.2.2 3 | requests==2.31.0 4 | langchain 5 | langchain-community 6 | vllm --------------------------------------------------------------------------------