├── README.md
├── create-cluster.sh
├── vllm-client.yaml
├── vllm-deploy-lws-deepseek.yaml
├── vllm-deploy-mistral-h100.yaml
├── vllm-deploy-tpu.yaml
├── vllm-deploy.yaml
└── webapp
    ├── Dockerfile
    ├── cloudbuild.yaml
    └── src
        ├── app.py
        ├── falcon.jpeg
        └── requirements.txt


/README.md:
--------------------------------------------------------------------------------
  1 | # Serving Open Source LLMs on GKE using vLLM framework
  2 | 
  3 | This post shows how to serve Open source LLM models(Mistrial 7B, Llama2 etc) on Nvidia GPUs(L4, Tesla-T4, for example) running on Google Cloud Kubernetes Engine (GKE). It will help you understand the AI/ML ready features of GKE and how to use them to serve large language models make life of self-managing OSS LLM models in GKE not as dauting as you may originally think of .
  4 | 
  5 | 
  6 | TGI and vLLM are 2 common frameworks to address significant challenges on slow latencities to obtain an output from a LLM, primarily due to ever increasing LLM substantial sizes to get responses back  
  7 | 
  8 | 𝐯𝐋𝐋𝐌(𝐕𝐞𝐫𝐬𝐚𝐭𝐢𝐥𝐞 𝐥𝐚𝐫𝐠𝐞 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐦𝐨𝐝𝐞𝐥) is a framework designed to enhance the inference and serving speed of LLMs. It has demonstrated remarkable performance improvements compared to mainstream frameworks like Hugging Face’s Transformers, primarily because of a highly innovative new algorithm at its core.
  9 | 
 10 | One key reason behind vLLM’s speed during inference is its use of the 𝐏𝐚𝐠𝐞𝐝 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞. In traditional attention mechanisms, the keys and values computed are stored in GPU memory as a KV cache. This cache stores attention keys and values for previous tokens, which can consume a significant amount of memory, especially for large models and long sequences. These keys and values are also stored in a contiguous manner. 
 11 | 
 12 | 𝐓𝐆𝐈 (𝐓𝐞𝐱𝐭-𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧-𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞) is another solution aimed at increasing the speed of LLM inference. It offers high-performance text generation using Tensor Parallelism and dynamic batching for popular open-source LLMs like StarCoder, BLOOM,Llama and other models.
 13 | 
 14 | 𝐓𝐆𝐈 𝐯𝐬 𝐯𝐋𝐋𝐌
 15 | 
 16 | - TGI does not support paged optimization.
 17 | 
 18 | -Both techniques don’t handle all LLM architectures.
 19 | 
 20 | -TGI also allows quantizing and fine-tuning models, which are not supported by vLLM.
 21 | 
 22 | -VLLM achieves better performance than TGI and the Hugging Face transformer library, with up to 24x higher throughput compared to Hugging Face and up to 3.5x higher throughput than TGI.
 23 | This post shows how to serve OSS LLMs(Mistral 7B, or Llama2) model on L4 GPUs running on Google Cloud Kubernetes Engine (GKE). It will help you understand the AI/ML ready features of GKE and how to use them to serve large language models.
 24 | 
 25 | 
 26 | GKE is a fully managed service that allows you to run containerized workloads on Google Cloud. It’s a great choice for running large language models and AI/ML workloads because it is easy to set up, it’s secure, and it’s AI/ML batteries included. GKE installs the latest NVIDIA GPU drivers for you in GPU-enabled node pools, and gives you autoscaling and partitioning capabilities for GPUs out of the box, so you can easily scale your workloads to the size you need while keeping the costs under control. 
 27 | 
 28 | 
 29 | For example on how to serve Open source model( Mistral 7B) on GKE using TGI, please refer to this Google community blog:
 30 | https://medium.com/google-cloud/serving-mistral-7b-on-l4-gpus-running-on-gke-25c6041dff27
 31 | 
 32 | 
 33 | ## Opensource Models supported
 34 | Llama2, Mistril, Falcon, see full list in https://docs.vllm.ai/en/latest/models/supported_models.html
 35 | 
 36 | ## Prerequisite: Huggingface API token
 37 | 
 38 | Access to a Google Cloud project with the L4 GPUs available and enough quota in the region you select.
 39 | A computer terminal with kubectl and the Google Cloud SDK installed. From the GCP project console you’ll be working with, you may want to use the included Cloud Shell as it already has the required tools installed.
 40 | Some models such as Llama2 will need Huggingface API token to download model files
 41 | Meta access request: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ need regisgter an email address to download
 42 | 
 43 | Go to Hugging face, create account account with same email registered in Meta request. Then find Llama 2 model, fill out access request: https://huggingface.co/meta-llama/Llama-2-7b. Need to wait for a few hours with the approval email to be able to use Llama 2
 44 | 
 45 | Get Hugging face access token from your huggingface account profile settings,
 46 | 
 47 | ## Setup project environments
 48 | 
 49 | From your console, select the Google Cloud region and project, checking that there’s availability for L4 GPUs in the one that you end up selecting. The one used in this tutorial is us-central, where at the time of writing this article there was availability for L4 GPUs( alternatively, you can choose other regions with different GPU accelerator type available):
 50 | 
 51 | ```
 52 | export PROJECT_ID=<your-project-id>
 53 | export REGION=us-central1
 54 | export ZONE_1=${REGION}-a # You may want to change the zone letter based on the region you selected above
 55 | export ZONE_2=${REGION}-b # You may want to change the zone letter based on the region you selected above
 56 | export CLUSTER_NAME=vllm-serving-cluster
 57 | export NAMESPACE=vllm
 58 | gcloud config set project "$PROJECT_ID"
 59 | gcloud config set compute/region "$REGION"
 60 | gcloud config set compute/zone "$ZONE_1"
 61 | ```
 62 | Then, enable the required APIs to create a GK cluster:
 63 | ```
 64 | gcloud services enable compute.googleapis.com container.googleapis.com
 65 | ```
 66 | 
 67 | Also, you may go ahead download the source code repo for this exercise, :
 68 | ```
 69 | git clone https://github.com/llm-on-gke/vllm-inference.git
 70 | cd vllm-inference
 71 | ```
 72 | 
 73 | 
 74 | In this exercise, you will be using the default service account to create the cluster, you need to grant it the required permissions to store metrics and logs in Cloud Monitoring that you will be using later on:
 75 | 
 76 | ```
 77 | PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')
 78 | GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"
 79 | for role in monitoring.metricWriter stackdriver.resourceMetadata.writer; do
 80 |   gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:${GCE_SA} --role=roles/${role}
 81 | done
 82 | ```
 83 | ## Create GKE Cluster and Nodepools
 84 | 
 85 | ### Quick Estimates of GPU type and number of GPU needed for model infereence:
 86 | Estimate the size of a model in gigabytes by multiplying the number of parameters (in billions) by 2. This approach is based on a simple formula: with each parameter using 16 bits (or 2 bytes) of memory in half-precision, the memory usage in GB is approximately twice the number of parameters. Therefore, a 7B parameter model, for instance, will take up approximately 14 GB of memory. We can comfortably run a 7B parameter model in Nvidia L4 and still have about 10 GB of memory remaining as a buffer for inferencing. Alternatively, you can choose to have 2 Tesla-T4 GPUs with 32G by sharding model across both GPUs, but there will be impacts of moving data around.  
 87 | 
 88 | For Models with larger parameter size, resource requirements can be reduced through weights Quantization into lower precision bits. 
 89 | Example, for Llama 2 70b model which may need 140G memeory with default half point(16 bits), resource requirements can be reduced with quatization into float 8 bits precision or even further with 4 bits, which only need 35G memory and can fit into 2 L4(48G)GPU. 
 90 | Reference: https://www.baseten.co/blog/llm-transformer-inference-guide/ 
 91 | 
 92 | ### GKE Cluster
 93 | 
 94 | Now, create a GKE cluster with a minimal default node pool, as you will be adding a node pool with L4 GPUs later on:
 95 | ```
 96 | gcloud container clusters create $CLUSTER_NAME \
 97 |   --location "$REGION" \
 98 |   --workload-pool "${PROJECT_ID}.svc.id.goog" \
 99 |   --enable-image-streaming --enable-shielded-nodes \
100 |   --shielded-secure-boot --shielded-integrity-monitoring \
101 |   --enable-ip-alias \
102 |   --node-locations="$ZONE_1" \
103 |   --workload-pool="${PROJECT_ID}.svc.id.goog" \
104 |   --addons GcsFuseCsiDriver   \
105 |   --no-enable-master-authorized-networks \
106 |   --machine-type n2d-standard-4 \
107 |   --num-nodes 1 --min-nodes 1 --max-nodes 5 \
108 |   --ephemeral-storage-local-ssd=count=2 \
109 |   --enable-ip-alias
110 | ```
111 | 
112 | ### Nodepool
113 | 
114 | Create an additional Spot node pool with regular (we use spot to illustrate) VMs with 2 L4 GPUs each:
115 | ```
116 | gcloud container node-pools create g2-standard-24 --cluster $CLUSTER_NAME \
117 |   --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
118 |   --machine-type g2-standard-8 \
119 |   --ephemeral-storage-local-ssd=count=1 \
120 |   --enable-autoscaling --enable-image-streaming \
121 |   --num-nodes=1 --min-nodes=0 --max-nodes=2 \
122 |   --shielded-secure-boot \
123 |   --shielded-integrity-monitoring \
124 |   --node-locations $ZONE_1,$ZONE_2 --region $REGION --spot
125 | ```
126 | Note how easy enabling GPUs in GKE is. Just adding the option --accelerator automatically bootstraps the nodes with the necessary drivers and configuration so your workloads can start using the GPUs attached to the cluster nodes. If you need to try tesla-t4, need to update  --accelerator and --machine-type parameter values, as one example:
127 | --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
128 | machine-type n2d-standard-8
129 | 
130 | After a few minutes, check that the node pool was created correctly:
131 | ```
132 | gcloud container node-pools list --region $REGION --cluster $CLUSTER_NAME
133 | ```
134 | 
135 | Also, check that the corresponding nodes in the g2-standard-24 node pool have the GPUs available:
136 | ```
137 | kubectl get nodes -o json | jq -r '.items[] | {name:.metadata.name, gpus:.status.capacity."nvidia.com/gpu"}'
138 | ```
139 | You should get one with 2 GPUs available corresponding to the node pool you just created:
140 | 
141 | {
142 |   "name": "vllm-serving-cluster-g2-standard-8-XXXXXX",
143 |   "gpus": "1"
144 | }
145 | 
146 | Run the following commands to setup identiy and IAM roles:
147 | ```
148 | kubectl create ns $NAMESPACE
149 | kubectl create serviceaccount $NAMESPACE --namespace $NAMESPACE
150 | gcloud iam service-accounts add-iam-policy-binding $GCE_SA \
151 |     --role roles/iam.workloadIdentityUser \
152 |     --member "serviceAccount:${PROJECT_ID}.svc.id.goog[${NAMESPACE}/${NAMESPACE}]"
153 | 
154 | kubectl annotate serviceaccount $NAMESPACE \
155 |     --namespace $NAMESPACE \
156 |     iam.gke.io/gcp-service-account=$GCE_SA
157 | ```
158 | 
159 | ## Deploy model to GKE cluster
160 | We’re now ready to deploy the model. 
161 | Save the following vllm-deploy.yaml,
162 | ```
163 | apiVersion: apps/v1
164 | kind: Deployment
165 | metadata:
166 |   name: vllm-server
167 |   labels:
168 |     app: vllm-server
169 | spec:
170 |   replicas: 1
171 |   selector:
172 |     matchLabels:
173 |       app: vllm-inference-server
174 |   template:
175 |     metadata:
176 |       labels:
177 |         app: vllm-inference-server
178 |     spec:
179 |       volumes:
180 |        - name: cache
181 |          emptyDir: {}
182 |        - name: dshm
183 |          emptyDir:
184 |               medium: Memory
185 |       nodeSelector:
186 |         cloud.google.com/gke-accelerator: nvidia-l4
187 |       serviceAccountName: triton
188 |       containers:
189 |         - name: vllm-inference-server
190 |           image: vllm/vllm-openai
191 |           imagePullPolicy: IfNotPresent
192 | 
193 |           resources:
194 |             limits:
195 |               nvidia.com/gpu: 1
196 |           env:
197 |             - name: HUGGING_FACE_HUB_TOKEN
198 |               valueFrom:
199 |                 secretKeyRef:
200 |                   name: huggingface
201 |                   key: HF_TOKEN
202 |             - name: TRANSFORMERS_CACHE
203 |               value: /.cache
204 |             - name: shm-size
205 |               value: 1g
206 |           command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
207 |           args: ["--model=meta-llama/Llama-2-7b-hf",
208 |                  "--gpu-memory-utilization=0.95",
209 |                  "--disable-log-requests",
210 |                  "--trust-remote-code",
211 |                  "--port=8000",
212 |                  "--tensor-parallel-size=1"]
213 |           ports:
214 |             - containerPort: 8000
215 |               name: http
216 |           securityContext:
217 |             runAsUser: 1000
218 |           volumeMounts:
219 |             - mountPath: /dev/shm
220 |               name: dshm
221 |             - mountPath: /.cache
222 |               name: cache
223 | 
224 | ---
225 | apiVersion: v1
226 | kind: Service
227 | metadata:
228 |   name: vllm-inference-server
229 |   annotations:
230 |     cloud.google.com/neg: '{"ingress": true}'
231 |   labels:
232 |     app: vllm-inference-server   
233 | spec:
234 |   type: NodePort
235 |   ports:
236 |     - port: 8000
237 |       targetPort: http
238 |       name: http-inference-server
239 |     
240 |   selector:
241 |     app: vllm-inference-server
242 | 
243 | ---
244 | apiVersion: networking.k8s.io/v1
245 | kind: Ingress
246 | metadata:
247 |   name: vllm-ingress
248 |   annotations:
249 |     kubernetes.io/ingress.class: "gce"
250 |     kubernetes.io/ingress.global-static-ip-name: "ingress-vllm"
251 | spec:
252 |   rules:
253 |   - http:
254 |       paths:
255 |       - path: "/"
256 |         pathType: Prefix
257 |         backend:
258 |           service:
259 |             name: vllm-inference-server
260 |             port:
261 |               number: 8000
262 | ```            
263 | 
264 | 
265 | Notes: 
266 | We include kubernetes resource templates for a deployment, service, and Ingress
267 | We use official container image to run the modles: vllm/vllm-openai
268 | Huggingface token setup, 
269 | Execute the command to deploy inference deployment in GKE, update the HF_TOKEN values
270 | 
271 | ```
272 | gcloud container clusters get-credentials $CLUSTER_NAME $REGION
273 | export HF_TOKEN=<paste-your-own-token>
274 | kubectl create secret generic huggingface --from-literal="HF_TOKEN=$HF_TOKEN" -n $NAMESPACE
275 | ```
276 | This GKE huggingface secrect is used to set the environment value in gke-deploy.yaml( need to keep the name: HUGGING_FACE_HUB_TOKEN ):
277 | env:
278 |     - name: HUGGING_FACE_HUB_TOKEN
279 |       valueFrom:
280 |             secretKeyRef:
281 |               name: huggingface
282 |               key: HF_TOKEN
283 | 
284 | 
285 | ## vLLM model config parameters:
286 | Update vllm-deploy.yml file as described earlier, 
287 | 
288 | You can override the command and arguments, 
289 | 
290 | if you prefer to Langchain and OpenAI integration in applications, you can use this:
291 | 
292 | command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
293 | 
294 | or you can use different entrypoint for native vLLM APIs:
295 | 
296 | command: ["python3", "-m", "vllm.entrypoints.api_server"]
297 | 
298 | To understand the vLLM model related arguments, See this doc: https://docs.vllm.ai/en/latest/models/engine_args.html
299 | 
300 | You can adjust the model related parameters in args settings in gke-deploy.yaml 
301 | 
302 | --model=ModelNameFromHuggingFace, replace with specific models from Huggingface, meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, mistralai/Mistral-7B-v0.1, tiiuae/falcon-7b
303 | 
304 | If you use Vertex vLLM image, --model value you can be full Cloud Storage path of model files, e.g., gs://vertex-model-garden-public-us-central1/llama2/llama2-13b-hf
305 | 
306 | ## Deploy the model to GKE
307 | After vllm-deploy.yaml file been updated with proper settings, execute the followin command:
308 | ```
309 | kubectl apply -f vllm-deploy.yaml -n $NAMESPACE 
310 | ```
311 | The following GKE artifacts will be created:
312 | a. vllm-server deployment
313 | b. Ingress
314 | b. Service with endpoint of LLM APIs, routing traffic through Ingress
315 | 
316 | Check all the objects you’ve just created:
317 | 
318 | kubectl get all
319 | Check that the pod has been correctly scheduled in one of the nodes in the g2-standard-8 node pool that has the GPUs available:
320 | 
321 | 
322 | ## Tests
323 | 
324 | Simplely run the following command to get the cluster ip:
325 | ```
326 | kubectl get service/vllm-server -o jsonpath='{.spec.clusterIP}' -n $NAMESPACE
327 | ```
328 | 
329 | Then use the following curl command to test inside the Cluster(update the cluster IP first):
330 | ```
331 | kubectl run curl --image=curlimages/curl \
332 |     -it --rm --restart=Never \
333 |     -- "$CLUSTERIP:8000/v1/models" 
334 | 
335 | curl http://ClusterIP:8000/v1/completions \
336 |     -H "Content-Type: application/json" \
337 |     -d '{
338 |         "model": "google/gemma-1.1-2b-it",
339 |         "prompt": "San Francisco is a",
340 |         "max_tokens": 250,
341 |         "temperature": 0.1
342 |     }'
343 | ```
344 | 
345 | ## Deploy WebApp
346 | 
347 | Siince vLLM can expose different model as OpenAI style APIs, different models will be transparent applications how to access LLM models.  
348 | 
349 | The sample app provided will use vLLM OpenAI library to initialize any model deployed through vLLM with following python code:
350 | 
351 | import gradio as gr
352 | import requests
353 | import os
354 | from langchain.llms import VLLMOpenAI
355 | llm_url = os.environ.get('LLM_URL')
356 | llm_name= os.environ.get('LLM_NAME')
357 | llm = VLLMOpenAI(
358 |     openai_api_key="EMPTY",
359 |     openai_api_base=f"{llm_url}/v1",
360 |     model_name=f"{llm_name}",
361 |     model_kwargs={"stop": ["."]},
362 | )
363 | 
364 | Note, you don't need to run this code here.
365 | 
366 | You need to build the webapp container image first so that you can host the webapp in Cloud Run:
367 | update(don't run it) the cloudbuild.yaml file under webapp directory to replace your own artifactory repository paths, 
368 | 
369 | steps:
370 | - name: 'gcr.io/cloud-builders/docker'
371 |   args: [ 'build', '-t', 'us-east1-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest', '.' ]
372 | images:
373 | - 'us-east1-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest'
374 | 
375 | 
376 | then run the build to build sample app container image use Cloud Build:
377 | 
378 | ```
379 | cd webapp
380 | gcloud builds submit. 
381 | ```
382 | 
383 | Then update the vllm-client.yaml file, 
384 |  - name: gradio
385 |         image: us-east1-docker.pkg.dev/PROJECT_ID/gke-llm/vllm-client:latest
386 |         env:
387 |           - name: LLM_URL
388 |             value: "http://CLusterIP:8000"
389 |           - name: LLM_NAME
390 |             value: "meta-llama/Llama-2-7b-hf"
391 | 
392 | a. image: URI, replace with your own vllm-client image
393 | b. LLM server and name settings:
394 |  - name: LLM_URL
395 |             value: "http://CLusterIP:8000"   (replace with the full LLM svc endpoint including port)
396 | - name: LLM_NAME
397 |             value: "meta-llama/Llama-2-7b-hf"  ( replace with your own Model setup earlier)
398 | 
399 | Run the command to deploy webapp, 
400 | kubectl apply -f vllm-client.yaml -n $NAMESPACE
401 | kubectl get service/vllm-client -o jsonpath='{.spec.externalIP}' -n $NAMESPACE
402 | 
403 | ## Validations:
404 | 
405 | Go to the external IP for the webapp, hptt://externalIP:8080, 
406 | and test a few questions from the web application.  
407 | 
408 | ## Cleanups:
409 | Don’t forget to clean up the resources created in this article once you’ve finished experimenting with GKE and Mistral 7b, as keeping the cluster running for a long time can incur in important costs. To clean up, you just need to delete the GKE cluster:
410 | 
411 | gcloud container clusters delete $CLUSTER_NAME — region $REGION
412 | 
413 | ## Conclusion
414 | This post tries to demonstrate how deploying Opensource LLM models such as Mistrial 7B and Llama2 7B/13B using vLLM on GKE is flexible and straightforward. 
415 | LLM operations and self manage AI ML workload in  flagship Managed kubernates platform GKE enables deploying LLM models in production, bringing ML Ops one step closer to existing platform teams with expertises in those managed platforms. Also, given the resources that are consumed, and the number of potential applications using AI/ML features moving forward, having a framework that offers scalability and cost control features simplifies adoption.
416 | 
417 | Don't forget to check out other GKE related resources on AI ML infrastrucure offered by Google Cloud. 
418 | 
419 | 
420 | 


--------------------------------------------------------------------------------
/create-cluster.sh:
--------------------------------------------------------------------------------
 1 | # for L4 and spot node-pools
 2 | export PROJECT_ID=<your-project-id>
 3 | export HF_TOKEN=<paste-your-own-token>
 4 | 
 5 | export REGION=us-central1
 6 | export ZONE_1=${REGION}-a # You may want to change the zone letter based on the region you selected above
 7 | export ZONE_2=${REGION}-b # You may want to change the zone letter based on the region you selected above
 8 | export CLUSTER_NAME=vllm-cluster
 9 | export NAMESPACE=vllm
10 | gcloud config set project "$PROJECT_ID"
11 | gcloud config set compute/region "$REGION"
12 | gcloud config set compute/zone "$ZONE_1"
13 | 
14 | #autopilot
15 | gcloud container clusters create-auto ${CLUSTER_NAME} \
16 |   --project=${PROJECT_ID} \
17 |   --region=${REGION}
18 | 
19 | gcloud container clusters create $CLUSTER_NAME --location ${REGION} \
20 |   --workload-pool ${PROJECT_ID}.svc.id.goog \
21 |   --enable-image-streaming --enable-shielded-nodes \
22 |   --shielded-secure-boot --shielded-integrity-monitoring \
23 |   --enable-ip-alias \
24 |   --node-locations=$REGION-b \
25 |   --workload-pool=${PROJECT_ID}.svc.id.goog \
26 |   --addons GcsFuseCsiDriver   \
27 |   --no-enable-master-authorized-networks \
28 |   --machine-type n2d-standard-4 \
29 |   --cluster-version 1.27.5-gke.200 \
30 |   --num-nodes 1 --min-nodes 1 --max-nodes 3 \
31 |   --ephemeral-storage-local-ssd=count=2 \
32 |   --scopes="gke-default,storage-rw"
33 | 
34 | PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')
35 | GCE_SA="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"
36 | for role in monitoring.metricWriter stackdriver.resourceMetadata.writer; do
37 |   gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:${GCE_SA} --role=roles/${role}
38 | done
39 | 
40 | gcloud container node-pools create vllm-inference-pool --cluster \
41 | $CLUSTER_NAME --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest   --machine-type g2-standard-8 \
42 | --ephemeral-storage-local-ssd=count=1   --enable-autoscaling --enable-image-streaming   --num-nodes=0 --min-nodes=0 --max-nodes=3 \
43 | --shielded-secure-boot   --shielded-integrity-monitoring --node-version=1.27.5-gke.200 --node-locations $ZONE_1,$ZONE_2 --region $REGION --spot
44 | 
45 | kubectl create ns $NAMESPACE
46 | kubectl create serviceaccount $NAMESPACE --namespace $NAMESPACE
47 | gcloud iam service-accounts add-iam-policy-binding $GCE_SA \
48 |     --role roles/iam.workloadIdentityUser \
49 |     --member "serviceAccount:${PROJECT_ID}.svc.id.goog[${NAMESPACE}/${NAMESPACE}]"
50 | 
51 | kubectl annotate serviceaccount $NAMESPACE \
52 |     --namespace $NAMESPACE \
53 |     iam.gke.io/gcp-service-account=$GCE_SA
54 | 
55 | kubectl create secret generic huggingface --from-literal="HF_TOKEN=$HF_TOKEN" -n $NAMESPACE
56 | 
57 | gcloud beta container clusters update $CLUSTER_NAME  --update-addons=HttpLoadBalancing=ENABLED --region $REGION


--------------------------------------------------------------------------------
/vllm-client.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: vllm-client
 5 | spec:
 6 |   selector:
 7 |     matchLabels:
 8 |       app: vllm-client
 9 |   template:
10 |     metadata:
11 |       labels:
12 |         app: vllm-client
13 |     spec:
14 |       containers:
15 |       - name: gradio
16 |         image: us-east1-docker.pkg.dev/rick-vertex-ai/gke-llm/vllm-client:latest
17 |         env:
18 |           - name: LLM_URL
19 |             value: "http://CLusterIP:8000"
20 |           - name: LLM_NAME
21 |             value: "meta-llama/Llama-3.1-8B-Instruct"
22 |           - name: MAX_TOKENS
23 |             value: "400"
24 |           - name: APIGEE_HOST
25 |             value: ""
26 |           - name: APIKEY
27 |             value: ""
28 |         resources:
29 |           requests:
30 |             memory: "128Mi"
31 |             cpu: "250m"
32 |           limits:
33 |             memory: "256Mi"
34 |             cpu: "500m"
35 |         ports:
36 |         - containerPort: 7860
37 | ---
38 | apiVersion: v1
39 | kind: Service
40 | metadata:
41 |   name: vllm-client-service
42 | spec:
43 |   type: LoadBalancer
44 |   selector:
45 |     app: vllm-client
46 |   ports:
47 |   - port: 8080
48 |     targetPort: 7860


--------------------------------------------------------------------------------
/vllm-deploy-lws-deepseek.yaml:
--------------------------------------------------------------------------------
  1 | 
  2 | apiVersion: leaderworkerset.x-k8s.io/v1
  3 | kind: LeaderWorkerSet
  4 | metadata:
  5 |   name: vllm-dps
  6 | spec:
  7 |   replicas: 1
  8 |   leaderWorkerTemplate:
  9 |     size: 2
 10 |     restartPolicy: RecreateGroupOnPodRestart
 11 |     leaderTemplate:
 12 |       metadata:
 13 |         labels:
 14 |           role: leader
 15 |       spec:
 16 |         nodeSelector:
 17 |           cloud.google.com/gke-accelerator: nvidia-h100-80gb
 18 |         containers:
 19 |           - name: vllm-leader
 20 |             image: us-east1-docker.pkg.dev/northam-ce-mlai-tpu/gke-llm/vllm-lws:latest
 21 |             env:
 22 |               - name: HUGGING_FACE_HUB_TOKEN
 23 |                 valueFrom:
 24 |                   secretKeyRef:
 25 |                     name: huggingface
 26 |                     key: HF_TOKEN
 27 |             command:
 28 |               - sh
 29 |               - -c
 30 |               - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
 31 |                 huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN ;
 32 |                 huggingface-cli download deepseek-ai/DeepSeek-V3 --local-dir /models;
 33 |                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model /models --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust_remote_code"
 34 |             resources:
 35 |               limits:
 36 |                 nvidia.com/gpu: "8"
 37 |             ports:
 38 |               - containerPort: 8080
 39 |             readinessProbe:
 40 |               tcpSocket:
 41 |                 port: 8080
 42 |               initialDelaySeconds: 15
 43 |               periodSeconds: 10
 44 |             volumeMounts:
 45 |               - mountPath: /dev/shm
 46 |                 name: dshm
 47 |               - mountPath: /models
 48 |                 name: models
 49 |         volumes:
 50 |         - name: dshm
 51 |           emptyDir:
 52 |             medium: Memory
 53 |             sizeLimit: 15Gi
 54 |         - name: models
 55 |           emptyDir: {}
 56 |     workerTemplate:
 57 |       spec:
 58 |         nodeSelector:
 59 |           cloud.google.com/gke-accelerator: nvidia-h100-80gb
 60 |         containers:
 61 |           - name: vllm-worker
 62 |             image: us-east1-docker.pkg.dev/northam-ce-mlai-tpu/gke-llm/vllm-lws:latest
 63 |             command:
 64 |               - sh
 65 |               - -c
 66 |               - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
 67 |             resources:
 68 |               limits:
 69 |                 nvidia.com/gpu: "8"
 70 |             env:
 71 |               - name: HUGGING_FACE_HUB_TOKEN
 72 |                 valueFrom:
 73 |                   secretKeyRef:
 74 |                     name: huggingface
 75 |                     key: HF_TOKEN
 76 |             volumeMounts:
 77 |               - mountPath: /dev/shm
 78 |                 name: dshm  
 79 |               - mountPath: /models
 80 |                 name: models 
 81 |         volumes:
 82 |         - name: dshm
 83 |           emptyDir:
 84 |             medium: Memory
 85 |             sizeLimit: 15Gi
 86 |         - name: models
 87 |           emptyDir: {}
 88 | ---
 89 | apiVersion: v1
 90 | kind: Service
 91 | metadata:
 92 |   name: vllm-leader
 93 | spec:
 94 |   ports:
 95 |     - name: http
 96 |       port: 8080
 97 |       protocol: TCP
 98 |       targetPort: 8080
 99 |   selector:
100 |     leaderworkerset.sigs.k8s.io/name: vllm
101 |     role: leader
102 |   type: ClusterIP


--------------------------------------------------------------------------------
/vllm-deploy-mistral-h100.yaml:
--------------------------------------------------------------------------------
  1 | apiVersion: apps/v1
  2 | kind: Deployment
  3 | metadata:
  4 |   name: vllm-server
  5 |   labels:
  6 |     app: vllm-server
  7 | spec:
  8 |   replicas: 1
  9 |   selector:
 10 |     matchLabels:
 11 |       app: vllm-inference-server
 12 |   template:
 13 |     metadata:
 14 |       labels:
 15 |         app: vllm-inference-server
 16 |     spec:
 17 |       volumes:
 18 |        - name: cache
 19 |          emptyDir: {}
 20 |        - name: dshm
 21 |          emptyDir:
 22 |               medium: Memory
 23 |        - name: triton
 24 |          emptyDir: {}
 25 |       nodeSelector:
 26 |         cloud.google.com/gke-accelerator:  nvidia-h100-80gb
 27 |         cloud.google.com/gke-spot: "true"
 28 |       serviceAccountName: vllm
 29 |       containers:
 30 |         - name: vllm-inference-server
 31 |           image: vllm/vllm-openai
 32 |           imagePullPolicy: IfNotPresent
 33 |           securityContext:
 34 |             privileged: true
 35 |           resources:
 36 |             requests:
 37 |               cpu: "4"
 38 |               memory: "30Gi"
 39 |               ephemeral-storage: "100Gi"
 40 |               nvidia.com/gpu: "8"
 41 |             limits:
 42 |               cpu: "4"
 43 |               memory: "30Gi"
 44 |               ephemeral-storage: "100Gi"
 45 |               nvidia.com/gpu: "8"
 46 |           env:
 47 |             - name: HUGGING_FACE_HUB_TOKEN
 48 |               valueFrom:
 49 |                 secretKeyRef:
 50 |                   name: huggingface
 51 |                   key: HF_TOKEN
 52 |             - name: TRANSFORMERS_CACHE
 53 |               value: /.cache
 54 |             - name: shm-size
 55 |               value: 5g
 56 |           command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
 57 |           args: ["--model=mistralai/Mixtral-8x7B-v0.1",
 58 |                  "--gpu-memory-utilization=0.95",
 59 |                  "--disable-log-requests",
 60 |                  "--trust-remote-code",
 61 |                  "--port=8000",
 62 |                  "--tensor-parallel-size=8"]
 63 |           ports:
 64 |             - containerPort: 8000
 65 |               name: http
 66 |           volumeMounts:
 67 |             - mountPath: /dev/shm
 68 |               name: dshm
 69 |             - mountPath: /.triton
 70 |               name: cache
 71 |             - mountPath: /.cache
 72 |               name: cache
 73 |             - mountPath: /.triton
 74 |               name: triton
 75 | 
 76 | ---
 77 | apiVersion: v1
 78 | kind: Service
 79 | metadata:
 80 |   name: vllm-inference-server
 81 |   annotations:
 82 |     cloud.google.com/neg: '{"ingress": true}'
 83 |     cloud.google.com/backend-config: '{"default": "vllm-backendconfig"}'
 84 |   labels:
 85 |     app: vllm-inference-server   
 86 | spec:
 87 |   type: NodePort
 88 |   ports:
 89 |     - port: 8000
 90 |       targetPort: http
 91 |       name: http-inference-server
 92 |     
 93 |   selector:
 94 |     app: vllm-inference-server
 95 | 
 96 | ---
 97 | apiVersion: cloud.google.com/v1
 98 | kind: BackendConfig
 99 | metadata:
100 |   name: vllm-backendconfig
101 | spec:
102 |   # gRPC healthchecks not supported, use http endpoint instead https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-configuration#direct_health
103 |   healthCheck:
104 |     checkIntervalSec: 15
105 |     timeoutSec: 500
106 |     healthyThreshold: 1
107 |     unhealthyThreshold: 2
108 |     type: HTTP                      # GKE Ingress controller only supports HTTP, HTTPS, or HTTP2
109 |     requestPath: /health   # Not a real endpoint, but should work (via prometheus metrics exporter)
110 |     port: 8000
111 | ---
112 | apiVersion: networking.k8s.io/v1
113 | kind: Ingress
114 | metadata:
115 |   name: vllm-ingress
116 |   annotations:
117 |     kubernetes.io/ingress.class: "gce"
118 |     kubernetes.io/ingress.global-static-ip-name: "ingress-vllm"
119 | spec:
120 |   rules:
121 |   - http:
122 |       paths:
123 |       - path: "/"
124 |         pathType: Prefix
125 |         backend:
126 |           service:
127 |             name: vllm-inference-server
128 |             port:
129 |               number: 8000
130 |              


--------------------------------------------------------------------------------
/vllm-deploy-tpu.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: vllm-tpu
 5 | spec:
 6 |   replicas: 1
 7 |   selector:
 8 |     matchLabels:
 9 |       app: vllm-tpu
10 |   template:
11 |     metadata:
12 |       labels:
13 |         app: vllm-tpu
14 |       annotations:
15 |         gke-gcsfuse/volumes: "true"
16 |         gke-gcsfuse/cpu-limit: "0"
17 |         gke-gcsfuse/memory-limit: "0"
18 |         gke-gcsfuse/ephemeral-storage-limit: "0"
19 | 
20 |     spec:
21 |       serviceAccountName: vllm
22 |       nodeSelector:
23 |         cloud.google.com/gke-tpu-topology: 2x2
24 |         cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
25 |       containers:
26 |       - name: vllm-tpu
27 |         image: docker.io/vllm/vllm-tpu:73aa7041bfee43581314e6f34e9a657137ecc092 #$REGION_NAME-docker.pkg.dev/$PROJECT_ID/vllm-tpu/vllm-tpu:latest
28 |         command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
29 |         args:
30 |         - --host=0.0.0.0
31 |         - --port=8000
32 |         - --tensor-parallel-size=4
33 |         - --max-model-len=8192 # max input + output len
34 |         - --model=meta-llama/Meta-Llama-3.1-70B
35 |         - --download-dir=/data
36 |         env:
37 |         - name: HUGGING_FACE_HUB_TOKEN
38 |           valueFrom:
39 |             secretKeyRef:
40 |               name: hf-secret
41 |               key: hf_api_token
42 |         - name: VLLM_XLA_CACHE_PATH
43 |           value: "/data"
44 |         ports:
45 |         - containerPort: 8000
46 |         resources:
47 |           limits:
48 |             google.com/tpu: 4
49 |         readinessProbe:
50 |           tcpSocket:
51 |             port: 8000
52 |           initialDelaySeconds: 15
53 |           periodSeconds: 10
54 |         volumeMounts:
55 |         - name: gcs-fuse-csi-ephemeral
56 |           mountPath: /data
57 |         - name: dshm
58 |           mountPath: /dev/shm
59 |       volumes:
60 |       - name: gke-gcsfuse-cache
61 |         emptyDir:
62 |           medium: Memory
63 |       - name: dshm
64 |         emptyDir:
65 |           medium: Memory
66 |       - name: gcs-fuse-csi-ephemeral
67 |         csi:
68 |           driver: gcsfuse.csi.storage.gke.io
69 |           volumeAttributes:
70 |             bucketName: rick-lllama-factory
71 |             mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
72 | ---
73 | apiVersion: v1
74 | kind: Service
75 | metadata:
76 |   name: vllm-service
77 | spec:
78 |   selector:
79 |     app: vllm-tpu
80 |   type: LoadBalancer
81 |   ports:
82 |     - name: http
83 |       protocol: TCP
84 |       port: 8000
85 |       targetPort: 8000
86 | 


--------------------------------------------------------------------------------
/vllm-deploy.yaml:
--------------------------------------------------------------------------------
  1 | apiVersion: apps/v1
  2 | kind: Deployment
  3 | metadata:
  4 |   name: vllm-server
  5 |   labels:
  6 |     app: vllm-server
  7 | spec:
  8 |   replicas: 1
  9 |   selector:
 10 |     matchLabels:
 11 |       app: vllm-inference-server
 12 |   template:
 13 |     metadata:
 14 |       labels:
 15 |         app: vllm-inference-server
 16 |     spec:
 17 |       volumes:
 18 |        - name: cache
 19 |          emptyDir: {}
 20 |        - name: dshm
 21 |          emptyDir:
 22 |               medium: Memory
 23 |       nodeSelector:
 24 |         cloud.google.com/gke-accelerator: nvidia-l4
 25 |         cloud.google.com/gke-spot: "true" #autopilot spot
 26 |       tolerations: 
 27 |       - key: "nvidia.com/gpu"
 28 |         operator: "Exists"
 29 |         effect: "NoSchedule"
 30 |       serviceAccountName: vllm
 31 |       containers:
 32 |         - name: vllm-inference-server
 33 |           image: vllm/vllm-openai
 34 |           imagePullPolicy: IfNotPresent
 35 | 
 36 |           resources:
 37 |             limits:
 38 |               nvidia.com/gpu: 1
 39 |           env:
 40 |             - name: HUGGING_FACE_HUB_TOKEN
 41 |               valueFrom:
 42 |                 secretKeyRef:
 43 |                   name: huggingface
 44 |                   key: HF_TOKEN
 45 |             - name: TRANSFORMERS_CACHE
 46 |               value: /.cache
 47 |             - name: shm-size
 48 |               value: 1g
 49 |           command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
 50 |           args: ["--model=meta-llama/Llama-3.1-8B-Instruct", #google/gemma-2-2b-it",
 51 |                  "--gpu-memory-utilization=0.95",
 52 |                  "--disable-log-requests",
 53 |                  "--trust-remote-code",
 54 |                  "--port=8000",
 55 |                  "--tensor-parallel-size=1"]
 56 |           ports:
 57 |             - containerPort: 8000
 58 |               name: http
 59 |           volumeMounts:
 60 |             - mountPath: /dev/shm
 61 |               name: dshm
 62 |             - mountPath: /.cache
 63 |               name: cache
 64 | 
 65 | ---
 66 | apiVersion: v1
 67 | kind: Service
 68 | metadata:
 69 |   name: vllm-inference-server
 70 |   annotations:
 71 |     cloud.google.com/neg: '{"ingress": true}'
 72 |     #cloud.google.com/backend-config: '{"default": "vllm-backendconfig"}'
 73 |   labels:
 74 |     app: vllm-inference-server   
 75 | spec:
 76 |   type: NodePort
 77 |   ports:
 78 |     - port: 8000
 79 |       targetPort: http
 80 |       name: http-inference-server
 81 |     
 82 |   selector:
 83 |     app: vllm-inference-server
 84 | 
 85 | ---
 86 | #apiVersion: cloud.google.com/v1
 87 | #kind: BackendConfig
 88 | #metadata:
 89 | #  name: vllm-backendconfig
 90 | #spec:
 91 |   # gRPC healthchecks not supported, use http endpoint instead https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-configuration#direct_health
 92 | #  healthCheck:
 93 | #    checkIntervalSec: 15
 94 | #    timeoutSec: 15
 95 | #    healthyThreshold: 1
 96 | #    unhealthyThreshold: 2
 97 | #    type: HTTP                      # GKE Ingress controller only supports HTTP, HTTPS, or HTTP2
 98 | #    requestPath: /health   # Not a real endpoint, but should work (via prometheus metrics exporter)
 99 | #    port: 8000
100 | ---
101 | apiVersion: networking.k8s.io/v1
102 | kind: Ingress
103 | metadata:
104 |   name: vllm-ingress
105 |   annotations:
106 |     kubernetes.io/ingress.class: "gce-internal"
107 |     #kubernetes.io/ingress.global-static-ip-name: "ingress-vllm"
108 | spec:
109 |   rules:
110 |   - http:
111 |       paths:
112 |       - path: "/"
113 |         pathType: Prefix
114 |         backend:
115 |           service:
116 |             name: vllm-inference-server
117 |             port:
118 |               number: 8000
119 | 


--------------------------------------------------------------------------------
/webapp/Dockerfile:
--------------------------------------------------------------------------------
 1 | # python3.8 breaks with gradio
 2 | FROM python:3.10
 3 | 
 4 | # Install dependencies from requirements.txt
 5 | COPY ./src/requirements.txt .
 6 | RUN pip install -r requirements.txt
 7 | 
 8 | COPY ./src /src
 9 | 
10 | WORKDIR /src
11 | 
12 | EXPOSE 7860
13 | 
14 | CMD ["python", "app.py"]


--------------------------------------------------------------------------------
/webapp/cloudbuild.yaml:
--------------------------------------------------------------------------------
1 | steps:
2 | - name: 'gcr.io/cloud-builders/docker'
3 |   args: [ 'build', '-t', '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest', '.' ]
4 | images:
5 | - '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/vllm-client:latest'
6 |   # Push the container image to Artifact Registry, get sha256 of the image
7 |   


--------------------------------------------------------------------------------
/webapp/src/app.py:
--------------------------------------------------------------------------------
  1 | import gradio as gr
  2 | import requests
  3 | import os
  4 | from langchain_community.llms import VLLMOpenAI
  5 | import json
  6 | from openai import OpenAI
  7 | 
  8 | llm_url = os.environ.get('LLM_URL')
  9 | llm_name= os.environ.get('LLM_NAME')
 10 | APIGEE_HOST=os.environ.get('APIGEE_HOST')
 11 | APIKEY=os.environ.get('APIKEY')
 12 | llm = VLLMOpenAI(
 13 |     openai_api_key="EMPTY",
 14 |     openai_api_base=f"{llm_url}/v1",
 15 |     model_name=f"{llm_name}",
 16 |     model_kwargs={"stop": ["."]},
 17 | )
 18 | #llm=OpenAI(
 19 | #    api_key="EMPTY",
 20 | #    base_url=f"{llm_url}/v1",
 21 | #    model=f"{llm_name}",
 22 | #)
 23 | 
 24 | def predict(question):
 25 |     url = f"https://{APIGEE_HOST}/v1/products?count=100"
 26 |     headers = {'x-apikey': APIKEY, 'Content-type': 'application/json'}
 27 | 
 28 |     resp = requests.get(url, headers = headers)
 29 |     products = resp.json()  # Parse JSON response
 30 |     
 31 |     # Create a more structured prompt for better results
 32 |     system_prompt = """You are a helpful assistant that analyzes product information.
 33 |     When describing products, focus on key details like:
 34 |     - name
 35 |     - categories
 36 |     - priceUsd
 37 |     - description
 38 |     Provide a concise and natural summary."""
 39 |     
 40 |     user_prompt = f"""I want information about {question}.
 41 |     Here are the product details in JSON format:
 42 |     {json.dumps(products, indent=2)}
 43 |     
 44 |     Please provide a natural summary focusing on products related to {question}. 
 45 |     If no relevant products are found, please indicate that."""
 46 |     
 47 |     messages = [
 48 |         {"role": "system", "content": system_prompt},
 49 |         {"role": "user", "content": user_prompt}
 50 |     ]
 51 |     
 52 |     #chat_response = llm.chat.completions.create(
 53 |     #model=f"{llm_name}",
 54 |     #messages=[
 55 |     #    {"role": "system", "content": "You are a helpful assistant."},
 56 |     #    {"role": "user", "content": "Tell me a joke."},
 57 |     #]
 58 |     #)
 59 |     #print("Chat response:", chat_response)
 60 | 
 61 |     response = requests.post(
 62 |         f"{llm_url}/v1/chat/completions",
 63 |         headers={"Content-Type": "application/json"},
 64 |         json={
 65 |             "model": llm_name,
 66 |             "messages": messages,
 67 |             "temperature": 0.7,
 68 |             "max_tokens": 600  # Adjust based on your needs
 69 |         }
 70 |     )
 71 |     # Debug the response
 72 |     response_data = response.json()
 73 |     print("API Response:", response_data)  # For debugging
 74 |         
 75 |     if response.status_code != 200:
 76 |         return f"Error: API returned status code {response.status_code}"
 77 |             
 78 |     # Check if response has the expected structure
 79 |     if "choices" not in response_data:
 80 |         return f"Error: Unexpected API response format: {response_data}"
 81 |             
 82 |     if not response_data["choices"]:
 83 |             return "Error: No completion choices returned"
 84 |             
 85 |     return response_data["choices"][0]["message"]["content"].strip()
 86 |          
 87 | 
 88 | examples = [
 89 |     ["Sunglass"],
 90 |     ["Shoes"],
 91 |     ["Clothes"],
 92 | ]
 93 | logo_html = '<div style="text-align: center;"><img src="file/falcon.jpeg" alt="Logo" style="height: 100px;"></div>'
 94 | 
 95 | demo = gr.Interface(
 96 |     predict, 
 97 |     [ gr.Textbox(label="Enter prompt:", value="Sunglass"),
 98 |       
 99 |     ],
100 |     "text",
101 |     examples=examples,
102 |     title= llm_name+" Knowledge Bot" +logo_html
103 |     )
104 | 
105 | demo.launch(server_name="0.0.0.0", server_port=7860)


--------------------------------------------------------------------------------
/webapp/src/falcon.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/llm-on-gke/vllm-inference/c531a0ab396f235c76975a413b5c80712e0e0f47/webapp/src/falcon.jpeg


--------------------------------------------------------------------------------
/webapp/src/requirements.txt:
--------------------------------------------------------------------------------
1 | gradio==3.37.0
2 | Flask==2.2.2
3 | requests==2.31.0
4 | langchain
5 | langchain-community
6 | vllm


--------------------------------------------------------------------------------