├── .gitignore ├── LICENSE ├── README.md ├── ingestion_lambda ├── lambda_function.py ├── requirements.txt └── setup.sh ├── rabbitmq ├── rabbitmqcluster.yaml ├── setup.sh └── test_queue.py ├── tei ├── setup.sh ├── tei-deployment.yaml └── tei-service.yaml └── worker ├── Dockerfile ├── requirements.txt ├── setup.sh ├── worker-deployment.yaml ├── worker-service.yaml └── worker.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.zip 2 | __pycache__ 3 | pika* 4 | venv -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 LlamaIndex 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LlamaIndex <> AWS 2 | 3 | This repository contains the code needed to setup and configure a complete ingestion and retrieval API, deployed to amazon AWS. 4 | 5 | This will give a starting point for scaling your data ingestion to handle large volumes of data, as well as learn a bit more about AWS along the way. 6 | 7 | The following tech stack is used: 8 | - AWS Lambda for ingestion and retrieval with LlamaIndex 9 | - RabbitMQ for queuing ingestion jobs 10 | - A custom docker image for ingesting data with LlamaIndex 11 | - Huggingface Text Embedding Interface for embedding our data 12 | 13 | ## Setup 14 | 15 | First, ensure you have an AWS account. Ensure you have some quota room for G5 EC2 nodes. 16 | 17 | Once you have an account, the following dependencies are needed: 18 | 1. [AWS account signup](https://portal.aws.amazon.com/billing/signup#/start/email) 19 | 2. [Install AWS CLI](https://docs.aws.amazon.com/eks/latest/userguide/setting-up.html) 20 | - Used to authenticate your AWS account for CLI tools 21 | 3. [Install eksctl](https://eksctl.io/installation/) 22 | - Used to create `EKS` clusters easily 23 | 4. [Install kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) 24 | - Used to configure and debug deployments, pods, services, etc. 25 | 5. [Install Docker](https://www.docker.com/products/docker-desktop/) 26 | 27 | ### 1. Deploying Text Embedding Inteface 28 | 29 | ```bash 30 | cd tei 31 | sh setup.sh 32 | ``` 33 | 34 | This will create a cluster using eksctl, using g5.xlarge nodes. You can adjust the `--nodes` argument as needed, as well as the number of replicas in the `tei-deployment.yaml` file. 35 | 36 | Note the public URL when you run `kubectl get svc`. The URL under `external IP` will be used in `./worker/woker-deployment.yaml`. 37 | 38 | For convience, the `setup.sh` script prints the URL for you at the end. 39 | 40 | ### 2. Deploying RabbitMQ 41 | 42 | The setup for RabbitMQ leverages an `operator` -- a specific abstraction in AWS that helps handle all the resources needed for running RabbitMQ. 43 | 44 | ``` 45 | cd raibbitmq 46 | sh setup.sh 47 | ``` 48 | 49 | RabbitMQ will be deployed on a eksctl cluster, where each node shares provisioned storage using EBS. You'll notice in the `setup.sh` file some extra commands to install the EBS add-on, as well as granting IAM permissions for provisioning the storage. 50 | 51 | Lastly, we use the `RabbitmqCluster` extension installed by `krew` to easily create our cluster using mostly default configs. You can visit the [example repo]() for more complex rabbitmq deployments. 52 | 53 | The setup may take some time. Even after the setup script finishes, it takes a while for pods and storage to start. You can check the output of `kubectl get pods` or `kubectl describe pod ` to see current status, or check your AWS EKS dashboard. 54 | 55 | Note that the public URL printed at the end will be used in `./worker/woker-deployment.yaml`. 56 | 57 | You can visit `:15672` to login with username/password "guest" to monitor the status of RabbitMQ once it's fully deployed. 58 | 59 | ### 3. Deploying the Worker 60 | 61 | Our worker deployment will continously consume messages from the RabbitMQ queue. Then, it will use our TEI deployment to embed documents and insert into our vector db (cloud-hosted weaviate, in this case, to simplify ingestion). 62 | 63 | Before running anything here, you should: 64 | 65 | - `cd` into the `worker/` folder 66 | - modify the env vars in `worker/worker-deployment.yaml` to point to the appropiate rabbitmq, tei, and weaviate credentials. 67 | - modify the pipeline and vector store setup if needed in `worker.py` 68 | - run `docker login` if not already logged in 69 | - run `docker build -t .` 70 | - run `docker tag logan-markewich/worker:latest :` 71 | - run `docker push :` 72 | - edit `worker-deployment.yaml` and adjust the line `image: lloganm/worker:1.4` under `conatiner` to point to your docker image 73 | 74 | With these setups complete, we can simply run `sh ./setup.sh` which will create a cluster, deploy our container, and setup a load balancer. 75 | 76 | `kubectl get pods` will display when your pods are ready. 77 | 78 | ### 4. Configuring AWS Lambda for Ingestion 79 | 80 | Lastly, we need to configure AWS Lambda as a public endpoint to send data into our queue for processing. 81 | 82 | While this can be done using the CLI, I preferred using the AWS UI for this. 83 | 84 | First, update `ingestion_lambda/lambda_function.py` to point to the proper URL for your rabbit-mq deployment (from step 2 -- I hope you wrote that down!) 85 | 86 | Then: 87 | 88 | ```bash 89 | cd ingestion_lambda 90 | sh setup.sh 91 | ``` 92 | 93 | This creates a zip file with our lambda function, as well as all the dependencies needed to run the lambda function (namely just the `pika` package). 94 | 95 | With our zip package, we can create our lambda function: 96 | 97 | - Open the Lambda console 98 | - click `create function` 99 | - Use a python3.11 runtime, give the function a name 100 | - click `create function` at the bottom 101 | - In the lambda editor, click the `upload from` button and select `.zip file` -- upload the zip file we created earlier. 102 | - Click deploy! 103 | - Your public `Function URL` will show up in the top panel, or under `Configuration` 104 | 105 | 106 | ## Ingesting your Data 107 | 108 | Once everything is deployed, you have a fully working ETL pipeline with LlamaIndex. 109 | 110 | You can run ingestion by sending a POST request to your `Function URL` for your lambda function 111 | 112 | ```python 113 | import requests 114 | from llama_index import Document, SimpleDirectoryReader 115 | 116 | documents = SimpleDirectoryReader("./data").load_data() 117 | 118 | # this will also be the namespace for the vector store -- for weaviate, it needs to start with a captial and only alpha-numeric 119 | user = "Loganm" 120 | 121 | body = { 122 | 'user': user, 123 | 'documents': [doc.json() for doc in documents] 124 | } 125 | 126 | # use the URL of our lambda function here 127 | response = requests.post("https://vguwrj5wc4wsd5lhgbgn37itay0lmkls.lambda-url.us-east-1.on.aws/", json=body) 128 | print(response.text) 129 | ``` 130 | 131 | ## Using your Data 132 | 133 | Once you've ingested data, querying with llama-index is a breeze. Our pipeline has automatically put the data into weaviate by default. 134 | 135 | ```python 136 | from llama_index import VectorStoreIndex 137 | from llama_index.vector_stores import WeaviateVectorStore 138 | import weaviate 139 | 140 | auth_config = weaviate.AuthApiKey(api_key="...") 141 | client = weaviate.Client(url="...", auth_client_secret=auth_config) 142 | vector_store = WeaviateVectorStore(weaviate_client=client, class_prefix="Loganm") 143 | 144 | index = VectorStoreIndex.from_vector_store(vector_store) 145 | ``` 146 | -------------------------------------------------------------------------------- /ingestion_lambda/lambda_function.py: -------------------------------------------------------------------------------- 1 | import pika 2 | import json 3 | 4 | def lambda_handler(event, context): 5 | user = event.get('user', '') 6 | documents = event.get('documents', []) 7 | 8 | if not user or not documents: 9 | return { 10 | 'statusCode': 400, 11 | 'body': json.dumps('Missing user or documents') 12 | } 13 | 14 | credentials = pika.PlainCredentials("guest", "guest") 15 | parameters = pika.ConnectionParameters(host="a5c51e88038e34e18ac2e8fc6e6281e7-1376501245.us-east-1.elb.amazonaws.com", port=5672, credentials=credentials) 16 | connection = pika.BlockingConnection(parameters=parameters) 17 | 18 | channel = connection.channel() 19 | channel.queue_declare(queue='etl') 20 | 21 | for document in documents: 22 | data = { 23 | 'user': user, 24 | 'documents': [document] 25 | } 26 | channel.basic_publish( 27 | exchange="", 28 | routing_key='etl', 29 | body=json.dumps(data) 30 | ) 31 | 32 | return { 33 | 'statusCode': 200, 34 | 'body': json.dumps('Documents queued for ingestion') 35 | } 36 | -------------------------------------------------------------------------------- /ingestion_lambda/requirements.txt: -------------------------------------------------------------------------------- 1 | pika==1.3.2 2 | -------------------------------------------------------------------------------- /ingestion_lambda/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | pip install -r requirements.txt -t . 4 | 5 | zip -r9 ../ingestion_lambda.zip . -x "*.git*" "*setup.sh*" "*requirements.txt*" "*.zip*" 6 | -------------------------------------------------------------------------------- /rabbitmq/rabbitmqcluster.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: rabbitmq.com/v1beta1 2 | kind: RabbitmqCluster 3 | metadata: 4 | name: production-rabbitmqcluster 5 | spec: 6 | replicas: 2 7 | resources: 8 | requests: 9 | cpu: 500m 10 | memory: 1Gi 11 | limits: 12 | cpu: 1 13 | memory: 2Gi 14 | rabbitmq: 15 | additionalConfig: | 16 | log.console.level = info 17 | channel_max = 1700 18 | default_user= guest 19 | default_pass = guest 20 | default_user_tags.administrator = true 21 | service: 22 | type: LoadBalancer 23 | -------------------------------------------------------------------------------- /rabbitmq/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # had to add these zones, else it fails to deploy 4 | eksctl create cluster --name mqCluster --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f 5 | 6 | sleep 5 7 | 8 | eksctl utils associate-iam-oidc-provider --cluster=mqCluster --region us-east-1 --approve 9 | 10 | sleep 5 11 | 12 | eksctl create iamserviceaccount \ 13 | --name ebs-csi-controller-sa \ 14 | --namespace kube-system \ 15 | --cluster mqCluster \ 16 | --role-name AmazonEKS_EBS_CSI_DriverRole \ 17 | --role-only \ 18 | --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \ 19 | --approve 20 | 21 | sleep 5 22 | 23 | eksctl create addon --name aws-ebs-csi-driver --cluster mqCluster --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force 24 | 25 | sleep 5 26 | 27 | kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml 28 | 29 | sleep 5 30 | 31 | kubectl apply -f rabbitmqcluster.yaml 32 | 33 | sleep 5 34 | 35 | echo "RabbitMQ URL is: $(kubectl get svc production-rabbitmqcluster -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')" 36 | 37 | echo "Note: It may take some time for pods and storage to be ready. Run 'kubectl get pods' to check status." -------------------------------------------------------------------------------- /rabbitmq/test_queue.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pika 3 | 4 | from llama_index import Document 5 | 6 | rabbitmq_url = "a3ad05b37871d4dd4a5dfbd8c573230e-623959034.us-east-1.elb.amazonaws.com" 7 | rabbitmq_user = "guest" 8 | rabbitmq_password = "guest" 9 | 10 | credentials = pika.PlainCredentials(rabbitmq_user, rabbitmq_password) 11 | parameters = pika.ConnectionParameters( 12 | host=rabbitmq_url, 13 | port=5672, 14 | credentials=credentials 15 | ) 16 | connection = pika.BlockingConnection(parameters=parameters) 17 | channel = connection.channel() 18 | channel.queue_declare(queue='etl') 19 | 20 | documents = [Document(text="logan")] 21 | data = { 22 | 'user': "Logan", # must be upper-case 23 | 'documents': [document.json() for document in documents] 24 | } 25 | 26 | channel.basic_publish(exchange="", routing_key='etl', body=json.dumps(data)) 27 | 28 | def callback(ch, method, properties, body): 29 | print(body, flush=True) 30 | print("Success! Use `ctrl+c` to exit.", flush=True) 31 | 32 | channel.basic_consume(queue='etl', on_message_callback=callback, auto_ack=True) 33 | channel.start_consuming() 34 | -------------------------------------------------------------------------------- /tei/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | eksctl create cluster --name embeddings --node-type=g5.xlarge --nodes 1 4 | 5 | sleep 5 6 | 7 | kubectl create -f ./tei-deployment.yaml 8 | 9 | sleep 5 10 | 11 | kubectl create -f ./tei-service.yaml 12 | 13 | sleep 5 14 | 15 | echo "Embeddings URL is: http://$(kubectl get svc tei-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')" -------------------------------------------------------------------------------- /tei/tei-deployment.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: tei-deployment 5 | labels: 6 | app: tei-app 7 | spec: 8 | replicas: 2 9 | selector: 10 | matchLabels: 11 | app: tei-app 12 | template: 13 | metadata: 14 | labels: 15 | app: tei-app 16 | spec: 17 | containers: 18 | - name: tei-app 19 | image: ghcr.io/huggingface/text-embeddings-inference:86-0.6 20 | ports: 21 | - containerPort: 80 22 | args: ["--model-id", "BAAI/bge-large-en-v1.5", "--revision", "refs/pr/5"] 23 | -------------------------------------------------------------------------------- /tei/tei-service.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | apiVersion: v1 3 | kind: Service 4 | metadata: 5 | name: tei-service 6 | spec: 7 | type: LoadBalancer 8 | selector: 9 | app: tei-app 10 | ports: 11 | - protocol: TCP 12 | port: 80 13 | targetPort: 80 14 | -------------------------------------------------------------------------------- /worker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.11-alpine 2 | 3 | WORKDIR /app 4 | 5 | COPY requirements.txt . 6 | 7 | RUN pip install -r requirements.txt 8 | 9 | COPY . . 10 | 11 | EXPOSE 8000 12 | 13 | CMD ["python", "worker.py"] 14 | -------------------------------------------------------------------------------- /worker/requirements.txt: -------------------------------------------------------------------------------- 1 | fastapi==0.108.0 2 | llama-index==0.9.22 3 | pika==1.3.2 4 | uvicorn==0.25.0 5 | weaviate-client==3.26.0 6 | -------------------------------------------------------------------------------- /worker/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | eksctl create cluster --name mq-workers --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f 4 | 5 | sleep 5 6 | 7 | kubectl create -f ./worker-deployment.yaml 8 | 9 | sleep 5 10 | 11 | kubectl create -f ./worker-service.yaml -------------------------------------------------------------------------------- /worker/worker-deployment.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: mq-worker-deployment 5 | labels: 6 | app: mq-worker 7 | spec: 8 | replicas: 1 9 | selector: 10 | matchLabels: 11 | app: mq-worker 12 | template: 13 | metadata: 14 | labels: 15 | app: mq-worker 16 | spec: 17 | containers: 18 | - name: mq-worker 19 | image: lloganm/worker:1.4 20 | env: 21 | - name: WEAVIATE_API_KEY 22 | value: 23 | - name: WEAVIATE_URL 24 | value: 25 | - name: RABBITMQ_URL 26 | value: 27 | - name: RABBITMQ_USER 28 | value: guest 29 | - name: RABBITMQ_PASSWORD 30 | value: guest 31 | - name: TEI_URL 32 | value: 33 | ports: 34 | - containerPort: 8000 35 | resources: 36 | requests: 37 | memory: 4Gi 38 | cpu: "0.25" 39 | -------------------------------------------------------------------------------- /worker/worker-service.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | apiVersion: v1 3 | kind: Service 4 | metadata: 5 | name: mq-worker-service 6 | spec: 7 | type: LoadBalancer 8 | selector: 9 | app: mq-worker 10 | ports: 11 | - protocol: TCP 12 | port: 80 13 | targetPort: 8000 14 | -------------------------------------------------------------------------------- /worker/worker.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import threading 4 | 5 | import fastapi 6 | import pika 7 | import uvicorn 8 | import weaviate 9 | 10 | from llama_index.embeddings import TextEmbeddingsInference 11 | from llama_index.ingestion import IngestionPipeline 12 | from llama_index.text_splitter import TokenTextSplitter 13 | from llama_index.schema import Document 14 | from llama_index.vector_stores import WeaviateVectorStore 15 | 16 | 17 | app = fastapi.FastAPI() 18 | 19 | 20 | def worker_thread(): 21 | """Worker thread that runs the ingestion pipeline using rabbitmq.""" 22 | weaviate_api_key = os.environ['WEAVIATE_API_KEY'] 23 | weaviate_url = os.environ['WEAVIATE_URL'] 24 | 25 | auth_config = weaviate.AuthApiKey(api_key=weaviate_api_key) 26 | 27 | rabbitmq_url = os.environ['RABBITMQ_URL'] 28 | rabbitmq_user = os.environ['RABBITMQ_USER'] 29 | rabbitmq_password = os.environ['RABBITMQ_PASSWORD'] 30 | 31 | credentials = pika.PlainCredentials(rabbitmq_user, rabbitmq_password) 32 | parameters = pika.ConnectionParameters( 33 | host=rabbitmq_url, 34 | port=5672, 35 | credentials=credentials 36 | ) 37 | 38 | def callback(ch, method, properties, body): 39 | try: 40 | data = json.loads(body.decode('utf-8')) 41 | documents = [Document.parse_raw(d) for d in data['documents']] 42 | 43 | user = data['user'] 44 | user = user[0].upper() + user[1:] 45 | 46 | client = weaviate.Client(url=weaviate_url, auth_client_secret=auth_config) 47 | vector_store = WeaviateVectorStore(weaviate_client=client, class_prefix=user) 48 | 49 | tei_url = os.environ['TEI_URL'] 50 | 51 | # setup pipeline 52 | ingestion_pipeline = IngestionPipeline( 53 | transformations=[ 54 | TokenTextSplitter(chunk_size=512), 55 | TextEmbeddingsInference( 56 | base_url=tei_url, 57 | embed_batch_size=10, 58 | model_name="BAAI/bge-large-en-v1.5" 59 | ), 60 | ], 61 | vector_store=vector_store, 62 | ) 63 | 64 | # ingest data directly into the users vector db 65 | ingestion_pipeline.run(documents=documents) 66 | except Exception as e: 67 | print("Error during ingestion pipeline: ", e) 68 | pass 69 | 70 | while True: 71 | try: 72 | connection = pika.BlockingConnection(parameters=parameters) 73 | channel = connection.channel() 74 | channel.queue_declare(queue='etl') 75 | channel.basic_consume(queue='etl', on_message_callback=callback, auto_ack=True) 76 | channel.start_consuming() 77 | except Exception as e: 78 | print("Error rabbitMQ consuming: ", e) 79 | pass 80 | 81 | 82 | @app.get('/health') 83 | def health(): 84 | return {'status': 'ok'} 85 | 86 | 87 | if __name__ == '__main__': 88 | # start worker thread 89 | threading.Thread(target=worker_thread).start() 90 | 91 | # start webserver 92 | uvicorn.run(app, host='0.0.0.0', port=8000) 93 | --------------------------------------------------------------------------------