├── .gitignore
├── LICENSE
├── README.md
├── ingestion_lambda
    ├── lambda_function.py
    ├── requirements.txt
    └── setup.sh
├── rabbitmq
    ├── rabbitmqcluster.yaml
    ├── setup.sh
    └── test_queue.py
├── tei
    ├── setup.sh
    ├── tei-deployment.yaml
    └── tei-service.yaml
└── worker
    ├── Dockerfile
    ├── requirements.txt
    ├── setup.sh
    ├── worker-deployment.yaml
    ├── worker-service.yaml
    └── worker.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.zip
2 | __pycache__
3 | pika*
4 | venv


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 LlamaIndex
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # LlamaIndex <> AWS
  2 | 
  3 | This repository contains the code needed to setup and configure a complete ingestion and retrieval API, deployed to amazon AWS.
  4 | 
  5 | This will give a starting point for scaling your data ingestion to handle large volumes of data, as well as learn a bit more about AWS along the way.
  6 | 
  7 | The following tech stack is used:
  8 | - AWS Lambda for ingestion and retrieval with LlamaIndex
  9 | - RabbitMQ for queuing ingestion jobs
 10 | - A custom docker image for ingesting data with LlamaIndex
 11 | - Huggingface Text Embedding Interface for embedding our data
 12 | 
 13 | ## Setup
 14 | 
 15 | First, ensure you have an AWS account. Ensure you have some quota room for G5 EC2 nodes.
 16 | 
 17 | Once you have an account, the following dependencies are needed:
 18 | 1. [AWS account signup](https://portal.aws.amazon.com/billing/signup#/start/email) 
 19 | 2. [Install AWS CLI](https://docs.aws.amazon.com/eks/latest/userguide/setting-up.html)
 20 |     - Used to authenticate your AWS account for CLI tools
 21 | 3. [Install eksctl](https://eksctl.io/installation/)
 22 |     - Used to create `EKS` clusters easily
 23 | 4. [Install kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
 24 |     - Used to configure and debug deployments, pods, services, etc.
 25 | 5. [Install Docker](https://www.docker.com/products/docker-desktop/)
 26 | 
 27 | ### 1. Deploying Text Embedding Inteface
 28 | 
 29 | ```bash
 30 | cd tei
 31 | sh setup.sh
 32 | ```
 33 | 
 34 | This will create a cluster using eksctl, using g5.xlarge nodes. You can adjust the `--nodes` argument as needed, as well as the number of replicas in the `tei-deployment.yaml` file.
 35 | 
 36 | Note the public URL when you run `kubectl get svc`. The URL under `external IP` will be used in `./worker/woker-deployment.yaml`.
 37 | 
 38 | For convience, the `setup.sh` script prints the URL for you at the end.
 39 | 
 40 | ### 2. Deploying RabbitMQ
 41 | 
 42 | The setup for RabbitMQ leverages an `operator` -- a specific abstraction in AWS that helps handle all the resources needed for running RabbitMQ.
 43 | 
 44 | ```
 45 | cd raibbitmq
 46 | sh setup.sh
 47 | ```
 48 | 
 49 | RabbitMQ will be deployed on a eksctl cluster, where each node shares provisioned storage using EBS. You'll notice in the `setup.sh` file some extra commands to install the EBS add-on, as well as granting IAM permissions for provisioning the storage.
 50 | 
 51 | Lastly, we use the `RabbitmqCluster` extension installed by `krew` to easily create our cluster using mostly default configs. You can visit the [example repo]() for more complex rabbitmq deployments.
 52 | 
 53 | The setup may take some time. Even after the setup script finishes, it takes a while for pods and storage to start. You can check the output of `kubectl get pods` or `kubectl describe pod <pod_name>` to see current status, or check your AWS EKS dashboard.
 54 | 
 55 | Note that the public URL printed at the end will be used in `./worker/woker-deployment.yaml`.
 56 | 
 57 | You can visit `<public_url>:15672` to login with username/password "guest" to monitor the status of RabbitMQ once it's fully deployed.
 58 | 
 59 | ### 3. Deploying the Worker
 60 | 
 61 | Our worker deployment will continously consume messages from the RabbitMQ queue. Then, it will use our TEI deployment to embed documents and insert into our vector db (cloud-hosted weaviate, in this case, to simplify ingestion).
 62 | 
 63 | Before running anything here, you should:
 64 | 
 65 | - `cd` into the `worker/` folder
 66 | - modify the env vars in `worker/worker-deployment.yaml` to point to the appropiate rabbitmq, tei, and weaviate credentials.
 67 | - modify the pipeline and vector store setup if needed in `worker.py`
 68 | - run `docker login` if not already logged in
 69 | - run `docker build -t <image name> .`
 70 | - run `docker tag logan-markewich/worker:latest <image_name>:<image_version>`
 71 | - run `docker push <image_name>:<image_version>`
 72 | - edit `worker-deployment.yaml` and adjust the line `image: lloganm/worker:1.4` under `conatiner` to point to your docker image
 73 | 
 74 | With these setups complete, we can simply run `sh ./setup.sh` which will create a cluster, deploy our container, and setup a load balancer.
 75 | 
 76 | `kubectl get pods` will display when your pods are ready.
 77 | 
 78 | ### 4. Configuring AWS Lambda for Ingestion
 79 | 
 80 | Lastly, we need to configure AWS Lambda as a public endpoint to send data into our queue for processing.
 81 | 
 82 | While this can be done using the CLI, I preferred using the AWS UI for this.
 83 | 
 84 | First, update `ingestion_lambda/lambda_function.py` to point to the proper URL for your rabbit-mq deployment (from step 2 -- I hope you wrote that down!)
 85 | 
 86 | Then:
 87 | 
 88 | ```bash
 89 | cd ingestion_lambda
 90 | sh setup.sh
 91 | ```
 92 | 
 93 | This creates a zip file with our lambda function, as well as all the dependencies needed to run the lambda function (namely just the `pika` package).
 94 | 
 95 | With our zip package, we can create our lambda function:
 96 | 
 97 | - Open the Lambda console
 98 | - click `create function`
 99 | - Use a python3.11 runtime, give the function a name
100 | - click `create function` at the bottom
101 | - In the lambda editor, click the `upload from` button and select `.zip file` -- upload the zip file we created earlier.
102 | - Click deploy!
103 | - Your public `Function URL` will show up in the top panel, or under `Configuration`
104 | 
105 | 
106 | ## Ingesting your Data
107 | 
108 | Once everything is deployed, you have a fully working ETL pipeline with LlamaIndex. 
109 | 
110 | You can run ingestion by sending a POST request to your `Function URL` for your lambda function
111 | 
112 | ```python
113 | import requests
114 | from llama_index import Document, SimpleDirectoryReader
115 | 
116 | documents = SimpleDirectoryReader("./data").load_data()
117 | 
118 | # this will also be the namespace for the vector store -- for weaviate, it needs to start with a captial and only alpha-numeric
119 | user = "Loganm" 
120 | 
121 | body = {
122 |   'user': user,
123 |   'documents': [doc.json() for doc in documents]
124 | }
125 | 
126 | # use the URL of our lambda function here
127 | response = requests.post("https://vguwrj5wc4wsd5lhgbgn37itay0lmkls.lambda-url.us-east-1.on.aws/", json=body)
128 | print(response.text)
129 | ```
130 | 
131 | ## Using your Data
132 | 
133 | Once you've ingested data, querying with llama-index is a breeze. Our pipeline has automatically put the data into weaviate by default.
134 | 
135 | ```python
136 | from llama_index import VectorStoreIndex
137 | from llama_index.vector_stores import WeaviateVectorStore
138 | import weaviate
139 | 
140 | auth_config = weaviate.AuthApiKey(api_key="...")
141 | client = weaviate.Client(url="...", auth_client_secret=auth_config)
142 | vector_store = WeaviateVectorStore(weaviate_client=client, class_prefix="Loganm")
143 | 
144 | index = VectorStoreIndex.from_vector_store(vector_store)
145 | ```
146 | 


--------------------------------------------------------------------------------
/ingestion_lambda/lambda_function.py:
--------------------------------------------------------------------------------
 1 | import pika
 2 | import json
 3 | 
 4 | def lambda_handler(event, context):
 5 |     user = event.get('user', '')
 6 |     documents = event.get('documents', [])
 7 | 
 8 |     if not user or not documents:
 9 |         return {
10 |             'statusCode': 400,
11 |             'body': json.dumps('Missing user or documents')
12 |         }
13 |     
14 |     credentials = pika.PlainCredentials("guest", "guest")
15 |     parameters = pika.ConnectionParameters(host="a5c51e88038e34e18ac2e8fc6e6281e7-1376501245.us-east-1.elb.amazonaws.com", port=5672, credentials=credentials)
16 |     connection = pika.BlockingConnection(parameters=parameters)
17 | 
18 |     channel = connection.channel()
19 |     channel.queue_declare(queue='etl')
20 | 
21 |     for document in documents:
22 |         data = {
23 |             'user': user,
24 |             'documents': [document]
25 |         }
26 |         channel.basic_publish(
27 |             exchange="", 
28 |             routing_key='etl', 
29 |             body=json.dumps(data)
30 |         )
31 | 
32 |     return {
33 |         'statusCode': 200,
34 |         'body': json.dumps('Documents queued for ingestion')
35 |     }
36 | 


--------------------------------------------------------------------------------
/ingestion_lambda/requirements.txt:
--------------------------------------------------------------------------------
1 | pika==1.3.2
2 | 


--------------------------------------------------------------------------------
/ingestion_lambda/setup.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | pip install -r requirements.txt -t .
4 | 
5 | zip -r9 ../ingestion_lambda.zip . -x "*.git*" "*setup.sh*" "*requirements.txt*" "*.zip*"
6 | 


--------------------------------------------------------------------------------
/rabbitmq/rabbitmqcluster.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: rabbitmq.com/v1beta1
 2 | kind: RabbitmqCluster
 3 | metadata:
 4 |   name: production-rabbitmqcluster
 5 | spec:
 6 |   replicas: 2
 7 |   resources:
 8 |     requests:
 9 |       cpu: 500m
10 |       memory: 1Gi
11 |     limits:
12 |       cpu: 1
13 |       memory: 2Gi
14 |   rabbitmq:
15 |           additionalConfig: |
16 |                   log.console.level = info
17 |                   channel_max = 1700
18 |                   default_user= guest 
19 |                   default_pass = guest
20 |                   default_user_tags.administrator = true
21 |   service:
22 |     type: LoadBalancer
23 | 


--------------------------------------------------------------------------------
/rabbitmq/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | # had to add these zones, else it fails to deploy
 4 | eksctl create cluster --name mqCluster --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f
 5 | 
 6 | sleep 5
 7 | 
 8 | eksctl utils associate-iam-oidc-provider --cluster=mqCluster --region us-east-1 --approve
 9 | 
10 | sleep 5
11 | 
12 | eksctl create iamserviceaccount \
13 |     --name ebs-csi-controller-sa \
14 |     --namespace kube-system \
15 |     --cluster mqCluster \
16 |     --role-name AmazonEKS_EBS_CSI_DriverRole \
17 |     --role-only \
18 |     --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
19 |     --approve
20 | 
21 | sleep 5
22 | 
23 | eksctl create addon --name aws-ebs-csi-driver --cluster mqCluster --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
24 | 
25 | sleep 5
26 | 
27 | kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
28 | 
29 | sleep 5
30 | 
31 | kubectl apply -f rabbitmqcluster.yaml
32 | 
33 | sleep 5
34 | 
35 | echo "RabbitMQ URL is: $(kubectl get svc production-rabbitmqcluster -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')"
36 | 
37 | echo "Note: It may take some time for pods and storage to be ready. Run 'kubectl get pods' to check status."


--------------------------------------------------------------------------------
/rabbitmq/test_queue.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import pika
 3 | 
 4 | from llama_index import Document
 5 | 
 6 | rabbitmq_url = "a3ad05b37871d4dd4a5dfbd8c573230e-623959034.us-east-1.elb.amazonaws.com"
 7 | rabbitmq_user = "guest"
 8 | rabbitmq_password = "guest"
 9 | 
10 | credentials = pika.PlainCredentials(rabbitmq_user, rabbitmq_password)
11 | parameters = pika.ConnectionParameters(
12 |     host=rabbitmq_url, 
13 |     port=5672, 
14 |     credentials=credentials
15 | )
16 | connection = pika.BlockingConnection(parameters=parameters)
17 | channel = connection.channel()
18 | channel.queue_declare(queue='etl')
19 | 
20 | documents = [Document(text="logan")]
21 | data = {
22 |     'user': "Logan",  # must be upper-case
23 |     'documents': [document.json() for document in documents]
24 | }
25 | 
26 | channel.basic_publish(exchange="", routing_key='etl', body=json.dumps(data))
27 | 
28 | def callback(ch, method, properties, body):
29 |   print(body, flush=True)
30 |   print("Success! Use `ctrl+c` to exit.", flush=True)
31 | 
32 | channel.basic_consume(queue='etl', on_message_callback=callback, auto_ack=True)
33 | channel.start_consuming()
34 | 


--------------------------------------------------------------------------------
/tei/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | eksctl create cluster --name embeddings --node-type=g5.xlarge --nodes 1
 4 | 
 5 | sleep 5
 6 | 
 7 | kubectl create -f ./tei-deployment.yaml
 8 | 
 9 | sleep 5
10 | 
11 | kubectl create -f ./tei-service.yaml
12 | 
13 | sleep 5
14 | 
15 | echo "Embeddings URL is: http://$(kubectl get svc tei-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')"


--------------------------------------------------------------------------------
/tei/tei-deployment.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1
 2 | kind: Deployment
 3 | metadata:
 4 |  name: tei-deployment
 5 |  labels:
 6 |    app: tei-app
 7 | spec:
 8 |   replicas: 2
 9 |   selector:
10 |     matchLabels:
11 |       app: tei-app
12 |   template:
13 |     metadata:
14 |       labels:
15 |         app: tei-app
16 |     spec:
17 |      containers:
18 |      - name: tei-app
19 |        image: ghcr.io/huggingface/text-embeddings-inference:86-0.6
20 |        ports:
21 |        - containerPort: 80
22 |        args: ["--model-id", "BAAI/bge-large-en-v1.5", "--revision", "refs/pr/5"]
23 | 


--------------------------------------------------------------------------------
/tei/tei-service.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | apiVersion: v1
 3 | kind: Service
 4 | metadata:
 5 |   name: tei-service
 6 | spec:
 7 |   type: LoadBalancer
 8 |   selector:
 9 |     app: tei-app
10 |   ports:
11 |     - protocol: TCP
12 |       port: 80
13 |       targetPort: 80
14 | 


--------------------------------------------------------------------------------
/worker/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.11-alpine
 2 | 
 3 | WORKDIR /app
 4 | 
 5 | COPY requirements.txt .
 6 | 
 7 | RUN pip install -r requirements.txt
 8 | 
 9 | COPY . .
10 | 
11 | EXPOSE 8000
12 | 
13 | CMD ["python", "worker.py"]
14 | 


--------------------------------------------------------------------------------
/worker/requirements.txt:
--------------------------------------------------------------------------------
1 | fastapi==0.108.0
2 | llama-index==0.9.22
3 | pika==1.3.2
4 | uvicorn==0.25.0
5 | weaviate-client==3.26.0
6 | 


--------------------------------------------------------------------------------
/worker/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | eksctl create cluster --name mq-workers --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f
 4 | 
 5 | sleep 5
 6 | 
 7 | kubectl create -f ./worker-deployment.yaml
 8 | 
 9 | sleep 5
10 | 
11 | kubectl create -f ./worker-service.yaml


--------------------------------------------------------------------------------
/worker/worker-deployment.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1
 2 | kind: Deployment
 3 | metadata:
 4 |  name: mq-worker-deployment
 5 |  labels:
 6 |    app: mq-worker
 7 | spec:
 8 |   replicas: 1
 9 |   selector:
10 |     matchLabels:
11 |       app: mq-worker
12 |   template:
13 |     metadata:
14 |       labels:
15 |         app: mq-worker
16 |     spec:
17 |      containers:
18 |      - name: mq-worker
19 |        image: lloganm/worker:1.4
20 |        env:
21 |        - name: WEAVIATE_API_KEY
22 |          value: <you api key>
23 |        - name: WEAVIATE_URL
24 |          value: <you weaviate url>
25 |        - name: RABBITMQ_URL
26 |          value: <your rabbitmq url>
27 |        - name: RABBITMQ_USER
28 |          value: guest
29 |        - name: RABBITMQ_PASSWORD
30 |          value: guest
31 |        - name: TEI_URL
32 |          value: <your TEI url>
33 |        ports:
34 |        - containerPort: 8000
35 |        resources:       
36 |          requests:
37 |            memory: 4Gi
38 |            cpu: "0.25"
39 | 


--------------------------------------------------------------------------------
/worker/worker-service.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | apiVersion: v1
 3 | kind: Service
 4 | metadata:
 5 |   name: mq-worker-service
 6 | spec:
 7 |   type: LoadBalancer
 8 |   selector:
 9 |     app: mq-worker
10 |   ports:
11 |     - protocol: TCP
12 |       port: 80
13 |       targetPort: 8000
14 | 


--------------------------------------------------------------------------------
/worker/worker.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import threading
 4 | 
 5 | import fastapi
 6 | import pika
 7 | import uvicorn
 8 | import weaviate
 9 | 
10 | from llama_index.embeddings import TextEmbeddingsInference
11 | from llama_index.ingestion import IngestionPipeline
12 | from llama_index.text_splitter import TokenTextSplitter
13 | from llama_index.schema import Document
14 | from llama_index.vector_stores import WeaviateVectorStore
15 | 
16 | 
17 | app = fastapi.FastAPI()
18 | 
19 | 
20 | def worker_thread():
21 |     """Worker thread that runs the ingestion pipeline using rabbitmq."""
22 |     weaviate_api_key = os.environ['WEAVIATE_API_KEY']
23 |     weaviate_url = os.environ['WEAVIATE_URL']
24 | 
25 |     auth_config = weaviate.AuthApiKey(api_key=weaviate_api_key)
26 |     
27 |     rabbitmq_url = os.environ['RABBITMQ_URL']
28 |     rabbitmq_user = os.environ['RABBITMQ_USER']
29 |     rabbitmq_password = os.environ['RABBITMQ_PASSWORD']
30 | 
31 |     credentials = pika.PlainCredentials(rabbitmq_user, rabbitmq_password)
32 |     parameters = pika.ConnectionParameters(
33 |         host=rabbitmq_url, 
34 |         port=5672, 
35 |         credentials=credentials
36 |     )
37 | 
38 |     def callback(ch, method, properties, body):
39 |         try:
40 |             data = json.loads(body.decode('utf-8'))
41 |             documents = [Document.parse_raw(d) for d in data['documents']]
42 | 
43 |             user = data['user']
44 |             user = user[0].upper() + user[1:]
45 | 
46 |             client = weaviate.Client(url=weaviate_url, auth_client_secret=auth_config)
47 |             vector_store = WeaviateVectorStore(weaviate_client=client, class_prefix=user)
48 | 
49 |             tei_url = os.environ['TEI_URL']
50 | 
51 |             # setup pipeline
52 |             ingestion_pipeline = IngestionPipeline(
53 |                 transformations=[
54 |                     TokenTextSplitter(chunk_size=512),
55 |                     TextEmbeddingsInference(
56 |                         base_url=tei_url,
57 |                         embed_batch_size=10,
58 |                         model_name="BAAI/bge-large-en-v1.5"
59 |                     ),
60 |                 ],
61 |                 vector_store=vector_store,
62 |             )
63 | 
64 |             # ingest data directly into the users vector db
65 |             ingestion_pipeline.run(documents=documents)
66 |         except Exception as e:
67 |             print("Error during ingestion pipeline: ", e)
68 |             pass
69 |     
70 |     while True:
71 |         try:
72 |             connection = pika.BlockingConnection(parameters=parameters)
73 |             channel = connection.channel()
74 |             channel.queue_declare(queue='etl')
75 |             channel.basic_consume(queue='etl', on_message_callback=callback, auto_ack=True)
76 |             channel.start_consuming()
77 |         except Exception as e:
78 |             print("Error rabbitMQ consuming: ", e)
79 |             pass
80 | 
81 | 
82 | @app.get('/health')
83 | def health():
84 |     return {'status': 'ok'}
85 | 
86 | 
87 | if __name__ == '__main__':
88 |     # start worker thread
89 |     threading.Thread(target=worker_thread).start()
90 | 
91 |     # start webserver
92 |     uvicorn.run(app, host='0.0.0.0', port=8000)
93 | 


--------------------------------------------------------------------------------