├── .gitignore
├── README.md
├── notebooks
    ├── 00-set-up-environment.ipynb
    ├── 01-semantic-search-rag.ipynb
    ├── 02-tweaking-semantic-search.ipynb
    ├── 03-hybrid-search.ipynb
    └── env.example
├── poetry.lock
├── pyproject.toml
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | .env.prod
3 | .idea


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # rag-optimization-workshop
 2 | 
 3 | This repository contains the materials for an RAG optimization workshop using Qdrant as a vector database.
 4 | It consists of Jupyter notebooks that guide you step-by-step in various optimizations and tweaks.
 5 | 
 6 | ## Prerequisites
 7 | 
 8 | Please clone the repository and install all the dependencies to run the notebooks.
 9 | 
10 | ```bash
11 | git clone https://github.com/qdrant/workshop-rag-optimization.git
12 | ```
13 | 
14 | ### Poetry
15 | 
16 | This project uses [Poetry](https://python-poetry.org/) to manage its dependencies. You can install it by following the instructions on the [official website](https://python-poetry.org/docs/#installation).
17 | Once you have it, the dependencies can be installed by running:
18 | 
19 | ```bash
20 | cd workshop-rag-optimization
21 | poetry install --no-root
22 | poetry shell
23 | ```
24 | 
25 | ### Pip
26 | 
27 | If you don't want to use Poetry, you can install the dependencies using pip:
28 | 
29 | ```bash
30 | pip install -r requirements.txt
31 | ```
32 | 
33 | ## Running the notebooks
34 | 
35 | Once all the dependencies are installed, Jupyter notebook might be started by running the following command:
36 | 
37 | ```bash
38 | jupyter notebook
39 | ```
40 | 
41 | The default browser should open automatically.
42 | 


--------------------------------------------------------------------------------
/notebooks/00-set-up-environment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "77d074b8e62858a6",
  6 |    "metadata": {
  7 |     "collapsed": false,
  8 |     "jupyter": {
  9 |      "outputs_hidden": false
 10 |     }
 11 |    },
 12 |    "source": [
 13 |     "# Setting up the environment\n",
 14 |     "\n",
 15 |     "During the workshop, we will use LlamaIndex to build a RAG system, with Qdrant acting as the vector store. We can skip the indexing process, and simply start with a pre-built index, imported from a snapshot. However, before we start doing the actual work, we will review the underlying data and get familiar with it.\n",
 16 |     "\n",
 17 |     "## Prerequisites\n",
 18 |     "\n",
 19 |     "As usual, building up RAG requires a few components:\n",
 20 |     "\n",
 21 |     "- **Qdrant instance** - obviously, no RAG without Qdrant in the loop, either local or cloud version\n",
 22 |     "- **LLM** - we are going to work with OpenAI models, as they are the default of the LlamaIndex, so [please obtain an API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)\n",
 23 |     "- **Embedding model** - there are plenty of models out there, but an open source [`BAAI/bge-large-en`](https://huggingface.co/BAAI/bge-large-en) is today's favorite"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "id": "5f9c5f7b903fcf2f",
 29 |    "metadata": {
 30 |     "collapsed": false,
 31 |     "jupyter": {
 32 |      "outputs_hidden": false
 33 |     }
 34 |    },
 35 |    "source": [
 36 |     "## Setting up Qdrant\n",
 37 |     "\n",
 38 |     "It doesn't matter if you prefer using a local Qdrant server running in a container, or our 1GB free tier cluster. Please make sure you have a running instance on hand."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "6065733650176c8d",
 44 |    "metadata": {
 45 |     "collapsed": false,
 46 |     "jupyter": {
 47 |      "outputs_hidden": false
 48 |     }
 49 |    },
 50 |    "source": [
 51 |     "### Local Docker container\n",
 52 |     "\n",
 53 |     "If you have Docker installed on your machine, you should be able to launch an instance pretty quickly by running the following command."
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "id": "ac6874f21b4d1a6a",
 60 |    "metadata": {
 61 |     "collapsed": false,
 62 |     "jupyter": {
 63 |      "outputs_hidden": false
 64 |     }
 65 |    },
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "!docker run -d -p \"6333:6333\" -p \"6334:6334\" qdrant/qdrant"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "id": "cf536ce934b91c79",
 74 |    "metadata": {
 75 |     "collapsed": false,
 76 |     "jupyter": {
 77 |      "outputs_hidden": false
 78 |     }
 79 |    },
 80 |    "source": [
 81 |     "### Qdrant Cloud free tier\n",
 82 |     "\n",
 83 |     "Another option is to [sign up for Qdrant Cloud](https://cloud.qdrant.io/login) and use the free tier 1GB cluster, which is available for everyone."
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "id": "96aef02fba7d6f82",
 89 |    "metadata": {
 90 |     "collapsed": false,
 91 |     "jupyter": {
 92 |      "outputs_hidden": false
 93 |     }
 94 |    },
 95 |    "source": [
 96 |     "## Saving configuration\n",
 97 |     "\n",
 98 |     "The last thing we need to set up before the start is to store all the secrets as environmental variables in the `.env` file. There is an `.env.example` we can use as a reference."
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "id": "723c1f816d7e4338",
104 |    "metadata": {
105 |     "collapsed": false,
106 |     "jupyter": {
107 |      "outputs_hidden": false
108 |     }
109 |    },
110 |    "source": [
111 |     "Once the configuration is done, let's try out the connection to our Qdrant instance."
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "id": "ead15566e447c4cf",
118 |    "metadata": {
119 |     "collapsed": false,
120 |     "jupyter": {
121 |      "outputs_hidden": false
122 |     }
123 |    },
124 |    "outputs": [],
125 |    "source": [
126 |     "from dotenv import load_dotenv\n",
127 |     "\n",
128 |     "load_dotenv()"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "id": "805f9acbec609491",
135 |    "metadata": {
136 |     "collapsed": false,
137 |     "jupyter": {
138 |      "outputs_hidden": false
139 |     }
140 |    },
141 |    "outputs": [],
142 |    "source": [
143 |     "from qdrant_client import QdrantClient\n",
144 |     "\n",
145 |     "import os\n",
146 |     "\n",
147 |     "client = QdrantClient(\n",
148 |     "    os.environ.get(\"QDRANT_URL\"), \n",
149 |     "    api_key=os.environ.get(\"QDRANT_API_KEY\"),\n",
150 |     ")\n",
151 |     "client.get_collections()"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "markdown",
156 |    "id": "1dc936dbdc03cc6c",
157 |    "metadata": {
158 |     "collapsed": false,
159 |     "jupyter": {
160 |      "outputs_hidden": false
161 |     }
162 |    },
163 |    "source": [
164 |     "## Data import\n",
165 |     "\n",
166 |     "RAG obviously needs data to work with. There are various challenges to overcome while indexing the documents, such as the chunking strategy. Another thing is creating the embeddings, which is usually a bottleneck of each system. Since this workshop is not about indexing, we're going to load the Qdrant collection from the snapshots I prepared beforehand."
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "id": "ef095c28cf9bbc92",
173 |    "metadata": {
174 |     "collapsed": false,
175 |     "jupyter": {
176 |      "outputs_hidden": false
177 |     }
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "client.recover_snapshot(\n",
182 |     "    collection_name=\"hacker-news\",\n",
183 |     "    # please do not modify the URL below\n",
184 |     "    location=\"https://snapshots.qdrant.io/workshop-rag-optimization/hacker-news-8895643013517159-2024-02-20-21-56-46.snapshot\",\n",
185 |     "    wait=False, # loading a snapshot may take some time, so let's avoid a timeout\n",
186 |     ")"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "id": "c830bea3a55a87ba",
193 |    "metadata": {
194 |     "collapsed": false,
195 |     "jupyter": {
196 |      "outputs_hidden": false
197 |     }
198 |    },
199 |    "outputs": [],
200 |    "source": [
201 |     "import time\n",
202 |     "\n",
203 |     "while True:\n",
204 |     "    collections = client.get_collections()\n",
205 |     "    if len(collections.collections) >= 1:\n",
206 |     "        break\n",
207 |     "    time.sleep(1.0)\n",
208 |     "\n",
209 |     "collections"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "id": "59b5c121411da1ac",
215 |    "metadata": {
216 |     "collapsed": false,
217 |     "jupyter": {
218 |      "outputs_hidden": false
219 |     }
220 |    },
221 |    "source": [
222 |     "For now, we're going to use a single collection, but once we get to the hybrid search, there'll be another one required. Let's already import it, so we don't need to worry again."
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": null,
228 |    "id": "99a33b63e3c98195",
229 |    "metadata": {
230 |     "collapsed": false,
231 |     "jupyter": {
232 |      "outputs_hidden": false
233 |     }
234 |    },
235 |    "outputs": [],
236 |    "source": [
237 |     "client.recover_snapshot(\n",
238 |     "    collection_name=\"hacker-news-hybrid\",\n",
239 |     "    location=\"https://snapshots.qdrant.io/workshop-rag-optimization/hacker-news-hybrid-8895643013517159-2024-02-20-21-56-54.snapshot\",\n",
240 |     "    wait=False,\n",
241 |     ")"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "id": "5aa32c7643d30b1d",
248 |    "metadata": {
249 |     "collapsed": false,
250 |     "jupyter": {
251 |      "outputs_hidden": false
252 |     }
253 |    },
254 |    "outputs": [],
255 |    "source": [
256 |     "while True:\n",
257 |     "    collections = client.get_collections()\n",
258 |     "    if len(collections.collections) >= 2:\n",
259 |     "        break\n",
260 |     "    time.sleep(1.0)\n",
261 |     "\n",
262 |     "collections"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": null,
268 |    "id": "ac78980080f7cfaa",
269 |    "metadata": {
270 |     "collapsed": false,
271 |     "jupyter": {
272 |      "outputs_hidden": false
273 |     }
274 |    },
275 |    "outputs": [],
276 |    "source": []
277 |   }
278 |  ],
279 |  "metadata": {
280 |   "kernelspec": {
281 |    "display_name": "Python 3 (ipykernel)",
282 |    "language": "python",
283 |    "name": "python3"
284 |   },
285 |   "language_info": {
286 |    "codemirror_mode": {
287 |     "name": "ipython",
288 |     "version": 3
289 |    },
290 |    "file_extension": ".py",
291 |    "mimetype": "text/x-python",
292 |    "name": "python",
293 |    "nbconvert_exporter": "python",
294 |    "pygments_lexer": "ipython3",
295 |    "version": "3.10.12"
296 |   }
297 |  },
298 |  "nbformat": 4,
299 |  "nbformat_minor": 5
300 | }
301 | 


--------------------------------------------------------------------------------
/notebooks/01-semantic-search-rag.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "source": [
  6 |     "# Semantic search based RAG\n",
  7 |     "\n",
  8 |     "We are going to use LlamaIndex to build a basic RAG pipeline that will use one of the open source embedding models. Then, we will consider different optimizations to either improve the performance or reduce the cost of the pipeline.\n"
  9 |    ],
 10 |    "metadata": {
 11 |     "collapsed": false
 12 |    },
 13 |    "id": "fe3d1bd3cb7874cc"
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "source": [
 18 |     "## Loading the configuration\n",
 19 |     "\n",
 20 |     "Before we start, all the configuration is loaded from the `.env` file we created in the previous notebook."
 21 |    ],
 22 |    "metadata": {
 23 |     "collapsed": false
 24 |    },
 25 |    "id": "23e281aaf592e306"
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "from dotenv import load_dotenv\n",
 32 |     "\n",
 33 |     "load_dotenv()"
 34 |    ],
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "id": "initial_id",
 39 |    "execution_count": null
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "source": [
 44 |     "## Basic RAG setup\n",
 45 |     "\n",
 46 |     "We will be using one of the open source embedding models to vectorize our document (actually, the snapshots we imported in the previous notebook were generated using the same model, so we need to use it for queries as well). OpenAI GPT will be our LLM, and it is the default model for LlamaIndex, so there is no need to configure it explicitly.\n",
 47 |     "\n",
 48 |     "The vector index, which will act as a fast retrieval layer, is the last missing piece to build our basic semantic search RAG. Qdrant will serve that purpose, as all the documents are already there."
 49 |    ],
 50 |    "metadata": {
 51 |     "collapsed": false
 52 |    },
 53 |    "id": "45e4dd797cbe142a"
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "from llama_index import ServiceContext\n",
 60 |     "\n",
 61 |     "service_context = ServiceContext.from_defaults(\n",
 62 |     "    embed_model=\"local:BAAI/bge-large-en\"\n",
 63 |     ")"
 64 |    ],
 65 |    "metadata": {
 66 |     "collapsed": false
 67 |    },
 68 |    "id": "ceecec79071db759",
 69 |    "execution_count": null
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "from qdrant_client import QdrantClient\n",
 76 |     "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
 77 |     "\n",
 78 |     "import os\n",
 79 |     "\n",
 80 |     "client = QdrantClient(\n",
 81 |     "    os.environ.get(\"QDRANT_URL\"), \n",
 82 |     "    api_key=os.environ.get(\"QDRANT_API_KEY\"),\n",
 83 |     ")\n",
 84 |     "vector_store = QdrantVectorStore(\n",
 85 |     "    client=client, \n",
 86 |     "    collection_name=\"hacker-news\"\n",
 87 |     ")"
 88 |    ],
 89 |    "metadata": {
 90 |     "collapsed": false
 91 |    },
 92 |    "id": "ef2c13ee4a83a508",
 93 |    "execution_count": null
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "from llama_index import VectorStoreIndex\n",
100 |     "\n",
101 |     "index = VectorStoreIndex.from_vector_store(\n",
102 |     "    vector_store=vector_store,\n",
103 |     "    service_context=service_context,\n",
104 |     ")"
105 |    ],
106 |    "metadata": {
107 |     "collapsed": false
108 |    },
109 |    "id": "2dca3d121903e207",
110 |    "execution_count": null
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "source": [
115 |     "### Querying RAG\n",
116 |     "\n",
117 |     "LlamaIndex simplifies the querying process by providing a high-level API that abstracts the underlying complexity. We can use the `as_query_engine` method to create a query engine that will handle the entire process for us, with the default configuration."
118 |    ],
119 |    "metadata": {
120 |     "collapsed": false
121 |    },
122 |    "id": "c1f73a5b0023dae8"
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "outputs": [],
127 |    "source": [
128 |     "query_engine = index.as_query_engine()\n",
129 |     "response = query_engine.query(\"What is the best way to learn programming?\")\n",
130 |     "print(response.response)"
131 |    ],
132 |    "metadata": {
133 |     "collapsed": false
134 |    },
135 |    "id": "3b10b5aec9a9aeb1",
136 |    "execution_count": null
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "source": [
141 |     "Our RAG retrieves some possibly relevant documents by using the original prompt as a query, and then sends them as a part of the prompt to the LLM. It seems to be a good idea to check what were these documents, and if our LLM was not making up the answer using its internal knowledge."
142 |    ],
143 |    "metadata": {
144 |     "collapsed": false
145 |    },
146 |    "id": "f7ce6484a8eaac59"
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "outputs": [],
151 |    "source": [
152 |     "for i, node in enumerate(response.source_nodes):\n",
153 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
154 |    ],
155 |    "metadata": {
156 |     "collapsed": false
157 |    },
158 |    "id": "a62a2fb53edaa8c1",
159 |    "execution_count": null
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "source": [
164 |     "The first tweak we can consider is to increase the number of documents fetched from our knowledge base (the default of LlamaIndex is just 2). We can do that by setting the `similarity_top_k` parameter of the `as_query_engine` method."
165 |    ],
166 |    "metadata": {
167 |     "collapsed": false
168 |    },
169 |    "id": "a392c560d622699d"
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "outputs": [],
174 |    "source": [
175 |     "response = index \\\n",
176 |     "    .as_query_engine(similarity_top_k=5) \\\n",
177 |     "    .query(\"What is the best way to learn programming?\")\n",
178 |     "print(response.response)"
179 |    ],
180 |    "metadata": {
181 |     "collapsed": false
182 |    },
183 |    "id": "e456cefe13dde93c",
184 |    "execution_count": null
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "outputs": [],
189 |    "source": [
190 |     "for i, node in enumerate(response.source_nodes):\n",
191 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
192 |    ],
193 |    "metadata": {
194 |     "collapsed": false
195 |    },
196 |    "id": "c72f6f232be4b89",
197 |    "execution_count": null
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "source": [
202 |     "## Customizing the RAG pipeline\n",
203 |     "\n",
204 |     "The defaults of LlamaIndex are a good starting point, but we can customize the pipeline to better fit our needs. That gives us more control over the behavior of the semantic search retriever or the way we interact with the LLM. LlamaIndex has pretty decent support for customizing the pipeline and there are three components that we need to set up:\n",
205 |     "\n",
206 |     "1. Retriever\n",
207 |     "2. Response synthesizer\n",
208 |     "3. Query engine"
209 |    ],
210 |    "metadata": {
211 |     "collapsed": false
212 |    },
213 |    "id": "a59904fb28cfe369"
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "outputs": [],
218 |    "source": [
219 |     "from llama_index.query_engine import RetrieverQueryEngine\n",
220 |     "from llama_index import get_response_synthesizer\n",
221 |     "from llama_index.indices.vector_store import VectorIndexRetriever\n",
222 |     "\n",
223 |     "retriever = VectorIndexRetriever(\n",
224 |     "    index=index,\n",
225 |     "    similarity_top_k=5,\n",
226 |     ")\n",
227 |     "\n",
228 |     "response_synthesizer = get_response_synthesizer()\n",
229 |     "\n",
230 |     "query_engine = RetrieverQueryEngine(\n",
231 |     "    retriever=retriever,\n",
232 |     "    response_synthesizer=response_synthesizer,\n",
233 |     ")"
234 |    ],
235 |    "metadata": {
236 |     "collapsed": false
237 |    },
238 |    "id": "b73b609cef8b1916",
239 |    "execution_count": null
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "outputs": [],
244 |    "source": [
245 |     "response = query_engine.query(\"What is the best way to learn programming?\")\n",
246 |     "print(response.response)"
247 |    ],
248 |    "metadata": {
249 |     "collapsed": false
250 |    },
251 |    "id": "59ff323f5dc019ed",
252 |    "execution_count": null
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "source": [
257 |     "## Playing with response synthesizers\n",
258 |     "\n",
259 |     "Response synthesizers are responsible for interactions with the LLM. This a component we want to control, when it comes to prompts and the way we actually communicate with the language model. There are lots of parameters to tweak, and prompt engineering is a topic of its own. Thus, we won't play with it too, but we can at least test out different response modes.\n",
260 |     "\n",
261 |     "The default one is `ResponseMode.COMPACT`, that combines retrieved text chunks into larger pieces, to utilize the available context window. There are also plenty of other modes, and they may work best in some specific scenario. For example, some of the modes may make a separate LLM call per extracted text chunk, which may be beneficial in some cases, but also increase the cost of the pipeline.\n",
262 |     "\n",
263 |     "Let's just compare the previous response with the `ResponseMode.ACCUMULATE` and `ResponseMode.REFINE` modes. The first one should create a response for each chunk and the concatenate them, while the second one should make a separate LLM call for each chunk in an iterative manner. That means, each call will use the previous response as a context."
264 |    ],
265 |    "metadata": {
266 |     "collapsed": false
267 |    },
268 |    "id": "98b489a71360ffa2"
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "outputs": [],
273 |    "source": [
274 |     "from llama_index.response_synthesizers import ResponseMode\n",
275 |     "\n",
276 |     "accumulate_response_synthesizer = get_response_synthesizer(\n",
277 |     "    response_mode=ResponseMode.ACCUMULATE,\n",
278 |     ")\n",
279 |     "\n",
280 |     "accumulate_query_engine = RetrieverQueryEngine(\n",
281 |     "    retriever=retriever,\n",
282 |     "    response_synthesizer=accumulate_response_synthesizer,\n",
283 |     ")"
284 |    ],
285 |    "metadata": {
286 |     "collapsed": false
287 |    },
288 |    "id": "bb47addbdc849fcf",
289 |    "execution_count": null
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "outputs": [],
294 |    "source": [
295 |     "response = accumulate_query_engine.query(\"What is the best way to learn programming?\")\n",
296 |     "print(response.response)"
297 |    ],
298 |    "metadata": {
299 |     "collapsed": false
300 |    },
301 |    "id": "d895eaf9f042fd22",
302 |    "execution_count": null
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "outputs": [],
307 |    "source": [
308 |     "refine_response_synthesizer = get_response_synthesizer(\n",
309 |     "    response_mode=ResponseMode.REFINE,\n",
310 |     ")\n",
311 |     "\n",
312 |     "refine_query_engine = RetrieverQueryEngine(\n",
313 |     "    retriever=retriever,\n",
314 |     "    response_synthesizer=refine_response_synthesizer,\n",
315 |     ")"
316 |    ],
317 |    "metadata": {
318 |     "collapsed": false
319 |    },
320 |    "id": "c82c017459bfb664",
321 |    "execution_count": null
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "outputs": [],
326 |    "source": [
327 |     "response = refine_query_engine.query(\"What is the best way to learn programming?\")\n",
328 |     "print(response.response)"
329 |    ],
330 |    "metadata": {
331 |     "collapsed": false
332 |    },
333 |    "id": "e748b91db81b9336",
334 |    "execution_count": null
335 |   },
336 |   {
337 |    "cell_type": "markdown",
338 |    "source": [
339 |     "## Multitenancy\n",
340 |     "\n",
341 |     "Most of the real applications require some sort of data separation. If you collect data coming from different users or organizations, you probably don't want to mix them up in the answers. Quite a common mistake, while using Qdrant, is to create a separate collection for each tenant. Instead, you can use the metadata field to separate the data. This field should have a payload index created, so the operations are fast. \n",
342 |     "\n",
343 |     "This is a Qdrant-specific feature, and the configuration is not done in LlamaIndex, but in Qdrant itself. However, we passed an instance of `QdrantClient` to the `QdrantVectorStore`, so we can use it to create a payload index for the metadata field.\n",
344 |     "\n",
345 |     "In our case, we can consider splitting the data by the type of the document. We have two types of documents in our collection: `story` and `comment`. We can use the `type` field to separate them."
346 |    ],
347 |    "metadata": {
348 |     "collapsed": false
349 |    },
350 |    "id": "235e641164884b02"
351 |   },
352 |   {
353 |    "cell_type": "code",
354 |    "outputs": [],
355 |    "source": [
356 |     "from qdrant_client import models\n",
357 |     "\n",
358 |     "client.create_payload_index(\n",
359 |     "    collection_name=\"hacker-news\",\n",
360 |     "    field_name=\"type\",\n",
361 |     "    field_schema=models.PayloadSchemaType.KEYWORD,\n",
362 |     ")"
363 |    ],
364 |    "metadata": {
365 |     "collapsed": false
366 |    },
367 |    "id": "cb261d6fa91e727",
368 |    "execution_count": null
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "source": [
373 |     "Using the newly created payload index, we can filter the documents by type. That's why we wanted to customize the pipeline, so we can add this filter to the retriever."
374 |    ],
375 |    "metadata": {
376 |     "collapsed": false
377 |    },
378 |    "id": "5077c72c40add0d3"
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "outputs": [],
383 |    "source": [
384 |     "from llama_index.vector_stores import MetadataFilters, MetadataFilter\n",
385 |     "\n",
386 |     "filtering_retriever = VectorIndexRetriever(\n",
387 |     "    index=index,\n",
388 |     "    similarity_top_k=5,\n",
389 |     "    filters=MetadataFilters(\n",
390 |     "        filters=[\n",
391 |     "            MetadataFilter(key=\"type\", value=\"story\"),\n",
392 |     "        ]\n",
393 |     "    ),\n",
394 |     ")\n",
395 |     "\n",
396 |     "filtering_query_engine = RetrieverQueryEngine(\n",
397 |     "    retriever=filtering_retriever,\n",
398 |     "    response_synthesizer=response_synthesizer,\n",
399 |     ")"
400 |    ],
401 |    "metadata": {
402 |     "collapsed": false
403 |    },
404 |    "id": "448493cd6c4d1c32",
405 |    "execution_count": null
406 |   },
407 |   {
408 |    "cell_type": "code",
409 |    "outputs": [],
410 |    "source": [
411 |     "response = filtering_query_engine.query(\"What is the best way to learn programming?\")\n",
412 |     "print(response.response)"
413 |    ],
414 |    "metadata": {
415 |     "collapsed": false
416 |    },
417 |    "id": "b4e00178a2746d09",
418 |    "execution_count": null
419 |   },
420 |   {
421 |    "cell_type": "code",
422 |    "outputs": [],
423 |    "source": [
424 |     "for i, node in enumerate(response.source_nodes):\n",
425 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
426 |    ],
427 |    "metadata": {
428 |     "collapsed": false
429 |    },
430 |    "id": "3f05c4c83cf54647",
431 |    "execution_count": null
432 |   },
433 |   {
434 |    "cell_type": "markdown",
435 |    "source": [
436 |     "## Additional tweaks\n",
437 |     "\n",
438 |     "Some scenarios require different means than just semantic search. For example, if we want to prefer the most recent documents, none of the embedding models is going to capture it, since it is a cross-document relationship. LlamaIndex provides a way to add additional postprocessing, so we can include the additional constraints directly on the prefetched documents.\n"
439 |    ],
440 |    "metadata": {
441 |     "collapsed": false
442 |    },
443 |    "id": "88f35208132ff94e"
444 |   },
445 |   {
446 |    "cell_type": "code",
447 |    "outputs": [],
448 |    "source": [
449 |     "from llama_index.postprocessor import FixedRecencyPostprocessor\n",
450 |     "\n",
451 |     "prefetching_retriever = VectorIndexRetriever(\n",
452 |     "    index=index,\n",
453 |     "    similarity_top_k=25,  # prefetch way more documents\n",
454 |     "    filters=MetadataFilters(\n",
455 |     "        filters=[\n",
456 |     "            MetadataFilter(key=\"type\", value=\"comment\"),  # we want comments this time\n",
457 |     "        ]\n",
458 |     "    ),\n",
459 |     ")\n",
460 |     "\n",
461 |     "recency_query_engine = RetrieverQueryEngine(\n",
462 |     "    retriever=prefetching_retriever,\n",
463 |     "    response_synthesizer=response_synthesizer,\n",
464 |     "    node_postprocessors=[\n",
465 |     "        FixedRecencyPostprocessor(\n",
466 |     "            service_context=service_context,\n",
467 |     "            date_key=\"date\",  # date is the default key also, but make it explicit\n",
468 |     "            top_k=5,  # leave just 20% of the prefetched documents\n",
469 |     "        )\n",
470 |     "    ]\n",
471 |     ")"
472 |    ],
473 |    "metadata": {
474 |     "collapsed": false
475 |    },
476 |    "id": "924db74f9afcb13f",
477 |    "execution_count": null
478 |   },
479 |   {
480 |    "cell_type": "code",
481 |    "outputs": [],
482 |    "source": [
483 |     "response = recency_query_engine.query(\"What is the best way to learn programming?\")\n",
484 |     "print(response.response)"
485 |    ],
486 |    "metadata": {
487 |     "collapsed": false
488 |    },
489 |    "id": "36f4611d9c0d262a",
490 |    "execution_count": null
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "outputs": [],
495 |    "source": [
496 |     "for i, node in enumerate(response.source_nodes):\n",
497 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
498 |    ],
499 |    "metadata": {
500 |     "collapsed": false
501 |    },
502 |    "id": "4c074064ccbfc715",
503 |    "execution_count": null
504 |   },
505 |   {
506 |    "cell_type": "code",
507 |    "outputs": [],
508 |    "source": [
509 |     "from llama_index.postprocessor import EmbeddingRecencyPostprocessor\n",
510 |     "\n",
511 |     "embedding_recency_query_engine = RetrieverQueryEngine(\n",
512 |     "    retriever=prefetching_retriever,\n",
513 |     "    response_synthesizer=response_synthesizer,\n",
514 |     "    node_postprocessors=[\n",
515 |     "        EmbeddingRecencyPostprocessor(\n",
516 |     "            service_context=service_context,\n",
517 |     "            date_key=\"date\",  # date is the default key\n",
518 |     "            similarity_cutoff=0.9,\n",
519 |     "        )\n",
520 |     "    ]\n",
521 |     ")"
522 |    ],
523 |    "metadata": {
524 |     "collapsed": false
525 |    },
526 |    "id": "bf76ad82f0d270bb",
527 |    "execution_count": null
528 |   },
529 |   {
530 |    "cell_type": "code",
531 |    "outputs": [],
532 |    "source": [
533 |     "response = embedding_recency_query_engine.query(\"What is the best way to learn programming?\")\n",
534 |     "print(response.response)"
535 |    ],
536 |    "metadata": {
537 |     "collapsed": false
538 |    },
539 |    "id": "33ea0ba3f3d10b13",
540 |    "execution_count": null
541 |   },
542 |   {
543 |    "cell_type": "code",
544 |    "outputs": [],
545 |    "source": [
546 |     "for i, node in enumerate(response.source_nodes):\n",
547 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
548 |    ],
549 |    "metadata": {
550 |     "collapsed": false
551 |    },
552 |    "id": "3cbe86011e834e79",
553 |    "execution_count": null
554 |   },
555 |   {
556 |    "cell_type": "code",
557 |    "outputs": [],
558 |    "source": [],
559 |    "metadata": {
560 |     "collapsed": false
561 |    },
562 |    "id": "620de1ff4231ef30",
563 |    "execution_count": null
564 |   }
565 |  ],
566 |  "metadata": {
567 |   "kernelspec": {
568 |    "display_name": "Python 3",
569 |    "language": "python",
570 |    "name": "python3"
571 |   },
572 |   "language_info": {
573 |    "codemirror_mode": {
574 |     "name": "ipython",
575 |     "version": 2
576 |    },
577 |    "file_extension": ".py",
578 |    "mimetype": "text/x-python",
579 |    "name": "python",
580 |    "nbconvert_exporter": "python",
581 |    "pygments_lexer": "ipython2",
582 |    "version": "2.7.6"
583 |   }
584 |  },
585 |  "nbformat": 4,
586 |  "nbformat_minor": 5
587 | }
588 | 


--------------------------------------------------------------------------------
/notebooks/02-tweaking-semantic-search.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "source": [
  6 |     "# Tweaking up semantic retrieval\n",
  7 |     "\n",
  8 |     "There are various objectives we could try optimizing for when it comes to semantic retrieval. We could try to optimize the **speed** of the retrieval, the **quality** of it, or the **memory usage**. We'll review some of the techniques in all three areas."
  9 |    ],
 10 |    "metadata": {
 11 |     "collapsed": false
 12 |    },
 13 |    "id": "5527397e89ce8f7f"
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "source": [
 18 |     "## Loading the configuration and pipeline\n",
 19 |     "\n",
 20 |     "Again, let's start with loading the configuration, and then set up our retriever. We don't want a full RAG pipeline, as we are solely interested in the semantic search part. Improving a single component at a time should be easier to understand and debug. "
 21 |    ],
 22 |    "metadata": {
 23 |     "collapsed": false
 24 |    },
 25 |    "id": "c97156bfd207c831"
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "from dotenv import load_dotenv\n",
 32 |     "\n",
 33 |     "load_dotenv()"
 34 |    ],
 35 |    "metadata": {
 36 |     "collapsed": false
 37 |    },
 38 |    "id": "5400644c6fa94d96",
 39 |    "execution_count": null
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "from llama_index import ServiceContext\n",
 46 |     "\n",
 47 |     "service_context = ServiceContext.from_defaults(\n",
 48 |     "    embed_model=\"local:BAAI/bge-large-en\"\n",
 49 |     ")"
 50 |    ],
 51 |    "metadata": {
 52 |     "collapsed": false
 53 |    },
 54 |    "id": "c571d2c60524ef3d",
 55 |    "execution_count": null
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "from qdrant_client import QdrantClient\n",
 62 |     "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
 63 |     "\n",
 64 |     "import os\n",
 65 |     "\n",
 66 |     "client = QdrantClient(\n",
 67 |     "    os.environ.get(\"QDRANT_URL\"), \n",
 68 |     "    api_key=os.environ.get(\"QDRANT_API_KEY\"),\n",
 69 |     ")\n",
 70 |     "vector_store = QdrantVectorStore(\n",
 71 |     "    client=client, \n",
 72 |     "    collection_name=\"hacker-news\"\n",
 73 |     ")"
 74 |    ],
 75 |    "metadata": {
 76 |     "collapsed": false
 77 |    },
 78 |    "id": "480e88c42e30d3cd",
 79 |    "execution_count": null
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "from llama_index import VectorStoreIndex\n",
 86 |     "\n",
 87 |     "index = VectorStoreIndex.from_vector_store(\n",
 88 |     "    vector_store=vector_store,\n",
 89 |     "    service_context=service_context,\n",
 90 |     ")"
 91 |    ],
 92 |    "metadata": {
 93 |     "collapsed": false
 94 |    },
 95 |    "id": "cb26026d9b2d8e0",
 96 |    "execution_count": null
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "outputs": [],
101 |    "source": [
102 |     "from llama_index.vector_stores import MetadataFilters, MetadataFilter\n",
103 |     "from llama_index.indices.vector_store import VectorIndexRetriever\n",
104 |     "\n",
105 |     "retriever = VectorIndexRetriever(\n",
106 |     "    index=index,\n",
107 |     "    similarity_top_k=5,\n",
108 |     "    filters=MetadataFilters(\n",
109 |     "        filters=[\n",
110 |     "            MetadataFilter(key=\"type\", value=\"story\"),\n",
111 |     "        ]\n",
112 |     "    ),\n",
113 |     ")"
114 |    ],
115 |    "metadata": {
116 |     "collapsed": false
117 |    },
118 |    "id": "a44471047edccc2a",
119 |    "execution_count": null
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "outputs": [],
124 |    "source": [
125 |     "nodes = retriever.retrieve(\"What is the best way to learn programming?\")\n",
126 |     "for i, node in enumerate(nodes):\n",
127 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
128 |    ],
129 |    "metadata": {
130 |     "collapsed": false
131 |    },
132 |    "id": "37fefadc69369516",
133 |    "execution_count": null
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "source": [
138 |     "## Quality optimization\n",
139 |     "\n",
140 |     "We have implemented a basic RAG already, and we might be happy with the quality. There are a lot of aspects when it comes to measuring the quality of a semantic retrieval system, and we will not go into details here. It is usually related to the quality of the embedding model we use, and it is a topic for another day.\n",
141 |     "\n",
142 |     "However, all the vector databases approximate the nearest neighbor search, and this approximation comes with a cost. The cost is that the results are not always ideal. HNSW, an algorithm used in Qdrant, has some parameters to control how the internal structures are built, and these parameters can be tweaked to improve the quality of the results. This is very specific to the vector database used, thus it's configured through the Qdrant API."
143 |    ],
144 |    "metadata": {
145 |     "collapsed": false
146 |    },
147 |    "id": "1c8084e400f1ff37"
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "outputs": [],
152 |    "source": [
153 |     "client.get_collection(collection_name=\"hacker-news\")"
154 |    ],
155 |    "metadata": {
156 |     "collapsed": false
157 |    },
158 |    "id": "18105b7cfb654cb6",
159 |    "execution_count": null
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "source": [
164 |     "As for now, the most interesting part is the `hnsw_config` field. The algorithm itself is controlled by two parameters. The number of edges per node is called the `m` parameter. The larger the value, the higher the precision of the search, but the more space required. The `ef_construct` parameter is the number of neighbors to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time. \n",
165 |     "\n",
166 |     "Playing with both parameters **improves just the approximation of the exact nearest neighbors**, and a proper embedding model is still way more important. However, [this quality aspect might also be controlled, even in an automated way](https://qdrant.tech/documentation/tutorials/retrieval-quality/). For the time being, we'll simply increase both values, but won't measure the impact on the overall quality of search results."
167 |    ],
168 |    "metadata": {
169 |     "collapsed": false
170 |    },
171 |    "id": "3a7eb35651edebd8"
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "outputs": [],
176 |    "source": [
177 |     "from qdrant_client import models\n",
178 |     "\n",
179 |     "client.update_collection(\n",
180 |     "    collection_name=\"hacker-news\",\n",
181 |     "    hnsw_config=models.HnswConfigDiff(\n",
182 |     "        m=32,\n",
183 |     "        ef_construct=200,\n",
184 |     "    )\n",
185 |     ")"
186 |    ],
187 |    "metadata": {
188 |     "collapsed": false
189 |    },
190 |    "id": "3668e7f04e38d00f",
191 |    "execution_count": null
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "outputs": [],
196 |    "source": [
197 |     "import time\n",
198 |     "\n",
199 |     "while True:\n",
200 |     "    collection = client.get_collection(\"hacker-news\")\n",
201 |     "    if collection.status == models.CollectionStatus.GREEN:\n",
202 |     "        break\n",
203 |     "    time.sleep(1.0)\n",
204 |     "        \n",
205 |     "collection"
206 |    ],
207 |    "metadata": {
208 |     "collapsed": false
209 |    },
210 |    "id": "8c4d08884025a701",
211 |    "execution_count": null
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "outputs": [],
216 |    "source": [
217 |     "nodes = retriever.retrieve(\"What is the best way to learn programming?\")\n",
218 |     "for i, node in enumerate(nodes):\n",
219 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
220 |    ],
221 |    "metadata": {
222 |     "collapsed": false
223 |    },
224 |    "id": "8a2ede02bc5281ec",
225 |    "execution_count": null
226 |   },
227 |   {
228 |    "cell_type": "markdown",
229 |    "source": [
230 |     "## Memory optimization\n",
231 |     "\n",
232 |     "Each point in a Qdrant collection consists of up to three elements: id, vector(s), and optional payload represented by a JSON object. Vectors are indexed in an HNSW graph, and search operations may involve semantic similarity and some payload-based criteria (it's best to add payload indexes on the fields we want to use for the filtering). Ideally, all the elements should be kept in RAM so access is fast.\n",
233 |     "\n",
234 |     "Unfortunately, semantic search is a heavy operation in terms of memory requirements. However, some projects are implemented on a budget and can't afford machines with hundreds of gigabytes of RAM. Qdrant allows storing every single component on a disk to reduce memory usage, but that comes with a performance cost. Let's compare the efficiency of the operations with all the components in RAM and with some of them on disk."
235 |    ],
236 |    "metadata": {
237 |     "collapsed": false
238 |    },
239 |    "id": "dd6b5c8fd6513bb5"
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "outputs": [],
244 |    "source": [
245 |     "%%timeit -n 100 -r 5\n",
246 |     "retriever.retrieve(\"What is the best way to learn programming?\")"
247 |    ],
248 |    "metadata": {
249 |     "collapsed": false
250 |    },
251 |    "id": "8d2159e4871b1dd6",
252 |    "execution_count": null
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "outputs": [],
257 |    "source": [
258 |     "client.update_collection(\n",
259 |     "    collection_name=\"hacker-news\",\n",
260 |     "    hnsw_config=models.HnswConfigDiff(\n",
261 |     "        on_disk=True,\n",
262 |     "    ),\n",
263 |     "    vectors_config={\n",
264 |     "        \"\": models.VectorParamsDiff(\n",
265 |     "            on_disk=True,\n",
266 |     "        )\n",
267 |     "    },\n",
268 |     ")"
269 |    ],
270 |    "metadata": {
271 |     "collapsed": false
272 |    },
273 |    "id": "f1534ffc86ce1f7e",
274 |    "execution_count": null
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "outputs": [],
279 |    "source": [
280 |     "while True:\n",
281 |     "    collection = client.get_collection(\"hacker-news\")\n",
282 |     "    if collection.status == models.CollectionStatus.GREEN:\n",
283 |     "        break\n",
284 |     "    time.sleep(1.0)\n",
285 |     "        \n",
286 |     "collection"
287 |    ],
288 |    "metadata": {
289 |     "collapsed": false
290 |    },
291 |    "id": "396074dc50565c7a",
292 |    "execution_count": null
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "outputs": [],
297 |    "source": [
298 |     "%%timeit -n 100 -r 5\n",
299 |     "retriever.retrieve(\"What is the best way to learn programming?\")"
300 |    ],
301 |    "metadata": {
302 |     "collapsed": false
303 |    },
304 |    "id": "9f500d12db05fdc6",
305 |    "execution_count": null
306 |   },
307 |   {
308 |    "cell_type": "markdown",
309 |    "source": [
310 |     "## Speed optimization\n",
311 |     "\n",
312 |     "There are various ways of optimizing semantic search in terms of speed. The most straightforward one is to reduce both `m` and `ef_construct` parameters, as we did in the previous section. However, this comes with a cost of the quality of the results.\n",
313 |     "\n",
314 |     "Qdrant also provides a number of quantization techniques, and two of them are primarily used to increase speed and reduce memory at the same time:\n",
315 |     "\n",
316 |     "1. **Scalar Quantization** - uses `int8` instead of `float32` to store each vector dimension\n",
317 |     "2. **Binary Quantization** - `bool` values are used to store each vector dimension\n",
318 |     "\n",
319 |     "The first one reduces the memory usage by up to 4x, while the second one by up to 32x and both increase the speed of the search. However, the quality of the search results is reduced, and Binary Quantization is not suitable for all the use cases. It only works with some specific models, usually the ones with high dimensionality."
320 |    ],
321 |    "metadata": {
322 |     "collapsed": false
323 |    },
324 |    "id": "8be9fc24e922f165"
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "source": [
329 |     "In our case, we're going to set up the binary quantization either way. From the LlamaIndex perspective, the search operations are going to be fired identically."
330 |    ],
331 |    "metadata": {
332 |     "collapsed": false
333 |    },
334 |    "id": "c1154f4172094cdf"
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "outputs": [],
339 |    "source": [
340 |     "client.update_collection(\n",
341 |     "    collection_name=\"hacker-news\",\n",
342 |     "    quantization_config=models.BinaryQuantization(\n",
343 |     "        binary=models.BinaryQuantizationConfig(\n",
344 |     "            always_ram=True,\n",
345 |     "        )\n",
346 |     "    )\n",
347 |     ")"
348 |    ],
349 |    "metadata": {
350 |     "collapsed": false
351 |    },
352 |    "id": "3e26ec00449bfb27",
353 |    "execution_count": null
354 |   },
355 |   {
356 |    "cell_type": "code",
357 |    "outputs": [],
358 |    "source": [
359 |     "while True:\n",
360 |     "    collection = client.get_collection(\"hacker-news\")\n",
361 |     "    if collection.status == models.CollectionStatus.GREEN:\n",
362 |     "        break\n",
363 |     "    time.sleep(1.0)\n",
364 |     "        \n",
365 |     "collection"
366 |    ],
367 |    "metadata": {
368 |     "collapsed": false
369 |    },
370 |    "id": "a01cc56dff3ea252",
371 |    "execution_count": null
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "outputs": [],
376 |    "source": [
377 |     "nodes = retriever.retrieve(\"What is the best way to learn programming?\")\n",
378 |     "for i, node in enumerate(nodes):\n",
379 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
380 |    ],
381 |    "metadata": {
382 |     "collapsed": false
383 |    },
384 |    "id": "6f4d07f587535769",
385 |    "execution_count": null
386 |   },
387 |   {
388 |    "cell_type": "code",
389 |    "outputs": [],
390 |    "source": [
391 |     "%%timeit -n 100 -r 5\n",
392 |     "retriever.retrieve(\"What is the best way to learn programming?\")"
393 |    ],
394 |    "metadata": {
395 |     "collapsed": false
396 |    },
397 |    "id": "408270c8449987c9",
398 |    "execution_count": null
399 |   },
400 |   {
401 |    "cell_type": "code",
402 |    "outputs": [],
403 |    "source": [],
404 |    "metadata": {
405 |     "collapsed": false
406 |    },
407 |    "id": "536dbd5e38b33efd",
408 |    "execution_count": null
409 |   }
410 |  ],
411 |  "metadata": {
412 |   "kernelspec": {
413 |    "display_name": "Python 3",
414 |    "language": "python",
415 |    "name": "python3"
416 |   },
417 |   "language_info": {
418 |    "codemirror_mode": {
419 |     "name": "ipython",
420 |     "version": 2
421 |    },
422 |    "file_extension": ".py",
423 |    "mimetype": "text/x-python",
424 |    "name": "python",
425 |    "nbconvert_exporter": "python",
426 |    "pygments_lexer": "ipython2",
427 |    "version": "2.7.6"
428 |   }
429 |  },
430 |  "nbformat": 4,
431 |  "nbformat_minor": 5
432 | }
433 | 


--------------------------------------------------------------------------------
/notebooks/03-hybrid-search.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "source": [
  6 |     "# Hybrid Search: dense and sparse vectors\n",
  7 |     "\n",
  8 |     "LlamaIndex integration with Qdrant supports sparse embeddings as well. From the user perspective, it doesn't change much, as they interact through the same interface. Since sparse and dense vectors work best in different setups, it makes sense to combine them if we want to have the best of both worlds. There are, however, some parameters we can control.\n",
  9 |     "\n",
 10 |     "Let's again start with recreating our pipeline, but this time we will use the other collection that has sparse vectors as well.\n"
 11 |    ],
 12 |    "metadata": {
 13 |     "collapsed": false
 14 |    },
 15 |    "id": "b3b7af9ec8fca084"
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "from dotenv import load_dotenv\n",
 22 |     "\n",
 23 |     "load_dotenv()"
 24 |    ],
 25 |    "metadata": {
 26 |     "collapsed": false
 27 |    },
 28 |    "id": "660547c536bbfa43",
 29 |    "execution_count": null
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "from llama_index import ServiceContext\n",
 36 |     "\n",
 37 |     "service_context = ServiceContext.from_defaults(\n",
 38 |     "    embed_model=\"local:BAAI/bge-large-en\"\n",
 39 |     ")"
 40 |    ],
 41 |    "metadata": {
 42 |     "collapsed": false
 43 |    },
 44 |    "id": "44f2eda9a0c435d8",
 45 |    "execution_count": null
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "from qdrant_client import QdrantClient\n",
 52 |     "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
 53 |     "\n",
 54 |     "import os\n",
 55 |     "\n",
 56 |     "client = QdrantClient(\n",
 57 |     "    os.environ.get(\"QDRANT_URL\"), \n",
 58 |     "    api_key=os.environ.get(\"QDRANT_API_KEY\"),\n",
 59 |     ")\n",
 60 |     "vector_store_hybrid = QdrantVectorStore(\n",
 61 |     "    client=client,\n",
 62 |     "    collection_name=\"hacker-news-hybrid\",\n",
 63 |     "    enable_hybrid=True,\n",
 64 |     "    batch_size=20,  # this is important for the ingestion\n",
 65 |     ")"
 66 |    ],
 67 |    "metadata": {
 68 |     "collapsed": false
 69 |    },
 70 |    "id": "31fc9049413d2075",
 71 |    "execution_count": null
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "from llama_index import VectorStoreIndex\n",
 78 |     "\n",
 79 |     "index = VectorStoreIndex.from_vector_store(\n",
 80 |     "    vector_store=vector_store_hybrid,\n",
 81 |     "    service_context=service_context,\n",
 82 |     ")"
 83 |    ],
 84 |    "metadata": {
 85 |     "collapsed": false
 86 |    },
 87 |    "id": "acdcf928f564b071",
 88 |    "execution_count": null
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "source": [
 93 |     "## Differences between sparse and dense vectors\n",
 94 |     "\n",
 95 |     "Sparse vectors are usually used in high-dimensional spaces, where the majority of the elements are zero. A single dimension represents a single word, so the dimensionality of the space is equal to the size of the vocabulary, with just a few non-zero values. \n",
 96 |     "\n",
 97 |     "There are various ways to create sparse vectors, but the most common one is to use the TF-IDF or BM25 representation. It's a simple and effective way to represent the importance of words in a document and in many cases create a solid baseline for the search.\n",
 98 |     "\n",
 99 |     "LlamaIndex uses SPLADE by default, which is based on transformers, similar to dense embedding models. **The main advantage of using sparse vectors is that they overcome the problem of vocabulary mismatch**. If a word is not present in the vocabulary of the dense embedding model, we can still represent it using the sparse vectors."
100 |    ],
101 |    "metadata": {
102 |     "collapsed": false
103 |    },
104 |    "id": "f1aa7483e19c702"
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "source": [
109 |     "## Using sparse vectors only\n",
110 |     "\n",
111 |     "Before we dive into the hybrid search, let's see what might be achieved by using sparse vectors alone. We already know the nodes retrieved by dense vectors so it makes sense to compare the results returned by both methods."
112 |    ],
113 |    "metadata": {
114 |     "collapsed": false
115 |    },
116 |    "id": "dbd696863d4c0144"
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "outputs": [],
121 |    "source": [
122 |     "from llama_index.vector_stores.types import VectorStoreQueryMode\n",
123 |     "from llama_index.indices.vector_store import VectorIndexRetriever\n",
124 |     "\n",
125 |     "sparse_retriever = VectorIndexRetriever(\n",
126 |     "    index=index,\n",
127 |     "    vector_store_query_mode=VectorStoreQueryMode.SPARSE,\n",
128 |     "    sparse_top_k=5,\n",
129 |     ")"
130 |    ],
131 |    "metadata": {
132 |     "collapsed": false
133 |    },
134 |    "id": "b724350a8f0a9fbd",
135 |    "execution_count": null
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "outputs": [],
140 |    "source": [
141 |     "nodes = sparse_retriever.retrieve(\"What is the best way to learn programming?\")\n",
142 |     "for i, node in enumerate(nodes):\n",
143 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
144 |    ],
145 |    "metadata": {
146 |     "collapsed": false
147 |    },
148 |    "id": "4762c0e9b8fb0231",
149 |    "execution_count": null
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "source": [
154 |     "## Hybrid search\n",
155 |     "\n",
156 |     "There are some specific use cases in which we may prefer to use just the sparse vectors. But both methods may complement each other and we usually need to find the sweet spot. The `VectorIndexRetriever` class allows us to control the parameters of the search. We can set the `sparse_top_k` and `similarity_top_k` parameters to control the number of results returned by each method. We can also set the `alpha` parameters to control the importance of each method (`0.0` = sparse, `1.0` = dense vectors only)."
157 |    ],
158 |    "metadata": {
159 |     "collapsed": false
160 |    },
161 |    "id": "4d05f74509f13f9c"
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "outputs": [],
166 |    "source": [
167 |     "hybrid_retriever = VectorIndexRetriever(\n",
168 |     "    index=index,\n",
169 |     "    vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n",
170 |     "    sparse_top_k=5,\n",
171 |     "    similarity_top_k=5,\n",
172 |     "    alpha=0.1,\n",
173 |     ")"
174 |    ],
175 |    "metadata": {
176 |     "collapsed": false
177 |    },
178 |    "id": "8aa7191e7ad214de",
179 |    "execution_count": null
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "outputs": [],
184 |    "source": [
185 |     "nodes = hybrid_retriever.retrieve(\"What is the best way to learn programming?\")\n",
186 |     "for i, node in enumerate(nodes):\n",
187 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
188 |    ],
189 |    "metadata": {
190 |     "collapsed": false
191 |    },
192 |    "id": "ccbf546068405a4d",
193 |    "execution_count": null
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "outputs": [],
198 |    "source": [
199 |     "# We shouldn't be modifying the alpha parameter after the retriever has been created\n",
200 |     "# but that's the easiest way to show the effect of the parameter\n",
201 |     "hybrid_retriever._alpha = 0.9\n",
202 |     "\n",
203 |     "nodes = hybrid_retriever.retrieve(\"What is the best way to learn programming?\")\n",
204 |     "for i, node in enumerate(nodes):\n",
205 |     "    print(i + 1, node.text, end=\"\\n\\n\")"
206 |    ],
207 |    "metadata": {
208 |     "collapsed": false
209 |    },
210 |    "id": "1290259a827c3f77",
211 |    "execution_count": null
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "outputs": [],
216 |    "source": [],
217 |    "metadata": {
218 |     "collapsed": false
219 |    },
220 |    "id": "65af4ef3d86d61ac",
221 |    "execution_count": null
222 |   }
223 |  ],
224 |  "metadata": {
225 |   "kernelspec": {
226 |    "display_name": "Python 3",
227 |    "language": "python",
228 |    "name": "python3"
229 |   },
230 |   "language_info": {
231 |    "codemirror_mode": {
232 |     "name": "ipython",
233 |     "version": 2
234 |    },
235 |    "file_extension": ".py",
236 |    "mimetype": "text/x-python",
237 |    "name": "python",
238 |    "nbconvert_exporter": "python",
239 |    "pygments_lexer": "ipython2",
240 |    "version": "2.7.6"
241 |   }
242 |  },
243 |  "nbformat": 4,
244 |  "nbformat_minor": 5
245 | }
246 | 


--------------------------------------------------------------------------------
/notebooks/env.example:
--------------------------------------------------------------------------------
1 | QDRANT_URL="https://xyx.com:6333"
2 | QDRANT_API_KEY="your-qdrant-api-key"
3 | 
4 | OPENAI_API_KEY="your-openai-key"
5 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "hacker-news-workshop"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Kacper Łukawski <lukawski.kacper@gmail.com>"]
 6 | readme = "README.md"
 7 | 
 8 | [tool.poetry.dependencies]
 9 | python = "^3.10"
10 | llama-index = "^0.9.46"
11 | qdrant-client = "^1.7.3"
12 | transformers = "^4.37.2"
13 | torch = "^2.2.0"
14 | jupyter = "^1.0.0"
15 | python-dotenv = "^1.0.1"
16 | wget = "^3.2"
17 | pyarrow = "^15.0.0"
18 | 
19 | 
20 | [build-system]
21 | requires = ["poetry-core"]
22 | build-backend = "poetry.core.masonry.api"
23 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
  1 | aiohttp==3.9.3
  2 | aiosignal==1.3.1
  3 | annotated-types==0.6.0
  4 | anyio==4.2.0
  5 | argon2-cffi==23.1.0
  6 | argon2-cffi-bindings==21.2.0
  7 | arrow==1.3.0
  8 | asttokens==2.4.1
  9 | async-lru==2.0.4
 10 | async-timeout==4.0.3
 11 | attrs==23.2.0
 12 | Babel==2.14.0
 13 | beautifulsoup4==4.12.3
 14 | bleach==6.1.0
 15 | certifi==2024.2.2
 16 | cffi==1.16.0
 17 | charset-normalizer==3.3.2
 18 | click==8.1.7
 19 | comm==0.2.1
 20 | dataclasses-json==0.6.4
 21 | debugpy==1.8.1
 22 | decorator==5.1.1
 23 | defusedxml==0.7.1
 24 | Deprecated==1.2.14
 25 | dirtyjson==1.0.8
 26 | distro==1.9.0
 27 | exceptiongroup==1.2.0
 28 | executing==2.0.1
 29 | fastjsonschema==2.19.1
 30 | filelock==3.13.1
 31 | fqdn==1.5.1
 32 | frozenlist==1.4.1
 33 | fsspec==2024.2.0
 34 | greenlet==3.0.3
 35 | grpcio==1.60.1
 36 | grpcio-tools==1.60.1
 37 | h11==0.14.0
 38 | h2==4.1.0
 39 | hpack==4.0.0
 40 | httpcore==1.0.3
 41 | httpx==0.26.0
 42 | huggingface-hub==0.20.3
 43 | hyperframe==6.0.1
 44 | idna==3.6
 45 | ipykernel==6.29.2
 46 | ipython==8.21.0
 47 | ipywidgets==8.1.2
 48 | isoduration==20.11.0
 49 | jedi==0.19.1
 50 | Jinja2==3.1.3
 51 | joblib==1.3.2
 52 | json5==0.9.14
 53 | jsonpointer==2.4
 54 | jsonschema==4.21.1
 55 | jsonschema-specifications==2023.12.1
 56 | jupyter==1.0.0
 57 | jupyter-console==6.6.3
 58 | jupyter-events==0.9.0
 59 | jupyter-lsp==2.2.2
 60 | jupyter_client==8.6.0
 61 | jupyter_core==5.7.1
 62 | jupyter_server==2.12.5
 63 | jupyter_server_terminals==0.5.2
 64 | jupyterlab==4.1.1
 65 | jupyterlab_pygments==0.3.0
 66 | jupyterlab_server==2.25.3
 67 | jupyterlab_widgets==3.0.10
 68 | llama-index==0.9.46
 69 | MarkupSafe==2.1.5
 70 | marshmallow==3.20.2
 71 | matplotlib-inline==0.1.6
 72 | mistune==3.0.2
 73 | mpmath==1.3.0
 74 | multidict==6.0.5
 75 | mypy-extensions==1.0.0
 76 | nbclient==0.9.0
 77 | nbconvert==7.16.0
 78 | nbformat==5.9.2
 79 | nest-asyncio==1.6.0
 80 | networkx==3.2.1
 81 | nltk==3.8.1
 82 | notebook==7.1.0
 83 | notebook_shim==0.2.4
 84 | numpy==1.26.4
 85 | nvidia-cublas-cu12==12.1.3.1
 86 | nvidia-cuda-cupti-cu12==12.1.105
 87 | nvidia-cuda-nvrtc-cu12==12.1.105
 88 | nvidia-cuda-runtime-cu12==12.1.105
 89 | nvidia-cudnn-cu12==8.9.2.26
 90 | nvidia-cufft-cu12==11.0.2.54
 91 | nvidia-curand-cu12==10.3.2.106
 92 | nvidia-cusolver-cu12==11.4.5.107
 93 | nvidia-cusparse-cu12==12.1.0.106
 94 | nvidia-nccl-cu12==2.19.3
 95 | nvidia-nvjitlink-cu12==12.3.101
 96 | nvidia-nvtx-cu12==12.1.105
 97 | openai==1.12.0
 98 | overrides==7.7.0
 99 | packaging==23.2
100 | pandas==2.2.0
101 | pandocfilters==1.5.1
102 | parso==0.8.3
103 | pexpect==4.9.0
104 | pillow==10.2.0
105 | platformdirs==4.2.0
106 | portalocker==2.8.2
107 | prometheus_client==0.20.0
108 | prompt-toolkit==3.0.43
109 | protobuf==4.25.3
110 | psutil==5.9.8
111 | ptyprocess==0.7.0
112 | pure-eval==0.2.2
113 | pyarrow==15.0.0
114 | pycparser==2.21
115 | pydantic==2.6.1
116 | pydantic_core==2.16.2
117 | Pygments==2.17.2
118 | python-dateutil==2.8.2
119 | python-dotenv==1.0.1
120 | python-json-logger==2.0.7
121 | pytz==2024.1
122 | PyYAML==6.0.1
123 | pyzmq==25.1.2
124 | qdrant-client==1.7.3
125 | qtconsole==5.5.1
126 | QtPy==2.4.1
127 | referencing==0.33.0
128 | regex==2023.12.25
129 | requests==2.31.0
130 | rfc3339-validator==0.1.4
131 | rfc3986-validator==0.1.1
132 | rpds-py==0.18.0
133 | safetensors==0.4.2
134 | scikit-learn==1.4.1.post1
135 | scipy==1.12.0
136 | Send2Trash==1.8.2
137 | sentence-transformers==2.3.1
138 | sentencepiece==0.1.99
139 | six==1.16.0
140 | sniffio==1.3.0
141 | soupsieve==2.5
142 | SQLAlchemy==2.0.27
143 | stack-data==0.6.3
144 | sympy==1.12
145 | tenacity==8.2.3
146 | terminado==0.18.0
147 | threadpoolctl==3.3.0
148 | tiktoken==0.6.0
149 | tinycss2==1.2.1
150 | tokenizers==0.15.2
151 | tomli==2.0.1
152 | torch==2.2.0
153 | tornado==6.4
154 | tqdm==4.66.2
155 | traitlets==5.14.1
156 | transformers==4.37.2
157 | triton==2.2.0
158 | types-python-dateutil==2.8.19.20240106
159 | typing-inspect==0.9.0
160 | typing_extensions==4.9.0
161 | tzdata==2024.1
162 | uri-template==1.3.0
163 | urllib3==2.2.0
164 | wcwidth==0.2.13
165 | webcolors==1.13
166 | webencodings==0.5.1
167 | websocket-client==1.7.0
168 | wget==3.2
169 | widgetsnbextension==4.0.10
170 | wrapt==1.16.0
171 | yarl==1.9.4
172 | 


--------------------------------------------------------------------------------