├── .gitignore ├── README.md ├── notebooks ├── 00-set-up-environment.ipynb ├── 01-semantic-search-rag.ipynb ├── 02-tweaking-semantic-search.ipynb ├── 03-hybrid-search.ipynb └── env.example ├── poetry.lock ├── pyproject.toml └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | .env.prod 3 | .idea -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # rag-optimization-workshop 2 | 3 | This repository contains the materials for an RAG optimization workshop using Qdrant as a vector database. 4 | It consists of Jupyter notebooks that guide you step-by-step in various optimizations and tweaks. 5 | 6 | ## Prerequisites 7 | 8 | Please clone the repository and install all the dependencies to run the notebooks. 9 | 10 | ```bash 11 | git clone https://github.com/qdrant/workshop-rag-optimization.git 12 | ``` 13 | 14 | ### Poetry 15 | 16 | This project uses [Poetry](https://python-poetry.org/) to manage its dependencies. You can install it by following the instructions on the [official website](https://python-poetry.org/docs/#installation). 17 | Once you have it, the dependencies can be installed by running: 18 | 19 | ```bash 20 | cd workshop-rag-optimization 21 | poetry install --no-root 22 | poetry shell 23 | ``` 24 | 25 | ### Pip 26 | 27 | If you don't want to use Poetry, you can install the dependencies using pip: 28 | 29 | ```bash 30 | pip install -r requirements.txt 31 | ``` 32 | 33 | ## Running the notebooks 34 | 35 | Once all the dependencies are installed, Jupyter notebook might be started by running the following command: 36 | 37 | ```bash 38 | jupyter notebook 39 | ``` 40 | 41 | The default browser should open automatically. 42 | -------------------------------------------------------------------------------- /notebooks/00-set-up-environment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "77d074b8e62858a6", 6 | "metadata": { 7 | "collapsed": false, 8 | "jupyter": { 9 | "outputs_hidden": false 10 | } 11 | }, 12 | "source": [ 13 | "# Setting up the environment\n", 14 | "\n", 15 | "During the workshop, we will use LlamaIndex to build a RAG system, with Qdrant acting as the vector store. We can skip the indexing process, and simply start with a pre-built index, imported from a snapshot. However, before we start doing the actual work, we will review the underlying data and get familiar with it.\n", 16 | "\n", 17 | "## Prerequisites\n", 18 | "\n", 19 | "As usual, building up RAG requires a few components:\n", 20 | "\n", 21 | "- **Qdrant instance** - obviously, no RAG without Qdrant in the loop, either local or cloud version\n", 22 | "- **LLM** - we are going to work with OpenAI models, as they are the default of the LlamaIndex, so [please obtain an API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)\n", 23 | "- **Embedding model** - there are plenty of models out there, but an open source [`BAAI/bge-large-en`](https://huggingface.co/BAAI/bge-large-en) is today's favorite" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "id": "5f9c5f7b903fcf2f", 29 | "metadata": { 30 | "collapsed": false, 31 | "jupyter": { 32 | "outputs_hidden": false 33 | } 34 | }, 35 | "source": [ 36 | "## Setting up Qdrant\n", 37 | "\n", 38 | "It doesn't matter if you prefer using a local Qdrant server running in a container, or our 1GB free tier cluster. Please make sure you have a running instance on hand." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "6065733650176c8d", 44 | "metadata": { 45 | "collapsed": false, 46 | "jupyter": { 47 | "outputs_hidden": false 48 | } 49 | }, 50 | "source": [ 51 | "### Local Docker container\n", 52 | "\n", 53 | "If you have Docker installed on your machine, you should be able to launch an instance pretty quickly by running the following command." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "id": "ac6874f21b4d1a6a", 60 | "metadata": { 61 | "collapsed": false, 62 | "jupyter": { 63 | "outputs_hidden": false 64 | } 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "!docker run -d -p \"6333:6333\" -p \"6334:6334\" qdrant/qdrant" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "cf536ce934b91c79", 74 | "metadata": { 75 | "collapsed": false, 76 | "jupyter": { 77 | "outputs_hidden": false 78 | } 79 | }, 80 | "source": [ 81 | "### Qdrant Cloud free tier\n", 82 | "\n", 83 | "Another option is to [sign up for Qdrant Cloud](https://cloud.qdrant.io/login) and use the free tier 1GB cluster, which is available for everyone." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "id": "96aef02fba7d6f82", 89 | "metadata": { 90 | "collapsed": false, 91 | "jupyter": { 92 | "outputs_hidden": false 93 | } 94 | }, 95 | "source": [ 96 | "## Saving configuration\n", 97 | "\n", 98 | "The last thing we need to set up before the start is to store all the secrets as environmental variables in the `.env` file. There is an `.env.example` we can use as a reference." 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "723c1f816d7e4338", 104 | "metadata": { 105 | "collapsed": false, 106 | "jupyter": { 107 | "outputs_hidden": false 108 | } 109 | }, 110 | "source": [ 111 | "Once the configuration is done, let's try out the connection to our Qdrant instance." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "id": "ead15566e447c4cf", 118 | "metadata": { 119 | "collapsed": false, 120 | "jupyter": { 121 | "outputs_hidden": false 122 | } 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "from dotenv import load_dotenv\n", 127 | "\n", 128 | "load_dotenv()" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "805f9acbec609491", 135 | "metadata": { 136 | "collapsed": false, 137 | "jupyter": { 138 | "outputs_hidden": false 139 | } 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "from qdrant_client import QdrantClient\n", 144 | "\n", 145 | "import os\n", 146 | "\n", 147 | "client = QdrantClient(\n", 148 | " os.environ.get(\"QDRANT_URL\"), \n", 149 | " api_key=os.environ.get(\"QDRANT_API_KEY\"),\n", 150 | ")\n", 151 | "client.get_collections()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "id": "1dc936dbdc03cc6c", 157 | "metadata": { 158 | "collapsed": false, 159 | "jupyter": { 160 | "outputs_hidden": false 161 | } 162 | }, 163 | "source": [ 164 | "## Data import\n", 165 | "\n", 166 | "RAG obviously needs data to work with. There are various challenges to overcome while indexing the documents, such as the chunking strategy. Another thing is creating the embeddings, which is usually a bottleneck of each system. Since this workshop is not about indexing, we're going to load the Qdrant collection from the snapshots I prepared beforehand." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "id": "ef095c28cf9bbc92", 173 | "metadata": { 174 | "collapsed": false, 175 | "jupyter": { 176 | "outputs_hidden": false 177 | } 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "client.recover_snapshot(\n", 182 | " collection_name=\"hacker-news\",\n", 183 | " # please do not modify the URL below\n", 184 | " location=\"https://snapshots.qdrant.io/workshop-rag-optimization/hacker-news-8895643013517159-2024-02-20-21-56-46.snapshot\",\n", 185 | " wait=False, # loading a snapshot may take some time, so let's avoid a timeout\n", 186 | ")" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "id": "c830bea3a55a87ba", 193 | "metadata": { 194 | "collapsed": false, 195 | "jupyter": { 196 | "outputs_hidden": false 197 | } 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "import time\n", 202 | "\n", 203 | "while True:\n", 204 | " collections = client.get_collections()\n", 205 | " if len(collections.collections) >= 1:\n", 206 | " break\n", 207 | " time.sleep(1.0)\n", 208 | "\n", 209 | "collections" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "id": "59b5c121411da1ac", 215 | "metadata": { 216 | "collapsed": false, 217 | "jupyter": { 218 | "outputs_hidden": false 219 | } 220 | }, 221 | "source": [ 222 | "For now, we're going to use a single collection, but once we get to the hybrid search, there'll be another one required. Let's already import it, so we don't need to worry again." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "id": "99a33b63e3c98195", 229 | "metadata": { 230 | "collapsed": false, 231 | "jupyter": { 232 | "outputs_hidden": false 233 | } 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "client.recover_snapshot(\n", 238 | " collection_name=\"hacker-news-hybrid\",\n", 239 | " location=\"https://snapshots.qdrant.io/workshop-rag-optimization/hacker-news-hybrid-8895643013517159-2024-02-20-21-56-54.snapshot\",\n", 240 | " wait=False,\n", 241 | ")" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "id": "5aa32c7643d30b1d", 248 | "metadata": { 249 | "collapsed": false, 250 | "jupyter": { 251 | "outputs_hidden": false 252 | } 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "while True:\n", 257 | " collections = client.get_collections()\n", 258 | " if len(collections.collections) >= 2:\n", 259 | " break\n", 260 | " time.sleep(1.0)\n", 261 | "\n", 262 | "collections" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "id": "ac78980080f7cfaa", 269 | "metadata": { 270 | "collapsed": false, 271 | "jupyter": { 272 | "outputs_hidden": false 273 | } 274 | }, 275 | "outputs": [], 276 | "source": [] 277 | } 278 | ], 279 | "metadata": { 280 | "kernelspec": { 281 | "display_name": "Python 3 (ipykernel)", 282 | "language": "python", 283 | "name": "python3" 284 | }, 285 | "language_info": { 286 | "codemirror_mode": { 287 | "name": "ipython", 288 | "version": 3 289 | }, 290 | "file_extension": ".py", 291 | "mimetype": "text/x-python", 292 | "name": "python", 293 | "nbconvert_exporter": "python", 294 | "pygments_lexer": "ipython3", 295 | "version": "3.10.12" 296 | } 297 | }, 298 | "nbformat": 4, 299 | "nbformat_minor": 5 300 | } 301 | -------------------------------------------------------------------------------- /notebooks/01-semantic-search-rag.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Semantic search based RAG\n", 7 | "\n", 8 | "We are going to use LlamaIndex to build a basic RAG pipeline that will use one of the open source embedding models. Then, we will consider different optimizations to either improve the performance or reduce the cost of the pipeline.\n" 9 | ], 10 | "metadata": { 11 | "collapsed": false 12 | }, 13 | "id": "fe3d1bd3cb7874cc" 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "source": [ 18 | "## Loading the configuration\n", 19 | "\n", 20 | "Before we start, all the configuration is loaded from the `.env` file we created in the previous notebook." 21 | ], 22 | "metadata": { 23 | "collapsed": false 24 | }, 25 | "id": "23e281aaf592e306" 26 | }, 27 | { 28 | "cell_type": "code", 29 | "outputs": [], 30 | "source": [ 31 | "from dotenv import load_dotenv\n", 32 | "\n", 33 | "load_dotenv()" 34 | ], 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "id": "initial_id", 39 | "execution_count": null 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "source": [ 44 | "## Basic RAG setup\n", 45 | "\n", 46 | "We will be using one of the open source embedding models to vectorize our document (actually, the snapshots we imported in the previous notebook were generated using the same model, so we need to use it for queries as well). OpenAI GPT will be our LLM, and it is the default model for LlamaIndex, so there is no need to configure it explicitly.\n", 47 | "\n", 48 | "The vector index, which will act as a fast retrieval layer, is the last missing piece to build our basic semantic search RAG. Qdrant will serve that purpose, as all the documents are already there." 49 | ], 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "id": "45e4dd797cbe142a" 54 | }, 55 | { 56 | "cell_type": "code", 57 | "outputs": [], 58 | "source": [ 59 | "from llama_index import ServiceContext\n", 60 | "\n", 61 | "service_context = ServiceContext.from_defaults(\n", 62 | " embed_model=\"local:BAAI/bge-large-en\"\n", 63 | ")" 64 | ], 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "id": "ceecec79071db759", 69 | "execution_count": null 70 | }, 71 | { 72 | "cell_type": "code", 73 | "outputs": [], 74 | "source": [ 75 | "from qdrant_client import QdrantClient\n", 76 | "from llama_index.vector_stores.qdrant import QdrantVectorStore\n", 77 | "\n", 78 | "import os\n", 79 | "\n", 80 | "client = QdrantClient(\n", 81 | " os.environ.get(\"QDRANT_URL\"), \n", 82 | " api_key=os.environ.get(\"QDRANT_API_KEY\"),\n", 83 | ")\n", 84 | "vector_store = QdrantVectorStore(\n", 85 | " client=client, \n", 86 | " collection_name=\"hacker-news\"\n", 87 | ")" 88 | ], 89 | "metadata": { 90 | "collapsed": false 91 | }, 92 | "id": "ef2c13ee4a83a508", 93 | "execution_count": null 94 | }, 95 | { 96 | "cell_type": "code", 97 | "outputs": [], 98 | "source": [ 99 | "from llama_index import VectorStoreIndex\n", 100 | "\n", 101 | "index = VectorStoreIndex.from_vector_store(\n", 102 | " vector_store=vector_store,\n", 103 | " service_context=service_context,\n", 104 | ")" 105 | ], 106 | "metadata": { 107 | "collapsed": false 108 | }, 109 | "id": "2dca3d121903e207", 110 | "execution_count": null 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "source": [ 115 | "### Querying RAG\n", 116 | "\n", 117 | "LlamaIndex simplifies the querying process by providing a high-level API that abstracts the underlying complexity. We can use the `as_query_engine` method to create a query engine that will handle the entire process for us, with the default configuration." 118 | ], 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "id": "c1f73a5b0023dae8" 123 | }, 124 | { 125 | "cell_type": "code", 126 | "outputs": [], 127 | "source": [ 128 | "query_engine = index.as_query_engine()\n", 129 | "response = query_engine.query(\"What is the best way to learn programming?\")\n", 130 | "print(response.response)" 131 | ], 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "id": "3b10b5aec9a9aeb1", 136 | "execution_count": null 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "source": [ 141 | "Our RAG retrieves some possibly relevant documents by using the original prompt as a query, and then sends them as a part of the prompt to the LLM. It seems to be a good idea to check what were these documents, and if our LLM was not making up the answer using its internal knowledge." 142 | ], 143 | "metadata": { 144 | "collapsed": false 145 | }, 146 | "id": "f7ce6484a8eaac59" 147 | }, 148 | { 149 | "cell_type": "code", 150 | "outputs": [], 151 | "source": [ 152 | "for i, node in enumerate(response.source_nodes):\n", 153 | " print(i + 1, node.text, end=\"\\n\\n\")" 154 | ], 155 | "metadata": { 156 | "collapsed": false 157 | }, 158 | "id": "a62a2fb53edaa8c1", 159 | "execution_count": null 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "source": [ 164 | "The first tweak we can consider is to increase the number of documents fetched from our knowledge base (the default of LlamaIndex is just 2). We can do that by setting the `similarity_top_k` parameter of the `as_query_engine` method." 165 | ], 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "id": "a392c560d622699d" 170 | }, 171 | { 172 | "cell_type": "code", 173 | "outputs": [], 174 | "source": [ 175 | "response = index \\\n", 176 | " .as_query_engine(similarity_top_k=5) \\\n", 177 | " .query(\"What is the best way to learn programming?\")\n", 178 | "print(response.response)" 179 | ], 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "id": "e456cefe13dde93c", 184 | "execution_count": null 185 | }, 186 | { 187 | "cell_type": "code", 188 | "outputs": [], 189 | "source": [ 190 | "for i, node in enumerate(response.source_nodes):\n", 191 | " print(i + 1, node.text, end=\"\\n\\n\")" 192 | ], 193 | "metadata": { 194 | "collapsed": false 195 | }, 196 | "id": "c72f6f232be4b89", 197 | "execution_count": null 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "source": [ 202 | "## Customizing the RAG pipeline\n", 203 | "\n", 204 | "The defaults of LlamaIndex are a good starting point, but we can customize the pipeline to better fit our needs. That gives us more control over the behavior of the semantic search retriever or the way we interact with the LLM. LlamaIndex has pretty decent support for customizing the pipeline and there are three components that we need to set up:\n", 205 | "\n", 206 | "1. Retriever\n", 207 | "2. Response synthesizer\n", 208 | "3. Query engine" 209 | ], 210 | "metadata": { 211 | "collapsed": false 212 | }, 213 | "id": "a59904fb28cfe369" 214 | }, 215 | { 216 | "cell_type": "code", 217 | "outputs": [], 218 | "source": [ 219 | "from llama_index.query_engine import RetrieverQueryEngine\n", 220 | "from llama_index import get_response_synthesizer\n", 221 | "from llama_index.indices.vector_store import VectorIndexRetriever\n", 222 | "\n", 223 | "retriever = VectorIndexRetriever(\n", 224 | " index=index,\n", 225 | " similarity_top_k=5,\n", 226 | ")\n", 227 | "\n", 228 | "response_synthesizer = get_response_synthesizer()\n", 229 | "\n", 230 | "query_engine = RetrieverQueryEngine(\n", 231 | " retriever=retriever,\n", 232 | " response_synthesizer=response_synthesizer,\n", 233 | ")" 234 | ], 235 | "metadata": { 236 | "collapsed": false 237 | }, 238 | "id": "b73b609cef8b1916", 239 | "execution_count": null 240 | }, 241 | { 242 | "cell_type": "code", 243 | "outputs": [], 244 | "source": [ 245 | "response = query_engine.query(\"What is the best way to learn programming?\")\n", 246 | "print(response.response)" 247 | ], 248 | "metadata": { 249 | "collapsed": false 250 | }, 251 | "id": "59ff323f5dc019ed", 252 | "execution_count": null 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "source": [ 257 | "## Playing with response synthesizers\n", 258 | "\n", 259 | "Response synthesizers are responsible for interactions with the LLM. This a component we want to control, when it comes to prompts and the way we actually communicate with the language model. There are lots of parameters to tweak, and prompt engineering is a topic of its own. Thus, we won't play with it too, but we can at least test out different response modes.\n", 260 | "\n", 261 | "The default one is `ResponseMode.COMPACT`, that combines retrieved text chunks into larger pieces, to utilize the available context window. There are also plenty of other modes, and they may work best in some specific scenario. For example, some of the modes may make a separate LLM call per extracted text chunk, which may be beneficial in some cases, but also increase the cost of the pipeline.\n", 262 | "\n", 263 | "Let's just compare the previous response with the `ResponseMode.ACCUMULATE` and `ResponseMode.REFINE` modes. The first one should create a response for each chunk and the concatenate them, while the second one should make a separate LLM call for each chunk in an iterative manner. That means, each call will use the previous response as a context." 264 | ], 265 | "metadata": { 266 | "collapsed": false 267 | }, 268 | "id": "98b489a71360ffa2" 269 | }, 270 | { 271 | "cell_type": "code", 272 | "outputs": [], 273 | "source": [ 274 | "from llama_index.response_synthesizers import ResponseMode\n", 275 | "\n", 276 | "accumulate_response_synthesizer = get_response_synthesizer(\n", 277 | " response_mode=ResponseMode.ACCUMULATE,\n", 278 | ")\n", 279 | "\n", 280 | "accumulate_query_engine = RetrieverQueryEngine(\n", 281 | " retriever=retriever,\n", 282 | " response_synthesizer=accumulate_response_synthesizer,\n", 283 | ")" 284 | ], 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "id": "bb47addbdc849fcf", 289 | "execution_count": null 290 | }, 291 | { 292 | "cell_type": "code", 293 | "outputs": [], 294 | "source": [ 295 | "response = accumulate_query_engine.query(\"What is the best way to learn programming?\")\n", 296 | "print(response.response)" 297 | ], 298 | "metadata": { 299 | "collapsed": false 300 | }, 301 | "id": "d895eaf9f042fd22", 302 | "execution_count": null 303 | }, 304 | { 305 | "cell_type": "code", 306 | "outputs": [], 307 | "source": [ 308 | "refine_response_synthesizer = get_response_synthesizer(\n", 309 | " response_mode=ResponseMode.REFINE,\n", 310 | ")\n", 311 | "\n", 312 | "refine_query_engine = RetrieverQueryEngine(\n", 313 | " retriever=retriever,\n", 314 | " response_synthesizer=refine_response_synthesizer,\n", 315 | ")" 316 | ], 317 | "metadata": { 318 | "collapsed": false 319 | }, 320 | "id": "c82c017459bfb664", 321 | "execution_count": null 322 | }, 323 | { 324 | "cell_type": "code", 325 | "outputs": [], 326 | "source": [ 327 | "response = refine_query_engine.query(\"What is the best way to learn programming?\")\n", 328 | "print(response.response)" 329 | ], 330 | "metadata": { 331 | "collapsed": false 332 | }, 333 | "id": "e748b91db81b9336", 334 | "execution_count": null 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "source": [ 339 | "## Multitenancy\n", 340 | "\n", 341 | "Most of the real applications require some sort of data separation. If you collect data coming from different users or organizations, you probably don't want to mix them up in the answers. Quite a common mistake, while using Qdrant, is to create a separate collection for each tenant. Instead, you can use the metadata field to separate the data. This field should have a payload index created, so the operations are fast. \n", 342 | "\n", 343 | "This is a Qdrant-specific feature, and the configuration is not done in LlamaIndex, but in Qdrant itself. However, we passed an instance of `QdrantClient` to the `QdrantVectorStore`, so we can use it to create a payload index for the metadata field.\n", 344 | "\n", 345 | "In our case, we can consider splitting the data by the type of the document. We have two types of documents in our collection: `story` and `comment`. We can use the `type` field to separate them." 346 | ], 347 | "metadata": { 348 | "collapsed": false 349 | }, 350 | "id": "235e641164884b02" 351 | }, 352 | { 353 | "cell_type": "code", 354 | "outputs": [], 355 | "source": [ 356 | "from qdrant_client import models\n", 357 | "\n", 358 | "client.create_payload_index(\n", 359 | " collection_name=\"hacker-news\",\n", 360 | " field_name=\"type\",\n", 361 | " field_schema=models.PayloadSchemaType.KEYWORD,\n", 362 | ")" 363 | ], 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "id": "cb261d6fa91e727", 368 | "execution_count": null 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "source": [ 373 | "Using the newly created payload index, we can filter the documents by type. That's why we wanted to customize the pipeline, so we can add this filter to the retriever." 374 | ], 375 | "metadata": { 376 | "collapsed": false 377 | }, 378 | "id": "5077c72c40add0d3" 379 | }, 380 | { 381 | "cell_type": "code", 382 | "outputs": [], 383 | "source": [ 384 | "from llama_index.vector_stores import MetadataFilters, MetadataFilter\n", 385 | "\n", 386 | "filtering_retriever = VectorIndexRetriever(\n", 387 | " index=index,\n", 388 | " similarity_top_k=5,\n", 389 | " filters=MetadataFilters(\n", 390 | " filters=[\n", 391 | " MetadataFilter(key=\"type\", value=\"story\"),\n", 392 | " ]\n", 393 | " ),\n", 394 | ")\n", 395 | "\n", 396 | "filtering_query_engine = RetrieverQueryEngine(\n", 397 | " retriever=filtering_retriever,\n", 398 | " response_synthesizer=response_synthesizer,\n", 399 | ")" 400 | ], 401 | "metadata": { 402 | "collapsed": false 403 | }, 404 | "id": "448493cd6c4d1c32", 405 | "execution_count": null 406 | }, 407 | { 408 | "cell_type": "code", 409 | "outputs": [], 410 | "source": [ 411 | "response = filtering_query_engine.query(\"What is the best way to learn programming?\")\n", 412 | "print(response.response)" 413 | ], 414 | "metadata": { 415 | "collapsed": false 416 | }, 417 | "id": "b4e00178a2746d09", 418 | "execution_count": null 419 | }, 420 | { 421 | "cell_type": "code", 422 | "outputs": [], 423 | "source": [ 424 | "for i, node in enumerate(response.source_nodes):\n", 425 | " print(i + 1, node.text, end=\"\\n\\n\")" 426 | ], 427 | "metadata": { 428 | "collapsed": false 429 | }, 430 | "id": "3f05c4c83cf54647", 431 | "execution_count": null 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "source": [ 436 | "## Additional tweaks\n", 437 | "\n", 438 | "Some scenarios require different means than just semantic search. For example, if we want to prefer the most recent documents, none of the embedding models is going to capture it, since it is a cross-document relationship. LlamaIndex provides a way to add additional postprocessing, so we can include the additional constraints directly on the prefetched documents.\n" 439 | ], 440 | "metadata": { 441 | "collapsed": false 442 | }, 443 | "id": "88f35208132ff94e" 444 | }, 445 | { 446 | "cell_type": "code", 447 | "outputs": [], 448 | "source": [ 449 | "from llama_index.postprocessor import FixedRecencyPostprocessor\n", 450 | "\n", 451 | "prefetching_retriever = VectorIndexRetriever(\n", 452 | " index=index,\n", 453 | " similarity_top_k=25, # prefetch way more documents\n", 454 | " filters=MetadataFilters(\n", 455 | " filters=[\n", 456 | " MetadataFilter(key=\"type\", value=\"comment\"), # we want comments this time\n", 457 | " ]\n", 458 | " ),\n", 459 | ")\n", 460 | "\n", 461 | "recency_query_engine = RetrieverQueryEngine(\n", 462 | " retriever=prefetching_retriever,\n", 463 | " response_synthesizer=response_synthesizer,\n", 464 | " node_postprocessors=[\n", 465 | " FixedRecencyPostprocessor(\n", 466 | " service_context=service_context,\n", 467 | " date_key=\"date\", # date is the default key also, but make it explicit\n", 468 | " top_k=5, # leave just 20% of the prefetched documents\n", 469 | " )\n", 470 | " ]\n", 471 | ")" 472 | ], 473 | "metadata": { 474 | "collapsed": false 475 | }, 476 | "id": "924db74f9afcb13f", 477 | "execution_count": null 478 | }, 479 | { 480 | "cell_type": "code", 481 | "outputs": [], 482 | "source": [ 483 | "response = recency_query_engine.query(\"What is the best way to learn programming?\")\n", 484 | "print(response.response)" 485 | ], 486 | "metadata": { 487 | "collapsed": false 488 | }, 489 | "id": "36f4611d9c0d262a", 490 | "execution_count": null 491 | }, 492 | { 493 | "cell_type": "code", 494 | "outputs": [], 495 | "source": [ 496 | "for i, node in enumerate(response.source_nodes):\n", 497 | " print(i + 1, node.text, end=\"\\n\\n\")" 498 | ], 499 | "metadata": { 500 | "collapsed": false 501 | }, 502 | "id": "4c074064ccbfc715", 503 | "execution_count": null 504 | }, 505 | { 506 | "cell_type": "code", 507 | "outputs": [], 508 | "source": [ 509 | "from llama_index.postprocessor import EmbeddingRecencyPostprocessor\n", 510 | "\n", 511 | "embedding_recency_query_engine = RetrieverQueryEngine(\n", 512 | " retriever=prefetching_retriever,\n", 513 | " response_synthesizer=response_synthesizer,\n", 514 | " node_postprocessors=[\n", 515 | " EmbeddingRecencyPostprocessor(\n", 516 | " service_context=service_context,\n", 517 | " date_key=\"date\", # date is the default key\n", 518 | " similarity_cutoff=0.9,\n", 519 | " )\n", 520 | " ]\n", 521 | ")" 522 | ], 523 | "metadata": { 524 | "collapsed": false 525 | }, 526 | "id": "bf76ad82f0d270bb", 527 | "execution_count": null 528 | }, 529 | { 530 | "cell_type": "code", 531 | "outputs": [], 532 | "source": [ 533 | "response = embedding_recency_query_engine.query(\"What is the best way to learn programming?\")\n", 534 | "print(response.response)" 535 | ], 536 | "metadata": { 537 | "collapsed": false 538 | }, 539 | "id": "33ea0ba3f3d10b13", 540 | "execution_count": null 541 | }, 542 | { 543 | "cell_type": "code", 544 | "outputs": [], 545 | "source": [ 546 | "for i, node in enumerate(response.source_nodes):\n", 547 | " print(i + 1, node.text, end=\"\\n\\n\")" 548 | ], 549 | "metadata": { 550 | "collapsed": false 551 | }, 552 | "id": "3cbe86011e834e79", 553 | "execution_count": null 554 | }, 555 | { 556 | "cell_type": "code", 557 | "outputs": [], 558 | "source": [], 559 | "metadata": { 560 | "collapsed": false 561 | }, 562 | "id": "620de1ff4231ef30", 563 | "execution_count": null 564 | } 565 | ], 566 | "metadata": { 567 | "kernelspec": { 568 | "display_name": "Python 3", 569 | "language": "python", 570 | "name": "python3" 571 | }, 572 | "language_info": { 573 | "codemirror_mode": { 574 | "name": "ipython", 575 | "version": 2 576 | }, 577 | "file_extension": ".py", 578 | "mimetype": "text/x-python", 579 | "name": "python", 580 | "nbconvert_exporter": "python", 581 | "pygments_lexer": "ipython2", 582 | "version": "2.7.6" 583 | } 584 | }, 585 | "nbformat": 4, 586 | "nbformat_minor": 5 587 | } 588 | -------------------------------------------------------------------------------- /notebooks/02-tweaking-semantic-search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Tweaking up semantic retrieval\n", 7 | "\n", 8 | "There are various objectives we could try optimizing for when it comes to semantic retrieval. We could try to optimize the **speed** of the retrieval, the **quality** of it, or the **memory usage**. We'll review some of the techniques in all three areas." 9 | ], 10 | "metadata": { 11 | "collapsed": false 12 | }, 13 | "id": "5527397e89ce8f7f" 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "source": [ 18 | "## Loading the configuration and pipeline\n", 19 | "\n", 20 | "Again, let's start with loading the configuration, and then set up our retriever. We don't want a full RAG pipeline, as we are solely interested in the semantic search part. Improving a single component at a time should be easier to understand and debug. " 21 | ], 22 | "metadata": { 23 | "collapsed": false 24 | }, 25 | "id": "c97156bfd207c831" 26 | }, 27 | { 28 | "cell_type": "code", 29 | "outputs": [], 30 | "source": [ 31 | "from dotenv import load_dotenv\n", 32 | "\n", 33 | "load_dotenv()" 34 | ], 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "id": "5400644c6fa94d96", 39 | "execution_count": null 40 | }, 41 | { 42 | "cell_type": "code", 43 | "outputs": [], 44 | "source": [ 45 | "from llama_index import ServiceContext\n", 46 | "\n", 47 | "service_context = ServiceContext.from_defaults(\n", 48 | " embed_model=\"local:BAAI/bge-large-en\"\n", 49 | ")" 50 | ], 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "id": "c571d2c60524ef3d", 55 | "execution_count": null 56 | }, 57 | { 58 | "cell_type": "code", 59 | "outputs": [], 60 | "source": [ 61 | "from qdrant_client import QdrantClient\n", 62 | "from llama_index.vector_stores.qdrant import QdrantVectorStore\n", 63 | "\n", 64 | "import os\n", 65 | "\n", 66 | "client = QdrantClient(\n", 67 | " os.environ.get(\"QDRANT_URL\"), \n", 68 | " api_key=os.environ.get(\"QDRANT_API_KEY\"),\n", 69 | ")\n", 70 | "vector_store = QdrantVectorStore(\n", 71 | " client=client, \n", 72 | " collection_name=\"hacker-news\"\n", 73 | ")" 74 | ], 75 | "metadata": { 76 | "collapsed": false 77 | }, 78 | "id": "480e88c42e30d3cd", 79 | "execution_count": null 80 | }, 81 | { 82 | "cell_type": "code", 83 | "outputs": [], 84 | "source": [ 85 | "from llama_index import VectorStoreIndex\n", 86 | "\n", 87 | "index = VectorStoreIndex.from_vector_store(\n", 88 | " vector_store=vector_store,\n", 89 | " service_context=service_context,\n", 90 | ")" 91 | ], 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "id": "cb26026d9b2d8e0", 96 | "execution_count": null 97 | }, 98 | { 99 | "cell_type": "code", 100 | "outputs": [], 101 | "source": [ 102 | "from llama_index.vector_stores import MetadataFilters, MetadataFilter\n", 103 | "from llama_index.indices.vector_store import VectorIndexRetriever\n", 104 | "\n", 105 | "retriever = VectorIndexRetriever(\n", 106 | " index=index,\n", 107 | " similarity_top_k=5,\n", 108 | " filters=MetadataFilters(\n", 109 | " filters=[\n", 110 | " MetadataFilter(key=\"type\", value=\"story\"),\n", 111 | " ]\n", 112 | " ),\n", 113 | ")" 114 | ], 115 | "metadata": { 116 | "collapsed": false 117 | }, 118 | "id": "a44471047edccc2a", 119 | "execution_count": null 120 | }, 121 | { 122 | "cell_type": "code", 123 | "outputs": [], 124 | "source": [ 125 | "nodes = retriever.retrieve(\"What is the best way to learn programming?\")\n", 126 | "for i, node in enumerate(nodes):\n", 127 | " print(i + 1, node.text, end=\"\\n\\n\")" 128 | ], 129 | "metadata": { 130 | "collapsed": false 131 | }, 132 | "id": "37fefadc69369516", 133 | "execution_count": null 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "source": [ 138 | "## Quality optimization\n", 139 | "\n", 140 | "We have implemented a basic RAG already, and we might be happy with the quality. There are a lot of aspects when it comes to measuring the quality of a semantic retrieval system, and we will not go into details here. It is usually related to the quality of the embedding model we use, and it is a topic for another day.\n", 141 | "\n", 142 | "However, all the vector databases approximate the nearest neighbor search, and this approximation comes with a cost. The cost is that the results are not always ideal. HNSW, an algorithm used in Qdrant, has some parameters to control how the internal structures are built, and these parameters can be tweaked to improve the quality of the results. This is very specific to the vector database used, thus it's configured through the Qdrant API." 143 | ], 144 | "metadata": { 145 | "collapsed": false 146 | }, 147 | "id": "1c8084e400f1ff37" 148 | }, 149 | { 150 | "cell_type": "code", 151 | "outputs": [], 152 | "source": [ 153 | "client.get_collection(collection_name=\"hacker-news\")" 154 | ], 155 | "metadata": { 156 | "collapsed": false 157 | }, 158 | "id": "18105b7cfb654cb6", 159 | "execution_count": null 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "source": [ 164 | "As for now, the most interesting part is the `hnsw_config` field. The algorithm itself is controlled by two parameters. The number of edges per node is called the `m` parameter. The larger the value, the higher the precision of the search, but the more space required. The `ef_construct` parameter is the number of neighbors to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time. \n", 165 | "\n", 166 | "Playing with both parameters **improves just the approximation of the exact nearest neighbors**, and a proper embedding model is still way more important. However, [this quality aspect might also be controlled, even in an automated way](https://qdrant.tech/documentation/tutorials/retrieval-quality/). For the time being, we'll simply increase both values, but won't measure the impact on the overall quality of search results." 167 | ], 168 | "metadata": { 169 | "collapsed": false 170 | }, 171 | "id": "3a7eb35651edebd8" 172 | }, 173 | { 174 | "cell_type": "code", 175 | "outputs": [], 176 | "source": [ 177 | "from qdrant_client import models\n", 178 | "\n", 179 | "client.update_collection(\n", 180 | " collection_name=\"hacker-news\",\n", 181 | " hnsw_config=models.HnswConfigDiff(\n", 182 | " m=32,\n", 183 | " ef_construct=200,\n", 184 | " )\n", 185 | ")" 186 | ], 187 | "metadata": { 188 | "collapsed": false 189 | }, 190 | "id": "3668e7f04e38d00f", 191 | "execution_count": null 192 | }, 193 | { 194 | "cell_type": "code", 195 | "outputs": [], 196 | "source": [ 197 | "import time\n", 198 | "\n", 199 | "while True:\n", 200 | " collection = client.get_collection(\"hacker-news\")\n", 201 | " if collection.status == models.CollectionStatus.GREEN:\n", 202 | " break\n", 203 | " time.sleep(1.0)\n", 204 | " \n", 205 | "collection" 206 | ], 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "id": "8c4d08884025a701", 211 | "execution_count": null 212 | }, 213 | { 214 | "cell_type": "code", 215 | "outputs": [], 216 | "source": [ 217 | "nodes = retriever.retrieve(\"What is the best way to learn programming?\")\n", 218 | "for i, node in enumerate(nodes):\n", 219 | " print(i + 1, node.text, end=\"\\n\\n\")" 220 | ], 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "id": "8a2ede02bc5281ec", 225 | "execution_count": null 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "source": [ 230 | "## Memory optimization\n", 231 | "\n", 232 | "Each point in a Qdrant collection consists of up to three elements: id, vector(s), and optional payload represented by a JSON object. Vectors are indexed in an HNSW graph, and search operations may involve semantic similarity and some payload-based criteria (it's best to add payload indexes on the fields we want to use for the filtering). Ideally, all the elements should be kept in RAM so access is fast.\n", 233 | "\n", 234 | "Unfortunately, semantic search is a heavy operation in terms of memory requirements. However, some projects are implemented on a budget and can't afford machines with hundreds of gigabytes of RAM. Qdrant allows storing every single component on a disk to reduce memory usage, but that comes with a performance cost. Let's compare the efficiency of the operations with all the components in RAM and with some of them on disk." 235 | ], 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "id": "dd6b5c8fd6513bb5" 240 | }, 241 | { 242 | "cell_type": "code", 243 | "outputs": [], 244 | "source": [ 245 | "%%timeit -n 100 -r 5\n", 246 | "retriever.retrieve(\"What is the best way to learn programming?\")" 247 | ], 248 | "metadata": { 249 | "collapsed": false 250 | }, 251 | "id": "8d2159e4871b1dd6", 252 | "execution_count": null 253 | }, 254 | { 255 | "cell_type": "code", 256 | "outputs": [], 257 | "source": [ 258 | "client.update_collection(\n", 259 | " collection_name=\"hacker-news\",\n", 260 | " hnsw_config=models.HnswConfigDiff(\n", 261 | " on_disk=True,\n", 262 | " ),\n", 263 | " vectors_config={\n", 264 | " \"\": models.VectorParamsDiff(\n", 265 | " on_disk=True,\n", 266 | " )\n", 267 | " },\n", 268 | ")" 269 | ], 270 | "metadata": { 271 | "collapsed": false 272 | }, 273 | "id": "f1534ffc86ce1f7e", 274 | "execution_count": null 275 | }, 276 | { 277 | "cell_type": "code", 278 | "outputs": [], 279 | "source": [ 280 | "while True:\n", 281 | " collection = client.get_collection(\"hacker-news\")\n", 282 | " if collection.status == models.CollectionStatus.GREEN:\n", 283 | " break\n", 284 | " time.sleep(1.0)\n", 285 | " \n", 286 | "collection" 287 | ], 288 | "metadata": { 289 | "collapsed": false 290 | }, 291 | "id": "396074dc50565c7a", 292 | "execution_count": null 293 | }, 294 | { 295 | "cell_type": "code", 296 | "outputs": [], 297 | "source": [ 298 | "%%timeit -n 100 -r 5\n", 299 | "retriever.retrieve(\"What is the best way to learn programming?\")" 300 | ], 301 | "metadata": { 302 | "collapsed": false 303 | }, 304 | "id": "9f500d12db05fdc6", 305 | "execution_count": null 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "source": [ 310 | "## Speed optimization\n", 311 | "\n", 312 | "There are various ways of optimizing semantic search in terms of speed. The most straightforward one is to reduce both `m` and `ef_construct` parameters, as we did in the previous section. However, this comes with a cost of the quality of the results.\n", 313 | "\n", 314 | "Qdrant also provides a number of quantization techniques, and two of them are primarily used to increase speed and reduce memory at the same time:\n", 315 | "\n", 316 | "1. **Scalar Quantization** - uses `int8` instead of `float32` to store each vector dimension\n", 317 | "2. **Binary Quantization** - `bool` values are used to store each vector dimension\n", 318 | "\n", 319 | "The first one reduces the memory usage by up to 4x, while the second one by up to 32x and both increase the speed of the search. However, the quality of the search results is reduced, and Binary Quantization is not suitable for all the use cases. It only works with some specific models, usually the ones with high dimensionality." 320 | ], 321 | "metadata": { 322 | "collapsed": false 323 | }, 324 | "id": "8be9fc24e922f165" 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "source": [ 329 | "In our case, we're going to set up the binary quantization either way. From the LlamaIndex perspective, the search operations are going to be fired identically." 330 | ], 331 | "metadata": { 332 | "collapsed": false 333 | }, 334 | "id": "c1154f4172094cdf" 335 | }, 336 | { 337 | "cell_type": "code", 338 | "outputs": [], 339 | "source": [ 340 | "client.update_collection(\n", 341 | " collection_name=\"hacker-news\",\n", 342 | " quantization_config=models.BinaryQuantization(\n", 343 | " binary=models.BinaryQuantizationConfig(\n", 344 | " always_ram=True,\n", 345 | " )\n", 346 | " )\n", 347 | ")" 348 | ], 349 | "metadata": { 350 | "collapsed": false 351 | }, 352 | "id": "3e26ec00449bfb27", 353 | "execution_count": null 354 | }, 355 | { 356 | "cell_type": "code", 357 | "outputs": [], 358 | "source": [ 359 | "while True:\n", 360 | " collection = client.get_collection(\"hacker-news\")\n", 361 | " if collection.status == models.CollectionStatus.GREEN:\n", 362 | " break\n", 363 | " time.sleep(1.0)\n", 364 | " \n", 365 | "collection" 366 | ], 367 | "metadata": { 368 | "collapsed": false 369 | }, 370 | "id": "a01cc56dff3ea252", 371 | "execution_count": null 372 | }, 373 | { 374 | "cell_type": "code", 375 | "outputs": [], 376 | "source": [ 377 | "nodes = retriever.retrieve(\"What is the best way to learn programming?\")\n", 378 | "for i, node in enumerate(nodes):\n", 379 | " print(i + 1, node.text, end=\"\\n\\n\")" 380 | ], 381 | "metadata": { 382 | "collapsed": false 383 | }, 384 | "id": "6f4d07f587535769", 385 | "execution_count": null 386 | }, 387 | { 388 | "cell_type": "code", 389 | "outputs": [], 390 | "source": [ 391 | "%%timeit -n 100 -r 5\n", 392 | "retriever.retrieve(\"What is the best way to learn programming?\")" 393 | ], 394 | "metadata": { 395 | "collapsed": false 396 | }, 397 | "id": "408270c8449987c9", 398 | "execution_count": null 399 | }, 400 | { 401 | "cell_type": "code", 402 | "outputs": [], 403 | "source": [], 404 | "metadata": { 405 | "collapsed": false 406 | }, 407 | "id": "536dbd5e38b33efd", 408 | "execution_count": null 409 | } 410 | ], 411 | "metadata": { 412 | "kernelspec": { 413 | "display_name": "Python 3", 414 | "language": "python", 415 | "name": "python3" 416 | }, 417 | "language_info": { 418 | "codemirror_mode": { 419 | "name": "ipython", 420 | "version": 2 421 | }, 422 | "file_extension": ".py", 423 | "mimetype": "text/x-python", 424 | "name": "python", 425 | "nbconvert_exporter": "python", 426 | "pygments_lexer": "ipython2", 427 | "version": "2.7.6" 428 | } 429 | }, 430 | "nbformat": 4, 431 | "nbformat_minor": 5 432 | } 433 | -------------------------------------------------------------------------------- /notebooks/03-hybrid-search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Hybrid Search: dense and sparse vectors\n", 7 | "\n", 8 | "LlamaIndex integration with Qdrant supports sparse embeddings as well. From the user perspective, it doesn't change much, as they interact through the same interface. Since sparse and dense vectors work best in different setups, it makes sense to combine them if we want to have the best of both worlds. There are, however, some parameters we can control.\n", 9 | "\n", 10 | "Let's again start with recreating our pipeline, but this time we will use the other collection that has sparse vectors as well.\n" 11 | ], 12 | "metadata": { 13 | "collapsed": false 14 | }, 15 | "id": "b3b7af9ec8fca084" 16 | }, 17 | { 18 | "cell_type": "code", 19 | "outputs": [], 20 | "source": [ 21 | "from dotenv import load_dotenv\n", 22 | "\n", 23 | "load_dotenv()" 24 | ], 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "id": "660547c536bbfa43", 29 | "execution_count": null 30 | }, 31 | { 32 | "cell_type": "code", 33 | "outputs": [], 34 | "source": [ 35 | "from llama_index import ServiceContext\n", 36 | "\n", 37 | "service_context = ServiceContext.from_defaults(\n", 38 | " embed_model=\"local:BAAI/bge-large-en\"\n", 39 | ")" 40 | ], 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "id": "44f2eda9a0c435d8", 45 | "execution_count": null 46 | }, 47 | { 48 | "cell_type": "code", 49 | "outputs": [], 50 | "source": [ 51 | "from qdrant_client import QdrantClient\n", 52 | "from llama_index.vector_stores.qdrant import QdrantVectorStore\n", 53 | "\n", 54 | "import os\n", 55 | "\n", 56 | "client = QdrantClient(\n", 57 | " os.environ.get(\"QDRANT_URL\"), \n", 58 | " api_key=os.environ.get(\"QDRANT_API_KEY\"),\n", 59 | ")\n", 60 | "vector_store_hybrid = QdrantVectorStore(\n", 61 | " client=client,\n", 62 | " collection_name=\"hacker-news-hybrid\",\n", 63 | " enable_hybrid=True,\n", 64 | " batch_size=20, # this is important for the ingestion\n", 65 | ")" 66 | ], 67 | "metadata": { 68 | "collapsed": false 69 | }, 70 | "id": "31fc9049413d2075", 71 | "execution_count": null 72 | }, 73 | { 74 | "cell_type": "code", 75 | "outputs": [], 76 | "source": [ 77 | "from llama_index import VectorStoreIndex\n", 78 | "\n", 79 | "index = VectorStoreIndex.from_vector_store(\n", 80 | " vector_store=vector_store_hybrid,\n", 81 | " service_context=service_context,\n", 82 | ")" 83 | ], 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "id": "acdcf928f564b071", 88 | "execution_count": null 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "source": [ 93 | "## Differences between sparse and dense vectors\n", 94 | "\n", 95 | "Sparse vectors are usually used in high-dimensional spaces, where the majority of the elements are zero. A single dimension represents a single word, so the dimensionality of the space is equal to the size of the vocabulary, with just a few non-zero values. \n", 96 | "\n", 97 | "There are various ways to create sparse vectors, but the most common one is to use the TF-IDF or BM25 representation. It's a simple and effective way to represent the importance of words in a document and in many cases create a solid baseline for the search.\n", 98 | "\n", 99 | "LlamaIndex uses SPLADE by default, which is based on transformers, similar to dense embedding models. **The main advantage of using sparse vectors is that they overcome the problem of vocabulary mismatch**. If a word is not present in the vocabulary of the dense embedding model, we can still represent it using the sparse vectors." 100 | ], 101 | "metadata": { 102 | "collapsed": false 103 | }, 104 | "id": "f1aa7483e19c702" 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "source": [ 109 | "## Using sparse vectors only\n", 110 | "\n", 111 | "Before we dive into the hybrid search, let's see what might be achieved by using sparse vectors alone. We already know the nodes retrieved by dense vectors so it makes sense to compare the results returned by both methods." 112 | ], 113 | "metadata": { 114 | "collapsed": false 115 | }, 116 | "id": "dbd696863d4c0144" 117 | }, 118 | { 119 | "cell_type": "code", 120 | "outputs": [], 121 | "source": [ 122 | "from llama_index.vector_stores.types import VectorStoreQueryMode\n", 123 | "from llama_index.indices.vector_store import VectorIndexRetriever\n", 124 | "\n", 125 | "sparse_retriever = VectorIndexRetriever(\n", 126 | " index=index,\n", 127 | " vector_store_query_mode=VectorStoreQueryMode.SPARSE,\n", 128 | " sparse_top_k=5,\n", 129 | ")" 130 | ], 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "id": "b724350a8f0a9fbd", 135 | "execution_count": null 136 | }, 137 | { 138 | "cell_type": "code", 139 | "outputs": [], 140 | "source": [ 141 | "nodes = sparse_retriever.retrieve(\"What is the best way to learn programming?\")\n", 142 | "for i, node in enumerate(nodes):\n", 143 | " print(i + 1, node.text, end=\"\\n\\n\")" 144 | ], 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "id": "4762c0e9b8fb0231", 149 | "execution_count": null 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "source": [ 154 | "## Hybrid search\n", 155 | "\n", 156 | "There are some specific use cases in which we may prefer to use just the sparse vectors. But both methods may complement each other and we usually need to find the sweet spot. The `VectorIndexRetriever` class allows us to control the parameters of the search. We can set the `sparse_top_k` and `similarity_top_k` parameters to control the number of results returned by each method. We can also set the `alpha` parameters to control the importance of each method (`0.0` = sparse, `1.0` = dense vectors only)." 157 | ], 158 | "metadata": { 159 | "collapsed": false 160 | }, 161 | "id": "4d05f74509f13f9c" 162 | }, 163 | { 164 | "cell_type": "code", 165 | "outputs": [], 166 | "source": [ 167 | "hybrid_retriever = VectorIndexRetriever(\n", 168 | " index=index,\n", 169 | " vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n", 170 | " sparse_top_k=5,\n", 171 | " similarity_top_k=5,\n", 172 | " alpha=0.1,\n", 173 | ")" 174 | ], 175 | "metadata": { 176 | "collapsed": false 177 | }, 178 | "id": "8aa7191e7ad214de", 179 | "execution_count": null 180 | }, 181 | { 182 | "cell_type": "code", 183 | "outputs": [], 184 | "source": [ 185 | "nodes = hybrid_retriever.retrieve(\"What is the best way to learn programming?\")\n", 186 | "for i, node in enumerate(nodes):\n", 187 | " print(i + 1, node.text, end=\"\\n\\n\")" 188 | ], 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "id": "ccbf546068405a4d", 193 | "execution_count": null 194 | }, 195 | { 196 | "cell_type": "code", 197 | "outputs": [], 198 | "source": [ 199 | "# We shouldn't be modifying the alpha parameter after the retriever has been created\n", 200 | "# but that's the easiest way to show the effect of the parameter\n", 201 | "hybrid_retriever._alpha = 0.9\n", 202 | "\n", 203 | "nodes = hybrid_retriever.retrieve(\"What is the best way to learn programming?\")\n", 204 | "for i, node in enumerate(nodes):\n", 205 | " print(i + 1, node.text, end=\"\\n\\n\")" 206 | ], 207 | "metadata": { 208 | "collapsed": false 209 | }, 210 | "id": "1290259a827c3f77", 211 | "execution_count": null 212 | }, 213 | { 214 | "cell_type": "code", 215 | "outputs": [], 216 | "source": [], 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "id": "65af4ef3d86d61ac", 221 | "execution_count": null 222 | } 223 | ], 224 | "metadata": { 225 | "kernelspec": { 226 | "display_name": "Python 3", 227 | "language": "python", 228 | "name": "python3" 229 | }, 230 | "language_info": { 231 | "codemirror_mode": { 232 | "name": "ipython", 233 | "version": 2 234 | }, 235 | "file_extension": ".py", 236 | "mimetype": "text/x-python", 237 | "name": "python", 238 | "nbconvert_exporter": "python", 239 | "pygments_lexer": "ipython2", 240 | "version": "2.7.6" 241 | } 242 | }, 243 | "nbformat": 4, 244 | "nbformat_minor": 5 245 | } 246 | -------------------------------------------------------------------------------- /notebooks/env.example: -------------------------------------------------------------------------------- 1 | QDRANT_URL="https://xyx.com:6333" 2 | QDRANT_API_KEY="your-qdrant-api-key" 3 | 4 | OPENAI_API_KEY="your-openai-key" 5 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "hacker-news-workshop" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Kacper Łukawski "] 6 | readme = "README.md" 7 | 8 | [tool.poetry.dependencies] 9 | python = "^3.10" 10 | llama-index = "^0.9.46" 11 | qdrant-client = "^1.7.3" 12 | transformers = "^4.37.2" 13 | torch = "^2.2.0" 14 | jupyter = "^1.0.0" 15 | python-dotenv = "^1.0.1" 16 | wget = "^3.2" 17 | pyarrow = "^15.0.0" 18 | 19 | 20 | [build-system] 21 | requires = ["poetry-core"] 22 | build-backend = "poetry.core.masonry.api" 23 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.9.3 2 | aiosignal==1.3.1 3 | annotated-types==0.6.0 4 | anyio==4.2.0 5 | argon2-cffi==23.1.0 6 | argon2-cffi-bindings==21.2.0 7 | arrow==1.3.0 8 | asttokens==2.4.1 9 | async-lru==2.0.4 10 | async-timeout==4.0.3 11 | attrs==23.2.0 12 | Babel==2.14.0 13 | beautifulsoup4==4.12.3 14 | bleach==6.1.0 15 | certifi==2024.2.2 16 | cffi==1.16.0 17 | charset-normalizer==3.3.2 18 | click==8.1.7 19 | comm==0.2.1 20 | dataclasses-json==0.6.4 21 | debugpy==1.8.1 22 | decorator==5.1.1 23 | defusedxml==0.7.1 24 | Deprecated==1.2.14 25 | dirtyjson==1.0.8 26 | distro==1.9.0 27 | exceptiongroup==1.2.0 28 | executing==2.0.1 29 | fastjsonschema==2.19.1 30 | filelock==3.13.1 31 | fqdn==1.5.1 32 | frozenlist==1.4.1 33 | fsspec==2024.2.0 34 | greenlet==3.0.3 35 | grpcio==1.60.1 36 | grpcio-tools==1.60.1 37 | h11==0.14.0 38 | h2==4.1.0 39 | hpack==4.0.0 40 | httpcore==1.0.3 41 | httpx==0.26.0 42 | huggingface-hub==0.20.3 43 | hyperframe==6.0.1 44 | idna==3.6 45 | ipykernel==6.29.2 46 | ipython==8.21.0 47 | ipywidgets==8.1.2 48 | isoduration==20.11.0 49 | jedi==0.19.1 50 | Jinja2==3.1.3 51 | joblib==1.3.2 52 | json5==0.9.14 53 | jsonpointer==2.4 54 | jsonschema==4.21.1 55 | jsonschema-specifications==2023.12.1 56 | jupyter==1.0.0 57 | jupyter-console==6.6.3 58 | jupyter-events==0.9.0 59 | jupyter-lsp==2.2.2 60 | jupyter_client==8.6.0 61 | jupyter_core==5.7.1 62 | jupyter_server==2.12.5 63 | jupyter_server_terminals==0.5.2 64 | jupyterlab==4.1.1 65 | jupyterlab_pygments==0.3.0 66 | jupyterlab_server==2.25.3 67 | jupyterlab_widgets==3.0.10 68 | llama-index==0.9.46 69 | MarkupSafe==2.1.5 70 | marshmallow==3.20.2 71 | matplotlib-inline==0.1.6 72 | mistune==3.0.2 73 | mpmath==1.3.0 74 | multidict==6.0.5 75 | mypy-extensions==1.0.0 76 | nbclient==0.9.0 77 | nbconvert==7.16.0 78 | nbformat==5.9.2 79 | nest-asyncio==1.6.0 80 | networkx==3.2.1 81 | nltk==3.8.1 82 | notebook==7.1.0 83 | notebook_shim==0.2.4 84 | numpy==1.26.4 85 | nvidia-cublas-cu12==12.1.3.1 86 | nvidia-cuda-cupti-cu12==12.1.105 87 | nvidia-cuda-nvrtc-cu12==12.1.105 88 | nvidia-cuda-runtime-cu12==12.1.105 89 | nvidia-cudnn-cu12==8.9.2.26 90 | nvidia-cufft-cu12==11.0.2.54 91 | nvidia-curand-cu12==10.3.2.106 92 | nvidia-cusolver-cu12==11.4.5.107 93 | nvidia-cusparse-cu12==12.1.0.106 94 | nvidia-nccl-cu12==2.19.3 95 | nvidia-nvjitlink-cu12==12.3.101 96 | nvidia-nvtx-cu12==12.1.105 97 | openai==1.12.0 98 | overrides==7.7.0 99 | packaging==23.2 100 | pandas==2.2.0 101 | pandocfilters==1.5.1 102 | parso==0.8.3 103 | pexpect==4.9.0 104 | pillow==10.2.0 105 | platformdirs==4.2.0 106 | portalocker==2.8.2 107 | prometheus_client==0.20.0 108 | prompt-toolkit==3.0.43 109 | protobuf==4.25.3 110 | psutil==5.9.8 111 | ptyprocess==0.7.0 112 | pure-eval==0.2.2 113 | pyarrow==15.0.0 114 | pycparser==2.21 115 | pydantic==2.6.1 116 | pydantic_core==2.16.2 117 | Pygments==2.17.2 118 | python-dateutil==2.8.2 119 | python-dotenv==1.0.1 120 | python-json-logger==2.0.7 121 | pytz==2024.1 122 | PyYAML==6.0.1 123 | pyzmq==25.1.2 124 | qdrant-client==1.7.3 125 | qtconsole==5.5.1 126 | QtPy==2.4.1 127 | referencing==0.33.0 128 | regex==2023.12.25 129 | requests==2.31.0 130 | rfc3339-validator==0.1.4 131 | rfc3986-validator==0.1.1 132 | rpds-py==0.18.0 133 | safetensors==0.4.2 134 | scikit-learn==1.4.1.post1 135 | scipy==1.12.0 136 | Send2Trash==1.8.2 137 | sentence-transformers==2.3.1 138 | sentencepiece==0.1.99 139 | six==1.16.0 140 | sniffio==1.3.0 141 | soupsieve==2.5 142 | SQLAlchemy==2.0.27 143 | stack-data==0.6.3 144 | sympy==1.12 145 | tenacity==8.2.3 146 | terminado==0.18.0 147 | threadpoolctl==3.3.0 148 | tiktoken==0.6.0 149 | tinycss2==1.2.1 150 | tokenizers==0.15.2 151 | tomli==2.0.1 152 | torch==2.2.0 153 | tornado==6.4 154 | tqdm==4.66.2 155 | traitlets==5.14.1 156 | transformers==4.37.2 157 | triton==2.2.0 158 | types-python-dateutil==2.8.19.20240106 159 | typing-inspect==0.9.0 160 | typing_extensions==4.9.0 161 | tzdata==2024.1 162 | uri-template==1.3.0 163 | urllib3==2.2.0 164 | wcwidth==0.2.13 165 | webcolors==1.13 166 | webencodings==0.5.1 167 | websocket-client==1.7.0 168 | wget==3.2 169 | widgetsnbextension==4.0.10 170 | wrapt==1.16.0 171 | yarl==1.9.4 172 | --------------------------------------------------------------------------------