├── LICENSE ├── README.md └── sqlite-hybrid-search.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Liam Cavanagh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hybrid Search using SQLite 2 | 3 | This is a simple python based demonstration of how to perform Hybrid (Vector & Full Text) Search using the [sqlite-vec](https://github.com/asg017/sqlite-vec) and the [full text search](https://www.sqlite.org/fts5.html) implementations. Although Hybrid enables search services such as [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search) have far greater capabilities, there are times when it is not viable to host content in the cloud which is why SQLite is a perfect choice. 4 | 5 | *NOTE:* At this time sqlite-vec is still under development and it is expected that there will be breaking changes. Please keep this in mind when considering this when using this demonstration. 6 | 7 | The demonstration provided here, uses an in memory SQLite database, however SQLIte also allows for persistence of the database. It is also important to note that as of this time, sqlite-vec currently only supports vector search using full table scans, as oppposed to other techniques such as ANN, although other options are under development. Regardless, the speed is incredibly impressive, even with large numbers of vectors and has the advantage of higher accuracy over ANN, etc. In addition, sqlite-vec supports pre-filtering of rows to decrease the analysis time. 8 | 9 | Althoough this notebook only provides a Python demonstration, other languages are supported. 10 | 11 | [Notebook Demo](https://github.com/liamca/sqlite-hybrid-search/blob/main/sqlite-hybrid-search.ipynb) 12 | 13 | ## Why Hybrid Search? 14 | 15 | Hybrid search leverages the strengths of both vector search and keyword search. Vector search excels in identifying information that is conceptually similar to the search query, even when there are no direct keyword matches in the inverted index. On the other hand, keyword or full-text search offers precision and allows for semantic ranking, which enhances the quality of the initial results. Certain situations, such as searching for product codes, specialized jargon, dates, and people's names, may benefit more from keyword search due to its ability to find exact matches. 16 | 17 | ## How Hybrid Search Works 18 | 19 | Although there are numerous approaches for taking scores from Vector similarity and BM25 based Full Text search, the approach used in this demonstration uses Reciprocal Rank Fusion (RRF). 20 | RRF is grounded in the idea of reciprocal rank, which refers to the inverse of the rank of the first relevant document in a set of search results. The technique aims to consider the positions of items in the original rankings and assign greater importance to items that appear higher across multiple lists. This approach enhances the overall quality and reliability of the final ranking, making it more effective for combining multiple ordered search results. 21 | 22 | Here is the code used to perform RRF: 23 | 24 | ```code 25 | def reciprocal_rank_fusion(fts_results, vec_results, k=60): 26 | rank_dict = {} 27 | 28 | # Process FTS results 29 | for rank, (id,) in enumerate(fts_results): 30 | if id not in rank_dict: 31 | rank_dict[id] = 0 32 | rank_dict[id] += 1 / (k + rank + 1) 33 | 34 | # Process vector results 35 | for rank, (rowid, distance) in enumerate(vec_results): 36 | if rowid not in rank_dict: 37 | rank_dict[rowid] = 0 38 | rank_dict[rowid] += 1 / (k + rank + 1) 39 | 40 | # Sort by RRF score 41 | sorted_results = sorted(rank_dict.items(), key=lambda x: x[1], reverse=True) 42 | return sorted_results 43 | ``` 44 | -------------------------------------------------------------------------------- /sqlite-hybrid-search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "555d7d57-91de-45f5-aac4-db021e7c23ea", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "# This is an example of performing Hybrid search using sqlite-vec and FTS" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "44a66748-a152-43a3-9a5a-2e726bc8a01c", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# !pip install --upgrade pip\n", 21 | "# !pip install sqlite-vec\n", 22 | "# !pip install pandas\n", 23 | "# !pip install openai\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "id": "5c1a669a-c807-41fa-96d3-67f2e53bbc8c", 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import sqlite3\n", 34 | "import sqlite_vec\n", 35 | "from typing import List\n", 36 | "import struct\n", 37 | "import pandas as pd\n", 38 | "from openai import AzureOpenAI, OpenAIError \n", 39 | "import openai \n", 40 | "import json\n", 41 | "import numpy as np\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "id": "a4a98670-b065-4e12-a077-223ecb521c0b", 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "top_k = 10\n", 52 | "openai_embedding_api_base = \"https://.openai.azure.com/\"\n", 53 | "openai_embedding_api_key = \"\"\n", 54 | "openai_embedding_api_version = \"2024-02-15-preview\"\n", 55 | "openai_embedding_model = \"text-embedding-ada-002\"" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 3, 61 | "id": "aa9b0d4c-5517-46ff-a745-9aa1726ccf51", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# Function to serialize float32 list to binary format compatible with sqlite-vec \n", 66 | "def serialize_f32(vec): \n", 67 | " return np.array(vec, dtype=np.float32).tobytes() \n", 68 | "\n", 69 | "def reciprocal_rank_fusion(fts_results, vec_results, k=60): \n", 70 | " rank_dict = {} \n", 71 | " \n", 72 | " # Process FTS results \n", 73 | " for rank, (id,) in enumerate(fts_results): \n", 74 | " if id not in rank_dict: \n", 75 | " rank_dict[id] = 0 \n", 76 | " rank_dict[id] += 1 / (k + rank + 1) \n", 77 | " \n", 78 | " # Process vector results \n", 79 | " for rank, (rowid, distance) in enumerate(vec_results): \n", 80 | " if rowid not in rank_dict: \n", 81 | " rank_dict[rowid] = 0 \n", 82 | " rank_dict[rowid] += 1 / (k + rank + 1) \n", 83 | " \n", 84 | " # Sort by RRF score \n", 85 | " sorted_results = sorted(rank_dict.items(), key=lambda x: x[1], reverse=True) \n", 86 | " return sorted_results \n", 87 | " \n", 88 | "def or_words(input_string): \n", 89 | " # Split the input string into words \n", 90 | " words = input_string.split() \n", 91 | " \n", 92 | " # Join the words with ' OR ' in between \n", 93 | " result = ' OR '.join(words) \n", 94 | " \n", 95 | " return result\n", 96 | "\n", 97 | "def lookup_row(id):\n", 98 | " row_lookup = cur.execute(''' \n", 99 | " SELECT content FROM mango_lookup WHERE id = ?\n", 100 | " ''', (id,)).fetchall() \n", 101 | " content = ''\n", 102 | " for row in row_lookup:\n", 103 | " content= row[0]\n", 104 | " break\n", 105 | " return content\n", 106 | "\n", 107 | "# Function to generate vectors for text \n", 108 | "def generate_embedding(text): \n", 109 | " max_attempts = 6 \n", 110 | " max_backoff = 60 \n", 111 | " if text is None: \n", 112 | " return None \n", 113 | " \n", 114 | " client = AzureOpenAI( \n", 115 | " api_version=openai_embedding_api_version, \n", 116 | " azure_endpoint=openai_embedding_api_base, \n", 117 | " api_key=openai_embedding_api_key \n", 118 | " ) \n", 119 | " \n", 120 | " counter = 0 \n", 121 | " incremental_backoff = 1 # seconds to wait on throttling - this will be incremental backoff \n", 122 | " while counter < max_attempts: \n", 123 | " try: \n", 124 | " response = client.embeddings.create( \n", 125 | " input=text, \n", 126 | " model=openai_embedding_model \n", 127 | " ) \n", 128 | " return json.loads(response.model_dump_json())[\"data\"][0]['embedding'] \n", 129 | " except OpenAIError as ex: \n", 130 | " if str(ex.code) == \"429\": \n", 131 | " print('OpenAI Throttling Error - Waiting to retry after', incremental_backoff, 'seconds...') \n", 132 | " incremental_backoff = min(max_backoff, incremental_backoff * 1.5) \n", 133 | " counter += 1 \n", 134 | " time.sleep(incremental_backoff) \n", 135 | " elif str(ex.code) == \"DeploymentNotFound\": \n", 136 | " print('Error: Deployment not found') \n", 137 | " return 'Error: Deployment not found' \n", 138 | " elif 'Error code: 40' in str(ex): \n", 139 | " print('Error: ' + str(ex)) \n", 140 | " return 'Error:' + str(ex) \n", 141 | " elif 'Connection error' in str(ex): \n", 142 | " print('Error: Connection error') \n", 143 | " return 'Error: Connection error' \n", 144 | " elif str(ex.code) == \"content_filter\": \n", 145 | " print('Content Filter Error', ex.code) \n", 146 | " return \"Error: Content could not be extracted due to Azure OpenAI content filter.\" + ex.code \n", 147 | " else: \n", 148 | " print('API Error:', ex) \n", 149 | " print('API Error Code:', ex.code) \n", 150 | " incremental_backoff = min(max_backoff, incremental_backoff * 1.5) \n", 151 | " counter += 1 \n", 152 | " time.sleep(incremental_backoff) \n", 153 | " except Exception as ex: \n", 154 | " counter += 1 \n", 155 | " print('Error - Retry count:', counter, ex) \n", 156 | " return None " 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 4, 162 | "id": "2f5d60e3-f030-4403-a2e1-9727919753c6", 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "name": "stdout", 167 | "output_type": "stream", 168 | "text": [ 169 | "sqlite_version=3.40.1, vec_version=v0.1.1\n" 170 | ] 171 | } 172 | ], 173 | "source": [ 174 | "# Create an in memory sqlite db\n", 175 | "db = sqlite3.connect(\":memory:\")\n", 176 | "db.enable_load_extension(True)\n", 177 | "sqlite_vec.load(db)\n", 178 | "db.enable_load_extension(False)\n", 179 | "\n", 180 | "sqlite_version, vec_version = db.execute(\n", 181 | " \"select sqlite_version(), vec_version()\"\n", 182 | ").fetchone()\n", 183 | "print(f\"sqlite_version={sqlite_version}, vec_version={vec_version}\")\n" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 5, 189 | "id": "85c86294-6699-4de7-b1b2-5d3ff9473a3f", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "Dims in Vector Embeddings: 1536\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "test_vec = generate_embedding('The quick brown fox')\n", 202 | "dims = len(test_vec)\n", 203 | "print ('Dims in Vector Embeddings:', dims)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 6, 209 | "id": "c6ddf350-058d-4317-bbc8-b99845ebaf4a", 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "data": { 214 | "text/plain": [ 215 | "" 216 | ] 217 | }, 218 | "execution_count": 6, 219 | "metadata": {}, 220 | "output_type": "execute_result" 221 | } 222 | ], 223 | "source": [ 224 | "cur = db.cursor()\n", 225 | "cur.execute('CREATE VIRTUAL TABLE mango_fts USING fts5(id UNINDEXED, content, tokenize=\"porter unicode61\");')\n", 226 | "\n", 227 | "# sqlite-vec always adds an ID field\n", 228 | "cur.execute('''CREATE VIRTUAL TABLE mango_vec USING vec0(embedding float[''' + str(dims) + '])''') \n", 229 | "\n", 230 | "# Create a content lookup table with an index on the ID \n", 231 | "cur.execute('CREATE TABLE mango_lookup (id INTEGER PRIMARY KEY, content TEXT);') " 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 7, 237 | "id": "7b7e4eca-43c8-43c0-a3d3-952f2305ad46", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "# Insert some sample data into mango_fts \n", 242 | "fts_data = [ \n", 243 | " (1, 'The quick brown fox jumps over the lazy dog.'), \n", 244 | " (2, 'Artificial intelligence is transforming the world.'), \n", 245 | " (3, 'Climate change is a pressing global issue.'), \n", 246 | " (4, 'The stock market fluctuates based on various factors.'), \n", 247 | " (5, 'Remote work has become more prevalent during the pandemic.'), \n", 248 | " (6, 'Electric vehicles are becoming more popular.'), \n", 249 | " (7, 'Quantum computing has the potential to revolutionize technology.'), \n", 250 | " (8, 'Healthcare innovation is critical for societal well-being.'), \n", 251 | " (9, 'Space exploration expands our understanding of the universe.'), \n", 252 | " (10, 'Cybersecurity threats are evolving and becoming more sophisticated.') \n", 253 | "] \n", 254 | " \n", 255 | "cur.executemany(''' \n", 256 | "INSERT INTO mango_fts (id, content) VALUES (?, ?) \n", 257 | "''', fts_data) \n", 258 | "\n", 259 | "\n", 260 | "cur.executemany(''' \n", 261 | " INSERT INTO mango_lookup (id, content) VALUES (?, ?) \n", 262 | "''', fts_data) \n", 263 | " \n", 264 | "\n", 265 | "# Generate embeddings for the content and insert into mango_vec \n", 266 | "for row in fts_data: \n", 267 | " id, content = row \n", 268 | " embedding = generate_embedding(content)\n", 269 | " cur.execute(''' \n", 270 | " INSERT INTO mango_vec (rowid, embedding) VALUES (?, ?) \n", 271 | " ''', (id, serialize_f32(embedding))) \n", 272 | "\n", 273 | "\n", 274 | "# Commit changes \n", 275 | "db.commit() " 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 10, 281 | "id": "7ed8a288-9e8f-4524-808c-f94d6aaa807d", 282 | "metadata": {}, 283 | "outputs": [ 284 | { 285 | "name": "stdout", 286 | "output_type": "stream", 287 | "text": [ 288 | "ID: 8, Content: Healthcare innovation is critical for societal well-being., RRF Score: 0.01639344262295082\n", 289 | "ID: 2, Content: Artificial intelligence is transforming the world., RRF Score: 0.016129032258064516\n", 290 | "ID: 9, Content: Space exploration expands our understanding of the universe., RRF Score: 0.015873015873015872\n", 291 | "ID: 1, Content: The quick brown fox jumps over the lazy dog., RRF Score: 0.015625\n", 292 | "ID: 5, Content: Remote work has become more prevalent during the pandemic., RRF Score: 0.015384615384615385\n", 293 | "ID: 10, Content: Cybersecurity threats are evolving and becoming more sophisticated., RRF Score: 0.015151515151515152\n", 294 | "ID: 7, Content: Quantum computing has the potential to revolutionize technology., RRF Score: 0.014925373134328358\n", 295 | "ID: 6, Content: Electric vehicles are becoming more popular., RRF Score: 0.014705882352941176\n", 296 | "ID: 3, Content: Climate change is a pressing global issue., RRF Score: 0.014492753623188406\n", 297 | "ID: 4, Content: The stock market fluctuates based on various factors., RRF Score: 0.014285714285714285\n" 298 | ] 299 | } 300 | ], 301 | "source": [ 302 | "# Full-text search query \n", 303 | "# fts_search_query = \"AI\" \n", 304 | "# fts_search_query = \"technology innovation\" \n", 305 | "# fts_search_query = \"electricity cars\" \n", 306 | "fts_search_query = \"medical\" \n", 307 | "\n", 308 | "fts_results = cur.execute(''' \n", 309 | " SELECT id FROM mango_fts WHERE mango_fts MATCH ? ORDER BY rank limit 5 \n", 310 | "''', (or_words(fts_search_query),)).fetchall() \n", 311 | " \n", 312 | "# Vector search query \n", 313 | "query_embedding = generate_embedding(fts_search_query) \n", 314 | "vec_results = cur.execute(''' \n", 315 | " SELECT rowid, distance FROM mango_vec WHERE embedding MATCH ? and K = ? \n", 316 | " ORDER BY distance \n", 317 | "''', [serialize_f32(query_embedding), top_k]).fetchall() \n", 318 | " \n", 319 | "# Combine results using RRF \n", 320 | "combined_results = reciprocal_rank_fusion(fts_results, vec_results) \n", 321 | " \n", 322 | "# Print combined results \n", 323 | "for id, score in combined_results: \n", 324 | " print(f'ID: {id}, Content: {lookup_row(id)}, RRF Score: {score}') " 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "id": "3dd42c82-d7ab-4a60-a1dc-1deae4d3f815", 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "# Close the connection \n", 335 | "db.close() \n" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "id": "cc0e5b5f-b593-4cf7-a53e-cf02a41539f5", 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [] 345 | } 346 | ], 347 | "metadata": { 348 | "kernelspec": { 349 | "display_name": "Python 3 (ipykernel)", 350 | "language": "python", 351 | "name": "python3" 352 | }, 353 | "language_info": { 354 | "codemirror_mode": { 355 | "name": "ipython", 356 | "version": 3 357 | }, 358 | "file_extension": ".py", 359 | "mimetype": "text/x-python", 360 | "name": "python", 361 | "nbconvert_exporter": "python", 362 | "pygments_lexer": "ipython3", 363 | "version": "3.11.9" 364 | } 365 | }, 366 | "nbformat": 4, 367 | "nbformat_minor": 5 368 | } 369 | --------------------------------------------------------------------------------