├── LICENSE ├── RAG_Experimentation_Framework_final.ipynb └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /RAG_Experimentation_Framework_final.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "machine_shape": "hm", 8 | "gpuType": "L4", 9 | "authorship_tag": "ABX9TyPKWVIVeEtb8/A7NP4NfWMT", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | }, 19 | "accelerator": "GPU" 20 | }, 21 | "cells": [ 22 | { 23 | "cell_type": "markdown", 24 | "metadata": { 25 | "id": "view-in-github", 26 | "colab_type": "text" 27 | }, 28 | "source": [ 29 | "\"Open" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "source": [ 35 | "If you use this code, please cite:\n", 36 | "\n", 37 | "{\n", 38 | " title = {RAG Experimentation Framework},\n", 39 | "\n", 40 | " author = {Bill Leece},\n", 41 | "\n", 42 | " year = {2024}\n", 43 | "}" 44 | ], 45 | "metadata": { 46 | "id": "wZ0kV_UtQn5O" 47 | } 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "source": [ 52 | "#Setup" 53 | ], 54 | "metadata": { 55 | "id": "_lHNBLR-92Zk" 56 | } 57 | }, 58 | { 59 | "cell_type": "code", 60 | "source": [ 61 | "!pip install -U transformers --quiet\n", 62 | "#!pip install -U optimum --quiet\n", 63 | "!pip install -U accelerate --quiet\n", 64 | "!pip install -U bitsandbytes --quiet\n", 65 | "!pip install -U torch --quiet\n", 66 | "!pip install -U sentencepiece --quiet\n", 67 | "!pip install -U llama-index --quiet\n", 68 | "!pip install -U llama-index-llms-mistralai --quiet\n", 69 | "!pip install -U llama-index-embeddings-mistralai --quiet\n", 70 | "!pip install -U llama-index-llms-langchain --quiet\n", 71 | "!pip install -U langchain --quiet\n", 72 | "!pip install -U langchain-community --quiet\n", 73 | "!pip install -U langchain-mistralai --quiet\n", 74 | "!pip install -U langchain_huggingface --quiet\n", 75 | "!pip install -U faiss-gpu --quiet" 76 | ], 77 | "metadata": { 78 | "id": "4g_Vs7wgZW-8" 79 | }, 80 | "execution_count": null, 81 | "outputs": [] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "source": [ 86 | "import os\n", 87 | "import json\n", 88 | "import numpy as np\n", 89 | "import faiss\n", 90 | "import transformers\n", 91 | "import torch\n", 92 | "import gc\n", 93 | "import openai\n", 94 | "import json\n", 95 | "import tiktoken\n", 96 | "import textwrap\n", 97 | "import time\n", 98 | "from google.colab import drive, userdata\n", 99 | "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n", 100 | "from langchain.prompts import PromptTemplate\n", 101 | "from langchain_huggingface import HuggingFacePipeline\n", 102 | "from langchain_core.output_parsers import StrOutputParser\n", 103 | "from langchain_mistralai.chat_models import ChatMistralAI\n", 104 | "from llama_index.embeddings.mistralai import MistralAIEmbedding\n", 105 | "from llama_index.core import SimpleDirectoryReader, Settings\n", 106 | "from llama_index.core.node_parser import SemanticSplitterNodeParser\n", 107 | "import time\n", 108 | "from typing import List, Dict, Tuple\n", 109 | "from contextlib import contextmanager\n", 110 | "from langchain.schema.runnable import RunnableSequence\n", 111 | "from langchain.schema.output_parser import StrOutputParser\n", 112 | "from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter\n", 113 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", 114 | "from datetime import datetime\n", 115 | "from typing import Dict, List, Any" 116 | ], 117 | "metadata": { 118 | "id": "Ao7eaSfq-TKs" 119 | }, 120 | "execution_count": null, 121 | "outputs": [] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "source": [ 126 | "os.environ[\"HF_TOKEN\"] = userdata.get('HF_TOKEN')\n", 127 | "os.environ[\"MISTRAL_API_KEY\"] = userdata.get('MISTRAL_API_KEY')\n", 128 | "api_key = userdata.get('OPENAI_API_KEY')" 129 | ], 130 | "metadata": { 131 | "id": "YvGHY024-OXK" 132 | }, 133 | "execution_count": null, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "source": [ 139 | "device = 'cuda' if torch.cuda.is_available() else 'cpu' #Use GPUs when possible" 140 | ], 141 | "metadata": { 142 | "id": "mxAHV7T_-Xlh" 143 | }, 144 | "execution_count": null, 145 | "outputs": [] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "source": [ 150 | "#Experiment Configurations" 151 | ], 152 | "metadata": { 153 | "id": "jkqEV8M_HUKG" 154 | } 155 | }, 156 | { 157 | "cell_type": "code", 158 | "source": [ 159 | "# Setup configurations\n", 160 | "MODEL_CONFIGS = {\n", 161 | " \"models\": [\n", 162 | " # {\n", 163 | " # \"name\": \"open-mixtral-8x7b\",\n", 164 | " # \"type\": \"mistral_api\",\n", 165 | " # \"tokenizer\": None, # Not needed for API models\n", 166 | " # },\n", 167 | "\n", 168 | " {\n", 169 | " \"name\": \"mistral-large-latest\",\n", 170 | " \"type\": \"mistral_api\",\n", 171 | " \"tokenizer\": None, # Not needed for API models\n", 172 | " },\n", 173 | "\n", 174 | " {\n", 175 | " \"name\": \"open-mistral-nemo\",\n", 176 | " \"type\": \"mistral_api\",\n", 177 | " \"tokenizer\": None, # Not needed for API models\n", 178 | " },\n", 179 | "# {\n", 180 | "# \"name\": \"ministral-8b-latest\",\n", 181 | "# \"type\": \"mistral_api\",\n", 182 | "# \"tokenizer\": None, # Not needed for API models\n", 183 | "# },\n", 184 | " # {\n", 185 | " # \"name\": \"meta-llama/Llama-3.1-8B-Instruct\",\n", 186 | " # \"type\": \"huggingface\",\n", 187 | " # \"tokenizer\": \"meta-llama/Llama-3.1-8B-Instruct\"\n", 188 | " # },\n", 189 | "\n", 190 | " # {\n", 191 | " # \"name\": \"wjleece/quantized-mistral-7b\",\n", 192 | " # \"type\": \"huggingface\",\n", 193 | " # \"tokenizer\": \"mistralai/Mixtral-8x7B-v0.1\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n", 194 | " # \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n", 195 | " # \"load_in_4bit\": True,\n", 196 | " # \"bnb_4bit_compute_dtype\": \"float16\",\n", 197 | " # \"bnb_4bit_quant_type\": \"nf4\",\n", 198 | " # \"bnb_4bit_use_double_quant\": False\n", 199 | " # }\n", 200 | " # },\n", 201 | " {\n", 202 | " \"name\": \"wjleece/quantized-mistral-nemo-12b\",\n", 203 | " \"type\": \"huggingface\",\n", 204 | " \"tokenizer\": \"mistralai/Mistral-Nemo-Instruct-2407\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n", 205 | " \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n", 206 | " \"load_in_4bit\": True,\n", 207 | " \"bnb_4bit_compute_dtype\": \"float16\",\n", 208 | " \"bnb_4bit_quant_type\": \"nf4\",\n", 209 | " \"bnb_4bit_use_double_quant\": False\n", 210 | " }\n", 211 | " },\n", 212 | " # {\n", 213 | " # \"name\": \"wjleece/quantized-mistral-8b\",\n", 214 | " # \"type\": \"huggingface\",\n", 215 | " # \"tokenizer\": \"mistralai/Ministral-8B-Instruct-2410\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n", 216 | " # \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n", 217 | " # \"load_in_4bit\": True,\n", 218 | " # \"bnb_4bit_compute_dtype\": \"float16\",\n", 219 | " # \"bnb_4bit_quant_type\": \"nf4\",\n", 220 | " # \"bnb_4bit_use_double_quant\": False\n", 221 | " # }\n", 222 | " # },\n", 223 | " # {\n", 224 | " # \"name\": \"wjleece/quantized-llama-3.1-8b\",\n", 225 | " # \"type\": \"huggingface\",\n", 226 | " # \"tokenizer\": \"meta-llama/Llama-3.1-8B-Instruct\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n", 227 | " # \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n", 228 | " # \"load_in_4bit\": True,\n", 229 | " # \"bnb_4bit_compute_dtype\": \"float16\",\n", 230 | " # \"bnb_4bit_quant_type\": \"nf4\",\n", 231 | " # \"bnb_4bit_use_double_quant\": False\n", 232 | " # }\n", 233 | " # }\n", 234 | " ]\n", 235 | "}\n", 236 | "\n", 237 | "\n", 238 | "CHUNKING_CONFIGS = {\n", 239 | " \"strategies\": [\"paragraph\", \"header\"],\n", 240 | " \"semantic_config\": {\n", 241 | " \"enabled\": True,\n", 242 | " \"thresholds\": [85, 95] if True else []\n", 243 | " },\n", 244 | " \"max_chunk_size\": 2048,\n", 245 | " \"chunk_overlap\": 100,\n", 246 | " \"min_chunk_size\": 35 #we'll ignore any chunk ~5 words or less\n", 247 | "}\n", 248 | "\n", 249 | "QUESTION_CONFIGS = {\n", 250 | " \"questions\": [\n", 251 | " \"What were cloud revenues in the most recent quarter?\",\n", 252 | " \"What were the main drivers of revenue growth in the most recent quarter?\",\n", 253 | " \"How much did YouTube ad revenues grow in the most recent quarter in APAC?\",\n", 254 | " \"Can you summarize recent key antitrust matters?\",\n", 255 | " \"Compare the revenue growth across all geographic regions and explain the main factors for each region.\",\n", 256 | " \"Summarize all mentioned risk factors related to international operations.\",\n", 257 | " \"What were the major changes in operating expenses across all categories and their stated reasons?\",\n", 258 | " ] #These quetsions should relate to the RAG document --> these are your 'business use cases'\n", 259 | "}\n", 260 | "\n", 261 | "FILE_CONFIGS = {\n", 262 | " \"save_directory\": '/content/drive/My Drive/AI/Model_Analysis'\n", 263 | "}" 264 | ], 265 | "metadata": { 266 | "id": "YDjgk_JhHWkj" 267 | }, 268 | "execution_count": null, 269 | "outputs": [] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "source": [ 274 | "#Load RAG Document" 275 | ], 276 | "metadata": { 277 | "id": "e_wxgOGc95sf" 278 | } 279 | }, 280 | { 281 | "cell_type": "code", 282 | "source": [ 283 | "drive.mount('/content/drive')\n", 284 | "documents = SimpleDirectoryReader(input_files=[\"/content/drive/My Drive/AI/Datasets/Google-10-q/goog-10-q-q3-2024.pdf\"]).load_data()" 285 | ], 286 | "metadata": { 287 | "id": "gS4Lemk09v9Y" 288 | }, 289 | "execution_count": null, 290 | "outputs": [] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "source": [ 295 | "#RAG Pipeline Class" 296 | ], 297 | "metadata": { 298 | "id": "MgLxma5M-bZA" 299 | } 300 | }, 301 | { 302 | "cell_type": "code", 303 | "source": [ 304 | "# Global singleton instance\n", 305 | "_GLOBAL_RAG_PIPELINE = None\n", 306 | "\n", 307 | "class RAGPipeline:\n", 308 | " def __init__(self):\n", 309 | " self.chunk_cache = {}\n", 310 | " self.embedding_cache = {}\n", 311 | " self.embedding_model = None\n", 312 | "\n", 313 | " @classmethod\n", 314 | " def get_instance(cls):\n", 315 | " \"\"\"Get or create singleton instance\"\"\"\n", 316 | " global _GLOBAL_RAG_PIPELINE\n", 317 | " if _GLOBAL_RAG_PIPELINE is None:\n", 318 | " _GLOBAL_RAG_PIPELINE = cls()\n", 319 | " return _GLOBAL_RAG_PIPELINE\n", 320 | "\n", 321 | "\n", 322 | " def initialize_embedding_model(self):\n", 323 | " \"\"\"Initialize the embedding model if not already initialized\"\"\"\n", 324 | " if self.embedding_model is None:\n", 325 | " mistral_api_key = userdata.get('MISTRAL_API_KEY')\n", 326 | " self.embedding_model = MistralAIEmbedding(\n", 327 | " model_name=\"mistral-embed\",\n", 328 | " api_key=mistral_api_key\n", 329 | " )\n", 330 | " return self.embedding_model\n", 331 | "\n", 332 | " def convert_to_markdown_headers(self, text):\n", 333 | " \"\"\"Convert document section titles to markdown headers\"\"\"\n", 334 | " import re\n", 335 | "\n", 336 | " patterns = [\n", 337 | " (r'^(?:ITEM|Section)\\s+\\d+[.:]\\s*(.+)$', '# '),\n", 338 | " (r'^\\d+\\.\\d+\\s+(.+)$', '## '),\n", 339 | " (r'^\\([a-z]\\)\\s+(.+)$', '### ')\n", 340 | " ]\n", 341 | "\n", 342 | " lines = text.split('\\n')\n", 343 | " markdown_lines = []\n", 344 | "\n", 345 | " for line in lines:\n", 346 | " line = line.strip()\n", 347 | " converted = False\n", 348 | "\n", 349 | " for pattern, header_mark in patterns:\n", 350 | " if re.match(pattern, line, re.IGNORECASE):\n", 351 | " markdown_lines.append(f\"{header_mark}{line}\")\n", 352 | " converted = True\n", 353 | " break\n", 354 | "\n", 355 | " if not converted:\n", 356 | " markdown_lines.append(line)\n", 357 | "\n", 358 | " return '\\n'.join(markdown_lines)\n", 359 | "\n", 360 | "\n", 361 | " def create_chunks(self, documents: List, threshold: int, chunk_strategy: str = \"semantic\") -> Dict:\n", 362 | " \"\"\"Create or retrieve chunks based on specified strategy\"\"\"\n", 363 | "\n", 364 | " MAX_CHUNK_SIZE = CHUNKING_CONFIGS['max_chunk_size']\n", 365 | " CHUNK_OVERLAP = CHUNKING_CONFIGS['chunk_overlap']\n", 366 | " MIN_CHUNK_SIZE = CHUNKING_CONFIGS['min_chunk_size']\n", 367 | "\n", 368 | "\n", 369 | " if chunk_strategy == \"semantic\":\n", 370 | " cache_key = f\"{chunk_strategy}_{threshold}\"\n", 371 | " print(f\"Using semantic cache key: {cache_key} with threshold: {threshold}\")\n", 372 | " else:\n", 373 | " cache_key = f\"{chunk_strategy}_{MAX_CHUNK_SIZE}\"\n", 374 | " print(f\"Using non-semantic cache key: {cache_key}\")\n", 375 | "\n", 376 | "\n", 377 | " if cache_key not in self.chunk_cache:\n", 378 | " print(\"\\nStarting new chunk creation:\")\n", 379 | " texts = []\n", 380 | "\n", 381 | " try:\n", 382 | " if chunk_strategy == \"semantic\":\n", 383 | " print(\"Processing semantic chunking...\")\n", 384 | " if self.embedding_model is None:\n", 385 | " print(\"Initializing embedding model\")\n", 386 | " self.initialize_embedding_model()\n", 387 | "\n", 388 | " splitter = SemanticSplitterNodeParser(\n", 389 | " buffer_size=1,\n", 390 | " breakpoint_percentile_threshold=threshold,\n", 391 | " embed_model=self.embedding_model\n", 392 | " )\n", 393 | " nodes = splitter.get_nodes_from_documents(documents)\n", 394 | " texts = [node.text for node in nodes]\n", 395 | " print(f\"Generated {len(texts)} semantic chunks\")\n", 396 | "\n", 397 | " elif chunk_strategy == \"paragraph\":\n", 398 | " print(\"Processing paragraph chunking...\")\n", 399 | " text_splitter = RecursiveCharacterTextSplitter(\n", 400 | " separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n", 401 | " chunk_size=MAX_CHUNK_SIZE,\n", 402 | " chunk_overlap=CHUNK_OVERLAP,\n", 403 | " length_function=len\n", 404 | " )\n", 405 | "\n", 406 | " for idx, doc in enumerate(documents):\n", 407 | " print(f\"\\nProcessing document {idx + 1}/{len(documents)}\")\n", 408 | " print(f\"Document length: {len(doc.text)} characters\")\n", 409 | " doc_chunks = text_splitter.split_text(doc.text)\n", 410 | " print(f\"Initial chunks from document: {len(doc_chunks)}\")\n", 411 | " if doc_chunks:\n", 412 | " print(f\"Sample chunk lengths: {[len(c) for c in doc_chunks[:3]]}\")\n", 413 | " texts.extend(doc_chunks)\n", 414 | "\n", 415 | " elif chunk_strategy == \"header\":\n", 416 | " print(\"Processing header chunking...\")\n", 417 | " text_splitter = RecursiveCharacterTextSplitter(\n", 418 | " separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n", 419 | " chunk_size=MAX_CHUNK_SIZE,\n", 420 | " chunk_overlap=CHUNK_OVERLAP,\n", 421 | " length_function=len\n", 422 | " )\n", 423 | "\n", 424 | " for idx, doc in enumerate(documents):\n", 425 | " print(f\"\\nProcessing document {idx + 1}/{len(documents)}\")\n", 426 | " md_text = self.convert_to_markdown_headers(doc.text)\n", 427 | " print(\"Headers identified. First 100 chars of markdown text:\")\n", 428 | " print(md_text[:100] + \"...\")\n", 429 | "\n", 430 | " headers_to_split_on = [\n", 431 | " (\"#\", \"Header 1\"),\n", 432 | " (\"##\", \"Header 2\"),\n", 433 | " (\"###\", \"Header 3\"),\n", 434 | " ]\n", 435 | "\n", 436 | " header_splitter = MarkdownHeaderTextSplitter(\n", 437 | " headers_to_split_on=headers_to_split_on\n", 438 | " )\n", 439 | "\n", 440 | " splits = header_splitter.split_text(md_text)\n", 441 | " print(f\"Generated {len(splits)} header sections\")\n", 442 | " if splits:\n", 443 | " print(\"Sample section lengths:\", [len(s.page_content) for s in splits[:3]])\n", 444 | "\n", 445 | " for split in splits:\n", 446 | " if len(split.page_content) > MAX_CHUNK_SIZE:\n", 447 | " print(f\"Splitting large section: {len(split.page_content)} chars\")\n", 448 | " subsections = text_splitter.split_text(split.page_content)\n", 449 | " print(f\"Created {len(subsections)} subsections\")\n", 450 | " texts.extend(subsections)\n", 451 | " else:\n", 452 | " texts.append(split.page_content)\n", 453 | "\n", 454 | " print(\"\\nCleaning and filtering chunks...\")\n", 455 | " initial_count = len(texts)\n", 456 | " cleaned_texts = []\n", 457 | " for idx, text in enumerate(texts):\n", 458 | " if not isinstance(text, str):\n", 459 | " print(f\"Warning: Non-string chunk found at index {idx}\")\n", 460 | " continue\n", 461 | "\n", 462 | " cleaned_text = text.strip()\n", 463 | " if len(cleaned_text) >= MIN_CHUNK_SIZE:\n", 464 | " cleaned_texts.append(cleaned_text)\n", 465 | " else:\n", 466 | " print(f\"Filtered out small chunk: {len(cleaned_text)} chars\")\n", 467 | "\n", 468 | " texts = cleaned_texts\n", 469 | " print(f\"Chunks after cleaning: {len(texts)} (removed {initial_count - len(texts)})\")\n", 470 | "\n", 471 | " if not texts:\n", 472 | " print(\"WARNING: No valid chunks generated!\")\n", 473 | " return {\n", 474 | " 'texts': [],\n", 475 | " 'strategy': chunk_strategy,\n", 476 | " 'chunk_stats': {\n", 477 | " 'num_chunks': 0,\n", 478 | " 'avg_chunk_size': 0,\n", 479 | " 'min_chunk_size': 0,\n", 480 | " 'max_chunk_size': 0\n", 481 | " }\n", 482 | " }\n", 483 | "\n", 484 | " # Calculate chunk statistics\n", 485 | " chunk_lengths = [len(t) for t in texts]\n", 486 | " chunk_stats = {\n", 487 | " 'num_chunks': len(texts),\n", 488 | " 'avg_chunk_size': sum(chunk_lengths)/len(texts),\n", 489 | " 'min_chunk_size': min(chunk_lengths),\n", 490 | " 'max_chunk_size': max(chunk_lengths)\n", 491 | " }\n", 492 | "\n", 493 | " print(\"\\nFinal Chunk Statistics:\")\n", 494 | " print(f\"Total chunks: {chunk_stats['num_chunks']}\")\n", 495 | " print(f\"Average chunk size: {chunk_stats['avg_chunk_size']:.2f} chars\")\n", 496 | " print(f\"Minimum chunk size: {chunk_stats['min_chunk_size']} chars\")\n", 497 | " print(f\"Maximum chunk size: {chunk_stats['max_chunk_size']} chars\")\n", 498 | "\n", 499 | " print(\"\\nSample of first chunk:\")\n", 500 | " if texts:\n", 501 | " print(texts[0][:200] + \"...\")\n", 502 | "\n", 503 | " # Store in cache\n", 504 | " self.chunk_cache[cache_key] = {\n", 505 | " 'texts': texts,\n", 506 | " 'strategy': chunk_strategy,\n", 507 | " 'chunk_stats': chunk_stats\n", 508 | " }\n", 509 | " print(f\"\\nStored chunks in cache with key: {cache_key}\")\n", 510 | "\n", 511 | " except Exception as e:\n", 512 | " print(\"\\nERROR in chunk creation:\")\n", 513 | " print(f\"Error type: {type(e).__name__}\")\n", 514 | " print(f\"Error message: {str(e)}\")\n", 515 | " import traceback\n", 516 | " print(\"\\nTraceback:\")\n", 517 | " print(traceback.format_exc())\n", 518 | " return {\n", 519 | " 'texts': [],\n", 520 | " 'strategy': chunk_strategy,\n", 521 | " 'chunk_stats': {\n", 522 | " 'num_chunks': 0,\n", 523 | " 'avg_chunk_size': 0,\n", 524 | " 'min_chunk_size': 0,\n", 525 | " 'max_chunk_size': 0\n", 526 | " }\n", 527 | " }\n", 528 | " else:\n", 529 | " print(f\"\\nRetrieving {len(self.chunk_cache[cache_key]['texts'])} existing chunks from cache\")\n", 530 | "\n", 531 | " result = self.chunk_cache[cache_key]\n", 532 | " print(f\"\\nFinal Output:\")\n", 533 | " print(f\"Number of chunks: {len(result['texts'])}\")\n", 534 | " print(f\"Strategy: {result['strategy']}\")\n", 535 | " print(\"=\"*50)\n", 536 | " return result\n", 537 | "\n", 538 | " def run_cosine_search(self, query: str, threshold: int, chunk_strategy: str = \"semantic\", k: int = 5) -> List[Dict]:\n", 539 | " \"\"\"Run cosine similarity search with enhanced error handling and debugging\"\"\"\n", 540 | " print(\"\\n\" + \"=\"*50)\n", 541 | " print(\"COSINE SEARCH DEBUG LOG\")\n", 542 | " print(\"=\"*50)\n", 543 | " print(f\"Query: {query}\")\n", 544 | " print(f\"Strategy: {chunk_strategy}\")\n", 545 | " print(f\"Threshold: {threshold}\")\n", 546 | " print(f\"Requested k: {k}\")\n", 547 | "\n", 548 | " if chunk_strategy == \"semantic\":\n", 549 | " cache_key = f\"{chunk_strategy}_{threshold}\"\n", 550 | " else:\n", 551 | " cache_key = f\"{chunk_strategy}_{CHUNKING_CONFIGS['max_chunk_size']}\"\n", 552 | "\n", 553 | " print(\"\\nCache Status:\")\n", 554 | " print(f\"Cache key: {cache_key}\")\n", 555 | " print(f\"Available cache keys: {list(self.chunk_cache.keys())}\")\n", 556 | " print(f\"Chunks cache hit: {cache_key in self.chunk_cache}\")\n", 557 | " print(f\"Embeddings cache hit: {cache_key in self.embedding_cache}\")\n", 558 | "\n", 559 | " # First, ensure we have chunks\n", 560 | " if cache_key not in self.chunk_cache:\n", 561 | " print(f\"\\nERROR: No chunks found in cache for {cache_key}\")\n", 562 | " print(\"This suggests chunk creation failed or wasn't called\")\n", 563 | " return []\n", 564 | "\n", 565 | " chunks_data = self.chunk_cache[cache_key]\n", 566 | " if not chunks_data['texts']:\n", 567 | " print(\"\\nERROR: Chunks list is empty\")\n", 568 | " print(\"This suggests chunk creation succeeded but produced no chunks\")\n", 569 | " return []\n", 570 | "\n", 571 | " print(f\"\\nFound {len(chunks_data['texts'])} chunks to search\")\n", 572 | " print(f\"Sample chunk (first 100 chars): {chunks_data['texts'][0][:100]}...\")\n", 573 | "\n", 574 | " try:\n", 575 | " if self.embedding_model is None:\n", 576 | " print(\"\\nInitializing embedding model\")\n", 577 | " self.initialize_embedding_model()\n", 578 | "\n", 579 | " if cache_key not in self.embedding_cache:\n", 580 | " print(\"\\nGenerating embeddings for chunks...\")\n", 581 | " chunk_embeddings = []\n", 582 | "\n", 583 | " # Process in batches\n", 584 | " batch_size = 32\n", 585 | " total_batches = (len(chunks_data['texts']) + batch_size - 1) // batch_size\n", 586 | "\n", 587 | " for i in range(0, len(chunks_data['texts']), batch_size):\n", 588 | " batch = chunks_data['texts'][i:i + batch_size]\n", 589 | " print(f\"\\nProcessing batch {i//batch_size + 1}/{total_batches}\")\n", 590 | " print(f\"Batch size: {len(batch)} chunks\")\n", 591 | "\n", 592 | " batch_embeddings = [self.embedding_model.get_text_embedding(text) for text in batch]\n", 593 | " chunk_embeddings.extend(batch_embeddings)\n", 594 | " print(f\"Total embeddings so far: {len(chunk_embeddings)}\")\n", 595 | "\n", 596 | " print(\"\\nConverting to numpy array...\")\n", 597 | " embeddings_array = np.array(chunk_embeddings).astype('float32')\n", 598 | " print(f\"Embeddings shape: {embeddings_array.shape}\")\n", 599 | "\n", 600 | " print(\"Normalizing embeddings...\")\n", 601 | " norms = np.linalg.norm(embeddings_array, axis=1)[:, np.newaxis]\n", 602 | " norms[norms == 0] = 1 # Prevent division by zero\n", 603 | " normalized_embeddings = embeddings_array / norms\n", 604 | "\n", 605 | " print(\"Creating FAISS index...\")\n", 606 | " dimension = embeddings_array.shape[1]\n", 607 | " index = faiss.IndexFlatIP(dimension)\n", 608 | " index.add(normalized_embeddings)\n", 609 | "\n", 610 | " self.embedding_cache[cache_key] = {\n", 611 | " 'embeddings': embeddings_array,\n", 612 | " 'index': index\n", 613 | " }\n", 614 | " print(\"Embeddings cached successfully\")\n", 615 | "\n", 616 | " print(\"\\nProcessing query...\")\n", 617 | " query_embedding = self.embedding_model.get_text_embedding(query)\n", 618 | " query_embedding = np.array([query_embedding]).astype('float32')\n", 619 | "\n", 620 | " print(\"Normalizing query embedding...\")\n", 621 | " query_norm = np.linalg.norm(query_embedding)\n", 622 | " if query_norm == 0:\n", 623 | " print(\"ERROR: Zero query vector\")\n", 624 | " return []\n", 625 | " query_normalized = query_embedding / query_norm\n", 626 | "\n", 627 | " print(f\"\\nSearching for top {k} matches...\")\n", 628 | " distances, indices = self.embedding_cache[cache_key]['index'].search(\n", 629 | " query_normalized, k\n", 630 | " )\n", 631 | "\n", 632 | " print(\"\\nFormatting results...\")\n", 633 | " results = []\n", 634 | " for score, idx in zip(distances[0], indices[0]):\n", 635 | " if idx >= 0 and idx < len(chunks_data['texts']):\n", 636 | " results.append({\n", 637 | " 'text': chunks_data['texts'][idx],\n", 638 | " 'distance': float(score),\n", 639 | " 'strategy': chunk_strategy\n", 640 | " })\n", 641 | " print(f\"\\nMatch {len(results)}:\")\n", 642 | " print(f\"Score: {float(score):.4f}\")\n", 643 | " print(f\"Text preview: {chunks_data['texts'][idx][:100]}...\")\n", 644 | "\n", 645 | " print(f\"\\nTotal matches found: {len(results)}\")\n", 646 | " print(\"=\"*50)\n", 647 | " return results\n", 648 | "\n", 649 | " except Exception as e:\n", 650 | " print(\"\\nERROR in cosine search:\")\n", 651 | " print(f\"Error type: {type(e).__name__}\")\n", 652 | " print(f\"Error message: {str(e)}\")\n", 653 | " import traceback\n", 654 | " print(\"\\nTraceback:\")\n", 655 | " print(traceback.format_exc())\n", 656 | " print(\"=\"*50)\n", 657 | " return []\n", 658 | "\n", 659 | " def generate_response(self, query: str, context_rag: list, model: Dict) -> dict:\n", 660 | " \"\"\"Generate response using provided context with source tracking\"\"\"\n", 661 | " try:\n", 662 | " if not context_rag:\n", 663 | " return {\n", 664 | " \"response_text\": \"No relevant context found.\",\n", 665 | " \"sources\": [],\n", 666 | " \"source_tracking\": {\n", 667 | " \"num_sources_provided\": 0,\n", 668 | " \"source_ids\": [],\n", 669 | " \"verification_status\": \"no_context\"\n", 670 | " },\n", 671 | " \"strategy\": None\n", 672 | " }\n", 673 | "\n", 674 | " print(\"\\n=== DEBUG: Context Chunks Passed to LLM ===\")\n", 675 | " print(f\"Query: {query}\")\n", 676 | " print(f\"Number of chunks: {len(context_rag)}\")\n", 677 | "\n", 678 | " # Generate unique IDs for each source chunk\n", 679 | " context_with_ids = []\n", 680 | " for idx, doc in enumerate(context_rag):\n", 681 | " source_id = f\"src_{idx}\"\n", 682 | " context_with_ids.append({\n", 683 | " \"text\": doc['text'],\n", 684 | " \"id\": source_id,\n", 685 | " \"distance\": doc.get('distance', 0)\n", 686 | " })\n", 687 | " print(f\"\\nChunk {source_id}:\")\n", 688 | " print(f\"Distance: {doc.get('distance', 'N/A')}\")\n", 689 | " print(\"Text:\", doc['text'])\n", 690 | " print(\"=\"*50)\n", 691 | "\n", 692 | " # Format context with source IDs\n", 693 | " formatted_context = \"\\n\\n\".join([\n", 694 | " f\"[{doc['id']}] {doc['text']}\"\n", 695 | " for doc in context_with_ids\n", 696 | " ])\n", 697 | "\n", 698 | " prompt = PromptTemplate(template=\"\"\"\n", 699 | " Instructions:\n", 700 | "\n", 701 | " You are a helpful assistant who answers questions strictly from the provided context.\n", 702 | " Given the context information, provide a direct and concise answer to the question: {query}\n", 703 | "\n", 704 | " Important rules:\n", 705 | " 1. Only use information present in the context\n", 706 | " 2. If you don't know or can't find the information, say \"I don't know\"\n", 707 | " 3. You must cite the source IDs [src_X] for every piece of information you use\n", 708 | " 4. Do not make assumptions or use external knowledge\n", 709 | "\n", 710 | " You must format your response as a JSON string object, starting with \"LLM_Response:\"\n", 711 | "\n", 712 | " Your answer must follow this exact format:\n", 713 | "\n", 714 | " LLM_Response:\n", 715 | " {{\n", 716 | " \"response_text\": \"Your detailed answer here with [src_X] citations inline\",\n", 717 | " \"sources\": [\n", 718 | " \"Copy and paste here the exact text segments you used, with their source IDs\"\n", 719 | " ],\n", 720 | " \"source_ids_used\": [\"List of all source IDs referenced in your answer\"]\n", 721 | " }}\n", 722 | "\n", 723 | " Context (with source IDs):\n", 724 | " ---------------\n", 725 | " {context}\n", 726 | " ---------------\n", 727 | " \"\"\")\n", 728 | "\n", 729 | " model_type = model['type']\n", 730 | " llm = model['llm']\n", 731 | "\n", 732 | " chain = prompt | llm | StrOutputParser()\n", 733 | "\n", 734 | " response = chain.invoke({\n", 735 | " \"query\": query,\n", 736 | " \"context\": formatted_context\n", 737 | " })\n", 738 | "\n", 739 | " response_text = response.split(\"LLM_Response:\")[-1].strip()\n", 740 | "\n", 741 | " try:\n", 742 | " if '{' in response_text and '}' in response_text:\n", 743 | " json_str = response_text[response_text.find('{'):response_text.rfind('}')+1]\n", 744 | " parsed_response = json.loads(json_str)\n", 745 | "\n", 746 | " # Verify source usage\n", 747 | " claimed_sources = set(parsed_response.get(\"source_ids_used\", []))\n", 748 | " available_sources = {doc[\"id\"] for doc in context_with_ids}\n", 749 | "\n", 750 | " verification_status = {\n", 751 | " \"status\": \"verified\" if claimed_sources.issubset(available_sources) else \"source_mismatch\",\n", 752 | " \"claimed_sources\": list(claimed_sources),\n", 753 | " \"available_sources\": list(available_sources),\n", 754 | " \"unauthorized_sources\": list(claimed_sources - available_sources)\n", 755 | " }\n", 756 | "\n", 757 | " return {\n", 758 | " \"response_text\": parsed_response.get(\"response_text\", response_text),\n", 759 | " \"sources\": parsed_response.get(\"sources\", []),\n", 760 | " \"source_tracking\": {\n", 761 | " \"num_sources_provided\": len(context_with_ids),\n", 762 | " \"source_ids\": [doc[\"id\"] for doc in context_with_ids],\n", 763 | " \"verification_status\": verification_status\n", 764 | " },\n", 765 | " \"strategy\": context_rag[0]['strategy'] if context_rag else None\n", 766 | " }\n", 767 | " else:\n", 768 | " return {\n", 769 | " \"response_text\": response_text,\n", 770 | " \"sources\": [],\n", 771 | " \"source_tracking\": {\n", 772 | " \"num_sources_provided\": len(context_with_ids),\n", 773 | " \"source_ids\": [doc[\"id\"] for doc in context_with_ids],\n", 774 | " \"verification_status\": {\n", 775 | " \"status\": \"parsing_failed\",\n", 776 | " \"error\": \"Response not in JSON format\"\n", 777 | " }\n", 778 | " },\n", 779 | " \"strategy\": context_rag[0]['strategy'] if context_rag else None\n", 780 | " }\n", 781 | "\n", 782 | " except json.JSONDecodeError:\n", 783 | " return {\n", 784 | " \"response_text\": response_text,\n", 785 | " \"sources\": [],\n", 786 | " \"source_tracking\": {\n", 787 | " \"num_sources_provided\": len(context_with_ids),\n", 788 | " \"source_ids\": [doc[\"id\"] for doc in context_with_ids],\n", 789 | " \"verification_status\": {\n", 790 | " \"status\": \"parsing_failed\",\n", 791 | " \"error\": \"JSON decode error\"\n", 792 | " }\n", 793 | " },\n", 794 | " \"strategy\": context_rag[0]['strategy'] if context_rag else None\n", 795 | " }\n", 796 | "\n", 797 | " except Exception as e:\n", 798 | " print(f\"An error occurred: {str(e)}\")\n", 799 | " return {\n", 800 | " \"response_text\": \"An error occurred while generating the response.\",\n", 801 | " \"sources\": [],\n", 802 | " \"source_tracking\": {\n", 803 | " \"num_sources_provided\": 0,\n", 804 | " \"source_ids\": [],\n", 805 | " \"verification_status\": {\n", 806 | " \"status\": \"error\",\n", 807 | " \"error\": str(e)\n", 808 | " }\n", 809 | " },\n", 810 | " \"strategy\": None\n", 811 | " }" 812 | ], 813 | "metadata": { 814 | "id": "YY5rnivk-bAh" 815 | }, 816 | "execution_count": null, 817 | "outputs": [] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "source": [ 822 | "#ModelConfig Class" 823 | ], 824 | "metadata": { 825 | "id": "ljqi1Qg8j9F8" 826 | } 827 | }, 828 | { 829 | "cell_type": "code", 830 | "source": [ 831 | "class ModelConfig:\n", 832 | " \"\"\"Handles model configuration and management\"\"\"\n", 833 | " def __init__(self,\n", 834 | " models: List[Dict],\n", 835 | " temperature: float = 0.3):\n", 836 | " self.models = models\n", 837 | " self.temperature = temperature\n", 838 | " self.current_model = None\n", 839 | " self.current_model_name = None\n", 840 | "\n", 841 | "\n", 842 | " @contextmanager\n", 843 | " def load_model(self, model_config: Dict):\n", 844 | " \"\"\"Context manager for lazy loading and proper cleanup of models\"\"\"\n", 845 | " try:\n", 846 | " model_name = model_config[\"name\"]\n", 847 | " model_type = model_config[\"type\"]\n", 848 | "\n", 849 | " # Clear any existing model\n", 850 | " self.cleanup_current_model()\n", 851 | "\n", 852 | " if model_type == \"mistral_api\":\n", 853 | " mistral_api_key = userdata.get('MISTRAL_API_KEY')\n", 854 | " self.current_model = {\n", 855 | " 'llm': ChatMistralAI(\n", 856 | " model=model_name,\n", 857 | " temperature=self.temperature,\n", 858 | " api_key=mistral_api_key\n", 859 | " ),\n", 860 | " 'type': 'mistral_api'\n", 861 | " }\n", 862 | " else: # huggingface\n", 863 | " print(f\"Loading huggingface model: {model_name}\")\n", 864 | "\n", 865 | " # Empty CUDA cache before loading new model\n", 866 | " torch.cuda.empty_cache()\n", 867 | " gc.collect()\n", 868 | "\n", 869 | " tokenizer = AutoTokenizer.from_pretrained(\n", 870 | " pretrained_model_name_or_path=model_config[\"tokenizer\"],\n", 871 | " trust_remote_code=True,\n", 872 | " use_fast=True,\n", 873 | " padding_side=\"left\"\n", 874 | " )\n", 875 | "\n", 876 | " model = AutoModelForCausalLM.from_pretrained(\n", 877 | " pretrained_model_name_or_path=model_name,\n", 878 | " device_map=\"auto\",\n", 879 | " trust_remote_code=True,\n", 880 | " torch_dtype=torch.float16,\n", 881 | " use_cache=True,\n", 882 | " low_cpu_mem_usage=True,\n", 883 | " )\n", 884 | "\n", 885 | " pipe = pipeline(\n", 886 | " \"text-generation\",\n", 887 | " model=model,\n", 888 | " tokenizer=tokenizer,\n", 889 | " max_new_tokens=512,\n", 890 | " temperature=self.temperature,\n", 891 | " top_p=0.95,\n", 892 | " top_k=50,\n", 893 | " do_sample=True,\n", 894 | " device_map=\"auto\"\n", 895 | " )\n", 896 | "\n", 897 | " self.current_model = {\n", 898 | " 'llm': HuggingFacePipeline(pipeline=pipe),\n", 899 | " 'type': 'huggingface',\n", 900 | " 'model': model, # Keep reference for cleanup\n", 901 | " 'pipe': pipe # Keep reference for cleanup\n", 902 | " }\n", 903 | "\n", 904 | " self.current_model_name = model_name\n", 905 | " yield self.current_model\n", 906 | "\n", 907 | " finally:\n", 908 | " # Cleanup will happen in cleanup_current_model()\n", 909 | " pass\n", 910 | "\n", 911 | " def cleanup_current_model(self):\n", 912 | " \"\"\"Clean up the current model and free memory\"\"\"\n", 913 | " if self.current_model is not None:\n", 914 | " if self.current_model['type'] == 'huggingface':\n", 915 | " # Delete model components explicitly\n", 916 | " del self.current_model['llm']\n", 917 | " del self.current_model['model']\n", 918 | " del self.current_model['pipe']\n", 919 | "\n", 920 | " # Clear CUDA cache\n", 921 | " torch.cuda.empty_cache()\n", 922 | "\n", 923 | " # Run garbage collection\n", 924 | " gc.collect()\n", 925 | "\n", 926 | " self.current_model = None\n", 927 | " self.current_model_name = None" 928 | ], 929 | "metadata": { 930 | "id": "tCzG7OE0IiDT" 931 | }, 932 | "execution_count": null, 933 | "outputs": [] 934 | }, 935 | { 936 | "cell_type": "markdown", 937 | "source": [ 938 | "#ExperimentRunner Class" 939 | ], 940 | "metadata": { 941 | "id": "05gTul4pIW6S" 942 | } 943 | }, 944 | { 945 | "cell_type": "code", 946 | "source": [ 947 | "class ExperimentRunner:\n", 948 | " \"\"\"Handles experiment execution\"\"\"\n", 949 | " def __init__(self,\n", 950 | " model_config: ModelConfig,\n", 951 | " questions: List[str],\n", 952 | " chunk_strategies: List[str],\n", 953 | " semantic_enabled: bool = False,\n", 954 | " semantic_thresholds: List[int] = None,\n", 955 | " rag_pipeline: RAGPipeline = None):\n", 956 | " self.model_config = model_config\n", 957 | " self.questions = questions\n", 958 | " self.chunk_strategies = chunk_strategies\n", 959 | " self.semantic_enabled = semantic_enabled\n", 960 | " self.semantic_thresholds = semantic_thresholds if semantic_enabled else []\n", 961 | "\n", 962 | " # Use existing RAG pipeline or create new one\n", 963 | " global _GLOBAL_RAG_PIPELINE\n", 964 | " if rag_pipeline:\n", 965 | " self.rag_pipeline = rag_pipeline\n", 966 | " elif _GLOBAL_RAG_PIPELINE:\n", 967 | " self.rag_pipeline = _GLOBAL_RAG_PIPELINE\n", 968 | " else:\n", 969 | " print(\"Initializing new RAG pipeline\")\n", 970 | " _GLOBAL_RAG_PIPELINE = RAGPipeline()\n", 971 | " self.rag_pipeline = _GLOBAL_RAG_PIPELINE\n", 972 | "\n", 973 | " def run_experiments(self) -> Dict:\n", 974 | " results = {\n", 975 | " \"metadata\": {\n", 976 | " \"timestamp\": time.strftime(\"%Y%m%d-%H%M%S\"),\n", 977 | " \"models_tested\": [model[\"name\"] for model in self.model_config.models],\n", 978 | " \"semantic_enabled\": self.semantic_enabled,\n", 979 | " \"semantic_thresholds\": self.semantic_thresholds if self.semantic_enabled else [],\n", 980 | " \"chunk_strategies\": self.chunk_strategies,\n", 981 | " \"temperature\": self.model_config.temperature\n", 982 | " },\n", 983 | " \"results\": []\n", 984 | " }\n", 985 | "\n", 986 | " for model_config in self.model_config.models:\n", 987 | " model_name = model_config[\"name\"]\n", 988 | " print(f\"\\nTesting model: {model_name}\")\n", 989 | "\n", 990 | " with self.model_config.load_model(model_config) as model:\n", 991 | " for strategy in self.chunk_strategies:\n", 992 | " # Handle thresholds based on strategy type\n", 993 | " if strategy == \"semantic\" and self.semantic_enabled:\n", 994 | " thresholds_to_test = self.semantic_thresholds\n", 995 | " else:\n", 996 | " thresholds_to_test = [None]\n", 997 | "\n", 998 | " for threshold in thresholds_to_test:\n", 999 | " chunks_data = self.rag_pipeline.create_chunks(\n", 1000 | " documents,\n", 1001 | " threshold=threshold,\n", 1002 | " chunk_strategy=strategy\n", 1003 | " )\n", 1004 | "\n", 1005 | " chunk_stats = {\n", 1006 | " \"strategy\": strategy,\n", 1007 | " \"threshold\": threshold,\n", 1008 | " \"stats\": chunks_data[\"chunk_stats\"]\n", 1009 | " }\n", 1010 | "\n", 1011 | " for question in self.questions:\n", 1012 | " print(f\"Processing question: {question}\")\n", 1013 | "\n", 1014 | " context = self.rag_pipeline.run_cosine_search(\n", 1015 | " query=question,\n", 1016 | " threshold=threshold,\n", 1017 | " chunk_strategy=strategy\n", 1018 | " )\n", 1019 | "\n", 1020 | " answer = self.rag_pipeline.generate_response(\n", 1021 | " query=question,\n", 1022 | " context_rag=context,\n", 1023 | " model=model\n", 1024 | " )\n", 1025 | "\n", 1026 | " results[\"results\"].append({\n", 1027 | " \"model\": model_name,\n", 1028 | " \"threshold\": threshold if strategy == \"semantic\" else None,\n", 1029 | " \"chunk_strategy\": strategy,\n", 1030 | " \"question\": question,\n", 1031 | " \"response\": answer,\n", 1032 | " \"chunk_stats\": chunk_stats[\"stats\"]\n", 1033 | " })\n", 1034 | "\n", 1035 | " return results" 1036 | ], 1037 | "metadata": { 1038 | "id": "8hFyd9G1kC8M" 1039 | }, 1040 | "execution_count": null, 1041 | "outputs": [] 1042 | }, 1043 | { 1044 | "cell_type": "markdown", 1045 | "source": [ 1046 | "#Evaluator Class" 1047 | ], 1048 | "metadata": { 1049 | "id": "EpjD-Qz54mfu" 1050 | } 1051 | }, 1052 | { 1053 | "cell_type": "code", 1054 | "source": [ 1055 | "class ExperimentEvaluator:\n", 1056 | " \"\"\"Handles pure evaluation logic\"\"\"\n", 1057 | " def __init__(self, api_key: str):\n", 1058 | " self.client = openai.OpenAI(api_key=api_key)\n", 1059 | " self.encoder = tiktoken.encoding_for_model(\"gpt-4o\")\n", 1060 | "\n", 1061 | " def _get_baseline_answers(self, questions: List[str], source_docs: List) -> Dict[str, str]:\n", 1062 | " \"\"\"Get GPT-4o's own answers to the questions as baseline\"\"\"\n", 1063 | " print(\"\\n=== DEBUG: _get_baseline_answers ===\")\n", 1064 | " print(f\"Questions received: {questions}\")\n", 1065 | " print(f\"Number of document parts: {len(source_docs)}\")\n", 1066 | "\n", 1067 | " # Concatenate all document parts\n", 1068 | " full_document = \"\\n\\n\".join([doc.text for doc in source_docs])\n", 1069 | " print(f\"\\nFull document length: {len(full_document)} characters\")\n", 1070 | "\n", 1071 | " # Print sample from document\n", 1072 | " print(\"\\nSampling from document:\")\n", 1073 | " print(\"Start:\", full_document[:200], \"...\")\n", 1074 | " print(\"Middle:\", full_document[len(full_document)//2:len(full_document)//2 + 200], \"...\")\n", 1075 | " print(\"End:\", full_document[-200:], \"...\")\n", 1076 | "\n", 1077 | " baseline_prompt = f\"\"\"Source Document:\n", 1078 | " {full_document}\n", 1079 | "\n", 1080 | " Using ONLY the information from the source document above, answer these questions.\n", 1081 | " - If the exact information is found, provide it with specific numbers\n", 1082 | " - If information is not found, explicitly state that\n", 1083 | " - If there are metrics, make sure to include appropriate units\n", 1084 | "\n", 1085 | " Format your response as a valid JSON object with questions as keys and answers as values.\n", 1086 | " Keep answers concise and factual.\n", 1087 | "\n", 1088 | " Questions to answer:\n", 1089 | " {json.dumps(questions, indent=2)}\"\"\"\n", 1090 | "\n", 1091 | " try:\n", 1092 | " print(\"\\n--- Getting Baseline Answers ---\")\n", 1093 | " response = self.client.chat.completions.create(\n", 1094 | " model=\"gpt-4o\",\n", 1095 | " messages=[\n", 1096 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant that provides JSON-formatted answers based on source documents.\"},\n", 1097 | " {\"role\": \"user\", \"content\": baseline_prompt}\n", 1098 | " ],\n", 1099 | " temperature=0.1\n", 1100 | " )\n", 1101 | "\n", 1102 | " content = response.choices[0].message.content\n", 1103 | " print(\"\\nRaw GPT-4 Response:\")\n", 1104 | " print(content)\n", 1105 | "\n", 1106 | " if '{' in content and '}' in content:\n", 1107 | " json_str = content[content.find('{'):content.rfind('}')+1]\n", 1108 | " baseline_answers = json.loads(json_str)\n", 1109 | " print(\"\\nParsed Baseline Answers:\")\n", 1110 | " print(baseline_answers)\n", 1111 | " return baseline_answers\n", 1112 | " print(\"\\nWarning: No JSON structure found in response\")\n", 1113 | " return {\"error\": \"No JSON structure found\", \"questions\": questions}\n", 1114 | "\n", 1115 | " except Exception as e:\n", 1116 | " print(f\"\\nError in _get_baseline_answers: {str(e)}\")\n", 1117 | " return {\"error\": str(e), \"questions\": questions}\n", 1118 | "\n", 1119 | " def evaluate_experiments(self, experiment_results: Dict, *, source_docs: List) -> Dict: # Updated signature\n", 1120 | " \"\"\"Core evaluation logic\"\"\"\n", 1121 | " try:\n", 1122 | " print(\"\\n=== DEBUG: evaluate_experiments ===\")\n", 1123 | " print(\"Getting questions...\")\n", 1124 | " questions = list(set(result[\"question\"] for result in experiment_results[\"results\"]))\n", 1125 | " print(f\"Questions extracted: {questions}\")\n", 1126 | "\n", 1127 | " print(\"\\nGetting baseline answers...\")\n", 1128 | " baseline_answers = self._get_baseline_answers(questions, source_docs) # Pass source_docs\n", 1129 | " print(f\"Baseline answers received: {baseline_answers}\")\n", 1130 | "\n", 1131 | " model_strategy_combinations = set(\n", 1132 | " (result[\"model\"],\n", 1133 | " result[\"chunk_strategy\"],\n", 1134 | " result[\"threshold\"] if result[\"chunk_strategy\"] == \"semantic\" else None)\n", 1135 | " for result in experiment_results[\"results\"]\n", 1136 | " )\n", 1137 | "\n", 1138 | " all_evaluations = []\n", 1139 | "\n", 1140 | " for model, strategy, threshold in model_strategy_combinations:\n", 1141 | " relevant_results = [r for r in experiment_results[\"results\"]\n", 1142 | " if r[\"model\"] == model and\n", 1143 | " r[\"chunk_strategy\"] == strategy and\n", 1144 | " (r[\"threshold\"] == threshold if strategy == \"semantic\" else True)]\n", 1145 | "\n", 1146 | " for result in relevant_results:\n", 1147 | " print(f\"\\nEvaluating response for: {result['question']}\")\n", 1148 | " baseline = baseline_answers.get(result[\"question\"], \"No baseline available\")\n", 1149 | " print(f\"Using baseline answer: {baseline}\")\n", 1150 | "\n", 1151 | " evaluation = self._evaluate_single_response(result, baseline)\n", 1152 | " all_evaluations.append(evaluation)\n", 1153 | "\n", 1154 | " return {\n", 1155 | " \"metadata\": {\n", 1156 | " \"timestamp\": datetime.now().isoformat(),\n", 1157 | " \"model_used\": \"gpt-4o\",\n", 1158 | " \"num_combinations_evaluated\": len(model_strategy_combinations),\n", 1159 | " \"num_questions_evaluated\": len(questions),\n", 1160 | " \"evaluation_status\": \"success\"\n", 1161 | " },\n", 1162 | " \"evaluations\": all_evaluations,\n", 1163 | " \"summary\": self._generate_summary(all_evaluations)\n", 1164 | " }\n", 1165 | "\n", 1166 | " except Exception as e:\n", 1167 | " print(f\"\\nCritical error in evaluate_experiments: {str(e)}\")\n", 1168 | " return self._create_default_evaluation(experiment_results)\n", 1169 | "\n", 1170 | " def _evaluate_single_response(self, result: Dict, baseline: str) -> Dict:\n", 1171 | " \"\"\"Evaluate a single response with clearer scoring criteria\"\"\"\n", 1172 | " evaluation_prompt = f\"\"\"Compare and evaluate this response. You must evaluate three separate aspects:\n", 1173 | "\n", 1174 | " 1. ACCURACY - Compare the model's answer against the baseline (ground truth)\n", 1175 | " 2. SOURCE ATTRIBUTION - Check if the model's answer matches its cited sources\n", 1176 | " 3. CONCISESNESS - Check if the model's answer is clear and direct\n", 1177 | "\n", 1178 | " Question: {result[\"question\"]}\n", 1179 | "\n", 1180 | " Baseline (Ground Truth): {baseline}\n", 1181 | "\n", 1182 | " Model Response: {result.get(\"response\", {}).get(\"response_text\", \"\")}\n", 1183 | " Sources Cited: {json.dumps(result.get(\"response\", {}).get(\"sources\", []), indent=2)}\n", 1184 | "\n", 1185 | " Scoring Criteria:\n", 1186 | "\n", 1187 | " 1. ACCURACY (0-100):\n", 1188 | " - Compare ONLY the model's answer against the baseline\n", 1189 | " - 100: Exact match with baseline (including numbers and units)\n", 1190 | " - 50: Partially correct but with some errors\n", 1191 | " - 0: Completely different from baseline or wrong\n", 1192 | "\n", 1193 | " 2. SOURCE ATTRIBUTION (0-100):\n", 1194 | " - Compare ONLY the model's answer against its cited sources\n", 1195 | " - 100: Answer exactly matches what appears in cited sources INCLUDING UNITS\n", 1196 | " - 50: Answer partially matches cited sources\n", 1197 | " - 0: Answer doesn't match cited sources or no sources cited\n", 1198 | "\n", 1199 | " Note: For large numbers, different formats are acceptable (e.g., $19,000 million = $19 billion)\n", 1200 | " BUT the units must match what appears in the source document for full attribution score.\n", 1201 | " The units in the source document are authoritative.\n", 1202 | "\n", 1203 | " 3. CONCISENESS (0-100):\n", 1204 | " - 100: Clear, direct answer without extra information\n", 1205 | " - 50: Contains some irrelevant information\n", 1206 | " - 0: Verbose or unclear\n", 1207 | "\n", 1208 | " Note: A response can have perfect source attribution (100) even if the answer is wrong,\n", 1209 | " as long as it accurately reflects what's in its cited sources.\n", 1210 | "\n", 1211 | " Provide your evaluation in this exact JSON format:\n", 1212 | " {{\n", 1213 | " \"model\": \"{result[\"model\"]}\",\n", 1214 | " \"chunk_strategy\": \"{result[\"chunk_strategy\"]}\",\n", 1215 | " \"threshold\": {result[\"threshold\"] if result[\"chunk_strategy\"] == \"semantic\" else \"null\"},\n", 1216 | " \"question\": \"{result[\"question\"]}\",\n", 1217 | " \"baseline_answer\": \"{baseline}\",\n", 1218 | " \"model_response\": {json.dumps(result.get(\"response\", {}), indent=2)},\n", 1219 | " \"chunk_stats\": {json.dumps(result.get(\"chunk_stats\", {}), indent=2)},\n", 1220 | " \"scores\": {{\n", 1221 | " \"accuracy\": ,\n", 1222 | " \"source_attribution\": ,\n", 1223 | " \"conciseness\": \n", 1224 | " }},\n", 1225 | " \"composite_score\": ,\n", 1226 | " \"detailed_analysis\": {{\n", 1227 | " \"accuracy_analysis\": \"Explain ONLY how the answer compares to baseline. Explicitly state if numbers match or differ.\",\n", 1228 | " \"attribution_analysis\": \"Explain ONLY how well the answer matches its cited sources, regardless of accuracy.\",\n", 1229 | " \"conciseness_analysis\": \"Explain how clear and direct the answer is\"\n", 1230 | " }}\n", 1231 | " }}\n", 1232 | "\n", 1233 | " Examples:\n", 1234 | "\n", 1235 | " Bad Response (Perfect Attribution, Wrong Answer):\n", 1236 | " - If baseline is \"$10,347M\" but model answers \"$19,921M [src_2]\" and src_2 contains \"$19,921M\"\n", 1237 | " - Accuracy: 0 (completely different from baseline)\n", 1238 | " - Attribution: 100 (perfectly matches its cited source)\n", 1239 | "\n", 1240 | " Good Response (Perfect Both):\n", 1241 | " - If baseline is \"$10,347M\" and model answers \"$10,347M [src_2]\" and src_2 contains \"$10,347M\"\n", 1242 | " - Accuracy: 100 (matches baseline)\n", 1243 | " - Attribution: 100 (matches source)\n", 1244 | " \"\"\"\n", 1245 | "\n", 1246 | " try:\n", 1247 | " response = self.client.chat.completions.create(\n", 1248 | " model=\"gpt-4o\",\n", 1249 | " messages=[\n", 1250 | " {\"role\": \"system\", \"content\": \"You are an expert at evaluating response accuracy against both baseline answers and source data.\"},\n", 1251 | " {\"role\": \"user\", \"content\": evaluation_prompt}\n", 1252 | " ],\n", 1253 | " temperature=0.7,\n", 1254 | " max_tokens=1000\n", 1255 | " )\n", 1256 | "\n", 1257 | " content = response.choices[0].message.content\n", 1258 | " if '{' in content and '}' in content:\n", 1259 | " json_str = content[content.find('{'):content.rfind('}')+1]\n", 1260 | " return json.loads(json_str)\n", 1261 | " return self._create_default_single_evaluation(result, baseline)\n", 1262 | "\n", 1263 | " except Exception as e:\n", 1264 | " print(f\"Error evaluating response: {str(e)}\")\n", 1265 | " return self._create_default_single_evaluation(result, baseline)\n", 1266 | "\n", 1267 | " def _create_default_single_evaluation(self, result: Dict, baseline: str) -> Dict:\n", 1268 | " \"\"\"Create a default evaluation for a single response when evaluation fails\"\"\"\n", 1269 | " return {\n", 1270 | " \"model\": result[\"model\"],\n", 1271 | " \"chunk_strategy\": result[\"chunk_strategy\"],\n", 1272 | " \"threshold\": result[\"threshold\"] if result[\"chunk_strategy\"] == \"semantic\" else None,\n", 1273 | " \"question\": result[\"question\"],\n", 1274 | " \"baseline_answer\": baseline,\n", 1275 | " \"model_response\": result.get(\"response\", {}),\n", 1276 | " \"scores\": {\n", 1277 | " \"source_accuracy\": 0,\n", 1278 | " \"source_attribution\": 0,\n", 1279 | " \"conciseness\": 0\n", 1280 | " },\n", 1281 | " \"composite_score\": 0,\n", 1282 | " \"detailed_analysis\": {\n", 1283 | " \"accuracy_analysis\": \"Evaluation failed\",\n", 1284 | " \"attribution_analysis\": \"Evaluation failed\",\n", 1285 | " \"conciseness_analysis\": \"Evaluation failed\"\n", 1286 | " }\n", 1287 | " }\n", 1288 | "\n", 1289 | " def _generate_summary(self, evaluations: List[Dict]) -> Dict:\n", 1290 | " \"\"\"Generate summary statistics from evaluations with ordered results\"\"\"\n", 1291 | " if not evaluations:\n", 1292 | " return {\n", 1293 | " \"overall_performance\": \"No evaluations available\",\n", 1294 | " \"optimal_permutation\": \"Not available\",\n", 1295 | " \"performance_analysis\": \"Evaluation process failed\",\n", 1296 | " \"chunking_statistics\": {}\n", 1297 | " }\n", 1298 | "\n", 1299 | " # Create ordered list of expected configurations\n", 1300 | " ordered_configs = []\n", 1301 | " if CHUNKING_CONFIGS[\"semantic_config\"][\"enabled\"]:\n", 1302 | " for threshold in CHUNKING_CONFIGS[\"semantic_config\"][\"thresholds\"]:\n", 1303 | " ordered_configs.append((\"semantic\", threshold))\n", 1304 | "\n", 1305 | " for strategy in [s for s in CHUNKING_CONFIGS[\"strategies\"] if s != \"semantic\"]:\n", 1306 | " ordered_configs.append((strategy, None))\n", 1307 | "\n", 1308 | " # Get unique models from evaluations\n", 1309 | " unique_models = sorted(set(eval[\"model\"] for eval in evaluations))\n", 1310 | "\n", 1311 | " # Track chunk statistics and performance scores\n", 1312 | " chunking_statistics = {}\n", 1313 | " performance_scores = {}\n", 1314 | " ordered_analysis = {}\n", 1315 | "\n", 1316 | " # Get document name from the documents list\n", 1317 | " document_name = os.path.basename(documents[0].metadata.get('file_path', 'Unknown Document'))\n", 1318 | "\n", 1319 | " # Initialize tracking for all model-strategy combinations\n", 1320 | " for model in unique_models:\n", 1321 | " for strategy, threshold in ordered_configs:\n", 1322 | " key = (model, strategy, threshold)\n", 1323 | " performance_scores[key] = {\n", 1324 | " \"count\": 0,\n", 1325 | " \"total_composite\": 0\n", 1326 | " }\n", 1327 | "\n", 1328 | " # First pass: calculate scores and collect statistics\n", 1329 | " best_score = 0\n", 1330 | " best_config = None\n", 1331 | "\n", 1332 | " for eval in evaluations:\n", 1333 | " model = eval[\"model\"]\n", 1334 | " strategy = eval[\"chunk_strategy\"]\n", 1335 | " threshold = eval[\"threshold\"] if strategy == \"semantic\" else None\n", 1336 | " key = (model, strategy, threshold)\n", 1337 | "\n", 1338 | " # Track performance scores\n", 1339 | " if key in performance_scores:\n", 1340 | " performance_scores[key][\"count\"] += 1\n", 1341 | " performance_scores[key][\"total_composite\"] += eval[\"composite_score\"]\n", 1342 | "\n", 1343 | " # Track chunk statistics (only need one entry per strategy/threshold combination)\n", 1344 | " chunk_key = (strategy, threshold)\n", 1345 | " if chunk_key not in chunking_statistics:\n", 1346 | " chunk_stats = eval.get(\"chunk_stats\", {})\n", 1347 | " if chunk_stats:\n", 1348 | " if threshold is not None:\n", 1349 | " config_str = f\"{document_name} with {strategy} chunking (threshold: {threshold})\"\n", 1350 | " else:\n", 1351 | " config_str = f\"{document_name} with {strategy} chunking\"\n", 1352 | "\n", 1353 | " chunking_statistics[chunk_key] = {\n", 1354 | " \"config_str\": config_str,\n", 1355 | " \"stats\": {\n", 1356 | " \"number_of_chunks\": chunk_stats.get(\"num_chunks\", \"N/A\"),\n", 1357 | " \"average_chunk_size\": round(chunk_stats.get(\"avg_chunk_size\", 0), 2),\n", 1358 | " \"min_chunk_size\": chunk_stats.get(\"min_chunk_size\", \"N/A\"),\n", 1359 | " \"max_chunk_size\": chunk_stats.get(\"max_chunk_size\", \"N/A\")\n", 1360 | " }\n", 1361 | " }\n", 1362 | "\n", 1363 | " # Second pass: create ordered performance analysis and chunk statistics\n", 1364 | " ordered_chunking_stats = {}\n", 1365 | " for strategy, threshold in ordered_configs:\n", 1366 | " # Add chunk statistics\n", 1367 | " chunk_key = (strategy, threshold)\n", 1368 | " if chunk_key in chunking_statistics:\n", 1369 | " config_str = chunking_statistics[chunk_key][\"config_str\"]\n", 1370 | " ordered_chunking_stats[config_str] = chunking_statistics[chunk_key][\"stats\"]\n", 1371 | "\n", 1372 | " # Add performance analysis for each model\n", 1373 | " for model in unique_models:\n", 1374 | " key = (model, strategy, threshold)\n", 1375 | " scores = performance_scores[key]\n", 1376 | "\n", 1377 | " if scores[\"count\"] > 0:\n", 1378 | " avg_composite = round(scores[\"total_composite\"] / scores[\"count\"], 2)\n", 1379 | "\n", 1380 | " if threshold is not None:\n", 1381 | " perf_key = f\"{model} with {strategy} chunking (threshold: {threshold})\"\n", 1382 | " else:\n", 1383 | " perf_key = f\"{model} with {strategy} chunking\"\n", 1384 | "\n", 1385 | " ordered_analysis[perf_key] = avg_composite\n", 1386 | "\n", 1387 | " if avg_composite > best_score:\n", 1388 | " best_score = avg_composite\n", 1389 | " best_config = perf_key\n", 1390 | "\n", 1391 | " # Calculate overall average score\n", 1392 | " total_score = sum(eval[\"composite_score\"] for eval in evaluations)\n", 1393 | " avg_score = round(total_score / len(evaluations), 2) if evaluations else 0\n", 1394 | "\n", 1395 | " return {\n", 1396 | " \"overall_performance\": f\"Average composite score across all evaluations: {avg_score:.2f}/100\",\n", 1397 | " \"optimal_permutation\": f\"Best performance: {best_config} (score: {best_score:.2f}/100)\",\n", 1398 | " \"performance_analysis\": ordered_analysis,\n", 1399 | " \"chunking_statistics\": ordered_chunking_stats\n", 1400 | " }\n", 1401 | "\n", 1402 | "\n", 1403 | " def _create_default_evaluation(self, experiment_results: Dict) -> Dict:\n", 1404 | " \"\"\"Create a default evaluation result when the evaluation process fails\"\"\"\n", 1405 | " return {\n", 1406 | " \"metadata\": {\n", 1407 | " \"timestamp\": datetime.now().isoformat(),\n", 1408 | " \"model_used\": \"gpt-4o\",\n", 1409 | " \"num_combinations_evaluated\": 0,\n", 1410 | " \"num_questions_evaluated\": 0,\n", 1411 | " \"evaluation_status\": \"failed\"\n", 1412 | " },\n", 1413 | " \"evaluations\": [\n", 1414 | " self._create_default_single_evaluation(result, \"Evaluation failed\")\n", 1415 | " for result in experiment_results[\"results\"]\n", 1416 | " ],\n", 1417 | " \"summary\": {\n", 1418 | " \"overall_performance\": \"Evaluation failed\",\n", 1419 | " \"optimal_permutation\": \"Not available\",\n", 1420 | " \"performance_analysis\": \"Evaluation process failed\",\n", 1421 | " \"chunking_statistics\": {}\n", 1422 | " }\n", 1423 | " }" 1424 | ], 1425 | "metadata": { 1426 | "id": "1rAK93yw4qCx" 1427 | }, 1428 | "execution_count": null, 1429 | "outputs": [] 1430 | }, 1431 | { 1432 | "cell_type": "markdown", 1433 | "source": [ 1434 | "#Results Manager Class" 1435 | ], 1436 | "metadata": { 1437 | "id": "lJmy7VbumFk4" 1438 | } 1439 | }, 1440 | { 1441 | "cell_type": "code", 1442 | "source": [ 1443 | "class ResultsManager:\n", 1444 | " \"\"\"Handles formatting, saving, and displaying evaluation results\"\"\"\n", 1445 | " def __init__(self, save_directory: str):\n", 1446 | " self.save_directory = save_directory\n", 1447 | " os.makedirs(save_directory, exist_ok=True)\n", 1448 | "\n", 1449 | " def format_results(self, experiment_results: Dict, evaluation_results: Dict) -> Tuple[Dict, Dict]:\n", 1450 | " \"\"\"Format experiment and evaluation results into structured output\"\"\"\n", 1451 | " print(\"\\n=== Starting Results Formatting ===\")\n", 1452 | "\n", 1453 | " # Format experiment results\n", 1454 | " formatted_experiment = {\n", 1455 | " \"metadata\": experiment_results.get(\"metadata\", {}),\n", 1456 | " \"results\": [{\n", 1457 | " \"model\": result[\"model\"],\n", 1458 | " \"chunk_strategy\": result[\"chunk_strategy\"],\n", 1459 | " \"threshold\": result[\"threshold\"],\n", 1460 | " \"question\": result[\"question\"],\n", 1461 | " \"response\": {\n", 1462 | " \"answer\": result[\"response\"].get(\"response_text\", \"\"),\n", 1463 | " \"sources\": result[\"response\"].get(\"sources\", [])\n", 1464 | " }\n", 1465 | " } for result in experiment_results[\"results\"]]\n", 1466 | " }\n", 1467 | "\n", 1468 | " # Format evaluation results with baseline answer\n", 1469 | " formatted_evaluation = {\n", 1470 | " \"metadata\": evaluation_results[\"metadata\"],\n", 1471 | " \"evaluations\": [{\n", 1472 | " \"model\": eval.get(\"model\"),\n", 1473 | " \"chunk_strategy\": eval.get(\"chunk_strategy\"),\n", 1474 | " \"threshold\": eval.get(\"threshold\"),\n", 1475 | " \"question\": eval.get(\"question\"),\n", 1476 | " \"baseline_answer\": eval.get(\"baseline_answer\", \"No baseline available\"), # Include baseline answer\n", 1477 | " \"model_response\": eval.get(\"model_response\", {}),\n", 1478 | " \"scores\": eval.get(\"scores\", {}),\n", 1479 | " \"composite_score\": eval.get(\"composite_score\"),\n", 1480 | " \"detailed_analysis\": eval.get(\"detailed_analysis\", {})\n", 1481 | " } for eval in evaluation_results.get(\"evaluations\", [])],\n", 1482 | " \"overall_summary\": evaluation_results.get(\"summary\", {})\n", 1483 | " }\n", 1484 | "\n", 1485 | " return formatted_experiment, formatted_evaluation\n", 1486 | "\n", 1487 | " def save_results(self, formatted_experiment: Dict, formatted_evaluation: Dict) -> Tuple[str, str]:\n", 1488 | " \"\"\"Save formatted results to JSON files\"\"\"\n", 1489 | " timestamp = time.strftime(\"%Y%m%d-%H%M%S\")\n", 1490 | "\n", 1491 | " experiment_file = f\"{self.save_directory}/experiment_results_{timestamp}.json\"\n", 1492 | " evaluation_file = f\"{self.save_directory}/evaluation_results_{timestamp}.json\"\n", 1493 | "\n", 1494 | " for filepath, data in [\n", 1495 | " (experiment_file, formatted_experiment),\n", 1496 | " (evaluation_file, formatted_evaluation)\n", 1497 | " ]:\n", 1498 | " with open(filepath, 'w', encoding='utf-8') as f:\n", 1499 | " json.dump(data, f, indent=2, ensure_ascii=False)\n", 1500 | "\n", 1501 | " return experiment_file, evaluation_file\n", 1502 | "\n", 1503 | " def display_results(self, evaluation_results: Dict):\n", 1504 | " \"\"\"Display evaluation results in a clear, formatted manner\"\"\"\n", 1505 | " print(\"\\n\" + \"=\"*80)\n", 1506 | " print(\"EVALUATION RESULTS\")\n", 1507 | " print(\"=\"*80)\n", 1508 | "\n", 1509 | " # Display metadata\n", 1510 | " metadata = evaluation_results.get(\"metadata\", {})\n", 1511 | " print(\"\\nMETADATA:\")\n", 1512 | " print(\"-\"*80)\n", 1513 | " print(f\"Timestamp: {metadata.get('timestamp', 'Not available')}\")\n", 1514 | " print(f\"Model Used: {metadata.get('model_used', 'Not available')}\")\n", 1515 | " print(f\"Combinations: {metadata.get('num_combinations_evaluated', 'Not available')}\")\n", 1516 | " print(f\"Questions: {metadata.get('num_questions_evaluated', 'Not available')}\")\n", 1517 | " print(f\"Evaluation Status: {metadata.get('evaluation_status', 'Not available')}\")\n", 1518 | "\n", 1519 | " # Display evaluations\n", 1520 | " evaluations = evaluation_results.get(\"evaluations\", [])\n", 1521 | " if evaluations:\n", 1522 | " print(\"\\nDETAILED EVALUATIONS:\")\n", 1523 | " print(\"-\"*80)\n", 1524 | " for eval in evaluations:\n", 1525 | " print(f\"\\nQuestion: {eval.get('question', 'No question provided')}\")\n", 1526 | " print(f\"Model: {eval.get('model', 'No model specified')}\")\n", 1527 | " print(f\"Strategy: {eval.get('chunk_strategy', 'No strategy specified')}\")\n", 1528 | " if eval.get('threshold'):\n", 1529 | " print(f\"Threshold: {eval.get('threshold')}\")\n", 1530 | "\n", 1531 | " # Display baseline answer\n", 1532 | " print(\"\\nBaseline Answer:\")\n", 1533 | " baseline = eval.get('baseline_answer', 'No baseline answer available')\n", 1534 | " print(textwrap.fill(str(baseline), width=80))\n", 1535 | "\n", 1536 | " print(\"\\nModel Response:\")\n", 1537 | " response = eval.get('model_response', {})\n", 1538 | " response_text = response.get('response_text', 'No response available')\n", 1539 | " if response_text:\n", 1540 | " print(textwrap.fill(str(response_text), width=80))\n", 1541 | " else:\n", 1542 | " print(\"No response available\")\n", 1543 | "\n", 1544 | " print(\"\\nSource Data:\")\n", 1545 | " sources = response.get('sources', [])\n", 1546 | " if sources:\n", 1547 | " for source in sources:\n", 1548 | " if source: # Check if source is not empty\n", 1549 | " print(textwrap.fill(str(source), width=80))\n", 1550 | " else:\n", 1551 | " print(\"No source data available\")\n", 1552 | "\n", 1553 | " print(\"\\nScores:\")\n", 1554 | " scores = eval.get('scores', {})\n", 1555 | " for metric, score in scores.items():\n", 1556 | " print(f\"- {metric.replace('_', ' ').capitalize()}: {score}/100\")\n", 1557 | " print(f\"Composite Score: {eval.get('composite_score', 0)}/100\")\n", 1558 | "\n", 1559 | " print(\"\\nDetailed Analysis:\")\n", 1560 | " analysis = eval.get('detailed_analysis', {})\n", 1561 | " for aspect, details in analysis.items():\n", 1562 | " if details: # Check if details is not empty\n", 1563 | " print(f\"\\n{aspect.replace('_', ' ').capitalize()}:\")\n", 1564 | " print(textwrap.fill(str(details), width=80))\n", 1565 | "\n", 1566 | " # Display summary\n", 1567 | " summary = evaluation_results.get(\"overall_summary\", {})\n", 1568 | " if summary:\n", 1569 | " print(\"\\nOVERALL SUMMARY:\")\n", 1570 | " print(\"-\"*80)\n", 1571 | "\n", 1572 | " if \"overall_performance\" in summary:\n", 1573 | " print(\"\\nOverall Performance:\")\n", 1574 | " print(textwrap.fill(str(summary[\"overall_performance\"]), width=80))\n", 1575 | "\n", 1576 | " if \"optimal_permutation\" in summary:\n", 1577 | " print(\"\\nOptimal Configuration:\")\n", 1578 | " print(textwrap.fill(str(summary[\"optimal_permutation\"]), width=80))\n", 1579 | "\n", 1580 | " if \"chunking_statistics\" in summary:\n", 1581 | " print(\"\\nChunking Statistics:\")\n", 1582 | " chunk_stats = summary[\"chunking_statistics\"]\n", 1583 | " for config, stats in chunk_stats.items():\n", 1584 | " print(f\"\\n{config}:\")\n", 1585 | " print(f\" Number of Chunks: {stats['number_of_chunks']}\")\n", 1586 | " print(f\" Average Chunk Size: {stats['average_chunk_size']}\")\n", 1587 | " print(f\" Min Chunk Size: {stats['min_chunk_size']}\")\n", 1588 | " print(f\" Max Chunk Size: {stats['max_chunk_size']}\")\n", 1589 | "\n", 1590 | " if \"performance_analysis\" in summary:\n", 1591 | " print(\"\\nPerformance Analysis:\")\n", 1592 | " analysis = summary[\"performance_analysis\"]\n", 1593 | " if isinstance(analysis, dict):\n", 1594 | " for config, score in analysis.items():\n", 1595 | " print(f\"{config}: {score:.2f}\")\n", 1596 | " else:\n", 1597 | " print(textwrap.fill(str(analysis), width=80))" 1598 | ], 1599 | "metadata": { 1600 | "id": "vnMb5d8cmKQU" 1601 | }, 1602 | "execution_count": null, 1603 | "outputs": [] 1604 | }, 1605 | { 1606 | "cell_type": "markdown", 1607 | "source": [ 1608 | "#Main" 1609 | ], 1610 | "metadata": { 1611 | "id": "koQ5ZObJC2ek" 1612 | } 1613 | }, 1614 | { 1615 | "cell_type": "code", 1616 | "execution_count": null, 1617 | "metadata": { 1618 | "id": "6qdI5iaXYsun", 1619 | "collapsed": true 1620 | }, 1621 | "outputs": [], 1622 | "source": [ 1623 | "def main():\n", 1624 | " # Initialize configurations with semantic settings from config\n", 1625 | " semantic_enabled = CHUNKING_CONFIGS[\"semantic_config\"][\"enabled\"]\n", 1626 | " semantic_thresholds = CHUNKING_CONFIGS[\"semantic_config\"][\"thresholds\"]\n", 1627 | "\n", 1628 | " # Update strategies list if semantic is enabled\n", 1629 | " strategies = CHUNKING_CONFIGS[\"strategies\"]\n", 1630 | " if semantic_enabled:\n", 1631 | " strategies = [\"semantic\"] + strategies\n", 1632 | "\n", 1633 | " model_config = ModelConfig(\n", 1634 | " models=MODEL_CONFIGS[\"models\"],\n", 1635 | " temperature=0.3\n", 1636 | " )\n", 1637 | "\n", 1638 | " # Initialize experiment runner with flexible configuration\n", 1639 | " experiment_runner = ExperimentRunner(\n", 1640 | " model_config=model_config,\n", 1641 | " questions=QUESTION_CONFIGS[\"questions\"],\n", 1642 | " chunk_strategies=strategies,\n", 1643 | " semantic_enabled=semantic_enabled,\n", 1644 | " semantic_thresholds=semantic_thresholds\n", 1645 | " )\n", 1646 | "\n", 1647 | " print(\"Starting experiment with configurations:\")\n", 1648 | " print(f\"Models: {[model['name'] for model in model_config.models]}\")\n", 1649 | " if semantic_enabled:\n", 1650 | " print(f\"Semantic thresholds: {semantic_thresholds}\")\n", 1651 | " print(f\"Chunk strategies: {strategies}\")\n", 1652 | " print(f\"Number of questions: {len(QUESTION_CONFIGS['questions'])}\")\n", 1653 | "\n", 1654 | " # Rest of the main function remains the same\n", 1655 | " experiment_results = experiment_runner.run_experiments()\n", 1656 | "\n", 1657 | " print(\"\\nInitializing GPT-4o evaluation...\")\n", 1658 | " evaluator = ExperimentEvaluator(api_key=userdata.get('OPENAI_API_KEY'))\n", 1659 | "\n", 1660 | " evaluation_results = evaluator.evaluate_experiments(\n", 1661 | " experiment_results=experiment_results,\n", 1662 | " source_docs=documents\n", 1663 | " )\n", 1664 | "\n", 1665 | " results_manager = ResultsManager(save_directory=FILE_CONFIGS['save_directory'])\n", 1666 | "\n", 1667 | " formatted_experiment, formatted_evaluation = results_manager.format_results(\n", 1668 | " experiment_results=experiment_results,\n", 1669 | " evaluation_results=evaluation_results\n", 1670 | " )\n", 1671 | "\n", 1672 | " experiment_file, evaluation_file = results_manager.save_results(\n", 1673 | " formatted_experiment=formatted_experiment,\n", 1674 | " formatted_evaluation=formatted_evaluation\n", 1675 | " )\n", 1676 | "\n", 1677 | " results_manager.display_results(evaluation_results=formatted_evaluation)\n", 1678 | "\n", 1679 | " print(\"\\nExperiment complete!\")\n", 1680 | " print(f\"Results saved to:\")\n", 1681 | " print(f\" Experiment results: {experiment_file}\")\n", 1682 | " print(f\" Evaluation results: {evaluation_file}\")\n", 1683 | "\n", 1684 | " torch.cuda.empty_cache()\n", 1685 | " gc.collect()\n", 1686 | "\n", 1687 | " return formatted_experiment, formatted_evaluation\n", 1688 | "\n", 1689 | "\n", 1690 | "if __name__ == \"__main__\":\n", 1691 | " results, evaluation = main()" 1692 | ] 1693 | } 1694 | ] 1695 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # rag-experimentation-framework 2 | Testing LLMs and RAG configurations at scale using an OpenAI Evaluator / Reflector 3 | 4 | [Motivation for this Project & Results Overview](https://docs.google.com/presentation/d/13QGNKmmOQhmpwAxXuc4k98_ITSvrORRJw6J7J74dx7M/edit#slide=id.g318d9a5244c_0_0) 5 | 6 | [A Systematic Framework for RAG Optimization: Data-Driven Design Through Controlled Experimentation](https://medium.com/@bill.leece/a-systematic-framework-for-rag-optimization-data-driven-design-through-controlled-experimentation-5e7d99643816) 7 | --------------------------------------------------------------------------------