├── LICENSE
├── RAG_Experimentation_Framework_final.ipynb
└── README.md
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/RAG_Experimentation_Framework_final.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "machine_shape": "hm",
8 | "gpuType": "L4",
9 | "authorship_tag": "ABX9TyPKWVIVeEtb8/A7NP4NfWMT",
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | },
16 | "language_info": {
17 | "name": "python"
18 | },
19 | "accelerator": "GPU"
20 | },
21 | "cells": [
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {
25 | "id": "view-in-github",
26 | "colab_type": "text"
27 | },
28 | "source": [
29 | "
"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "source": [
35 | "If you use this code, please cite:\n",
36 | "\n",
37 | "{\n",
38 | " title = {RAG Experimentation Framework},\n",
39 | "\n",
40 | " author = {Bill Leece},\n",
41 | "\n",
42 | " year = {2024}\n",
43 | "}"
44 | ],
45 | "metadata": {
46 | "id": "wZ0kV_UtQn5O"
47 | }
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "source": [
52 | "#Setup"
53 | ],
54 | "metadata": {
55 | "id": "_lHNBLR-92Zk"
56 | }
57 | },
58 | {
59 | "cell_type": "code",
60 | "source": [
61 | "!pip install -U transformers --quiet\n",
62 | "#!pip install -U optimum --quiet\n",
63 | "!pip install -U accelerate --quiet\n",
64 | "!pip install -U bitsandbytes --quiet\n",
65 | "!pip install -U torch --quiet\n",
66 | "!pip install -U sentencepiece --quiet\n",
67 | "!pip install -U llama-index --quiet\n",
68 | "!pip install -U llama-index-llms-mistralai --quiet\n",
69 | "!pip install -U llama-index-embeddings-mistralai --quiet\n",
70 | "!pip install -U llama-index-llms-langchain --quiet\n",
71 | "!pip install -U langchain --quiet\n",
72 | "!pip install -U langchain-community --quiet\n",
73 | "!pip install -U langchain-mistralai --quiet\n",
74 | "!pip install -U langchain_huggingface --quiet\n",
75 | "!pip install -U faiss-gpu --quiet"
76 | ],
77 | "metadata": {
78 | "id": "4g_Vs7wgZW-8"
79 | },
80 | "execution_count": null,
81 | "outputs": []
82 | },
83 | {
84 | "cell_type": "code",
85 | "source": [
86 | "import os\n",
87 | "import json\n",
88 | "import numpy as np\n",
89 | "import faiss\n",
90 | "import transformers\n",
91 | "import torch\n",
92 | "import gc\n",
93 | "import openai\n",
94 | "import json\n",
95 | "import tiktoken\n",
96 | "import textwrap\n",
97 | "import time\n",
98 | "from google.colab import drive, userdata\n",
99 | "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n",
100 | "from langchain.prompts import PromptTemplate\n",
101 | "from langchain_huggingface import HuggingFacePipeline\n",
102 | "from langchain_core.output_parsers import StrOutputParser\n",
103 | "from langchain_mistralai.chat_models import ChatMistralAI\n",
104 | "from llama_index.embeddings.mistralai import MistralAIEmbedding\n",
105 | "from llama_index.core import SimpleDirectoryReader, Settings\n",
106 | "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
107 | "import time\n",
108 | "from typing import List, Dict, Tuple\n",
109 | "from contextlib import contextmanager\n",
110 | "from langchain.schema.runnable import RunnableSequence\n",
111 | "from langchain.schema.output_parser import StrOutputParser\n",
112 | "from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter\n",
113 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
114 | "from datetime import datetime\n",
115 | "from typing import Dict, List, Any"
116 | ],
117 | "metadata": {
118 | "id": "Ao7eaSfq-TKs"
119 | },
120 | "execution_count": null,
121 | "outputs": []
122 | },
123 | {
124 | "cell_type": "code",
125 | "source": [
126 | "os.environ[\"HF_TOKEN\"] = userdata.get('HF_TOKEN')\n",
127 | "os.environ[\"MISTRAL_API_KEY\"] = userdata.get('MISTRAL_API_KEY')\n",
128 | "api_key = userdata.get('OPENAI_API_KEY')"
129 | ],
130 | "metadata": {
131 | "id": "YvGHY024-OXK"
132 | },
133 | "execution_count": null,
134 | "outputs": []
135 | },
136 | {
137 | "cell_type": "code",
138 | "source": [
139 | "device = 'cuda' if torch.cuda.is_available() else 'cpu' #Use GPUs when possible"
140 | ],
141 | "metadata": {
142 | "id": "mxAHV7T_-Xlh"
143 | },
144 | "execution_count": null,
145 | "outputs": []
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "source": [
150 | "#Experiment Configurations"
151 | ],
152 | "metadata": {
153 | "id": "jkqEV8M_HUKG"
154 | }
155 | },
156 | {
157 | "cell_type": "code",
158 | "source": [
159 | "# Setup configurations\n",
160 | "MODEL_CONFIGS = {\n",
161 | " \"models\": [\n",
162 | " # {\n",
163 | " # \"name\": \"open-mixtral-8x7b\",\n",
164 | " # \"type\": \"mistral_api\",\n",
165 | " # \"tokenizer\": None, # Not needed for API models\n",
166 | " # },\n",
167 | "\n",
168 | " {\n",
169 | " \"name\": \"mistral-large-latest\",\n",
170 | " \"type\": \"mistral_api\",\n",
171 | " \"tokenizer\": None, # Not needed for API models\n",
172 | " },\n",
173 | "\n",
174 | " {\n",
175 | " \"name\": \"open-mistral-nemo\",\n",
176 | " \"type\": \"mistral_api\",\n",
177 | " \"tokenizer\": None, # Not needed for API models\n",
178 | " },\n",
179 | "# {\n",
180 | "# \"name\": \"ministral-8b-latest\",\n",
181 | "# \"type\": \"mistral_api\",\n",
182 | "# \"tokenizer\": None, # Not needed for API models\n",
183 | "# },\n",
184 | " # {\n",
185 | " # \"name\": \"meta-llama/Llama-3.1-8B-Instruct\",\n",
186 | " # \"type\": \"huggingface\",\n",
187 | " # \"tokenizer\": \"meta-llama/Llama-3.1-8B-Instruct\"\n",
188 | " # },\n",
189 | "\n",
190 | " # {\n",
191 | " # \"name\": \"wjleece/quantized-mistral-7b\",\n",
192 | " # \"type\": \"huggingface\",\n",
193 | " # \"tokenizer\": \"mistralai/Mixtral-8x7B-v0.1\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n",
194 | " # \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n",
195 | " # \"load_in_4bit\": True,\n",
196 | " # \"bnb_4bit_compute_dtype\": \"float16\",\n",
197 | " # \"bnb_4bit_quant_type\": \"nf4\",\n",
198 | " # \"bnb_4bit_use_double_quant\": False\n",
199 | " # }\n",
200 | " # },\n",
201 | " {\n",
202 | " \"name\": \"wjleece/quantized-mistral-nemo-12b\",\n",
203 | " \"type\": \"huggingface\",\n",
204 | " \"tokenizer\": \"mistralai/Mistral-Nemo-Instruct-2407\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n",
205 | " \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n",
206 | " \"load_in_4bit\": True,\n",
207 | " \"bnb_4bit_compute_dtype\": \"float16\",\n",
208 | " \"bnb_4bit_quant_type\": \"nf4\",\n",
209 | " \"bnb_4bit_use_double_quant\": False\n",
210 | " }\n",
211 | " },\n",
212 | " # {\n",
213 | " # \"name\": \"wjleece/quantized-mistral-8b\",\n",
214 | " # \"type\": \"huggingface\",\n",
215 | " # \"tokenizer\": \"mistralai/Ministral-8B-Instruct-2410\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n",
216 | " # \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n",
217 | " # \"load_in_4bit\": True,\n",
218 | " # \"bnb_4bit_compute_dtype\": \"float16\",\n",
219 | " # \"bnb_4bit_quant_type\": \"nf4\",\n",
220 | " # \"bnb_4bit_use_double_quant\": False\n",
221 | " # }\n",
222 | " # },\n",
223 | " # {\n",
224 | " # \"name\": \"wjleece/quantized-llama-3.1-8b\",\n",
225 | " # \"type\": \"huggingface\",\n",
226 | " # \"tokenizer\": \"meta-llama/Llama-3.1-8B-Instruct\", # The same tokenizer that works on the base model will work on the quantized model - there is no 'quantized tokenizer'\n",
227 | " # \"quantization_config\": { #Quantization config left here as a reference, but not used in the code (as we're using an already quantized model from HuggingFace)\n",
228 | " # \"load_in_4bit\": True,\n",
229 | " # \"bnb_4bit_compute_dtype\": \"float16\",\n",
230 | " # \"bnb_4bit_quant_type\": \"nf4\",\n",
231 | " # \"bnb_4bit_use_double_quant\": False\n",
232 | " # }\n",
233 | " # }\n",
234 | " ]\n",
235 | "}\n",
236 | "\n",
237 | "\n",
238 | "CHUNKING_CONFIGS = {\n",
239 | " \"strategies\": [\"paragraph\", \"header\"],\n",
240 | " \"semantic_config\": {\n",
241 | " \"enabled\": True,\n",
242 | " \"thresholds\": [85, 95] if True else []\n",
243 | " },\n",
244 | " \"max_chunk_size\": 2048,\n",
245 | " \"chunk_overlap\": 100,\n",
246 | " \"min_chunk_size\": 35 #we'll ignore any chunk ~5 words or less\n",
247 | "}\n",
248 | "\n",
249 | "QUESTION_CONFIGS = {\n",
250 | " \"questions\": [\n",
251 | " \"What were cloud revenues in the most recent quarter?\",\n",
252 | " \"What were the main drivers of revenue growth in the most recent quarter?\",\n",
253 | " \"How much did YouTube ad revenues grow in the most recent quarter in APAC?\",\n",
254 | " \"Can you summarize recent key antitrust matters?\",\n",
255 | " \"Compare the revenue growth across all geographic regions and explain the main factors for each region.\",\n",
256 | " \"Summarize all mentioned risk factors related to international operations.\",\n",
257 | " \"What were the major changes in operating expenses across all categories and their stated reasons?\",\n",
258 | " ] #These quetsions should relate to the RAG document --> these are your 'business use cases'\n",
259 | "}\n",
260 | "\n",
261 | "FILE_CONFIGS = {\n",
262 | " \"save_directory\": '/content/drive/My Drive/AI/Model_Analysis'\n",
263 | "}"
264 | ],
265 | "metadata": {
266 | "id": "YDjgk_JhHWkj"
267 | },
268 | "execution_count": null,
269 | "outputs": []
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "source": [
274 | "#Load RAG Document"
275 | ],
276 | "metadata": {
277 | "id": "e_wxgOGc95sf"
278 | }
279 | },
280 | {
281 | "cell_type": "code",
282 | "source": [
283 | "drive.mount('/content/drive')\n",
284 | "documents = SimpleDirectoryReader(input_files=[\"/content/drive/My Drive/AI/Datasets/Google-10-q/goog-10-q-q3-2024.pdf\"]).load_data()"
285 | ],
286 | "metadata": {
287 | "id": "gS4Lemk09v9Y"
288 | },
289 | "execution_count": null,
290 | "outputs": []
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "source": [
295 | "#RAG Pipeline Class"
296 | ],
297 | "metadata": {
298 | "id": "MgLxma5M-bZA"
299 | }
300 | },
301 | {
302 | "cell_type": "code",
303 | "source": [
304 | "# Global singleton instance\n",
305 | "_GLOBAL_RAG_PIPELINE = None\n",
306 | "\n",
307 | "class RAGPipeline:\n",
308 | " def __init__(self):\n",
309 | " self.chunk_cache = {}\n",
310 | " self.embedding_cache = {}\n",
311 | " self.embedding_model = None\n",
312 | "\n",
313 | " @classmethod\n",
314 | " def get_instance(cls):\n",
315 | " \"\"\"Get or create singleton instance\"\"\"\n",
316 | " global _GLOBAL_RAG_PIPELINE\n",
317 | " if _GLOBAL_RAG_PIPELINE is None:\n",
318 | " _GLOBAL_RAG_PIPELINE = cls()\n",
319 | " return _GLOBAL_RAG_PIPELINE\n",
320 | "\n",
321 | "\n",
322 | " def initialize_embedding_model(self):\n",
323 | " \"\"\"Initialize the embedding model if not already initialized\"\"\"\n",
324 | " if self.embedding_model is None:\n",
325 | " mistral_api_key = userdata.get('MISTRAL_API_KEY')\n",
326 | " self.embedding_model = MistralAIEmbedding(\n",
327 | " model_name=\"mistral-embed\",\n",
328 | " api_key=mistral_api_key\n",
329 | " )\n",
330 | " return self.embedding_model\n",
331 | "\n",
332 | " def convert_to_markdown_headers(self, text):\n",
333 | " \"\"\"Convert document section titles to markdown headers\"\"\"\n",
334 | " import re\n",
335 | "\n",
336 | " patterns = [\n",
337 | " (r'^(?:ITEM|Section)\\s+\\d+[.:]\\s*(.+)$', '# '),\n",
338 | " (r'^\\d+\\.\\d+\\s+(.+)$', '## '),\n",
339 | " (r'^\\([a-z]\\)\\s+(.+)$', '### ')\n",
340 | " ]\n",
341 | "\n",
342 | " lines = text.split('\\n')\n",
343 | " markdown_lines = []\n",
344 | "\n",
345 | " for line in lines:\n",
346 | " line = line.strip()\n",
347 | " converted = False\n",
348 | "\n",
349 | " for pattern, header_mark in patterns:\n",
350 | " if re.match(pattern, line, re.IGNORECASE):\n",
351 | " markdown_lines.append(f\"{header_mark}{line}\")\n",
352 | " converted = True\n",
353 | " break\n",
354 | "\n",
355 | " if not converted:\n",
356 | " markdown_lines.append(line)\n",
357 | "\n",
358 | " return '\\n'.join(markdown_lines)\n",
359 | "\n",
360 | "\n",
361 | " def create_chunks(self, documents: List, threshold: int, chunk_strategy: str = \"semantic\") -> Dict:\n",
362 | " \"\"\"Create or retrieve chunks based on specified strategy\"\"\"\n",
363 | "\n",
364 | " MAX_CHUNK_SIZE = CHUNKING_CONFIGS['max_chunk_size']\n",
365 | " CHUNK_OVERLAP = CHUNKING_CONFIGS['chunk_overlap']\n",
366 | " MIN_CHUNK_SIZE = CHUNKING_CONFIGS['min_chunk_size']\n",
367 | "\n",
368 | "\n",
369 | " if chunk_strategy == \"semantic\":\n",
370 | " cache_key = f\"{chunk_strategy}_{threshold}\"\n",
371 | " print(f\"Using semantic cache key: {cache_key} with threshold: {threshold}\")\n",
372 | " else:\n",
373 | " cache_key = f\"{chunk_strategy}_{MAX_CHUNK_SIZE}\"\n",
374 | " print(f\"Using non-semantic cache key: {cache_key}\")\n",
375 | "\n",
376 | "\n",
377 | " if cache_key not in self.chunk_cache:\n",
378 | " print(\"\\nStarting new chunk creation:\")\n",
379 | " texts = []\n",
380 | "\n",
381 | " try:\n",
382 | " if chunk_strategy == \"semantic\":\n",
383 | " print(\"Processing semantic chunking...\")\n",
384 | " if self.embedding_model is None:\n",
385 | " print(\"Initializing embedding model\")\n",
386 | " self.initialize_embedding_model()\n",
387 | "\n",
388 | " splitter = SemanticSplitterNodeParser(\n",
389 | " buffer_size=1,\n",
390 | " breakpoint_percentile_threshold=threshold,\n",
391 | " embed_model=self.embedding_model\n",
392 | " )\n",
393 | " nodes = splitter.get_nodes_from_documents(documents)\n",
394 | " texts = [node.text for node in nodes]\n",
395 | " print(f\"Generated {len(texts)} semantic chunks\")\n",
396 | "\n",
397 | " elif chunk_strategy == \"paragraph\":\n",
398 | " print(\"Processing paragraph chunking...\")\n",
399 | " text_splitter = RecursiveCharacterTextSplitter(\n",
400 | " separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n",
401 | " chunk_size=MAX_CHUNK_SIZE,\n",
402 | " chunk_overlap=CHUNK_OVERLAP,\n",
403 | " length_function=len\n",
404 | " )\n",
405 | "\n",
406 | " for idx, doc in enumerate(documents):\n",
407 | " print(f\"\\nProcessing document {idx + 1}/{len(documents)}\")\n",
408 | " print(f\"Document length: {len(doc.text)} characters\")\n",
409 | " doc_chunks = text_splitter.split_text(doc.text)\n",
410 | " print(f\"Initial chunks from document: {len(doc_chunks)}\")\n",
411 | " if doc_chunks:\n",
412 | " print(f\"Sample chunk lengths: {[len(c) for c in doc_chunks[:3]]}\")\n",
413 | " texts.extend(doc_chunks)\n",
414 | "\n",
415 | " elif chunk_strategy == \"header\":\n",
416 | " print(\"Processing header chunking...\")\n",
417 | " text_splitter = RecursiveCharacterTextSplitter(\n",
418 | " separators=[\"\\n\\n\", \"\\n\", \". \", \" \", \"\"],\n",
419 | " chunk_size=MAX_CHUNK_SIZE,\n",
420 | " chunk_overlap=CHUNK_OVERLAP,\n",
421 | " length_function=len\n",
422 | " )\n",
423 | "\n",
424 | " for idx, doc in enumerate(documents):\n",
425 | " print(f\"\\nProcessing document {idx + 1}/{len(documents)}\")\n",
426 | " md_text = self.convert_to_markdown_headers(doc.text)\n",
427 | " print(\"Headers identified. First 100 chars of markdown text:\")\n",
428 | " print(md_text[:100] + \"...\")\n",
429 | "\n",
430 | " headers_to_split_on = [\n",
431 | " (\"#\", \"Header 1\"),\n",
432 | " (\"##\", \"Header 2\"),\n",
433 | " (\"###\", \"Header 3\"),\n",
434 | " ]\n",
435 | "\n",
436 | " header_splitter = MarkdownHeaderTextSplitter(\n",
437 | " headers_to_split_on=headers_to_split_on\n",
438 | " )\n",
439 | "\n",
440 | " splits = header_splitter.split_text(md_text)\n",
441 | " print(f\"Generated {len(splits)} header sections\")\n",
442 | " if splits:\n",
443 | " print(\"Sample section lengths:\", [len(s.page_content) for s in splits[:3]])\n",
444 | "\n",
445 | " for split in splits:\n",
446 | " if len(split.page_content) > MAX_CHUNK_SIZE:\n",
447 | " print(f\"Splitting large section: {len(split.page_content)} chars\")\n",
448 | " subsections = text_splitter.split_text(split.page_content)\n",
449 | " print(f\"Created {len(subsections)} subsections\")\n",
450 | " texts.extend(subsections)\n",
451 | " else:\n",
452 | " texts.append(split.page_content)\n",
453 | "\n",
454 | " print(\"\\nCleaning and filtering chunks...\")\n",
455 | " initial_count = len(texts)\n",
456 | " cleaned_texts = []\n",
457 | " for idx, text in enumerate(texts):\n",
458 | " if not isinstance(text, str):\n",
459 | " print(f\"Warning: Non-string chunk found at index {idx}\")\n",
460 | " continue\n",
461 | "\n",
462 | " cleaned_text = text.strip()\n",
463 | " if len(cleaned_text) >= MIN_CHUNK_SIZE:\n",
464 | " cleaned_texts.append(cleaned_text)\n",
465 | " else:\n",
466 | " print(f\"Filtered out small chunk: {len(cleaned_text)} chars\")\n",
467 | "\n",
468 | " texts = cleaned_texts\n",
469 | " print(f\"Chunks after cleaning: {len(texts)} (removed {initial_count - len(texts)})\")\n",
470 | "\n",
471 | " if not texts:\n",
472 | " print(\"WARNING: No valid chunks generated!\")\n",
473 | " return {\n",
474 | " 'texts': [],\n",
475 | " 'strategy': chunk_strategy,\n",
476 | " 'chunk_stats': {\n",
477 | " 'num_chunks': 0,\n",
478 | " 'avg_chunk_size': 0,\n",
479 | " 'min_chunk_size': 0,\n",
480 | " 'max_chunk_size': 0\n",
481 | " }\n",
482 | " }\n",
483 | "\n",
484 | " # Calculate chunk statistics\n",
485 | " chunk_lengths = [len(t) for t in texts]\n",
486 | " chunk_stats = {\n",
487 | " 'num_chunks': len(texts),\n",
488 | " 'avg_chunk_size': sum(chunk_lengths)/len(texts),\n",
489 | " 'min_chunk_size': min(chunk_lengths),\n",
490 | " 'max_chunk_size': max(chunk_lengths)\n",
491 | " }\n",
492 | "\n",
493 | " print(\"\\nFinal Chunk Statistics:\")\n",
494 | " print(f\"Total chunks: {chunk_stats['num_chunks']}\")\n",
495 | " print(f\"Average chunk size: {chunk_stats['avg_chunk_size']:.2f} chars\")\n",
496 | " print(f\"Minimum chunk size: {chunk_stats['min_chunk_size']} chars\")\n",
497 | " print(f\"Maximum chunk size: {chunk_stats['max_chunk_size']} chars\")\n",
498 | "\n",
499 | " print(\"\\nSample of first chunk:\")\n",
500 | " if texts:\n",
501 | " print(texts[0][:200] + \"...\")\n",
502 | "\n",
503 | " # Store in cache\n",
504 | " self.chunk_cache[cache_key] = {\n",
505 | " 'texts': texts,\n",
506 | " 'strategy': chunk_strategy,\n",
507 | " 'chunk_stats': chunk_stats\n",
508 | " }\n",
509 | " print(f\"\\nStored chunks in cache with key: {cache_key}\")\n",
510 | "\n",
511 | " except Exception as e:\n",
512 | " print(\"\\nERROR in chunk creation:\")\n",
513 | " print(f\"Error type: {type(e).__name__}\")\n",
514 | " print(f\"Error message: {str(e)}\")\n",
515 | " import traceback\n",
516 | " print(\"\\nTraceback:\")\n",
517 | " print(traceback.format_exc())\n",
518 | " return {\n",
519 | " 'texts': [],\n",
520 | " 'strategy': chunk_strategy,\n",
521 | " 'chunk_stats': {\n",
522 | " 'num_chunks': 0,\n",
523 | " 'avg_chunk_size': 0,\n",
524 | " 'min_chunk_size': 0,\n",
525 | " 'max_chunk_size': 0\n",
526 | " }\n",
527 | " }\n",
528 | " else:\n",
529 | " print(f\"\\nRetrieving {len(self.chunk_cache[cache_key]['texts'])} existing chunks from cache\")\n",
530 | "\n",
531 | " result = self.chunk_cache[cache_key]\n",
532 | " print(f\"\\nFinal Output:\")\n",
533 | " print(f\"Number of chunks: {len(result['texts'])}\")\n",
534 | " print(f\"Strategy: {result['strategy']}\")\n",
535 | " print(\"=\"*50)\n",
536 | " return result\n",
537 | "\n",
538 | " def run_cosine_search(self, query: str, threshold: int, chunk_strategy: str = \"semantic\", k: int = 5) -> List[Dict]:\n",
539 | " \"\"\"Run cosine similarity search with enhanced error handling and debugging\"\"\"\n",
540 | " print(\"\\n\" + \"=\"*50)\n",
541 | " print(\"COSINE SEARCH DEBUG LOG\")\n",
542 | " print(\"=\"*50)\n",
543 | " print(f\"Query: {query}\")\n",
544 | " print(f\"Strategy: {chunk_strategy}\")\n",
545 | " print(f\"Threshold: {threshold}\")\n",
546 | " print(f\"Requested k: {k}\")\n",
547 | "\n",
548 | " if chunk_strategy == \"semantic\":\n",
549 | " cache_key = f\"{chunk_strategy}_{threshold}\"\n",
550 | " else:\n",
551 | " cache_key = f\"{chunk_strategy}_{CHUNKING_CONFIGS['max_chunk_size']}\"\n",
552 | "\n",
553 | " print(\"\\nCache Status:\")\n",
554 | " print(f\"Cache key: {cache_key}\")\n",
555 | " print(f\"Available cache keys: {list(self.chunk_cache.keys())}\")\n",
556 | " print(f\"Chunks cache hit: {cache_key in self.chunk_cache}\")\n",
557 | " print(f\"Embeddings cache hit: {cache_key in self.embedding_cache}\")\n",
558 | "\n",
559 | " # First, ensure we have chunks\n",
560 | " if cache_key not in self.chunk_cache:\n",
561 | " print(f\"\\nERROR: No chunks found in cache for {cache_key}\")\n",
562 | " print(\"This suggests chunk creation failed or wasn't called\")\n",
563 | " return []\n",
564 | "\n",
565 | " chunks_data = self.chunk_cache[cache_key]\n",
566 | " if not chunks_data['texts']:\n",
567 | " print(\"\\nERROR: Chunks list is empty\")\n",
568 | " print(\"This suggests chunk creation succeeded but produced no chunks\")\n",
569 | " return []\n",
570 | "\n",
571 | " print(f\"\\nFound {len(chunks_data['texts'])} chunks to search\")\n",
572 | " print(f\"Sample chunk (first 100 chars): {chunks_data['texts'][0][:100]}...\")\n",
573 | "\n",
574 | " try:\n",
575 | " if self.embedding_model is None:\n",
576 | " print(\"\\nInitializing embedding model\")\n",
577 | " self.initialize_embedding_model()\n",
578 | "\n",
579 | " if cache_key not in self.embedding_cache:\n",
580 | " print(\"\\nGenerating embeddings for chunks...\")\n",
581 | " chunk_embeddings = []\n",
582 | "\n",
583 | " # Process in batches\n",
584 | " batch_size = 32\n",
585 | " total_batches = (len(chunks_data['texts']) + batch_size - 1) // batch_size\n",
586 | "\n",
587 | " for i in range(0, len(chunks_data['texts']), batch_size):\n",
588 | " batch = chunks_data['texts'][i:i + batch_size]\n",
589 | " print(f\"\\nProcessing batch {i//batch_size + 1}/{total_batches}\")\n",
590 | " print(f\"Batch size: {len(batch)} chunks\")\n",
591 | "\n",
592 | " batch_embeddings = [self.embedding_model.get_text_embedding(text) for text in batch]\n",
593 | " chunk_embeddings.extend(batch_embeddings)\n",
594 | " print(f\"Total embeddings so far: {len(chunk_embeddings)}\")\n",
595 | "\n",
596 | " print(\"\\nConverting to numpy array...\")\n",
597 | " embeddings_array = np.array(chunk_embeddings).astype('float32')\n",
598 | " print(f\"Embeddings shape: {embeddings_array.shape}\")\n",
599 | "\n",
600 | " print(\"Normalizing embeddings...\")\n",
601 | " norms = np.linalg.norm(embeddings_array, axis=1)[:, np.newaxis]\n",
602 | " norms[norms == 0] = 1 # Prevent division by zero\n",
603 | " normalized_embeddings = embeddings_array / norms\n",
604 | "\n",
605 | " print(\"Creating FAISS index...\")\n",
606 | " dimension = embeddings_array.shape[1]\n",
607 | " index = faiss.IndexFlatIP(dimension)\n",
608 | " index.add(normalized_embeddings)\n",
609 | "\n",
610 | " self.embedding_cache[cache_key] = {\n",
611 | " 'embeddings': embeddings_array,\n",
612 | " 'index': index\n",
613 | " }\n",
614 | " print(\"Embeddings cached successfully\")\n",
615 | "\n",
616 | " print(\"\\nProcessing query...\")\n",
617 | " query_embedding = self.embedding_model.get_text_embedding(query)\n",
618 | " query_embedding = np.array([query_embedding]).astype('float32')\n",
619 | "\n",
620 | " print(\"Normalizing query embedding...\")\n",
621 | " query_norm = np.linalg.norm(query_embedding)\n",
622 | " if query_norm == 0:\n",
623 | " print(\"ERROR: Zero query vector\")\n",
624 | " return []\n",
625 | " query_normalized = query_embedding / query_norm\n",
626 | "\n",
627 | " print(f\"\\nSearching for top {k} matches...\")\n",
628 | " distances, indices = self.embedding_cache[cache_key]['index'].search(\n",
629 | " query_normalized, k\n",
630 | " )\n",
631 | "\n",
632 | " print(\"\\nFormatting results...\")\n",
633 | " results = []\n",
634 | " for score, idx in zip(distances[0], indices[0]):\n",
635 | " if idx >= 0 and idx < len(chunks_data['texts']):\n",
636 | " results.append({\n",
637 | " 'text': chunks_data['texts'][idx],\n",
638 | " 'distance': float(score),\n",
639 | " 'strategy': chunk_strategy\n",
640 | " })\n",
641 | " print(f\"\\nMatch {len(results)}:\")\n",
642 | " print(f\"Score: {float(score):.4f}\")\n",
643 | " print(f\"Text preview: {chunks_data['texts'][idx][:100]}...\")\n",
644 | "\n",
645 | " print(f\"\\nTotal matches found: {len(results)}\")\n",
646 | " print(\"=\"*50)\n",
647 | " return results\n",
648 | "\n",
649 | " except Exception as e:\n",
650 | " print(\"\\nERROR in cosine search:\")\n",
651 | " print(f\"Error type: {type(e).__name__}\")\n",
652 | " print(f\"Error message: {str(e)}\")\n",
653 | " import traceback\n",
654 | " print(\"\\nTraceback:\")\n",
655 | " print(traceback.format_exc())\n",
656 | " print(\"=\"*50)\n",
657 | " return []\n",
658 | "\n",
659 | " def generate_response(self, query: str, context_rag: list, model: Dict) -> dict:\n",
660 | " \"\"\"Generate response using provided context with source tracking\"\"\"\n",
661 | " try:\n",
662 | " if not context_rag:\n",
663 | " return {\n",
664 | " \"response_text\": \"No relevant context found.\",\n",
665 | " \"sources\": [],\n",
666 | " \"source_tracking\": {\n",
667 | " \"num_sources_provided\": 0,\n",
668 | " \"source_ids\": [],\n",
669 | " \"verification_status\": \"no_context\"\n",
670 | " },\n",
671 | " \"strategy\": None\n",
672 | " }\n",
673 | "\n",
674 | " print(\"\\n=== DEBUG: Context Chunks Passed to LLM ===\")\n",
675 | " print(f\"Query: {query}\")\n",
676 | " print(f\"Number of chunks: {len(context_rag)}\")\n",
677 | "\n",
678 | " # Generate unique IDs for each source chunk\n",
679 | " context_with_ids = []\n",
680 | " for idx, doc in enumerate(context_rag):\n",
681 | " source_id = f\"src_{idx}\"\n",
682 | " context_with_ids.append({\n",
683 | " \"text\": doc['text'],\n",
684 | " \"id\": source_id,\n",
685 | " \"distance\": doc.get('distance', 0)\n",
686 | " })\n",
687 | " print(f\"\\nChunk {source_id}:\")\n",
688 | " print(f\"Distance: {doc.get('distance', 'N/A')}\")\n",
689 | " print(\"Text:\", doc['text'])\n",
690 | " print(\"=\"*50)\n",
691 | "\n",
692 | " # Format context with source IDs\n",
693 | " formatted_context = \"\\n\\n\".join([\n",
694 | " f\"[{doc['id']}] {doc['text']}\"\n",
695 | " for doc in context_with_ids\n",
696 | " ])\n",
697 | "\n",
698 | " prompt = PromptTemplate(template=\"\"\"\n",
699 | " Instructions:\n",
700 | "\n",
701 | " You are a helpful assistant who answers questions strictly from the provided context.\n",
702 | " Given the context information, provide a direct and concise answer to the question: {query}\n",
703 | "\n",
704 | " Important rules:\n",
705 | " 1. Only use information present in the context\n",
706 | " 2. If you don't know or can't find the information, say \"I don't know\"\n",
707 | " 3. You must cite the source IDs [src_X] for every piece of information you use\n",
708 | " 4. Do not make assumptions or use external knowledge\n",
709 | "\n",
710 | " You must format your response as a JSON string object, starting with \"LLM_Response:\"\n",
711 | "\n",
712 | " Your answer must follow this exact format:\n",
713 | "\n",
714 | " LLM_Response:\n",
715 | " {{\n",
716 | " \"response_text\": \"Your detailed answer here with [src_X] citations inline\",\n",
717 | " \"sources\": [\n",
718 | " \"Copy and paste here the exact text segments you used, with their source IDs\"\n",
719 | " ],\n",
720 | " \"source_ids_used\": [\"List of all source IDs referenced in your answer\"]\n",
721 | " }}\n",
722 | "\n",
723 | " Context (with source IDs):\n",
724 | " ---------------\n",
725 | " {context}\n",
726 | " ---------------\n",
727 | " \"\"\")\n",
728 | "\n",
729 | " model_type = model['type']\n",
730 | " llm = model['llm']\n",
731 | "\n",
732 | " chain = prompt | llm | StrOutputParser()\n",
733 | "\n",
734 | " response = chain.invoke({\n",
735 | " \"query\": query,\n",
736 | " \"context\": formatted_context\n",
737 | " })\n",
738 | "\n",
739 | " response_text = response.split(\"LLM_Response:\")[-1].strip()\n",
740 | "\n",
741 | " try:\n",
742 | " if '{' in response_text and '}' in response_text:\n",
743 | " json_str = response_text[response_text.find('{'):response_text.rfind('}')+1]\n",
744 | " parsed_response = json.loads(json_str)\n",
745 | "\n",
746 | " # Verify source usage\n",
747 | " claimed_sources = set(parsed_response.get(\"source_ids_used\", []))\n",
748 | " available_sources = {doc[\"id\"] for doc in context_with_ids}\n",
749 | "\n",
750 | " verification_status = {\n",
751 | " \"status\": \"verified\" if claimed_sources.issubset(available_sources) else \"source_mismatch\",\n",
752 | " \"claimed_sources\": list(claimed_sources),\n",
753 | " \"available_sources\": list(available_sources),\n",
754 | " \"unauthorized_sources\": list(claimed_sources - available_sources)\n",
755 | " }\n",
756 | "\n",
757 | " return {\n",
758 | " \"response_text\": parsed_response.get(\"response_text\", response_text),\n",
759 | " \"sources\": parsed_response.get(\"sources\", []),\n",
760 | " \"source_tracking\": {\n",
761 | " \"num_sources_provided\": len(context_with_ids),\n",
762 | " \"source_ids\": [doc[\"id\"] for doc in context_with_ids],\n",
763 | " \"verification_status\": verification_status\n",
764 | " },\n",
765 | " \"strategy\": context_rag[0]['strategy'] if context_rag else None\n",
766 | " }\n",
767 | " else:\n",
768 | " return {\n",
769 | " \"response_text\": response_text,\n",
770 | " \"sources\": [],\n",
771 | " \"source_tracking\": {\n",
772 | " \"num_sources_provided\": len(context_with_ids),\n",
773 | " \"source_ids\": [doc[\"id\"] for doc in context_with_ids],\n",
774 | " \"verification_status\": {\n",
775 | " \"status\": \"parsing_failed\",\n",
776 | " \"error\": \"Response not in JSON format\"\n",
777 | " }\n",
778 | " },\n",
779 | " \"strategy\": context_rag[0]['strategy'] if context_rag else None\n",
780 | " }\n",
781 | "\n",
782 | " except json.JSONDecodeError:\n",
783 | " return {\n",
784 | " \"response_text\": response_text,\n",
785 | " \"sources\": [],\n",
786 | " \"source_tracking\": {\n",
787 | " \"num_sources_provided\": len(context_with_ids),\n",
788 | " \"source_ids\": [doc[\"id\"] for doc in context_with_ids],\n",
789 | " \"verification_status\": {\n",
790 | " \"status\": \"parsing_failed\",\n",
791 | " \"error\": \"JSON decode error\"\n",
792 | " }\n",
793 | " },\n",
794 | " \"strategy\": context_rag[0]['strategy'] if context_rag else None\n",
795 | " }\n",
796 | "\n",
797 | " except Exception as e:\n",
798 | " print(f\"An error occurred: {str(e)}\")\n",
799 | " return {\n",
800 | " \"response_text\": \"An error occurred while generating the response.\",\n",
801 | " \"sources\": [],\n",
802 | " \"source_tracking\": {\n",
803 | " \"num_sources_provided\": 0,\n",
804 | " \"source_ids\": [],\n",
805 | " \"verification_status\": {\n",
806 | " \"status\": \"error\",\n",
807 | " \"error\": str(e)\n",
808 | " }\n",
809 | " },\n",
810 | " \"strategy\": None\n",
811 | " }"
812 | ],
813 | "metadata": {
814 | "id": "YY5rnivk-bAh"
815 | },
816 | "execution_count": null,
817 | "outputs": []
818 | },
819 | {
820 | "cell_type": "markdown",
821 | "source": [
822 | "#ModelConfig Class"
823 | ],
824 | "metadata": {
825 | "id": "ljqi1Qg8j9F8"
826 | }
827 | },
828 | {
829 | "cell_type": "code",
830 | "source": [
831 | "class ModelConfig:\n",
832 | " \"\"\"Handles model configuration and management\"\"\"\n",
833 | " def __init__(self,\n",
834 | " models: List[Dict],\n",
835 | " temperature: float = 0.3):\n",
836 | " self.models = models\n",
837 | " self.temperature = temperature\n",
838 | " self.current_model = None\n",
839 | " self.current_model_name = None\n",
840 | "\n",
841 | "\n",
842 | " @contextmanager\n",
843 | " def load_model(self, model_config: Dict):\n",
844 | " \"\"\"Context manager for lazy loading and proper cleanup of models\"\"\"\n",
845 | " try:\n",
846 | " model_name = model_config[\"name\"]\n",
847 | " model_type = model_config[\"type\"]\n",
848 | "\n",
849 | " # Clear any existing model\n",
850 | " self.cleanup_current_model()\n",
851 | "\n",
852 | " if model_type == \"mistral_api\":\n",
853 | " mistral_api_key = userdata.get('MISTRAL_API_KEY')\n",
854 | " self.current_model = {\n",
855 | " 'llm': ChatMistralAI(\n",
856 | " model=model_name,\n",
857 | " temperature=self.temperature,\n",
858 | " api_key=mistral_api_key\n",
859 | " ),\n",
860 | " 'type': 'mistral_api'\n",
861 | " }\n",
862 | " else: # huggingface\n",
863 | " print(f\"Loading huggingface model: {model_name}\")\n",
864 | "\n",
865 | " # Empty CUDA cache before loading new model\n",
866 | " torch.cuda.empty_cache()\n",
867 | " gc.collect()\n",
868 | "\n",
869 | " tokenizer = AutoTokenizer.from_pretrained(\n",
870 | " pretrained_model_name_or_path=model_config[\"tokenizer\"],\n",
871 | " trust_remote_code=True,\n",
872 | " use_fast=True,\n",
873 | " padding_side=\"left\"\n",
874 | " )\n",
875 | "\n",
876 | " model = AutoModelForCausalLM.from_pretrained(\n",
877 | " pretrained_model_name_or_path=model_name,\n",
878 | " device_map=\"auto\",\n",
879 | " trust_remote_code=True,\n",
880 | " torch_dtype=torch.float16,\n",
881 | " use_cache=True,\n",
882 | " low_cpu_mem_usage=True,\n",
883 | " )\n",
884 | "\n",
885 | " pipe = pipeline(\n",
886 | " \"text-generation\",\n",
887 | " model=model,\n",
888 | " tokenizer=tokenizer,\n",
889 | " max_new_tokens=512,\n",
890 | " temperature=self.temperature,\n",
891 | " top_p=0.95,\n",
892 | " top_k=50,\n",
893 | " do_sample=True,\n",
894 | " device_map=\"auto\"\n",
895 | " )\n",
896 | "\n",
897 | " self.current_model = {\n",
898 | " 'llm': HuggingFacePipeline(pipeline=pipe),\n",
899 | " 'type': 'huggingface',\n",
900 | " 'model': model, # Keep reference for cleanup\n",
901 | " 'pipe': pipe # Keep reference for cleanup\n",
902 | " }\n",
903 | "\n",
904 | " self.current_model_name = model_name\n",
905 | " yield self.current_model\n",
906 | "\n",
907 | " finally:\n",
908 | " # Cleanup will happen in cleanup_current_model()\n",
909 | " pass\n",
910 | "\n",
911 | " def cleanup_current_model(self):\n",
912 | " \"\"\"Clean up the current model and free memory\"\"\"\n",
913 | " if self.current_model is not None:\n",
914 | " if self.current_model['type'] == 'huggingface':\n",
915 | " # Delete model components explicitly\n",
916 | " del self.current_model['llm']\n",
917 | " del self.current_model['model']\n",
918 | " del self.current_model['pipe']\n",
919 | "\n",
920 | " # Clear CUDA cache\n",
921 | " torch.cuda.empty_cache()\n",
922 | "\n",
923 | " # Run garbage collection\n",
924 | " gc.collect()\n",
925 | "\n",
926 | " self.current_model = None\n",
927 | " self.current_model_name = None"
928 | ],
929 | "metadata": {
930 | "id": "tCzG7OE0IiDT"
931 | },
932 | "execution_count": null,
933 | "outputs": []
934 | },
935 | {
936 | "cell_type": "markdown",
937 | "source": [
938 | "#ExperimentRunner Class"
939 | ],
940 | "metadata": {
941 | "id": "05gTul4pIW6S"
942 | }
943 | },
944 | {
945 | "cell_type": "code",
946 | "source": [
947 | "class ExperimentRunner:\n",
948 | " \"\"\"Handles experiment execution\"\"\"\n",
949 | " def __init__(self,\n",
950 | " model_config: ModelConfig,\n",
951 | " questions: List[str],\n",
952 | " chunk_strategies: List[str],\n",
953 | " semantic_enabled: bool = False,\n",
954 | " semantic_thresholds: List[int] = None,\n",
955 | " rag_pipeline: RAGPipeline = None):\n",
956 | " self.model_config = model_config\n",
957 | " self.questions = questions\n",
958 | " self.chunk_strategies = chunk_strategies\n",
959 | " self.semantic_enabled = semantic_enabled\n",
960 | " self.semantic_thresholds = semantic_thresholds if semantic_enabled else []\n",
961 | "\n",
962 | " # Use existing RAG pipeline or create new one\n",
963 | " global _GLOBAL_RAG_PIPELINE\n",
964 | " if rag_pipeline:\n",
965 | " self.rag_pipeline = rag_pipeline\n",
966 | " elif _GLOBAL_RAG_PIPELINE:\n",
967 | " self.rag_pipeline = _GLOBAL_RAG_PIPELINE\n",
968 | " else:\n",
969 | " print(\"Initializing new RAG pipeline\")\n",
970 | " _GLOBAL_RAG_PIPELINE = RAGPipeline()\n",
971 | " self.rag_pipeline = _GLOBAL_RAG_PIPELINE\n",
972 | "\n",
973 | " def run_experiments(self) -> Dict:\n",
974 | " results = {\n",
975 | " \"metadata\": {\n",
976 | " \"timestamp\": time.strftime(\"%Y%m%d-%H%M%S\"),\n",
977 | " \"models_tested\": [model[\"name\"] for model in self.model_config.models],\n",
978 | " \"semantic_enabled\": self.semantic_enabled,\n",
979 | " \"semantic_thresholds\": self.semantic_thresholds if self.semantic_enabled else [],\n",
980 | " \"chunk_strategies\": self.chunk_strategies,\n",
981 | " \"temperature\": self.model_config.temperature\n",
982 | " },\n",
983 | " \"results\": []\n",
984 | " }\n",
985 | "\n",
986 | " for model_config in self.model_config.models:\n",
987 | " model_name = model_config[\"name\"]\n",
988 | " print(f\"\\nTesting model: {model_name}\")\n",
989 | "\n",
990 | " with self.model_config.load_model(model_config) as model:\n",
991 | " for strategy in self.chunk_strategies:\n",
992 | " # Handle thresholds based on strategy type\n",
993 | " if strategy == \"semantic\" and self.semantic_enabled:\n",
994 | " thresholds_to_test = self.semantic_thresholds\n",
995 | " else:\n",
996 | " thresholds_to_test = [None]\n",
997 | "\n",
998 | " for threshold in thresholds_to_test:\n",
999 | " chunks_data = self.rag_pipeline.create_chunks(\n",
1000 | " documents,\n",
1001 | " threshold=threshold,\n",
1002 | " chunk_strategy=strategy\n",
1003 | " )\n",
1004 | "\n",
1005 | " chunk_stats = {\n",
1006 | " \"strategy\": strategy,\n",
1007 | " \"threshold\": threshold,\n",
1008 | " \"stats\": chunks_data[\"chunk_stats\"]\n",
1009 | " }\n",
1010 | "\n",
1011 | " for question in self.questions:\n",
1012 | " print(f\"Processing question: {question}\")\n",
1013 | "\n",
1014 | " context = self.rag_pipeline.run_cosine_search(\n",
1015 | " query=question,\n",
1016 | " threshold=threshold,\n",
1017 | " chunk_strategy=strategy\n",
1018 | " )\n",
1019 | "\n",
1020 | " answer = self.rag_pipeline.generate_response(\n",
1021 | " query=question,\n",
1022 | " context_rag=context,\n",
1023 | " model=model\n",
1024 | " )\n",
1025 | "\n",
1026 | " results[\"results\"].append({\n",
1027 | " \"model\": model_name,\n",
1028 | " \"threshold\": threshold if strategy == \"semantic\" else None,\n",
1029 | " \"chunk_strategy\": strategy,\n",
1030 | " \"question\": question,\n",
1031 | " \"response\": answer,\n",
1032 | " \"chunk_stats\": chunk_stats[\"stats\"]\n",
1033 | " })\n",
1034 | "\n",
1035 | " return results"
1036 | ],
1037 | "metadata": {
1038 | "id": "8hFyd9G1kC8M"
1039 | },
1040 | "execution_count": null,
1041 | "outputs": []
1042 | },
1043 | {
1044 | "cell_type": "markdown",
1045 | "source": [
1046 | "#Evaluator Class"
1047 | ],
1048 | "metadata": {
1049 | "id": "EpjD-Qz54mfu"
1050 | }
1051 | },
1052 | {
1053 | "cell_type": "code",
1054 | "source": [
1055 | "class ExperimentEvaluator:\n",
1056 | " \"\"\"Handles pure evaluation logic\"\"\"\n",
1057 | " def __init__(self, api_key: str):\n",
1058 | " self.client = openai.OpenAI(api_key=api_key)\n",
1059 | " self.encoder = tiktoken.encoding_for_model(\"gpt-4o\")\n",
1060 | "\n",
1061 | " def _get_baseline_answers(self, questions: List[str], source_docs: List) -> Dict[str, str]:\n",
1062 | " \"\"\"Get GPT-4o's own answers to the questions as baseline\"\"\"\n",
1063 | " print(\"\\n=== DEBUG: _get_baseline_answers ===\")\n",
1064 | " print(f\"Questions received: {questions}\")\n",
1065 | " print(f\"Number of document parts: {len(source_docs)}\")\n",
1066 | "\n",
1067 | " # Concatenate all document parts\n",
1068 | " full_document = \"\\n\\n\".join([doc.text for doc in source_docs])\n",
1069 | " print(f\"\\nFull document length: {len(full_document)} characters\")\n",
1070 | "\n",
1071 | " # Print sample from document\n",
1072 | " print(\"\\nSampling from document:\")\n",
1073 | " print(\"Start:\", full_document[:200], \"...\")\n",
1074 | " print(\"Middle:\", full_document[len(full_document)//2:len(full_document)//2 + 200], \"...\")\n",
1075 | " print(\"End:\", full_document[-200:], \"...\")\n",
1076 | "\n",
1077 | " baseline_prompt = f\"\"\"Source Document:\n",
1078 | " {full_document}\n",
1079 | "\n",
1080 | " Using ONLY the information from the source document above, answer these questions.\n",
1081 | " - If the exact information is found, provide it with specific numbers\n",
1082 | " - If information is not found, explicitly state that\n",
1083 | " - If there are metrics, make sure to include appropriate units\n",
1084 | "\n",
1085 | " Format your response as a valid JSON object with questions as keys and answers as values.\n",
1086 | " Keep answers concise and factual.\n",
1087 | "\n",
1088 | " Questions to answer:\n",
1089 | " {json.dumps(questions, indent=2)}\"\"\"\n",
1090 | "\n",
1091 | " try:\n",
1092 | " print(\"\\n--- Getting Baseline Answers ---\")\n",
1093 | " response = self.client.chat.completions.create(\n",
1094 | " model=\"gpt-4o\",\n",
1095 | " messages=[\n",
1096 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant that provides JSON-formatted answers based on source documents.\"},\n",
1097 | " {\"role\": \"user\", \"content\": baseline_prompt}\n",
1098 | " ],\n",
1099 | " temperature=0.1\n",
1100 | " )\n",
1101 | "\n",
1102 | " content = response.choices[0].message.content\n",
1103 | " print(\"\\nRaw GPT-4 Response:\")\n",
1104 | " print(content)\n",
1105 | "\n",
1106 | " if '{' in content and '}' in content:\n",
1107 | " json_str = content[content.find('{'):content.rfind('}')+1]\n",
1108 | " baseline_answers = json.loads(json_str)\n",
1109 | " print(\"\\nParsed Baseline Answers:\")\n",
1110 | " print(baseline_answers)\n",
1111 | " return baseline_answers\n",
1112 | " print(\"\\nWarning: No JSON structure found in response\")\n",
1113 | " return {\"error\": \"No JSON structure found\", \"questions\": questions}\n",
1114 | "\n",
1115 | " except Exception as e:\n",
1116 | " print(f\"\\nError in _get_baseline_answers: {str(e)}\")\n",
1117 | " return {\"error\": str(e), \"questions\": questions}\n",
1118 | "\n",
1119 | " def evaluate_experiments(self, experiment_results: Dict, *, source_docs: List) -> Dict: # Updated signature\n",
1120 | " \"\"\"Core evaluation logic\"\"\"\n",
1121 | " try:\n",
1122 | " print(\"\\n=== DEBUG: evaluate_experiments ===\")\n",
1123 | " print(\"Getting questions...\")\n",
1124 | " questions = list(set(result[\"question\"] for result in experiment_results[\"results\"]))\n",
1125 | " print(f\"Questions extracted: {questions}\")\n",
1126 | "\n",
1127 | " print(\"\\nGetting baseline answers...\")\n",
1128 | " baseline_answers = self._get_baseline_answers(questions, source_docs) # Pass source_docs\n",
1129 | " print(f\"Baseline answers received: {baseline_answers}\")\n",
1130 | "\n",
1131 | " model_strategy_combinations = set(\n",
1132 | " (result[\"model\"],\n",
1133 | " result[\"chunk_strategy\"],\n",
1134 | " result[\"threshold\"] if result[\"chunk_strategy\"] == \"semantic\" else None)\n",
1135 | " for result in experiment_results[\"results\"]\n",
1136 | " )\n",
1137 | "\n",
1138 | " all_evaluations = []\n",
1139 | "\n",
1140 | " for model, strategy, threshold in model_strategy_combinations:\n",
1141 | " relevant_results = [r for r in experiment_results[\"results\"]\n",
1142 | " if r[\"model\"] == model and\n",
1143 | " r[\"chunk_strategy\"] == strategy and\n",
1144 | " (r[\"threshold\"] == threshold if strategy == \"semantic\" else True)]\n",
1145 | "\n",
1146 | " for result in relevant_results:\n",
1147 | " print(f\"\\nEvaluating response for: {result['question']}\")\n",
1148 | " baseline = baseline_answers.get(result[\"question\"], \"No baseline available\")\n",
1149 | " print(f\"Using baseline answer: {baseline}\")\n",
1150 | "\n",
1151 | " evaluation = self._evaluate_single_response(result, baseline)\n",
1152 | " all_evaluations.append(evaluation)\n",
1153 | "\n",
1154 | " return {\n",
1155 | " \"metadata\": {\n",
1156 | " \"timestamp\": datetime.now().isoformat(),\n",
1157 | " \"model_used\": \"gpt-4o\",\n",
1158 | " \"num_combinations_evaluated\": len(model_strategy_combinations),\n",
1159 | " \"num_questions_evaluated\": len(questions),\n",
1160 | " \"evaluation_status\": \"success\"\n",
1161 | " },\n",
1162 | " \"evaluations\": all_evaluations,\n",
1163 | " \"summary\": self._generate_summary(all_evaluations)\n",
1164 | " }\n",
1165 | "\n",
1166 | " except Exception as e:\n",
1167 | " print(f\"\\nCritical error in evaluate_experiments: {str(e)}\")\n",
1168 | " return self._create_default_evaluation(experiment_results)\n",
1169 | "\n",
1170 | " def _evaluate_single_response(self, result: Dict, baseline: str) -> Dict:\n",
1171 | " \"\"\"Evaluate a single response with clearer scoring criteria\"\"\"\n",
1172 | " evaluation_prompt = f\"\"\"Compare and evaluate this response. You must evaluate three separate aspects:\n",
1173 | "\n",
1174 | " 1. ACCURACY - Compare the model's answer against the baseline (ground truth)\n",
1175 | " 2. SOURCE ATTRIBUTION - Check if the model's answer matches its cited sources\n",
1176 | " 3. CONCISESNESS - Check if the model's answer is clear and direct\n",
1177 | "\n",
1178 | " Question: {result[\"question\"]}\n",
1179 | "\n",
1180 | " Baseline (Ground Truth): {baseline}\n",
1181 | "\n",
1182 | " Model Response: {result.get(\"response\", {}).get(\"response_text\", \"\")}\n",
1183 | " Sources Cited: {json.dumps(result.get(\"response\", {}).get(\"sources\", []), indent=2)}\n",
1184 | "\n",
1185 | " Scoring Criteria:\n",
1186 | "\n",
1187 | " 1. ACCURACY (0-100):\n",
1188 | " - Compare ONLY the model's answer against the baseline\n",
1189 | " - 100: Exact match with baseline (including numbers and units)\n",
1190 | " - 50: Partially correct but with some errors\n",
1191 | " - 0: Completely different from baseline or wrong\n",
1192 | "\n",
1193 | " 2. SOURCE ATTRIBUTION (0-100):\n",
1194 | " - Compare ONLY the model's answer against its cited sources\n",
1195 | " - 100: Answer exactly matches what appears in cited sources INCLUDING UNITS\n",
1196 | " - 50: Answer partially matches cited sources\n",
1197 | " - 0: Answer doesn't match cited sources or no sources cited\n",
1198 | "\n",
1199 | " Note: For large numbers, different formats are acceptable (e.g., $19,000 million = $19 billion)\n",
1200 | " BUT the units must match what appears in the source document for full attribution score.\n",
1201 | " The units in the source document are authoritative.\n",
1202 | "\n",
1203 | " 3. CONCISENESS (0-100):\n",
1204 | " - 100: Clear, direct answer without extra information\n",
1205 | " - 50: Contains some irrelevant information\n",
1206 | " - 0: Verbose or unclear\n",
1207 | "\n",
1208 | " Note: A response can have perfect source attribution (100) even if the answer is wrong,\n",
1209 | " as long as it accurately reflects what's in its cited sources.\n",
1210 | "\n",
1211 | " Provide your evaluation in this exact JSON format:\n",
1212 | " {{\n",
1213 | " \"model\": \"{result[\"model\"]}\",\n",
1214 | " \"chunk_strategy\": \"{result[\"chunk_strategy\"]}\",\n",
1215 | " \"threshold\": {result[\"threshold\"] if result[\"chunk_strategy\"] == \"semantic\" else \"null\"},\n",
1216 | " \"question\": \"{result[\"question\"]}\",\n",
1217 | " \"baseline_answer\": \"{baseline}\",\n",
1218 | " \"model_response\": {json.dumps(result.get(\"response\", {}), indent=2)},\n",
1219 | " \"chunk_stats\": {json.dumps(result.get(\"chunk_stats\", {}), indent=2)},\n",
1220 | " \"scores\": {{\n",
1221 | " \"accuracy\": ,\n",
1222 | " \"source_attribution\": ,\n",
1223 | " \"conciseness\": \n",
1224 | " }},\n",
1225 | " \"composite_score\": ,\n",
1226 | " \"detailed_analysis\": {{\n",
1227 | " \"accuracy_analysis\": \"Explain ONLY how the answer compares to baseline. Explicitly state if numbers match or differ.\",\n",
1228 | " \"attribution_analysis\": \"Explain ONLY how well the answer matches its cited sources, regardless of accuracy.\",\n",
1229 | " \"conciseness_analysis\": \"Explain how clear and direct the answer is\"\n",
1230 | " }}\n",
1231 | " }}\n",
1232 | "\n",
1233 | " Examples:\n",
1234 | "\n",
1235 | " Bad Response (Perfect Attribution, Wrong Answer):\n",
1236 | " - If baseline is \"$10,347M\" but model answers \"$19,921M [src_2]\" and src_2 contains \"$19,921M\"\n",
1237 | " - Accuracy: 0 (completely different from baseline)\n",
1238 | " - Attribution: 100 (perfectly matches its cited source)\n",
1239 | "\n",
1240 | " Good Response (Perfect Both):\n",
1241 | " - If baseline is \"$10,347M\" and model answers \"$10,347M [src_2]\" and src_2 contains \"$10,347M\"\n",
1242 | " - Accuracy: 100 (matches baseline)\n",
1243 | " - Attribution: 100 (matches source)\n",
1244 | " \"\"\"\n",
1245 | "\n",
1246 | " try:\n",
1247 | " response = self.client.chat.completions.create(\n",
1248 | " model=\"gpt-4o\",\n",
1249 | " messages=[\n",
1250 | " {\"role\": \"system\", \"content\": \"You are an expert at evaluating response accuracy against both baseline answers and source data.\"},\n",
1251 | " {\"role\": \"user\", \"content\": evaluation_prompt}\n",
1252 | " ],\n",
1253 | " temperature=0.7,\n",
1254 | " max_tokens=1000\n",
1255 | " )\n",
1256 | "\n",
1257 | " content = response.choices[0].message.content\n",
1258 | " if '{' in content and '}' in content:\n",
1259 | " json_str = content[content.find('{'):content.rfind('}')+1]\n",
1260 | " return json.loads(json_str)\n",
1261 | " return self._create_default_single_evaluation(result, baseline)\n",
1262 | "\n",
1263 | " except Exception as e:\n",
1264 | " print(f\"Error evaluating response: {str(e)}\")\n",
1265 | " return self._create_default_single_evaluation(result, baseline)\n",
1266 | "\n",
1267 | " def _create_default_single_evaluation(self, result: Dict, baseline: str) -> Dict:\n",
1268 | " \"\"\"Create a default evaluation for a single response when evaluation fails\"\"\"\n",
1269 | " return {\n",
1270 | " \"model\": result[\"model\"],\n",
1271 | " \"chunk_strategy\": result[\"chunk_strategy\"],\n",
1272 | " \"threshold\": result[\"threshold\"] if result[\"chunk_strategy\"] == \"semantic\" else None,\n",
1273 | " \"question\": result[\"question\"],\n",
1274 | " \"baseline_answer\": baseline,\n",
1275 | " \"model_response\": result.get(\"response\", {}),\n",
1276 | " \"scores\": {\n",
1277 | " \"source_accuracy\": 0,\n",
1278 | " \"source_attribution\": 0,\n",
1279 | " \"conciseness\": 0\n",
1280 | " },\n",
1281 | " \"composite_score\": 0,\n",
1282 | " \"detailed_analysis\": {\n",
1283 | " \"accuracy_analysis\": \"Evaluation failed\",\n",
1284 | " \"attribution_analysis\": \"Evaluation failed\",\n",
1285 | " \"conciseness_analysis\": \"Evaluation failed\"\n",
1286 | " }\n",
1287 | " }\n",
1288 | "\n",
1289 | " def _generate_summary(self, evaluations: List[Dict]) -> Dict:\n",
1290 | " \"\"\"Generate summary statistics from evaluations with ordered results\"\"\"\n",
1291 | " if not evaluations:\n",
1292 | " return {\n",
1293 | " \"overall_performance\": \"No evaluations available\",\n",
1294 | " \"optimal_permutation\": \"Not available\",\n",
1295 | " \"performance_analysis\": \"Evaluation process failed\",\n",
1296 | " \"chunking_statistics\": {}\n",
1297 | " }\n",
1298 | "\n",
1299 | " # Create ordered list of expected configurations\n",
1300 | " ordered_configs = []\n",
1301 | " if CHUNKING_CONFIGS[\"semantic_config\"][\"enabled\"]:\n",
1302 | " for threshold in CHUNKING_CONFIGS[\"semantic_config\"][\"thresholds\"]:\n",
1303 | " ordered_configs.append((\"semantic\", threshold))\n",
1304 | "\n",
1305 | " for strategy in [s for s in CHUNKING_CONFIGS[\"strategies\"] if s != \"semantic\"]:\n",
1306 | " ordered_configs.append((strategy, None))\n",
1307 | "\n",
1308 | " # Get unique models from evaluations\n",
1309 | " unique_models = sorted(set(eval[\"model\"] for eval in evaluations))\n",
1310 | "\n",
1311 | " # Track chunk statistics and performance scores\n",
1312 | " chunking_statistics = {}\n",
1313 | " performance_scores = {}\n",
1314 | " ordered_analysis = {}\n",
1315 | "\n",
1316 | " # Get document name from the documents list\n",
1317 | " document_name = os.path.basename(documents[0].metadata.get('file_path', 'Unknown Document'))\n",
1318 | "\n",
1319 | " # Initialize tracking for all model-strategy combinations\n",
1320 | " for model in unique_models:\n",
1321 | " for strategy, threshold in ordered_configs:\n",
1322 | " key = (model, strategy, threshold)\n",
1323 | " performance_scores[key] = {\n",
1324 | " \"count\": 0,\n",
1325 | " \"total_composite\": 0\n",
1326 | " }\n",
1327 | "\n",
1328 | " # First pass: calculate scores and collect statistics\n",
1329 | " best_score = 0\n",
1330 | " best_config = None\n",
1331 | "\n",
1332 | " for eval in evaluations:\n",
1333 | " model = eval[\"model\"]\n",
1334 | " strategy = eval[\"chunk_strategy\"]\n",
1335 | " threshold = eval[\"threshold\"] if strategy == \"semantic\" else None\n",
1336 | " key = (model, strategy, threshold)\n",
1337 | "\n",
1338 | " # Track performance scores\n",
1339 | " if key in performance_scores:\n",
1340 | " performance_scores[key][\"count\"] += 1\n",
1341 | " performance_scores[key][\"total_composite\"] += eval[\"composite_score\"]\n",
1342 | "\n",
1343 | " # Track chunk statistics (only need one entry per strategy/threshold combination)\n",
1344 | " chunk_key = (strategy, threshold)\n",
1345 | " if chunk_key not in chunking_statistics:\n",
1346 | " chunk_stats = eval.get(\"chunk_stats\", {})\n",
1347 | " if chunk_stats:\n",
1348 | " if threshold is not None:\n",
1349 | " config_str = f\"{document_name} with {strategy} chunking (threshold: {threshold})\"\n",
1350 | " else:\n",
1351 | " config_str = f\"{document_name} with {strategy} chunking\"\n",
1352 | "\n",
1353 | " chunking_statistics[chunk_key] = {\n",
1354 | " \"config_str\": config_str,\n",
1355 | " \"stats\": {\n",
1356 | " \"number_of_chunks\": chunk_stats.get(\"num_chunks\", \"N/A\"),\n",
1357 | " \"average_chunk_size\": round(chunk_stats.get(\"avg_chunk_size\", 0), 2),\n",
1358 | " \"min_chunk_size\": chunk_stats.get(\"min_chunk_size\", \"N/A\"),\n",
1359 | " \"max_chunk_size\": chunk_stats.get(\"max_chunk_size\", \"N/A\")\n",
1360 | " }\n",
1361 | " }\n",
1362 | "\n",
1363 | " # Second pass: create ordered performance analysis and chunk statistics\n",
1364 | " ordered_chunking_stats = {}\n",
1365 | " for strategy, threshold in ordered_configs:\n",
1366 | " # Add chunk statistics\n",
1367 | " chunk_key = (strategy, threshold)\n",
1368 | " if chunk_key in chunking_statistics:\n",
1369 | " config_str = chunking_statistics[chunk_key][\"config_str\"]\n",
1370 | " ordered_chunking_stats[config_str] = chunking_statistics[chunk_key][\"stats\"]\n",
1371 | "\n",
1372 | " # Add performance analysis for each model\n",
1373 | " for model in unique_models:\n",
1374 | " key = (model, strategy, threshold)\n",
1375 | " scores = performance_scores[key]\n",
1376 | "\n",
1377 | " if scores[\"count\"] > 0:\n",
1378 | " avg_composite = round(scores[\"total_composite\"] / scores[\"count\"], 2)\n",
1379 | "\n",
1380 | " if threshold is not None:\n",
1381 | " perf_key = f\"{model} with {strategy} chunking (threshold: {threshold})\"\n",
1382 | " else:\n",
1383 | " perf_key = f\"{model} with {strategy} chunking\"\n",
1384 | "\n",
1385 | " ordered_analysis[perf_key] = avg_composite\n",
1386 | "\n",
1387 | " if avg_composite > best_score:\n",
1388 | " best_score = avg_composite\n",
1389 | " best_config = perf_key\n",
1390 | "\n",
1391 | " # Calculate overall average score\n",
1392 | " total_score = sum(eval[\"composite_score\"] for eval in evaluations)\n",
1393 | " avg_score = round(total_score / len(evaluations), 2) if evaluations else 0\n",
1394 | "\n",
1395 | " return {\n",
1396 | " \"overall_performance\": f\"Average composite score across all evaluations: {avg_score:.2f}/100\",\n",
1397 | " \"optimal_permutation\": f\"Best performance: {best_config} (score: {best_score:.2f}/100)\",\n",
1398 | " \"performance_analysis\": ordered_analysis,\n",
1399 | " \"chunking_statistics\": ordered_chunking_stats\n",
1400 | " }\n",
1401 | "\n",
1402 | "\n",
1403 | " def _create_default_evaluation(self, experiment_results: Dict) -> Dict:\n",
1404 | " \"\"\"Create a default evaluation result when the evaluation process fails\"\"\"\n",
1405 | " return {\n",
1406 | " \"metadata\": {\n",
1407 | " \"timestamp\": datetime.now().isoformat(),\n",
1408 | " \"model_used\": \"gpt-4o\",\n",
1409 | " \"num_combinations_evaluated\": 0,\n",
1410 | " \"num_questions_evaluated\": 0,\n",
1411 | " \"evaluation_status\": \"failed\"\n",
1412 | " },\n",
1413 | " \"evaluations\": [\n",
1414 | " self._create_default_single_evaluation(result, \"Evaluation failed\")\n",
1415 | " for result in experiment_results[\"results\"]\n",
1416 | " ],\n",
1417 | " \"summary\": {\n",
1418 | " \"overall_performance\": \"Evaluation failed\",\n",
1419 | " \"optimal_permutation\": \"Not available\",\n",
1420 | " \"performance_analysis\": \"Evaluation process failed\",\n",
1421 | " \"chunking_statistics\": {}\n",
1422 | " }\n",
1423 | " }"
1424 | ],
1425 | "metadata": {
1426 | "id": "1rAK93yw4qCx"
1427 | },
1428 | "execution_count": null,
1429 | "outputs": []
1430 | },
1431 | {
1432 | "cell_type": "markdown",
1433 | "source": [
1434 | "#Results Manager Class"
1435 | ],
1436 | "metadata": {
1437 | "id": "lJmy7VbumFk4"
1438 | }
1439 | },
1440 | {
1441 | "cell_type": "code",
1442 | "source": [
1443 | "class ResultsManager:\n",
1444 | " \"\"\"Handles formatting, saving, and displaying evaluation results\"\"\"\n",
1445 | " def __init__(self, save_directory: str):\n",
1446 | " self.save_directory = save_directory\n",
1447 | " os.makedirs(save_directory, exist_ok=True)\n",
1448 | "\n",
1449 | " def format_results(self, experiment_results: Dict, evaluation_results: Dict) -> Tuple[Dict, Dict]:\n",
1450 | " \"\"\"Format experiment and evaluation results into structured output\"\"\"\n",
1451 | " print(\"\\n=== Starting Results Formatting ===\")\n",
1452 | "\n",
1453 | " # Format experiment results\n",
1454 | " formatted_experiment = {\n",
1455 | " \"metadata\": experiment_results.get(\"metadata\", {}),\n",
1456 | " \"results\": [{\n",
1457 | " \"model\": result[\"model\"],\n",
1458 | " \"chunk_strategy\": result[\"chunk_strategy\"],\n",
1459 | " \"threshold\": result[\"threshold\"],\n",
1460 | " \"question\": result[\"question\"],\n",
1461 | " \"response\": {\n",
1462 | " \"answer\": result[\"response\"].get(\"response_text\", \"\"),\n",
1463 | " \"sources\": result[\"response\"].get(\"sources\", [])\n",
1464 | " }\n",
1465 | " } for result in experiment_results[\"results\"]]\n",
1466 | " }\n",
1467 | "\n",
1468 | " # Format evaluation results with baseline answer\n",
1469 | " formatted_evaluation = {\n",
1470 | " \"metadata\": evaluation_results[\"metadata\"],\n",
1471 | " \"evaluations\": [{\n",
1472 | " \"model\": eval.get(\"model\"),\n",
1473 | " \"chunk_strategy\": eval.get(\"chunk_strategy\"),\n",
1474 | " \"threshold\": eval.get(\"threshold\"),\n",
1475 | " \"question\": eval.get(\"question\"),\n",
1476 | " \"baseline_answer\": eval.get(\"baseline_answer\", \"No baseline available\"), # Include baseline answer\n",
1477 | " \"model_response\": eval.get(\"model_response\", {}),\n",
1478 | " \"scores\": eval.get(\"scores\", {}),\n",
1479 | " \"composite_score\": eval.get(\"composite_score\"),\n",
1480 | " \"detailed_analysis\": eval.get(\"detailed_analysis\", {})\n",
1481 | " } for eval in evaluation_results.get(\"evaluations\", [])],\n",
1482 | " \"overall_summary\": evaluation_results.get(\"summary\", {})\n",
1483 | " }\n",
1484 | "\n",
1485 | " return formatted_experiment, formatted_evaluation\n",
1486 | "\n",
1487 | " def save_results(self, formatted_experiment: Dict, formatted_evaluation: Dict) -> Tuple[str, str]:\n",
1488 | " \"\"\"Save formatted results to JSON files\"\"\"\n",
1489 | " timestamp = time.strftime(\"%Y%m%d-%H%M%S\")\n",
1490 | "\n",
1491 | " experiment_file = f\"{self.save_directory}/experiment_results_{timestamp}.json\"\n",
1492 | " evaluation_file = f\"{self.save_directory}/evaluation_results_{timestamp}.json\"\n",
1493 | "\n",
1494 | " for filepath, data in [\n",
1495 | " (experiment_file, formatted_experiment),\n",
1496 | " (evaluation_file, formatted_evaluation)\n",
1497 | " ]:\n",
1498 | " with open(filepath, 'w', encoding='utf-8') as f:\n",
1499 | " json.dump(data, f, indent=2, ensure_ascii=False)\n",
1500 | "\n",
1501 | " return experiment_file, evaluation_file\n",
1502 | "\n",
1503 | " def display_results(self, evaluation_results: Dict):\n",
1504 | " \"\"\"Display evaluation results in a clear, formatted manner\"\"\"\n",
1505 | " print(\"\\n\" + \"=\"*80)\n",
1506 | " print(\"EVALUATION RESULTS\")\n",
1507 | " print(\"=\"*80)\n",
1508 | "\n",
1509 | " # Display metadata\n",
1510 | " metadata = evaluation_results.get(\"metadata\", {})\n",
1511 | " print(\"\\nMETADATA:\")\n",
1512 | " print(\"-\"*80)\n",
1513 | " print(f\"Timestamp: {metadata.get('timestamp', 'Not available')}\")\n",
1514 | " print(f\"Model Used: {metadata.get('model_used', 'Not available')}\")\n",
1515 | " print(f\"Combinations: {metadata.get('num_combinations_evaluated', 'Not available')}\")\n",
1516 | " print(f\"Questions: {metadata.get('num_questions_evaluated', 'Not available')}\")\n",
1517 | " print(f\"Evaluation Status: {metadata.get('evaluation_status', 'Not available')}\")\n",
1518 | "\n",
1519 | " # Display evaluations\n",
1520 | " evaluations = evaluation_results.get(\"evaluations\", [])\n",
1521 | " if evaluations:\n",
1522 | " print(\"\\nDETAILED EVALUATIONS:\")\n",
1523 | " print(\"-\"*80)\n",
1524 | " for eval in evaluations:\n",
1525 | " print(f\"\\nQuestion: {eval.get('question', 'No question provided')}\")\n",
1526 | " print(f\"Model: {eval.get('model', 'No model specified')}\")\n",
1527 | " print(f\"Strategy: {eval.get('chunk_strategy', 'No strategy specified')}\")\n",
1528 | " if eval.get('threshold'):\n",
1529 | " print(f\"Threshold: {eval.get('threshold')}\")\n",
1530 | "\n",
1531 | " # Display baseline answer\n",
1532 | " print(\"\\nBaseline Answer:\")\n",
1533 | " baseline = eval.get('baseline_answer', 'No baseline answer available')\n",
1534 | " print(textwrap.fill(str(baseline), width=80))\n",
1535 | "\n",
1536 | " print(\"\\nModel Response:\")\n",
1537 | " response = eval.get('model_response', {})\n",
1538 | " response_text = response.get('response_text', 'No response available')\n",
1539 | " if response_text:\n",
1540 | " print(textwrap.fill(str(response_text), width=80))\n",
1541 | " else:\n",
1542 | " print(\"No response available\")\n",
1543 | "\n",
1544 | " print(\"\\nSource Data:\")\n",
1545 | " sources = response.get('sources', [])\n",
1546 | " if sources:\n",
1547 | " for source in sources:\n",
1548 | " if source: # Check if source is not empty\n",
1549 | " print(textwrap.fill(str(source), width=80))\n",
1550 | " else:\n",
1551 | " print(\"No source data available\")\n",
1552 | "\n",
1553 | " print(\"\\nScores:\")\n",
1554 | " scores = eval.get('scores', {})\n",
1555 | " for metric, score in scores.items():\n",
1556 | " print(f\"- {metric.replace('_', ' ').capitalize()}: {score}/100\")\n",
1557 | " print(f\"Composite Score: {eval.get('composite_score', 0)}/100\")\n",
1558 | "\n",
1559 | " print(\"\\nDetailed Analysis:\")\n",
1560 | " analysis = eval.get('detailed_analysis', {})\n",
1561 | " for aspect, details in analysis.items():\n",
1562 | " if details: # Check if details is not empty\n",
1563 | " print(f\"\\n{aspect.replace('_', ' ').capitalize()}:\")\n",
1564 | " print(textwrap.fill(str(details), width=80))\n",
1565 | "\n",
1566 | " # Display summary\n",
1567 | " summary = evaluation_results.get(\"overall_summary\", {})\n",
1568 | " if summary:\n",
1569 | " print(\"\\nOVERALL SUMMARY:\")\n",
1570 | " print(\"-\"*80)\n",
1571 | "\n",
1572 | " if \"overall_performance\" in summary:\n",
1573 | " print(\"\\nOverall Performance:\")\n",
1574 | " print(textwrap.fill(str(summary[\"overall_performance\"]), width=80))\n",
1575 | "\n",
1576 | " if \"optimal_permutation\" in summary:\n",
1577 | " print(\"\\nOptimal Configuration:\")\n",
1578 | " print(textwrap.fill(str(summary[\"optimal_permutation\"]), width=80))\n",
1579 | "\n",
1580 | " if \"chunking_statistics\" in summary:\n",
1581 | " print(\"\\nChunking Statistics:\")\n",
1582 | " chunk_stats = summary[\"chunking_statistics\"]\n",
1583 | " for config, stats in chunk_stats.items():\n",
1584 | " print(f\"\\n{config}:\")\n",
1585 | " print(f\" Number of Chunks: {stats['number_of_chunks']}\")\n",
1586 | " print(f\" Average Chunk Size: {stats['average_chunk_size']}\")\n",
1587 | " print(f\" Min Chunk Size: {stats['min_chunk_size']}\")\n",
1588 | " print(f\" Max Chunk Size: {stats['max_chunk_size']}\")\n",
1589 | "\n",
1590 | " if \"performance_analysis\" in summary:\n",
1591 | " print(\"\\nPerformance Analysis:\")\n",
1592 | " analysis = summary[\"performance_analysis\"]\n",
1593 | " if isinstance(analysis, dict):\n",
1594 | " for config, score in analysis.items():\n",
1595 | " print(f\"{config}: {score:.2f}\")\n",
1596 | " else:\n",
1597 | " print(textwrap.fill(str(analysis), width=80))"
1598 | ],
1599 | "metadata": {
1600 | "id": "vnMb5d8cmKQU"
1601 | },
1602 | "execution_count": null,
1603 | "outputs": []
1604 | },
1605 | {
1606 | "cell_type": "markdown",
1607 | "source": [
1608 | "#Main"
1609 | ],
1610 | "metadata": {
1611 | "id": "koQ5ZObJC2ek"
1612 | }
1613 | },
1614 | {
1615 | "cell_type": "code",
1616 | "execution_count": null,
1617 | "metadata": {
1618 | "id": "6qdI5iaXYsun",
1619 | "collapsed": true
1620 | },
1621 | "outputs": [],
1622 | "source": [
1623 | "def main():\n",
1624 | " # Initialize configurations with semantic settings from config\n",
1625 | " semantic_enabled = CHUNKING_CONFIGS[\"semantic_config\"][\"enabled\"]\n",
1626 | " semantic_thresholds = CHUNKING_CONFIGS[\"semantic_config\"][\"thresholds\"]\n",
1627 | "\n",
1628 | " # Update strategies list if semantic is enabled\n",
1629 | " strategies = CHUNKING_CONFIGS[\"strategies\"]\n",
1630 | " if semantic_enabled:\n",
1631 | " strategies = [\"semantic\"] + strategies\n",
1632 | "\n",
1633 | " model_config = ModelConfig(\n",
1634 | " models=MODEL_CONFIGS[\"models\"],\n",
1635 | " temperature=0.3\n",
1636 | " )\n",
1637 | "\n",
1638 | " # Initialize experiment runner with flexible configuration\n",
1639 | " experiment_runner = ExperimentRunner(\n",
1640 | " model_config=model_config,\n",
1641 | " questions=QUESTION_CONFIGS[\"questions\"],\n",
1642 | " chunk_strategies=strategies,\n",
1643 | " semantic_enabled=semantic_enabled,\n",
1644 | " semantic_thresholds=semantic_thresholds\n",
1645 | " )\n",
1646 | "\n",
1647 | " print(\"Starting experiment with configurations:\")\n",
1648 | " print(f\"Models: {[model['name'] for model in model_config.models]}\")\n",
1649 | " if semantic_enabled:\n",
1650 | " print(f\"Semantic thresholds: {semantic_thresholds}\")\n",
1651 | " print(f\"Chunk strategies: {strategies}\")\n",
1652 | " print(f\"Number of questions: {len(QUESTION_CONFIGS['questions'])}\")\n",
1653 | "\n",
1654 | " # Rest of the main function remains the same\n",
1655 | " experiment_results = experiment_runner.run_experiments()\n",
1656 | "\n",
1657 | " print(\"\\nInitializing GPT-4o evaluation...\")\n",
1658 | " evaluator = ExperimentEvaluator(api_key=userdata.get('OPENAI_API_KEY'))\n",
1659 | "\n",
1660 | " evaluation_results = evaluator.evaluate_experiments(\n",
1661 | " experiment_results=experiment_results,\n",
1662 | " source_docs=documents\n",
1663 | " )\n",
1664 | "\n",
1665 | " results_manager = ResultsManager(save_directory=FILE_CONFIGS['save_directory'])\n",
1666 | "\n",
1667 | " formatted_experiment, formatted_evaluation = results_manager.format_results(\n",
1668 | " experiment_results=experiment_results,\n",
1669 | " evaluation_results=evaluation_results\n",
1670 | " )\n",
1671 | "\n",
1672 | " experiment_file, evaluation_file = results_manager.save_results(\n",
1673 | " formatted_experiment=formatted_experiment,\n",
1674 | " formatted_evaluation=formatted_evaluation\n",
1675 | " )\n",
1676 | "\n",
1677 | " results_manager.display_results(evaluation_results=formatted_evaluation)\n",
1678 | "\n",
1679 | " print(\"\\nExperiment complete!\")\n",
1680 | " print(f\"Results saved to:\")\n",
1681 | " print(f\" Experiment results: {experiment_file}\")\n",
1682 | " print(f\" Evaluation results: {evaluation_file}\")\n",
1683 | "\n",
1684 | " torch.cuda.empty_cache()\n",
1685 | " gc.collect()\n",
1686 | "\n",
1687 | " return formatted_experiment, formatted_evaluation\n",
1688 | "\n",
1689 | "\n",
1690 | "if __name__ == \"__main__\":\n",
1691 | " results, evaluation = main()"
1692 | ]
1693 | }
1694 | ]
1695 | }
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # rag-experimentation-framework
2 | Testing LLMs and RAG configurations at scale using an OpenAI Evaluator / Reflector
3 |
4 | [Motivation for this Project & Results Overview](https://docs.google.com/presentation/d/13QGNKmmOQhmpwAxXuc4k98_ITSvrORRJw6J7J74dx7M/edit#slide=id.g318d9a5244c_0_0)
5 |
6 | [A Systematic Framework for RAG Optimization: Data-Driven Design Through Controlled Experimentation](https://medium.com/@bill.leece/a-systematic-framework-for-rag-optimization-data-driven-design-through-controlled-experimentation-5e7d99643816)
7 |
--------------------------------------------------------------------------------