├── README.md
├── LICENSE
├── 03_llama_index_multi_doc_agent.ipynb
├── 04_langchain_per_user_retrieval.ipynb
└── 01_semi_structured_data.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # Advanced RAG
 2 | 
 3 | **What is RAG?**
 4 | 
 5 | Retrieval augmented generation (RAG) is a natural language processing (NLP) technique that employes the capabilities of retrieval and generative based AI models.
 6 | 
 7 | **What is Naive RAG?**
 8 | 
 9 | Naive RAG often refers to splitting documents into chunks, embedding them, and retrieving chunks based on semantic similarity search to a user question.
10 | 
11 | It's simple, but of poor overall performance.
12 | 
13 | **That's why we need Advanced RAG.**
14 | 
15 | In this tutorials (Advanced RAG), we will learn the techniques and best practices in RAG application development, that can improve the quality of the RAG.
16 | 
17 | It's crucial to the success of a RAG application.
18 | 
19 | ## Episodes
20 | 
21 | 1. [RAG on Semi-structured data](./01_semi_structured_data.ipynb)
22 | 2. [Multi-Modal RAG](./02_multi_modal.ipynb)
23 | 3. [Multi-Document RAG with LlamaIndex](./03_llama_index_multi_doc_agent.ipynb)
24 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 sugarforever
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/03_llama_index_multi_doc_agent.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "authorship_tag": "ABX9TyPUZj1zzKjIILyTHPOtMUgD",
  8 |       "include_colab_link": true
  9 |     },
 10 |     "kernelspec": {
 11 |       "name": "python3",
 12 |       "display_name": "Python 3"
 13 |     },
 14 |     "language_info": {
 15 |       "name": "python"
 16 |     }
 17 |   },
 18 |   "cells": [
 19 |     {
 20 |       "cell_type": "markdown",
 21 |       "metadata": {
 22 |         "id": "view-in-github",
 23 |         "colab_type": "text"
 24 |       },
 25 |       "source": [
 26 |         "<a href=\"https://colab.research.google.com/github/sugarforever/Advanced-RAG/blob/main/03_llama_index_multi_doc_agent.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 27 |       ]
 28 |     },
 29 |     {
 30 |       "cell_type": "markdown",
 31 |       "source": [
 32 |         "# Llama-Index Multi-Document Agent\n",
 33 |         "\n",
 34 |         "It's straightforward to build a RAG application on a single document. But when it comes to multiple documents, it's a different story.\n",
 35 |         "\n",
 36 |         "In this tutorial, we will learn how to use Llama-Index to build a RAG application that can answer user questions that require cross query on multiple documents.\n",
 37 |         "\n",
 38 |         "The core LlamaIndex components required in this tutorial are as follows:\n",
 39 |         "\n",
 40 |         "- VectorStoreIndex\n",
 41 |         "- SummaryIndex\n",
 42 |         "- ObjectIndex\n",
 43 |         "- QueryEngineTool\n",
 44 |         "- OpenAIAgent\n",
 45 |         "- FnRetrieverOpenAIAgent\n"
 46 |       ],
 47 |       "metadata": {
 48 |         "id": "wXoO6jXcAH3x"
 49 |       }
 50 |     },
 51 |     {
 52 |       "cell_type": "code",
 53 |       "execution_count": 1,
 54 |       "metadata": {
 55 |         "colab": {
 56 |           "base_uri": "https://localhost:8080/"
 57 |         },
 58 |         "id": "zYRlFZOL_sJx",
 59 |         "outputId": "4d21b91a-9793-4ff6-e3b2-859edb71ea38"
 60 |       },
 61 |       "outputs": [
 62 |         {
 63 |           "output_type": "stream",
 64 |           "name": "stdout",
 65 |           "text": [
 66 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m900.5/900.5 kB\u001b[0m \u001b[31m11.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 67 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m143.0/143.0 kB\u001b[0m \u001b[31m13.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 68 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.0/75.0 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 69 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m220.8/220.8 kB\u001b[0m \u001b[31m20.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 70 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m44.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 71 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.9/76.9 kB\u001b[0m \u001b[31m8.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 72 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 73 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 74 |             "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
 75 |             "llmx 0.0.15a0 requires cohere, which is not installed.\u001b[0m\u001b[31m\n",
 76 |             "\u001b[0m"
 77 |           ]
 78 |         }
 79 |       ],
 80 |       "source": [
 81 |         "! pip install llama-index -q -U"
 82 |       ]
 83 |     },
 84 |     {
 85 |       "cell_type": "code",
 86 |       "source": [
 87 |         "import os\n",
 88 |         "import openai\n",
 89 |         "\n",
 90 |         "os.environ[\"OPENAI_API_KEY\"] = \"your valid openai api key\"\n",
 91 |         "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
 92 |       ],
 93 |       "metadata": {
 94 |         "id": "8Bw2iRQjEk8z"
 95 |       },
 96 |       "execution_count": 2,
 97 |       "outputs": []
 98 |     },
 99 |     {
100 |       "cell_type": "markdown",
101 |       "source": [
102 |         "## Example Code"
103 |       ],
104 |       "metadata": {
105 |         "id": "MumturibbfBD"
106 |       }
107 |     },
108 |     {
109 |       "cell_type": "code",
110 |       "source": [
111 |         "from llama_index import (\n",
112 |         "    VectorStoreIndex,\n",
113 |         "    SummaryIndex,\n",
114 |         "    SimpleDirectoryReader,\n",
115 |         "    ServiceContext,\n",
116 |         ")\n",
117 |         "from llama_index.tools import QueryEngineTool, ToolMetadata\n",
118 |         "from llama_index.llms import OpenAI"
119 |       ],
120 |       "metadata": {
121 |         "id": "2586El8ibhjc"
122 |       },
123 |       "execution_count": 3,
124 |       "outputs": []
125 |     },
126 |     {
127 |       "cell_type": "code",
128 |       "source": [
129 |         "wiki_titles = [\n",
130 |         "    \"Serie A\",\n",
131 |         "    \"Premier League\",\n",
132 |         "    \"Bundesliga\",\n",
133 |         "    \"La Liga\",\n",
134 |         "    \"Ligue 1\"\n",
135 |         "]"
136 |       ],
137 |       "metadata": {
138 |         "id": "W9BtxyvNnEl1"
139 |       },
140 |       "execution_count": 4,
141 |       "outputs": []
142 |     },
143 |     {
144 |       "cell_type": "code",
145 |       "source": [
146 |         "from pathlib import Path\n",
147 |         "\n",
148 |         "import requests\n",
149 |         "\n",
150 |         "for title in wiki_titles:\n",
151 |         "    response = requests.get(\n",
152 |         "        \"https://en.wikipedia.org/w/api.php\",\n",
153 |         "        params={\n",
154 |         "            \"action\": \"query\",\n",
155 |         "            \"format\": \"json\",\n",
156 |         "            \"titles\": title,\n",
157 |         "            \"prop\": \"extracts\",\n",
158 |         "            \"explaintext\": True,\n",
159 |         "        },\n",
160 |         "    ).json()\n",
161 |         "    page = next(iter(response[\"query\"][\"pages\"].values()))\n",
162 |         "    wiki_text = page[\"extract\"]\n",
163 |         "\n",
164 |         "    data_path = Path(\"data\")\n",
165 |         "    if not data_path.exists():\n",
166 |         "        Path.mkdir(data_path)\n",
167 |         "\n",
168 |         "    with open(data_path / f\"{title}.txt\", \"w\") as fp:\n",
169 |         "        fp.write(wiki_text)"
170 |       ],
171 |       "metadata": {
172 |         "id": "HVmr7ayEnG-6"
173 |       },
174 |       "execution_count": 5,
175 |       "outputs": []
176 |     },
177 |     {
178 |       "cell_type": "code",
179 |       "source": [
180 |         "leagues_docs = {}\n",
181 |         "for wiki_title in wiki_titles:\n",
182 |         "    leagues_docs[wiki_title] = SimpleDirectoryReader(\n",
183 |         "        input_files=[f\"data/{wiki_title}.txt\"]\n",
184 |         "    ).load_data()"
185 |       ],
186 |       "metadata": {
187 |         "id": "_pC4oWUhnKVB"
188 |       },
189 |       "execution_count": 6,
190 |       "outputs": []
191 |     },
192 |     {
193 |       "cell_type": "code",
194 |       "source": [
195 |         "llm = OpenAI(temperature=0, model=\"gpt-3.5-turbo\")\n",
196 |         "service_context = ServiceContext.from_defaults(llm=llm)"
197 |       ],
198 |       "metadata": {
199 |         "colab": {
200 |           "base_uri": "https://localhost:8080/"
201 |         },
202 |         "id": "aZZYj1NWnL9m",
203 |         "outputId": "61a67bb1-2e77-4af3-ce62-9ff3d6201a7b"
204 |       },
205 |       "execution_count": 7,
206 |       "outputs": [
207 |         {
208 |           "output_type": "stream",
209 |           "name": "stderr",
210 |           "text": [
211 |             "[nltk_data] Downloading package punkt to /tmp/llama_index...\n",
212 |             "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
213 |           ]
214 |         }
215 |       ]
216 |     },
217 |     {
218 |       "cell_type": "code",
219 |       "source": [
220 |         "from llama_index.agent import OpenAIAgent\n",
221 |         "from llama_index import load_index_from_storage, StorageContext\n",
222 |         "from llama_index.node_parser import SentenceSplitter\n",
223 |         "import os\n",
224 |         "\n",
225 |         "node_parser = SentenceSplitter()\n",
226 |         "\n",
227 |         "# Build agents dictionary\n",
228 |         "agents = {}\n",
229 |         "query_engines = {}\n",
230 |         "\n",
231 |         "# this is for the baseline\n",
232 |         "all_nodes = []\n",
233 |         "\n",
234 |         "for idx, wiki_title in enumerate(wiki_titles):\n",
235 |         "    nodes = node_parser.get_nodes_from_documents(leagues_docs[wiki_title])\n",
236 |         "    all_nodes.extend(nodes)\n",
237 |         "\n",
238 |         "    if not os.path.exists(f\"./data/{wiki_title}\"):\n",
239 |         "        # build vector index\n",
240 |         "        vector_index = VectorStoreIndex(nodes, service_context=service_context)\n",
241 |         "        vector_index.storage_context.persist(\n",
242 |         "            persist_dir=f\"./data/{wiki_title}\"\n",
243 |         "        )\n",
244 |         "    else:\n",
245 |         "        vector_index = load_index_from_storage(\n",
246 |         "            StorageContext.from_defaults(persist_dir=f\"./data/{wiki_title}\"),\n",
247 |         "            service_context=service_context,\n",
248 |         "        )\n",
249 |         "\n",
250 |         "    # build summary index\n",
251 |         "    summary_index = SummaryIndex(nodes, service_context=service_context)\n",
252 |         "    # define query engines\n",
253 |         "    vector_query_engine = vector_index.as_query_engine()\n",
254 |         "    summary_query_engine = summary_index.as_query_engine()\n",
255 |         "\n",
256 |         "    # define tools\n",
257 |         "    query_engine_tools = [\n",
258 |         "        QueryEngineTool(\n",
259 |         "            query_engine=vector_query_engine,\n",
260 |         "            metadata=ToolMetadata(\n",
261 |         "                name=\"vector_tool\",\n",
262 |         "                description=(\n",
263 |         "                    \"Useful for questions related to specific aspects of\"\n",
264 |         "                    f\" {wiki_title} (e.g. the history, teams \"\n",
265 |         "                    \"and performance in EU, or more).\"\n",
266 |         "                ),\n",
267 |         "            ),\n",
268 |         "        ),\n",
269 |         "        QueryEngineTool(\n",
270 |         "            query_engine=summary_query_engine,\n",
271 |         "            metadata=ToolMetadata(\n",
272 |         "                name=\"summary_tool\",\n",
273 |         "                description=(\n",
274 |         "                    \"Useful for any requests that require a holistic summary\"\n",
275 |         "                    f\" of EVERYTHING about {wiki_title}. For questions about\"\n",
276 |         "                    \" more specific sections, please use the vector_tool.\"\n",
277 |         "                ),\n",
278 |         "            ),\n",
279 |         "        ),\n",
280 |         "    ]\n",
281 |         "\n",
282 |         "    # build agent\n",
283 |         "    function_llm = OpenAI(model=\"gpt-4\")\n",
284 |         "    agent = OpenAIAgent.from_tools(\n",
285 |         "        query_engine_tools,\n",
286 |         "        llm=function_llm,\n",
287 |         "        verbose=True,\n",
288 |         "        system_prompt=f\"\"\"\\\n",
289 |         "You are a specialized agent designed to answer queries about {wiki_title}.\n",
290 |         "You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\\\n",
291 |         "\"\"\",\n",
292 |         "    )\n",
293 |         "\n",
294 |         "    agents[wiki_title] = agent\n",
295 |         "    query_engines[wiki_title] = vector_index.as_query_engine(\n",
296 |         "        similarity_top_k=2\n",
297 |         "    )"
298 |       ],
299 |       "metadata": {
300 |         "id": "BLZ1EV80nQOF"
301 |       },
302 |       "execution_count": 8,
303 |       "outputs": []
304 |     },
305 |     {
306 |       "cell_type": "code",
307 |       "source": [
308 |         "# define tool for each document agent\n",
309 |         "all_tools = []\n",
310 |         "for wiki_title in wiki_titles:\n",
311 |         "    wiki_summary = (\n",
312 |         "        f\"This content contains Wikipedia articles about {wiki_title}. Use\"\n",
313 |         "        f\" this tool if you want to answer any questions about {wiki_title}.\\n\"\n",
314 |         "    )\n",
315 |         "    doc_tool = QueryEngineTool(\n",
316 |         "        query_engine=agents[wiki_title],\n",
317 |         "        metadata=ToolMetadata(\n",
318 |         "            name=f\"tool_{wiki_title.replace(' ', '_')}\",\n",
319 |         "            description=wiki_summary,\n",
320 |         "        ),\n",
321 |         "    )\n",
322 |         "    all_tools.append(doc_tool)"
323 |       ],
324 |       "metadata": {
325 |         "id": "Mm_T_N1SnXXn"
326 |       },
327 |       "execution_count": 12,
328 |       "outputs": []
329 |     },
330 |     {
331 |       "cell_type": "code",
332 |       "source": [
333 |         "# define an \"object\" index and retriever over these tools\n",
334 |         "from llama_index import VectorStoreIndex\n",
335 |         "from llama_index.objects import ObjectIndex, SimpleToolNodeMapping\n",
336 |         "\n",
337 |         "tool_mapping = SimpleToolNodeMapping.from_objects(all_tools)\n",
338 |         "obj_index = ObjectIndex.from_objects(\n",
339 |         "    all_tools,\n",
340 |         "    tool_mapping,\n",
341 |         "    VectorStoreIndex,\n",
342 |         ")"
343 |       ],
344 |       "metadata": {
345 |         "id": "T5Vskzn9nY42"
346 |       },
347 |       "execution_count": 10,
348 |       "outputs": []
349 |     },
350 |     {
351 |       "cell_type": "code",
352 |       "source": [
353 |         "from llama_index.agent import FnRetrieverOpenAIAgent\n",
354 |         "\n",
355 |         "top_agent = FnRetrieverOpenAIAgent.from_retriever(\n",
356 |         "    obj_index.as_retriever(similarity_top_k=3),\n",
357 |         "    system_prompt=\"\"\" \\\n",
358 |         "You are an agent designed to answer queries about the European top football leagues.\n",
359 |         "Please always use the tools provided to answer a question. Do not rely on prior knowledge.\\\n",
360 |         "\n",
361 |         "\"\"\",\n",
362 |         "    verbose=True,\n",
363 |         ")"
364 |       ],
365 |       "metadata": {
366 |         "id": "jBsCUADsnaZt"
367 |       },
368 |       "execution_count": 11,
369 |       "outputs": []
370 |     },
371 |     {
372 |       "cell_type": "code",
373 |       "source": [
374 |         "response = top_agent.query(\"Tell me about the history and UCL performance of La Liga\")"
375 |       ],
376 |       "metadata": {
377 |         "colab": {
378 |           "base_uri": "https://localhost:8080/"
379 |         },
380 |         "id": "UpragtnandB7",
381 |         "outputId": "7247b8fd-a9e7-467d-cbb1-14405c8fd256"
382 |       },
383 |       "execution_count": 13,
384 |       "outputs": [
385 |         {
386 |           "output_type": "stream",
387 |           "name": "stdout",
388 |           "text": [
389 |             "STARTING TURN 1\n",
390 |             "---------------\n",
391 |             "\n",
392 |             "=== Calling Function ===\n",
393 |             "Calling function: tool_La_Liga with args: {\n",
394 |             "  \"input\": \"history\"\n",
395 |             "}\n",
396 |             "STARTING TURN 1\n",
397 |             "---------------\n",
398 |             "\n",
399 |             "=== Calling Function ===\n",
400 |             "Calling function: summary_tool with args: {\n",
401 |             "  \"input\": \"history\"\n",
402 |             "}\n",
403 |             "Got output: During the 1930s, Athletic Club was the dominant team in La Liga, winning several titles. However, in the 1940s, Atlético Madrid, Barcelona, and Valencia emerged as strong clubs. Atlético Madrid won two titles during this decade, while Barcelona and Valencia each won multiple titles. In the 1950s, FC Barcelona continued their success, winning back-to-back La Liga titles and several other trophies. Real Madrid also emerged as a dominant force during this decade, winning three La Liga titles and dominating the newly created European Cup. Real Madrid's dominance continued into the 1960s and 1970s, winning 14 La Liga titles during this period. However, Atlético Madrid and other clubs like Valencia and Barcelona also had successful seasons during this time. In the 1980s, Real Madrid continued to dominate La Liga, winning five consecutive titles. However, the Basque clubs of Real Sociedad and Athletic Club disrupted their monopoly by winning titles in the early 1980s. Barcelona also won a title in 1985.\n",
404 |             "========================\n",
405 |             "\n",
406 |             "STARTING TURN 2\n",
407 |             "---------------\n",
408 |             "\n",
409 |             "Got output: La Liga, also known as the Primera Division, has a rich history dating back to the 1930s. During this time, Athletic Club was the dominant team, winning several titles. However, the 1940s saw the rise of Atlético Madrid, Barcelona, and Valencia as strong contenders, with each club winning multiple titles.\n",
410 |             "\n",
411 |             "The 1950s were marked by the continued success of FC Barcelona, which won back-to-back La Liga titles and several other trophies. This decade also saw the emergence of Real Madrid as a dominant force, winning three La Liga titles and dominating the newly created European Cup.\n",
412 |             "\n",
413 |             "Real Madrid's dominance continued into the 1960s and 1970s, during which they won 14 La Liga titles. However, Atlético Madrid, Valencia, and Barcelona also had successful seasons during this period.\n",
414 |             "\n",
415 |             "The 1980s saw Real Madrid continue to dominate La Liga, winning five consecutive titles. However, the Basque clubs of Real Sociedad and Athletic Club disrupted their monopoly by winning titles in the early 1980s. Barcelona also won a title in 1985.\n",
416 |             "========================\n",
417 |             "\n",
418 |             "STARTING TURN 2\n",
419 |             "---------------\n",
420 |             "\n",
421 |             "=== Calling Function ===\n",
422 |             "Calling function: tool_La_Liga with args: {\n",
423 |             "  \"input\": \"UCL performance\"\n",
424 |             "}\n",
425 |             "STARTING TURN 1\n",
426 |             "---------------\n",
427 |             "\n",
428 |             "=== Calling Function ===\n",
429 |             "Calling function: vector_tool with args: {\n",
430 |             "  \"input\": \"UCL performance\"\n",
431 |             "}\n",
432 |             "Got output: Real Madrid, Barcelona, and Atlético Madrid have been successful clubs in European football, with multiple Champions League titles between them. Other Spanish clubs like Sevilla and Valencia have also won international trophies. La Liga has had clubs win both the Champions League and UEFA Cup in the same season, and has had multiple seasons where Spanish clubs achieved this feat. Additionally, La Liga has been successful in terms of the number of teams qualifying for the Champions League group stage.\n",
433 |             "========================\n",
434 |             "\n",
435 |             "STARTING TURN 2\n",
436 |             "---------------\n",
437 |             "\n",
438 |             "Got output: La Liga teams have had a significant impact in the UEFA Champions League (UCL). Real Madrid, Barcelona, and Atlético Madrid are particularly successful, with multiple UCL titles to their names. Other Spanish clubs like Sevilla and Valencia have also won international trophies. There have been instances where La Liga clubs have won both the Champions League and UEFA Cup in the same season. Furthermore, La Liga consistently sends a significant number of teams to the Champions League group stage, demonstrating the league's overall strength and competitiveness.\n",
439 |             "========================\n",
440 |             "\n",
441 |             "STARTING TURN 3\n",
442 |             "---------------\n",
443 |             "\n"
444 |           ]
445 |         }
446 |       ]
447 |     },
448 |     {
449 |       "cell_type": "code",
450 |       "source": [
451 |         "print(response)"
452 |       ],
453 |       "metadata": {
454 |         "colab": {
455 |           "base_uri": "https://localhost:8080/"
456 |         },
457 |         "id": "cVYo_fUvnerF",
458 |         "outputId": "5c1225b0-daad-4a60-b8a9-4ed266fc18d3"
459 |       },
460 |       "execution_count": 14,
461 |       "outputs": [
462 |         {
463 |           "output_type": "stream",
464 |           "name": "stdout",
465 |           "text": [
466 |             "La Liga has a rich history dating back to the 1930s. During the early years, Athletic Club was the dominant team, but in the 1940s, Atlético Madrid, Barcelona, and Valencia emerged as strong contenders. The 1950s saw the rise of FC Barcelona and the dominance of Real Madrid, which continued into the 1960s and 1970s.\n",
467 |             "\n",
468 |             "In terms of UEFA Champions League (UCL) performance, La Liga teams have had a significant impact. Real Madrid, Barcelona, and Atlético Madrid have been particularly successful, with multiple UCL titles to their names. Other Spanish clubs like Sevilla and Valencia have also won international trophies. La Liga consistently sends a significant number of teams to the Champions League group stage, showcasing the league's overall strength and competitiveness.\n"
469 |           ]
470 |         }
471 |       ]
472 |     },
473 |     {
474 |       "cell_type": "code",
475 |       "source": [
476 |         "response = top_agent.query(\n",
477 |         "    \"Please compare Premier League and La Liga in terms of history and UCL performance\"\n",
478 |         ")"
479 |       ],
480 |       "metadata": {
481 |         "colab": {
482 |           "base_uri": "https://localhost:8080/"
483 |         },
484 |         "id": "W1j9Wt__njJR",
485 |         "outputId": "2895e72c-858d-4ccf-bedf-1d90949e4aea"
486 |       },
487 |       "execution_count": 15,
488 |       "outputs": [
489 |         {
490 |           "output_type": "stream",
491 |           "name": "stdout",
492 |           "text": [
493 |             "STARTING TURN 1\n",
494 |             "---------------\n",
495 |             "\n",
496 |             "=== Calling Function ===\n",
497 |             "Calling function: tool_Premier_League with args: {\n",
498 |             "\"input\": \"history\"\n",
499 |             "}\n",
500 |             "STARTING TURN 1\n",
501 |             "---------------\n",
502 |             "\n",
503 |             "=== Calling Function ===\n",
504 |             "Calling function: summary_tool with args: {\n",
505 |             "  \"input\": \"history\"\n",
506 |             "}\n",
507 |             "Got output: The Premier League has a rich history since its establishment in 1992 as the FA Premier League. It operates independently from the English Football League and follows a promotion and relegation system with the EFL. The league's inaugural season took place in 1992-1993 with the participation of 22 clubs, and Manchester United emerged as the first champions. Throughout the years, clubs like Manchester United, Arsenal, Chelsea, Liverpool, and Manchester City have had a significant impact on individual matches. In the 2000s, the \"Big Four\" clubs emerged, which later expanded to the \"Big Six\" with the inclusion of Tottenham Hotspur and Manchester City. A notable moment in the league's history was Leicester City's remarkable title victory in the 2015-2016 season. The Premier League is renowned for its financial strength and global influence, particularly with the \"Big Six\" clubs generating substantial revenue. There have been ongoing discussions and proposals for potential changes in the league structure in recent times.\n",
508 |             "========================\n",
509 |             "\n",
510 |             "STARTING TURN 2\n",
511 |             "---------------\n",
512 |             "\n",
513 |             "Got output: The Premier League, established in 1992 as the FA Premier League, operates independently from the English Football League and follows a promotion and relegation system with the EFL. The inaugural season took place in 1992-1993 with 22 clubs, and Manchester United emerged as the first champions. Throughout the years, clubs like Manchester United, Arsenal, Chelsea, Liverpool, and Manchester City have had a significant impact on individual matches. In the 2000s, the \"Big Four\" clubs emerged, which later expanded to the \"Big Six\" with the inclusion of Tottenham Hotspur and Manchester City. A notable moment in the league's history was Leicester City's remarkable title victory in the 2015-2016 season. The Premier League is renowned for its financial strength and global influence, particularly with the \"Big Six\" clubs generating substantial revenue. There have been ongoing discussions and proposals for potential changes in the league structure in recent times.\n",
514 |             "========================\n",
515 |             "\n",
516 |             "STARTING TURN 2\n",
517 |             "---------------\n",
518 |             "\n",
519 |             "=== Calling Function ===\n",
520 |             "Calling function: tool_La_Liga with args: {\n",
521 |             "\"input\": \"history\"\n",
522 |             "}\n",
523 |             "STARTING TURN 1\n",
524 |             "---------------\n",
525 |             "\n",
526 |             "=== Calling Function ===\n",
527 |             "Calling function: summary_tool with args: {\n",
528 |             "  \"input\": \"history\"\n",
529 |             "}\n",
530 |             "Got output: In the early years of La Liga, Athletic Club was the most successful team, winning the title four times in the 1930s. However, in the 1940s, Atlético Madrid, Barcelona, and Valencia emerged as strong clubs, with Atlético Madrid winning two titles during that decade. The 1950s saw FC Barcelona's continued success, winning back-to-back La Liga titles and several other trophies. Real Madrid also emerged as a dominant force, winning three La Liga titles and dominating the newly created European Cup. Real Madrid's dominance continued into the 1960s and 1970s, winning 14 La Liga titles during that period. However, in the 1980s, the Basque clubs of Real Sociedad and Athletic Club disrupted Real Madrid's monopoly, with Real Sociedad winning two consecutive La Liga titles and Athletic Club also winning two titles. Barcelona also secured their tenth La Liga title in 1985.\n",
531 |             "========================\n",
532 |             "\n",
533 |             "STARTING TURN 2\n",
534 |             "---------------\n",
535 |             "\n",
536 |             "Got output: La Liga, also known as the Primera Division, has a rich history dating back to its inception in 1929. In the early years, Athletic Club was the most successful team, winning the title four times in the 1930s. However, the 1940s saw the rise of Atlético Madrid, Barcelona, and Valencia as strong contenders, with Atlético Madrid winning two titles during that decade.\n",
537 |             "\n",
538 |             "The 1950s were marked by FC Barcelona's continued success, as they clinched back-to-back La Liga titles and several other trophies. Real Madrid also emerged as a dominant force during this period, winning three La Liga titles and making a strong showing in the newly created European Cup.\n",
539 |             "\n",
540 |             "Real Madrid's dominance extended into the 1960s and 1970s, during which they secured 14 La Liga titles. However, the 1980s saw a disruption in Real Madrid's monopoly as the Basque clubs of Real Sociedad and Athletic Club came to the fore. Real Sociedad won two consecutive La Liga titles, and Athletic Club also won two titles. Barcelona also secured their tenth La Liga title in 1985.\n",
541 |             "========================\n",
542 |             "\n",
543 |             "STARTING TURN 3\n",
544 |             "---------------\n",
545 |             "\n",
546 |             "=== Calling Function ===\n",
547 |             "Calling function: tool_Premier_League with args: {\n",
548 |             "\"input\": \"UCL performance\"\n",
549 |             "}\n",
550 |             "STARTING TURN 1\n",
551 |             "---------------\n",
552 |             "\n",
553 |             "=== Calling Function ===\n",
554 |             "Calling function: vector_tool with args: {\n",
555 |             "  \"input\": \"UCL performance\"\n",
556 |             "}\n",
557 |             "Got output: English clubs have won a total of 15 UEFA Champions League titles and have been runners-up 11 times. This makes them the second-most successful country in the history of the competition, behind Spanish clubs.\n",
558 |             "========================\n",
559 |             "\n",
560 |             "STARTING TURN 2\n",
561 |             "---------------\n",
562 |             "\n",
563 |             "Got output: English clubs have had a significant impact in the UEFA Champions League. They have won a total of 15 titles and have been runners-up 11 times. This makes England the second-most successful country in the history of the competition, only surpassed by Spain.\n",
564 |             "========================\n",
565 |             "\n",
566 |             "STARTING TURN 4\n",
567 |             "---------------\n",
568 |             "\n",
569 |             "=== Calling Function ===\n",
570 |             "Calling function: tool_La_Liga with args: {\n",
571 |             "\"input\": \"UCL performance\"\n",
572 |             "}\n",
573 |             "STARTING TURN 1\n",
574 |             "---------------\n",
575 |             "\n",
576 |             "=== Calling Function ===\n",
577 |             "Calling function: vector_tool with args: {\n",
578 |             "  \"input\": \"UCL performance\"\n",
579 |             "}\n",
580 |             "Got output: Real Madrid, Barcelona, and Atlético Madrid have been successful clubs in European football, with multiple Champions League titles between them. Other Spanish clubs like Sevilla and Valencia have also won international trophies. La Liga became the first league to have its clubs win both the Champions League and UEFA Cup in the same season, and this feat was repeated multiple times. Additionally, La Liga has had a strong presence in the Champions League group stage, with multiple teams qualifying.\n",
581 |             "========================\n",
582 |             "\n",
583 |             "STARTING TURN 2\n",
584 |             "---------------\n",
585 |             "\n",
586 |             "Got output: La Liga teams have shown remarkable performance in the UEFA Champions League (UCL). Real Madrid, Barcelona, and Atlético Madrid are among the most successful clubs in European football, with multiple Champions League titles to their names. \n",
587 |             "\n",
588 |             "Real Madrid, in particular, has been a dominant force in the UCL, holding the record for the most titles won. Barcelona has also had its share of success, with several titles under its belt. Atlético Madrid, while not as successful as the other two in terms of titles won, has consistently performed well, reaching the finals on several occasions.\n",
589 |             "\n",
590 |             "Other Spanish clubs like Sevilla and Valencia have also made their mark in international competitions, winning trophies in other UEFA competitions.\n",
591 |             "\n",
592 |             "In fact, La Liga became the first league to have its clubs win both the Champions League and UEFA Cup in the same season, a feat that has been repeated multiple times. This speaks volumes about the strength and depth of the league.\n",
593 |             "\n",
594 |             "Moreover, La Liga has consistently had a strong presence in the Champions League group stage, with multiple teams qualifying each year. This consistent performance in the UCL underscores the competitiveness and quality of La Liga.\n",
595 |             "========================\n",
596 |             "\n",
597 |             "STARTING TURN 5\n",
598 |             "---------------\n",
599 |             "\n"
600 |           ]
601 |         }
602 |       ]
603 |     },
604 |     {
605 |       "cell_type": "code",
606 |       "source": [
607 |         "print(response)"
608 |       ],
609 |       "metadata": {
610 |         "colab": {
611 |           "base_uri": "https://localhost:8080/"
612 |         },
613 |         "id": "LoozAkAhnlMn",
614 |         "outputId": "62c5919c-0c3e-4a51-8638-e515510ba79c"
615 |       },
616 |       "execution_count": 16,
617 |       "outputs": [
618 |         {
619 |           "output_type": "stream",
620 |           "name": "stdout",
621 |           "text": [
622 |             "In terms of history, the Premier League was established in 1992 and has since become one of the most popular and financially lucrative football leagues in the world. It is known for its competitive nature and has been dominated by clubs like Manchester United, Arsenal, Chelsea, Liverpool, and Manchester City. The league has also seen remarkable moments, such as Leicester City's title victory in the 2015-2016 season.\n",
623 |             "\n",
624 |             "On the other hand, La Liga has a longer history, dating back to 1929. It has been home to some of the most successful and iconic clubs in football, including Real Madrid and Barcelona. The league has witnessed periods of dominance from various clubs and has seen intense rivalries between teams. Real Madrid has been particularly successful in the Champions League, winning numerous titles.\n",
625 |             "\n",
626 |             "In terms of UEFA Champions League (UCL) performance, both leagues have had strong showings. English clubs have won a total of 15 UCL titles, making England the second-most successful country in the competition's history. Spanish clubs, especially Real Madrid and Barcelona, have also been highly successful in the UCL. Real Madrid holds the record for the most UCL titles won, while Barcelona has also had multiple triumphs in the competition. Other Spanish clubs like Atlético Madrid, Sevilla, and Valencia have also made their mark in European competitions.\n",
627 |             "\n",
628 |             "Overall, both the Premier League and La Liga have rich histories and have produced successful teams in the UEFA Champions League. The Premier League is known for its competitiveness and financial strength, while La Liga has a longer history and has been home to some of the most iconic clubs in football.\n"
629 |           ]
630 |         }
631 |       ]
632 |     }
633 |   ]
634 | }


--------------------------------------------------------------------------------
/04_langchain_per_user_retrieval.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "authorship_tag": "ABX9TyNWGtpJM0YGGc8RTGjYQlzJ",
  8 |       "include_colab_link": true
  9 |     },
 10 |     "kernelspec": {
 11 |       "name": "python3",
 12 |       "display_name": "Python 3"
 13 |     },
 14 |     "language_info": {
 15 |       "name": "python"
 16 |     }
 17 |   },
 18 |   "cells": [
 19 |     {
 20 |       "cell_type": "markdown",
 21 |       "metadata": {
 22 |         "id": "view-in-github",
 23 |         "colab_type": "text"
 24 |       },
 25 |       "source": [
 26 |         "<a href=\"https://colab.research.google.com/github/sugarforever/Advanced-RAG/blob/main/04_langchain_per_user_retrieval.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 27 |       ]
 28 |     },
 29 |     {
 30 |       "cell_type": "code",
 31 |       "execution_count": null,
 32 |       "metadata": {
 33 |         "id": "6ygznopTYdrQ"
 34 |       },
 35 |       "outputs": [],
 36 |       "source": [
 37 |         "!pip install openai langchain pinecone-client pypdf tiktoken -q -U"
 38 |       ]
 39 |     },
 40 |     {
 41 |       "cell_type": "code",
 42 |       "source": [
 43 |         "from google.colab import userdata\n",
 44 |         "\n",
 45 |         "\n",
 46 |         "PINECONE_API_KEY=userdata.get('PINECONE_API_KEY')\n",
 47 |         "PINECONE_ENVIRONMENT=\"us-west4-gcp\"\n",
 48 |         "PINECONE_INDEX=\"multi-users\"\n",
 49 |         "\n",
 50 |         "OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')\n",
 51 |         "\n",
 52 |         "import os\n",
 53 |         "\n",
 54 |         "os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY"
 55 |       ],
 56 |       "metadata": {
 57 |         "id": "ipfSp3H4bLpK"
 58 |       },
 59 |       "execution_count": null,
 60 |       "outputs": []
 61 |     },
 62 |     {
 63 |       "cell_type": "code",
 64 |       "source": [
 65 |         "import pinecone\n",
 66 |         "from langchain.embeddings.openai import OpenAIEmbeddings\n",
 67 |         "from langchain.vectorstores import Pinecone"
 68 |       ],
 69 |       "metadata": {
 70 |         "id": "BvMjqf7ZZNpn"
 71 |       },
 72 |       "execution_count": 27,
 73 |       "outputs": []
 74 |     },
 75 |     {
 76 |       "cell_type": "code",
 77 |       "source": [
 78 |         "pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)\n",
 79 |         "index = pinecone.Index(PINECONE_INDEX)\n",
 80 |         "\n",
 81 |         "embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)\n",
 82 |         "vectorstore = Pinecone(index, embeddings, \"text\")"
 83 |       ],
 84 |       "metadata": {
 85 |         "id": "brRWntkRZuEj"
 86 |       },
 87 |       "execution_count": 28,
 88 |       "outputs": []
 89 |     },
 90 |     {
 91 |       "cell_type": "code",
 92 |       "source": [
 93 |         "!wget -O uniswap-v3.pdf https://uniswap.org/whitepaper-v3.pdf"
 94 |       ],
 95 |       "metadata": {
 96 |         "colab": {
 97 |           "base_uri": "https://localhost:8080/"
 98 |         },
 99 |         "id": "MTiI8aH7abGP",
100 |         "outputId": "8c665539-8908-4cbc-ad87-2af816fdd390"
101 |       },
102 |       "execution_count": 29,
103 |       "outputs": [
104 |         {
105 |           "output_type": "stream",
106 |           "name": "stdout",
107 |           "text": [
108 |             "--2023-12-27 23:45:39--  https://uniswap.org/whitepaper-v3.pdf\n",
109 |             "Resolving uniswap.org (uniswap.org)... 104.18.22.54, 104.18.23.54, 2606:4700::6812:1736, ...\n",
110 |             "Connecting to uniswap.org (uniswap.org)|104.18.22.54|:443... connected.\n",
111 |             "HTTP request sent, awaiting response... 200 OK\n",
112 |             "Length: 1500865 (1.4M) [application/pdf]\n",
113 |             "Saving to: ‘uniswap-v3.pdf’\n",
114 |             "\n",
115 |             "\runiswap-v3.pdf        0%[                    ]       0  --.-KB/s               \runiswap-v3.pdf      100%[===================>]   1.43M  --.-KB/s    in 0.01s   \n",
116 |             "\n",
117 |             "2023-12-27 23:45:39 (137 MB/s) - ‘uniswap-v3.pdf’ saved [1500865/1500865]\n",
118 |             "\n"
119 |           ]
120 |         }
121 |       ]
122 |     },
123 |     {
124 |       "cell_type": "code",
125 |       "source": [
126 |         "!wget -O electronic-health-records.pdf https://med.stanford.edu/content/dam/sm/ehr/documents/SM-EHR-White-Papers_v12.pdf"
127 |       ],
128 |       "metadata": {
129 |         "colab": {
130 |           "base_uri": "https://localhost:8080/"
131 |         },
132 |         "id": "NvuEc1u4bEiZ",
133 |         "outputId": "15a37cd2-a087-4151-d218-19e68e2ee37d"
134 |       },
135 |       "execution_count": 30,
136 |       "outputs": [
137 |         {
138 |           "output_type": "stream",
139 |           "name": "stdout",
140 |           "text": [
141 |             "--2023-12-27 23:45:42--  https://med.stanford.edu/content/dam/sm/ehr/documents/SM-EHR-White-Papers_v12.pdf\n",
142 |             "Resolving med.stanford.edu (med.stanford.edu)... 34.117.178.225\n",
143 |             "Connecting to med.stanford.edu (med.stanford.edu)|34.117.178.225|:443... connected.\n",
144 |             "HTTP request sent, awaiting response... 200 OK\n",
145 |             "Length: 814908 (796K) [application/pdf]\n",
146 |             "Saving to: ‘electronic-health-records.pdf’\n",
147 |             "\n",
148 |             "electronic-health-r 100%[===================>] 795.81K  3.94MB/s    in 0.2s    \n",
149 |             "\n",
150 |             "2023-12-27 23:45:42 (3.94 MB/s) - ‘electronic-health-records.pdf’ saved [814908/814908]\n",
151 |             "\n"
152 |           ]
153 |         }
154 |       ]
155 |     },
156 |     {
157 |       "cell_type": "code",
158 |       "source": [
159 |         "USER_1 = \"Alex\"\n",
160 |         "USER_2 = \"Lucy\""
161 |       ],
162 |       "metadata": {
163 |         "id": "jHzEOVi8bb82"
164 |       },
165 |       "execution_count": 31,
166 |       "outputs": []
167 |     },
168 |     {
169 |       "cell_type": "markdown",
170 |       "source": [
171 |         "Lucy will create a chatbot with Uniswap v3 whitepaper - **uniswap-v3.pdf**.\n",
172 |         "\n",
173 |         "Alex will do it with electronic health records report - **electronic-health-records.pdf**."
174 |       ],
175 |       "metadata": {
176 |         "id": "9Yv-FN-nbjRm"
177 |       }
178 |     },
179 |     {
180 |       "cell_type": "code",
181 |       "source": [
182 |         "from langchain.document_loaders import PyPDFLoader\n",
183 |         "\n",
184 |         "loader = PyPDFLoader(\"uniswap-v3.pdf\")\n",
185 |         "documents = loader.load_and_split()\n",
186 |         "\n",
187 |         "vectorstore.add_documents(documents, namespace=USER_2)"
188 |       ],
189 |       "metadata": {
190 |         "colab": {
191 |           "base_uri": "https://localhost:8080/"
192 |         },
193 |         "id": "ID2hhSYTbbMH",
194 |         "outputId": "c0f9ec65-17ef-4f45-86a3-6437ad13a288"
195 |       },
196 |       "execution_count": 32,
197 |       "outputs": [
198 |         {
199 |           "output_type": "execute_result",
200 |           "data": {
201 |             "text/plain": [
202 |               "['d215b4cc-1334-4da9-8cb4-49509c941276',\n",
203 |               " 'd4a9a6c3-2928-45b6-be31-416994b0822a',\n",
204 |               " 'f6b23825-0047-4301-bb4f-cd1bd7bec624',\n",
205 |               " '9eeb44ab-e896-43eb-ae62-c771f1d7a17c',\n",
206 |               " '49c32905-1feb-4553-a95d-87d03c0693c8',\n",
207 |               " '1226e79d-e955-4503-9fa3-c8ee29595b99',\n",
208 |               " '58332bcd-dd8e-45e1-9003-14cf3e966831',\n",
209 |               " 'c15280eb-a408-4c2f-99db-117918231c95',\n",
210 |               " 'df6279b2-33d3-462f-8b5a-55eb53774565',\n",
211 |               " 'ea4260de-c956-4b1a-8db7-9ca45da5dbcf',\n",
212 |               " '3d24b427-f3ef-48e8-9479-dc1456d41a9f',\n",
213 |               " '8372c6e2-77bd-44d9-8f84-56f938188a7a',\n",
214 |               " '07151d70-cfa7-4ced-a4be-f37d9958603e',\n",
215 |               " '1c5f29f6-6e0a-4409-8aa9-893ccd3baf3e']"
216 |             ]
217 |           },
218 |           "metadata": {},
219 |           "execution_count": 32
220 |         }
221 |       ]
222 |     },
223 |     {
224 |       "cell_type": "code",
225 |       "source": [
226 |         "loader = PyPDFLoader(\"electronic-health-records.pdf\")\n",
227 |         "documents = loader.load_and_split()\n",
228 |         "\n",
229 |         "vectorstore.add_documents(documents, namespace=USER_1)"
230 |       ],
231 |       "metadata": {
232 |         "colab": {
233 |           "base_uri": "https://localhost:8080/"
234 |         },
235 |         "id": "AZ8fcjlzcyg_",
236 |         "outputId": "093166d4-5471-4388-8e5b-0216c2e30a5a"
237 |       },
238 |       "execution_count": 33,
239 |       "outputs": [
240 |         {
241 |           "output_type": "execute_result",
242 |           "data": {
243 |             "text/plain": [
244 |               "['a63b10f9-80dd-414b-bdfa-4935a516e269',\n",
245 |               " '63b5819a-e975-4fb9-beef-a978f3a2a4ac',\n",
246 |               " 'e8662ce7-f2b9-4fd3-92a5-a714233bee49',\n",
247 |               " '46ca36e2-3b5d-46a2-a0ac-ec4221f26334',\n",
248 |               " 'e76b404f-d7b1-48ec-aae3-5f8b0f9fca2a',\n",
249 |               " 'd2bc4d7d-1958-4f83-bfed-95f545f8b9cc',\n",
250 |               " '055e37fa-979c-443b-851f-d98df790ec5d',\n",
251 |               " 'c15a4322-3e35-4ae3-b96b-f568068a2e29',\n",
252 |               " '36433b35-e0b2-4371-a9b9-69aae8740cba',\n",
253 |               " '925b3caf-9ddc-4152-8cd8-72977a076cc3',\n",
254 |               " '0ef3f844-171d-4a57-b816-f8c997c89bfc',\n",
255 |               " 'b915f8d3-d012-4961-b5f9-bd62fc093315',\n",
256 |               " 'd10101f8-79c2-4b4f-92b7-e1dec20a5fc0',\n",
257 |               " '16130b73-fe42-4c06-b428-4d9484f2d961',\n",
258 |               " 'f3a8e510-c61e-4e93-bcc1-f69316605d2a',\n",
259 |               " 'de09243b-a51f-4aac-ad23-d8ff13fd6dac',\n",
260 |               " '6cffaec1-64f5-4389-8367-93dc6719dce3',\n",
261 |               " '433cbc43-8dd3-429e-8da3-f39521d92ad8',\n",
262 |               " '5d0d1691-4324-40ff-b546-29a1383c1e9d',\n",
263 |               " '302f928e-455b-4d79-a2d6-ff7cd4e800db',\n",
264 |               " '49ad14dc-0789-4e26-9bc2-d466fa07d829']"
265 |             ]
266 |           },
267 |           "metadata": {},
268 |           "execution_count": 33
269 |         }
270 |       ]
271 |     },
272 |     {
273 |       "cell_type": "code",
274 |       "source": [
275 |         "question = \"What is uniswap?\""
276 |       ],
277 |       "metadata": {
278 |         "id": "5prPmGaIdJje"
279 |       },
280 |       "execution_count": 34,
281 |       "outputs": []
282 |     },
283 |     {
284 |       "cell_type": "code",
285 |       "source": [
286 |         "from langchain.schema.vectorstore import VectorStoreRetriever\n",
287 |         "from langchain.llms import OpenAI\n",
288 |         "from langchain.memory import ConversationBufferMemory\n",
289 |         "from langchain.chains.conversational_retrieval.prompts import QA_PROMPT\n",
290 |         "from langchain.chains.question_answering import load_qa_chain\n",
291 |         "from langchain.chains import ConversationalRetrievalChain"
292 |       ],
293 |       "metadata": {
294 |         "id": "Iun__di5c-vr"
295 |       },
296 |       "execution_count": 35,
297 |       "outputs": []
298 |     },
299 |     {
300 |       "cell_type": "markdown",
301 |       "source": [
302 |         "USER_1 Alex persisted electronic-health-records.pdf.\n",
303 |         "\n",
304 |         "USER_2 Lucy persisted uniswap-v3.pdf.\n"
305 |       ],
306 |       "metadata": {
307 |         "id": "5hf7GC36b3Oj"
308 |       }
309 |     },
310 |     {
311 |       "cell_type": "code",
312 |       "source": [
313 |         "retriever = vectorstore.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={\"namespace\": USER_1, \"score_threshold\": .9})\n",
314 |         "\n",
315 |         "relevant_documents = retriever.get_relevant_documents(question)\n",
316 |         "relevant_documents"
317 |       ],
318 |       "metadata": {
319 |         "colab": {
320 |           "base_uri": "https://localhost:8080/"
321 |         },
322 |         "id": "SycEh_Y0iO4o",
323 |         "outputId": "63344113-7bc8-409f-973f-9eed1c913808"
324 |       },
325 |       "execution_count": 39,
326 |       "outputs": [
327 |         {
328 |           "output_type": "stream",
329 |           "name": "stderr",
330 |           "text": [
331 |             "/usr/local/lib/python3.10/dist-packages/langchain_core/vectorstores.py:325: UserWarning: No relevant docs were retrieved using the relevance score threshold 0.9\n",
332 |             "  warnings.warn(\n"
333 |           ]
334 |         },
335 |         {
336 |           "output_type": "execute_result",
337 |           "data": {
338 |             "text/plain": [
339 |               "[]"
340 |             ]
341 |           },
342 |           "metadata": {},
343 |           "execution_count": 39
344 |         }
345 |       ]
346 |     },
347 |     {
348 |       "cell_type": "code",
349 |       "source": [
350 |         "retriever = vectorstore.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={\"namespace\": USER_2, \"score_threshold\": .9})\n",
351 |         "\n",
352 |         "relevant_documents = retriever.get_relevant_documents(question)\n",
353 |         "relevant_documents"
354 |       ],
355 |       "metadata": {
356 |         "colab": {
357 |           "base_uri": "https://localhost:8080/"
358 |         },
359 |         "id": "TqifrjaHa1Fk",
360 |         "outputId": "32d86a35-162e-4fc7-ae5c-3fa50494a837"
361 |       },
362 |       "execution_count": 40,
363 |       "outputs": [
364 |         {
365 |           "output_type": "execute_result",
366 |           "data": {
367 |             "text/plain": [
368 |               "[Document(page_content='Uniswap v3 Core\\nMarch 2021\\nHayden Adams\\nhayden@uniswap.orgNoah Zinsmeister\\nnoah@uniswap.orgMoody Salem\\nmoody@uniswap.org\\nRiver Keefer\\nriver@uniswap.orgDan Robinson\\ndan@paradigm.xyz\\nABSTRACT\\nUniswap v3 is a noncustodial automated market maker imple-\\nmented for the Ethereum Virtual Machine. In comparison to earlier\\nversions of the protocol, Uniswap v3 provides increased capital\\nefficiency and fine-tuned control to liquidity providers, improves\\nthe accuracy and convenience of the price oracle, and has a more\\nflexible fee structure.\\n1 INTRODUCTION\\nAutomated market makers (AMMs) are agents that pool liquidity\\nand make it available to traders according to an algorithm [ 5]. Con-\\nstant function market makers (CFMMs), a broad class of AMMs of\\nwhich Uniswap is a member, have seen widespread use in the con-\\ntext of decentralized finance, where they are typically implemented\\nas smart contracts that trade tokens on a permissionless blockchain\\n[2].\\nCFMMs as they are implemented today are often capital inef-\\nficient. In the constant product market maker formula used by\\nUniswap v1 andv2, only a fraction of the assets in the pool are\\navailable at a given price. This is inefficient, particularly when\\nassets are expected to trade close to a particular price at all times.\\nPrior attempts to address this capital efficiency issue, such as\\nCurve [ 3] and YieldSpace [ 4], have involved building pools that use\\ndifferent functions to describe the relation between reserves. This\\nrequires all liquidity providers in a given pool to adhere to a single\\nformula, and could result in liquidity fragmentation if liquidity\\nproviders want to provide liquidity within different price ranges.In this paper, we present Uniswap v3 , a novel AMM that gives\\nliquidity providers more control over the price ranges in which\\ntheir capital is used, with limited effect on liquidity fragmentation\\nand gas inefficiency. This design does not depend on any shared\\nassumption about the price behavior of the tokens. Uniswap v3\\nis based on the same constant product reserves curve as earlier\\nversions [1], but offers several significant new features:\\n•Concentrated Liquidity : Liquidity providers (LPs) are given\\nthe ability to concentrate their liquidity by “bounding\" it\\nwithin an arbitrary price range. This improves the pool’s\\ncapital efficiency and allows LPs to approximate their pre-\\nferred reserves curve, while still being efficiently aggregated\\nwith the rest of the pool. We describe this feature in section\\n2 and its implementation in Section 6.\\n•Flexible Fees : The swap fee is no longer locked at 0.30%.\\nRather, the fee tier for each pool (of which there can be\\nmultiple per asset pair) is set on initialization (Section 3.1).\\nThe initially supported fee tiers are 0.05%,0.30%, and 1%.\\nUNIgovernance is able to add additional values to this set.\\n•Protocol Fee Governance :UNIgovernance has more flexibility\\nin setting the fraction of swap fees collected by the protocol\\n(Section 6.2.2).\\n•Improved Price Oracle :Uniswap v3 provides a way for users\\nto query recent price accumulator values, thus avoiding the\\nneed to checkpoint the accumulator value at the exact be-\\nginning and end of the period for which a TWAP is being\\nmeasured. (Section 5.1).\\n1', metadata={'page': 0.0, 'source': 'uniswap-v3.pdf'}),\n",
369 |               " Document(page_content='initializable ticks), 0.30%(with a tick spacing of 60, approximately\\n0.60%between initializable ticks), and 1%(with a tick spacing of\\n200, approximately 2.02%between ticks.\\nFinally, UNI governance has the power to transfer ownership to\\nanother address.\\n5 ORACLE UPGRADES\\nUniswap v3 includes three significant changes to the time-weighted\\naverage price (TWAP) oracle that was introduced by Uniswap v2.\\nMost significantly, Uniswap v3 removes the need for users of\\nthe oracle to track previous values of the accumulator externally.\\nUniswap v2 requires users to checkpoint the accumulator value\\nat both the beginning and end of the time period for which they\\n2Specifically, the owner will be initialized to the Timelock contract from UNI gover-\\nnance, 0x1a9c8182c09f50c8318d769245bea52c32be35bc.\\n3', metadata={'page': 2.0, 'source': 'uniswap-v3.pdf'}),\n",
370 |               " Document(page_content='Uniswap v3 Core\\n0 ∞\\nPriceLiquidity\\n(I)Uniswap v2𝑝𝑎 𝑝𝑏\\nPriceLiquidity\\n(II)A single position on [𝑝𝑎, 𝑝𝑏]PriceLiquidity\\n(III) A collection of custom positions\\nFigure 3: Example Liquidity Distributions\\n3 ARCHITECTURAL CHANGES\\nUniswap v3 makes a number of architectural changes, some of\\nwhich are necessitated by the inclusion of concentrated liquidity,\\nand some of which are independent improvements.\\n3.1 Multiple Pools Per Pair\\nInUniswap v1 andv2, every pair of tokens corresponds to a single\\nliquidity pool, which applies a uniform fee of 0.30%to all swaps.\\nWhile this default fee tier historically worked well enough for many\\ntokens, it is likely too high for some pools (such as pools between\\ntwo stablecoins), and too low for others (such as pools that include\\nhighly volatile or rarely traded tokens).\\nUniswap v3 introduces multiple pools for each pair of tokens,\\neach with a different swap fee. All pools are created by the same\\nfactory contract. The factory contract initially allows pools to be\\ncreated at three fee tiers: 0.05%,0.30%, and 1%. Additional fee tiers\\ncan be enabled by UNI governance.\\n3.2 Non-Fungible Liquidity\\n3.2.1 Non-Compounding Fees. Fees earned in earlier versions were\\ncontinuously deposited in the pool as liquidity. This meant that\\nliquidity in the pool would grow over time, even without explicit\\ndeposits, and that fee earnings compounded.\\nInUniswap v3 , due to the non-fungible nature of positions, this\\nis no longer possible. Instead, fee earnings are stored separately\\nand held as the tokens in which the fees are paid (see Section 6.2.2).\\n3.2.2 Removal of Native Liquidity Tokens. InUniswap v1 andv2,\\nthe pool contract is also an ERC-20 token contract, whose tokens\\nrepresent liquidity held in the pool. While this is convenient, it\\nactually sits uneasily with the Uniswap v2 philosophy that any-\\nthing that does not need to be in the core contracts should be in the\\nperiphery, and blessing one “canonical\" ERC-20 implementation\\ndiscourages the creation of improved ERC-20 token wrappers. Ar-\\nguably, the ERC-20 token implementation should have been in the\\nperiphery, as a wrapper on a single liquidity position in the core\\ncontract.\\nThe changes made in Uniswap v3 force this issue by making\\ncompletely fungible liquidity tokens impossible. Due to the custom\\nliquidity provision feature, fees are now collected and held by thepool as individual tokens, rather than automatically reinvested as\\nliquidity in the pool.\\nAs a result, in v3, the pool contract does not implement the\\nERC-20 standard. Anyone can create an ERC-20 token contract in\\nthe periphery that makes a liquidity position more fungible, but\\nit will have to have additional logic to handle distribution of, or\\nreinvestment of, collected fees. Alternatively, anyone could create\\na periphery contract that wraps an individual liquidity position\\n(including collected fees) in an ERC-721 non-fungible token.\\n4 GOVERNANCE\\nThe factory has an owner , which is initially controlled by UNI\\ntokenholders.2The owner does not have the ability to halt the\\noperation of any of the core contracts.\\nAs in Uniswap v2 ,Uniswap v3 has a protocol fee that can be\\nturned on by UNI governance. In Uniswap v3 , UNI governance has\\nmore flexibility in choosing the fraction of swap fees that go to the\\nprotocol, and is able to choose any fraction1\\n𝑁where 4≤𝑁≤10,\\nor0. This parameter can be set on a per-pool basis.\\nUNI governance also has the ability to add additional fee tiers.\\nWhen it adds a new fee tier, it can also define the tickSpacing\\n(see Section 6.1) corresponding to that fee tier. Once a fee tier is\\nadded to the factory, it cannot be removed (and the tickSpacing\\ncannot be changed). The initial fee tiers and tick spacings supported\\nare0.05%(with a tick spacing of 10, approximately 0.10%between\\ninitializable ticks), 0.30%(with a tick spacing of 60, approximately\\n0.60%between initializable ticks), and 1%(with a tick spacing of\\n200, approximately 2.02%between ticks.', metadata={'page': 2.0, 'source': 'uniswap-v3.pdf'}),\n",
371 |               " Document(page_content='Hayden Adams, Noah Zinsmeister, Moody Salem, River Keefer, and Dan Robinson\\n•Liquidity Oracle : The contracts expose a time-weighted av-\\nerage liquidity oracle (Section 5.3).\\nTheUniswap v2 core contracts are non-upgradeable by de-\\nsign, so Uniswap v3 is implemented as an entirely new set of\\ncontracts, available here. The Uniswap v3 core contracts are also\\nnon-upgradeable, with some parameters controlled by governance\\nas described in Section 4.\\n2 CONCENTRATED LIQUIDITY\\nThe defining idea of Uniswap v3 is that of concentrated liquidity :\\nliquidity bounded within some price range.\\nIn earlier versions, liquidity was distributed uniformly along the\\n𝑥·𝑦=𝑘 (2.1)\\nreserves curve, where 𝑥and𝑦are the respective reserves of two\\nassets XandY, and 𝑘is a constant [ 1]. In other words, earlier ver-\\nsions were designed to provide liquidity across the entire price\\nrange(0,∞). This is simple to implement and allows liquidity to\\nbe efficiently aggregated, but means that much of the assets held in\\na pool are never touched.\\nHaving considered this, it seems reasonable to allow LPs to\\nconcentrate their liquidity to smaller price ranges than (0,∞). We\\ncall liquidity concentrated to a finite range a position . A position\\nonly needs to maintain enough reserves to support trading within\\nits range, and therefore can act like a constant product pool with\\nlarger reserves (we call these the virtual reserves ) within that range.\\n𝑎𝑏\\n𝑐𝑦real𝑥real\\nXReservesYReservesvirtual reserves\\nFigure 1: Simulation of Virtual Liquidity\\nSpecifically, a position only needs to hold enough of asset Xto\\ncover price movement to its upper bound, because upwards price\\nmovement1corresponds to depletion of the Xreserves. Similarly,\\nit only needs to hold enough of asset Yto cover price movement\\nto its lower bound. Fig. 1 depicts this relationship for a position on\\na range[𝑝𝑎, 𝑝𝑏]and a current price 𝑝𝑐∈[𝑝𝑎, 𝑝𝑏].𝑥realand𝑦real\\ndenote the position’s real reserves.\\nWhen the price exits a position’s range, the position’s liquidity\\nis no longer active, and no longer earns fees. At that point, its\\n1We take asset Yto be the unit of account, which corresponds to token1 in our\\nimplementation.liquidity is composed entirely of a single asset, because the reserves\\nof the other asset must have been entirely depleted. If the price ever\\nreenters the range, the liquidity becomes active again.\\nThe amount of liquidity provided can be measured by the value\\n𝐿, which is equal to√\\n𝑘. The real reserves of a position are described\\nby the curve:\\n(𝑥+𝐿√𝑝𝑏)(𝑦+𝐿√𝑝𝑎)=𝐿2(2.2)\\nThis curve is a translation of formula 2.1 such that the position is\\nsolvent exactly within its range (Fig. 2).\\n𝑎𝑏\\nXReservesYReservesvirtual reserves (2.1)\\nreal reserves (2.2)\\nFigure 2: Real Reserves\\nLiquidity providers are free to create as many positions as they\\nsee fit, each on its own price range. In this way, LPs can approximate\\nany desired distribution of liquidity on the price space (see Fig. 3\\nfor a few examples). Moreover, this serves as a mechanism to let\\nthe market decide where liquidity should be allocated. Rational LPs\\ncan reduce their capital costs by concentrating their liquidity in\\na narrow band around the current price, and adding or removing\\ntokens as the price moves to keep their liquidity active.\\n2.1 Range Orders\\nPositions on very small ranges act similarly to limit orders—if the\\nrange is crossed, the position flips from being composed entirely\\nof one asset, to being composed entirely of the other asset (plus\\naccrued fees). There are two differences between this range order\\nand a traditional limit order:\\n•There is a limit to how narrow a position’s range can be.\\nWhile the price is within that range, the limit order might\\nbe partially executed.\\n•When the position has been crossed, it needs to be with-\\ndrawn. If it is not, and the price crosses back across that\\nrange, the position will be traded back, effectively reversing\\nthe trade.\\n2', metadata={'page': 1.0, 'source': 'uniswap-v3.pdf'})]"
372 |             ]
373 |           },
374 |           "metadata": {},
375 |           "execution_count": 40
376 |         }
377 |       ]
378 |     },
379 |     {
380 |       "cell_type": "code",
381 |       "source": [
382 |         "retriever = vectorstore.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={\"namespace\": USER_1, \"score_threshold\": .9})\n",
383 |         "\n",
384 |         "relevant_documents = retriever.get_relevant_documents(\"what are the short-term steps to help streamline billing?\")\n",
385 |         "relevant_documents"
386 |       ],
387 |       "metadata": {
388 |         "colab": {
389 |           "base_uri": "https://localhost:8080/"
390 |         },
391 |         "id": "1Ngz6ohfcklM",
392 |         "outputId": "0a302a2d-3c4d-4fa9-8222-52c1a462d8ee"
393 |       },
394 |       "execution_count": 51,
395 |       "outputs": [
396 |         {
397 |           "output_type": "execute_result",
398 |           "data": {
399 |             "text/plain": [
400 |               "[Document(page_content='Despite the rapid adoption of EHRs, \\nthere has not been a commensurate reduction in fragmentation and automation in the realm of billing.\\nThis is ironic considering that the \\nconventional wisdom among physicians is that billing is the primary purpose of EHRs in the first place. Standards for Application Programming Interfaces (APIs), which allow software from different devices to exchange information securely and efficiently, may help alleviate some inefficiencies in billing. (We’ll discuss APIs further below.) For instance, an API may give a payer access directly to the information it needs to process a claim, rather than having to exchange faxes with the doctor’s office. The change in recent years from transaction-oriented reimbursement schemes, in which health care providers are paid for, and have to document, each act they perform, to those that emphasize the value of health care services to the patient, has lowered the volume of forms and documentation somewhat. Fewer practices are required to document things that they’ve done to get reimbursement.Physicians: Junk the fax machine. Every doctor’s office ought to embrace electronic communications. “For the 30 to 40 percent of American practices that are unwilling to give up the fax machine and are unwilling to receive electronic payments, perhaps there has to be a culture shift,” Gilliland told symposium attendees. Federal and private payers may be able to help in this regard by providing incentives. Even simple changes such as physician practices shifting to a payer’s portal website, rather than insisting on using faxes and paper payments, would create significant efficiencies. \\nPhysicians must start accepting \\nelectronic payments. Many doctors’ offices are reluctant to allow payers to transfer funds electronically for fear that they would also be able to make withdrawals. Payers need to support physician practices. Providing in-kind support to providers in exchange for sharing clinical and claims information would help practices adopt technology. For instance, payers could provide a dashboard back to providers so that they can get analytical insights about their utilization and costs. This would have the added benefit of helping them make the transition to value-based billing.  \\nCreate common standards across \\npayers. If health care insurers and other payers agree on a common set of data and formats, they would greatly reduce the bureaucratic burden on small practices who now must fill out so many different forms. \\nStreamline pre-authorization. \\nDecreasing the hassle and time to process claim pre-authorizations would reduce inefficiencies and \\n2.3.Future of EHR\\n4.Gilliland identified some \\npractical, short-term steps  \\nto help streamline billing: \\n1.\\n5.', metadata={'page': 10.0, 'source': 'electronic-health-records.pdf'}),\n",
401 |               " Document(page_content='to use Epic’s analytics tools to measure \\nproductivity, to find the best workflow that increases the quality of patient care. During the measles outbreak, IT met with the hospital’s lawyers, quality-control personnel, nurses, and infectious disease experts to determine how best to modify EHRs to make them more useful in managing the crisis. The changes took a matter of hours to implement. \\nCentral to the success of Great Hudson \\nValley’s program is making analytics data available to physicians. Since the clinicians helped narrow down the list of priorities, it is important that they are able to access data they need to make appropriate decisions.\\nFor physicians and health care \\norganizations that handle Medicare and Medicaid patients, the federal government is beginning to move away from some of its more onerous requirements for documenting the patient-doctor interaction.\\nIts recent “Patients Over Paperwork”\\ninitiative, announced in June 2018, would consolidate some Medicare fee structures for outpatientvisits, reduce clerical tasks associatedwith coding and billing administration,and allow doctors and otherpractitioners to focus on documentedchanges since the patient’s last visitrather than re-documenting information. The Centers for Medicare and Medicaid Services (CMS) estimates that the new regulations would save nearly 30,000 hours spent on billing administration. If these changes are adopted, they might relieve some of the burden on physicians in the next few years. \\nInefficiencies in Billing \\nand Reimbursement \\nPhysicians and patients alike have their \\nfavorite anecdotes about the problems of the U.S. health care system. Many of these stories center on the process of billing and reimbursement. Terry Gilliland, MD, Senior Vice President and Chief Health Officer at Blue Shield of California, told the Stanford Medicine symposium that overhead for billing transactions accounts for 6 percent of a payer’s costs. One practice reported that it takes a trained registered nurse 45 minutes on average to get insurance pre-authorization for a CT scan. Another practice reported that, over the course of dealing with all of their different payers and care organizations, they are required to fill out 200 different forms. In general, health care billing in the U.S. is characterized by a pervasive fear of technology and inefficiency.\\n“This is an extremely fragmented \\nmanual process that is not benefitting many people, ” says Gilliland. \\nOne problem is a lack of automation \\nof manual processes. About a third of physician practices insist on doing business with paper forms and fax machines. A physician’s office might fill out a claims form on paper and fax it to the payer, who would pay someone to transcribe it into their system. Then the payer will identify information that is missing in the form but is needed to process a claim. The form then gets faxed back to the doctor’s  office, which addends it and faxes it back. The payer then has to pay someone to transcribe the information into the system. These kinds of inefficiencies drive up the cost of transactions. Future of EHR\\nInefficiencies\\nIn Billing and\\nReimbursement\\nOne practice reports it takes a \\ntrained nurse 45 minutes, on \\naverage, to get insurance\\npre-authorization for a CT scan.\\nAnother practice reports that, \\nover the course of dealing with all \\nof their different payers and care \\norganizations, they are required \\nto fill out 200 different forms.', metadata={'page': 9.0, 'source': 'electronic-health-records.pdf'}),\n",
402 |               " Document(page_content='For Medical Practices:  \\nInvest in adequate EHR training when \\nonboarding physicians and bring them up to speed when incremental changes are made;\\nEnlist physicians to help prioritize \\nEHR development tasks and to design clinical workflows that take advantage of EHR capabilities (e.g., the Sprint team model);\\nTailor the size and makeup of \\nphysician development teams, taking into account the clinical resources available;\\nDeliver EHR development projects \\nsoon after physicians ask for them;\\nEstablish an EHR governance process \\nthat gives the clinical organization nimbleness in responding to health emergencies and crisis scenarios;\\nMake analytics data available to \\nphysicians—presented in a way that is intuitive at the point of care;\\nShift non-essential EHR data entry \\nto ancillary staff. In the near term, consider increasing the number of medical assistants to act as “digital scribes” (though this option is expensive). In the long term, seek automated solutions to eliminate manual EHR documentation;Re-evaluate your organization’s interpretation of privacy rules;\\nCreate opportunities for patients \\nto digitally maintain their records (providing family history, medical history, medications, health monitoring data, etc.);\\nJunk the fax machine (if you still \\nhave one) and embrace electronic communications;\\nStart accepting electronic payments, \\nif you don’t already.\\nFor Payers: \\nEHRs are a reflection of the current fee-for-service payment paradigm. Commit to value-based care and provide adequate support to physicians under this model, including greater reimbursement for preventive care services and the use of digital health to engage patients;\\nCreate common standards for billing \\nand quality reporting across payers;\\nStreamline pre-authorization \\nprocedures;\\nMake claims data more accessible to \\nphysicians to enable a longitudinal view of their patients.For Regulators:  \\nAffirm commitment to value-based care and moving away from requiring literal documentation of patient-doctor interactions;\\nCreate more flexibility around who \\nneeds to enter data into the EHR, as many tasks do not require the expertise of a highly trained clinician;\\nClarify information-blocking rules to \\nencourage open APIs and eliminate perverse incentives to hoard information.A Summary  \\nof Action PointsFuture of EHR', metadata={'page': 18.0, 'source': 'electronic-health-records.pdf'})]"
403 |             ]
404 |           },
405 |           "metadata": {},
406 |           "execution_count": 51
407 |         }
408 |       ]
409 |     },
410 |     {
411 |       "cell_type": "code",
412 |       "source": [
413 |         "retriever = vectorstore.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={\"namespace\": USER_2, \"score_threshold\": .9})\n",
414 |         "\n",
415 |         "relevant_documents = retriever.get_relevant_documents(\"what are the short-term steps to help streamline billing?\")\n",
416 |         "relevant_documents"
417 |       ],
418 |       "metadata": {
419 |         "colab": {
420 |           "base_uri": "https://localhost:8080/"
421 |         },
422 |         "id": "G0FpbA_TclgC",
423 |         "outputId": "9018957d-b1f0-4207-f986-95a8370b58df"
424 |       },
425 |       "execution_count": 52,
426 |       "outputs": [
427 |         {
428 |           "output_type": "stream",
429 |           "name": "stderr",
430 |           "text": [
431 |             "/usr/local/lib/python3.10/dist-packages/langchain_core/vectorstores.py:325: UserWarning: No relevant docs were retrieved using the relevance score threshold 0.9\n",
432 |             "  warnings.warn(\n"
433 |           ]
434 |         },
435 |         {
436 |           "output_type": "execute_result",
437 |           "data": {
438 |             "text/plain": [
439 |               "[]"
440 |             ]
441 |           },
442 |           "metadata": {},
443 |           "execution_count": 52
444 |         }
445 |       ]
446 |     },
447 |     {
448 |       "cell_type": "code",
449 |       "source": [
450 |         "from langchain.chat_models import ChatOpenAI\n",
451 |         "from langchain.embeddings import OpenAIEmbeddings\n",
452 |         "from langchain.prompts import ChatPromptTemplate\n",
453 |         "from langchain_core.output_parsers import StrOutputParser\n",
454 |         "from langchain_core.runnables import RunnablePassthrough\n",
455 |         "\n",
456 |         "def ask_question_by_user(question, user):\n",
457 |         "\n",
458 |         "  retriever = vectorstore.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={\"namespace\": user, \"score_threshold\": .9})\n",
459 |         "  template = \"\"\"Answer the question based only on the following context which is wrapped by tags [CONTEXT][/CONTEXT]:\n",
460 |         "  [CONTEXT]{context}[/CONTEXT]\n",
461 |         "\n",
462 |         "  If you don't know the answer, or the context is empty, just say that you don't know, don't try to make up an answer.\n",
463 |         "\n",
464 |         "  Question: {question}\n",
465 |         "  \"\"\"\n",
466 |         "  prompt = ChatPromptTemplate.from_template(template)\n",
467 |         "  model = ChatOpenAI()\n",
468 |         "\n",
469 |         "  retrieval_chain = (\n",
470 |         "      {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
471 |         "      | prompt\n",
472 |         "      | model\n",
473 |         "      | StrOutputParser()\n",
474 |         "  )\n",
475 |         "\n",
476 |         "  response = retrieval_chain.invoke(question)\n",
477 |         "  print(response)"
478 |       ],
479 |       "metadata": {
480 |         "id": "051wIoVHd_Mu"
481 |       },
482 |       "execution_count": 53,
483 |       "outputs": []
484 |     },
485 |     {
486 |       "cell_type": "code",
487 |       "source": [
488 |         "ask_question_by_user(\"what are the short-term steps to help streamline billing?\", USER_1)"
489 |       ],
490 |       "metadata": {
491 |         "colab": {
492 |           "base_uri": "https://localhost:8080/"
493 |         },
494 |         "id": "W_mCk-wne_Jt",
495 |         "outputId": "4868a404-9009-4f99-fa43-c19eb9181bf3"
496 |       },
497 |       "execution_count": 61,
498 |       "outputs": [
499 |         {
500 |           "output_type": "stream",
501 |           "name": "stdout",
502 |           "text": [
503 |             "The short-term steps to help streamline billing include:\n",
504 |             "1. Encouraging physician practices to shift to a payer's portal website instead of using faxes and paper payments.\n",
505 |             "2. Encouraging physicians to start accepting electronic payments.\n",
506 |             "3. Providing in-kind support to providers in exchange for sharing clinical and claims information.\n",
507 |             "4. Creating common standards across payers to reduce the bureaucratic burden on small practices.\n",
508 |             "5. Streamlining pre-authorization procedures.\n"
509 |           ]
510 |         }
511 |       ]
512 |     },
513 |     {
514 |       "cell_type": "code",
515 |       "source": [
516 |         "ask_question_by_user(\"what are the short-term steps to help streamline billing?\", USER_2)"
517 |       ],
518 |       "metadata": {
519 |         "colab": {
520 |           "base_uri": "https://localhost:8080/"
521 |         },
522 |         "id": "dAqbP3DrfEEL",
523 |         "outputId": "66346e01-459f-4854-a934-2280b79046f9"
524 |       },
525 |       "execution_count": 62,
526 |       "outputs": [
527 |         {
528 |           "output_type": "stream",
529 |           "name": "stderr",
530 |           "text": [
531 |             "/usr/local/lib/python3.10/dist-packages/langchain_core/vectorstores.py:325: UserWarning: No relevant docs were retrieved using the relevance score threshold 0.9\n",
532 |             "  warnings.warn(\n"
533 |           ]
534 |         },
535 |         {
536 |           "output_type": "stream",
537 |           "name": "stdout",
538 |           "text": [
539 |             "I don't know.\n"
540 |           ]
541 |         }
542 |       ]
543 |     },
544 |     {
545 |       "cell_type": "code",
546 |       "source": [
547 |         "ask_question_by_user(\"what is uniswap?\", USER_1)"
548 |       ],
549 |       "metadata": {
550 |         "colab": {
551 |           "base_uri": "https://localhost:8080/"
552 |         },
553 |         "id": "mhp_AAFQfE3H",
554 |         "outputId": "f31ee1e3-bd64-4f1f-aefc-792abaf9eab1"
555 |       },
556 |       "execution_count": 63,
557 |       "outputs": [
558 |         {
559 |           "output_type": "stream",
560 |           "name": "stderr",
561 |           "text": [
562 |             "/usr/local/lib/python3.10/dist-packages/langchain_core/vectorstores.py:325: UserWarning: No relevant docs were retrieved using the relevance score threshold 0.9\n",
563 |             "  warnings.warn(\n"
564 |           ]
565 |         },
566 |         {
567 |           "output_type": "stream",
568 |           "name": "stdout",
569 |           "text": [
570 |             "I don't know.\n"
571 |           ]
572 |         }
573 |       ]
574 |     },
575 |     {
576 |       "cell_type": "code",
577 |       "source": [
578 |         "ask_question_by_user(\"what is uniswap?\", USER_2)"
579 |       ],
580 |       "metadata": {
581 |         "colab": {
582 |           "base_uri": "https://localhost:8080/"
583 |         },
584 |         "id": "8nr4zBYFfFp6",
585 |         "outputId": "747b4836-35d4-4a5b-a892-1402b3200bad"
586 |       },
587 |       "execution_count": 64,
588 |       "outputs": [
589 |         {
590 |           "output_type": "stream",
591 |           "name": "stdout",
592 |           "text": [
593 |             "Based on the provided context, Uniswap is a noncustodial automated market maker (AMM) implemented for the Ethereum Virtual Machine. It is a protocol that pools liquidity and makes it available to traders according to an algorithm. Uniswap v3, the version described in the context, offers increased capital efficiency, fine-tuned control to liquidity providers, an improved price oracle, and a more flexible fee structure compared to earlier versions.\n"
594 |           ]
595 |         }
596 |       ]
597 |     }
598 |   ]
599 | }


--------------------------------------------------------------------------------
/01_semi_structured_data.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |   "nbformat": 4,
   3 |   "nbformat_minor": 0,
   4 |   "metadata": {
   5 |     "colab": {
   6 |       "provenance": [],
   7 |       "gpuType": "T4",
   8 |       "authorship_tag": "ABX9TyPnIDleZ4upjO9LLlSfEb5e",
   9 |       "include_colab_link": true
  10 |     },
  11 |     "kernelspec": {
  12 |       "name": "python3",
  13 |       "display_name": "Python 3"
  14 |     },
  15 |     "language_info": {
  16 |       "name": "python"
  17 |     },
  18 |     "widgets": {
  19 |       "application/vnd.jupyter.widget-state+json": {
  20 |         "ac0fab4eb8e543c1a0f5b037e0235815": {
  21 |           "model_module": "@jupyter-widgets/controls",
  22 |           "model_name": "HBoxModel",
  23 |           "model_module_version": "1.5.0",
  24 |           "state": {
  25 |             "_dom_classes": [],
  26 |             "_model_module": "@jupyter-widgets/controls",
  27 |             "_model_module_version": "1.5.0",
  28 |             "_model_name": "HBoxModel",
  29 |             "_view_count": null,
  30 |             "_view_module": "@jupyter-widgets/controls",
  31 |             "_view_module_version": "1.5.0",
  32 |             "_view_name": "HBoxView",
  33 |             "box_style": "",
  34 |             "children": [
  35 |               "IPY_MODEL_3649dec7e417482091fc2f86939e0c16",
  36 |               "IPY_MODEL_aa165def4267499fa5bd66c7a23aa83d",
  37 |               "IPY_MODEL_4ee55dbff1674ba9bdb0614692dc64e9"
  38 |             ],
  39 |             "layout": "IPY_MODEL_54445f16b2e5439dbea3d25c7d690933"
  40 |           }
  41 |         },
  42 |         "3649dec7e417482091fc2f86939e0c16": {
  43 |           "model_module": "@jupyter-widgets/controls",
  44 |           "model_name": "HTMLModel",
  45 |           "model_module_version": "1.5.0",
  46 |           "state": {
  47 |             "_dom_classes": [],
  48 |             "_model_module": "@jupyter-widgets/controls",
  49 |             "_model_module_version": "1.5.0",
  50 |             "_model_name": "HTMLModel",
  51 |             "_view_count": null,
  52 |             "_view_module": "@jupyter-widgets/controls",
  53 |             "_view_module_version": "1.5.0",
  54 |             "_view_name": "HTMLView",
  55 |             "description": "",
  56 |             "description_tooltip": null,
  57 |             "layout": "IPY_MODEL_b240ed924ce94639b137552549ffd29e",
  58 |             "placeholder": "​",
  59 |             "style": "IPY_MODEL_c0f954fb5893438fbc5306ba55d44c20",
  60 |             "value": "yolox_l0.05.onnx: 100%"
  61 |           }
  62 |         },
  63 |         "aa165def4267499fa5bd66c7a23aa83d": {
  64 |           "model_module": "@jupyter-widgets/controls",
  65 |           "model_name": "FloatProgressModel",
  66 |           "model_module_version": "1.5.0",
  67 |           "state": {
  68 |             "_dom_classes": [],
  69 |             "_model_module": "@jupyter-widgets/controls",
  70 |             "_model_module_version": "1.5.0",
  71 |             "_model_name": "FloatProgressModel",
  72 |             "_view_count": null,
  73 |             "_view_module": "@jupyter-widgets/controls",
  74 |             "_view_module_version": "1.5.0",
  75 |             "_view_name": "ProgressView",
  76 |             "bar_style": "success",
  77 |             "description": "",
  78 |             "description_tooltip": null,
  79 |             "layout": "IPY_MODEL_cf8f1809505a436c949d936e74f93094",
  80 |             "max": 216625723,
  81 |             "min": 0,
  82 |             "orientation": "horizontal",
  83 |             "style": "IPY_MODEL_7639470b337e47b0b7a60d32c8770744",
  84 |             "value": 216625723
  85 |           }
  86 |         },
  87 |         "4ee55dbff1674ba9bdb0614692dc64e9": {
  88 |           "model_module": "@jupyter-widgets/controls",
  89 |           "model_name": "HTMLModel",
  90 |           "model_module_version": "1.5.0",
  91 |           "state": {
  92 |             "_dom_classes": [],
  93 |             "_model_module": "@jupyter-widgets/controls",
  94 |             "_model_module_version": "1.5.0",
  95 |             "_model_name": "HTMLModel",
  96 |             "_view_count": null,
  97 |             "_view_module": "@jupyter-widgets/controls",
  98 |             "_view_module_version": "1.5.0",
  99 |             "_view_name": "HTMLView",
 100 |             "description": "",
 101 |             "description_tooltip": null,
 102 |             "layout": "IPY_MODEL_17913e64f28b4a2daf9f9d21654c6372",
 103 |             "placeholder": "​",
 104 |             "style": "IPY_MODEL_ef27bad8255f447da09541fc10f80ad8",
 105 |             "value": " 217M/217M [00:01&lt;00:00, 160MB/s]"
 106 |           }
 107 |         },
 108 |         "54445f16b2e5439dbea3d25c7d690933": {
 109 |           "model_module": "@jupyter-widgets/base",
 110 |           "model_name": "LayoutModel",
 111 |           "model_module_version": "1.2.0",
 112 |           "state": {
 113 |             "_model_module": "@jupyter-widgets/base",
 114 |             "_model_module_version": "1.2.0",
 115 |             "_model_name": "LayoutModel",
 116 |             "_view_count": null,
 117 |             "_view_module": "@jupyter-widgets/base",
 118 |             "_view_module_version": "1.2.0",
 119 |             "_view_name": "LayoutView",
 120 |             "align_content": null,
 121 |             "align_items": null,
 122 |             "align_self": null,
 123 |             "border": null,
 124 |             "bottom": null,
 125 |             "display": null,
 126 |             "flex": null,
 127 |             "flex_flow": null,
 128 |             "grid_area": null,
 129 |             "grid_auto_columns": null,
 130 |             "grid_auto_flow": null,
 131 |             "grid_auto_rows": null,
 132 |             "grid_column": null,
 133 |             "grid_gap": null,
 134 |             "grid_row": null,
 135 |             "grid_template_areas": null,
 136 |             "grid_template_columns": null,
 137 |             "grid_template_rows": null,
 138 |             "height": null,
 139 |             "justify_content": null,
 140 |             "justify_items": null,
 141 |             "left": null,
 142 |             "margin": null,
 143 |             "max_height": null,
 144 |             "max_width": null,
 145 |             "min_height": null,
 146 |             "min_width": null,
 147 |             "object_fit": null,
 148 |             "object_position": null,
 149 |             "order": null,
 150 |             "overflow": null,
 151 |             "overflow_x": null,
 152 |             "overflow_y": null,
 153 |             "padding": null,
 154 |             "right": null,
 155 |             "top": null,
 156 |             "visibility": null,
 157 |             "width": null
 158 |           }
 159 |         },
 160 |         "b240ed924ce94639b137552549ffd29e": {
 161 |           "model_module": "@jupyter-widgets/base",
 162 |           "model_name": "LayoutModel",
 163 |           "model_module_version": "1.2.0",
 164 |           "state": {
 165 |             "_model_module": "@jupyter-widgets/base",
 166 |             "_model_module_version": "1.2.0",
 167 |             "_model_name": "LayoutModel",
 168 |             "_view_count": null,
 169 |             "_view_module": "@jupyter-widgets/base",
 170 |             "_view_module_version": "1.2.0",
 171 |             "_view_name": "LayoutView",
 172 |             "align_content": null,
 173 |             "align_items": null,
 174 |             "align_self": null,
 175 |             "border": null,
 176 |             "bottom": null,
 177 |             "display": null,
 178 |             "flex": null,
 179 |             "flex_flow": null,
 180 |             "grid_area": null,
 181 |             "grid_auto_columns": null,
 182 |             "grid_auto_flow": null,
 183 |             "grid_auto_rows": null,
 184 |             "grid_column": null,
 185 |             "grid_gap": null,
 186 |             "grid_row": null,
 187 |             "grid_template_areas": null,
 188 |             "grid_template_columns": null,
 189 |             "grid_template_rows": null,
 190 |             "height": null,
 191 |             "justify_content": null,
 192 |             "justify_items": null,
 193 |             "left": null,
 194 |             "margin": null,
 195 |             "max_height": null,
 196 |             "max_width": null,
 197 |             "min_height": null,
 198 |             "min_width": null,
 199 |             "object_fit": null,
 200 |             "object_position": null,
 201 |             "order": null,
 202 |             "overflow": null,
 203 |             "overflow_x": null,
 204 |             "overflow_y": null,
 205 |             "padding": null,
 206 |             "right": null,
 207 |             "top": null,
 208 |             "visibility": null,
 209 |             "width": null
 210 |           }
 211 |         },
 212 |         "c0f954fb5893438fbc5306ba55d44c20": {
 213 |           "model_module": "@jupyter-widgets/controls",
 214 |           "model_name": "DescriptionStyleModel",
 215 |           "model_module_version": "1.5.0",
 216 |           "state": {
 217 |             "_model_module": "@jupyter-widgets/controls",
 218 |             "_model_module_version": "1.5.0",
 219 |             "_model_name": "DescriptionStyleModel",
 220 |             "_view_count": null,
 221 |             "_view_module": "@jupyter-widgets/base",
 222 |             "_view_module_version": "1.2.0",
 223 |             "_view_name": "StyleView",
 224 |             "description_width": ""
 225 |           }
 226 |         },
 227 |         "cf8f1809505a436c949d936e74f93094": {
 228 |           "model_module": "@jupyter-widgets/base",
 229 |           "model_name": "LayoutModel",
 230 |           "model_module_version": "1.2.0",
 231 |           "state": {
 232 |             "_model_module": "@jupyter-widgets/base",
 233 |             "_model_module_version": "1.2.0",
 234 |             "_model_name": "LayoutModel",
 235 |             "_view_count": null,
 236 |             "_view_module": "@jupyter-widgets/base",
 237 |             "_view_module_version": "1.2.0",
 238 |             "_view_name": "LayoutView",
 239 |             "align_content": null,
 240 |             "align_items": null,
 241 |             "align_self": null,
 242 |             "border": null,
 243 |             "bottom": null,
 244 |             "display": null,
 245 |             "flex": null,
 246 |             "flex_flow": null,
 247 |             "grid_area": null,
 248 |             "grid_auto_columns": null,
 249 |             "grid_auto_flow": null,
 250 |             "grid_auto_rows": null,
 251 |             "grid_column": null,
 252 |             "grid_gap": null,
 253 |             "grid_row": null,
 254 |             "grid_template_areas": null,
 255 |             "grid_template_columns": null,
 256 |             "grid_template_rows": null,
 257 |             "height": null,
 258 |             "justify_content": null,
 259 |             "justify_items": null,
 260 |             "left": null,
 261 |             "margin": null,
 262 |             "max_height": null,
 263 |             "max_width": null,
 264 |             "min_height": null,
 265 |             "min_width": null,
 266 |             "object_fit": null,
 267 |             "object_position": null,
 268 |             "order": null,
 269 |             "overflow": null,
 270 |             "overflow_x": null,
 271 |             "overflow_y": null,
 272 |             "padding": null,
 273 |             "right": null,
 274 |             "top": null,
 275 |             "visibility": null,
 276 |             "width": null
 277 |           }
 278 |         },
 279 |         "7639470b337e47b0b7a60d32c8770744": {
 280 |           "model_module": "@jupyter-widgets/controls",
 281 |           "model_name": "ProgressStyleModel",
 282 |           "model_module_version": "1.5.0",
 283 |           "state": {
 284 |             "_model_module": "@jupyter-widgets/controls",
 285 |             "_model_module_version": "1.5.0",
 286 |             "_model_name": "ProgressStyleModel",
 287 |             "_view_count": null,
 288 |             "_view_module": "@jupyter-widgets/base",
 289 |             "_view_module_version": "1.2.0",
 290 |             "_view_name": "StyleView",
 291 |             "bar_color": null,
 292 |             "description_width": ""
 293 |           }
 294 |         },
 295 |         "17913e64f28b4a2daf9f9d21654c6372": {
 296 |           "model_module": "@jupyter-widgets/base",
 297 |           "model_name": "LayoutModel",
 298 |           "model_module_version": "1.2.0",
 299 |           "state": {
 300 |             "_model_module": "@jupyter-widgets/base",
 301 |             "_model_module_version": "1.2.0",
 302 |             "_model_name": "LayoutModel",
 303 |             "_view_count": null,
 304 |             "_view_module": "@jupyter-widgets/base",
 305 |             "_view_module_version": "1.2.0",
 306 |             "_view_name": "LayoutView",
 307 |             "align_content": null,
 308 |             "align_items": null,
 309 |             "align_self": null,
 310 |             "border": null,
 311 |             "bottom": null,
 312 |             "display": null,
 313 |             "flex": null,
 314 |             "flex_flow": null,
 315 |             "grid_area": null,
 316 |             "grid_auto_columns": null,
 317 |             "grid_auto_flow": null,
 318 |             "grid_auto_rows": null,
 319 |             "grid_column": null,
 320 |             "grid_gap": null,
 321 |             "grid_row": null,
 322 |             "grid_template_areas": null,
 323 |             "grid_template_columns": null,
 324 |             "grid_template_rows": null,
 325 |             "height": null,
 326 |             "justify_content": null,
 327 |             "justify_items": null,
 328 |             "left": null,
 329 |             "margin": null,
 330 |             "max_height": null,
 331 |             "max_width": null,
 332 |             "min_height": null,
 333 |             "min_width": null,
 334 |             "object_fit": null,
 335 |             "object_position": null,
 336 |             "order": null,
 337 |             "overflow": null,
 338 |             "overflow_x": null,
 339 |             "overflow_y": null,
 340 |             "padding": null,
 341 |             "right": null,
 342 |             "top": null,
 343 |             "visibility": null,
 344 |             "width": null
 345 |           }
 346 |         },
 347 |         "ef27bad8255f447da09541fc10f80ad8": {
 348 |           "model_module": "@jupyter-widgets/controls",
 349 |           "model_name": "DescriptionStyleModel",
 350 |           "model_module_version": "1.5.0",
 351 |           "state": {
 352 |             "_model_module": "@jupyter-widgets/controls",
 353 |             "_model_module_version": "1.5.0",
 354 |             "_model_name": "DescriptionStyleModel",
 355 |             "_view_count": null,
 356 |             "_view_module": "@jupyter-widgets/base",
 357 |             "_view_module_version": "1.2.0",
 358 |             "_view_name": "StyleView",
 359 |             "description_width": ""
 360 |           }
 361 |         },
 362 |         "8ee21ab783c543c0944e8732faaebb4a": {
 363 |           "model_module": "@jupyter-widgets/controls",
 364 |           "model_name": "HBoxModel",
 365 |           "model_module_version": "1.5.0",
 366 |           "state": {
 367 |             "_dom_classes": [],
 368 |             "_model_module": "@jupyter-widgets/controls",
 369 |             "_model_module_version": "1.5.0",
 370 |             "_model_name": "HBoxModel",
 371 |             "_view_count": null,
 372 |             "_view_module": "@jupyter-widgets/controls",
 373 |             "_view_module_version": "1.5.0",
 374 |             "_view_name": "HBoxView",
 375 |             "box_style": "",
 376 |             "children": [
 377 |               "IPY_MODEL_2024899dae44406a9e187c5d770f873f",
 378 |               "IPY_MODEL_2174f17cbb124ab2bf1a32765c0d5a0c",
 379 |               "IPY_MODEL_330603dd11944ad9a267378e3a17a573"
 380 |             ],
 381 |             "layout": "IPY_MODEL_674ae68fb32d40b48152f29f77636138"
 382 |           }
 383 |         },
 384 |         "2024899dae44406a9e187c5d770f873f": {
 385 |           "model_module": "@jupyter-widgets/controls",
 386 |           "model_name": "HTMLModel",
 387 |           "model_module_version": "1.5.0",
 388 |           "state": {
 389 |             "_dom_classes": [],
 390 |             "_model_module": "@jupyter-widgets/controls",
 391 |             "_model_module_version": "1.5.0",
 392 |             "_model_name": "HTMLModel",
 393 |             "_view_count": null,
 394 |             "_view_module": "@jupyter-widgets/controls",
 395 |             "_view_module_version": "1.5.0",
 396 |             "_view_name": "HTMLView",
 397 |             "description": "",
 398 |             "description_tooltip": null,
 399 |             "layout": "IPY_MODEL_c1cd34e355a540f58cebe58990bee350",
 400 |             "placeholder": "​",
 401 |             "style": "IPY_MODEL_2248e9a15d694e7f94b6d817a0bceb90",
 402 |             "value": "config.json: 100%"
 403 |           }
 404 |         },
 405 |         "2174f17cbb124ab2bf1a32765c0d5a0c": {
 406 |           "model_module": "@jupyter-widgets/controls",
 407 |           "model_name": "FloatProgressModel",
 408 |           "model_module_version": "1.5.0",
 409 |           "state": {
 410 |             "_dom_classes": [],
 411 |             "_model_module": "@jupyter-widgets/controls",
 412 |             "_model_module_version": "1.5.0",
 413 |             "_model_name": "FloatProgressModel",
 414 |             "_view_count": null,
 415 |             "_view_module": "@jupyter-widgets/controls",
 416 |             "_view_module_version": "1.5.0",
 417 |             "_view_name": "ProgressView",
 418 |             "bar_style": "success",
 419 |             "description": "",
 420 |             "description_tooltip": null,
 421 |             "layout": "IPY_MODEL_2477901c30b44069ac5db5cf4e8924f0",
 422 |             "max": 1469,
 423 |             "min": 0,
 424 |             "orientation": "horizontal",
 425 |             "style": "IPY_MODEL_f2d4643e36634b668eb2d749d74e7780",
 426 |             "value": 1469
 427 |           }
 428 |         },
 429 |         "330603dd11944ad9a267378e3a17a573": {
 430 |           "model_module": "@jupyter-widgets/controls",
 431 |           "model_name": "HTMLModel",
 432 |           "model_module_version": "1.5.0",
 433 |           "state": {
 434 |             "_dom_classes": [],
 435 |             "_model_module": "@jupyter-widgets/controls",
 436 |             "_model_module_version": "1.5.0",
 437 |             "_model_name": "HTMLModel",
 438 |             "_view_count": null,
 439 |             "_view_module": "@jupyter-widgets/controls",
 440 |             "_view_module_version": "1.5.0",
 441 |             "_view_name": "HTMLView",
 442 |             "description": "",
 443 |             "description_tooltip": null,
 444 |             "layout": "IPY_MODEL_f9a39e14cb0140e8ac09cfb400344804",
 445 |             "placeholder": "​",
 446 |             "style": "IPY_MODEL_9e0916f00d374a858a3edb397a63af8d",
 447 |             "value": " 1.47k/1.47k [00:00&lt;00:00, 71.8kB/s]"
 448 |           }
 449 |         },
 450 |         "674ae68fb32d40b48152f29f77636138": {
 451 |           "model_module": "@jupyter-widgets/base",
 452 |           "model_name": "LayoutModel",
 453 |           "model_module_version": "1.2.0",
 454 |           "state": {
 455 |             "_model_module": "@jupyter-widgets/base",
 456 |             "_model_module_version": "1.2.0",
 457 |             "_model_name": "LayoutModel",
 458 |             "_view_count": null,
 459 |             "_view_module": "@jupyter-widgets/base",
 460 |             "_view_module_version": "1.2.0",
 461 |             "_view_name": "LayoutView",
 462 |             "align_content": null,
 463 |             "align_items": null,
 464 |             "align_self": null,
 465 |             "border": null,
 466 |             "bottom": null,
 467 |             "display": null,
 468 |             "flex": null,
 469 |             "flex_flow": null,
 470 |             "grid_area": null,
 471 |             "grid_auto_columns": null,
 472 |             "grid_auto_flow": null,
 473 |             "grid_auto_rows": null,
 474 |             "grid_column": null,
 475 |             "grid_gap": null,
 476 |             "grid_row": null,
 477 |             "grid_template_areas": null,
 478 |             "grid_template_columns": null,
 479 |             "grid_template_rows": null,
 480 |             "height": null,
 481 |             "justify_content": null,
 482 |             "justify_items": null,
 483 |             "left": null,
 484 |             "margin": null,
 485 |             "max_height": null,
 486 |             "max_width": null,
 487 |             "min_height": null,
 488 |             "min_width": null,
 489 |             "object_fit": null,
 490 |             "object_position": null,
 491 |             "order": null,
 492 |             "overflow": null,
 493 |             "overflow_x": null,
 494 |             "overflow_y": null,
 495 |             "padding": null,
 496 |             "right": null,
 497 |             "top": null,
 498 |             "visibility": null,
 499 |             "width": null
 500 |           }
 501 |         },
 502 |         "c1cd34e355a540f58cebe58990bee350": {
 503 |           "model_module": "@jupyter-widgets/base",
 504 |           "model_name": "LayoutModel",
 505 |           "model_module_version": "1.2.0",
 506 |           "state": {
 507 |             "_model_module": "@jupyter-widgets/base",
 508 |             "_model_module_version": "1.2.0",
 509 |             "_model_name": "LayoutModel",
 510 |             "_view_count": null,
 511 |             "_view_module": "@jupyter-widgets/base",
 512 |             "_view_module_version": "1.2.0",
 513 |             "_view_name": "LayoutView",
 514 |             "align_content": null,
 515 |             "align_items": null,
 516 |             "align_self": null,
 517 |             "border": null,
 518 |             "bottom": null,
 519 |             "display": null,
 520 |             "flex": null,
 521 |             "flex_flow": null,
 522 |             "grid_area": null,
 523 |             "grid_auto_columns": null,
 524 |             "grid_auto_flow": null,
 525 |             "grid_auto_rows": null,
 526 |             "grid_column": null,
 527 |             "grid_gap": null,
 528 |             "grid_row": null,
 529 |             "grid_template_areas": null,
 530 |             "grid_template_columns": null,
 531 |             "grid_template_rows": null,
 532 |             "height": null,
 533 |             "justify_content": null,
 534 |             "justify_items": null,
 535 |             "left": null,
 536 |             "margin": null,
 537 |             "max_height": null,
 538 |             "max_width": null,
 539 |             "min_height": null,
 540 |             "min_width": null,
 541 |             "object_fit": null,
 542 |             "object_position": null,
 543 |             "order": null,
 544 |             "overflow": null,
 545 |             "overflow_x": null,
 546 |             "overflow_y": null,
 547 |             "padding": null,
 548 |             "right": null,
 549 |             "top": null,
 550 |             "visibility": null,
 551 |             "width": null
 552 |           }
 553 |         },
 554 |         "2248e9a15d694e7f94b6d817a0bceb90": {
 555 |           "model_module": "@jupyter-widgets/controls",
 556 |           "model_name": "DescriptionStyleModel",
 557 |           "model_module_version": "1.5.0",
 558 |           "state": {
 559 |             "_model_module": "@jupyter-widgets/controls",
 560 |             "_model_module_version": "1.5.0",
 561 |             "_model_name": "DescriptionStyleModel",
 562 |             "_view_count": null,
 563 |             "_view_module": "@jupyter-widgets/base",
 564 |             "_view_module_version": "1.2.0",
 565 |             "_view_name": "StyleView",
 566 |             "description_width": ""
 567 |           }
 568 |         },
 569 |         "2477901c30b44069ac5db5cf4e8924f0": {
 570 |           "model_module": "@jupyter-widgets/base",
 571 |           "model_name": "LayoutModel",
 572 |           "model_module_version": "1.2.0",
 573 |           "state": {
 574 |             "_model_module": "@jupyter-widgets/base",
 575 |             "_model_module_version": "1.2.0",
 576 |             "_model_name": "LayoutModel",
 577 |             "_view_count": null,
 578 |             "_view_module": "@jupyter-widgets/base",
 579 |             "_view_module_version": "1.2.0",
 580 |             "_view_name": "LayoutView",
 581 |             "align_content": null,
 582 |             "align_items": null,
 583 |             "align_self": null,
 584 |             "border": null,
 585 |             "bottom": null,
 586 |             "display": null,
 587 |             "flex": null,
 588 |             "flex_flow": null,
 589 |             "grid_area": null,
 590 |             "grid_auto_columns": null,
 591 |             "grid_auto_flow": null,
 592 |             "grid_auto_rows": null,
 593 |             "grid_column": null,
 594 |             "grid_gap": null,
 595 |             "grid_row": null,
 596 |             "grid_template_areas": null,
 597 |             "grid_template_columns": null,
 598 |             "grid_template_rows": null,
 599 |             "height": null,
 600 |             "justify_content": null,
 601 |             "justify_items": null,
 602 |             "left": null,
 603 |             "margin": null,
 604 |             "max_height": null,
 605 |             "max_width": null,
 606 |             "min_height": null,
 607 |             "min_width": null,
 608 |             "object_fit": null,
 609 |             "object_position": null,
 610 |             "order": null,
 611 |             "overflow": null,
 612 |             "overflow_x": null,
 613 |             "overflow_y": null,
 614 |             "padding": null,
 615 |             "right": null,
 616 |             "top": null,
 617 |             "visibility": null,
 618 |             "width": null
 619 |           }
 620 |         },
 621 |         "f2d4643e36634b668eb2d749d74e7780": {
 622 |           "model_module": "@jupyter-widgets/controls",
 623 |           "model_name": "ProgressStyleModel",
 624 |           "model_module_version": "1.5.0",
 625 |           "state": {
 626 |             "_model_module": "@jupyter-widgets/controls",
 627 |             "_model_module_version": "1.5.0",
 628 |             "_model_name": "ProgressStyleModel",
 629 |             "_view_count": null,
 630 |             "_view_module": "@jupyter-widgets/base",
 631 |             "_view_module_version": "1.2.0",
 632 |             "_view_name": "StyleView",
 633 |             "bar_color": null,
 634 |             "description_width": ""
 635 |           }
 636 |         },
 637 |         "f9a39e14cb0140e8ac09cfb400344804": {
 638 |           "model_module": "@jupyter-widgets/base",
 639 |           "model_name": "LayoutModel",
 640 |           "model_module_version": "1.2.0",
 641 |           "state": {
 642 |             "_model_module": "@jupyter-widgets/base",
 643 |             "_model_module_version": "1.2.0",
 644 |             "_model_name": "LayoutModel",
 645 |             "_view_count": null,
 646 |             "_view_module": "@jupyter-widgets/base",
 647 |             "_view_module_version": "1.2.0",
 648 |             "_view_name": "LayoutView",
 649 |             "align_content": null,
 650 |             "align_items": null,
 651 |             "align_self": null,
 652 |             "border": null,
 653 |             "bottom": null,
 654 |             "display": null,
 655 |             "flex": null,
 656 |             "flex_flow": null,
 657 |             "grid_area": null,
 658 |             "grid_auto_columns": null,
 659 |             "grid_auto_flow": null,
 660 |             "grid_auto_rows": null,
 661 |             "grid_column": null,
 662 |             "grid_gap": null,
 663 |             "grid_row": null,
 664 |             "grid_template_areas": null,
 665 |             "grid_template_columns": null,
 666 |             "grid_template_rows": null,
 667 |             "height": null,
 668 |             "justify_content": null,
 669 |             "justify_items": null,
 670 |             "left": null,
 671 |             "margin": null,
 672 |             "max_height": null,
 673 |             "max_width": null,
 674 |             "min_height": null,
 675 |             "min_width": null,
 676 |             "object_fit": null,
 677 |             "object_position": null,
 678 |             "order": null,
 679 |             "overflow": null,
 680 |             "overflow_x": null,
 681 |             "overflow_y": null,
 682 |             "padding": null,
 683 |             "right": null,
 684 |             "top": null,
 685 |             "visibility": null,
 686 |             "width": null
 687 |           }
 688 |         },
 689 |         "9e0916f00d374a858a3edb397a63af8d": {
 690 |           "model_module": "@jupyter-widgets/controls",
 691 |           "model_name": "DescriptionStyleModel",
 692 |           "model_module_version": "1.5.0",
 693 |           "state": {
 694 |             "_model_module": "@jupyter-widgets/controls",
 695 |             "_model_module_version": "1.5.0",
 696 |             "_model_name": "DescriptionStyleModel",
 697 |             "_view_count": null,
 698 |             "_view_module": "@jupyter-widgets/base",
 699 |             "_view_module_version": "1.2.0",
 700 |             "_view_name": "StyleView",
 701 |             "description_width": ""
 702 |           }
 703 |         },
 704 |         "5b720610ad684d5abe3ca6f3acdf6606": {
 705 |           "model_module": "@jupyter-widgets/controls",
 706 |           "model_name": "HBoxModel",
 707 |           "model_module_version": "1.5.0",
 708 |           "state": {
 709 |             "_dom_classes": [],
 710 |             "_model_module": "@jupyter-widgets/controls",
 711 |             "_model_module_version": "1.5.0",
 712 |             "_model_name": "HBoxModel",
 713 |             "_view_count": null,
 714 |             "_view_module": "@jupyter-widgets/controls",
 715 |             "_view_module_version": "1.5.0",
 716 |             "_view_name": "HBoxView",
 717 |             "box_style": "",
 718 |             "children": [
 719 |               "IPY_MODEL_127c410899934087904875f527a2dbba",
 720 |               "IPY_MODEL_2f9d37688cfa46059f152c0e6b080904",
 721 |               "IPY_MODEL_cd3f489a29c8448c9a01f595690c0e48"
 722 |             ],
 723 |             "layout": "IPY_MODEL_f9056446dd594a2aaec5396995a5f065"
 724 |           }
 725 |         },
 726 |         "127c410899934087904875f527a2dbba": {
 727 |           "model_module": "@jupyter-widgets/controls",
 728 |           "model_name": "HTMLModel",
 729 |           "model_module_version": "1.5.0",
 730 |           "state": {
 731 |             "_dom_classes": [],
 732 |             "_model_module": "@jupyter-widgets/controls",
 733 |             "_model_module_version": "1.5.0",
 734 |             "_model_name": "HTMLModel",
 735 |             "_view_count": null,
 736 |             "_view_module": "@jupyter-widgets/controls",
 737 |             "_view_module_version": "1.5.0",
 738 |             "_view_name": "HTMLView",
 739 |             "description": "",
 740 |             "description_tooltip": null,
 741 |             "layout": "IPY_MODEL_42efb45824894d808c9e31ef7554a852",
 742 |             "placeholder": "​",
 743 |             "style": "IPY_MODEL_181f2590b9464b929bf9b4e3b5c88b5c",
 744 |             "value": "model.safetensors: 100%"
 745 |           }
 746 |         },
 747 |         "2f9d37688cfa46059f152c0e6b080904": {
 748 |           "model_module": "@jupyter-widgets/controls",
 749 |           "model_name": "FloatProgressModel",
 750 |           "model_module_version": "1.5.0",
 751 |           "state": {
 752 |             "_dom_classes": [],
 753 |             "_model_module": "@jupyter-widgets/controls",
 754 |             "_model_module_version": "1.5.0",
 755 |             "_model_name": "FloatProgressModel",
 756 |             "_view_count": null,
 757 |             "_view_module": "@jupyter-widgets/controls",
 758 |             "_view_module_version": "1.5.0",
 759 |             "_view_name": "ProgressView",
 760 |             "bar_style": "success",
 761 |             "description": "",
 762 |             "description_tooltip": null,
 763 |             "layout": "IPY_MODEL_4f40596f290146d8b53dc3dda50a09e7",
 764 |             "max": 115434268,
 765 |             "min": 0,
 766 |             "orientation": "horizontal",
 767 |             "style": "IPY_MODEL_9869010a1cb949f7ae1feeef095fc05e",
 768 |             "value": 115434268
 769 |           }
 770 |         },
 771 |         "cd3f489a29c8448c9a01f595690c0e48": {
 772 |           "model_module": "@jupyter-widgets/controls",
 773 |           "model_name": "HTMLModel",
 774 |           "model_module_version": "1.5.0",
 775 |           "state": {
 776 |             "_dom_classes": [],
 777 |             "_model_module": "@jupyter-widgets/controls",
 778 |             "_model_module_version": "1.5.0",
 779 |             "_model_name": "HTMLModel",
 780 |             "_view_count": null,
 781 |             "_view_module": "@jupyter-widgets/controls",
 782 |             "_view_module_version": "1.5.0",
 783 |             "_view_name": "HTMLView",
 784 |             "description": "",
 785 |             "description_tooltip": null,
 786 |             "layout": "IPY_MODEL_1cc0b1752a704fa0b8687d0411539055",
 787 |             "placeholder": "​",
 788 |             "style": "IPY_MODEL_089092011cf341379b9be17a052830f1",
 789 |             "value": " 115M/115M [00:00&lt;00:00, 187MB/s]"
 790 |           }
 791 |         },
 792 |         "f9056446dd594a2aaec5396995a5f065": {
 793 |           "model_module": "@jupyter-widgets/base",
 794 |           "model_name": "LayoutModel",
 795 |           "model_module_version": "1.2.0",
 796 |           "state": {
 797 |             "_model_module": "@jupyter-widgets/base",
 798 |             "_model_module_version": "1.2.0",
 799 |             "_model_name": "LayoutModel",
 800 |             "_view_count": null,
 801 |             "_view_module": "@jupyter-widgets/base",
 802 |             "_view_module_version": "1.2.0",
 803 |             "_view_name": "LayoutView",
 804 |             "align_content": null,
 805 |             "align_items": null,
 806 |             "align_self": null,
 807 |             "border": null,
 808 |             "bottom": null,
 809 |             "display": null,
 810 |             "flex": null,
 811 |             "flex_flow": null,
 812 |             "grid_area": null,
 813 |             "grid_auto_columns": null,
 814 |             "grid_auto_flow": null,
 815 |             "grid_auto_rows": null,
 816 |             "grid_column": null,
 817 |             "grid_gap": null,
 818 |             "grid_row": null,
 819 |             "grid_template_areas": null,
 820 |             "grid_template_columns": null,
 821 |             "grid_template_rows": null,
 822 |             "height": null,
 823 |             "justify_content": null,
 824 |             "justify_items": null,
 825 |             "left": null,
 826 |             "margin": null,
 827 |             "max_height": null,
 828 |             "max_width": null,
 829 |             "min_height": null,
 830 |             "min_width": null,
 831 |             "object_fit": null,
 832 |             "object_position": null,
 833 |             "order": null,
 834 |             "overflow": null,
 835 |             "overflow_x": null,
 836 |             "overflow_y": null,
 837 |             "padding": null,
 838 |             "right": null,
 839 |             "top": null,
 840 |             "visibility": null,
 841 |             "width": null
 842 |           }
 843 |         },
 844 |         "42efb45824894d808c9e31ef7554a852": {
 845 |           "model_module": "@jupyter-widgets/base",
 846 |           "model_name": "LayoutModel",
 847 |           "model_module_version": "1.2.0",
 848 |           "state": {
 849 |             "_model_module": "@jupyter-widgets/base",
 850 |             "_model_module_version": "1.2.0",
 851 |             "_model_name": "LayoutModel",
 852 |             "_view_count": null,
 853 |             "_view_module": "@jupyter-widgets/base",
 854 |             "_view_module_version": "1.2.0",
 855 |             "_view_name": "LayoutView",
 856 |             "align_content": null,
 857 |             "align_items": null,
 858 |             "align_self": null,
 859 |             "border": null,
 860 |             "bottom": null,
 861 |             "display": null,
 862 |             "flex": null,
 863 |             "flex_flow": null,
 864 |             "grid_area": null,
 865 |             "grid_auto_columns": null,
 866 |             "grid_auto_flow": null,
 867 |             "grid_auto_rows": null,
 868 |             "grid_column": null,
 869 |             "grid_gap": null,
 870 |             "grid_row": null,
 871 |             "grid_template_areas": null,
 872 |             "grid_template_columns": null,
 873 |             "grid_template_rows": null,
 874 |             "height": null,
 875 |             "justify_content": null,
 876 |             "justify_items": null,
 877 |             "left": null,
 878 |             "margin": null,
 879 |             "max_height": null,
 880 |             "max_width": null,
 881 |             "min_height": null,
 882 |             "min_width": null,
 883 |             "object_fit": null,
 884 |             "object_position": null,
 885 |             "order": null,
 886 |             "overflow": null,
 887 |             "overflow_x": null,
 888 |             "overflow_y": null,
 889 |             "padding": null,
 890 |             "right": null,
 891 |             "top": null,
 892 |             "visibility": null,
 893 |             "width": null
 894 |           }
 895 |         },
 896 |         "181f2590b9464b929bf9b4e3b5c88b5c": {
 897 |           "model_module": "@jupyter-widgets/controls",
 898 |           "model_name": "DescriptionStyleModel",
 899 |           "model_module_version": "1.5.0",
 900 |           "state": {
 901 |             "_model_module": "@jupyter-widgets/controls",
 902 |             "_model_module_version": "1.5.0",
 903 |             "_model_name": "DescriptionStyleModel",
 904 |             "_view_count": null,
 905 |             "_view_module": "@jupyter-widgets/base",
 906 |             "_view_module_version": "1.2.0",
 907 |             "_view_name": "StyleView",
 908 |             "description_width": ""
 909 |           }
 910 |         },
 911 |         "4f40596f290146d8b53dc3dda50a09e7": {
 912 |           "model_module": "@jupyter-widgets/base",
 913 |           "model_name": "LayoutModel",
 914 |           "model_module_version": "1.2.0",
 915 |           "state": {
 916 |             "_model_module": "@jupyter-widgets/base",
 917 |             "_model_module_version": "1.2.0",
 918 |             "_model_name": "LayoutModel",
 919 |             "_view_count": null,
 920 |             "_view_module": "@jupyter-widgets/base",
 921 |             "_view_module_version": "1.2.0",
 922 |             "_view_name": "LayoutView",
 923 |             "align_content": null,
 924 |             "align_items": null,
 925 |             "align_self": null,
 926 |             "border": null,
 927 |             "bottom": null,
 928 |             "display": null,
 929 |             "flex": null,
 930 |             "flex_flow": null,
 931 |             "grid_area": null,
 932 |             "grid_auto_columns": null,
 933 |             "grid_auto_flow": null,
 934 |             "grid_auto_rows": null,
 935 |             "grid_column": null,
 936 |             "grid_gap": null,
 937 |             "grid_row": null,
 938 |             "grid_template_areas": null,
 939 |             "grid_template_columns": null,
 940 |             "grid_template_rows": null,
 941 |             "height": null,
 942 |             "justify_content": null,
 943 |             "justify_items": null,
 944 |             "left": null,
 945 |             "margin": null,
 946 |             "max_height": null,
 947 |             "max_width": null,
 948 |             "min_height": null,
 949 |             "min_width": null,
 950 |             "object_fit": null,
 951 |             "object_position": null,
 952 |             "order": null,
 953 |             "overflow": null,
 954 |             "overflow_x": null,
 955 |             "overflow_y": null,
 956 |             "padding": null,
 957 |             "right": null,
 958 |             "top": null,
 959 |             "visibility": null,
 960 |             "width": null
 961 |           }
 962 |         },
 963 |         "9869010a1cb949f7ae1feeef095fc05e": {
 964 |           "model_module": "@jupyter-widgets/controls",
 965 |           "model_name": "ProgressStyleModel",
 966 |           "model_module_version": "1.5.0",
 967 |           "state": {
 968 |             "_model_module": "@jupyter-widgets/controls",
 969 |             "_model_module_version": "1.5.0",
 970 |             "_model_name": "ProgressStyleModel",
 971 |             "_view_count": null,
 972 |             "_view_module": "@jupyter-widgets/base",
 973 |             "_view_module_version": "1.2.0",
 974 |             "_view_name": "StyleView",
 975 |             "bar_color": null,
 976 |             "description_width": ""
 977 |           }
 978 |         },
 979 |         "1cc0b1752a704fa0b8687d0411539055": {
 980 |           "model_module": "@jupyter-widgets/base",
 981 |           "model_name": "LayoutModel",
 982 |           "model_module_version": "1.2.0",
 983 |           "state": {
 984 |             "_model_module": "@jupyter-widgets/base",
 985 |             "_model_module_version": "1.2.0",
 986 |             "_model_name": "LayoutModel",
 987 |             "_view_count": null,
 988 |             "_view_module": "@jupyter-widgets/base",
 989 |             "_view_module_version": "1.2.0",
 990 |             "_view_name": "LayoutView",
 991 |             "align_content": null,
 992 |             "align_items": null,
 993 |             "align_self": null,
 994 |             "border": null,
 995 |             "bottom": null,
 996 |             "display": null,
 997 |             "flex": null,
 998 |             "flex_flow": null,
 999 |             "grid_area": null,
1000 |             "grid_auto_columns": null,
1001 |             "grid_auto_flow": null,
1002 |             "grid_auto_rows": null,
1003 |             "grid_column": null,
1004 |             "grid_gap": null,
1005 |             "grid_row": null,
1006 |             "grid_template_areas": null,
1007 |             "grid_template_columns": null,
1008 |             "grid_template_rows": null,
1009 |             "height": null,
1010 |             "justify_content": null,
1011 |             "justify_items": null,
1012 |             "left": null,
1013 |             "margin": null,
1014 |             "max_height": null,
1015 |             "max_width": null,
1016 |             "min_height": null,
1017 |             "min_width": null,
1018 |             "object_fit": null,
1019 |             "object_position": null,
1020 |             "order": null,
1021 |             "overflow": null,
1022 |             "overflow_x": null,
1023 |             "overflow_y": null,
1024 |             "padding": null,
1025 |             "right": null,
1026 |             "top": null,
1027 |             "visibility": null,
1028 |             "width": null
1029 |           }
1030 |         },
1031 |         "089092011cf341379b9be17a052830f1": {
1032 |           "model_module": "@jupyter-widgets/controls",
1033 |           "model_name": "DescriptionStyleModel",
1034 |           "model_module_version": "1.5.0",
1035 |           "state": {
1036 |             "_model_module": "@jupyter-widgets/controls",
1037 |             "_model_module_version": "1.5.0",
1038 |             "_model_name": "DescriptionStyleModel",
1039 |             "_view_count": null,
1040 |             "_view_module": "@jupyter-widgets/base",
1041 |             "_view_module_version": "1.2.0",
1042 |             "_view_name": "StyleView",
1043 |             "description_width": ""
1044 |           }
1045 |         },
1046 |         "f4321c31f68a458382b2b8b969993831": {
1047 |           "model_module": "@jupyter-widgets/controls",
1048 |           "model_name": "HBoxModel",
1049 |           "model_module_version": "1.5.0",
1050 |           "state": {
1051 |             "_dom_classes": [],
1052 |             "_model_module": "@jupyter-widgets/controls",
1053 |             "_model_module_version": "1.5.0",
1054 |             "_model_name": "HBoxModel",
1055 |             "_view_count": null,
1056 |             "_view_module": "@jupyter-widgets/controls",
1057 |             "_view_module_version": "1.5.0",
1058 |             "_view_name": "HBoxView",
1059 |             "box_style": "",
1060 |             "children": [
1061 |               "IPY_MODEL_e8ed92210b654289a7732b2e76f8dcad",
1062 |               "IPY_MODEL_7428d3cc85f148fc8659b2f7e5e1afe6",
1063 |               "IPY_MODEL_c34b003c7d524da0bb39e8c8cfe63d44"
1064 |             ],
1065 |             "layout": "IPY_MODEL_8fd5fbbf855c43428ff5253adda6a4b3"
1066 |           }
1067 |         },
1068 |         "e8ed92210b654289a7732b2e76f8dcad": {
1069 |           "model_module": "@jupyter-widgets/controls",
1070 |           "model_name": "HTMLModel",
1071 |           "model_module_version": "1.5.0",
1072 |           "state": {
1073 |             "_dom_classes": [],
1074 |             "_model_module": "@jupyter-widgets/controls",
1075 |             "_model_module_version": "1.5.0",
1076 |             "_model_name": "HTMLModel",
1077 |             "_view_count": null,
1078 |             "_view_module": "@jupyter-widgets/controls",
1079 |             "_view_module_version": "1.5.0",
1080 |             "_view_name": "HTMLView",
1081 |             "description": "",
1082 |             "description_tooltip": null,
1083 |             "layout": "IPY_MODEL_f623dbd8c9b5462bab46250eb7d492e6",
1084 |             "placeholder": "​",
1085 |             "style": "IPY_MODEL_bb8ab670115c46378f448a7ca93726e2",
1086 |             "value": "model.safetensors: 100%"
1087 |           }
1088 |         },
1089 |         "7428d3cc85f148fc8659b2f7e5e1afe6": {
1090 |           "model_module": "@jupyter-widgets/controls",
1091 |           "model_name": "FloatProgressModel",
1092 |           "model_module_version": "1.5.0",
1093 |           "state": {
1094 |             "_dom_classes": [],
1095 |             "_model_module": "@jupyter-widgets/controls",
1096 |             "_model_module_version": "1.5.0",
1097 |             "_model_name": "FloatProgressModel",
1098 |             "_view_count": null,
1099 |             "_view_module": "@jupyter-widgets/controls",
1100 |             "_view_module_version": "1.5.0",
1101 |             "_view_name": "ProgressView",
1102 |             "bar_style": "success",
1103 |             "description": "",
1104 |             "description_tooltip": null,
1105 |             "layout": "IPY_MODEL_dc74c156b5234141934265b417a5ffcb",
1106 |             "max": 46807446,
1107 |             "min": 0,
1108 |             "orientation": "horizontal",
1109 |             "style": "IPY_MODEL_d18a0ce1bd7149d388d12cdd86c3390c",
1110 |             "value": 46807446
1111 |           }
1112 |         },
1113 |         "c34b003c7d524da0bb39e8c8cfe63d44": {
1114 |           "model_module": "@jupyter-widgets/controls",
1115 |           "model_name": "HTMLModel",
1116 |           "model_module_version": "1.5.0",
1117 |           "state": {
1118 |             "_dom_classes": [],
1119 |             "_model_module": "@jupyter-widgets/controls",
1120 |             "_model_module_version": "1.5.0",
1121 |             "_model_name": "HTMLModel",
1122 |             "_view_count": null,
1123 |             "_view_module": "@jupyter-widgets/controls",
1124 |             "_view_module_version": "1.5.0",
1125 |             "_view_name": "HTMLView",
1126 |             "description": "",
1127 |             "description_tooltip": null,
1128 |             "layout": "IPY_MODEL_fb00943087d0426fad871d9d6ed68a2a",
1129 |             "placeholder": "​",
1130 |             "style": "IPY_MODEL_4e25db89a38944f1bf63427969d90caa",
1131 |             "value": " 46.8M/46.8M [00:00&lt;00:00, 177MB/s]"
1132 |           }
1133 |         },
1134 |         "8fd5fbbf855c43428ff5253adda6a4b3": {
1135 |           "model_module": "@jupyter-widgets/base",
1136 |           "model_name": "LayoutModel",
1137 |           "model_module_version": "1.2.0",
1138 |           "state": {
1139 |             "_model_module": "@jupyter-widgets/base",
1140 |             "_model_module_version": "1.2.0",
1141 |             "_model_name": "LayoutModel",
1142 |             "_view_count": null,
1143 |             "_view_module": "@jupyter-widgets/base",
1144 |             "_view_module_version": "1.2.0",
1145 |             "_view_name": "LayoutView",
1146 |             "align_content": null,
1147 |             "align_items": null,
1148 |             "align_self": null,
1149 |             "border": null,
1150 |             "bottom": null,
1151 |             "display": null,
1152 |             "flex": null,
1153 |             "flex_flow": null,
1154 |             "grid_area": null,
1155 |             "grid_auto_columns": null,
1156 |             "grid_auto_flow": null,
1157 |             "grid_auto_rows": null,
1158 |             "grid_column": null,
1159 |             "grid_gap": null,
1160 |             "grid_row": null,
1161 |             "grid_template_areas": null,
1162 |             "grid_template_columns": null,
1163 |             "grid_template_rows": null,
1164 |             "height": null,
1165 |             "justify_content": null,
1166 |             "justify_items": null,
1167 |             "left": null,
1168 |             "margin": null,
1169 |             "max_height": null,
1170 |             "max_width": null,
1171 |             "min_height": null,
1172 |             "min_width": null,
1173 |             "object_fit": null,
1174 |             "object_position": null,
1175 |             "order": null,
1176 |             "overflow": null,
1177 |             "overflow_x": null,
1178 |             "overflow_y": null,
1179 |             "padding": null,
1180 |             "right": null,
1181 |             "top": null,
1182 |             "visibility": null,
1183 |             "width": null
1184 |           }
1185 |         },
1186 |         "f623dbd8c9b5462bab46250eb7d492e6": {
1187 |           "model_module": "@jupyter-widgets/base",
1188 |           "model_name": "LayoutModel",
1189 |           "model_module_version": "1.2.0",
1190 |           "state": {
1191 |             "_model_module": "@jupyter-widgets/base",
1192 |             "_model_module_version": "1.2.0",
1193 |             "_model_name": "LayoutModel",
1194 |             "_view_count": null,
1195 |             "_view_module": "@jupyter-widgets/base",
1196 |             "_view_module_version": "1.2.0",
1197 |             "_view_name": "LayoutView",
1198 |             "align_content": null,
1199 |             "align_items": null,
1200 |             "align_self": null,
1201 |             "border": null,
1202 |             "bottom": null,
1203 |             "display": null,
1204 |             "flex": null,
1205 |             "flex_flow": null,
1206 |             "grid_area": null,
1207 |             "grid_auto_columns": null,
1208 |             "grid_auto_flow": null,
1209 |             "grid_auto_rows": null,
1210 |             "grid_column": null,
1211 |             "grid_gap": null,
1212 |             "grid_row": null,
1213 |             "grid_template_areas": null,
1214 |             "grid_template_columns": null,
1215 |             "grid_template_rows": null,
1216 |             "height": null,
1217 |             "justify_content": null,
1218 |             "justify_items": null,
1219 |             "left": null,
1220 |             "margin": null,
1221 |             "max_height": null,
1222 |             "max_width": null,
1223 |             "min_height": null,
1224 |             "min_width": null,
1225 |             "object_fit": null,
1226 |             "object_position": null,
1227 |             "order": null,
1228 |             "overflow": null,
1229 |             "overflow_x": null,
1230 |             "overflow_y": null,
1231 |             "padding": null,
1232 |             "right": null,
1233 |             "top": null,
1234 |             "visibility": null,
1235 |             "width": null
1236 |           }
1237 |         },
1238 |         "bb8ab670115c46378f448a7ca93726e2": {
1239 |           "model_module": "@jupyter-widgets/controls",
1240 |           "model_name": "DescriptionStyleModel",
1241 |           "model_module_version": "1.5.0",
1242 |           "state": {
1243 |             "_model_module": "@jupyter-widgets/controls",
1244 |             "_model_module_version": "1.5.0",
1245 |             "_model_name": "DescriptionStyleModel",
1246 |             "_view_count": null,
1247 |             "_view_module": "@jupyter-widgets/base",
1248 |             "_view_module_version": "1.2.0",
1249 |             "_view_name": "StyleView",
1250 |             "description_width": ""
1251 |           }
1252 |         },
1253 |         "dc74c156b5234141934265b417a5ffcb": {
1254 |           "model_module": "@jupyter-widgets/base",
1255 |           "model_name": "LayoutModel",
1256 |           "model_module_version": "1.2.0",
1257 |           "state": {
1258 |             "_model_module": "@jupyter-widgets/base",
1259 |             "_model_module_version": "1.2.0",
1260 |             "_model_name": "LayoutModel",
1261 |             "_view_count": null,
1262 |             "_view_module": "@jupyter-widgets/base",
1263 |             "_view_module_version": "1.2.0",
1264 |             "_view_name": "LayoutView",
1265 |             "align_content": null,
1266 |             "align_items": null,
1267 |             "align_self": null,
1268 |             "border": null,
1269 |             "bottom": null,
1270 |             "display": null,
1271 |             "flex": null,
1272 |             "flex_flow": null,
1273 |             "grid_area": null,
1274 |             "grid_auto_columns": null,
1275 |             "grid_auto_flow": null,
1276 |             "grid_auto_rows": null,
1277 |             "grid_column": null,
1278 |             "grid_gap": null,
1279 |             "grid_row": null,
1280 |             "grid_template_areas": null,
1281 |             "grid_template_columns": null,
1282 |             "grid_template_rows": null,
1283 |             "height": null,
1284 |             "justify_content": null,
1285 |             "justify_items": null,
1286 |             "left": null,
1287 |             "margin": null,
1288 |             "max_height": null,
1289 |             "max_width": null,
1290 |             "min_height": null,
1291 |             "min_width": null,
1292 |             "object_fit": null,
1293 |             "object_position": null,
1294 |             "order": null,
1295 |             "overflow": null,
1296 |             "overflow_x": null,
1297 |             "overflow_y": null,
1298 |             "padding": null,
1299 |             "right": null,
1300 |             "top": null,
1301 |             "visibility": null,
1302 |             "width": null
1303 |           }
1304 |         },
1305 |         "d18a0ce1bd7149d388d12cdd86c3390c": {
1306 |           "model_module": "@jupyter-widgets/controls",
1307 |           "model_name": "ProgressStyleModel",
1308 |           "model_module_version": "1.5.0",
1309 |           "state": {
1310 |             "_model_module": "@jupyter-widgets/controls",
1311 |             "_model_module_version": "1.5.0",
1312 |             "_model_name": "ProgressStyleModel",
1313 |             "_view_count": null,
1314 |             "_view_module": "@jupyter-widgets/base",
1315 |             "_view_module_version": "1.2.0",
1316 |             "_view_name": "StyleView",
1317 |             "bar_color": null,
1318 |             "description_width": ""
1319 |           }
1320 |         },
1321 |         "fb00943087d0426fad871d9d6ed68a2a": {
1322 |           "model_module": "@jupyter-widgets/base",
1323 |           "model_name": "LayoutModel",
1324 |           "model_module_version": "1.2.0",
1325 |           "state": {
1326 |             "_model_module": "@jupyter-widgets/base",
1327 |             "_model_module_version": "1.2.0",
1328 |             "_model_name": "LayoutModel",
1329 |             "_view_count": null,
1330 |             "_view_module": "@jupyter-widgets/base",
1331 |             "_view_module_version": "1.2.0",
1332 |             "_view_name": "LayoutView",
1333 |             "align_content": null,
1334 |             "align_items": null,
1335 |             "align_self": null,
1336 |             "border": null,
1337 |             "bottom": null,
1338 |             "display": null,
1339 |             "flex": null,
1340 |             "flex_flow": null,
1341 |             "grid_area": null,
1342 |             "grid_auto_columns": null,
1343 |             "grid_auto_flow": null,
1344 |             "grid_auto_rows": null,
1345 |             "grid_column": null,
1346 |             "grid_gap": null,
1347 |             "grid_row": null,
1348 |             "grid_template_areas": null,
1349 |             "grid_template_columns": null,
1350 |             "grid_template_rows": null,
1351 |             "height": null,
1352 |             "justify_content": null,
1353 |             "justify_items": null,
1354 |             "left": null,
1355 |             "margin": null,
1356 |             "max_height": null,
1357 |             "max_width": null,
1358 |             "min_height": null,
1359 |             "min_width": null,
1360 |             "object_fit": null,
1361 |             "object_position": null,
1362 |             "order": null,
1363 |             "overflow": null,
1364 |             "overflow_x": null,
1365 |             "overflow_y": null,
1366 |             "padding": null,
1367 |             "right": null,
1368 |             "top": null,
1369 |             "visibility": null,
1370 |             "width": null
1371 |           }
1372 |         },
1373 |         "4e25db89a38944f1bf63427969d90caa": {
1374 |           "model_module": "@jupyter-widgets/controls",
1375 |           "model_name": "DescriptionStyleModel",
1376 |           "model_module_version": "1.5.0",
1377 |           "state": {
1378 |             "_model_module": "@jupyter-widgets/controls",
1379 |             "_model_module_version": "1.5.0",
1380 |             "_model_name": "DescriptionStyleModel",
1381 |             "_view_count": null,
1382 |             "_view_module": "@jupyter-widgets/base",
1383 |             "_view_module_version": "1.2.0",
1384 |             "_view_name": "StyleView",
1385 |             "description_width": ""
1386 |           }
1387 |         }
1388 |       }
1389 |     },
1390 |     "accelerator": "GPU"
1391 |   },
1392 |   "cells": [
1393 |     {
1394 |       "cell_type": "markdown",
1395 |       "metadata": {
1396 |         "id": "view-in-github",
1397 |         "colab_type": "text"
1398 |       },
1399 |       "source": [
1400 |         "<a href=\"https://colab.research.google.com/github/sugarforever/Advanced-RAG/blob/main/01_semi_structured_data.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
1401 |       ]
1402 |     },
1403 |     {
1404 |       "cell_type": "markdown",
1405 |       "source": [
1406 |         "# Advanced RAG - 01. RAG on Semi-structured data\n",
1407 |         "\n",
1408 |         "**What is RAG?**\n",
1409 |         "\n",
1410 |         "Retrieval augmented generation (RAG) is a natural language processing (NLP) technique that employes the capabilities of retrieval and generative based AI models.\n",
1411 |         "\n",
1412 |         "**What is Naive RAG?**\n",
1413 |         "\n",
1414 |         "Naive RAG often refers to splitting documents into chunks, embedding them, and retrieving chunks based on semantic similarity search to a user question.\n",
1415 |         "\n",
1416 |         "It's simple, but of poor overall performance.\n",
1417 |         "\n",
1418 |         "**That's why we need Advanced RAG.**\n",
1419 |         "\n",
1420 |         "In this tutorials (**Advanced RAG**), we will learn the techniques and best practices in RAG application development, that can improve the quality of the RAG.\n",
1421 |         "\n",
1422 |         "It's crucial to the success of a RAG application.\n",
1423 |         "\n",
1424 |         "## 01. RAG on Semi-structured data\n",
1425 |         "\n",
1426 |         "### Introduction\n",
1427 |         "\n",
1428 |         "#### ✏️ What is Structured Data?\n",
1429 |         "\n",
1430 |         "Structured data is organized information with a predefined format, typically stored in rows and columns, making it easy to search and analyze.\n",
1431 |         "\n",
1432 |         "#### ✏️ What is Unstructured Data?\n",
1433 |         "\n",
1434 |         "Unstructured data is information that lacks a specific format or organization, often in the form of text, images, or multimedia, making it challenging to analyze without specialized techniques.\n",
1435 |         "\n",
1436 |         "#### ✏️ What is Semi-structured Data?\n",
1437 |         "\n",
1438 |         "Apparently, semi-structured data is the mix of them above.\n",
1439 |         "\n",
1440 |         "It's challenging for RAG to process semi-structured data, as:\n",
1441 |         "\n",
1442 |         "1. Text splitting may break up tables\n",
1443 |         "2. Tables and images are challenging for embedding and semantic search\n",
1444 |         "\n",
1445 |         "The typical semi-structured data is PDF document that contains text, tables, images and so on.\n",
1446 |         "\n",
1447 |         "In this tutorial, let's use the following components to showcase how to build RAG on top of semi-structured data:\n",
1448 |         "\n",
1449 |         "1. ✂️ [unstructured](https://github.com/Unstructured-IO/unstructured)\n",
1450 |         "  \n",
1451 |         "  Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.\n",
1452 |         "\n",
1453 |         "  We will use it to parse PDF documents and extract different types of elements seperately, such as text, table, and image\n",
1454 |         "\n",
1455 |         "2. 🦜 [LangChain](https://github.com/langchain-ai/langchain)\n",
1456 |         "\n",
1457 |         "3. 🗂 [Chromadb](https://github.com/chroma-core/chroma)\n",
1458 |         "\n",
1459 |         "  Vector data storage\n",
1460 |         "\n",
1461 |         "The PDF document we use in this example is the [NVIDIA Statement of Changes](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf). It's a small PDF file containing several tables which is a good example for quick data processing and clear demonstration."
1462 |       ],
1463 |       "metadata": {
1464 |         "id": "AeO6Vg59nrPP"
1465 |       }
1466 |     },
1467 |     {
1468 |       "cell_type": "markdown",
1469 |       "source": [
1470 |         "### Prepare Environment"
1471 |       ],
1472 |       "metadata": {
1473 |         "id": "K-e_9LiVuJeQ"
1474 |       }
1475 |     },
1476 |     {
1477 |       "cell_type": "markdown",
1478 |       "source": [
1479 |         "Let's install the necessary Python packages."
1480 |       ],
1481 |       "metadata": {
1482 |         "id": "wVP2xyg3uh1r"
1483 |       }
1484 |     },
1485 |     {
1486 |       "cell_type": "code",
1487 |       "execution_count": 18,
1488 |       "metadata": {
1489 |         "colab": {
1490 |           "base_uri": "https://localhost:8080/"
1491 |         },
1492 |         "id": "kaIyvvQuB4uu",
1493 |         "outputId": "a1a32063-5712-45fc-853b-e6140c4e2542"
1494 |       },
1495 |       "outputs": [
1496 |         {
1497 |           "output_type": "stream",
1498 |           "name": "stdout",
1499 |           "text": [
1500 |             "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1501 |             "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
1502 |             "llmx 0.0.15a0 requires cohere, which is not installed.\u001b[0m\u001b[31m\n",
1503 |             "\u001b[0m"
1504 |           ]
1505 |         }
1506 |       ],
1507 |       "source": [
1508 |         "!pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -U"
1509 |       ]
1510 |     },
1511 |     {
1512 |       "cell_type": "markdown",
1513 |       "source": [
1514 |         "Download the PDF file and name it as `statement_of_changes.pdf`."
1515 |       ],
1516 |       "metadata": {
1517 |         "id": "O7276PRIuq4k"
1518 |       }
1519 |     },
1520 |     {
1521 |       "cell_type": "code",
1522 |       "source": [
1523 |         "!wget -O statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf"
1524 |       ],
1525 |       "metadata": {
1526 |         "colab": {
1527 |           "base_uri": "https://localhost:8080/"
1528 |         },
1529 |         "id": "BInAbAKNDP55",
1530 |         "outputId": "5476a09a-9d04-47d7-d8ff-688c397c49c0"
1531 |       },
1532 |       "execution_count": 2,
1533 |       "outputs": [
1534 |         {
1535 |           "output_type": "stream",
1536 |           "name": "stdout",
1537 |           "text": [
1538 |             "--2023-11-20 21:19:43--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf\n",
1539 |             "Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 52.84.122.100, 52.84.122.47, 52.84.122.58, ...\n",
1540 |             "Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|52.84.122.100|:443... connected.\n",
1541 |             "HTTP request sent, awaiting response... 200 OK\n",
1542 |             "Length: 119037 (116K) [application/pdf]\n",
1543 |             "Saving to: ‘statement_of_changes.pdf’\n",
1544 |             "\n",
1545 |             "statement_of_change 100%[===================>] 116.25K  --.-KB/s    in 0.07s   \n",
1546 |             "\n",
1547 |             "2023-11-20 21:19:44 (1.52 MB/s) - ‘statement_of_changes.pdf’ saved [119037/119037]\n",
1548 |             "\n"
1549 |           ]
1550 |         }
1551 |       ]
1552 |     },
1553 |     {
1554 |       "cell_type": "markdown",
1555 |       "source": [
1556 |         "Install required platform packages:\n",
1557 |         "\n",
1558 |         "- poppler-utils\n",
1559 |         "  \n",
1560 |         "  A collection of command-line utilities built on Poppler's library API, to manage PDF and extract contents\n",
1561 |         "\n",
1562 |         "- tesseract-ocr\n",
1563 |         "\n",
1564 |         "  Optical character recognition engine"
1565 |       ],
1566 |       "metadata": {
1567 |         "id": "SOSkdNt9ux_Z"
1568 |       }
1569 |     },
1570 |     {
1571 |       "cell_type": "code",
1572 |       "source": [
1573 |         "!apt-get install poppler-utils tesseract-ocr"
1574 |       ],
1575 |       "metadata": {
1576 |         "colab": {
1577 |           "base_uri": "https://localhost:8080/"
1578 |         },
1579 |         "id": "ER_h_kAPatZO",
1580 |         "outputId": "7c229fc7-d8dd-461f-ae80-88fb79f640d9"
1581 |       },
1582 |       "execution_count": 3,
1583 |       "outputs": [
1584 |         {
1585 |           "output_type": "stream",
1586 |           "name": "stdout",
1587 |           "text": [
1588 |             "Reading package lists... Done\n",
1589 |             "Building dependency tree... Done\n",
1590 |             "Reading state information... Done\n",
1591 |             "The following additional packages will be installed:\n",
1592 |             "  tesseract-ocr-eng tesseract-ocr-osd\n",
1593 |             "The following NEW packages will be installed:\n",
1594 |             "  poppler-utils tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd\n",
1595 |             "0 upgraded, 4 newly installed, 0 to remove and 9 not upgraded.\n",
1596 |             "Need to get 5,002 kB of archives.\n",
1597 |             "After this operation, 16.3 MB of additional disk space will be used.\n",
1598 |             "Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.2 [186 kB]\n",
1599 |             "Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]\n",
1600 |             "Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]\n",
1601 |             "Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]\n",
1602 |             "Fetched 5,002 kB in 1s (3,337 kB/s)\n",
1603 |             "Selecting previously unselected package poppler-utils.\n",
1604 |             "(Reading database ... 120880 files and directories currently installed.)\n",
1605 |             "Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.2_amd64.deb ...\n",
1606 |             "Unpacking poppler-utils (22.02.0-2ubuntu0.2) ...\n",
1607 |             "Selecting previously unselected package tesseract-ocr-eng.\n",
1608 |             "Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1.1_all.deb ...\n",
1609 |             "Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...\n",
1610 |             "Selecting previously unselected package tesseract-ocr-osd.\n",
1611 |             "Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...\n",
1612 |             "Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...\n",
1613 |             "Selecting previously unselected package tesseract-ocr.\n",
1614 |             "Preparing to unpack .../tesseract-ocr_4.1.1-2.1build1_amd64.deb ...\n",
1615 |             "Unpacking tesseract-ocr (4.1.1-2.1build1) ...\n",
1616 |             "Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...\n",
1617 |             "Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...\n",
1618 |             "Setting up poppler-utils (22.02.0-2ubuntu0.2) ...\n",
1619 |             "Setting up tesseract-ocr (4.1.1-2.1build1) ...\n",
1620 |             "Processing triggers for man-db (2.10.2-1) ...\n"
1621 |           ]
1622 |         }
1623 |       ]
1624 |     },
1625 |     {
1626 |       "cell_type": "code",
1627 |       "source": [
1628 |         "import os\n",
1629 |         "\n",
1630 |         "os.environ[\"OPENAI_API_KEY\"] = \"Your Valid OpenAI API Key\""
1631 |       ],
1632 |       "metadata": {
1633 |         "id": "o2lSAfBCxt4w"
1634 |       },
1635 |       "execution_count": null,
1636 |       "outputs": []
1637 |     },
1638 |     {
1639 |       "cell_type": "markdown",
1640 |       "source": [
1641 |         "### Coding"
1642 |       ],
1643 |       "metadata": {
1644 |         "id": "S336PokNwlE-"
1645 |       }
1646 |     },
1647 |     {
1648 |       "cell_type": "markdown",
1649 |       "source": [
1650 |         "1. Use `unstructured` library to partition the PDF document into different type of elements."
1651 |       ],
1652 |       "metadata": {
1653 |         "id": "sGONjGYXw0ot"
1654 |       }
1655 |     },
1656 |     {
1657 |       "cell_type": "code",
1658 |       "source": [
1659 |         "from typing import Any\n",
1660 |         "\n",
1661 |         "from pydantic import BaseModel\n",
1662 |         "from unstructured.partition.pdf import partition_pdf\n",
1663 |         "\n",
1664 |         "raw_pdf_elements = partition_pdf(\n",
1665 |         "    filename=\"statement_of_changes.pdf\",\n",
1666 |         "    extract_images_in_pdf=False,\n",
1667 |         "    infer_table_structure=True,\n",
1668 |         "    chunking_strategy=\"by_title\",\n",
1669 |         "    max_characters=4000,\n",
1670 |         "    new_after_n_chars=3800,\n",
1671 |         "    combine_text_under_n_chars=2000,\n",
1672 |         "    image_output_dir_path=\".\",\n",
1673 |         ")"
1674 |       ],
1675 |       "metadata": {
1676 |         "colab": {
1677 |           "base_uri": "https://localhost:8080/",
1678 |           "height": 312,
1679 |           "referenced_widgets": [
1680 |             "ac0fab4eb8e543c1a0f5b037e0235815",
1681 |             "3649dec7e417482091fc2f86939e0c16",
1682 |             "aa165def4267499fa5bd66c7a23aa83d",
1683 |             "4ee55dbff1674ba9bdb0614692dc64e9",
1684 |             "54445f16b2e5439dbea3d25c7d690933",
1685 |             "b240ed924ce94639b137552549ffd29e",
1686 |             "c0f954fb5893438fbc5306ba55d44c20",
1687 |             "cf8f1809505a436c949d936e74f93094",
1688 |             "7639470b337e47b0b7a60d32c8770744",
1689 |             "17913e64f28b4a2daf9f9d21654c6372",
1690 |             "ef27bad8255f447da09541fc10f80ad8",
1691 |             "8ee21ab783c543c0944e8732faaebb4a",
1692 |             "2024899dae44406a9e187c5d770f873f",
1693 |             "2174f17cbb124ab2bf1a32765c0d5a0c",
1694 |             "330603dd11944ad9a267378e3a17a573",
1695 |             "674ae68fb32d40b48152f29f77636138",
1696 |             "c1cd34e355a540f58cebe58990bee350",
1697 |             "2248e9a15d694e7f94b6d817a0bceb90",
1698 |             "2477901c30b44069ac5db5cf4e8924f0",
1699 |             "f2d4643e36634b668eb2d749d74e7780",
1700 |             "f9a39e14cb0140e8ac09cfb400344804",
1701 |             "9e0916f00d374a858a3edb397a63af8d",
1702 |             "5b720610ad684d5abe3ca6f3acdf6606",
1703 |             "127c410899934087904875f527a2dbba",
1704 |             "2f9d37688cfa46059f152c0e6b080904",
1705 |             "cd3f489a29c8448c9a01f595690c0e48",
1706 |             "f9056446dd594a2aaec5396995a5f065",
1707 |             "42efb45824894d808c9e31ef7554a852",
1708 |             "181f2590b9464b929bf9b4e3b5c88b5c",
1709 |             "4f40596f290146d8b53dc3dda50a09e7",
1710 |             "9869010a1cb949f7ae1feeef095fc05e",
1711 |             "1cc0b1752a704fa0b8687d0411539055",
1712 |             "089092011cf341379b9be17a052830f1",
1713 |             "f4321c31f68a458382b2b8b969993831",
1714 |             "e8ed92210b654289a7732b2e76f8dcad",
1715 |             "7428d3cc85f148fc8659b2f7e5e1afe6",
1716 |             "c34b003c7d524da0bb39e8c8cfe63d44",
1717 |             "8fd5fbbf855c43428ff5253adda6a4b3",
1718 |             "f623dbd8c9b5462bab46250eb7d492e6",
1719 |             "bb8ab670115c46378f448a7ca93726e2",
1720 |             "dc74c156b5234141934265b417a5ffcb",
1721 |             "d18a0ce1bd7149d388d12cdd86c3390c",
1722 |             "fb00943087d0426fad871d9d6ed68a2a",
1723 |             "4e25db89a38944f1bf63427969d90caa"
1724 |           ]
1725 |         },
1726 |         "id": "nFpMgbR6Csj6",
1727 |         "outputId": "2782369b-ab1e-4677-a542-e1080744c1eb"
1728 |       },
1729 |       "execution_count": 4,
1730 |       "outputs": [
1731 |         {
1732 |           "output_type": "stream",
1733 |           "name": "stderr",
1734 |           "text": [
1735 |             "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
1736 |             "[nltk_data]   Unzipping tokenizers/punkt.zip.\n",
1737 |             "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
1738 |             "[nltk_data]     /root/nltk_data...\n",
1739 |             "[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.\n"
1740 |           ]
1741 |         },
1742 |         {
1743 |           "output_type": "display_data",
1744 |           "data": {
1745 |             "text/plain": [
1746 |               "yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]"
1747 |             ],
1748 |             "application/vnd.jupyter.widget-view+json": {
1749 |               "version_major": 2,
1750 |               "version_minor": 0,
1751 |               "model_id": "ac0fab4eb8e543c1a0f5b037e0235815"
1752 |             }
1753 |           },
1754 |           "metadata": {}
1755 |         },
1756 |         {
1757 |           "output_type": "display_data",
1758 |           "data": {
1759 |             "text/plain": [
1760 |               "config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]"
1761 |             ],
1762 |             "application/vnd.jupyter.widget-view+json": {
1763 |               "version_major": 2,
1764 |               "version_minor": 0,
1765 |               "model_id": "8ee21ab783c543c0944e8732faaebb4a"
1766 |             }
1767 |           },
1768 |           "metadata": {}
1769 |         },
1770 |         {
1771 |           "output_type": "display_data",
1772 |           "data": {
1773 |             "text/plain": [
1774 |               "model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]"
1775 |             ],
1776 |             "application/vnd.jupyter.widget-view+json": {
1777 |               "version_major": 2,
1778 |               "version_minor": 0,
1779 |               "model_id": "5b720610ad684d5abe3ca6f3acdf6606"
1780 |             }
1781 |           },
1782 |           "metadata": {}
1783 |         },
1784 |         {
1785 |           "output_type": "display_data",
1786 |           "data": {
1787 |             "text/plain": [
1788 |               "model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]"
1789 |             ],
1790 |             "application/vnd.jupyter.widget-view+json": {
1791 |               "version_major": 2,
1792 |               "version_minor": 0,
1793 |               "model_id": "f4321c31f68a458382b2b8b969993831"
1794 |             }
1795 |           },
1796 |           "metadata": {}
1797 |         },
1798 |         {
1799 |           "output_type": "stream",
1800 |           "name": "stderr",
1801 |           "text": [
1802 |             "Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked']\n",
1803 |             "- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
1804 |             "- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
1805 |           ]
1806 |         }
1807 |       ]
1808 |     },
1809 |     {
1810 |       "cell_type": "markdown",
1811 |       "source": [
1812 |         "2. Categorize the elements"
1813 |       ],
1814 |       "metadata": {
1815 |         "id": "cv8Tw7ETw90G"
1816 |       }
1817 |     },
1818 |     {
1819 |       "cell_type": "code",
1820 |       "source": [
1821 |         "category_counts = {}\n",
1822 |         "\n",
1823 |         "for element in raw_pdf_elements:\n",
1824 |         "    category = str(type(element))\n",
1825 |         "    if category in category_counts:\n",
1826 |         "        category_counts[category] += 1\n",
1827 |         "    else:\n",
1828 |         "        category_counts[category] = 1\n",
1829 |         "\n",
1830 |         "unique_categories = set(category_counts.keys())\n",
1831 |         "category_counts"
1832 |       ],
1833 |       "metadata": {
1834 |         "id": "0ofgTIuNDkLo",
1835 |         "colab": {
1836 |           "base_uri": "https://localhost:8080/"
1837 |         },
1838 |         "outputId": "06c8b468-a500-461e-f1d3-0acfb8be6af7"
1839 |       },
1840 |       "execution_count": 30,
1841 |       "outputs": [
1842 |         {
1843 |           "output_type": "execute_result",
1844 |           "data": {
1845 |             "text/plain": [
1846 |               "{\"<class 'unstructured.documents.elements.CompositeElement'>\": 5,\n",
1847 |               " \"<class 'unstructured.documents.elements.Table'>\": 4}"
1848 |             ]
1849 |           },
1850 |           "metadata": {},
1851 |           "execution_count": 30
1852 |         }
1853 |       ]
1854 |     },
1855 |     {
1856 |       "cell_type": "code",
1857 |       "source": [
1858 |         "class Element(BaseModel):\n",
1859 |         "    type: str\n",
1860 |         "    text: Any\n",
1861 |         "\n",
1862 |         "table_elements = []\n",
1863 |         "text_elements = []\n",
1864 |         "for element in raw_pdf_elements:\n",
1865 |         "    if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
1866 |         "        table_elements.append(Element(type=\"table\", text=str(element)))\n",
1867 |         "    elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
1868 |         "        text_elements.append(Element(type=\"text\", text=str(element)))"
1869 |       ],
1870 |       "metadata": {
1871 |         "id": "alMtzOivbJXo"
1872 |       },
1873 |       "execution_count": 31,
1874 |       "outputs": []
1875 |     },
1876 |     {
1877 |       "cell_type": "code",
1878 |       "source": [
1879 |         "print(len(table_elements))\n",
1880 |         "print(len(text_elements))"
1881 |       ],
1882 |       "metadata": {
1883 |         "colab": {
1884 |           "base_uri": "https://localhost:8080/"
1885 |         },
1886 |         "id": "4rGCd4O5xlSh",
1887 |         "outputId": "acd62e39-97cb-4400-f1d9-452eeafcd541"
1888 |       },
1889 |       "execution_count": 32,
1890 |       "outputs": [
1891 |         {
1892 |           "output_type": "stream",
1893 |           "name": "stdout",
1894 |           "text": [
1895 |             "4\n",
1896 |             "5\n"
1897 |           ]
1898 |         }
1899 |       ]
1900 |     },
1901 |     {
1902 |       "cell_type": "code",
1903 |       "source": [
1904 |         "table_elements[0]"
1905 |       ],
1906 |       "metadata": {
1907 |         "colab": {
1908 |           "base_uri": "https://localhost:8080/"
1909 |         },
1910 |         "id": "ld91YQP3nKTd",
1911 |         "outputId": "a71da61e-39e6-4c4a-bcf4-1cbde6c0e26a"
1912 |       },
1913 |       "execution_count": 33,
1914 |       "outputs": [
1915 |         {
1916 |           "output_type": "execute_result",
1917 |           "data": {
1918 |             "text/plain": [
1919 |               "Element(type='table', text='1. Name and Address of Reporting Person = \\\\Drell Persis 2. Issuer Name and Ticker or Trading Symbol INVIDIA CORP [ NVDA ] 5. Relationship of Reporting Person(s) to Issuer (Check all applicable) 3. Date of Earliest Transaction (MM/DD/YYYY) |_X_Director 10% Owner (es) (Fis) (Middle) Officer (give title below) Other (specify below) C/O NVIDIA CORPORATION, 2788 10/6/2023 SAN TOMAS EXPRESSWAY (Street) 4. If Amendment, Date Original Filed (MM/DD/YYYY) 6. Individual or Joint/Group Filing (Check Applicable Line) SANTA CLARA, CA 95051 | X_ Form filed by One Reporting Person - |__ Form filed by More than One Reporting Person (City) (State) (Zip)')"
1920 |             ]
1921 |           },
1922 |           "metadata": {},
1923 |           "execution_count": 33
1924 |         }
1925 |       ]
1926 |     },
1927 |     {
1928 |       "cell_type": "code",
1929 |       "source": [
1930 |         "table_elements[2]"
1931 |       ],
1932 |       "metadata": {
1933 |         "colab": {
1934 |           "base_uri": "https://localhost:8080/"
1935 |         },
1936 |         "id": "DHThHvBWnT87",
1937 |         "outputId": "154f0173-8cf3-4e7f-ef90-ee3a1befd2d2"
1938 |       },
1939 |       "execution_count": 34,
1940 |       "outputs": [
1941 |         {
1942 |           "output_type": "execute_result",
1943 |           "data": {
1944 |             "text/plain": [
1945 |               "Element(type='table', text='1. Title of Derivate |2. 3. Trans. 3A. Deemed |4. Trans. Code |5. Number of 6. Date Exercisable 7. Title and Amount of 8. Price of ]9. Number of | 10. Security Conversion | Date Execution |(Instr, 8) Derivative Securities |and Expiration Date _| Securities Underlying Derivative }derivative | Ownership] (Instr. 3) or Exercise Date, if any Acquired (A) or Derivative Security Security |Securities |Formof Price of Disposed of (D) (Instr. 3 and 4) (Instr. 5) |Beneficially | Derivative | Derivative (Instr. 3, 4 and 5) Owned Security: | Security Following —_| Direct (D) - Reported _| or Indirect Date Expiration] 1... | Amount or Number of Transaction(s)] (1) (Instr. coe |v | (a) (D) | Exercisable|Date Shares (instr. 4) 4)')"
1946 |             ]
1947 |           },
1948 |           "metadata": {},
1949 |           "execution_count": 34
1950 |         }
1951 |       ]
1952 |     },
1953 |     {
1954 |       "cell_type": "code",
1955 |       "source": [
1956 |         "from langchain.chat_models import ChatOpenAI\n",
1957 |         "from langchain.prompts import ChatPromptTemplate\n",
1958 |         "from langchain.schema.output_parser import StrOutputParser"
1959 |       ],
1960 |       "metadata": {
1961 |         "id": "he7fQp5qbLnu"
1962 |       },
1963 |       "execution_count": 35,
1964 |       "outputs": []
1965 |     },
1966 |     {
1967 |       "cell_type": "markdown",
1968 |       "source": [
1969 |         "3. Build up summarization chain with LangChain framework"
1970 |       ],
1971 |       "metadata": {
1972 |         "id": "1hBf_lftySxt"
1973 |       }
1974 |     },
1975 |     {
1976 |       "cell_type": "code",
1977 |       "source": [
1978 |         "prompt_text = \"\"\"\n",
1979 |         "  You are responsible for concisely summarizing table or text chunk:\n",
1980 |         "\n",
1981 |         "  {element}\n",
1982 |         "\"\"\"\n",
1983 |         "prompt = ChatPromptTemplate.from_template(prompt_text)\n",
1984 |         "summarize_chain = {\"element\": lambda x: x} | prompt | ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\") | StrOutputParser()"
1985 |       ],
1986 |       "metadata": {
1987 |         "id": "uDQYbnKDbM7C"
1988 |       },
1989 |       "execution_count": 36,
1990 |       "outputs": []
1991 |     },
1992 |     {
1993 |       "cell_type": "markdown",
1994 |       "source": [
1995 |         "4. Summarize each text and table element"
1996 |       ],
1997 |       "metadata": {
1998 |         "id": "TikuTLiKyXy3"
1999 |       }
2000 |     },
2001 |     {
2002 |       "cell_type": "code",
2003 |       "source": [
2004 |         "tables = [i.text for i in table_elements]\n",
2005 |         "table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})\n",
2006 |         "\n",
2007 |         "texts = [i.text for i in text_elements]\n",
2008 |         "text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})"
2009 |       ],
2010 |       "metadata": {
2011 |         "id": "SMqqogGDbOk_"
2012 |       },
2013 |       "execution_count": 12,
2014 |       "outputs": []
2015 |     },
2016 |     {
2017 |       "cell_type": "markdown",
2018 |       "source": [
2019 |         "5. Use LangChain MultiVectorRetriever to associate summaries of tables and texts with original text chunks in parent-child relationship."
2020 |       ],
2021 |       "metadata": {
2022 |         "id": "PskDZuy7ydfV"
2023 |       }
2024 |     },
2025 |     {
2026 |       "cell_type": "code",
2027 |       "source": [
2028 |         "import uuid\n",
2029 |         "\n",
2030 |         "from langchain.embeddings import OpenAIEmbeddings\n",
2031 |         "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
2032 |         "from langchain.schema.document import Document\n",
2033 |         "from langchain.storage import InMemoryStore\n",
2034 |         "from langchain.vectorstores import Chroma\n",
2035 |         "\n",
2036 |         "id_key = \"doc_id\"\n",
2037 |         "\n",
2038 |         "# The retriever (empty to start)\n",
2039 |         "retriever = MultiVectorRetriever(\n",
2040 |         "    vectorstore=Chroma(collection_name=\"summaries\", embedding_function=OpenAIEmbeddings()),\n",
2041 |         "    docstore=InMemoryStore(),\n",
2042 |         "    id_key=id_key,\n",
2043 |         ")\n",
2044 |         "\n",
2045 |         "# Add texts\n",
2046 |         "doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
2047 |         "summary_texts = [\n",
2048 |         "    Document(page_content=s, metadata={id_key: doc_ids[i]})\n",
2049 |         "    for i, s in enumerate(text_summaries)\n",
2050 |         "]\n",
2051 |         "retriever.vectorstore.add_documents(summary_texts)\n",
2052 |         "retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
2053 |         "\n",
2054 |         "# Add tables\n",
2055 |         "table_ids = [str(uuid.uuid4()) for _ in tables]\n",
2056 |         "summary_tables = [\n",
2057 |         "    Document(page_content=s, metadata={id_key: table_ids[i]})\n",
2058 |         "    for i, s in enumerate(table_summaries)\n",
2059 |         "]\n",
2060 |         "retriever.vectorstore.add_documents(summary_tables)\n",
2061 |         "retriever.docstore.mset(list(zip(table_ids, tables)))"
2062 |       ],
2063 |       "metadata": {
2064 |         "id": "-deJeuO4bRSZ"
2065 |       },
2066 |       "execution_count": 19,
2067 |       "outputs": []
2068 |     },
2069 |     {
2070 |       "cell_type": "code",
2071 |       "source": [
2072 |         "from langchain.schema.runnable import RunnablePassthrough\n",
2073 |         "\n",
2074 |         "template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
2075 |         "{context}\n",
2076 |         "Question: {question}\n",
2077 |         "\"\"\"\n",
2078 |         "prompt = ChatPromptTemplate.from_template(template)\n",
2079 |         "\n",
2080 |         "# RAG pipeline\n",
2081 |         "chain = (\n",
2082 |         "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
2083 |         "    | prompt\n",
2084 |         "    | ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
2085 |         "    | StrOutputParser()\n",
2086 |         ")"
2087 |       ],
2088 |       "metadata": {
2089 |         "id": "NuZzXm_kbTK7"
2090 |       },
2091 |       "execution_count": 37,
2092 |       "outputs": []
2093 |     },
2094 |     {
2095 |       "cell_type": "code",
2096 |       "source": [
2097 |         "chain.invoke(\"How many stocks were disposed? Who is the beneficial owner?\")"
2098 |       ],
2099 |       "metadata": {
2100 |         "id": "wAEyfvmabU3u",
2101 |         "colab": {
2102 |           "base_uri": "https://localhost:8080/",
2103 |           "height": 36
2104 |         },
2105 |         "outputId": "8ba7f132-32c0-455c-c6b5-453a40d4f60b"
2106 |       },
2107 |       "execution_count": 38,
2108 |       "outputs": [
2109 |         {
2110 |           "output_type": "execute_result",
2111 |           "data": {
2112 |             "text/plain": [
2113 |               "'2300 stocks were disposed. The beneficial owner is the Welch-Drell 2009 Revocable Trust.'"
2114 |             ],
2115 |             "application/vnd.google.colaboratory.intrinsic+json": {
2116 |               "type": "string"
2117 |             }
2118 |           },
2119 |           "metadata": {},
2120 |           "execution_count": 38
2121 |         }
2122 |       ]
2123 |     },
2124 |     {
2125 |       "cell_type": "markdown",
2126 |       "source": [
2127 |         "6. Experiment with GPT-3.5\n",
2128 |         "\n",
2129 |         "Looks it doesn't perform as well as GPT-4."
2130 |       ],
2131 |       "metadata": {
2132 |         "id": "Jbr8tZjuzGZf"
2133 |       }
2134 |     },
2135 |     {
2136 |       "cell_type": "code",
2137 |       "source": [
2138 |         "# RAG pipeline\n",
2139 |         "chain = (\n",
2140 |         "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
2141 |         "    | prompt\n",
2142 |         "    | ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\")\n",
2143 |         "    | StrOutputParser()\n",
2144 |         ")\n",
2145 |         "chain.invoke(\"How many stocks were disposed? Who is the beneficial owner?\")"
2146 |       ],
2147 |       "metadata": {
2148 |         "colab": {
2149 |           "base_uri": "https://localhost:8080/",
2150 |           "height": 54
2151 |         },
2152 |         "id": "YWnJywytzPNt",
2153 |         "outputId": "453ac524-4528-4875-981d-b016b84ef72b"
2154 |       },
2155 |       "execution_count": 39,
2156 |       "outputs": [
2157 |         {
2158 |           "output_type": "execute_result",
2159 |           "data": {
2160 |             "text/plain": [
2161 |               "'Based on the given context, it is not possible to determine how many stocks were disposed or who the beneficial owner is. The context does not provide any specific information about the disposal of stocks or the identification of the beneficial owner.'"
2162 |             ],
2163 |             "application/vnd.google.colaboratory.intrinsic+json": {
2164 |               "type": "string"
2165 |             }
2166 |           },
2167 |           "metadata": {},
2168 |           "execution_count": 39
2169 |         }
2170 |       ]
2171 |     }
2172 |   ]
2173 | }


--------------------------------------------------------------------------------