├── README.md ├── rag.pdf ├── autogen_with_function_calls.ipynb └── autogen_rag_agent.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # AutoGen Tutorials 2 | Tutorials from AutoGen Basics to Use Cases 3 | -------------------------------------------------------------------------------- /rag.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sugarforever/AutoGen-Tutorials/main/rag.pdf -------------------------------------------------------------------------------- /autogen_with_function_calls.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | " $\"Open$ " 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "source": [ 16 | "# AutoGen with Function Calls\n", 17 | "\n", 18 | "Let's see how to use AssistantAgent and UserProxyAgent to make function calls with the new feature of OpenAI models.\n", 19 | "\n", 20 | "In this notebook, we will instruct the AssistantAgent to use provided function as tool to search news with Google Serp API." 21 | ], 22 | "metadata": { 23 | "id": "39_S0tWcB-oh" 24 | } 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": { 30 | "execution": { 31 | "iopub.execute_input": "2023-02-13T23:40:52.317406Z", 32 | "iopub.status.busy": "2023-02-13T23:40:52.316561Z", 33 | "iopub.status.idle": "2023-02-13T23:40:52.321193Z", 34 | "shell.execute_reply": "2023-02-13T23:40:52.320628Z" 35 | }, 36 | "id": "1VRZnGGGgkhl" 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "%pip install pyautogen~=0.1.0 docker google-search-results -q -U" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "source": [ 46 | "from serpapi import GoogleSearch\n", 47 | "\n", 48 | "def search_google_news(keyword):\n", 49 | " print(f\"Keyword: {keyword}\")\n", 50 | " search = GoogleSearch({\n", 51 | " \"q\": keyword,\n", 52 | " \"tbm\": \"nws\",\n", 53 | " \"api_key\": open(\"serp_api_key.txt\").read().strip()\n", 54 | " })\n", 55 | " result = search.get_dict()\n", 56 | " return [item['link'] for item in result['news_results']]\n" 57 | ], 58 | "metadata": { 59 | "id": "YbaB37ztBXgQ" 60 | }, 61 | "execution_count": null, 62 | "outputs": [] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": { 67 | "id": "AvkcIBUpgkhl" 68 | }, 69 | "source": [ 70 | "## Set your API Endpoint\n", 71 | "\n", 72 | "The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file.\n" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "id": "ZsXuHf1fgkhl" 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "import autogen\n", 84 | "\n", 85 | "config_list = autogen.config_list_from_json(\n", 86 | " \"OAI_CONFIG_LIST\",\n", 87 | " filter_dict={\n", 88 | " \"model\": [\"gpt-4\"],\n", 89 | " },\n", 90 | ")\n", 91 | "\n", 92 | "llm_config={\n", 93 | " \"request_timeout\": 600,\n", 94 | " \"seed\": 42,\n", 95 | " \"config_list\": config_list,\n", 96 | " \"temperature\": 0,\n", 97 | " \"functions\": [\n", 98 | " {\n", 99 | " \"name\": \"search_google_news\",\n", 100 | " \"description\": \"Search google news by keyword\",\n", 101 | " \"parameters\": {\n", 102 | " \"type\": \"object\",\n", 103 | " \"properties\": {\n", 104 | " \"keyword\": {\n", 105 | " \"type\": \"string\",\n", 106 | " \"description\": \"The keyword that's used to search google news\",\n", 107 | " }\n", 108 | " },\n", 109 | " \"required\": [\"keyword\"],\n", 110 | " },\n", 111 | " }\n", 112 | " ],\n", 113 | "}" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "id": "-G6my50Lgkhm" 120 | }, 121 | "source": [ 122 | "## Construct Agents\n", 123 | "\n", 124 | "We construct the assistant agent and the user proxy agent. We specify `human_input_mode` as \"TERMINATE\" in the user proxy agent, which will ask for human feedback when it receives a \"TERMINATE\" signal from the assistant agent." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "id": "eg9g65hOgkhm" 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "# create an AssistantAgent instance named \"assistant\"\n", 136 | "assistant = autogen.AssistantAgent(\n", 137 | " name=\"assistant\",\n", 138 | " llm_config=llm_config,\n", 139 | ")\n", 140 | "# create a UserProxyAgent instance named \"user_proxy\"\n", 141 | "user_proxy = autogen.UserProxyAgent(\n", 142 | " name=\"user_proxy\",\n", 143 | " human_input_mode=\"NEVER\",\n", 144 | " max_consecutive_auto_reply=10,\n", 145 | " code_execution_config={\"work_dir\": \".\"},\n", 146 | " llm_config=llm_config,\n", 147 | " system_message=\"\"\"When a link is provided, you should ask the assistant for fetching the content. Reply TERMINATE if the task has been solved at full satisfaction.\n", 148 | "Otherwise, reply CONTINUE, or the reason why the task is not solved yet.\"\"\",\n", 149 | " function_map={\"search_google_news\": search_google_news}\n", 150 | ")" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": { 157 | "id": "aCdAqig3gkhn", 158 | "outputId": "3c5f7234-56ff-401a-e31c-40221d263b6a", 159 | "colab": { 160 | "base_uri": "https://localhost:8080/" 161 | } 162 | }, 163 | "outputs": [ 164 | { 165 | "output_type": "stream", 166 | "name": "stdout", 167 | "text": [ 168 | "user_proxy (to assistant):\n", 169 | "\n", 170 | "\n", 171 | "Search google news in the topic of \"Hamas\", and write a news based on them.\n", 172 | "You should avoid bullet points. \n", 173 | "Instead, use more expressive words to form a news that's like one written by a well recognized journalist.\n", 174 | "\n", 175 | "Start the work now.\n", 176 | "\n", 177 | "\n", 178 | "--------------------------------------------------------------------------------\n", 179 | "assistant (to user_proxy):\n", 180 | "\n", 181 | "***** Suggested function Call: search_google_news *****\n", 182 | "Arguments: \n", 183 | "\n", 184 | "{\n", 185 | " \"keyword\": \"Hamas\"\n", 186 | "}\n", 187 | "*******************************************************\n", 188 | "\n", 189 | "--------------------------------------------------------------------------------\n", 190 | "\n", 191 | ">>>>>>>> EXECUTING FUNCTION search_google_news...\n", 192 | "Keyword: Hamas\n", 193 | "user_proxy (to assistant):\n", 194 | "\n", 195 | "***** Response from calling function \"search_google_news\" *****\n", 196 | "['https://www.wsj.com/livecoverage/israel-hamas-gaza-rockets-attack-palestinians', 'https://apnews.com/article/israel-palestinians-gaza-hamas-rockets-airstrikes-tel-aviv-11fb98655c256d54ecb5329284fc37d2', 'https://www.bbc.com/news/world-middle-east-67043563', 'https://www.foxnews.com/live-news/hamas-attack-israel-war', 'https://www.nytimes.com/live/2023/10/07/world/israel-gaza-attack', 'https://www.cnn.com/2023/10/07/middleeast/israel-hamas-rocket-attack-what-we-know-intl/index.html', 'https://www.usatoday.com/story/news/world/2023/10/07/israel-netanyahu-says-at-war-as-hamas-palestinian-militants-launch-surprise-attack/71098144007/', 'https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-israel-gaza-attack-rockets-gunmen-palestinian-rcna119316', 'https://abcnews.go.com/International/rockets-fired-gaza-tel-aviv-jerusalem-hamas-militants/story?id=103803865']\n", 197 | "***************************************************************\n", 198 | "\n", 199 | "--------------------------------------------------------------------------------\n", 200 | "assistant (to user_proxy):\n", 201 | "\n", 202 | "Based on the recent news articles, here is a news story about the current situation with Hamas:\n", 203 | "\n", 204 | "In a sudden escalation of tensions, Hamas, the Palestinian militant group, has launched a surprise attack on Israel. The situation has rapidly evolved into a state of war, as stated by Israel's Prime Minister Netanyahu. The attack has seen rockets being fired from Gaza towards Tel Aviv and Jerusalem, causing widespread panic and chaos in these cities.\n", 205 | "\n", 206 | "The Wall Street Journal, AP News, and BBC have all reported on the ongoing rocket attacks and airstrikes in the region. The situation is tense, with both sides showing no signs of backing down. The Israeli Defense Forces (IDF) have responded with their own set of airstrikes, targeting key Hamas locations in Gaza.\n", 207 | "\n", 208 | "Fox News and The New York Times have provided live coverage of the events, detailing the unfolding situation on the ground. The reports indicate that the situation is volatile, with the potential for further escalation.\n", 209 | "\n", 210 | "CNN and USA Today have highlighted the international implications of this conflict. The world watches anxiously as the situation unfolds, with calls for peace and de-escalation coming from various international bodies.\n", 211 | "\n", 212 | "NBC News and ABC News have provided live updates on the situation, with reports of rockets being fired and gunmen on the streets. The Palestinian militants have claimed responsibility for the attacks, stating that they are in response to Israeli actions in the region.\n", 213 | "\n", 214 | "As the situation continues to evolve, the world watches with bated breath, hoping for a peaceful resolution to this sudden escalation of violence.\n", 215 | "\n", 216 | "TERMINATE\n", 217 | "\n", 218 | "--------------------------------------------------------------------------------\n", 219 | "user_proxy (to assistant):\n", 220 | "\n", 221 | "TERMINATE\n", 222 | "\n", 223 | "--------------------------------------------------------------------------------\n" 224 | ] 225 | } 226 | ], 227 | "source": [ 228 | "# the assistant receives a message from the user, which contains the task description\n", 229 | "user_proxy.initiate_chat(\n", 230 | " assistant,\n", 231 | " message=\"\"\"\n", 232 | "Search google news in the topic of \"Hamas\", and write a news based on them.\n", 233 | "You should avoid bullet points.\n", 234 | "Instead, use more expressive words to form a news that's like one written by a well recognized journalist.\n", 235 | "\n", 236 | "Start the work now.\n", 237 | "\"\"\"\n", 238 | ")" 239 | ] 240 | } 241 | ], 242 | "metadata": { 243 | "kernelspec": { 244 | "display_name": "Python 3", 245 | "language": "python", 246 | "name": "python3" 247 | }, 248 | "language_info": { 249 | "codemirror_mode": { 250 | "name": "ipython", 251 | "version": 3 252 | }, 253 | "file_extension": ".py", 254 | "mimetype": "text/x-python", 255 | "name": "python", 256 | "nbconvert_exporter": "python", 257 | "pygments_lexer": "ipython3", 258 | "version": "3.9.17" 259 | }, 260 | "vscode": { 261 | "interpreter": { 262 | "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1" 263 | } 264 | }, 265 | "colab": { 266 | "provenance": [], 267 | "include_colab_link": true 268 | } 269 | }, 270 | "nbformat": 4, 271 | "nbformat_minor": 0 272 | } 273 | -------------------------------------------------------------------------------- /autogen_rag_agent.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | " $\"Open$ " 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "ysXjb6M0gkhj" 17 | }, 18 | "source": [ 19 | " $\"Open$ " 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "source": [ 25 | "# AutoGen Agents with Retrieval Augmented Generation\n", 26 | "\n", 27 | "**`AutoGen`** is a versatile framework that facilitates the creation of LLM applications by employing multiple agents capable of interacting with one another to tackle tasks. These AutoGen agents can be tailored to specific needs, engage in conversations, and seamlessly integrate human participation. They are adaptable to different operation modes that encompass the utilization of LLMs, human inputs, and various tools.\n", 28 | "\n", 29 | "**`RAG`** stands for `Retrieval Augmented Generation`, a natural language processing (NLP) technique that combines two essential components: **retrieval** and **generation**.\n", 30 | "\n", 31 | "The previous tutorial [AutoGen + LangChain = Super AI Agents](https://github.com/sugarforever/LangChain-Advanced/blob/main/Integrations/AutoGen/autogen_langchain_uniswap_ai_agent.ipynb) introduced how to build an AutoGen application that can execute tasks requiring specific documents knowledge. This is a typical RAG use case, aka. document based chatbot.\n", 32 | "\n", 33 | "The latest **AutoGen** version already supports RAG natively with the feature package `retrievechat`.\n", 34 | "\n", 35 | "In this tutorial, we are going to rebuild the same feature demonstrated in the previous tutorial. We will utilize `AutoGen` `retrievechat` feature.\n", 36 | "\n", 37 | "This tutorial is inspired by the [Blog - Retrieval-Augmented Generation (RAG) Applications with AutoGen](https://microsoft.github.io/autogen/blog/2023/10/18/RetrieveChat) of [Li Jiang](https://github.com/thinkall).\n", 38 | "\n", 39 | "Credits go to Li Jiang! 🙌\n", 40 | "\n", 41 | "Let's roll! 🚴🏻‍♀️ 🚴🏻 🚴🏻‍♂️" 42 | ], 43 | "metadata": { 44 | "id": "39_S0tWcB-oh" 45 | } 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "source": [ 50 | "## Use Case\n", 51 | "\n", 52 | "\n", 53 | "\n", 54 | "In this tutorial, I will create the retrieval augmented agents with the following document:\n", 55 | "\n", 56 | "[RETRIEVAL AUGMENTED GENERATION AND REPRESENTATIVE\n", 57 | "VECTOR SUMMARIZATION FOR LARGE UNSTRUCTURED\n", 58 | "TEXTUAL DATA IN MEDICAL EDUCATION](https://arxiv.org/pdf/2308.00479.pdf)\n", 59 | "\n", 60 | "You should be able to see the agents are able to perform retrieval augmented generation based on the document above and answer question." 61 | ], 62 | "metadata": { 63 | "id": "NtvjgnBZZjUL" 64 | } 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "source": [ 69 | "### Environment Preparation" 70 | ], 71 | "metadata": { 72 | "id": "z73o7bmtb5LH" 73 | } 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "execution": { 80 | "iopub.execute_input": "2023-02-13T23:40:52.317406Z", 81 | "iopub.status.busy": "2023-02-13T23:40:52.316561Z", 82 | "iopub.status.idle": "2023-02-13T23:40:52.321193Z", 83 | "shell.execute_reply": "2023-02-13T23:40:52.320628Z" 84 | }, 85 | "id": "1VRZnGGGgkhl" 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "%pip install pyautogen[retrievechat] langchain \"chromadb<0.4.15\" -q" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "source": [ 95 | "import autogen\n", 96 | "\n", 97 | "config_list = autogen.config_list_from_json(\n", 98 | " \"OAI_CONFIG_LIST\",\n", 99 | " filter_dict={\n", 100 | " \"model\": [\"gpt-4\"],\n", 101 | " },\n", 102 | ")" 103 | ], 104 | "metadata": { 105 | "id": "HZ7w_A3nXU8-" 106 | }, 107 | "execution_count": null, 108 | "outputs": [] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "source": [ 113 | "### Steps" 114 | ], 115 | "metadata": { 116 | "id": "BA48TH6Hc_3c" 117 | } 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "source": [ 122 | "#### 1. Configure Embedding Function\n", 123 | "\n", 124 | "We will use OpenAI embedding function." 125 | ], 126 | "metadata": { 127 | "id": "rCrCnRC7cdC-" 128 | } 129 | }, 130 | { 131 | "cell_type": "code", 132 | "source": [ 133 | "from chromadb.utils import embedding_functions\n", 134 | "\n", 135 | "openai_embedding_function = embedding_functions.OpenAIEmbeddingFunction(api_key = config_list[0][\"api_key\"])" 136 | ], 137 | "metadata": { 138 | "id": "c636oZ2zHNpr" 139 | }, 140 | "execution_count": null, 141 | "outputs": [] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "source": [ 146 | "#### 2. Configure Text Splitter\n", 147 | "\n", 148 | "LangChain has done a great job in text splitting, so we will use its components." 149 | ], 150 | "metadata": { 151 | "id": "PxFsXiHVciOo" 152 | } 153 | }, 154 | { 155 | "cell_type": "code", 156 | "source": [ 157 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", 158 | "\n", 159 | "text_splitter = RecursiveCharacterTextSplitter(separators=[\"\\n\\n\", \"\\n\", \"\\r\", \"\\t\"])" 160 | ], 161 | "metadata": { 162 | "id": "6eRvVjJITKfR" 163 | }, 164 | "execution_count": null, 165 | "outputs": [] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "source": [ 170 | "#### 3. Configure Vector Store\n", 171 | "\n", 172 | "By default, the AutoGen retrieval augmented agents use `chromadb` as vector store.\n", 173 | "\n", 174 | "Developers can configure preferred vector store by extending the class `RetrieveUserProxyAgent` and overriding function `retrieve_docs`.\n", 175 | "\n", 176 | "AutoGen also supports simple configuration items to customize Chromadb storage.\n", 177 | "\n", 178 | "In this demo, we will specify the collection name by `retreive_config` item `collection_name`. You should be able to see it in step 4.\n" 179 | ], 180 | "metadata": { 181 | "id": "O7VPXVI_coX4" 182 | } 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "source": [ 187 | "#### 4. Create Retrieval Augmented Agents" 188 | ], 189 | "metadata": { 190 | "id": "Wu7gjAv-c4uP" 191 | } 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "id": "ZsXuHf1fgkhl" 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent\n", 202 | "from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent\n", 203 | "\n", 204 | "llm_config = {\n", 205 | " \"request_timeout\": 600,\n", 206 | " \"config_list\": config_list,\n", 207 | " \"temperature\": 0\n", 208 | "}\n", 209 | "\n", 210 | "assistant = RetrieveAssistantAgent(\n", 211 | " name=\"assistant\",\n", 212 | " system_message=\"You are a helpful assistant.\",\n", 213 | " llm_config=llm_config,\n", 214 | ")\n", 215 | "\n", 216 | "rag_agent = RetrieveUserProxyAgent(\n", 217 | " human_input_mode=\"NEVER\",\n", 218 | " retrieve_config={\n", 219 | " \"task\": \"qa\",\n", 220 | " \"docs_path\": \"./rag.pdf\",\n", 221 | " \"collection_name\": \"rag_collection\",\n", 222 | " \"embedding_function\": openai_embedding_function,\n", 223 | " \"custom_text_split_function\": text_splitter.split_text,\n", 224 | " },\n", 225 | ")\n", 226 | "\n" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "source": [ 232 | "### It's time to start a chat with the RAG agent." 233 | ], 234 | "metadata": { 235 | "id": "37cRtpqLdLSZ" 236 | } 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "id": "aCdAqig3gkhn", 243 | "outputId": "d4463f2e-7dcc-4fa1-8552-b9dab3291afd", 244 | "colab": { 245 | "base_uri": "https://localhost:8080/" 246 | } 247 | }, 248 | "outputs": [ 249 | { 250 | "output_type": "stream", 251 | "name": "stderr", 252 | "text": [ 253 | "INFO:autogen.retrieve_utils:Found 3 chunks.\n" 254 | ] 255 | }, 256 | { 257 | "output_type": "stream", 258 | "name": "stdout", 259 | "text": [ 260 | "Trying to create collection.\n", 261 | "doc_ids: [['doc_1', 'doc_2']]\n", 262 | "Adding doc_id doc_1 to context.\n", 263 | "Adding doc_id doc_2 to context.\n", 264 | "RetrieveChatAgent (to assistant):\n", 265 | "\n", 266 | "You're a retrieve augmented chatbot. You answer user's questions based on your own knowledge and the\n", 267 | "context provided by the user.\n", 268 | "If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.\n", 269 | "You must give as short an answer as possible.\n", 270 | "\n", 271 | "User's question is: What is the workflow in docGPT?\n", 272 | "\n", 273 | "Context is: is initially stored in memory as a Document and then the Document is split into text chunks using recursive character text\n", 274 | "splitting. The chunks are then embedded in a 1536-dimensional vector space using OpenAI text-embedding-ada-002\n", 275 | "embedding engine and stored in Facebook AI Similarity Search ( FAISS ) vector database [ 3]. The FAISS vector database\n", 276 | "is used to find the k-most similar chunks to a given query at the query time. The original query, combined with the\n", 277 | "retrieved chunks is compiled into a prompt and passed to the LLM for generating the answer. The model workflow and\n", 278 | "the steps excecuted at build-time and at query-time are depicted in Figure 1.\n", 279 | "2.2 Summarization\n", 280 | "2.2.1 Representative Vector\n", 281 | "A maximum affordable token limit ( T) is initially defined. The target is to select knumber of chunks which is the\n", 282 | "maximum number of chunks that can be selected without their total token size exceeding T, from the vector database of\n", 283 | "nchunks. Assuming each chunk has an average token size of s,kis simply obtained by optimizing for max k≤n,k×s≤Tk\n", 284 | "Once kis calculated, the chunks in the high-dimension vector space are quantized using the k-means clustering\n", 285 | "algorithm. k-number of clusters which minimize the within-cluster sum of squared Euclidian distances from the\n", 286 | "corresponding centroids are created within the vector space. Since the distribution of chunks in the vector space is based\n", 287 | "on their contextual similarity, we assume each cluster captures different aspects/semantics of the original document.\n", 288 | "After this quantization, one representative chunk, which is the closest to the corresponding centroid from each cluster is\n", 289 | "extracted.\n", 290 | "Let vectors be an n×dmatrix, where nis the number of chunks and dis the dimensionality of the embeddings. Let\n", 291 | "centroids [i]denote the centroid of i-th cluster. The Euclidian distance between the m-th data point and the centroid of\n", 292 | "thei-th cluster is calculated by\n", 293 | "distance m=vuutdX\n", 294 | "j=1(vectors [m][j]−centroids [i][j])2 (1)\n", 295 | "and stored in the distances [i]array.\n", 296 | "Next, the index of the data point that has the minimum distance to the centroid of the i-th cluster, denoted as\n", 297 | "closest _index i, is calculated by:\n", 298 | "closest _index i= argmin distances i (2)\n", 299 | "Finally, the closest _index ivalues for all clusters are stored in a list.\n", 300 | "closest _indices = [closest _index 0, closest _index 1, . . . , closest _index k−1]\n", 301 | "Once the indices of the representative chunks are identified, they are stored in a separate representative Document list.\n", 302 | "2.2.2 Keyword generation and mapping\n", 303 | "Even though the token size of the representative text chunks obtained with this method is relatively smaller compared to\n", 304 | "the original document, it can still be too large to fit into the LLM’s context window. For example, docGPT uses a default\n", 305 | "maximum affordable token limit of 10,000 tokens. The default LLM that docGPT calls is OpenAI gpt-3.5-turbo ,\n", 306 | "which has a maximum context window of 4k. Therefore, the representative Document list undergoes an intermediate\n", 307 | "step of extractive summarization (keyword generation and mapping) for each chunk in it.\n", 308 | "Three keywords are generated for each representative chunk and the keywords are distributed among all the other\n", 309 | "members of the same cluster. A word cloud is generated from all the keywords from the n chunks. The relative\n", 310 | "frequency that each keyword appears is reflected by the size of the word in the word cloud.\n", 311 | "2AI for Medicine\n", 312 | "Figure 1: Retrieval workflow of docGPT\n", 313 | "3AI for Medicine\n", 314 | "Figure 2: Word cloud for Kumar and Clark Clinical Medicine\n", 315 | "The original high-dimensional vectors are then reduced to two dimensions using t-distributed Stochastic Neighbor\n", 316 | "Embedding (t-SNE) and plotted on a scatter plot with colors corresponding to their clusters and keywords. This enables\n", 317 | "the identification of the distribution of the contents of the entire document at a glance.\n", 318 | "Finally, the mapped summaries are used to create a final abstractive summary and to create a list of key points from the\n", 319 | "documents.\n", 320 | "3 Evaluation\n", 321 | "3.1 Retrieval\n", 322 | "Finally, the mapped summaries are used to create a final abstractive summary and to create a list of key points from the\n", 323 | "documents.\n", 324 | "3 Evaluation\n", 325 | "3.1 Retrieval\n", 326 | "Responses for queries on clinical medicine and general pharmacology from LLM without a non-parametric knowledge-\n", 327 | "base and a RAG model with vector databases built from Kumar and Clark’s Clinical Medicine 10th Edition and British\n", 328 | "National Formulary 82 were compared and checked for accuracy [ 6,7]. The responses generated by chatGPT and\n", 329 | "RAG implementation in docGPT are tabulated with the excerpts from the reference books in the supplementary material\n", 330 | "named Queries. For each query, docGPT generated more targetted and accurate answers, while chtGPT answers were\n", 331 | "more generic.\n", 332 | "3.2 Summarization\n", 333 | "Performances of RVS and retrieve-stuff-summarize methods were tested using Kumar and Clark Clinical Medicine 10th\n", 334 | "edition and BNF 82 ebooks. Results are summarized in supplementary material named Summaries.\n", 335 | "Kumar and Clark had 1508 pages with 13024 text chunks, each having an average of 789 tokens. The maximum\n", 336 | "affordable token limit was set at 15,000 and 19 representative chunks were selected. The word cloud and the t-SNE\n", 337 | "visualization are depicted in Figures 2 and 3.\n", 338 | "BNF had 1805 pages with 7278 text chunks with an average token size of 486. The model chose 10 representative\n", 339 | "chunks under the constraints of 5000 maximum affordable tokens. The word cloud and t-SNE are depicted in Figures 4\n", 340 | "and 5.\n", 341 | "4 Implementation\n", 342 | "RVS is implemented in docGPT , a document intelligence program written in Python using langchain framework and\n", 343 | "the source is available at https://github.com/ssm123ssm/docGPT-pharm .\n", 344 | "5 Conclusion\n", 345 | "Clinical medicine is a knowledge-intensive domain. Both clinicians and medical students would benefit from efficient\n", 346 | "methods for retrieving information quickly from large knowledgebases. We believe the proposed retrieval augmented\n", 347 | "4AI for Medicine\n", 348 | "Figure 3: t-SNE visualization of clustering of chunks in embedding space for Kumar and Clark Clinical Medicine\n", 349 | "Figure 4: Word cloud BNF 82\n", 350 | "Figure 5: t-SNE visualization of clustering of chunks in embedding space for BNF\n", 351 | "5AI for Medicine\n", 352 | "generation workflow and representative vector summarization for large documents would be of help in this context.\n", 353 | "Even though the workflow was tested on medical reference books and use cases related to medical education, the\n", 354 | "concept of RAG and RVS can be adopted by other domains as well.\n", 355 | "References\n", 356 | "[1]Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales,\n", 357 | "Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly,\n", 358 | "Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado,\n", 359 | "Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher\n", 360 | "Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical knowledge.\n", 361 | "[2]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler,\n", 362 | "Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation\n", 363 | "for knowledge-intensive NLP tasks.\n", 364 | "[3] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs.\n", 365 | "[4] Harrison Chase. LangChain. original-date: 2022-10-17T02:58:36Z.\n", 366 | "[5]R. Smith. An overview of the tesseract OCR engine. In Ninth International Conference on Document Analysis and\n", 367 | "Recognition (ICDAR 2007) Vol 2 , pages 629–633. IEEE. ISSN: 1520-5363.\n", 368 | "[6] Kumar and clark’s clinical medicine, 10th edition - 9780702078682.\n", 369 | "[7] British national formulary | BNF publications.\n", 370 | "6\n", 371 | "\n", 372 | "\n", 373 | "\n", 374 | "--------------------------------------------------------------------------------\n", 375 | "assistant (to RetrieveChatAgent):\n", 376 | "\n", 377 | "The workflow in docGPT involves storing a document in memory, splitting it into text chunks, and embedding these chunks into a vector space using OpenAI's text-embedding engine. These embedded chunks are stored in a FAISS vector database. At query time, the k-most similar chunks to a given query are retrieved from the database. The original query and retrieved chunks are compiled into a prompt and passed to the Language Model for generating the answer. The workflow also includes a summarization process where representative chunks are selected, keywords are generated, and a final abstractive summary is created.\n", 378 | "\n", 379 | "--------------------------------------------------------------------------------\n" 380 | ] 381 | } 382 | ], 383 | "source": [ 384 | "assistant.reset()\n", 385 | "rag_agent.initiate_chat(assistant, problem=\"What is the workflow in docGPT?\", n_results=2)" 386 | ] 387 | } 388 | ], 389 | "metadata": { 390 | "kernelspec": { 391 | "display_name": "Python 3", 392 | "language": "python", 393 | "name": "python3" 394 | }, 395 | "language_info": { 396 | "codemirror_mode": { 397 | "name": "ipython", 398 | "version": 3 399 | }, 400 | "file_extension": ".py", 401 | "mimetype": "text/x-python", 402 | "name": "python", 403 | "nbconvert_exporter": "python", 404 | "pygments_lexer": "ipython3", 405 | "version": "3.9.17" 406 | }, 407 | "vscode": { 408 | "interpreter": { 409 | "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1" 410 | } 411 | }, 412 | "colab": { 413 | "provenance": [], 414 | "include_colab_link": true 415 | } 416 | }, 417 | "nbformat": 4, 418 | "nbformat_minor": 0 419 | } --------------------------------------------------------------------------------