├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── Notebook
    └── generating-synthetic-datasets-for-evaluating-retrieval-augmented-generation-systems.ipynb
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | .python-version
3 | .pdf
4 | .DS_Store
5 | Notebooks/synthetic_dataset_generation/


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/Notebook/generating-synthetic-datasets-for-evaluating-retrieval-augmented-generation-systems.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "2c142cfb-78f5-4126-8f67-b1735786473b",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Generating Synthetic Datasets for Evaluating Retrieval Augmented Generation Systems\n",
   9 |     "\n",
  10 |     "\n",
  11 |     "As Retrieval Augmented Generation (RAG) systems become more prevalent, evaluating their performance is essential to ensure quality and performace. However, collecting real-world data for evaluation can be costly and time-consuming, especially in the early stages of a project. To addresses this challenge of data scarcity, synthetic dataset generation provides a practical solution for generating datasets that mimic real human interactions, enabling efficient and scalable evaluation of RAG systems. By leveraging large language models and knowledge retrieval context, the proposed approach ensures that the synthetic datasets are diverse, realistic, and representative of real-world scenarios. This solution is relevant for developers and researchers working on RAG systems, as it streamlines the evaluation process and accelerates the iterative development cycle, ultimately leading to better-performing AI systems. The process of generating synthetic datasets is integrated in open source tools like [RAGAS](https://docs.ragas.io/en/stable/) and will be outlined in this notebook.  \n",
  12 |     "\n",
  13 |     "In this notebook you will be guided through generating a synthetic dataset for a QA-RAG application using Anthropic Claude via the Bedrock API, Python and Langchain. The notebook consists of the following chapters: \n",
  14 |     "\n",
  15 |     "1. [Set-up of the environment](#1.-Set-up-of-the-environment)\n",
  16 |     "2. [Loading and preparing context data](#2-loading-and-preparing-data)\n",
  17 |     "3. [Initial Question Generation](#3.-Initial-Question-Generation)\n",
  18 |     "4. [Answer Generation](#4.-Answer-Generation)\n",
  19 |     "5. [Extracting Relevant Context](#5.-Extracting-Relevant-Context)\n",
  20 |     "6. [Evolving Questions to fit End-User behaviour](#6.-Evolving-Questions-to-fit-end-users-behaviour)\n",
  21 |     "7. [Automated Dataset Generation](#7.-Automated-Dataset-Generation)\n",
  22 |     "8. [Assessing the questions quality using Critique Agents](#8-assessing-the-questions-quality-using-critique-agents)\n",
  23 |     "\n"
  24 |    ]
  25 |   },
  26 |   {
  27 |    "cell_type": "markdown",
  28 |    "id": "13666ed4-1d44-43d3-a59e-b0db38cf4520",
  29 |    "metadata": {},
  30 |    "source": [
  31 |     "## 1. Set-up of the environment\n",
  32 |     "Let's start by installing the required libraries."
  33 |    ]
  34 |   },
  35 |   {
  36 |    "cell_type": "code",
  37 |    "execution_count": null,
  38 |    "id": "24b2f16b-e59a-4155-92c1-dde7eee6f2f7",
  39 |    "metadata": {
  40 |     "tags": []
  41 |    },
  42 |    "outputs": [],
  43 |    "source": [
  44 |     "%pip install -q langchain==0.1.10 boto3 pypdf pandas tqdm"
  45 |    ]
  46 |   },
  47 |   {
  48 |    "cell_type": "markdown",
  49 |    "id": "a65bb4a4-bc7e-4db7-9cce-2062494cfb95",
  50 |    "metadata": {},
  51 |    "source": [
  52 |     "## 2. Loading and Preparing Data\n",
  53 |     "\n",
  54 |     "For this lab you will use a fictuous use case where you want to build a chatbot to answer questions about Amazon shareholder letters. A typical technique to build such a chatbot is  Retrieval-Augmented Generation (RAG). While this lab focuses on dataset generation, let's start with a quick RAG primer for some background context. \n",
  55 |     "\n",
  56 |     "#### What is RAG?\n",
  57 |     "\n",
  58 |     "Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. \n",
  59 |     "\n",
  60 |     "Now in order to build a synthetic evaluation dataset for such a question answering RAG system, raw data from the knowledge source is used to derive possible user questions. For our use case you will use PDF files of shareholder letters with text information loaded from the internet to serve as the knowledge base. In production grade RAG implementations, the knowledge retriever may leverage a database that supports vector searches to dynamically lookup relevant documents that serve as the knowledge source.\n",
  61 |     "\n",
  62 |     "In our case let's start by downloading the shareholder letters."
  63 |    ]
  64 |   },
  65 |   {
  66 |    "cell_type": "code",
  67 |    "execution_count": null,
  68 |    "id": "89b8a5d5-d361-4dc2-bcb4-c0ae01456e67",
  69 |    "metadata": {
  70 |     "tags": []
  71 |    },
  72 |    "outputs": [],
  73 |    "source": [
  74 |     "# Import necessary libraries for downloading files\n",
  75 |     "from urllib.request import urlretrieve \n",
  76 |     "import os\n",
  77 |     "\n",
  78 |     "# Create folder to store downloaded files\n",
  79 |     "# Use descriptive folder name relating to data\n",
  80 |     "folder_name = \"synthetic_dataset_generation\"  \n",
  81 |     "\n",
  82 |     "# Check if folder already exists, if yes do nothing\n",
  83 |     "# If no, create the folder\n",
  84 |     "if os.path.exists(folder_name):\n",
  85 |     "    pass  \n",
  86 |     "else:\n",
  87 |     "    os.mkdir(folder_name)\n",
  88 |     "\n",
  89 |     "# List of URLs of files to download\n",
  90 |     "files = [\n",
  91 |     "    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',\n",
  92 |     "    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',\n",
  93 |     "    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',  \n",
  94 |     "    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf'\n",
  95 |     "]\n",
  96 |     "\n",
  97 |     "# Iterate through list of URLs \n",
  98 |     "for url in files:\n",
  99 |     "\n",
 100 |     "    # Get file name from URL to use as local file name\n",
 101 |     "    file_path = os.path.join(\"synthetic_dataset_generation\", url.rpartition(\"/\")[2])  \n",
 102 |     "    urlretrieve(url, file_path)"
 103 |    ]
 104 |   },
 105 |   {
 106 |    "cell_type": "markdown",
 107 |    "id": "1c10bbcc-47b5-4ba0-a0b6-e38f4bd79abc",
 108 |    "metadata": {},
 109 |    "source": [
 110 |     "Now that the context data in the form of the shareholder letters has been downloaded you now load PDF documents from the created directory, you will now split them into smaller text chunks using a recursive character text splitter from the Langchain library. The RecursiveCharacterTextSplitter divides the text into chunks of a specified size while allowing for overlap to prevent cutting sentences in half. When setting the chunk size, make sure it fits into the context window of your LLM and feel free to experiment with different chunk sizes."
 111 |    ]
 112 |   },
 113 |   {
 114 |    "cell_type": "code",
 115 |    "execution_count": null,
 116 |    "id": "00371a36-ceef-44fe-828a-95489bf7bae0",
 117 |    "metadata": {
 118 |     "tags": []
 119 |    },
 120 |    "outputs": [],
 121 |    "source": [
 122 |     "from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter\n",
 123 |     "from langchain.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader\n",
 124 |     "\n",
 125 |     "\n",
 126 |     "# Load PDF documents from directory\n",
 127 |     "loader = PyPDFDirectoryLoader(\"./synthetic_dataset_generation/\")  \n",
 128 |     "documents = loader.load()\n",
 129 |     "\n",
 130 |     "# Use recursive character splitter, works better for this PDF data set\n",
 131 |     "text_splitter = RecursiveCharacterTextSplitter(\n",
 132 |     "\n",
 133 |     "    # Split documents into small chunks\n",
 134 |     "    chunk_size = 1500,  \n",
 135 |     "\n",
 136 |     "    # Overlap chunks to reduce cutting sentences in half\n",
 137 |     "    chunk_overlap  = 100,\n",
 138 |     "    separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n",
 139 |     "\n",
 140 |     ")\n",
 141 |     "\n",
 142 |     "\n",
 143 |     "# Split loaded documents into chunks\n",
 144 |     "docs = text_splitter.split_documents(documents)"
 145 |    ]
 146 |   },
 147 |   {
 148 |    "cell_type": "markdown",
 149 |    "id": "949c9c3a-915c-495b-8fac-8b1cfe97e4b3",
 150 |    "metadata": {},
 151 |    "source": [
 152 |     "Let's have a look at the size of our data"
 153 |    ]
 154 |   },
 155 |   {
 156 |    "cell_type": "code",
 157 |    "execution_count": null,
 158 |    "id": "831439b9-f018-4f2c-91e1-0bd4ec58b8f9",
 159 |    "metadata": {
 160 |     "tags": []
 161 |    },
 162 |    "outputs": [],
 163 |    "source": [
 164 |     "# Print metadata of the loaded documents\n",
 165 |     "avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)\n",
 166 |     "avg_char_count_pre = avg_doc_length(documents)\n",
 167 |     "avg_char_count_post = avg_doc_length(docs)\n",
 168 |     "print(f'Average length among {len(documents)} pages loaded is {avg_char_count_pre} characters.')\n",
 169 |     "print(f'After the split you have {len(docs)}')\n",
 170 |     "print(f'Average length among {len(docs)} chunks is {avg_char_count_post} characters.')"
 171 |    ]
 172 |   },
 173 |   {
 174 |    "cell_type": "markdown",
 175 |    "id": "f96c6c15-1985-4e96-9391-6f1bbfc4b9a6",
 176 |    "metadata": {},
 177 |    "source": [
 178 |     "> **_NOTE:_**  As Amazon Bedrock will be used for generating synthetic data in the following you will connect to the Bedrock API. Further in the Lab you will use the Langchain library to communicate with the Amazon Bedrock API"
 179 |    ]
 180 |   },
 181 |   {
 182 |    "cell_type": "code",
 183 |    "execution_count": null,
 184 |    "id": "d381cc96-fcf5-4510-bfff-63e0a4b38fa3",
 185 |    "metadata": {
 186 |     "tags": []
 187 |    },
 188 |    "outputs": [],
 189 |    "source": [
 190 |     "# Set up Amazon Bedrock as LLM supplier for synthetic dataset creation\n",
 191 |     "import json\n",
 192 |     "import os\n",
 193 |     "import sys\n",
 194 |     "import boto3\n",
 195 |     "\n",
 196 |     "# set up a Bedrock-runtime client for inferencing large language models\n",
 197 |     "boto3_bedrock = boto3.client('bedrock-runtime')\n",
 198 |     "\n",
 199 |     "# set up a Bedrock client for performing administrative API calls\n",
 200 |     "boto3_bedrock_admin = boto3.client('bedrock')\n"
 201 |    ]
 202 |   },
 203 |   {
 204 |    "cell_type": "code",
 205 |    "execution_count": null,
 206 |    "id": "1eff3dab",
 207 |    "metadata": {},
 208 |    "outputs": [],
 209 |    "source": [
 210 |     "# Model Selection\n",
 211 |     "# Choosing claude 3 Haiku due to cost and performance efficiency\n",
 212 |     "claude_3_haiku = \"anthropic.claude-3-haiku-20240307-v1:0\""
 213 |    ]
 214 |   },
 215 |   {
 216 |    "cell_type": "code",
 217 |    "execution_count": null,
 218 |    "id": "4fed135f-987c-4204-a9a1-0140397b9a8e",
 219 |    "metadata": {
 220 |     "tags": []
 221 |    },
 222 |    "outputs": [],
 223 |    "source": [
 224 |     "# Set-up langchain LLM for implementing the synthetic dataset generation logic\n",
 225 |     "from langchain.llms.bedrock import Bedrock\n",
 226 |     "from langchain_community.chat_models import BedrockChat\n",
 227 |     "\n",
 228 |     "# for each model provider there are different parameters to define when inferencing against the model\n",
 229 |     "inference_modifier = {\n",
 230 |     "                        \"max_tokens\": 4096,\n",
 231 |     "                        \"temperature\": 0.5,\n",
 232 |     "                    }\n",
 233 |     "                     \n",
 234 |     "\n",
 235 |     "llm = BedrockChat(model_id = claude_3_haiku,\n",
 236 |     "                    client = boto3_bedrock, \n",
 237 |     "                    model_kwargs = inference_modifier \n",
 238 |     "                    )"
 239 |    ]
 240 |   },
 241 |   {
 242 |    "cell_type": "markdown",
 243 |    "id": "32aaba95-a7fa-4c09-81f1-90f826246109",
 244 |    "metadata": {},
 245 |    "source": [
 246 |     "## 3. Initial Question Generation\n",
 247 |     "\n",
 248 |     "As a first step you generate sample questions. You can use each of the generated chunks to generate synthetic questions that a real chatbot user might ask. You will prompt the LLM to analyze a chunk of shareholder letter data and generate a relevant question based on the information presented in the context. Below is a sample prompt to generate a question given a specific context. Note you are hardcoding to generate a single question for simplicity, of course you can also ask the LLM to generate multiple questions with a single prompt."
 249 |    ]
 250 |   },
 251 |   {
 252 |    "cell_type": "code",
 253 |    "execution_count": null,
 254 |    "id": "4457b697-468a-44b2-a1c8-ba92ddbd4632",
 255 |    "metadata": {
 256 |     "tags": []
 257 |    },
 258 |    "outputs": [],
 259 |    "source": [
 260 |     "from langchain.prompts import PromptTemplate\n",
 261 |     "\n",
 262 |     "# Create a prompt template to generate a question a end-user could have about a given context\n",
 263 |     "initial_question_prompt_template = PromptTemplate(\n",
 264 |     "    input_variables=[\"context\"],\n",
 265 |     "    template=\"\"\"\n",
 266 |     "    <Instructions>\n",
 267 |     "    Here is some context:\n",
 268 |     "    <context>\n",
 269 |     "    {context}\n",
 270 |     "    </context>\n",
 271 |     "\n",
 272 |     "    Your task is to generate 1 question that can be answered using the provided context, following these rules:\n",
 273 |     "\n",
 274 |     "    <rules>\n",
 275 |     "    1. The question should make sense to humans even when read without the given context.\n",
 276 |     "    2. The question should be fully answered from the given context.\n",
 277 |     "    3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.\n",
 278 |     "    4. The answer to the question should not contain any links.\n",
 279 |     "    5. The question should be of moderate difficulty.\n",
 280 |     "    6. The question must be reasonable and must be understood and responded by humans.\n",
 281 |     "    7. Do not use phrases like 'provided context', etc. in the question.\n",
 282 |     "    8. Avoid framing questions using the word \"and\" that can be decomposed into more than one question.\n",
 283 |     "    9. The question should not contain more than 10 words, make use of abbreviations wherever possible.\n",
 284 |     "    </rules>\n",
 285 |     "\n",
 286 |     "    To generate the question, first identify the most important or relevant part of the context. Then frame a question around that part that satisfies all the rules above.\n",
 287 |     "\n",
 288 |     "    Output only the generated question with a \"?\" at the end, no other text or characters.\n",
 289 |     "    </Instructions>\n",
 290 |     "    \n",
 291 |     "    \"\"\")\n",
 292 |     "\n",
 293 |     "def generate_question(doc, llm):\n",
 294 |     "\n",
 295 |     "    # Pass in values to the input variables\n",
 296 |     "    initial_question_prompt = initial_question_prompt_template.format(context=doc)\n",
 297 |     "    \n",
 298 |     "    initial_question = llm.invoke(initial_question_prompt)\n",
 299 |     "    \n",
 300 |     "    return initial_question"
 301 |    ]
 302 |   },
 303 |   {
 304 |    "cell_type": "code",
 305 |    "execution_count": null,
 306 |    "id": "122d6c90-18f0-4ae3-8843-37b31bf9db7b",
 307 |    "metadata": {
 308 |     "tags": []
 309 |    },
 310 |    "outputs": [],
 311 |    "source": [
 312 |     "# generate a question based on a given context\n",
 313 |     "question = generate_question(docs[1], llm)\n",
 314 |     "print(f\"Intial question: {question.content}\")"
 315 |    ]
 316 |   },
 317 |   {
 318 |    "cell_type": "code",
 319 |    "execution_count": null,
 320 |    "id": "466639dd",
 321 |    "metadata": {},
 322 |    "outputs": [],
 323 |    "source": [
 324 |     "print(question.content)"
 325 |    ]
 326 |   },
 327 |   {
 328 |    "cell_type": "markdown",
 329 |    "id": "7b5e8a8e-c8a1-487a-8d12-3ce024f67a69",
 330 |    "metadata": {},
 331 |    "source": [
 332 |     "## 4. Answer Generation\n",
 333 |     "To use the questions for evaluation you need to generate a reference answer for each of the questions to test against. Let's do this using following prompt template:"
 334 |    ]
 335 |   },
 336 |   {
 337 |    "cell_type": "code",
 338 |    "execution_count": null,
 339 |    "id": "72ca873f-ddd2-4b51-987d-5524d198c56c",
 340 |    "metadata": {
 341 |     "tags": []
 342 |    },
 343 |    "outputs": [],
 344 |    "source": [
 345 |     "# Create a prompt template that takes into consideration the the question and generates an answer\n",
 346 |     "answer_prompt_template = PromptTemplate(\n",
 347 |     "    input_variables=[\"context\", \"question\"],\n",
 348 |     "    template=\"\"\"\n",
 349 |     "    <Instructions>\n",
 350 |     "    <Task>\n",
 351 |     "    <role>You are an experienced QA Engineer for building large language model applications.</role>\n",
 352 |     "    <task>It is your task to generate an answer to the following question <question>{question}</question> only based on the <context>{context}</context></task>\n",
 353 |     "    The output should be only the answer generated from the context.\n",
 354 |     "\n",
 355 |     "    <rules>\n",
 356 |     "    1. Only use the given context as a source for generating the answer.\n",
 357 |     "    2. Be as precise as possible with answering the question.\n",
 358 |     "    3. Be concise in answering the question and only answer the question at hand rather than adding extra information.\n",
 359 |     "    </rules>\n",
 360 |     "\n",
 361 |     "    Only output the generated answer as a sentence. No extra characters.\n",
 362 |     "    </Task>\n",
 363 |     "    </Instructions>\n",
 364 |     "    \n",
 365 |     "    Assistant:\"\"\")\n",
 366 |     "\n",
 367 |     "def generate_answer(question: str, doc, llm):\n",
 368 |     "    \n",
 369 |     "    answer_prompt = answer_prompt_template.format(question = question, context=doc)\n",
 370 |     "    \n",
 371 |     "    answer = llm.invoke(answer_prompt)\n",
 372 |     "    \n",
 373 |     "    return answer"
 374 |    ]
 375 |   },
 376 |   {
 377 |    "cell_type": "code",
 378 |    "execution_count": null,
 379 |    "id": "ffceef3c-9f53-4143-b5c9-0b80f3e6d182",
 380 |    "metadata": {
 381 |     "tags": []
 382 |    },
 383 |    "outputs": [],
 384 |    "source": [
 385 |     "answer = generate_answer(question, docs[1], llm)\n",
 386 |     "print(f\"Intial question: {question.content}\")\n",
 387 |     "print(\"---\")\n",
 388 |     "print(f\"Reference Answer: {answer.content}\")"
 389 |    ]
 390 |   },
 391 |   {
 392 |    "cell_type": "markdown",
 393 |    "id": "da3ddc9b-f409-4655-a943-361939fa4946",
 394 |    "metadata": {
 395 |     "tags": []
 396 |    },
 397 |    "source": [
 398 |     "## 5. Extracting Relevant Context\n",
 399 |     "To make the dataset verifiable you use the following prompt to extract the relevant sentences from the given context to answer the generated question. Knowing the relevant sentences you can easily check whether the question and answer are correct. "
 400 |    ]
 401 |   },
 402 |   {
 403 |    "cell_type": "code",
 404 |    "execution_count": null,
 405 |    "id": "c64d1f19-d1af-4317-b9cb-e4142700939d",
 406 |    "metadata": {
 407 |     "tags": []
 408 |    },
 409 |    "outputs": [],
 410 |    "source": [
 411 |     "# To check whether an answer was correctly formulated by the large language model you get the relevant text passages from the documents used for answering the questions.\n",
 412 |     "source_prompt_template = PromptTemplate(\n",
 413 |     "    input_variables=[\"context\", \"question\"],\n",
 414 |     "    template=\"\"\"Human:\n",
 415 |     "    <Instructions>\n",
 416 |     "    Here is the context:\n",
 417 |     "    <context>\n",
 418 |     "    {context}\n",
 419 |     "    </context>\n",
 420 |     "\n",
 421 |     "    Your task is to extract the relevant sentences from the given context that can potentially help answer the following question. You are not allowed to make any changes to the sentences from the context.\n",
 422 |     "\n",
 423 |     "    <question>\n",
 424 |     "    {question}\n",
 425 |     "    </question>\n",
 426 |     "\n",
 427 |     "    Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.\n",
 428 |     "    </Instructions>\n",
 429 |     "    Assistant:\"\"\")\n",
 430 |     "\n",
 431 |     "def generate_source(question: str, doc, llm):\n",
 432 |     "        \n",
 433 |     "    source_prompt = source_prompt_template.format(question = question, context=doc)\n",
 434 |     "    \n",
 435 |     "    source = llm.invoke(source_prompt)\n",
 436 |     "    \n",
 437 |     "    return source"
 438 |    ]
 439 |   },
 440 |   {
 441 |    "cell_type": "code",
 442 |    "execution_count": null,
 443 |    "id": "008571ba-6cde-40a0-833b-0348d54e9a1e",
 444 |    "metadata": {
 445 |     "tags": []
 446 |    },
 447 |    "outputs": [],
 448 |    "source": [
 449 |     "source_sentence = generate_source(question, docs[1], llm)\n",
 450 |     "print(f\"Intial question: {question.content}\")\n",
 451 |     "print(\"---\")\n",
 452 |     "print(f\"Reference Answer: {answer.content}\")\n",
 453 |     "print(\"---\")\n",
 454 |     "print(f\"Source Sentence: {source_sentence.content}\")"
 455 |    ]
 456 |   },
 457 |   {
 458 |    "cell_type": "markdown",
 459 |    "id": "f45065c6-7387-42fe-878d-884f6d297499",
 460 |    "metadata": {},
 461 |    "source": [
 462 |     "## 6. Evolving Questions to fit end-users behaviour\n",
 463 |     "When generating question & answer pairs from the same prompt for the whole dataset it might appear that the questions are repetitive, similar in form and thus not mimic real enduser behavior. In this section you evolve the existing generated question to for example make it shorter and more precise. The prompt for generating questions that fit your use case heavily depend on your use case and thus your prompt must reflect your endusers by for instance setting the rules accordingly or by providing examples."
 464 |    ]
 465 |   },
 466 |   {
 467 |    "cell_type": "code",
 468 |    "execution_count": null,
 469 |    "id": "2ea6c408-6395-4d8a-b217-a1a45bae02e5",
 470 |    "metadata": {
 471 |     "tags": []
 472 |    },
 473 |    "outputs": [],
 474 |    "source": [
 475 |     "# To generate a more versatile testing dataset you alternate the questions to see how your RAG systems performs against differently formulated of questions\n",
 476 |     "question_compress_prompt_template = PromptTemplate(\n",
 477 |     "    input_variables=[\"question\"],\n",
 478 |     "    template=\"\"\"\n",
 479 |     "    <Instructions>\n",
 480 |     "    <role>You are an experienced linguistics expert for building testsets for large language model applications.</role>\n",
 481 |     "\n",
 482 |     "    <task>It is your task to rewrite the following question in a more indirect and compressed form, following these rules:\n",
 483 |     "\n",
 484 |     "    <rules>\n",
 485 |     "    1. Make the question more indirect\n",
 486 |     "    2. Make the question shorter\n",
 487 |     "    3. Use abbreviations if possible\n",
 488 |     "    </rules>\n",
 489 |     "\n",
 490 |     "    <question>\n",
 491 |     "    {question}\n",
 492 |     "    </question>\n",
 493 |     "\n",
 494 |     "    Your output should only be the rewritten question with a question mark \"?\" at the end. Do not provide any other explanation or text.\n",
 495 |     "    </task>\n",
 496 |     "    </Instructions>\n",
 497 |     "    \n",
 498 |     "    \"\"\")\n",
 499 |     "\n",
 500 |     "\n",
 501 |     "def compress_question(question): \n",
 502 |     "    # Pass in values to the input variables\n",
 503 |     "    question_compress_prompt = question_compress_prompt_template.format(question=question)\n",
 504 |     "    \n",
 505 |     "    question_compressed = llm.invoke(question_compress_prompt)\n",
 506 |     "        \n",
 507 |     "    return question_compressed"
 508 |    ]
 509 |   },
 510 |   {
 511 |    "cell_type": "code",
 512 |    "execution_count": null,
 513 |    "id": "d5b05eb8-d9da-42b9-8b00-e69f9fc79907",
 514 |    "metadata": {
 515 |     "tags": []
 516 |    },
 517 |    "outputs": [],
 518 |    "source": [
 519 |     "compressed_question = compress_question(question)\n",
 520 |     "print(f\"Intial question: {question.content}\")\n",
 521 |     "print(\"---\")\n",
 522 |     "print(f\"Reference Answer: {answer.content}\")\n",
 523 |     "print(\"---\")\n",
 524 |     "print(f\"Source Sentence: {source_sentence.content}\")\n",
 525 |     "print(\"---\")\n",
 526 |     "print(f\"Compressed Question: {compressed_question.content}\")\n"
 527 |    ]
 528 |   },
 529 |   {
 530 |    "cell_type": "markdown",
 531 |    "id": "50011c1a-ff3a-4ecd-b715-c69db89905a8",
 532 |    "metadata": {},
 533 |    "source": [
 534 |     "## 7. Automated Dataset Generation\n",
 535 |     "To scale the process of the dataset generation you iterate over all chunks of your context, generate questions, answers, relevant sentences and evolutions for each chunk and save them to a pandas dataframe. "
 536 |    ]
 537 |   },
 538 |   {
 539 |    "cell_type": "code",
 540 |    "execution_count": null,
 541 |    "id": "d7198b44-886e-4538-a23d-d9c642c4632a",
 542 |    "metadata": {},
 543 |    "outputs": [],
 544 |    "source": [
 545 |     "# Using only a subset of the loaded documents for lightweight testing.\n",
 546 |     "# For generating data for every document just set docs_subset = docs\n",
 547 |     "docs_subset = docs[0:5]"
 548 |    ]
 549 |   },
 550 |   {
 551 |    "cell_type": "code",
 552 |    "execution_count": null,
 553 |    "id": "33ac8015-3787-4d26-a024-3da2d2a189a4",
 554 |    "metadata": {
 555 |     "tags": []
 556 |    },
 557 |    "outputs": [],
 558 |    "source": [
 559 |     "from langchain_core.documents.base import Document\n",
 560 |     "\n",
 561 |     "def generate_qa_dataset_doc(doc: Document, llm, dataset, doc_number):\n",
 562 |     "    \"\"\"A function to create a test dataset of questions for a given Document(Langchain Document type)\"\"\"\n",
 563 |     "    \n",
 564 |     "    # generate the initial question for the RAG testdataset\n",
 565 |     "    question = generate_question(doc, llm)\n",
 566 |     "    dataset.at[doc_number, \"question\"] = question.content\n",
 567 |     "    \n",
 568 |     "    # generate compressed  question to variate the dataset\n",
 569 |     "    compressed_question = compress_question(question)\n",
 570 |     "    dataset.at[doc_number, \"question_compressed\"] = compressed_question.content\n",
 571 |     "   \n",
 572 |     "    \n",
 573 |     "    answer = generate_answer(question, doc, llm)\n",
 574 |     "    dataset.at[doc_number, \"reference_answer\"] = answer.content\n",
 575 |     "        \n",
 576 |     "    source_sentence = generate_source(question, doc, llm)\n",
 577 |     "    dataset.at[doc_number, \"source_sentence\"] = source_sentence.content\n",
 578 |     "    \n",
 579 |     "    source_raw = doc\n",
 580 |     "    dataset.at[doc_number, \"source_raw\"] = source_raw.page_content\n",
 581 |     "    \n",
 582 |     "    source_document = doc.metadata[\"source\"]\n",
 583 |     "    dataset.at[doc_number, \"source_document\"] = source_document\n",
 584 |     "    \n",
 585 |     "    \n",
 586 |     "    return dataset\n",
 587 |     "    "
 588 |    ]
 589 |   },
 590 |   {
 591 |    "cell_type": "code",
 592 |    "execution_count": null,
 593 |    "id": "f5c30fb8-aa99-4c6a-8c91-f449c89bb103",
 594 |    "metadata": {
 595 |     "tags": []
 596 |    },
 597 |    "outputs": [],
 598 |    "source": [
 599 |     "# create a dataset class that in the end can be used to generate the dataset\n",
 600 |     "import pandas as pd\n",
 601 |     "import time\n",
 602 |     "\n",
 603 |     "dataset = pd.DataFrame(columns=[\"question\", \"question_compressed\", \"reference_answer\", \"source_sentence\",\"source_raw\",\"source_document\" ])        "
 604 |    ]
 605 |   },
 606 |   {
 607 |    "cell_type": "code",
 608 |    "execution_count": null,
 609 |    "id": "07c5d096-421d-4c06-bb03-2f57df9710c5",
 610 |    "metadata": {
 611 |     "tags": []
 612 |    },
 613 |    "outputs": [],
 614 |    "source": [
 615 |     "from langchain_core.documents.base import Document\n",
 616 |     "from tqdm import tqdm\n",
 617 |     "\n",
 618 |     "def generate_dataset(documents: Document,llm, dataset):\n",
 619 |     "\n",
 620 |     "    print(f\"start generating dataset from {len(documents)} docuements\")\n",
 621 |     "    print(\"---\")\n",
 622 |     "    generation_time_start = time.time()\n",
 623 |     "    \n",
 624 |     "    for doc in tqdm(range(len(documents))):\n",
 625 |     "        q_generation_time_start = time.time()\n",
 626 |     "        dataset = generate_qa_dataset_doc(doc = documents[doc], llm = llm, dataset = dataset, doc_number = doc)\n",
 627 |     "        q_generation_time_end = time.time()\n",
 628 |     "        total_elapsed_time_generation = q_generation_time_end - q_generation_time_start\n",
 629 |     "\n",
 630 |     "\n",
 631 |     "        print(f\"Finished creating evaluation data for chunk {doc+1}\")\n",
 632 |     "        print(f\"Generation time for doc: {total_elapsed_time_generation}\")\n",
 633 |     "        print(\"---\")\n",
 634 |     "        \n",
 635 |     "    generation_time_end = time.time()\n",
 636 |     "    total_elapsed_time= generation_time_end - generation_time_start\n",
 637 |     "    print(f\"Generation time for all docs: {total_elapsed_time}\")\n",
 638 |     "        \n",
 639 |     "    return dataset"
 640 |    ]
 641 |   },
 642 |   {
 643 |    "cell_type": "code",
 644 |    "execution_count": null,
 645 |    "id": "38ac63dc-b676-477b-8bc7-9aa750732f8b",
 646 |    "metadata": {
 647 |     "tags": []
 648 |    },
 649 |    "outputs": [],
 650 |    "source": [
 651 |     "dataset_df = generate_dataset(docs_subset, llm, dataset)\n",
 652 |     "\n",
 653 |     "num_questions_generated = dataset_df.shape[0]\n",
 654 |     "print(f\"Generated a total of {num_questions_generated} questions.\")"
 655 |    ]
 656 |   },
 657 |   {
 658 |    "cell_type": "code",
 659 |    "execution_count": null,
 660 |    "id": "56b80a38-1b5c-47a1-8e4c-a357f4d9dd8c",
 661 |    "metadata": {
 662 |     "tags": []
 663 |    },
 664 |    "outputs": [],
 665 |    "source": [
 666 |     "# display the first rows of the generated dataset\n",
 667 |     "dataset_df.head()"
 668 |    ]
 669 |   },
 670 |   {
 671 |    "cell_type": "markdown",
 672 |    "id": "5499b380-7680-40c5-a25e-34edfd0916f6",
 673 |    "metadata": {},
 674 |    "source": [
 675 |     "## 8. Assessing the questions quality using Critique Agents\n",
 676 |     "Critique agents are a technique used in natural language processing (NLP) to evaluate the quality and suitability of questions in a dataset for a particular task or application. In this case, the critique agents are employed to assess whether the questions in a dataset are valid for a Retrieval-Augmented Generation (RAG) system, which is a type of language model that combines information retrieval and generation capabilities.\n",
 677 |     "\n",
 678 |     "The two main metrics evaluated by the critique agents are relevance and groundedness.\n",
 679 |     "\n",
 680 |     "**Relevance**\n",
 681 |     "\n",
 682 |     "Relevance measures how useful and applicable a question is for a specific domain or context. In the context of financial and business analysis, the relevance prompt evaluates questions based on the following criteria:\n",
 683 |     "\n",
 684 |     "- Is the question directly relevant to the work of financial and business analysts on Wall Street?\n",
 685 |     "- Does the question address a practical problem or use case that analysts might encounter?\n",
 686 |     "- Is the question clear and well-defined, avoiding ambiguity or vagueness?\n",
 687 |     "- Does the question require a substantive answer that demonstrates understanding of financial topics?\n",
 688 |     "- Would answering the question provide insights or knowledge that could be applied to real-world company evaluation tasks?\n",
 689 |     "\n",
 690 |     "The relevance score ranges from 1 to 5, with a higher score indicating greater relevance and usefulness for financial and business analysts.\n",
 691 |     "\n",
 692 |     "**Groundedness**\n",
 693 |     "\n",
 694 |     "Groundedness measures how well a question can be answered based on the provided context or information. The groundedness prompt evaluates questions based on the following criteria:\n",
 695 |     "\n",
 696 |     "- Can the question be answered using only the information provided in the given context?\n",
 697 |     "- Does the context provide little, some, substantial, or all the information needed to answer the question?\n",
 698 |     "\n",
 699 |     "The groundedness score also ranges from 1 to 5, with the following interpretations:\n",
 700 |     "\n",
 701 |     "1. The question cannot be answered at all based on the given context.\n",
 702 |     "2. The context provides very little relevant information to answer the question.\n",
 703 |     "3. The context provides some relevant information to partially answer the question.\n",
 704 |     "4. The context provides substantial information to answer most aspects of the question.\n",
 705 |     "5. The context provides all the information needed to fully and unambiguously answer the question.\n",
 706 |     "\n",
 707 |     "By evaluating both relevance and groundedness, the critique agents can help identify questions in the dataset that are well-suited for the RAG system, as well as those that may need to be revised, removed, or supplemented with additional context or information."
 708 |    ]
 709 |   },
 710 |   {
 711 |    "cell_type": "code",
 712 |    "execution_count": null,
 713 |    "id": "616a5af5-2e80-49df-b223-907729b1b707",
 714 |    "metadata": {
 715 |     "tags": []
 716 |    },
 717 |    "outputs": [],
 718 |    "source": [
 719 |     "groundedness_check_prompt_template = PromptTemplate(\n",
 720 |     "    input_variables=[\"context\",\"question\"],\n",
 721 |     "    template=\"\"\"\n",
 722 |     "    <Instructions>\n",
 723 |     "    You will be given a context and a question related to that context.\n",
 724 |     "\n",
 725 |     "    Your task is to provide an evaluation of how well the given question can be answered using only the information provided in the context. Rate this on a scale from 1 to 5, where:\n",
 726 |     "\n",
 727 |     "    1 = The question cannot be answered at all based on the given context\n",
 728 |     "    2 = The context provides very little relevant information to answer the question\n",
 729 |     "    3 = The context provides some relevant information to partially answer the question \n",
 730 |     "    4 = The context provides substantial information to answer most aspects of the question\n",
 731 |     "    5 = The context provides all the information needed to fully and unambiguously answer the question\n",
 732 |     "\n",
 733 |     "    First, read through the provided context carefully:\n",
 734 |     "\n",
 735 |     "    <context>\n",
 736 |     "    {context}\n",
 737 |     "    </context>\n",
 738 |     "\n",
 739 |     "    Then read the question:\n",
 740 |     "\n",
 741 |     "    <question>\n",
 742 |     "    {question}\n",
 743 |     "    </question>\n",
 744 |     "\n",
 745 |     "    Evaluate how well you think the question can be answered using only the context information. Provide your reasoning first in an <evaluation> section, explaining what relevant or missing information from the context led you to your evaluation score in only one sentence.\n",
 746 |     "\n",
 747 |     "    Provide your evaluation in the following format:\n",
 748 |     "\n",
 749 |     "    <rating>(Your rating from 1 to 5)</rating>\n",
 750 |     "    \n",
 751 |     "    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>\n",
 752 |     "\n",
 753 |     "\n",
 754 |     "    </Instructions>\n",
 755 |     "    \n",
 756 |     "    \"\"\")\n",
 757 |     "\n",
 758 |     "relevance_check_prompt_template = PromptTemplate(\n",
 759 |     "    input_variables=[\"question\"],\n",
 760 |     "    template=\"\"\"\n",
 761 |     "    <Instructions>\n",
 762 |     "    You will be given a question related to Amazon Shareholder letters. Your task is to evaluate how useful this question would be for a financial and business analyst working in wallstreet.\n",
 763 |     "\n",
 764 |     "    To evaluate the usefulness of the question, consider the following criteria:\n",
 765 |     "\n",
 766 |     "    1. Relevance: Is the question directly relevant to your work? Questions that are too broad or unrelated to this domain should receive a lower rating.\n",
 767 |     "\n",
 768 |     "    2. Practicality: Does the question address a practical problem or use case that analysts might encounter? Theoretical or overly academic questions may be less useful.\n",
 769 |     "\n",
 770 |     "    3. Clarity: Is the question clear and well-defined? Ambiguous or vague questions are less useful.\n",
 771 |     "\n",
 772 |     "    4. Depth: Does the question require a substantive answer that demonstrates understanding of financial topics? Surface-level questions may be less useful.\n",
 773 |     "\n",
 774 |     "    5. Applicability: Would answering this question provide insights or knowledge that could be applied to real-world company evaluation tasks? Questions with limited applicability should receive a lower rating.\n",
 775 |     "\n",
 776 |     "    Provide your evaluation in the following format:\n",
 777 |     "\n",
 778 |     "    <rating>(Your rating from 1 to 5)</rating>\n",
 779 |     "    \n",
 780 |     "    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>\n",
 781 |     "\n",
 782 |     "    Here is an example: \n",
 783 |     "    <evaluation>The question is very relevant to the persona because it asks about financial information of a company</evaluation>\n",
 784 |     "    <rating>5</rating>\n",
 785 |     "\n",
 786 |     "    Here is the question:\n",
 787 |     "\n",
 788 |     "    {question}\n",
 789 |     "    </Instructions>\n",
 790 |     "    \"\"\")"
 791 |    ]
 792 |   },
 793 |   {
 794 |    "cell_type": "code",
 795 |    "execution_count": null,
 796 |    "id": "5d69431e-6b5a-46a6-8166-cbaf2fb6a663",
 797 |    "metadata": {
 798 |     "tags": []
 799 |    },
 800 |    "outputs": [],
 801 |    "source": [
 802 |     "def generate_groundedness_check(question, source_raw): \n",
 803 |     "    # Pass in values to the input variables\n",
 804 |     "    groundedness_prompt = groundedness_check_prompt_template.format(question=question, context=source_raw)\n",
 805 |     "    \n",
 806 |     "    groundedness_rating = llm.invoke(groundedness_prompt)\n",
 807 |     "        \n",
 808 |     "    return groundedness_rating\n",
 809 |     "\n",
 810 |     "def generate_relevance_check(question): \n",
 811 |     "    # Pass in values to the input variables\n",
 812 |     "    relevance_prompt = relevance_check_prompt_template.format(question=question)\n",
 813 |     "    \n",
 814 |     "    relevance_rating = llm.invoke(relevance_prompt)\n",
 815 |     "        \n",
 816 |     "    return relevance_rating"
 817 |    ]
 818 |   },
 819 |   {
 820 |    "cell_type": "code",
 821 |    "execution_count": null,
 822 |    "id": "941864ce-49c2-4800-aff5-203b24ee6816",
 823 |    "metadata": {
 824 |     "tags": []
 825 |    },
 826 |    "outputs": [],
 827 |    "source": [
 828 |     "# Evaluating one of the generated questions for groundedness and relevance\n",
 829 |     "groundedness_rating = generate_groundedness_check(dataset_df.question[0], dataset_df.source_raw[0])\n",
 830 |     "relevance_rating = generate_relevance_check(dataset_df.question[0])\n",
 831 |     "\n",
 832 |     "print(\"groundedness Score:\")\n",
 833 |     "print(groundedness_rating.content)\n",
 834 |     "\n",
 835 |     "print(\"---\")\n",
 836 |     "\n",
 837 |     "print(\"Relevance Score:\")\n",
 838 |     "print(relevance_rating.content)\n"
 839 |    ]
 840 |   },
 841 |   {
 842 |    "cell_type": "code",
 843 |    "execution_count": null,
 844 |    "id": "2e18dc75-4424-4112-b1dc-dca4dc3c41d9",
 845 |    "metadata": {
 846 |     "tags": []
 847 |    },
 848 |    "outputs": [],
 849 |    "source": [
 850 |     "import re\n",
 851 |     "# Helper functions to extract values from the string response by the LLM Critique Agents.\n",
 852 |     "def extract_rating(text):\n",
 853 |     "    pattern = r'<rating>(.*?)</rating>'\n",
 854 |     "    match = re.search(pattern, text)\n",
 855 |     "    if match:\n",
 856 |     "        rating = match.group(1)\n",
 857 |     "        return rating\n",
 858 |     "    else:\n",
 859 |     "        return None\n",
 860 |     "    \n",
 861 |     "def extract_reasoning(text):\n",
 862 |     "    pattern = r'<evaluation>(.*?)</evaluation>'\n",
 863 |     "    match = re.search(pattern, text)\n",
 864 |     "    if match:\n",
 865 |     "        rating = match.group(1)\n",
 866 |     "        return rating\n",
 867 |     "    else:\n",
 868 |     "        return None"
 869 |    ]
 870 |   },
 871 |   {
 872 |    "cell_type": "code",
 873 |    "execution_count": null,
 874 |    "id": "374e2112",
 875 |    "metadata": {},
 876 |    "outputs": [],
 877 |    "source": [
 878 |     "def evaluate_dataset(dataset):\n",
 879 |     "    for index, row in dataset.iterrows():\n",
 880 |     "\n",
 881 |     "        question = row['question']\n",
 882 |     "        source_raw = row['source_raw']\n",
 883 |     "\n",
 884 |     "        # Generate groundedness check\n",
 885 |     "        groundedness_check = generate_groundedness_check(question, source_raw)\n",
 886 |     "        groundedness_score = extract_rating(groundedness_check.content)\n",
 887 |     "        groundedness_score_reasoning = extract_reasoning(groundedness_check.content)\n",
 888 |     "\n",
 889 |     "        dataset.at[index, 'groundedness_score'] = groundedness_score\n",
 890 |     "        dataset.at[index, 'groundedness_score_reasoning'] = groundedness_score_reasoning\n",
 891 |     "\n",
 892 |     "        # Generate relevance check\n",
 893 |     "        relevance_check = generate_relevance_check(question)\n",
 894 |     "        relevancy_score = extract_rating(relevance_check.content)\n",
 895 |     "        relevancy_score_reasoning = extract_reasoning(relevance_check.content)\n",
 896 |     "\n",
 897 |     "        dataset.at[index, 'relevancy_score'] = relevancy_score\n",
 898 |     "        dataset.at[index, 'relevancy_score_reasoning'] = relevancy_score_reasoning\n",
 899 |     "\n",
 900 |     "    return dataset"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "markdown",
 905 |    "id": "dcc1e46c",
 906 |    "metadata": {},
 907 |    "source": [
 908 |     "Now that the concept of critique agents has been established including the prompt for groundedness and relevance scores you iterate over the generated dataset dataset and assign each question a score. Depending on your needs you can eliminate questions with a score beneath a certain threshold from the dataset."
 909 |    ]
 910 |   },
 911 |   {
 912 |    "cell_type": "code",
 913 |    "execution_count": null,
 914 |    "id": "ec660fbe",
 915 |    "metadata": {},
 916 |    "outputs": [],
 917 |    "source": [
 918 |     "dataset_evaluated = evaluate_dataset(dataset_df)"
 919 |    ]
 920 |   },
 921 |   {
 922 |    "cell_type": "code",
 923 |    "execution_count": null,
 924 |    "id": "07cd698d",
 925 |    "metadata": {},
 926 |    "outputs": [],
 927 |    "source": [
 928 |     "dataset_evaluated.head()"
 929 |    ]
 930 |   },
 931 |   {
 932 |    "cell_type": "markdown",
 933 |    "id": "497224ae",
 934 |    "metadata": {},
 935 |    "source": [
 936 |     "### Conclusion\n",
 937 |     "\n",
 938 |     "Generating synthetic datasets is a powerful technique for evaluating retrieval augmented generation (RAG) systems, particularly in the early stages of development when real-world data is scarce or difficult to obtain. By leveraging large language models and knowledge retrieval context, this approach enables the creation of diverse, realistic, and representative datasets that mimic real human interactions.\n",
 939 |     "\n",
 940 |     "Throughout this notebook, you have explored the process of generating a synthetic dataset for a QA-RAG application using Anthropic's Claude via the Bedrock API, Python, and Langchain. You covered essential steps, including setting up the environment, loading and preparing context data, initial question generation, answer generation, extracting relevant context, evolving questions to fit end-user behavior, automated dataset generation, and assessing question quality.\n",
 941 |     "\n",
 942 |     "While this approach offers numerous benefits, it is essential to acknowledge its limitations. First, the quality of the synthetic dataset heavily relies on the performance and capabilities of the underlying language model and knowledge retrieval system. Biases and limitations present in these components may be reflected in the generated dataset. Additionally, capturing the full complexity and nuances of real-world interactions can be challenging, as synthetic datasets may not account for all edge cases or unexpected scenarios.\n",
 943 |     "\n",
 944 |     "Despite these limitations, generating synthetic datasets remains a valuable tool for accelerating the development and evaluation of RAG systems. By streamlining the evaluation process and enabling iterative development cycles, this approach can contribute to the creation of better-performing AI systems.\n",
 945 |     "\n",
 946 |     "We encourage developers, researchers, and enthusiasts to explore the open-source tools like RAGAS mentioned in this notebook and experiment with generating synthetic datasets for their own RAG applications. Hands-on experience with this technique can provide valuable insights and contribute to the advancement of RAG systems in various domains.\n",
 947 |     "\n",
 948 |     "Remember, synthetic dataset generation is not a silver bullet, but rather a powerful tool that should be used in conjunction with other evaluation techniques and real-world data when available. By embracing this approach and continuously improving upon it, you can accelerate the development of more robust and capable RAG systems, ultimately enhancing the user experience and unlocking new possibilities in natural language processing."
 949 |    ]
 950 |   }
 951 |  ],
 952 |  "metadata": {
 953 |   "availableInstances": [
 954 |    {
 955 |     "_defaultOrder": 0,
 956 |     "_isFastLaunch": true,
 957 |     "category": "General purpose",
 958 |     "gpuNum": 0,
 959 |     "hideHardwareSpecs": false,
 960 |     "memoryGiB": 4,
 961 |     "name": "ml.t3.medium",
 962 |     "vcpuNum": 2
 963 |    },
 964 |    {
 965 |     "_defaultOrder": 1,
 966 |     "_isFastLaunch": false,
 967 |     "category": "General purpose",
 968 |     "gpuNum": 0,
 969 |     "hideHardwareSpecs": false,
 970 |     "memoryGiB": 8,
 971 |     "name": "ml.t3.large",
 972 |     "vcpuNum": 2
 973 |    },
 974 |    {
 975 |     "_defaultOrder": 2,
 976 |     "_isFastLaunch": false,
 977 |     "category": "General purpose",
 978 |     "gpuNum": 0,
 979 |     "hideHardwareSpecs": false,
 980 |     "memoryGiB": 16,
 981 |     "name": "ml.t3.xlarge",
 982 |     "vcpuNum": 4
 983 |    },
 984 |    {
 985 |     "_defaultOrder": 3,
 986 |     "_isFastLaunch": false,
 987 |     "category": "General purpose",
 988 |     "gpuNum": 0,
 989 |     "hideHardwareSpecs": false,
 990 |     "memoryGiB": 32,
 991 |     "name": "ml.t3.2xlarge",
 992 |     "vcpuNum": 8
 993 |    },
 994 |    {
 995 |     "_defaultOrder": 4,
 996 |     "_isFastLaunch": true,
 997 |     "category": "General purpose",
 998 |     "gpuNum": 0,
 999 |     "hideHardwareSpecs": false,
1000 |     "memoryGiB": 8,
1001 |     "name": "ml.m5.large",
1002 |     "vcpuNum": 2
1003 |    },
1004 |    {
1005 |     "_defaultOrder": 5,
1006 |     "_isFastLaunch": false,
1007 |     "category": "General purpose",
1008 |     "gpuNum": 0,
1009 |     "hideHardwareSpecs": false,
1010 |     "memoryGiB": 16,
1011 |     "name": "ml.m5.xlarge",
1012 |     "vcpuNum": 4
1013 |    },
1014 |    {
1015 |     "_defaultOrder": 6,
1016 |     "_isFastLaunch": false,
1017 |     "category": "General purpose",
1018 |     "gpuNum": 0,
1019 |     "hideHardwareSpecs": false,
1020 |     "memoryGiB": 32,
1021 |     "name": "ml.m5.2xlarge",
1022 |     "vcpuNum": 8
1023 |    },
1024 |    {
1025 |     "_defaultOrder": 7,
1026 |     "_isFastLaunch": false,
1027 |     "category": "General purpose",
1028 |     "gpuNum": 0,
1029 |     "hideHardwareSpecs": false,
1030 |     "memoryGiB": 64,
1031 |     "name": "ml.m5.4xlarge",
1032 |     "vcpuNum": 16
1033 |    },
1034 |    {
1035 |     "_defaultOrder": 8,
1036 |     "_isFastLaunch": false,
1037 |     "category": "General purpose",
1038 |     "gpuNum": 0,
1039 |     "hideHardwareSpecs": false,
1040 |     "memoryGiB": 128,
1041 |     "name": "ml.m5.8xlarge",
1042 |     "vcpuNum": 32
1043 |    },
1044 |    {
1045 |     "_defaultOrder": 9,
1046 |     "_isFastLaunch": false,
1047 |     "category": "General purpose",
1048 |     "gpuNum": 0,
1049 |     "hideHardwareSpecs": false,
1050 |     "memoryGiB": 192,
1051 |     "name": "ml.m5.12xlarge",
1052 |     "vcpuNum": 48
1053 |    },
1054 |    {
1055 |     "_defaultOrder": 10,
1056 |     "_isFastLaunch": false,
1057 |     "category": "General purpose",
1058 |     "gpuNum": 0,
1059 |     "hideHardwareSpecs": false,
1060 |     "memoryGiB": 256,
1061 |     "name": "ml.m5.16xlarge",
1062 |     "vcpuNum": 64
1063 |    },
1064 |    {
1065 |     "_defaultOrder": 11,
1066 |     "_isFastLaunch": false,
1067 |     "category": "General purpose",
1068 |     "gpuNum": 0,
1069 |     "hideHardwareSpecs": false,
1070 |     "memoryGiB": 384,
1071 |     "name": "ml.m5.24xlarge",
1072 |     "vcpuNum": 96
1073 |    },
1074 |    {
1075 |     "_defaultOrder": 12,
1076 |     "_isFastLaunch": false,
1077 |     "category": "General purpose",
1078 |     "gpuNum": 0,
1079 |     "hideHardwareSpecs": false,
1080 |     "memoryGiB": 8,
1081 |     "name": "ml.m5d.large",
1082 |     "vcpuNum": 2
1083 |    },
1084 |    {
1085 |     "_defaultOrder": 13,
1086 |     "_isFastLaunch": false,
1087 |     "category": "General purpose",
1088 |     "gpuNum": 0,
1089 |     "hideHardwareSpecs": false,
1090 |     "memoryGiB": 16,
1091 |     "name": "ml.m5d.xlarge",
1092 |     "vcpuNum": 4
1093 |    },
1094 |    {
1095 |     "_defaultOrder": 14,
1096 |     "_isFastLaunch": false,
1097 |     "category": "General purpose",
1098 |     "gpuNum": 0,
1099 |     "hideHardwareSpecs": false,
1100 |     "memoryGiB": 32,
1101 |     "name": "ml.m5d.2xlarge",
1102 |     "vcpuNum": 8
1103 |    },
1104 |    {
1105 |     "_defaultOrder": 15,
1106 |     "_isFastLaunch": false,
1107 |     "category": "General purpose",
1108 |     "gpuNum": 0,
1109 |     "hideHardwareSpecs": false,
1110 |     "memoryGiB": 64,
1111 |     "name": "ml.m5d.4xlarge",
1112 |     "vcpuNum": 16
1113 |    },
1114 |    {
1115 |     "_defaultOrder": 16,
1116 |     "_isFastLaunch": false,
1117 |     "category": "General purpose",
1118 |     "gpuNum": 0,
1119 |     "hideHardwareSpecs": false,
1120 |     "memoryGiB": 128,
1121 |     "name": "ml.m5d.8xlarge",
1122 |     "vcpuNum": 32
1123 |    },
1124 |    {
1125 |     "_defaultOrder": 17,
1126 |     "_isFastLaunch": false,
1127 |     "category": "General purpose",
1128 |     "gpuNum": 0,
1129 |     "hideHardwareSpecs": false,
1130 |     "memoryGiB": 192,
1131 |     "name": "ml.m5d.12xlarge",
1132 |     "vcpuNum": 48
1133 |    },
1134 |    {
1135 |     "_defaultOrder": 18,
1136 |     "_isFastLaunch": false,
1137 |     "category": "General purpose",
1138 |     "gpuNum": 0,
1139 |     "hideHardwareSpecs": false,
1140 |     "memoryGiB": 256,
1141 |     "name": "ml.m5d.16xlarge",
1142 |     "vcpuNum": 64
1143 |    },
1144 |    {
1145 |     "_defaultOrder": 19,
1146 |     "_isFastLaunch": false,
1147 |     "category": "General purpose",
1148 |     "gpuNum": 0,
1149 |     "hideHardwareSpecs": false,
1150 |     "memoryGiB": 384,
1151 |     "name": "ml.m5d.24xlarge",
1152 |     "vcpuNum": 96
1153 |    },
1154 |    {
1155 |     "_defaultOrder": 20,
1156 |     "_isFastLaunch": false,
1157 |     "category": "General purpose",
1158 |     "gpuNum": 0,
1159 |     "hideHardwareSpecs": true,
1160 |     "memoryGiB": 0,
1161 |     "name": "ml.geospatial.interactive",
1162 |     "supportedImageNames": [
1163 |      "sagemaker-geospatial-v1-0"
1164 |     ],
1165 |     "vcpuNum": 0
1166 |    },
1167 |    {
1168 |     "_defaultOrder": 21,
1169 |     "_isFastLaunch": true,
1170 |     "category": "Compute optimized",
1171 |     "gpuNum": 0,
1172 |     "hideHardwareSpecs": false,
1173 |     "memoryGiB": 4,
1174 |     "name": "ml.c5.large",
1175 |     "vcpuNum": 2
1176 |    },
1177 |    {
1178 |     "_defaultOrder": 22,
1179 |     "_isFastLaunch": false,
1180 |     "category": "Compute optimized",
1181 |     "gpuNum": 0,
1182 |     "hideHardwareSpecs": false,
1183 |     "memoryGiB": 8,
1184 |     "name": "ml.c5.xlarge",
1185 |     "vcpuNum": 4
1186 |    },
1187 |    {
1188 |     "_defaultOrder": 23,
1189 |     "_isFastLaunch": false,
1190 |     "category": "Compute optimized",
1191 |     "gpuNum": 0,
1192 |     "hideHardwareSpecs": false,
1193 |     "memoryGiB": 16,
1194 |     "name": "ml.c5.2xlarge",
1195 |     "vcpuNum": 8
1196 |    },
1197 |    {
1198 |     "_defaultOrder": 24,
1199 |     "_isFastLaunch": false,
1200 |     "category": "Compute optimized",
1201 |     "gpuNum": 0,
1202 |     "hideHardwareSpecs": false,
1203 |     "memoryGiB": 32,
1204 |     "name": "ml.c5.4xlarge",
1205 |     "vcpuNum": 16
1206 |    },
1207 |    {
1208 |     "_defaultOrder": 25,
1209 |     "_isFastLaunch": false,
1210 |     "category": "Compute optimized",
1211 |     "gpuNum": 0,
1212 |     "hideHardwareSpecs": false,
1213 |     "memoryGiB": 72,
1214 |     "name": "ml.c5.9xlarge",
1215 |     "vcpuNum": 36
1216 |    },
1217 |    {
1218 |     "_defaultOrder": 26,
1219 |     "_isFastLaunch": false,
1220 |     "category": "Compute optimized",
1221 |     "gpuNum": 0,
1222 |     "hideHardwareSpecs": false,
1223 |     "memoryGiB": 96,
1224 |     "name": "ml.c5.12xlarge",
1225 |     "vcpuNum": 48
1226 |    },
1227 |    {
1228 |     "_defaultOrder": 27,
1229 |     "_isFastLaunch": false,
1230 |     "category": "Compute optimized",
1231 |     "gpuNum": 0,
1232 |     "hideHardwareSpecs": false,
1233 |     "memoryGiB": 144,
1234 |     "name": "ml.c5.18xlarge",
1235 |     "vcpuNum": 72
1236 |    },
1237 |    {
1238 |     "_defaultOrder": 28,
1239 |     "_isFastLaunch": false,
1240 |     "category": "Compute optimized",
1241 |     "gpuNum": 0,
1242 |     "hideHardwareSpecs": false,
1243 |     "memoryGiB": 192,
1244 |     "name": "ml.c5.24xlarge",
1245 |     "vcpuNum": 96
1246 |    },
1247 |    {
1248 |     "_defaultOrder": 29,
1249 |     "_isFastLaunch": true,
1250 |     "category": "Accelerated computing",
1251 |     "gpuNum": 1,
1252 |     "hideHardwareSpecs": false,
1253 |     "memoryGiB": 16,
1254 |     "name": "ml.g4dn.xlarge",
1255 |     "vcpuNum": 4
1256 |    },
1257 |    {
1258 |     "_defaultOrder": 30,
1259 |     "_isFastLaunch": false,
1260 |     "category": "Accelerated computing",
1261 |     "gpuNum": 1,
1262 |     "hideHardwareSpecs": false,
1263 |     "memoryGiB": 32,
1264 |     "name": "ml.g4dn.2xlarge",
1265 |     "vcpuNum": 8
1266 |    },
1267 |    {
1268 |     "_defaultOrder": 31,
1269 |     "_isFastLaunch": false,
1270 |     "category": "Accelerated computing",
1271 |     "gpuNum": 1,
1272 |     "hideHardwareSpecs": false,
1273 |     "memoryGiB": 64,
1274 |     "name": "ml.g4dn.4xlarge",
1275 |     "vcpuNum": 16
1276 |    },
1277 |    {
1278 |     "_defaultOrder": 32,
1279 |     "_isFastLaunch": false,
1280 |     "category": "Accelerated computing",
1281 |     "gpuNum": 1,
1282 |     "hideHardwareSpecs": false,
1283 |     "memoryGiB": 128,
1284 |     "name": "ml.g4dn.8xlarge",
1285 |     "vcpuNum": 32
1286 |    },
1287 |    {
1288 |     "_defaultOrder": 33,
1289 |     "_isFastLaunch": false,
1290 |     "category": "Accelerated computing",
1291 |     "gpuNum": 4,
1292 |     "hideHardwareSpecs": false,
1293 |     "memoryGiB": 192,
1294 |     "name": "ml.g4dn.12xlarge",
1295 |     "vcpuNum": 48
1296 |    },
1297 |    {
1298 |     "_defaultOrder": 34,
1299 |     "_isFastLaunch": false,
1300 |     "category": "Accelerated computing",
1301 |     "gpuNum": 1,
1302 |     "hideHardwareSpecs": false,
1303 |     "memoryGiB": 256,
1304 |     "name": "ml.g4dn.16xlarge",
1305 |     "vcpuNum": 64
1306 |    },
1307 |    {
1308 |     "_defaultOrder": 35,
1309 |     "_isFastLaunch": false,
1310 |     "category": "Accelerated computing",
1311 |     "gpuNum": 1,
1312 |     "hideHardwareSpecs": false,
1313 |     "memoryGiB": 61,
1314 |     "name": "ml.p3.2xlarge",
1315 |     "vcpuNum": 8
1316 |    },
1317 |    {
1318 |     "_defaultOrder": 36,
1319 |     "_isFastLaunch": false,
1320 |     "category": "Accelerated computing",
1321 |     "gpuNum": 4,
1322 |     "hideHardwareSpecs": false,
1323 |     "memoryGiB": 244,
1324 |     "name": "ml.p3.8xlarge",
1325 |     "vcpuNum": 32
1326 |    },
1327 |    {
1328 |     "_defaultOrder": 37,
1329 |     "_isFastLaunch": false,
1330 |     "category": "Accelerated computing",
1331 |     "gpuNum": 8,
1332 |     "hideHardwareSpecs": false,
1333 |     "memoryGiB": 488,
1334 |     "name": "ml.p3.16xlarge",
1335 |     "vcpuNum": 64
1336 |    },
1337 |    {
1338 |     "_defaultOrder": 38,
1339 |     "_isFastLaunch": false,
1340 |     "category": "Accelerated computing",
1341 |     "gpuNum": 8,
1342 |     "hideHardwareSpecs": false,
1343 |     "memoryGiB": 768,
1344 |     "name": "ml.p3dn.24xlarge",
1345 |     "vcpuNum": 96
1346 |    },
1347 |    {
1348 |     "_defaultOrder": 39,
1349 |     "_isFastLaunch": false,
1350 |     "category": "Memory Optimized",
1351 |     "gpuNum": 0,
1352 |     "hideHardwareSpecs": false,
1353 |     "memoryGiB": 16,
1354 |     "name": "ml.r5.large",
1355 |     "vcpuNum": 2
1356 |    },
1357 |    {
1358 |     "_defaultOrder": 40,
1359 |     "_isFastLaunch": false,
1360 |     "category": "Memory Optimized",
1361 |     "gpuNum": 0,
1362 |     "hideHardwareSpecs": false,
1363 |     "memoryGiB": 32,
1364 |     "name": "ml.r5.xlarge",
1365 |     "vcpuNum": 4
1366 |    },
1367 |    {
1368 |     "_defaultOrder": 41,
1369 |     "_isFastLaunch": false,
1370 |     "category": "Memory Optimized",
1371 |     "gpuNum": 0,
1372 |     "hideHardwareSpecs": false,
1373 |     "memoryGiB": 64,
1374 |     "name": "ml.r5.2xlarge",
1375 |     "vcpuNum": 8
1376 |    },
1377 |    {
1378 |     "_defaultOrder": 42,
1379 |     "_isFastLaunch": false,
1380 |     "category": "Memory Optimized",
1381 |     "gpuNum": 0,
1382 |     "hideHardwareSpecs": false,
1383 |     "memoryGiB": 128,
1384 |     "name": "ml.r5.4xlarge",
1385 |     "vcpuNum": 16
1386 |    },
1387 |    {
1388 |     "_defaultOrder": 43,
1389 |     "_isFastLaunch": false,
1390 |     "category": "Memory Optimized",
1391 |     "gpuNum": 0,
1392 |     "hideHardwareSpecs": false,
1393 |     "memoryGiB": 256,
1394 |     "name": "ml.r5.8xlarge",
1395 |     "vcpuNum": 32
1396 |    },
1397 |    {
1398 |     "_defaultOrder": 44,
1399 |     "_isFastLaunch": false,
1400 |     "category": "Memory Optimized",
1401 |     "gpuNum": 0,
1402 |     "hideHardwareSpecs": false,
1403 |     "memoryGiB": 384,
1404 |     "name": "ml.r5.12xlarge",
1405 |     "vcpuNum": 48
1406 |    },
1407 |    {
1408 |     "_defaultOrder": 45,
1409 |     "_isFastLaunch": false,
1410 |     "category": "Memory Optimized",
1411 |     "gpuNum": 0,
1412 |     "hideHardwareSpecs": false,
1413 |     "memoryGiB": 512,
1414 |     "name": "ml.r5.16xlarge",
1415 |     "vcpuNum": 64
1416 |    },
1417 |    {
1418 |     "_defaultOrder": 46,
1419 |     "_isFastLaunch": false,
1420 |     "category": "Memory Optimized",
1421 |     "gpuNum": 0,
1422 |     "hideHardwareSpecs": false,
1423 |     "memoryGiB": 768,
1424 |     "name": "ml.r5.24xlarge",
1425 |     "vcpuNum": 96
1426 |    },
1427 |    {
1428 |     "_defaultOrder": 47,
1429 |     "_isFastLaunch": false,
1430 |     "category": "Accelerated computing",
1431 |     "gpuNum": 1,
1432 |     "hideHardwareSpecs": false,
1433 |     "memoryGiB": 16,
1434 |     "name": "ml.g5.xlarge",
1435 |     "vcpuNum": 4
1436 |    },
1437 |    {
1438 |     "_defaultOrder": 48,
1439 |     "_isFastLaunch": false,
1440 |     "category": "Accelerated computing",
1441 |     "gpuNum": 1,
1442 |     "hideHardwareSpecs": false,
1443 |     "memoryGiB": 32,
1444 |     "name": "ml.g5.2xlarge",
1445 |     "vcpuNum": 8
1446 |    },
1447 |    {
1448 |     "_defaultOrder": 49,
1449 |     "_isFastLaunch": false,
1450 |     "category": "Accelerated computing",
1451 |     "gpuNum": 1,
1452 |     "hideHardwareSpecs": false,
1453 |     "memoryGiB": 64,
1454 |     "name": "ml.g5.4xlarge",
1455 |     "vcpuNum": 16
1456 |    },
1457 |    {
1458 |     "_defaultOrder": 50,
1459 |     "_isFastLaunch": false,
1460 |     "category": "Accelerated computing",
1461 |     "gpuNum": 1,
1462 |     "hideHardwareSpecs": false,
1463 |     "memoryGiB": 128,
1464 |     "name": "ml.g5.8xlarge",
1465 |     "vcpuNum": 32
1466 |    },
1467 |    {
1468 |     "_defaultOrder": 51,
1469 |     "_isFastLaunch": false,
1470 |     "category": "Accelerated computing",
1471 |     "gpuNum": 1,
1472 |     "hideHardwareSpecs": false,
1473 |     "memoryGiB": 256,
1474 |     "name": "ml.g5.16xlarge",
1475 |     "vcpuNum": 64
1476 |    },
1477 |    {
1478 |     "_defaultOrder": 52,
1479 |     "_isFastLaunch": false,
1480 |     "category": "Accelerated computing",
1481 |     "gpuNum": 4,
1482 |     "hideHardwareSpecs": false,
1483 |     "memoryGiB": 192,
1484 |     "name": "ml.g5.12xlarge",
1485 |     "vcpuNum": 48
1486 |    },
1487 |    {
1488 |     "_defaultOrder": 53,
1489 |     "_isFastLaunch": false,
1490 |     "category": "Accelerated computing",
1491 |     "gpuNum": 4,
1492 |     "hideHardwareSpecs": false,
1493 |     "memoryGiB": 384,
1494 |     "name": "ml.g5.24xlarge",
1495 |     "vcpuNum": 96
1496 |    },
1497 |    {
1498 |     "_defaultOrder": 54,
1499 |     "_isFastLaunch": false,
1500 |     "category": "Accelerated computing",
1501 |     "gpuNum": 8,
1502 |     "hideHardwareSpecs": false,
1503 |     "memoryGiB": 768,
1504 |     "name": "ml.g5.48xlarge",
1505 |     "vcpuNum": 192
1506 |    },
1507 |    {
1508 |     "_defaultOrder": 55,
1509 |     "_isFastLaunch": false,
1510 |     "category": "Accelerated computing",
1511 |     "gpuNum": 8,
1512 |     "hideHardwareSpecs": false,
1513 |     "memoryGiB": 1152,
1514 |     "name": "ml.p4d.24xlarge",
1515 |     "vcpuNum": 96
1516 |    },
1517 |    {
1518 |     "_defaultOrder": 56,
1519 |     "_isFastLaunch": false,
1520 |     "category": "Accelerated computing",
1521 |     "gpuNum": 8,
1522 |     "hideHardwareSpecs": false,
1523 |     "memoryGiB": 1152,
1524 |     "name": "ml.p4de.24xlarge",
1525 |     "vcpuNum": 96
1526 |    },
1527 |    {
1528 |     "_defaultOrder": 57,
1529 |     "_isFastLaunch": false,
1530 |     "category": "Accelerated computing",
1531 |     "gpuNum": 0,
1532 |     "hideHardwareSpecs": false,
1533 |     "memoryGiB": 32,
1534 |     "name": "ml.trn1.2xlarge",
1535 |     "vcpuNum": 8
1536 |    },
1537 |    {
1538 |     "_defaultOrder": 58,
1539 |     "_isFastLaunch": false,
1540 |     "category": "Accelerated computing",
1541 |     "gpuNum": 0,
1542 |     "hideHardwareSpecs": false,
1543 |     "memoryGiB": 512,
1544 |     "name": "ml.trn1.32xlarge",
1545 |     "vcpuNum": 128
1546 |    },
1547 |    {
1548 |     "_defaultOrder": 59,
1549 |     "_isFastLaunch": false,
1550 |     "category": "Accelerated computing",
1551 |     "gpuNum": 0,
1552 |     "hideHardwareSpecs": false,
1553 |     "memoryGiB": 512,
1554 |     "name": "ml.trn1n.32xlarge",
1555 |     "vcpuNum": 128
1556 |    }
1557 |   ],
1558 |   "instance_type": "ml.t3.medium",
1559 |   "kernelspec": {
1560 |    "display_name": "Python 3 (ipykernel)",
1561 |    "language": "python",
1562 |    "name": "python3"
1563 |   },
1564 |   "language_info": {
1565 |    "codemirror_mode": {
1566 |     "name": "ipython",
1567 |     "version": 3
1568 |    },
1569 |    "file_extension": ".py",
1570 |    "mimetype": "text/x-python",
1571 |    "name": "python",
1572 |    "nbconvert_exporter": "python",
1573 |    "pygments_lexer": "ipython3",
1574 |    "version": "3.11.6"
1575 |   }
1576 |  },
1577 |  "nbformat": 4,
1578 |  "nbformat_minor": 5
1579 | }
1580 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Generating Synthetic Datasets for Evaluating Retrieval Augmented Generation Systems
 2 | 
 3 | As Retrieval Augmented Generation (RAG) systems become more prevalent, evaluating their performance is essential to ensure quality and performance. However, collecting real-world data for evaluation can be costly and time-consuming, especially in the early stages of a project. This lab provides a hands-on guide to leveraging large language models and knowledge retrieval context to generate synthetic evaluation datasets that mimic real human interactions. It covers setting up an end-to-end workflow using Python and the Amazon Bedrock API.
 4 | 
 5 | By leveraging large language models and knowledge retrieval context, the proposed approach ensures that the synthetic datasets are diverse, realistic, and representative of real-world scenarios. This solution is relevant for developers and researchers working on RAG systems, as it streamlines the evaluation process and accelerates the iterative development cycle, ultimately leading to better-performing AI systems.
 6 | 
 7 | The notebook guides you through generating a synthetic dataset for a QA-RAG application using the Bedrock API, Python and Langchain. The notebook consists of the following chapters: 
 8 | 
 9 | 1. Set-up of the environment
10 | 2. Loading and preparing context data
11 | 3. Initial Question Generation
12 | 4. Answer Generation
13 | 5. Extracting Relevant Context
14 | 6. Evolving Questions to fit end-users behaviour
15 | 7. Automated Dataset Generation
16 | 8. Assessing the Questions quality
17 | 
18 | 
19 | ## Getting started
20 | 
21 | ### Choose a notebook environment
22 | 
23 | This lab is presented as a **Python notebook**, which you can run from the environment of your choice:
24 | 
25 | - [SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) is a web-based integrated development environment (IDE) for machine learning. To get started quickly, refer to the [instructions for domain quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html).
26 | - [SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html) is a machine learning (ML) compute instance running the Jupyter Notebook App.
27 | - To use your existing (local or other) notebook environment, make sure it has [credentials for calling AWS](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).
28 | 
29 | 
30 | ### Enable AWS IAM permissions for Bedrock
31 | 
32 | The AWS identity you assume from your notebook environment (which is the [*Studio/notebook Execution Role*](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) from SageMaker, or can be a role or IAM User for self-managed notebooks), must have sufficient [AWS IAM permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html) to call the Amazon Bedrock service.
33 | 
34 | To grant Bedrock access to your identity:
35 | 
36 | - Open the [AWS IAM Console](https://us-east-1.console.aws.amazon.com/iam/home?#)
37 | - Find your [Role](https://us-east-1.console.aws.amazon.com/iamv2/home?#/roles) (if using SageMaker or otherwise assuming an IAM Role), or else [User](https://us-east-1.console.aws.amazon.com/iamv2/home?#/users)
38 | - Select *Add Permissions > Create Inline Policy* to attach new inline permissions, open the *JSON* editor and paste in the below example policy:
39 | 
40 | ```
41 | {
42 |     "Version": "2012-10-17",
43 |     "Statement": {
44 |         "Sid": "AllowInference",
45 |         "Effect": "Allow",
46 |         "Action": [
47 |             "bedrock:InvokeModel"
48 |         ],
49 |         "Resource": "arn:aws:bedrock:*::foundation-model/*"
50 |     }
51 | }
52 | ```
53 | 
54 | > ℹ️  **Note:** With Amazon SageMaker, your notebook execution role is typically be *separate* from the user or role that you log in to the AWS Console with. If you want to explore the AWS Console for Amazon Bedrock, you need to grant permissions to your Console user/role too. You can run the notebooks anywhere as long as you have access to the AWS Bedrock service and have appropriate credentials
55 | 
56 | For more information on the fine-grained action and resource permissions in Bedrock, check out the Bedrock [Developer Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html).
57 | 
58 | ### Clone and use the notebooks
59 | 
60 | > ℹ️ **Note:** In SageMaker Studio, you can open a "System Terminal" to run these commands by clicking *File > New > Terminal*
61 | 
62 | Once your notebook environment is set up, clone this workshop repository into it.
63 | 
64 | ```sh
65 | git clone https://github.com/aws-samples/generating-synthetic-datasets-for-evaluating-retrieval-augmented-generation-systems.git
66 | cd generating-synthetic-datasets-for-evaluating-retrieval-augmented-generation-systems/Notebook 
67 | ```
68 | 
69 | 
70 | You're now ready to explore the lab notebook! You will be guided through connection the notebook to Amazon Bedrock for large language model access.
71 | 
72 | 
73 | ## Contributing
74 | 
75 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
76 | 
77 | ## License 
78 | This lab is licensed under the MIT-0 License. 
79 | 
80 | 


--------------------------------------------------------------------------------