├── .gitignore
├── 1_raft_datagen.ipynb
├── 2_raft_finetuning.ipynb
├── 3_raft_evaluation.ipynb
├── README.md
├── data
    ├── banking_test.jsonl
    ├── banking_train.jsonl
    ├── banking_validation.jsonl
    └── better_banking_guide_en.pdf
├── requirements.txt
├── sample.env
└── static
    ├── ft_checkpoints.png
    ├── ft_deploy.png
    ├── ft_metrics.png
    ├── ft_monitor.png
    └── raft.png


/.gitignore:
--------------------------------------------------------------------------------
 1 | data/training_data/vampires_eval_batch_run.json
 2 | data/training_data/vampires_eval_batch_run.jsonl
 3 | data/training_data/vampires_eval.json
 4 | data/training_data/vampires_test_with_metadata.json
 5 | data/training_data/vampires_test.jsonl
 6 | data/training_data/vampires_train.jsonl
 7 | data/training_data/vampires.jsonl
 8 | .env
 9 | data/tax_guide/irs_form_p17.pdf
10 | test.md
11 | data/training_data/banking_test_with_metadata.json
12 | data/training_data/banking_test.jsonl
13 | data/training_data/banking_train.jsonl
14 | data/training_data/banking_validation.jsonl
15 | 


--------------------------------------------------------------------------------
/1_raft_datagen.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 1. RAFT training data generation using GPT-4o\n",
  8 |     "\n",
  9 |     "In this notebook, we'll synthesize some training data that will eventually be used to fine-tune a GPT-4o mini model in order to adapt it to a set of document(s) and specific domain. **This step is critical as the quality of your training data will greatly influence the performance of your fine-tuned model.**\n",
 10 |     "\n",
 11 |     "For RAFT, these are the different steps to prepare the training dataset:\n",
 12 |     "- Collect Domain-Specific Documents: Gather documents relevant to the domain you want to specialize the LLM in (e.g., medical documents for PubMed, legal documents, API documentation for software).\n",
 13 |     "- Chunk the file into Documents\n",
 14 |     "- For each Document chunk, generate a set of Questions that can be answered from the Document\n",
 15 |     "- For each Document-Question pair, create a list of documents using:\n",
 16 |     "    - **Golden Document (D*)**: Document that contains the answer to the question.\n",
 17 |     "    - **Distractor Documents (Dk)**: Documents that do not contain relevant information.\n",
 18 |     "- Question-Answer-Document Triplets: From each **Document-Question** pair, generate a factual **Answer** based on the Golden Document.\n",
 19 |     "\n",
 20 |     "Curating a good training dataset often involves manual work and review by SMEs. That said, we can use an LLM to help us generate an initial set of training examples that can be vetted and further refined by SMEs.\n"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "### 0. Pre-requisites\n",
 28 |     "\n",
 29 |     "For this hands-on workshop, all you need is access to an Azure subscription and the ability to create Azure OpenAI resources and deployments. \n",
 30 |     "\n",
 31 |     "0. Install poppler for PDF processing\n",
 32 |     "\n",
 33 |     "- on Linux run `sudo apt-get install -y poppler-utils`\n",
 34 |     "- on Mac run `brew install poppler`\n",
 35 |     "- on Windows run `conda install -c conda-forge poppler`\n",
 36 |     "\n",
 37 |     "1. Create a code environment and install the necessary packages\n",
 38 |     "\n",
 39 |     "```shell\n",
 40 |     "conda create -n raft python=3.11\n",
 41 |     "\n",
 42 |     "conda activate raft\n",
 43 |     "\n",
 44 |     "pip install -r requirements.txt\n",
 45 |     "```\n",
 46 |     "\n",
 47 |     "2. Create a GPT-4o deployment\n",
 48 |     "3. Create a GPT-4o mini deployment\n",
 49 |     "4. Create an Azure OpenAI resource in North Central US or Sweden Central (regions where gpt-4o-mini fine tuning is supported)\n",
 50 |     "5. Create a `.env` file based on the [sample.env](./sample.env) file in this repository to store your credentials and important environment variables. Paste your AOAI endpoints, keys and deployment names, name the file `.env`"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "**Import libraries**"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": null,
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "from openai import AzureOpenAI\n",
 67 |     "from langchain_experimental.text_splitter import SemanticChunker\n",
 68 |     "from dotenv import load_dotenv\n",
 69 |     "from io import BytesIO\n",
 70 |     "import base64\n",
 71 |     "from typing import Literal, Any\n",
 72 |     "import os\n",
 73 |     "from math import ceil\n",
 74 |     "import random\n",
 75 |     "from tqdm import tqdm\n",
 76 |     "from pdf2image import convert_from_path\n",
 77 |     "\n",
 78 |     "load_dotenv()\n",
 79 |     "\n",
 80 |     "generator_client = AzureOpenAI(\n",
 81 |     "    azure_endpoint=os.getenv(\"AOAI_GPT4o_ENDPOINT\"),\n",
 82 |     "    api_version=\"2024-02-01\",\n",
 83 |     "    api_key=os.getenv(\"AOAI_GPT4o_API_KEY\")\n",
 84 |     ")\n",
 85 |     "\n",
 86 |     "gpt4o_deployment = os.getenv(\"AOAI_GPT4o_DEPLOYMENT\")\n"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "### 1. Loading and chunking domain-specific documents\n",
 94 |     "\n",
 95 |     "For Retrieval Augmented Fine Tuning, we need to generate Question-Documents-Answer triplets. The first step is to create document chunks based on our domain-specific documents we want to specialize our model on.\n",
 96 |     "\n",
 97 |     "For this workshop, we will use the publicly available [BMO Better Banking Guide](./data/better_banking_guide_en.pdf)\n",
 98 |     "\n",
 99 |     "**Take a minute to browse through the PDF and become more familiar with its content**\n",
100 |     "\n",
101 |     "The guide contains information about various banking accounts offered by BMO as well as how-to guides on e.g how to access Account statements etc.\n",
102 |     "\n",
103 |     "The document is in PDF format and contains a number of tables and charts, we will use GPT-4o to convert the pages content to markdown. Credit to Liam Cavanagh for the inspiration.\n",
104 |     "\n",
105 |     "**a. First we'll need to convert the document pages to images encoded in base64**"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "def pdf_to_base64_urls(pdf_path):\n",
115 |     "  \"\"\"Converts each page of a PDF to a base64 encoded URL starting with 'data:image/jpeg'.\n",
116 |     "\n",
117 |     "  Parameters:\n",
118 |     "    pdf_path: Path to the PDF file.\n",
119 |     "\n",
120 |     "  Returns:\n",
121 |     "    A list of base64 encoded image URLs, one for each page.\n",
122 |     "  \"\"\"\n",
123 |     "\n",
124 |     "  images = convert_from_path(pdf_path)\n",
125 |     "  base64_urls = []\n",
126 |     "\n",
127 |     "  for image in images:\n",
128 |     "    img_byte_arr = BytesIO()\n",
129 |     "    image.save(img_byte_arr, format=\"JPEG\")\n",
130 |     "    img_byte_arr.seek(0)\n",
131 |     "    base64_encoded = base64.b64encode(img_byte_arr.read()).decode('utf-8')\n",
132 |     "    base64_url = f\"data:image/jpeg;base64,{base64_encoded}\"\n",
133 |     "    base64_urls.append(base64_url)\n",
134 |     "\n",
135 |     "  return base64_urls\n",
136 |     "\n",
137 |     "pdf_path = './data/better_banking_guide_en.pdf'\n",
138 |     "image_data = pdf_to_base64_urls(pdf_path)"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "**b. Now we can call GPT-4o to convert our images to markdown**"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "def gpt_image_to_markdown(image_data, client):\n",
155 |     "    \"\"\"\n",
156 |     "    Converts an image to markdown using GPT-4o.\n",
157 |     "    \n",
158 |     "    Parameters:\n",
159 |     "    image_data: A base64 encoded image.\n",
160 |     "    client: An AzureOpenAI client\n",
161 |     "\n",
162 |     "    Returns:\n",
163 |     "    A list of base64 encoded image URLs, one for each page.\n",
164 |     "    \"\"\"\n",
165 |     "    messages = [\n",
166 |     "        {\"role\":\"system\", \"content\":\"\"\"You are an AI image assistant capable of extracting text from images\n",
167 |     "         Given an image, you must extract the any visible text on the image and return it in Markdown.\n",
168 |     "         You must keep the original layout and formatting of the text as much as possible in Markdown format.\n",
169 |     "         Pay attention to the text size and use headers, subheaders, bold, italic, tables etc where necessary.\"\"\"},\n",
170 |     "         {\"role\":\"user\", \"content\":[{\n",
171 |     "                \"type\":\"image_url\",\n",
172 |     "                \"image_url\":{\n",
173 |     "                    \"url\":image_data\n",
174 |     "                    }\n",
175 |     "         }]}\n",
176 |     "    ]\n",
177 |     "    try:\n",
178 |     "        response = client.chat.completions.create(\n",
179 |     "            model=gpt4o_deployment,\n",
180 |     "            messages=messages\n",
181 |     "        )\n",
182 |     "        \n",
183 |     "        return response.choices[0].message.content\n",
184 |     "    \n",
185 |     "    except Exception as e:\n",
186 |     "        print(f\"Error: {e}\")\n",
187 |     "        return None\n",
188 |     "\n",
189 |     "\n",
190 |     "markdown_doc = \"\"\n",
191 |     "\n",
192 |     "with tqdm(total=len(image_data), desc=\"Converting images to Markdown\") as pbar:\n",
193 |     "    for img_data in image_data:\n",
194 |     "        result = gpt_image_to_markdown(img_data, generator_client)\n",
195 |     "        markdown_doc += \"\\n\" + result\n",
196 |     "        pbar.update(1)\n"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {},
202 |    "source": [
203 |     "**c. Let's chunk the Markdown using Langchain**\n",
204 |     "\n",
205 |     "We first use Langchain's `MarkdownHeaderTextSplitter` to split the document based on headers and then further split the chunks using `RecursiveCharacterTextSplitter` with a chunk size of 1024. Finally, we filter out any chunks that are too short to contain any valuable information."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": null,
211 |    "metadata": {},
212 |    "outputs": [],
213 |    "source": [
214 |     "from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter\n",
215 |     "import re\n",
216 |     "\n",
217 |     "def remove_special_characters(string):\n",
218 |     "    \"\"\"\n",
219 |     "    Remove special characters from a string.\n",
220 |     "    \n",
221 |     "    Parameters:  \n",
222 |     "    string (str): The input string from which special characters need to be removed.  \n",
223 |     "  \n",
224 |     "    Returns:  \n",
225 |     "    str: A new string with special characters removed.\n",
226 |     "    \"\"\"\n",
227 |     "    return re.sub(r'[^a-zA-Z0-9\\s]', '', string)\n",
228 |     "\n",
229 |     "headers_to_split_on = [\n",
230 |     "    (\"#\", \"Header 1\"),\n",
231 |     "    (\"##\", \"Header 2\")    \n",
232 |     "]\n",
233 |     "\n",
234 |     "markdown_splitter = MarkdownHeaderTextSplitter(\n",
235 |     "    headers_to_split_on=headers_to_split_on,\n",
236 |     "    strip_headers=False\n",
237 |     "    )\n",
238 |     "\n",
239 |     "markdown_doc_splits = markdown_splitter.split_text(markdown_doc)\n",
240 |     "print(f\"Number of chunks after markdown split: {len(markdown_doc_splits)}\")\n",
241 |     "\n",
242 |     "chunk_size = 1024\n",
243 |     "chunk_overlap = 50\n",
244 |     "text_splitter = RecursiveCharacterTextSplitter(\n",
245 |     "    chunk_size=chunk_size, \n",
246 |     "    chunk_overlap=chunk_overlap,\n",
247 |     ")\n",
248 |     "\n",
249 |     "\n",
250 |     "chunked_document = text_splitter.split_documents(markdown_doc_splits)\n",
251 |     "\n",
252 |     "print(f\"Number of chunks after markdown + recursive split: {len(chunked_document)}\")\n",
253 |     "\n",
254 |     "chunks = [chunk.page_content for chunk in chunked_document if len(remove_special_characters(chunk.page_content))>100]\n",
255 |     "\n",
256 |     "print(f\"Number of chunks after filtering out empty: {len(chunks)}\")\n"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "markdown",
261 |    "metadata": {},
262 |    "source": [
263 |     "### 2. Generate training data from the chunked documents\n",
264 |     "\n",
265 |     "We define 2 main functions to generate our Question-Document-Answer triplets from our chunked document\n",
266 |     "\n",
267 |     "1. `generate_instructions_gen()`: This function generates a list of questions based on an input document chunk\n",
268 |     "2. `generate_label()`: This function generates an Answer based on a Question-Document chunk pair\n",
269 |     "\n",
270 |     "**a. First, lets look at the `generate_instructions_gen()` function on a sample**"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "code",
275 |    "execution_count": null,
276 |    "metadata": {},
277 |    "outputs": [],
278 |    "source": [
279 |     "def strip_str(s: str) -> str:\n",
280 |     "    \"\"\"\n",
281 |     "    Helper function for helping format strings returned by GPT-4o.\n",
282 |     "    \n",
283 |     "    Parameters:  \n",
284 |     "    s (str): The input string to be formatted.  \n",
285 |     "  \n",
286 |     "    Returns:  \n",
287 |     "    str: A formatted string \n",
288 |     "    \"\"\"\n",
289 |     "    l, r = 0, len(s)-1\n",
290 |     "    beg_found = False\n",
291 |     "    for i in range(len(s)):\n",
292 |     "        if s[i].isalpha():\n",
293 |     "            if not beg_found:\n",
294 |     "                l = i\n",
295 |     "                beg_found = True\n",
296 |     "            else:\n",
297 |     "                r = i \n",
298 |     "    r += 2\n",
299 |     "    return s[l:min(r, len(s))]\n",
300 |     "\n",
301 |     "def generate_instructions_gen(client: AzureOpenAI, chunk: Any, x: int = 5, model: str = None) -> list[str]:\n",
302 |     "    \"\"\"\n",
303 |     "    Generates a list of questions or use cases based on a provided chunk of context using an Azure OpenAI model.  \n",
304 |     "\n",
305 |     "    Parameters:  \n",
306 |     "    client (AzureOpenAI): An instance of the Azure OpenAI client used to communicate with the OpenAI API.  \n",
307 |     "    chunk (Any): The context or chunk of text based on which the questions are to be generated.  \n",
308 |     "    x (int, optional): The number of questions to generate. Default is 5.  \n",
309 |     "    model (str, optional): The specific model to use for generating the questions. Default is None, which uses the default model configured in the client.  \n",
310 |     "  \n",
311 |     "    Returns:  \n",
312 |     "    list[str]: A list of generated questions.\n",
313 |     "    \"\"\"\n",
314 |     "    response = client.chat.completions.create(\n",
315 |     "        model=model,\n",
316 |     "        messages=[\n",
317 |     "            {\"role\": \"system\", \"content\": \"You are a synthetic question-answer pair generator. Given a chunk of context about some topic(s), generate exactly %s example questions a user could ask and would be answered using information from the chunk. For example, if the given context was a Wikipedia paragraph about the United States, an example question could be 'How many states are in the United States?'\" % (x)},\n",
318 |     "            {\"role\": \"system\", \"content\": \"The questions should be able to be answered in a few words or less. Include only the questions in your response.\"},\n",
319 |     "            {\"role\": \"user\", \"content\": str(chunk)}\n",
320 |     "        ]\n",
321 |     "    )\n",
322 |     "\n",
323 |     "    queries = response.choices[0].message.content.split('\\n')\n",
324 |     "    queries = [strip_str(q) for q in queries]\n",
325 |     "    queries = [q for q in queries if any(c.isalpha() for c in q)]\n",
326 |     "    return queries[:int(x)]"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "Let's visualize an example picked randomly from our Document chunks"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": null,
339 |    "metadata": {},
340 |    "outputs": [],
341 |    "source": [
342 |     "sample_index = random.randint(0, len(chunks)-1)\n",
343 |     "chunk = chunks[sample_index]\n",
344 |     "\n",
345 |     "queries = generate_instructions_gen(generator_client, chunk, x=5, model=gpt4o_deployment)"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": null,
351 |    "metadata": {},
352 |    "outputs": [],
353 |    "source": [
354 |     "print(chunk)"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": null,
360 |    "metadata": {},
361 |    "outputs": [],
362 |    "source": [
363 |     "queries\n"
364 |    ]
365 |   },
366 |   {
367 |    "cell_type": "markdown",
368 |    "metadata": {},
369 |    "source": [
370 |     "**b. Generating questions, answers and adding distractor documents** "
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": null,
376 |    "metadata": {},
377 |    "outputs": [],
378 |    "source": [
379 |     "from datasets import Dataset, load_dataset\n",
380 |     "import random\n",
381 |     "from typing import Any\n",
382 |     "\n",
383 |     "def encode_question_gen(question: str, chunk: Any) -> list[str]:\n",
384 |     "    \"\"\"\n",
385 |     "    Encode multiple prompt instructions into a single string for the general case (`pdf`, `json`, or `txt`).\n",
386 |     "\n",
387 |     "    Parameters:  \n",
388 |     "    question (str): The question to be answered.  \n",
389 |     "    chunk (Any): The context or chunk of text that provides the information needed to answer the question.  \n",
390 |     "  \n",
391 |     "    Returns:  \n",
392 |     "    list[str]: A list of messages formatted for the language model API, including system and user roles.  \n",
393 |     "    \"\"\"\n",
394 |     "    \n",
395 |     "    prompts = []\n",
396 |     "        \n",
397 |     "    prompt = \"\"\"\n",
398 |     "        Question: {question}\\n Context: {context}\\n\n",
399 |     "        Answer this question using the information given in the context above and no prior knowledge. Here is things to pay attention to: \n",
400 |     "        - First provide step-by-step reasoning on how to answer the question. \n",
401 |     "        - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context. \n",
402 |     "        - End your response with final answer in the form <ANSWER>: $answer, the answer should be given in a joyful and friendly tone.\n",
403 |     "        - If the answer cannot be found in the context, say \"I'm sorry, I cannot answer this question as I'm missing the required information\"\n",
404 |     "        You MUST begin your final answer with the tag \"<ANSWER>:\".\n",
405 |     "    \"\"\".format(question=question, context=str(chunk))\n",
406 |     "    prompts.append({\"role\": \"system\", \"content\": \"You are a helpful question answerer who can provide an answer given a question and relevant context.\"})\n",
407 |     "    prompts.append({\"role\": \"user\", \"content\": prompt})\n",
408 |     "    return prompts\n",
409 |     "\n",
410 |     "def generate_label(client: AzureOpenAI, question: str, context: Any, model: str = None) -> str | None:\n",
411 |     "    \"\"\"\n",
412 |     "    Generates the label / answer to `question` using `context` and GPT-4o.\n",
413 |     "\n",
414 |     "    Parameters:  \n",
415 |     "    client (AzureOpenAI): An instance of the Azure OpenAI client used to communicate with the OpenAI API.  \n",
416 |     "    question (str): The question to be answered.  \n",
417 |     "    context (Any): The context or chunk of text that provides the information needed to answer the question.  \n",
418 |     "    model (str, optional): The specific model to use for generating the answer. Default is None, which uses the default model configured in the client.  \n",
419 |     "  \n",
420 |     "    Returns:  \n",
421 |     "    str | None: The generated answer from the language model, or None if no answer was generated.\n",
422 |     "    \"\"\"\n",
423 |     "    question = encode_question_gen(question, context)\n",
424 |     "    response = client.chat.completions.create(\n",
425 |     "        model=model,\n",
426 |     "        messages=question,\n",
427 |     "        n=1,\n",
428 |     "        temperature=0\n",
429 |     "    )\n",
430 |     "    response = response.choices[0].message.content\n",
431 |     "    return response\n",
432 |     "\n",
433 |     "def add_chunk_to_dataset(\n",
434 |     "    client: AzureOpenAI,\n",
435 |     "    chunks: list[str], \n",
436 |     "    chunk: str, \n",
437 |     "    x: int = 5, \n",
438 |     "    num_distract: int = 3, \n",
439 |     "    p: float = 0.8,\n",
440 |     "    model: str = None\n",
441 |     ") -> None:\n",
442 |     "    \"\"\"\n",
443 |     "    Given a chunk, create {Q, A, D} triplets and add them to the dataset.\n",
444 |     "\n",
445 |     "     Parameters:  \n",
446 |     "    client (AzureOpenAI): An instance of the Azure OpenAI client used to communicate with the OpenAI API.  \n",
447 |     "    chunks (list[str]): A list of chunks of text from which distractor documents can be sampled.  \n",
448 |     "    chunk (str): The chunk of text to use as the primary context for generating questions and answers.  \n",
449 |     "    x (int, optional): The number of questions to generate for the given chunk. Default is 5.  \n",
450 |     "    num_distract (int, optional): The number of distractor documents to include with each question. Default is 3.  \n",
451 |     "    p (float, optional): The probability of including the oracle (original) document as part of the context. Default is 0.8.  \n",
452 |     "    model (str, optional): The specific model to use for generating questions and answers. Default is None, which uses the default model configured in the client. \n",
453 |     "    \"\"\"\n",
454 |     "    global ds\n",
455 |     "    global errors\n",
456 |     "    i = chunks.index(chunk)\n",
457 |     "    try:\n",
458 |     "        qs = generate_instructions_gen(client, chunk, x, model)\n",
459 |     "    except Exception as e:\n",
460 |     "        errors.append(e)\n",
461 |     "        return None\n",
462 |     "    for q in qs:\n",
463 |     "        datapt = {\n",
464 |     "            \"id\": None,\n",
465 |     "            \"type\": None,\n",
466 |     "            \"question\": None,\n",
467 |     "            \"context\": None,\n",
468 |     "            \"oracle_context\": None,\n",
469 |     "            \"cot_answer\": None\n",
470 |     "        }\n",
471 |     "\n",
472 |     "        datapt[\"id\"] = f\"seed_task_{i}\"\n",
473 |     "        datapt[\"type\"] = \"general\"\n",
474 |     "        datapt[\"question\"] = q\n",
475 |     "\n",
476 |     "        # add num_distract distractor docs\n",
477 |     "        docs = [chunk]\n",
478 |     "        indices = list(range(0, len(chunks)))\n",
479 |     "        indices.remove(i)\n",
480 |     "        for j in random.sample(indices, num_distract):\n",
481 |     "            docs.append(chunks[j])\n",
482 |     "        \n",
483 |     "        # decides whether to add oracle document\n",
484 |     "        oracle = random.uniform(0, 1) < p\n",
485 |     "        if not oracle:\n",
486 |     "            docs[0] = chunks[random.sample(indices, 1)[0]]\n",
487 |     "        random.shuffle(docs)\n",
488 |     "\n",
489 |     "        d = {\n",
490 |     "            \"title\": [],\n",
491 |     "            \"sentences\": []\n",
492 |     "        }\n",
493 |     "\n",
494 |     "        d[\"title\"].append([\"placeholder_title\"]*(num_distract+1))\n",
495 |     "        d[\"sentences\"].append(docs)\n",
496 |     "        datapt[\"context\"] = d\n",
497 |     "        datapt[\"oracle_context\"] = chunk\n",
498 |     "\n",
499 |     "        # add answer to q\n",
500 |     "        try:\n",
501 |     "            datapt[\"cot_answer\"] = generate_label(client, q, chunk, model=model)\n",
502 |     "        except Exception as e:\n",
503 |     "            errors.append(e)\n",
504 |     "            continue\n",
505 |     "\n",
506 |     "        # construct model instruction \n",
507 |     "        context = \"\"\n",
508 |     "        for doc in docs:\n",
509 |     "            context += \"<DOCUMENT>\" + str(doc) + \"</DOCUMENT>\\n\"\n",
510 |     "        context += q\n",
511 |     "        datapt[\"instruction\"] = context\n",
512 |     "\n",
513 |     "        # add to dataset\n",
514 |     "        if not ds:\n",
515 |     "            # init ds\n",
516 |     "            datapt[\"id\"] = [datapt[\"id\"]]\n",
517 |     "            datapt[\"type\"] = [datapt[\"type\"]]\n",
518 |     "            datapt[\"question\"] = [datapt[\"question\"]]\n",
519 |     "            datapt[\"context\"] = [datapt[\"context\"]]\n",
520 |     "            datapt[\"oracle_context\"] = [datapt[\"oracle_context\"]]\n",
521 |     "            datapt[\"cot_answer\"] = [datapt[\"cot_answer\"]]\n",
522 |     "            datapt[\"instruction\"] = [datapt[\"instruction\"]]\n",
523 |     "            ds = Dataset.from_dict(datapt)\n",
524 |     "        else:\n",
525 |     "            ds = ds.add_item(datapt)"
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "metadata": {},
531 |    "source": [
532 |     "**Let's execute this function a in multi-threaded way to speed up the process**"
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": null,
538 |    "metadata": {},
539 |    "outputs": [],
540 |    "source": [
541 |     "from tqdm import tqdm\n",
542 |     "import concurrent.futures\n",
543 |     "\n",
544 |     "errors = []\n",
545 |     "ds = Dataset.from_dict({})\n",
546 |     "\n",
547 |     "\n",
548 |     "def process_chunk(chunk):\n",
549 |     "    add_chunk_to_dataset(generator_client, chunks, chunk, 5, 3, model=gpt4o_deployment)\n",
550 |     "\n",
551 |     "# Create a ThreadPoolExecutor with the desired number of workers\n",
552 |     "with concurrent.futures.ThreadPoolExecutor() as executor:\n",
553 |     "    # Submit the tasks to the executor and store the Future objects\n",
554 |     "    futures = [executor.submit(process_chunk, chunk) for chunk in chunks]\n",
555 |     "\n",
556 |     "    # Use tqdm to create a progress bar\n",
557 |     "    with tqdm(total=len(chunks), desc=\"Processing chunks\") as pbar:\n",
558 |     "        # Iterate over the completed futures as they become available\n",
559 |     "        for future in concurrent.futures.as_completed(futures):\n",
560 |     "            # Get the result of the completed future\n",
561 |     "            result = future.result()\n",
562 |     "            # Update the progress bar\n",
563 |     "            pbar.update(1)\n",
564 |     "\n",
565 |     "# Print any errors that occurred during processing\n",
566 |     "print(f'Number of processing errors: {errors}/{len(chunks)}')"
567 |    ]
568 |   },
569 |   {
570 |    "cell_type": "code",
571 |    "execution_count": null,
572 |    "metadata": {},
573 |    "outputs": [],
574 |    "source": [
575 |     "training_df = ds.to_pandas()\n",
576 |     "\n",
577 |     "print(f'{training_df.shape[0]} rows and {training_df.shape[1]} columns in the training dataset')"
578 |    ]
579 |   },
580 |   {
581 |    "cell_type": "code",
582 |    "execution_count": null,
583 |    "metadata": {},
584 |    "outputs": [],
585 |    "source": [
586 |     "# Previewing the generated data\n",
587 |     "\n",
588 |     "training_df.head(2)"
589 |    ]
590 |   },
591 |   {
592 |    "cell_type": "markdown",
593 |    "metadata": {},
594 |    "source": [
595 |     "**c. Formatting the data in chat format for fine tuning with Azure OpenAI**\n",
596 |     "\n",
597 |     "The conversational chat format is required to fine-tune gpt-4o-mini"
598 |    ]
599 |   },
600 |   {
601 |    "cell_type": "code",
602 |    "execution_count": null,
603 |    "metadata": {},
604 |    "outputs": [],
605 |    "source": [
606 |     "training_df[\"messages\"] = training_df.apply(lambda x: [\n",
607 |     "                                                     {\"role\":\"user\", \"content\":x['instruction']},\n",
608 |     "                                                     {\"role\":\"assistant\", \"content\":x['cot_answer']}\n",
609 |     "                                                     ], axis=1)"
610 |    ]
611 |   },
612 |   {
613 |    "cell_type": "code",
614 |    "execution_count": null,
615 |    "metadata": {},
616 |    "outputs": [],
617 |    "source": [
618 |     "training_df.messages.values[12]"
619 |    ]
620 |   },
621 |   {
622 |    "cell_type": "code",
623 |    "execution_count": null,
624 |    "metadata": {},
625 |    "outputs": [],
626 |    "source": [
627 |     "training_df.dropna(subset=['cot_answer'], inplace=True)"
628 |    ]
629 |   },
630 |   {
631 |    "cell_type": "markdown",
632 |    "metadata": {},
633 |    "source": [
634 |     "### 2. Spitting our data into training and test sets\n",
635 |     "\n",
636 |     "Splitting your data into training, validation and testing sets when fine-tuning a large language model (LLM) is crucial for ensuring the model's performance and generalization capabilities. The training set is used to teach the model, allowing it to learn patterns from the data. The validation set is used to track performance metrics during the training to avoid underfitting / overfitting. However, to objectively evaluate how well the model has learned and to ensure it can generalize to unseen data, a separate testing set is necessary. \n",
637 |     "\n",
638 |     "We will use this test set in order to measure the improvement of performance we get from using RAFT over RAG with gpt-4o-mini"
639 |    ]
640 |   },
641 |   {
642 |    "cell_type": "code",
643 |    "execution_count": null,
644 |    "metadata": {},
645 |    "outputs": [],
646 |    "source": [
647 |     "import numpy as np \n",
648 |     "\n",
649 |     "train_df, validate_df, test_df = np.split(\n",
650 |     "    training_df.sample(frac=1, random_state=42), \n",
651 |     "                       [int(.8*len(training_df)), int(.9*len(training_df))]\n",
652 |     "                       )\n",
653 |     "\n",
654 |     "print(f\"Train: {train_df.shape[0]}, Validate: {validate_df.shape[0]}, Test: {test_df.shape[0]}\")"
655 |    ]
656 |   },
657 |   {
658 |    "cell_type": "code",
659 |    "execution_count": null,
660 |    "metadata": {},
661 |    "outputs": [],
662 |    "source": [
663 |     "\n",
664 |     "if not os.path.exists(\"./data/training_data\"):\n",
665 |     "    os.makedirs(\"./data/training_data\")\n",
666 |     "\n",
667 |     "train_df[['messages']].to_json(\"./data/training_data/banking_train.jsonl\", orient=\"records\", lines=True)\n",
668 |     "test_df.to_json(\"./data/training_data/banking_test.jsonl\", orient=\"records\", lines=True)\n",
669 |     "validate_df[['messages']].to_json(\"./data/training_data/banking_validation.jsonl\", orient=\"records\", lines=True)"
670 |    ]
671 |   },
672 |   {
673 |    "cell_type": "markdown",
674 |    "metadata": {},
675 |    "source": [
676 |     "#### Congrats! We now have a labelled training dataset and a test dataset to evaluate our model's performance. Now go to the [finetuning notebook](./raft_finetuning.ipynb)"
677 |    ]
678 |   }
679 |  ],
680 |  "metadata": {
681 |   "kernelspec": {
682 |    "display_name": "openai",
683 |    "language": "python",
684 |    "name": "python3"
685 |   },
686 |   "language_info": {
687 |    "codemirror_mode": {
688 |     "name": "ipython",
689 |     "version": 3
690 |    },
691 |    "file_extension": ".py",
692 |    "mimetype": "text/x-python",
693 |    "name": "python",
694 |    "nbconvert_exporter": "python",
695 |    "pygments_lexer": "ipython3",
696 |    "version": "3.11.9"
697 |   }
698 |  },
699 |  "nbformat": 4,
700 |  "nbformat_minor": 2
701 | }
702 | 


--------------------------------------------------------------------------------
/2_raft_finetuning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 2. RAFT GPT-4o-mini fine tuning\n",
  8 |     "\n",
  9 |     "Now that we've created labelled training data, we can fine tune our model using the Supervised Fine Tuning technique. Azure OpenAI uses LoRA to fine tune models efficiently. **LoRA (Low-Rank Adaptation)** finetuning of a Large Language Model is a technique used to adapt pre-trained language models to specific tasks efficiently and with fewer computational resources.  \n",
 10 |     "\n",
 11 |     " Instead of adjusting all the model parameters, LoRA introduces a small number of additional parameters (low-rank matrices) that modify the model's behavior. These new parameters are trained while keeping the original model's parameters mostly unchanged. This way, the model can learn the new task without the need for extensive computational resources or time.\n",
 12 |     "\n",
 13 |     " Azure OpenAI lets developers customize OpenAI models with their own data and easily deploy their custom model using an easy to use and affordable managed service.\n",
 14 |     "\n",
 15 |     " While Fine Tuning can be a complex process, Azure OpenAI abstracts away a lot of the complexity to make fine tuning accessible to any developer."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "### 0. How much will this cost?\n",
 23 |     "\n",
 24 |     "Fine tuning pricing on Azure OpenAI makes fine tuning experiment cost very predictable. Training pricing is based on the number of tokens you're training your model on. Therefore, it is very easy to predict and manage the cost of your finetuning experiments.\n",
 25 |     "\n",
 26 |     "For GPT-4o mini, training price is $0.003300 per 1K tokens.\n",
 27 |     "\n",
 28 |     "So to estimate the cost of our fine tuning job we can use the following formula\n",
 29 |     "\n",
 30 |     "`(Training cost per 1K input tokens / 1K) * number of tokens in input file * number of epochs trained`\n",
 31 |     "\n",
 32 |     "**epoch:** a complete iteration through a dataset during the training process of a process, \n",
 33 |     "\n",
 34 |     "1. If the number of epochs is too low: Your model might be underfitted, which means it could perform poorly because it hasn't learned enough from the training data. In essence, it may not have had enough iterations to effectively learn and adjust its parameters (e.g., weights and biases).\n",
 35 |     "\n",
 36 |     "2. If the number of epochs is too high: There's a risk of overfitting, where the model becomes too specialized in the training data and performs poorly on unseen data (examples that weren’t in your training dataset).\n",
 37 |     "\n",
 38 |     "the number of epochs is a parameter of the fine tuning job, usually 3 epochs is a reasonable number\n",
 39 |     "\n",
 40 |     "**Let's explore our dataset and estimate our fine tuning costs.**"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "metadata": {},
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "from dotenv import load_dotenv\n",
 50 |     "import os\n",
 51 |     "import tiktoken\n",
 52 |     "import json\n",
 53 |     "from openai import AzureOpenAI\n",
 54 |     "\n",
 55 |     "# loading environment variables\n",
 56 |     "load_dotenv()\n",
 57 |     "aoai_endpoint = os.getenv(\"AOAI_FINETUNING_ENDPOINT\")\n",
 58 |     "aoai_api_key = os.getenv(\"AOAI_FINETUNING_API_KEY\")\n",
 59 |     "\n",
 60 |     "training_file_path = './data/training_data/banking_train.jsonl'\n",
 61 |     "encoding = tiktoken.encoding_for_model(\"gpt-4o-mini\")\n",
 62 |     "\n",
 63 |     "def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):\n",
 64 |     "    num_tokens = 0\n",
 65 |     "    for message in messages:\n",
 66 |     "        num_tokens += tokens_per_message\n",
 67 |     "        for key, value in message.items():\n",
 68 |     "            num_tokens += len(encoding.encode(value))\n",
 69 |     "            if key == \"role\":\n",
 70 |     "                num_tokens += tokens_per_name\n",
 71 |     "    num_tokens += 3\n",
 72 |     "    return num_tokens\n",
 73 |     "\n",
 74 |     "with open(training_file_path, 'r', encoding='utf-8') as f:\n",
 75 |     "    num_tokens=0\n",
 76 |     "    dataset = [json.loads(line) for line in f]\n",
 77 |     "    messages = [ d.get('messages') for d in dataset]\n",
 78 |     "    for message in messages:\n",
 79 |     "        \n",
 80 |     "        num_tokens += num_tokens_from_messages(message)\n",
 81 |     "    \n",
 82 |     "print(f\"Number of tokens in training data: {num_tokens}\")\n",
 83 |     "\n",
 84 |     "training_cost_per_token = 0.003300 / 1000\n",
 85 |     "num_epochs = 3\n",
 86 |     "total_cost = num_tokens * training_cost_per_token * num_epochs\n",
 87 |     "\n",
 88 |     "print(f\"Total estimated cost for training: {total_cost:.2f} USD\")"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "### 1. Uploading the training and validation data to Azure OpenAI"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "\n",
105 |     "client = AzureOpenAI(\n",
106 |     "  azure_endpoint = aoai_endpoint,\n",
107 |     "  api_key = aoai_api_key,\n",
108 |     "  api_version = \"2024-05-01-preview\"  # This API version or later is required to access seed/events/checkpoint features\n",
109 |     ")\n",
110 |     "\n",
111 |     "validation_file_path = './data/training_data/banking_validation.jsonl'\n",
112 |     "\n",
113 |     "# Upload the training and validation dataset files to Azure OpenAI with the SDK.\n",
114 |     "training_response = client.files.create(\n",
115 |     "    file = open(training_file_path, \"rb\"), purpose=\"fine-tune\"\n",
116 |     ")\n",
117 |     "training_file_id = training_response.id\n",
118 |     "\n",
119 |     "validation_response = client.files.create(\n",
120 |     "    file = open(validation_file_path, \"rb\"), purpose=\"fine-tune\"\n",
121 |     ")\n",
122 |     "validation_file_id = validation_response.id\n",
123 |     "\n",
124 |     "print(\"Training file ID:\", training_file_id)\n",
125 |     "print(\"Validation file ID:\", validation_file_id)"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "### 2. Creating the fine tuning job\n",
133 |     "\n",
134 |     "For each fine tuning job, you can specify the following hyperparameters. \n",
135 |     "\n",
136 |     "- epochs: An \"epoch\" is a term used to describe one complete pass through the entire training dataset\n",
137 |     "- learning rate multiplier: this will be used as the learning rate for the fine tuning job, as a multiple of the model's original learning rate. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results\n",
138 |     "- batch size:  how many training examples you use at one time during training, common choices are (32, 64, 128, 256). This value is to be tuned based on the size of your data and available compute.\n",
139 |     "\n",
140 |     "The general recommendation is to initially train without specifying any of these, Azure OpenAI will pick a default for you based on dataset size, then adjusting based on results to find the ideal combination"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "# Submit fine-tuning training job\n",
150 |     "\n",
151 |     "response = client.fine_tuning.jobs.create(\n",
152 |     "    training_file = training_file_id,\n",
153 |     "    validation_file = validation_file_id,\n",
154 |     "    model = \"gpt-4o-mini\", # Enter base model name. Note that in Azure OpenAI the model name contains dashes and cannot contain dot/period characters.\n",
155 |     "    seed = 105 # seed parameter controls reproducibility of the fine-tuning job. If no seed is specified one will be generated automatically.\n",
156 |     ")\n",
157 |     "\n",
158 |     "job_id = response.id\n",
159 |     "\n",
160 |     "# You can use the job ID to monitor the status of the fine-tuning job.\n",
161 |     "# The fine-tuning job will take some time to start and complete.\n",
162 |     "\n",
163 |     "print(\"Job ID:\", response.id)\n",
164 |     "print(\"Status:\", response.status)\n",
165 |     "print(response.model_dump_json(indent=2))"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "### 3. Monitor the fine tuning job\n",
173 |     "\n",
174 |     "The training time depends on the number of tokens and number of epochs, typically you can expect a job this size to run for a little over an hour and a half. We have already fine-tuned and deployed a model so you can use it directly without waiting for your fine tuning job to complete.\n",
175 |     "\n",
176 |     "[Fine tuned model](https://oai.azure.com/resource/finetune/ftjob-6d1293138cd844e7bab02a141a60c697/details?wsid=/subscriptions/3c791225-4905-4a40-860b-0a0c9cd2af91/resourceGroups/RG-FineTuning-AIGBBWorkshop/providers/Microsoft.CognitiveServices/accounts/aoai-raft-gbb-workshop&tid=604b58b3-fa4e-4a57-b566-cac3f88a3ae8)\n",
177 |     "\n",
178 |     "[Fine tuned model deployment](https://oai.azure.com/resource/deployments/%2Fsubscriptions%2F3c791225-4905-4a40-860b-0a0c9cd2af91%2FresourceGroups%2FRG-FineTuning-AIGBBWorkshop%2Fproviders%2FMicrosoft.CognitiveServices%2Faccounts%2Faoai-raft-gbb-workshop%2Fdeployments%2Fgpt-4o-mini-ft-raft-banking?wsid=/subscriptions/3c791225-4905-4a40-860b-0a0c9cd2af91/resourceGroups/RG-FineTuning-AIGBBWorkshop/providers/Microsoft.CognitiveServices/accounts/aoai-raft-gbb-workshop&tid=604b58b3-fa4e-4a57-b566-cac3f88a3ae8)\n",
179 |     "\n",
180 |     "You can monitor your fine tuning job from this notebook or in the Azure OpenAI's new studio.\n",
181 |     "\n",
182 |     "Go to Tools > Fine-tuning > Click on your job \n",
183 |     "\n",
184 |     "![alt text](./static/ft_monitor.png \"Azure OpenAI Studio Fine tuning job\")"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "**We can also monitor the job from this notebook**"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {},
198 |    "outputs": [],
199 |    "source": [
200 |     "from IPython.display import clear_output\n",
201 |     "import time\n",
202 |     "\n",
203 |     "start_time = time.time()\n",
204 |     "\n",
205 |     "# Get the status of our fine-tuning job.\n",
206 |     "response = client.fine_tuning.jobs.retrieve(job_id)\n",
207 |     "\n",
208 |     "status = response.status\n",
209 |     "\n",
210 |     "# If the job isn't done yet, poll it every 10 seconds.\n",
211 |     "while status not in [\"succeeded\", \"failed\"]:\n",
212 |     "    time.sleep(10)\n",
213 |     "\n",
214 |     "    response = client.fine_tuning.jobs.retrieve(job_id)\n",
215 |     "    print(response.model_dump_json(indent=2))\n",
216 |     "    print(\"Elapsed time: {} minutes {} seconds\".format(int((time.time() - start_time) // 60), int((time.time() - start_time) % 60)))\n",
217 |     "    status = response.status\n",
218 |     "    print(f'Status: {status}')\n",
219 |     "    clear_output(wait=True)\n",
220 |     "\n",
221 |     "print(f'Fine-tuning job {job_id} finished with status: {status}')\n",
222 |     "\n",
223 |     "# List all fine-tuning jobs for this resource.\n",
224 |     "print('Checking other fine-tune jobs for this resource.')\n",
225 |     "response = client.fine_tuning.jobs.list()\n",
226 |     "print(f'Found {len(response.data)} fine-tune jobs.')"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": null,
232 |    "metadata": {},
233 |    "outputs": [],
234 |    "source": [
235 |     "# Retrieve fine_tuned_model name\n",
236 |     "response = client.fine_tuning.jobs.retrieve(job_id)\n",
237 |     "fine_tuned_model = response.fine_tuned_model"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "metadata": {},
243 |    "source": [
244 |     "### 4. Analyze the fine tuned model in Azure OpenAI Studio\n",
245 |     "\n",
246 |     "Head here for a fine tuned model in the shared AI GBB tenant:\n",
247 |     "[Fine tuning job](https://oai.azure.com/resource/finetune/ftjob-6d1293138cd844e7bab02a141a60c697/details?wsid=/subscriptions/3c791225-4905-4a40-860b-0a0c9cd2af91/resourceGroups/RG-FineTuning-AIGBBWorkshop/providers/Microsoft.CognitiveServices/accounts/aoai-raft-gbb-workshop&tid=604b58b3-fa4e-4a57-b566-cac3f88a3ae8)\n",
248 |     "\n",
249 |     "##### 4.a Training plots\n",
250 |     "\n",
251 |     "When the model is done training, head to your Azure OpenAI Studio to analyze your model training metrics.\n",
252 |     "\n",
253 |     "Two charts are available to analyze your fine tuning job and sanity check that the training went smoothly:\n",
254 |     "- Loss curve: Value of the loss function (how wrong the model is) over time during training process --> this curve should go down over time as the model weights converge towards the optimum. \n",
255 |     "- Token Accuracy: Shows the accuracy of the model's predictions at the token level (e.g., words or subwords) over time during training. A higher token accuracy suggests that the model is better able to capture the nuances of the language and generate more accurate text.\n",
256 |     "\n",
257 |     "Each of these charts has the metrics computed both on the training data and on the validation set. \n",
258 |     "\n",
259 |     "To analyze these plots, one should look for the following:\n",
260 |     "\n",
261 |     "- A smooth curve: A smooth curve indicates that the model is learning consistently. Sharp changes or spikes in the curve could indicate issues with the learning rate or data preprocessing.\n",
262 |     "- Plateau: A plateau in the curve indicates that the model has stopped improving and further training may not be necessary.\n",
263 |     "- Overfitting: If the training loss continues to decrease but the validation loss starts to increase, it could be a sign of overfitting. This means that the model is not generalizing well to new data and may perform poorly on unseen data.\n",
264 |     "- Underfitting: If both the training and validation loss remain high, it could be a sign of underfitting. This means that the model is not learning the patterns in the data well enough and may need a more complex - architecture or more training data.\n",
265 |     "- Optimal stopping point: By analyzing the loss curve and token accuracy plot, one can determine the optimal stopping point for training, where the model has reached its best performance without overfitting.\n",
266 |     "\n",
267 |     "Now head to the studio and ensure your curves look roughly like the below\n",
268 |     "\n",
269 |     "\n",
270 |     "\n",
271 |     "![Alt text](./static/ft_metrics.png \"AOAI training plots\")\n",
272 |     "\n",
273 |     "##### 4.b Model Checkpoints\n",
274 |     "\n",
275 |     "In the Studio, go to the checkpoints tab, you'll see a model checkpoint corresponding to each completed epoch. A checkpoint is a fully functional version of a model which can both be deployed and used as the target model for subsequent fine-tuning jobs. Checkpoints can be particularly useful, as they can provide a snapshot of your model prior to overfitting having occurred. \n",
276 |     "\n",
277 |     "![Alt text](./static/ft_checkpoints.png \"AOAI training plots\")"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "markdown",
282 |    "metadata": {},
283 |    "source": [
284 |     "### 5. Create a new deployment with the fine tuned model\n",
285 |     "\n",
286 |     "When the fine-tuning job succeeds, the value of the fine_tuned_model variable in the response body is set to the name of your customized model. Your model is now also available for discovery from the list Models API. However, you can't issue completion calls to your customized model until your customized model is deployed. You must deploy your customized model to make it available for use with completion calls\n",
287 |     "\n",
288 |     "#### 5.a From the notebook\n",
289 |     "To create a new deployment from a notebook, you'll need an access token from Azure, \n",
290 |     "Open a terminal and run:\n",
291 |     "\n",
292 |     "`az login`\n",
293 |     "\n",
294 |     "`az account get-access-token`\n",
295 |     "\n",
296 |     "paste the token in the next cell"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": null,
302 |    "metadata": {},
303 |    "outputs": [],
304 |    "source": [
305 |     "# Deploy fine-tuned model\n",
306 |     "import requests\n",
307 |     "\n",
308 |     "token = \"YOUR TOKEN\"\n",
309 |     "subscription = os.getenv(\"AZURE_SUBSCRIPTION_ID\")\n",
310 |     "resource_group = os.getenv(\"AZURE_RESOURCE_GROUP_NAME\")\n",
311 |     "resource_name = aoai_endpoint.split(\"https://\")[1].split(\".\")[0]\n",
312 |     "model_deployment_name = \"gpt-4o-mini-ft-raft-banking\"\n",
313 |     "\n",
314 |     "deploy_params = {'api-version': \"2023-05-01\"}\n",
315 |     "deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}\n",
316 |     "\n",
317 |     "deploy_data = {\n",
318 |     "    \"sku\": {\"name\": \"standard\", \"capacity\": 50},\n",
319 |     "    \"properties\": {\n",
320 |     "        \"model\": {\n",
321 |     "            \"format\": \"OpenAI\",\n",
322 |     "            \"name\": fine_tuned_model, #retrieve this value from the previous call, it will look like gpt-35-turbo-0613.ft-b044a9d3cf9c4228b5d393567f693b83\n",
323 |     "            \"version\": \"1\"\n",
324 |     "        }\n",
325 |     "    }\n",
326 |     "}\n",
327 |     "deploy_data = json.dumps(deploy_data)\n",
328 |     "\n",
329 |     "request_url = f'https://management.azure.com/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{model_deployment_name}'\n",
330 |     "\n",
331 |     "print('Creating a new deployment...')\n",
332 |     "\n",
333 |     "r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)\n",
334 |     "\n",
335 |     "print(r)\n",
336 |     "print(r.reason)\n",
337 |     "print(r.json())"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "markdown",
342 |    "metadata": {},
343 |    "source": [
344 |     "#### 5.b From the studio\n",
345 |     "\n",
346 |     "On the fine tuning job page, click 'Deploy'\n",
347 |     "\n",
348 |     "![alt text](./static/ft_deploy.png \"AOAI Deploy model\")"
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "markdown",
353 |    "metadata": {},
354 |    "source": []
355 |   }
356 |  ],
357 |  "metadata": {
358 |   "kernelspec": {
359 |    "display_name": "openai",
360 |    "language": "python",
361 |    "name": "python3"
362 |   },
363 |   "language_info": {
364 |    "codemirror_mode": {
365 |     "name": "ipython",
366 |     "version": 3
367 |    },
368 |    "file_extension": ".py",
369 |    "mimetype": "text/x-python",
370 |    "name": "python",
371 |    "nbconvert_exporter": "python",
372 |    "pygments_lexer": "ipython3",
373 |    "version": "3.11.9"
374 |   }
375 |  },
376 |  "nbformat": 4,
377 |  "nbformat_minor": 2
378 | }
379 | 


--------------------------------------------------------------------------------
/3_raft_evaluation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 3. Model evaluation \n",
  8 |     "\n",
  9 |     "Did our fine tuning actually improve the model's performance on our RAG task? Let's find out!\n",
 10 |     "\n",
 11 |     "In order to measure the efficacy of RAFT, we'll compare a base gpt-4o-mini model with our fine tuned model.\n",
 12 |     "In the first notebook, we generated a test set that's never been used to train the model, we will use the test set to compare our 2 models.\n",
 13 |     "\n",
 14 |     "We will go through the following steps:\n",
 15 |     "1. Load the test set\n",
 16 |     "2. Perform inference with both models on the test set\n",
 17 |     "3. Clean up the models answers to remove the Chain of Thought and only keep the final answer\n",
 18 |     "3. Define evaluation metrics\n",
 19 |     "4. Run evaluation for both models\n",
 20 |     "5. Plot results\n",
 21 |     "\n",
 22 |     "\n",
 23 |     "#### 1. Loading the test set"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": null,
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "import pandas as pd\n",
 33 |     "\n",
 34 |     "test_df = pd.read_json('./data/training_data/banking_test.jsonl', lines=True)\n",
 35 |     "test_df.head(2)"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "#### 2. Run inference on the test set with both models\n",
 43 |     "\n",
 44 |     "Make sure your `.env` file contains endpoint, api key and deployment name for the baseline model and the fine tuned model. Here we compare the fine tuned gpt-4o-mini with gpt-4o-mini base but you could switch the baseline model to any model"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "from dotenv import load_dotenv\n",
 54 |     "import os\n",
 55 |     "from openai import AzureOpenAI\n",
 56 |     "\n",
 57 |     "\n",
 58 |     "load_dotenv()\n",
 59 |     "# run the base and finetuned models through the dataset\n",
 60 |     "BASELINE_OPENAI_DEPLOYMENT = os.getenv(\"BASELINE_OPENAI_DEPLOYMENT\")\n",
 61 |     "BASELINE_OPENAI_ENDPOINT= os.getenv(\"BASELINE_OPENAI_ENDPOINT\")\n",
 62 |     "BASELINE_OPENAI_KEY= os.getenv(\"BASELINE_OPENAI_KEY\")\n",
 63 |     "\n",
 64 |     "FINETUNED_OPENAI_DEPLOYMENT = os.getenv(\"FINETUNED_OPENAI_DEPLOYMENT\")\n",
 65 |     "FINETUNED_OPENAI_ENDPOINT = os.getenv(\"FINETUNED_OPENAI_ENDPOINT\")\n",
 66 |     "FINETUNED_OPENAI_KEY = os.getenv(\"FINETUNED_OPENAI_KEY\")\n",
 67 |     "\n",
 68 |     "baseline_client = AzureOpenAI(\n",
 69 |     "    azure_endpoint=BASELINE_OPENAI_ENDPOINT, \n",
 70 |     "    api_key=BASELINE_OPENAI_KEY,\n",
 71 |     "    api_version=\"2024-02-01\"\n",
 72 |     "    )\n",
 73 |     "\n",
 74 |     "finetuned_client = AzureOpenAI(\n",
 75 |     "    azure_endpoint=FINETUNED_OPENAI_ENDPOINT, \n",
 76 |     "    api_key=FINETUNED_OPENAI_KEY,\n",
 77 |     "    api_version=\"2024-02-01\"\n",
 78 |     "\n",
 79 |     "    )\n",
 80 |     "\n",
 81 |     "# get the predictions\n",
 82 |     "def get_model_completions(client, prompt, deployment):\n",
 83 |     "    \"\"\"\n",
 84 |     "    This function generates a model completion from a given prompt using the OpenAI API.\n",
 85 |     "\n",
 86 |     "    Parameters:\n",
 87 |     "    client (openai.Client): The AzureOpenAI client being used.\n",
 88 |     "    prompt (str): The prompt to be sent to the model for completion.\n",
 89 |     "    deployment (str): The identifier of the model deployment to be used for completion.\n",
 90 |     "\n",
 91 |     "    Returns:\n",
 92 |     "    str: The completed message content from the model. If an exception occurs during the process, it returns None and prints the exception.\n",
 93 |     "    \"\"\"\n",
 94 |     "\n",
 95 |     "\n",
 96 |     "    messages = [\n",
 97 |     "        {'role':'user','content':prompt}\n",
 98 |     "        ]\n",
 99 |     "    try:\n",
100 |     "        response = client.chat.completions.create(\n",
101 |     "        messages=messages,\n",
102 |     "        model=deployment,\n",
103 |     "        temperature=0.3,\n",
104 |     "    )\n",
105 |     "    \n",
106 |     "        return response.choices[0].message.content\n",
107 |     "\n",
108 |     "    except Exception as e:\n",
109 |     "        print(e)\n",
110 |     "        return None\n",
111 |     "    \n"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "\n",
121 |     "from tqdm.notebook import tqdm\n",
122 |     "\n",
123 |     "tqdm.pandas()\n",
124 |     "\n",
125 |     "test_df['baseline_model_response'] = test_df.progress_apply(lambda x: get_model_completions(baseline_client, x.instruction, BASELINE_OPENAI_DEPLOYMENT), axis=1)\n",
126 |     "test_df['finetuned_model_response'] = test_df.progress_apply(lambda x: get_model_completions(finetuned_client, x.instruction, FINETUNED_OPENAI_DEPLOYMENT), axis=1)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "#### 3. Clean up the model answers\n",
134 |     "\n",
135 |     "Because our fine tuned model has been trained with Chain of Thought answers, we need to clean up the answers to extract the final answer and match the format of the baseline model answers.\n",
136 |     "\n",
137 |     "Similarly, to run evaluation on RAG, we'll need to extract a clean context string with the content of the retrieved documents"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "def extract_final_answer(cot_answer: str) -> str:\n",
147 |     "    \"\"\"\n",
148 |     "    Extracts the final answer from the cot_answer field\n",
149 |     "    \"\"\"\n",
150 |     "    if cot_answer:\n",
151 |     "        return cot_answer.split(\"<ANSWER>: \")[-1]\n",
152 |     "    return None\n",
153 |     "\n",
154 |     "def extract_context(instruction: str) -> str:\n",
155 |     "    \"\"\"\n",
156 |     "    Extracts the context from the instruction field.\n",
157 |     "    Keeps all <DOCUMENTS/> and removes the last line with the question.\n",
158 |     "    \"\"\"\n",
159 |     "    return \"\\n\".join(instruction.split(\"\\n\")[:-1])\n",
160 |     "\n",
161 |     "test_df['gold_final_answer'] = test_df.cot_answer.apply(extract_final_answer)\n",
162 |     "test_df.rename(columns={'context':'context_docs'}, inplace=True)\n",
163 |     "test_df['context'] = test_df.instruction.apply(extract_context)\n",
164 |     "test_df['baseline_final_answer'] = test_df.baseline_model_response.apply(extract_final_answer)\n",
165 |     "test_df['finetuned_final_answer'] = test_df.finetuned_model_response.apply(extract_final_answer)\n"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "test_df.head(2)"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "metadata": {},
180 |    "source": [
181 |     "#### 4. Define evaluation metrics\n",
182 |     "\n",
183 |     "We'll use RAGAS to evaluate the performance of the models on this RAG task. Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. Ragas offers metrics tailored for evaluating each component of your RAG pipeline.\n",
184 |     "\n",
185 |     "For the scope of this workshop, we are only interested in evaluation the generation part of the pipeline.\n",
186 |     "\n",
187 |     "Ragas provides a few out of the box metrics we can compute, these metrics require either an LLM as a judge or an embedding model:\n",
188 |     "- Answer relevancy: assesses how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy\n",
189 |     "- Faithfulness: This measures the factual consistency of the generated answer against the given context\n",
190 |     "- Answer similarity: semantic resemblance between the generated answer and the ground truth.\n",
191 |     "- Answer correctness: Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {},
198 |    "outputs": [],
199 |    "source": [
200 |     "from ragas.metrics import (\n",
201 |     "    answer_relevancy,\n",
202 |     "    faithfulness,\n",
203 |     "    answer_similarity,\n",
204 |     "    answer_correctness\n",
205 |     ")\n",
206 |     "from ragas.metrics.critique import harmfulness\n",
207 |     "\n",
208 |     "# list of metrics we're going to use\n",
209 |     "metrics = [\n",
210 |     "    faithfulness,\n",
211 |     "    answer_relevancy,\n",
212 |     "    answer_similarity,\n",
213 |     "    answer_correctness\n",
214 |     "]"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "metadata": {},
221 |    "outputs": [],
222 |    "source": [
223 |     "from langchain_openai.chat_models import AzureChatOpenAI\n",
224 |     "from langchain_openai.embeddings import AzureOpenAIEmbeddings\n",
225 |     "from ragas import evaluate\n",
226 |     "from dotenv import load_dotenv\n",
227 |     "import os\n",
228 |     "\n",
229 |     "load_dotenv()\n",
230 |     "\n",
231 |     "judge_model_endpoint = os.getenv(\"JUDGE_OPENAI_ENDPOINT\")\n",
232 |     "judge_model_api_key = os.getenv(\"JUDGE_OPENAI_API_KEY\")\n",
233 |     "judge_model_deployment = os.getenv(\"JUDGE_OPENAI_DEPLOYMENT\")\n",
234 |     "embedding_model_deployment= os.getenv(\"EMBEDDING_OPENAI_DEPLOYMENT\")\n",
235 |     "\n",
236 |     "azure_model = AzureChatOpenAI(\n",
237 |     "    openai_api_version=\"2024-02-01\",\n",
238 |     "    azure_endpoint=judge_model_endpoint,\n",
239 |     "    azure_deployment=judge_model_deployment,\n",
240 |     "    validate_base_url=False,\n",
241 |     "    api_key=judge_model_api_key,\n",
242 |     ")\n",
243 |     "\n",
244 |     "# init the embeddings for answer_relevancy, answer_correctness and answer_similarity\n",
245 |     "azure_embeddings = AzureOpenAIEmbeddings(\n",
246 |     "    openai_api_version=\"2024-02-01\",\n",
247 |     "    azure_endpoint=judge_model_endpoint,\n",
248 |     "    azure_deployment=embedding_model_deployment,\n",
249 |     "    api_key=judge_model_api_key,\n",
250 |     ")"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": null,
256 |    "metadata": {},
257 |    "outputs": [],
258 |    "source": [
259 |     "from datasets import Dataset\n",
260 |     "\n",
261 |     "baseline_df = test_df[['baseline_final_answer',\n",
262 |     "                      'context',\n",
263 |     "                      'gold_final_answer',\n",
264 |     "                      'question']]\n",
265 |     "\n",
266 |     "baseline_df.rename(columns={'baseline_final_answer':'answer', \n",
267 |     "                            'gold_final_answer':'ground_truth',\n",
268 |     "                            'context':'contexts'}, inplace=True)\n",
269 |     "#baseline_df['ground_truth'] = baseline_df['ground_truth'].apply(lambda x: [x] if x else [])\n",
270 |     "baseline_df['contexts'] = baseline_df['contexts'].apply(lambda x: [x] if x else [])\n",
271 |     "\n",
272 |     "dataset = Dataset.from_pandas(baseline_df)\n"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "markdown",
277 |    "metadata": {},
278 |    "source": [
279 |     "#### 5. Computing the evaluation metrics for both models"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "metadata": {},
286 |    "outputs": [],
287 |    "source": [
288 |     "baseline_result = evaluate(\n",
289 |     "    dataset, metrics=metrics, llm=azure_model, embeddings=azure_embeddings\n",
290 |     ")\n",
291 |     "\n",
292 |     "baseline_result"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": null,
298 |    "metadata": {},
299 |    "outputs": [],
300 |    "source": [
301 |     "finetuned_df = test_df[['finetuned_final_answer',\n",
302 |     "                      'context',\n",
303 |     "                      'gold_final_answer',\n",
304 |     "                      'question']]\n",
305 |     "\n",
306 |     "finetuned_df.rename(columns={'finetuned_final_answer':'answer', \n",
307 |     "                            'gold_final_answer':'ground_truth',\n",
308 |     "                            'context':'contexts'}, inplace=True)\n",
309 |     "#baseline_df['ground_truth'] = baseline_df['ground_truth'].apply(lambda x: [x] if x else [])\n",
310 |     "finetuned_df['contexts'] = finetuned_df['contexts'].apply(lambda x: [x] if x else [])\n",
311 |     "\n",
312 |     "ft_dataset = Dataset.from_pandas(finetuned_df)\n",
313 |     "\n",
314 |     "ft_result = evaluate(\n",
315 |     "    ft_dataset, metrics=metrics, llm=azure_model, embeddings=azure_embeddings\n",
316 |     ")\n",
317 |     "\n",
318 |     "ft_result"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "code",
323 |    "execution_count": null,
324 |    "metadata": {},
325 |    "outputs": [],
326 |    "source": [
327 |     "\n",
328 |     "baseline_dict = dict(baseline_result)\n",
329 |     "ft_dict = dict(ft_result)\n",
330 |     "\n",
331 |     "ft_dict['model']=os.getenv(\"FINETUNED_OPENAI_DEPLOYMENT\")\n",
332 |     "baseline_dict['model']=os.getenv(\"BASELINE_OPENAI_DEPLOYMENT\")\n",
333 |     "\n",
334 |     "results_df = pd.DataFrame([baseline_dict, ft_dict])"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "#### 6. Plotting the side-by-side comparison of the models"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": null,
347 |    "metadata": {},
348 |    "outputs": [],
349 |    "source": [
350 |     "import pandas as pd\n",
351 |     "import matplotlib.pyplot as plt\n",
352 |     "\n",
353 |     "# Assuming you have your results_df DataFrame\n",
354 |     "\n",
355 |     "# Reshape the DataFrame\n",
356 |     "melted_df = results_df.melt(id_vars='model', var_name='metric', value_name='value')\n",
357 |     "melted_df['value'] = melted_df['value'].round(2)\n",
358 |     "\n",
359 |     "# Create the bar plot\n",
360 |     "pivoted_data = melted_df.pivot_table(index='metric', columns='model', values='value')\n",
361 |     "ax = pivoted_data.plot(kind='bar', figsize=(10, 6))\n",
362 |     "\n",
363 |     "# Add value labels on top of the bars\n",
364 |     "for container in ax.containers:\n",
365 |     "    ax.bar_label(container)\n",
366 |     "\n",
367 |     "plt.ylabel('Metric Value')\n",
368 |     "plt.title('Model Comparison by Metric')\n",
369 |     "plt.show()"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": null,
375 |    "metadata": {},
376 |    "outputs": [],
377 |    "source": []
378 |   }
379 |  ],
380 |  "metadata": {
381 |   "kernelspec": {
382 |    "display_name": "openai",
383 |    "language": "python",
384 |    "name": "python3"
385 |   },
386 |   "language_info": {
387 |    "codemirror_mode": {
388 |     "name": "ipython",
389 |     "version": 3
390 |    },
391 |    "file_extension": ".py",
392 |    "mimetype": "text/x-python",
393 |    "name": "python",
394 |    "nbconvert_exporter": "python",
395 |    "pygments_lexer": "ipython3",
396 |    "version": "3.11.8"
397 |   }
398 |  },
399 |  "nbformat": 4,
400 |  "nbformat_minor": 2
401 | }
402 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## RAFT Fine-Tuning for RAG Using Azure OpenAI GBB Workshop
 2 | 
 3 | This repository provides a step-by-step guide on how to implement Retrieval Augmented Fine Tuning using Azure OpenAI. The repository is meant to be educational rather than a production ready implementation of RAFT. 
 4 | 
 5 | **Scenario** You are a developer at a bank working on a conversational assistant capable of answering questions about the bank's accounts and online tools. You have already tried Prompt Engineering and RAG but the business stakeholders still complain about incorrect answers and a general lack of domain adaptation. You suggest trying out RAFT as a potential solution and want to validate the efficacy of such a technique for this use case.
 6 | 
 7 | ###  RAFT: Adapting Language Model to Domain Specific RAG 📚
 8 | Retrieval Aware Fine-Tuning (RAFT), presents a novel recipe to prepare fine-tuning data to tailor the models for domain-specific open-book setting, equivalent to in-domain RAG. In RAFT, we prepare the training data such that each data point contains a question ( Q ), a set of documents (Dk), and a corresponding Chain-of-though style answer (A*) generated from one of the document (D*). We differentiate between two types of documents: oracle documents (D*) i.e. the documents from which the answer to the question can be deduced, and `distractor' documents (Di) that do not contain answer-relevant information. As an implementation detail, the oracle document doesn't need to be a single document, but can be more than one document, as is the case in HotpotQA. Then, for $P$ fraction of the questions (qi) in the dataset, we retain the oracle document (di*) along with distractor documents (dk-1). For (1-P) fraction of the questions (qi) in the dataset, we include no oracle document and only include distractor documents (dk). We then fine-tune the language model using standard supervised training (SFT) technique, training it to generate answers from the provided documents and question. The figure below illustrates the high-level design principal for RAFT.
 9 | 
10 | We demonstrate that our RAG approach trains the model to perform better RAG on the set of documents it is trained on i.e., in-domain. By removing the oracle documents in some instances of the training data, we are compelling the model to memorize domain-knowledge. The training data for RAFT is as follows, and an example training data can be seen in the figure below:
11 | 
12 | ![alt text](static/raft.png)
13 | 
14 | Paper: https://arxiv.org/abs/2403.10131
15 | 
16 | ### Getting Started
17 | 
18 | By design, this workshop only uses **Azure OpenAI service**. 
19 | 
20 | 1. **Deploy required models**
21 | For the workshop, you'll need GPT-4o, ada-002 embedding as well as GPT-4o-mini, make sure you deploy them and have enough capacity (approximately 100K TPM).
22 | 2. **Create and populate your credentials**
23 | Open the [sample.env](./sample.env) file, copy its content into a new `.env` file and fill the required credentials. Note that the FINETUNED model credentials can only be populated after running through notebook 2.
24 | 3. **Install dependencies**
25 | 
26 | **Install system dependencies:**
27 | 
28 | - on Linux run `sudo apt-get install -y poppler-utils`
29 | - on Mac run `brew install poppler`
30 | - On Windows run `conda install -c conda-forge poppler`
31 | 
32 | **Install Python dependencies:**
33 | ```shell
34 | conda create -n raft python=3.11
35 | 
36 | conda activate raft
37 | 
38 | pip install -r requirements.txt
39 | ```
40 | 4. Go through the 3 notebooks in order, execute all the cells in order 
41 | 
42 | 
43 | 
44 | 


--------------------------------------------------------------------------------
/data/better_banking_guide_en.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure-Samples/azure-openai-raft/2e8dfa1e2b5aafc4e895c0e7450b4132e46e09c3/data/better_banking_guide_en.pdf


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | datasets==2.19.1
 2 | langchain-core==0.2.30
 3 | langchain-experimental==0.0.62
 4 | langchain-text-splitters==0.2.2
 5 | openai==1.40.6
 6 | ragas==0.1.13
 7 | pdf2image==1.17.0
 8 | tqdm==4.66.2
 9 | pandas==2.2.1
10 | numpy==1.26.4
11 | matplotlib==3.8.3
12 | tiktoken==0.7.0
13 | python-dotenv
14 | ipykernel                                
15 | ipython                                  
16 | ipywidgets
17 | jupyterlab_widgets


--------------------------------------------------------------------------------
/sample.env:
--------------------------------------------------------------------------------
 1 | AOAI_GPT4o_API_KEY = ""
 2 | AOAI_GPT4o_ENDPOINT = "https://YOUR_RESOURCE_NAME.openai.azure.com"
 3 | AOAI_GPT4o_DEPLOYMENT = ""
 4 | 
 5 | AOAI_FINETUNING_ENDPOINT = "https://YOUR_RESOURCE_NAME.openai.azure.com"
 6 | AOAI_FINETUNING_API_KEY = ""
 7 | 
 8 | EMBEDDING_OPENAI_DEPLOYMENT = ""
 9 | 
10 | BASELINE_OPENAI_DEPLOYMENT = ""
11 | BASELINE_OPENAI_ENDPOINT = "https://YOUR_RESOURCE_NAME.openai.azure.com"
12 | BASELINE_OPENAI_KEY = ""
13 | 
14 | FINETUNED_OPENAI_DEPLOYMENT = "gpt-4o-mini-ft-raft-banking"
15 | FINETUNED_OPENAI_ENDPOINT = "https://YOUR_RESOURCE_NAME.openai.azure.com"
16 | FINETUNED_OPENAI_KEY = ""
17 | 
18 | JUDGE_OPENAI_DEPLOYMENT = ""
19 | JUDGE_OPENAI_ENDPOINT = "https://YOUR_RESOURCE_NAME.openai.azure.com"
20 | JUDGE_OPENAI_API_KEY = ""
21 | 
22 | AZURE_SUBSCRIPTION_ID= "YOUR_SUBSCRIPTION_ID"
23 | AZURE_RESOURCE_GROUP_NAME="YOUR AZURE_RESOURCE_GROUP_NAME FOR THE AOAI RESOURCE"


--------------------------------------------------------------------------------
/static/ft_checkpoints.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure-Samples/azure-openai-raft/2e8dfa1e2b5aafc4e895c0e7450b4132e46e09c3/static/ft_checkpoints.png


--------------------------------------------------------------------------------
/static/ft_deploy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure-Samples/azure-openai-raft/2e8dfa1e2b5aafc4e895c0e7450b4132e46e09c3/static/ft_deploy.png


--------------------------------------------------------------------------------
/static/ft_metrics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure-Samples/azure-openai-raft/2e8dfa1e2b5aafc4e895c0e7450b4132e46e09c3/static/ft_metrics.png


--------------------------------------------------------------------------------
/static/ft_monitor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure-Samples/azure-openai-raft/2e8dfa1e2b5aafc4e895c0e7450b4132e46e09c3/static/ft_monitor.png


--------------------------------------------------------------------------------
/static/raft.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure-Samples/azure-openai-raft/2e8dfa1e2b5aafc4e895c0e7450b4132e46e09c3/static/raft.png


--------------------------------------------------------------------------------