├── lab06
├── paper0.pdf
├── paper1.pdf
├── paper2.pdf
├── paper3.pdf
├── paper4.pdf
├── paper5.pdf
├── paper6.pdf
├── paper7.pdf
├── paper8.pdf
├── paper9.pdf
├── example2.pdf
└── text-06-prompt-engineering-rag.ipynb
├── lab02
├── images
│ ├── decoder.png
│ ├── encoder.png
│ ├── sample_dataset.png
│ └── transformer_encoder.png
└── README.md
├── lab01
├── backpropagation.png
└── computational_graph.png
├── lab07
├── architecture_diagram.png
└── stories.csv
├── .gitignore
├── lab05
├── README.md
├── text-02-LoRA.ipynb
├── text-01-quantization.ipynb
└── solution-01-quantization.ipynb
├── lab04
├── README.md
├── text-01-clm.ipynb
└── text-02-llms.ipynb
├── lab03
└── README.md
├── README.md
├── lab10
├── function_01.py
└── text-10-test-generation.ipynb
├── lab09
├── test_cases_01.py
└── text-09-code-generation.ipynb
└── lab08
└── text-08-roc-router-design.ipynb
/lab06/paper0.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper0.pdf
--------------------------------------------------------------------------------
/lab06/paper1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper1.pdf
--------------------------------------------------------------------------------
/lab06/paper2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper2.pdf
--------------------------------------------------------------------------------
/lab06/paper3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper3.pdf
--------------------------------------------------------------------------------
/lab06/paper4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper4.pdf
--------------------------------------------------------------------------------
/lab06/paper5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper5.pdf
--------------------------------------------------------------------------------
/lab06/paper6.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper6.pdf
--------------------------------------------------------------------------------
/lab06/paper7.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper7.pdf
--------------------------------------------------------------------------------
/lab06/paper8.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper8.pdf
--------------------------------------------------------------------------------
/lab06/paper9.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/paper9.pdf
--------------------------------------------------------------------------------
/lab06/example2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab06/example2.pdf
--------------------------------------------------------------------------------
/lab02/images/decoder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab02/images/decoder.png
--------------------------------------------------------------------------------
/lab02/images/encoder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab02/images/encoder.png
--------------------------------------------------------------------------------
/lab01/backpropagation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab01/backpropagation.png
--------------------------------------------------------------------------------
/lab01/computational_graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab01/computational_graph.png
--------------------------------------------------------------------------------
/lab02/images/sample_dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab02/images/sample_dataset.png
--------------------------------------------------------------------------------
/lab07/architecture_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab07/architecture_diagram.png
--------------------------------------------------------------------------------
/lab02/images/transformer_encoder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dbdmg/llm/HEAD/lab02/images/transformer_encoder.png
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | results
2 | gpt2-medical-finetuned
3 | hf_token.txt
4 | model_answers.json
5 | augmented_model.onnx
6 | static_quantized_model
--------------------------------------------------------------------------------
/lab02/README.md:
--------------------------------------------------------------------------------
1 | ## Lab 02
2 |
3 | The second lab is divided into two parts:
4 | - "T5": we use the T5 encoder-decoder model to generate translations. This part will show how to use HuggingFace tokenizers and models. We will peek into the model's attentions to understand what's going on.
5 | - "Attention": we move away from the field of NLP and focus on the attention mechanism itself. We will predict the next value in a time series. The attention mechanism will be used to learn recurring patterns within the time series.
6 |
7 | The following are the files you will use for this lab:
8 | - T5 exercise ([text](./text-01-t5.ipynb)) ([solution](./solution-01-t5.ipynb))
9 | - Attention exercise ([text](./text-02-attention.ipynb)) ([solution](./solution-02-attention.ipynb))
--------------------------------------------------------------------------------
/lab05/README.md:
--------------------------------------------------------------------------------
1 | ## Lab 05
2 |
3 | The fift lab is divided into two parts:
4 | - quantization: in this part we see how quantization works, by first implementing absmax (symmetric) and minmax (asymmetric) quantizations; and then using dynamic quantization to quantize a model (post-training).
5 |
6 | - LoRA: this part focuses on applying LoRA to a BERT model, with the goal of fine-tuning it (but changing fewer parameters). We first implement LoRA ourselves. Next, we use th HF PEFT library to apply LoRA directly.
7 |
8 | The following are the files you will use for this lab:
9 | - quantization exercise ([text](./text-01-quantization.ipynb)) ([solution](./solution-01-quantization.ipynb))
10 | - LoRA exercise ([text](./text-02-LoRA.ipynb)) ([solution](./solution-02-LoRA.ipynb))
11 |
--------------------------------------------------------------------------------
/lab04/README.md:
--------------------------------------------------------------------------------
1 | ## Lab 04
2 |
3 | The fourth lab is divided into two parts:
4 | - CLM: this part uses causal language modeling (CLM) to generate new sequences of tokens. You will use GPT-2 to generate new sequences of tokens. In particular, you will first fine-tune the model on a medical text dataset. Then, you will use the fine-tuned model to generate new sequences of tokens that are (somewhat) coherent with the medical domain.
5 |
6 | - LLMs: this part focuses on using various LLMs (Llama 3.2, and Mistral v0.2). You will see how we can use instruction-tuned versions of those models to get coherent answers to various questions.
7 |
8 | The following are the files you will use for this lab:
9 | - CLM exercise ([text](./text-01-clm.ipynb)) ([solution](./solution-01-clm.ipynb))
10 | - LLMs exercise ([text](./text-02-llms.ipynb)) ([solution](./solution-02-llms.ipynb))
11 |
--------------------------------------------------------------------------------
/lab03/README.md:
--------------------------------------------------------------------------------
1 | ## Lab 03
2 |
3 | The third lab is divided into two parts:
4 | - BERT: in this part, you will use BERT, an encoder-only model that is often adopted to solve downstream tasks (e.g., sentence classification). You will explore its architecture, observe how the attention mechanism works in encoder-only models, use it on the original tasks (masked LM, Next Sentence Prediction) and, finally, fine-tune it to solve a new task.
5 | - GPT-2: in this part, you will use GPT-2, a decoder-only model used for next token prediction. You will explore its architecture and masked self-attention (typical of decoders, to preserve causality). You will then use GPT-2 to generate new sequences of tokens, with various sampling policies.
6 |
7 | The following are the files you will use for this lab:
8 | - BERT exercise ([text](./text-01-bert.ipynb)) ([solution](./solution-01-bert.ipynb))
9 | - GPT-2 exercise ([text](./text-02-gpt2.ipynb)) ([solution](./solution-02-gpt2.ipynb))
10 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Large Language Models for Software Engineering
2 |
3 | This is the repository for the *Large Language Models for Software Engineering* course at Politecnico di Torino.
4 |
5 |
6 | ### Course information
7 |
8 | - **A.Y.** 2025/26 (1st semester)
9 | - **CFU:** 6
10 | - **Lecturer:** Riccardo Coppola
11 | - **Co-lecturer:** Flavio Giobergia
12 | - **[LLM course website](https://dbdmg.polito.it/dbdmg_web/2025/large-language-models-for-software-engineering-2025-26/)**
13 |
14 |
15 | ### Part I - Foundations of Large Language Models
16 |
17 | - Lab 01: Introduction to deep learning with PyTorch
18 | - Lab 02: Using transformers (T5) with the HuggingFace suite (tokenizers, models), attention on time series data
19 | - Lab 03: Analyzing Encoder only models (BERT) and Decoder only models (GPT)
20 | - Lab 04: Causal Language Modeling (CLM) and LLMs
21 | - Lab 05: Efficient fine-tuning and inference in LLMs
22 |
23 | The labs of Part I have been curated by Claudio Savelli and Flavio Giobergia.
24 |
25 | ### Part II - Applications of LLMs to Software Engineering
26 |
27 | - Lab 06: Prompt Engineering and RAG with Llama
28 | - Lab 07: User Stories and Chain Architectures
29 | - Lab 08: ROC evaluations and Router Agents
30 |
31 | The labs of Part II have been curated by Riccardo Coppola and Tommaso Fulcini.
32 |
--------------------------------------------------------------------------------
/lab10/function_01.py:
--------------------------------------------------------------------------------
1 | def racer_disqualified(times, winner_times, n_penalties, penalties):
2 | """
3 | Determines if a racer is disqualified based on their times, penalties, and winner times.
4 |
5 | Parameters:
6 | times (list of int): List of the racer's times for three events.
7 | winner_times (list of int): List of winner times for the same three events.
8 | n_penalties (int): Number of penalties the racer incurred.
9 | penalties (list of int): List of penalty values.
10 |
11 | Returns:
12 | bool: True if the racer is disqualified, False otherwise.
13 |
14 | Raises:
15 | ValueError: If inputs do not meet the required types or constraints.
16 | """
17 | # Input validation
18 | if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):
19 | raise ValueError("times must be a list of three integers.")
20 |
21 | if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):
22 | raise ValueError("winner_times must be a list of three integers.")
23 |
24 | if not isinstance(n_penalties, int):
25 | raise ValueError("n_penalties must be an integer.")
26 |
27 | if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):
28 | raise ValueError("penalties must be a list of integers.")
29 |
30 | if n_penalties != len(penalties):
31 | raise ValueError("n_penalties must match the length of the penalties list.")
32 |
33 | disqualified = False
34 | tot_penalties = 0
35 |
36 | # Calculate total penalties and check for any excessive penalty
37 | for penalty in penalties:
38 | tot_penalties += penalty
39 | if penalty > 100:
40 | disqualified = True
41 |
42 | # Check for disqualification based on total penalties or number of penalties
43 | if tot_penalties > 100 or n_penalties > 5:
44 | disqualified = True
45 |
46 | # Check if any time exceeds 1.5 times the corresponding winner time
47 | for i in range(3):
48 | max_time = winner_times[i] * 1.5
49 | if times[i] > max_time:
50 | disqualified = True
51 |
52 | return True
--------------------------------------------------------------------------------
/lab09/test_cases_01.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | import ipytest
3 |
4 | from function_01 import compute_average
5 |
6 |
7 | def test_valid_grades():
8 | grades = [18, 25, 30, 33, 22, 28]
9 | result = compute_average(grades)
10 | assert result == pytest.approx(26.25, rel=1e-2)
11 |
12 | def test_valid_grades_with_30laude():
13 | grades = [20, 25, 30, 33, 29, 28]
14 | result = compute_average(grades)
15 | assert result == pytest.approx(27.75, rel=1e-2)
16 |
17 | # Boundary test cases
18 | def test_lowest_and_highest_valid_grades():
19 | grades = [18, 19, 20, 30, 33, 25]
20 | result = compute_average(grades)
21 | assert result == pytest.approx(23.5, rel=1e-2)
22 |
23 | def test_all_30laude():
24 | grades = [33, 33, 33, 33, 33, 33]
25 | result = compute_average(grades)
26 | assert result == 33.0
27 |
28 | # Invalid test cases
29 | def test_more_than_six_grades():
30 | grades = [18, 25, 30, 33, 22, 28, 29]
31 | with pytest.raises(ValueError):
32 | compute_average(grades)
33 |
34 | def test_less_than_six_grades():
35 | grades = [18, 25, 30, 33, 22]
36 | with pytest.raises(ValueError):
37 | compute_average(grades)
38 |
39 | def test_invalid_grade_too_low():
40 | grades = [17, 25, 30, 33, 22, 28]
41 | with pytest.raises(ValueError):
42 | compute_average(grades)
43 |
44 | def test_invalid_grade_too_high():
45 | grades = [18, 25, 30, 34, 22, 28]
46 | with pytest.raises(ValueError):
47 | compute_average(grades)
48 |
49 | def test_all_grades_are_the_same():
50 | grades = [25, 25, 25, 25, 25, 25]
51 | result = compute_average(grades)
52 | assert result == 25.0
53 |
54 | # Test cases with the lowest and highest grade removed
55 | def test_all_grades_except_highest_and_lowest():
56 | grades = [18, 30, 28, 25, 22, 33]
57 | result = compute_average(grades)
58 | assert result == pytest.approx(26.25, rel=1e-2)
59 |
60 | # Invalid test cases for non-numeric inputs
61 | def test_non_numeric_input_string():
62 | grades = ["a", 25, 30, 33, 22, 28]
63 | with pytest.raises(TypeError):
64 | compute_average(grades)
65 |
66 | def test_non_numeric_input_none():
67 | grades = [None, 25, 30, 33, 22, 28]
68 | with pytest.raises(TypeError):
69 | compute_average(grades)
70 |
71 | def test_non_numeric_input_boolean():
72 | grades = [True, 25, 30, 33, 22, 28]
73 | with pytest.raises(ValueError):
74 | compute_average(grades)
75 |
76 | def test_non_numeric_input_mixed():
77 | grades = [18, "25", 30, 33, 22, 28]
78 | with pytest.raises(TypeError):
79 | compute_average(grades)
--------------------------------------------------------------------------------
/lab07/stories.csv:
--------------------------------------------------------------------------------
1 | Issue-id,Type,BusinessValue,Summary,Description
2 | HT-1,User Story,200,Browse hikes,As a visitor I want to see the list (with filtering) of available hikes So that I can get information on them
3 | HT-2,User Story,198,Describe hikes,As a local guide I want to add a hike description So that users can look at it
4 | HT-3,User Story,196,Register,As a visitor I want to register to the platform So that I can use its advanced services
5 | HT-4,User Story,190,See hikes' details,As a hiker I want to see the full list of hikes So that I can get information (including tracks) on them
6 | HT-17,User Story,180,Start hike,As a hiker I want to start a registered hike So that I can record an ongoing hike
7 | HT-18,User Story,179,Terminate hike,As a hiker I want to terminate a hike So that the hike is added to my completed hikes
8 | HT-34,User Story,178,Completed hikes,As hiker I want to access the list of hikes I completed
9 | HT-5,User Story,164,Describe hut,As a local guide I want to insert a hut description
10 | HT-6,User Story,162,Describe parking,As a local guide I want to add a parking lot
11 | HT-7,User Story,160,Search hut,As a hiker I want to search for hut description
12 | HT-8,User Story,158,Link start/arrival,As a local guide I want to add parking lots and huts as start/arrivals points for hikes
13 | HT-9,User Story,156,Link hut,As a local guide I want to link a hut to a hike So that hikers can better plan their hike
14 | HT-33,User Story,154,Define reference points,As a local guide I want to define reference points for a hike I added So that hikers can be tracked
15 | HT-19,User Story,150,Record point,As a hiker I want to record reaching a reference point of an on-going hike So that I can track my progress on the hike
16 | HT-35,User Story,149,Performance stats,As a hiker I want to see my stats based on the completed hikes
17 | HT-31,User Story,145,Register local guide,As a local guide I want to register To be able to access reserved features
18 | HT-32,User Story,144,Validate local guide,As a platform manager I want to validate a local guide registration So that they can access specific features
19 | HT-27,User Story,141,New weather alert,As a platform manager I want to enter a weather alert for a given area So that hikers can be warned
20 | HT-29,User Story,140,Weather alert notification,As a hiker I want to receive a weather alert notification in my current hiking area So that I am warned
21 | HT-14,User Story,138,Update hike condition,As a hut worker I want to update the condition of a hike linked to the hut So that prospective hikers are informed
22 | HT-30,User Story,137,Modify hike description,As a local guide I want to modify and delete hikes I added
23 | HT-15,User Story,136,Hut info,As a hut worker I want to add information on the hut So that hikers can better plan their hike
24 | HT-12,User Story,132,Hut worker sign-up,As a hut worker I want to request a user login So that I can operate on the platform
25 | HT-13,User Story,131,Verify hut-worker,As a platform manager I want to verify hut worker users So that they can operate on the platform
26 | HT-10,User Story,130,Set profile,As a hiker I want to record my performance parameters So that I can get personalised recommendations
27 | HT-11,User Story,128,Filter hikes,As a hiker I want to filter the list of hikes based on my profile So that I can see them based on certain characteristics
28 | HT-16,User Story,110,Hut pictures,As a hut worker I want to add photos of the hut So that hikers can better plan their hike
29 | HT-20,User Story,104,Broadcasting URL,As a hiker I want to get the broadcasting URL for my hike So that I can share it with my friends to let them follow me
30 | HT-21,User Story,102,Monitor hike,As a friend (of a hiker) I want to monitor the progress of my hiker friend
31 | HT-22,User Story,100,Unfinished hike,As a hiker I want to be notified about an unfinished hike So that I can terminate it
32 | HT-23,User Story,99,Plan hike,As a hiker I want to register a planned hike So that it can be later traced
33 | HT-24,User Story,98,Add group,As a hiker I want to add an hiker to a group for a planned hike So that we can hike together
34 | HT-25,User Story,98,Confirm group,As a hiker I want to be able to confirm group participation To be part of a group hike
35 | HT-26,User Story,97,Confirm buddy end,As a hiker I want to notify that someone of my group has completed their hike So that the group hike track is updated
36 | HT-28,User Story,94,Notify buddy late,As a hiker I want to be notified if someone of my group has not completed the hike yet So that I can verify her presence
37 |
--------------------------------------------------------------------------------
/lab04/text-01-clm.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Causal Language Modeling (CLM) - Preprocess, Training and Inference"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this lab, we will explore **Causal Language Modeling (CLM)**, which is a core task in training autoregressive language models like GPT-2. CLM is the process of predicting the next word in a sequence, given the previous words. This type of modeling forms the backbone of text generation tasks, where the model learns to generate coherent text by focusing only on previous tokens in the sequence.\n",
15 | "\n",
16 | "The lab is divided into three major sections:\n",
17 | "1. **Preprocessing**: Preparing a dataset and the labels for the CLM task and tokenizing them.\n",
18 | "2. **Training**: Fine-tuning a pre-trained language model like GPT-2 on a specific dataset using the CLM task.\n",
19 | "3. **Inference**: Evaluating the model’s performance by generating text based on input prompts."
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 1,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments\n",
29 | "from datasets import load_dataset"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "### 1. Preprocessing"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "1. **Load the Domain-Specific Dataset**:\n",
44 | " - The first step in fine-tuning GPT-2 is to load a dataset that is specific to the domain of interest. In this case, we are using a publicly available **medical dataset** from the **PubMed** collection. PubMed contains a vast number of medical articles, and fine-tuning on such a dataset can help GPT-2 generate more accurate and context-specific medical text.\n",
45 | " - The dataset we're using is `\"japhba/pubmed_simple\"`, which is a simplified version of PubMed data. This dataset can be easily accessed using the `datasets` library from Hugging Face.\n"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 2,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "ds = load_dataset(\"japhba/pubmed_simple\", split=\"train\")\n",
55 | "train_dataset = ds.shuffle(seed=42).select(range(1000))\n",
56 | "eval_dataset = ds.shuffle(seed=42).select(range(1000, 1500))"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "We can check the contents of any one of the examples in the dataset to understand the structure of the data."
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "train_dataset[0]"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "We see that the each entry is a dictionary with two keys:\n",
80 | "- \"abstract\": The abstract of the medical article\n",
81 | "- \"country\": The country where the article was published (we will not use this information in this lab)"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "2. **Tokenize the Dataset**:\n",
89 | " - The next step is to **tokenize** the data so that it can be processed by GPT-2.\n",
90 | " - GPT-2 does not natively use a padding token, since it does not require fixed length inputs. For this reason, we will substite it with the EOS token as suggested by transformers library (remember, we are also passing an attention mask, so whatever value is used for padding will be ignored by the model!)"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "# TODO: Tokenize the Dataset using the GPT-2 tokenizer\n",
100 | "# Hint: Use the `map` method of the dataset object\n",
101 | "model_name = \"gpt2\"\n",
102 | "tokenizer = ...\n",
103 | "tokenizer.pad_token = ... # Set padding token to EOS token\n",
104 | "\n",
105 | "def tokenize_function(samples):\n",
106 | " # Tokenize the text column `abstract` in the dataset (set `truncation=True` and `padding=\"max_length\"`)\n",
107 | " # NOTE: You can either use tokenzier as a global variable, or\n",
108 | " # use a closure to capture the tokenizer variable\n",
109 | " return ...\n",
110 | "\n",
111 | "# NOTE: you should map the dataset using the tokenize function. \n",
112 | "# (You can optionally consider using the argument `batched=True` for better performance,\n",
113 | "# as long as tokenize_function can handle the batch! [note2: the tokenizer can do that!])\n",
114 | "tokenized_dataset = ...\n",
115 | "tokenized_dataset_eval = ...\n"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "3. **Add labels to the Dataset for Next Token Prediction**:\n",
123 | " - In this step, we will add the **labels** that will be used during the next token prediction task. \n",
124 | " - In autoregressive language modeling, the **labels** represent the same sequence as the input, shifted one token to the right. This is because the model is trained to predict the next token in the sequence given the previous tokens.\n",
125 | " - The shifting of the tokens is already handled automatically by the `Trainer` class. We just pass an extra attribute named `labels` to the dataset (when this argument is passed to the model, it will know to compute the loss for us!)"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 5,
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "# TODO: Add labels to the dataset\n",
135 | "def add_labels(samples):\n",
136 | " # NOTE: The labels, in causal modeling, should be the same as the input_ids\n",
137 | " samples[\"labels\"] = ...\n",
138 | " return samples\n",
139 | "\n",
140 | "tokenized_dataset = ...\n",
141 | "tokenized_dataset_eval = ..."
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "### 2. Training"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "1. **Fine-Tune the GPT-2 Model**:\n",
156 | " - Set up the model and finetune it using the medical dataset. \n",
157 | " - The pipeline to be followed is the same that we have already seen in the previous lab (`lab03 - 01-bert`)"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "# Set Training Parameters\n",
167 | "training_args = TrainingArguments(\n",
168 | " output_dir=\"./gpt2-medical-finetuned\",\n",
169 | " overwrite_output_dir=True,\n",
170 | " num_train_epochs=1,\n",
171 | " per_device_train_batch_size=6,\n",
172 | " save_total_limit=2,\n",
173 | " logging_dir=\"./logs\",\n",
174 | " logging_steps=100,\n",
175 | " eval_steps=10,\n",
176 | " eval_strategy=\"steps\",\n",
177 | ")\n",
178 | "\n",
179 | "# TODO: Initialize GPT-2 Model\n",
180 | "model = ...\n",
181 | "\n",
182 | "# TODO: Fine-Tune the Model with the `Trainer` method and pass also the eval_dataset \n",
183 | "trainer = Trainer(\n",
184 | " model=model,\n",
185 | " args=training_args,\n",
186 | " train_dataset=tokenized_dataset,\n",
187 | " eval_dataset=tokenized_dataset_eval,\n",
188 | ")\n",
189 | "\n",
190 | "trainer.train()\n",
191 | "\n",
192 | "# Save the Fine-Tuned Model\n",
193 | "model.save_pretrained(\"./gpt2-medical-finetuned\")\n"
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "### 3. Inference"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "1. **Compare Text Generation Before and After Fine-Tuning**:\n",
208 | " - Generate text using both the original pre-trained GPT-2 model and the fine-tuned model.\n",
209 | " - Provide the same input prompt and observe the differences in the outputs."
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": null,
215 | "metadata": {},
216 | "outputs": [],
217 | "source": [
218 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
219 | "\n",
220 | "# TODO: Load the pre-trained and fine-tuned GPT-2 models\n",
221 | "pretrained_model = ...\n",
222 | "finetuned_model = ...\n",
223 | "\n",
224 | "# TODO: Tokenize the prompt\n",
225 | "tokenizer = ...\n",
226 | "tokenizer.pad_token = ...\n",
227 | "\n",
228 | "prompt = \"The patient presents with chest pain and shortness of breath.\"\n",
229 | "\n",
230 | "inputs = ...\n",
231 | "input_ids = inputs['input_ids']"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": null,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "# TODO: Generate Output of the Model Before Fine-Tuning (use `generate` and then `decode` methods of the model and tokenizer) \n",
241 | "output_pretrained = ...\n",
242 | "generated_pretrained = ...\n",
243 | "\n",
244 | "print(\"**Before Fine-Tuning (Pre-Trained GPT-2)**:\")\n",
245 | "print(generated_pretrained)"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {},
252 | "outputs": [],
253 | "source": [
254 | "# TODO: Generate Output of the Model After Fine-Tuning\n",
255 | "output_finetuned = ...\n",
256 | "generated_finetuned = ...\n",
257 | "\n",
258 | "print(\"**After Fine-Tuning (Fine-Tuned on Medical Dataset)**:\")\n",
259 | "print(generated_finetuned)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "Extra stuff!\n",
267 | "\n",
268 | "Training the model in this way produces batches with potentially very different lengths. This can be inefficient, as the model will have to pad the sequences to the length of the longest sequence in the batch.\n",
269 | "\n",
270 | "To avoid this, we can use a technique called **Dynamic Padding**. This technique groups the sequences in the batch by length and pads them to the length of the longest sequence in each group. This way, the model only has to pad the sequences to the length of the longest sequence in each group, which can significantly reduce the amount of padding required.\n",
271 | "\n",
272 | "As a first exercise, quantify the number of pad tokens being used in various situations:\n",
273 | "1. You pad all batches to the maximum allowed sequence length (1024 for GPT-2, this is what we used so far)\n",
274 | "2. You pad the entire batch to the length of the longest sequence in the batch (generate the batches by randomly sampling sentences)\n",
275 | "3. You pad the entire batch to the length of the longest sequence in the batch (generate the batches by placing sentences of similar lengths together)\n",
276 | "\n",
277 | "Next, introduce dynamic padding and compare the execution times of the previous execution and the one with dynamic padding.\n",
278 | "\n",
279 | "You can use the following resources to help you with this exercise:\n",
280 | "- `group_by_length` parameter ([TrainingArguments](https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/trainer#transformers.TrainingArguments.group_by_length)) (parameter to group together samples with similar lengths)\n",
281 | "- `DataCollatorForSeq2Seq` ([DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq)) (collator function that aggregates samples into batches and pads them to the maximum length of the batch)\n",
282 | "\n",
283 | "Note: you may find that the validation losses you observe may be different from the previous ones. This is because the cross entropy loss is computed as an average across tokens, and the number of tokens in a batch can vary depending on the padding strategy used."
284 | ]
285 | }
286 | ],
287 | "metadata": {
288 | "kernelspec": {
289 | "display_name": "Python 3",
290 | "language": "python",
291 | "name": "python3"
292 | },
293 | "language_info": {
294 | "codemirror_mode": {
295 | "name": "ipython",
296 | "version": 3
297 | },
298 | "file_extension": ".py",
299 | "mimetype": "text/x-python",
300 | "name": "python",
301 | "nbconvert_exporter": "python",
302 | "pygments_lexer": "ipython3",
303 | "version": "3.10.12"
304 | }
305 | },
306 | "nbformat": 4,
307 | "nbformat_minor": 2
308 | }
309 |
--------------------------------------------------------------------------------
/lab05/text-02-LoRA.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## LoRA: Low-Rank Adaptation of Large Language Models"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### **Introduction to LoRA**\n",
15 | "\n",
16 | "LoRA aims to adapt pre-trained language models by adding low-rank matrices to certain weight matrices, reducing the number of parameters that need to be updated. This saves memory and computation, making it ideal for large models. In LoRA, we introduce low-rank matrices to the weights of the model. This allow to train only the low-rank parameters, while the rest of the model remains frozen. In this way, we can adapt the model to a specific task without having to train the entire model from scratch.\n"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "In this exercise, we will manually add the low-rank (trainable) matrices to an existing model (frozen). We will use BERT. "
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "import torch\n",
33 | "import torch.nn as nn\n",
34 | "from transformers import BertTokenizer, BertForSequenceClassification"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 2,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "# From https://github.com/huggingface/peft/issues/41#issuecomment-1404611868\n",
44 | "def print_trainable_parameters(model):\n",
45 | " \"\"\"\n",
46 | " Prints the number of trainable parameters in the model.\n",
47 | " \"\"\"\n",
48 | " trainable_params = 0\n",
49 | " all_param = 0\n",
50 | " for _, param in model.named_parameters():\n",
51 | " all_param += param.numel()\n",
52 | " if param.requires_grad:\n",
53 | " trainable_params += param.numel()\n",
54 | " print(\n",
55 | " f\"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}\"\n",
56 | " )"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "### 1. Implement LoRA from scratch on a BERT model"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "In this section, we will implement Low-Rank Adaptation (LoRA) on a BERT model from scratch to better understand the concept and its benefits. We will use the `transformers` library to load the pre-trained BERT model and then modify its attention layers to include low-rank matrices. \n",
71 | "\n",
72 | "We will then train the modified BERT model on a downstream task to observe the efficiency of LoRA compared to standard fine-tuning. We will use the same training pipeline as proposed in `lab03` on bert to finetune it on a sentiment classification task on the IMDB dataset, so that we can easily compare the results obtained with previous ones. "
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "model_name = \"bert-base-uncased\"\n",
82 | "model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)\n",
83 | "tokenizer = BertTokenizer.from_pretrained(model_name)"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "print_trainable_parameters(model)"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "We create a `LoRA` class, which will inherits from `nn.Module` -- the base class for all neural network modules in PyTorch. The constructor takes an `original_layer` (e.g., a linear layer from BERT) and a `rank` parameter that determines the rank of the low-rank matrices. It initializes two low-rank matrices `A` and `B`, which will be used for the adaptation. The dimensions of these matrices are determined by the input and output features of the original layer.\n",
100 | "\n",
101 | "`A` and `B` are initialized according to the original LoRA paper (https://arxiv.org/abs/2106.09685).\n",
102 | "\n",
103 | "The `forward` method defines how the input `x` is processed through the LoRA layer. The input is multiplied by the low-rank matrix `A` to create a low-rank representation, which is then multiplied by the low-rank matrix `B` to obtain the adapted output. Finally, the output of the original layer is combined with the LoRA output."
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 5,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "class LoRA(nn.Module):\n",
113 | " def __init__(self, original_layer, rank=8):\n",
114 | " super(LoRA, self).__init__()\n",
115 | " self.original_layer = original_layer\n",
116 | " self.rank = rank\n",
117 | " self.in_features = original_layer.in_features\n",
118 | " self.out_features = original_layer.out_features\n",
119 | "\n",
120 | " # Initialize the Low-rank matrices A and B\n",
121 | " self.A = nn.Parameter(torch.zeros(self.in_features, rank))\n",
122 | " self.B = nn.Parameter(torch.randn(size=(rank, self.out_features)))\n",
123 | "\n",
124 | " def forward(self, x):\n",
125 | " # The output is the original layer output plus the low-rank adaptation\n",
126 | "\n",
127 | " # LoRA output\n",
128 | " ...\n",
129 | "\n",
130 | " # layer output, which combines the original output with the LoRA one \n",
131 | " return ..."
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "We can choose a list of modules that we want to adapt with LoRA. We will update all linear layers found in the transformer architecture of BERT. "
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 6,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "\n",
148 | "for layer in model.bert.encoder.layer:\n",
149 | " layer.attention.self.query = LoRA(layer.attention.self.query)\n",
150 | " ..."
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "Next, we need to make sure that the model is frozen. We simply all layers by going through the list of modules found in the model, and setting the `requires_grad` attribute to `False`.\n",
158 | "\n",
159 | "Then, only for LoRA modules, we unfreeze A and B. "
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 7,
165 | "metadata": {},
166 | "outputs": [],
167 | "source": [
168 | "# Freeze all parameters except the LoRA parameters\n",
169 | "\n",
170 | "# Freeze all parameters\n",
171 | "for param in model.parameters():\n",
172 | " param.requires_grad = False \n",
173 | " \n",
174 | "# Unfreeze only LoRA parameters\n",
175 | "for layer in model.modules():\n",
176 | " if isinstance(layer, LoRA):\n",
177 | " layer.A.requires_grad = True\n",
178 | " layer.B.requires_grad = True"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "Remember, we are introducing low rank versions for:\n",
186 | "- query, key, value, output: all 4 converted to two matrices with 768 * 8 parameters (=> 768 * 8 * 2 * 4 = 49,152)\n",
187 | "- the two ffnn matrices, both 768x3072, converted to 768x8 and 3072x8 => 2*(768 * 8 + 3072 * 8) = 61,440\n",
188 | "\n",
189 | "Repeating this for 12 layers, should give us 12 * (49152 + 61440) = 1,327,104\n",
190 | "\n",
191 | "We can easily verify if that's the case."
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "print_trainable_parameters(model)"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "From this point onwards, the classic training pipeline can be applied to the model!"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 9,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "from datasets import load_dataset\n",
217 | "\n",
218 | "# Load a sentiment analysis dataset\n",
219 | "dataset = load_dataset('imdb')\n",
220 | "train_dataset = dataset['train'].shuffle(seed=42).select(range(2000))\n",
221 | "test_dataset = dataset['test'].shuffle(seed=42).select(range(1000))"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 10,
227 | "metadata": {},
228 | "outputs": [],
229 | "source": [
230 | "from sklearn.metrics import accuracy_score\n",
231 | "\n",
232 | "# Function to compute accuracy\n",
233 | "def compute_metrics(pred):\n",
234 | " labels = pred.label_ids\n",
235 | " preds = pred.predictions.argmax(-1)\n",
236 | " accuracy = accuracy_score(labels, preds)\n",
237 | " return {\"accuracy\": accuracy}"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "# Tokenize the dataset\n",
247 | "def tokenize_function(sample):\n",
248 | " return tokenizer(sample['text'], padding=\"max_length\", truncation=True)\n",
249 | "\n",
250 | "train_dataset = train_dataset.map(tokenize_function, batched=True)\n",
251 | "test_dataset = test_dataset.map(tokenize_function, batched=True)"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 12,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "from transformers import Trainer, TrainingArguments\n",
261 | "\n",
262 | "batch_size = 32\n",
263 | "num_train_epochs = 1\n",
264 | "\n",
265 | "learning_rate = 2e-4\n",
266 | "weight_decay = 0.01\n",
267 | "\n",
268 | "# Define training arguments\n",
269 | "training_args = TrainingArguments(\n",
270 | " output_dir='./results',\n",
271 | " eval_strategy=\"steps\",\n",
272 | " eval_steps=10,\n",
273 | " learning_rate=learning_rate,\n",
274 | " per_device_train_batch_size=batch_size,\n",
275 | " per_device_eval_batch_size=batch_size,\n",
276 | " num_train_epochs=num_train_epochs,\n",
277 | " weight_decay=weight_decay,\n",
278 | " logging_dir='./logs', # Directory for storing logs\n",
279 | " logging_steps=10, # Log every 10 steps\n",
280 | ")\n",
281 | "\n",
282 | "# Initialize the Trainer object\n",
283 | "trainer = Trainer(\n",
284 | " model=model,\n",
285 | " args=training_args,\n",
286 | " train_dataset=train_dataset,\n",
287 | " eval_dataset=test_dataset,\n",
288 | " compute_metrics=compute_metrics\n",
289 | ")"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {},
296 | "outputs": [],
297 | "source": [
298 | "# Evaluate the model\n",
299 | "results = trainer.evaluate()\n",
300 | "print(f\"Accuracy on the validation set: {results['eval_accuracy']:.4f}\")"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": null,
306 | "metadata": {},
307 | "outputs": [],
308 | "source": [
309 | "results = trainer.train()"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "metadata": {},
316 | "outputs": [],
317 | "source": [
318 | "# Evaluate the model\n",
319 | "results = trainer.evaluate()\n",
320 | "print(f\"Accuracy on the validation set: {results['eval_accuracy']:.4f}\")"
321 | ]
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {},
326 | "source": [
327 | "### 2. Using LoRA with Hugging Face Transformers"
328 | ]
329 | },
330 | {
331 | "cell_type": "markdown",
332 | "metadata": {},
333 | "source": [
334 | "In addition to implementing Low-Rank Adaptation (LoRA) from scratch, HuggingFace provides an automated way of applying LoRA to models through the **PEFT** library and `LoraConfig`."
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "metadata": {},
341 | "outputs": [],
342 | "source": [
343 | "model_name = \"bert-base-uncased\"\n",
344 | "model = BertForSequenceClassification.from_pretrained(model_name)\n",
345 | "tokenizer = BertTokenizer.from_pretrained(model_name)"
346 | ]
347 | },
348 | {
349 | "cell_type": "markdown",
350 | "metadata": {},
351 | "source": [
352 | "**PEFT (Parameter-Efficient Fine-Tuning)** is a framework within Hugging Face's ecosystem designed to enable efficient fine-tuning of large language models. PEFT supports various parameter-efficient techniques, including LoRA, Prefix Tuning, and Adapter Layers, to adapt pre-trained models to specific tasks without requiring extensive training or memory resources.\n",
353 | "\n",
354 | "We create a PEFT configuration object (in this case, `LoraConfig` since we want to apply LoRA). We specify some parameters:\n",
355 | "- `r` (rank) determines the rank of the low-rank matrices.\n",
356 | "- `lora_alpha` a scaling parameter used in LoRA\n",
357 | "- `lora_dropout` dropout rate for the dropout layers introduced in LoRA\n",
358 | "- `target_modules` a list of module names that we want to adapt with LoRA. In this case, we adapt all linear layers found in the transformer architecture of BERT (the names of the modules can be found upon inspecting the model).\n",
359 | "\n",
360 | "We instantiate a `PeftModelForSequenceClassification` model. "
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 23,
366 | "metadata": {},
367 | "outputs": [],
368 | "source": [
369 | "from peft import LoraConfig, get_peft_model, PeftModelForSequenceClassification\n",
370 | "\n",
371 | "config = LoraConfig(\n",
372 | " r=8,\n",
373 | " lora_alpha=32, \n",
374 | " lora_dropout=0.1,\n",
375 | " target_modules=[...],\n",
376 | ")\n",
377 | "\n",
378 | "peft_model = PeftModelForSequenceClassification(model, peft_config=config)"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": null,
384 | "metadata": {},
385 | "outputs": [],
386 | "source": [
387 | "type(peft_model)"
388 | ]
389 | },
390 | {
391 | "cell_type": "markdown",
392 | "metadata": {},
393 | "source": [
394 | "If we look into the model, we find that, indeed, some extra layers have been added (e.g., `lora_A`, `lora_B`). Their behavior is the same as the layers we implemented from scratch in the previous section."
395 | ]
396 | },
397 | {
398 | "cell_type": "code",
399 | "execution_count": null,
400 | "metadata": {},
401 | "outputs": [],
402 | "source": [
403 | "peft_model.bert.encoder.layer[0]"
404 | ]
405 | },
406 | {
407 | "cell_type": "markdown",
408 | "metadata": {},
409 | "source": [
410 | "We see that the number of trainable parameters is the one we approximately expected."
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": null,
416 | "metadata": {},
417 | "outputs": [],
418 | "source": [
419 | "peft_model.print_trainable_parameters()"
420 | ]
421 | },
422 | {
423 | "cell_type": "markdown",
424 | "metadata": {},
425 | "source": [
426 | "Much like before, we can now run the training! We will reuse some of the objects already created for the previous part (e.g., functions to compute metrics, datasets, training arguments). "
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": 27,
432 | "metadata": {},
433 | "outputs": [],
434 | "source": [
435 | "from transformers import Trainer, TrainingArguments\n",
436 | "\n",
437 | "# Initialize the Trainer object\n",
438 | "trainer = Trainer(\n",
439 | " model=peft_model,\n",
440 | " args=training_args, # we will recycle the same training arguments as before!\n",
441 | " train_dataset=train_dataset, # also datasets, \n",
442 | " eval_dataset=test_dataset,\n",
443 | " compute_metrics=compute_metrics # and compute_metrics\n",
444 | ")"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {},
451 | "outputs": [],
452 | "source": [
453 | "trainer.train()"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": null,
459 | "metadata": {},
460 | "outputs": [],
461 | "source": [
462 | "# Evaluate the model\n",
463 | "results = trainer.evaluate()\n",
464 | "print(f\"Accuracy on the validation set: {results['eval_accuracy']:.4f}\")"
465 | ]
466 | }
467 | ],
468 | "metadata": {
469 | "kernelspec": {
470 | "display_name": "Python 3",
471 | "language": "python",
472 | "name": "python3"
473 | },
474 | "language_info": {
475 | "codemirror_mode": {
476 | "name": "ipython",
477 | "version": 3
478 | },
479 | "file_extension": ".py",
480 | "mimetype": "text/x-python",
481 | "name": "python",
482 | "nbconvert_exporter": "python",
483 | "pygments_lexer": "ipython3",
484 | "version": "3.10.12"
485 | }
486 | },
487 | "nbformat": 4,
488 | "nbformat_minor": 2
489 | }
490 |
--------------------------------------------------------------------------------
/lab10/text-10-test-generation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Test case generation with Large Language Models\n",
8 | "\n",
9 | "In this series of exercises, we will investigate the use of LLM to generate test cases.\n",
10 | "\n",
11 | "### Step 1: Our reference code\n",
12 | "\n",
13 | "As opposed to the previous experience with code generation - where we had valid test cases - we assume this time that we have valid solutions for given software requirements. Our task now is to generate test cases for valid code.\n",
14 | "\n"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 6,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "#the same code is saved in the python script function_01.py\n",
24 | "\n",
25 | "original_function=\"\"\"def racer_disqualified(times, winner_times, n_penalties, penalties):\n",
26 | " \\\"\"\"\n",
27 | " Determines if a racer is disqualified based on their times, penalties, and winner times.\n",
28 | "\n",
29 | " Parameters:\n",
30 | " times (list of int): List of the racer's times for three events.\n",
31 | " winner_times (list of int): List of winner times for the same three events.\n",
32 | " n_penalties (int): Number of penalties the racer incurred.\n",
33 | " penalties (list of int): List of penalty values.\n",
34 | "\n",
35 | " Returns:\n",
36 | " bool: True if the racer is disqualified, False otherwise.\n",
37 | "\n",
38 | " Raises:\n",
39 | " ValueError: If inputs do not meet the required types or constraints.\n",
40 | " \\\"\"\"\n",
41 | " # Input validation\n",
42 | " if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):\n",
43 | " raise ValueError(\"times must be a list of three integers.\")\n",
44 | "\n",
45 | " if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):\n",
46 | " raise ValueError(\"winner_times must be a list of three integers.\")\n",
47 | "\n",
48 | " if not isinstance(n_penalties, int):\n",
49 | " raise ValueError(\"n_penalties must be an integer.\")\n",
50 | "\n",
51 | " if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):\n",
52 | " raise ValueError(\"penalties must be a list of integers.\")\n",
53 | "\n",
54 | " if n_penalties != len(penalties):\n",
55 | " raise ValueError(\"n_penalties must match the length of the penalties list.\")\n",
56 | "\n",
57 | " disqualified = False\n",
58 | " tot_penalties = 0\n",
59 | "\n",
60 | " # Calculate total penalties and check for any excessive penalty\n",
61 | " for penalty in penalties:\n",
62 | " tot_penalties += penalty\n",
63 | " if penalty > 100:\n",
64 | " disqualified = True\n",
65 | "\n",
66 | " # Check for disqualification based on total penalties or number of penalties\n",
67 | " if tot_penalties > 100 or n_penalties > 5:\n",
68 | " disqualified = True\n",
69 | "\n",
70 | " # Check if any time exceeds 1.5 times the corresponding winner time\n",
71 | " for i in range(3):\n",
72 | " max_time = winner_times[i] * 1.5\n",
73 | " if times[i] > max_time:\n",
74 | " disqualified = True\n",
75 | "\n",
76 | " return disqualified\"\"\"\n",
77 | "\n",
78 | "\n",
79 | "file_path = \"function_01.py\"\n",
80 | "\n",
81 | "with open(file_path, 'w') as file:\n",
82 | " file.write(original_function)\n",
83 | "\n",
84 | "\n",
85 | "\n",
86 | "def racer_disqualified(times, winner_times, n_penalties, penalties):\n",
87 | " \"\"\"\n",
88 | " Determines if a racer is disqualified based on their times, penalties, and winner times.\n",
89 | "\n",
90 | " Parameters:\n",
91 | " times (list of int): List of the racer's times for three events.\n",
92 | " winner_times (list of int): List of winner times for the same three events.\n",
93 | " n_penalties (int): Number of penalties the racer incurred.\n",
94 | " penalties (list of int): List of penalty values.\n",
95 | "\n",
96 | " Returns:\n",
97 | " bool: True if the racer is disqualified, False otherwise.\n",
98 | "\n",
99 | " Raises:\n",
100 | " ValueError: If inputs do not meet the required types or constraints.\n",
101 | " \"\"\"\n",
102 | " # Input validation\n",
103 | " if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):\n",
104 | " raise ValueError(\"times must be a list of three integers.\")\n",
105 | "\n",
106 | " if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):\n",
107 | " raise ValueError(\"winner_times must be a list of three integers.\")\n",
108 | "\n",
109 | " if not isinstance(n_penalties, int):\n",
110 | " raise ValueError(\"n_penalties must be an integer.\")\n",
111 | "\n",
112 | " if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):\n",
113 | " raise ValueError(\"penalties must be a list of integers.\")\n",
114 | "\n",
115 | " if n_penalties != len(penalties):\n",
116 | " raise ValueError(\"n_penalties must match the length of the penalties list.\")\n",
117 | "\n",
118 | " disqualified = False\n",
119 | " tot_penalties = 0\n",
120 | "\n",
121 | " # Calculate total penalties and check for any excessive penalty\n",
122 | " for penalty in penalties:\n",
123 | " tot_penalties += penalty\n",
124 | " if penalty > 100:\n",
125 | " disqualified = True\n",
126 | "\n",
127 | " # Check for disqualification based on total penalties or number of penalties\n",
128 | " if tot_penalties > 100 or n_penalties > 5:\n",
129 | " disqualified = True\n",
130 | "\n",
131 | " # Check if any time exceeds 1.5 times the corresponding winner time\n",
132 | " for i in range(3):\n",
133 | " max_time = winner_times[i] * 1.5\n",
134 | " if times[i] > max_time:\n",
135 | " disqualified = True\n",
136 | "\n",
137 | " return disqualified"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "### Step 2: Define some pytest test cases\n",
145 | "\n",
146 | "We now setup an environment to run test cases and obtain the coverage of test cases. To start, we define a couple of test cases with the PyTest library."
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "import pytest\n",
156 | "import ipytest\n",
157 | "\n",
158 | "#TODO \n",
159 | "#define test cases with pytest to run with ipytest below\n",
160 | "\n",
161 | "\n",
162 | "def run_tests():\n",
163 | " ipytest.run('-vv') \n",
164 | "\n",
165 | "# Running the tests with ipytests\n",
166 | "run_tests()\n",
167 | "\n"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "### Step 3: Computing the pass rate\n",
175 | "\n",
176 | "The first objective of our analysis is computing the pass rate of the test cases.\n",
177 | "\n",
178 | "The pass rate for a test suite is defined as the ratio between the passing test cases and all the test cases executed.\n",
179 | "\n",
180 | "Notice that this ratio is computed in the same way as the Functional Correctness when you are comparing generated code against an existing test suite, but there is a subtle difference in what we are measuring: \n",
181 | "- when we compute functional correctness, we have a correct test suite, and we are verifying if the code complies to requirements by executing the test cases.\n",
182 | "- when we compute the pass rate, we have correct code, and we are verifying if the test cases comply to the requirements by executing them against the code.\n",
183 | "\n",
184 | "For now, we are defining the test cases manually: we make sure that the pass rate is 100%."
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "import pytest\n",
194 | "import io\n",
195 | "import sys\n",
196 | "import subprocess\n",
197 | "import re\n",
198 | "\n",
199 | "\n",
200 | "\n",
201 | "#TODO\n",
202 | "# Run the pytest command and capture the output\n",
203 | "result = \"\"\n",
204 | "\n",
205 | "#TODO\n",
206 | "# Extract test results from the pytest output\n",
207 | "# parse the results to find passed, failed and errors\n",
208 | "# (hint: you can use the code from previous lab)\n",
209 | "\n",
210 | "\n",
211 | "errors, failures, passes = (0,0,0)\n",
212 | "\n",
213 | "\n",
214 | "print(f\"# Passed: {passes}\")\n",
215 | "print(f\"# Failed: {failures}\")\n",
216 | "print(f\"# Errors: {errors}\")\n",
217 | "\n",
218 | "#compute the pass rate of the test cases\n",
219 | "pass_rate = 0\n",
220 | "print(f\"Pass Rate: {pass_rate}\")"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "### Step 4: Compute the coverage\n",
228 | "\n",
229 | "To compute the coverage of a test suite over a function or a set of functions, we can use the coverage library.\n",
230 | "\n",
231 | "pip install pytest-cov\n",
232 | "\n",
233 | "Once we have the coverage module installed, it is possible to launch the coverage by launching the following command line instructions:\n",
234 | "- coverage run -m pytest test_function_name\n",
235 | "- coverage report -m\n",
236 | "\n",
237 | "In this code section, define multiple subprocess runs to obtain the results of the coverage computation inside a variable."
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "#TODO\n",
247 | "# Run the pytest coverage run command\n",
248 | "result = \"\"\n",
249 | "\n",
250 | "#TODO\n",
251 | "# Run the pytest coverage report command\n",
252 | "result2 = \"\"\n",
253 | "\n",
254 | "#TODO\n",
255 | "#define code to extract the coverage from the coverage report\n",
256 | "#1) find the line where the function is defined (the line will report the name of the file)\n",
257 | "#2) extract the coverage\n",
258 | "\n",
259 | "coverage = 0.0\n",
260 | "\n",
261 | "print(f\"Coverage: {coverage}%\")"
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "### Step 5: Introducing mutations\n",
269 | "\n",
270 | "To try out mutation testing, produce a set of variants of the function by changing operators and values. Save all these variants in a dictionary of mutations by modifying the text of the function like in the example below.\n",
271 | "\n",
272 | "Remeber to introduce a single mutant in each mutated version of the function.\n",
273 | "\n",
274 | "**Note**: several tools exist to automate mutation. You can refer to the libraries mutatest and mutpy to generate automatic mutations for test cases written with pytest. In this example, we will introduce mutations manually."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "\n",
284 | "#in this mutant, the check \"if penalty < 100\" is changed to \"if penalty > 100\"\n",
285 | "\n",
286 | "mutant1 = \"\"\"def racer_disqualified(times, winner_times, n_penalties, penalties):\n",
287 | " \\\"\"\"\n",
288 | " Determines if a racer is disqualified based on their times, penalties, and winner times.\n",
289 | "\n",
290 | " Parameters:\n",
291 | " times (list of int): List of the racer's times for three events.\n",
292 | " winner_times (list of int): List of winner times for the same three events.\n",
293 | " n_penalties (int): Number of penalties the racer incurred.\n",
294 | " penalties (list of int): List of penalty values.\n",
295 | "\n",
296 | " Returns:\n",
297 | " bool: True if the racer is disqualified, False otherwise.\n",
298 | "\n",
299 | " Raises:\n",
300 | " ValueError: If inputs do not meet the required types or constraints.\n",
301 | " \\\"\"\"\n",
302 | " # Input validation\n",
303 | " if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):\n",
304 | " raise ValueError(\"times must be a list of three integers.\")\n",
305 | "\n",
306 | " if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):\n",
307 | " raise ValueError(\"winner_times must be a list of three integers.\")\n",
308 | "\n",
309 | " if not isinstance(n_penalties, int):\n",
310 | " raise ValueError(\"n_penalties must be an integer.\")\n",
311 | "\n",
312 | " if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):\n",
313 | " raise ValueError(\"penalties must be a list of integers.\")\n",
314 | "\n",
315 | " if n_penalties != len(penalties):\n",
316 | " raise ValueError(\"n_penalties must match the length of the penalties list.\")\n",
317 | "\n",
318 | " disqualified = False\n",
319 | " tot_penalties = 0\n",
320 | "\n",
321 | " # Calculate total penalties and check for any excessive penalty\n",
322 | " for penalty in penalties:\n",
323 | " tot_penalties += penalty\n",
324 | " if penalty < 100:\n",
325 | " disqualified = True\n",
326 | "\n",
327 | " # Check for disqualification based on total penalties or number of penalties\n",
328 | " if tot_penalties > 100 or n_penalties > 5:\n",
329 | " disqualified = True\n",
330 | "\n",
331 | " # Check if any time exceeds 1.5 times the corresponding winner time\n",
332 | " for i in range(3):\n",
333 | " max_time = winner_times[i] * 1.5\n",
334 | " if times[i] > max_time:\n",
335 | " disqualified = True\n",
336 | "\n",
337 | " return disqualified\"\"\"\n",
338 | "\n",
339 | "\n",
340 | "#TODO\n",
341 | "#define additional mutations and combine them in a mutant list\n",
342 | "mutant2 = \"\"\n",
343 | "mutant3 = \"\"\n",
344 | "mutant4 = \"\"\n",
345 | "mutant5 = \"\" \n",
346 | "mutants = []\n",
347 | "\n"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "### Step 6: Calculating Mutation Score\n",
355 | "\n",
356 | "Now cycle over the list of mutants. For every mutant, overwrite the function function_01.py and re-execute the test cases. For each mutant you can compute the following outcome:\n",
357 | "- Mutant killed: one or more test cases failed\n",
358 | "- Mutant survived: all test cases passed\n",
359 | "\n",
360 | "At the end of the iteration over mutants, compute the mutation score:\n",
361 | "- Mutation score = survived mutants / total number of mutants"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "#define the path where to save the mutants\n",
371 | "\n",
372 | "file_path = \"function_01.py\"\n",
373 | "\n",
374 | "\n",
375 | "#initialize killed mutants and survived mutants\n",
376 | "killed_mutants = 0\n",
377 | "survived_mutants = 0\n",
378 | "\n",
379 | "# Iterate over the list of mutants \n",
380 | "\n",
381 | "for mutant in mutants:\n",
382 | " \n",
383 | " #TODO\n",
384 | " #overwrite the file with the function with each mutant\n",
385 | "\n",
386 | " \n",
387 | " #TODO\n",
388 | " #run the test cases and collect the number of passed tests\n",
389 | "\n",
390 | "\n",
391 | " #TODO\n",
392 | " # Extract test results from the pytest output\n",
393 | " # parse the results to find passed, failed and errors\n",
394 | " # (hint: you can use the code from previous lab)\n",
395 | "\n",
396 | "\n",
397 | " #TODO\n",
398 | " #update the number of survived or killed mutants\n",
399 | "\n",
400 | " pass\n",
401 | "\n",
402 | "#TODO\n",
403 | "#compute the mutation score\n",
404 | "mutation_score = 0.0\n",
405 | "\n",
406 | "print(f\"Mutation score: {round(mutation_score*100, 2)}%\")\n",
407 | "\n",
408 | "\n",
409 | "\n"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "### Step 7 : Generating tests with LLMs\n",
417 | "\n",
418 | "This time, we will consider again at least two alternatives for test case generation:\n",
419 | "- a model from HuggingFace, e.g., CodeLLAMA\n",
420 | "- a chat engine, e.g., ChatGPT or Qwen2.5\n",
421 | "\n",
422 | "With each engine, we will generate a new test file (e.g., test_function_01_gpt.py, and test_function_01_llama.py), and replicate the pass rate, coverage and mutation analysis performed before with pre-defined test cases."
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": 4,
428 | "metadata": {},
429 | "outputs": [],
430 | "source": [
431 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
432 | "import torch\n",
433 | "import subprocess\n",
434 | "\n",
435 | "\n",
436 | "\n",
437 | "\n",
438 | "#TODO \n",
439 | "#obtain test code by using LLAMA or other chat engines, save the results in different files\n",
440 | "\n",
441 | "\n",
442 | "#TODO\n",
443 | "#append the results on different test files\n",
444 | "test_files = []\n",
445 | "\n",
446 | "\n",
447 | "#TODO\n",
448 | "#traverse all the test files saved in the list\n",
449 | "for test_file in test_files :\n",
450 | "\n",
451 | "\n",
452 | " print()\n",
453 | " print()\n",
454 | " print(\"Doing:\", test_file)\n",
455 | "\n",
456 | " #TODO\n",
457 | " #restore the original function in function_01.py after mutation analysis is performed\n",
458 | "\n",
459 | "\n",
460 | " #TODO\n",
461 | " #compute pass rate for the given test file\n",
462 | "\n",
463 | "\n",
464 | "\n",
465 | "\n",
466 | " number_of_tests = 0\n",
467 | " pass_rate = 0.0\n",
468 | " print(f\"Number of tests: {number_of_tests}\")\n",
469 | " print(f\"Pass Rate: {round(pass_rate*100, 2)}\")\n",
470 | "\n",
471 | " #TODO\n",
472 | " #compute coverage for the given test file\n",
473 | "\n",
474 | "\n",
475 | "\n",
476 | "\n",
477 | " print(f\"Coverage: {coverage}%\")\n",
478 | " \n",
479 | "\n",
480 | " #re-execute the mutation analysis\n",
481 | "\n",
482 | " survived_mutants = 0\n",
483 | " killed_mutants = 0\n",
484 | "\n",
485 | " for mutant in mutants:\n",
486 | " \n",
487 | " #TODO\n",
488 | " #overwrite the file with the function with each mutant\n",
489 | " pass\n",
490 | "\n",
491 | " #TODO\n",
492 | " #compute the mutation score\n",
493 | " mutation_score = 0.0\n",
494 | "\n",
495 | " print(f\"Mutation score: {round(mutation_score*100, 2)}%\")\n",
496 | "\n",
497 | "\n",
498 | " pass\n",
499 | "\n"
500 | ]
501 | }
502 | ],
503 | "metadata": {
504 | "kernelspec": {
505 | "display_name": "Python 3",
506 | "language": "python",
507 | "name": "python3"
508 | },
509 | "language_info": {
510 | "codemirror_mode": {
511 | "name": "ipython",
512 | "version": 3
513 | },
514 | "file_extension": ".py",
515 | "mimetype": "text/x-python",
516 | "name": "python",
517 | "nbconvert_exporter": "python",
518 | "pygments_lexer": "ipython3",
519 | "version": "3.10.10"
520 | }
521 | },
522 | "nbformat": 4,
523 | "nbformat_minor": 2
524 | }
525 |
--------------------------------------------------------------------------------
/lab08/text-08-roc-router-design.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exercise 1: Analyzing thresholds\n",
8 | "\n",
9 | "The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of binary classification models. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.\n",
10 | "\n",
11 | "### Step 1: Generating some random data\n",
12 | "\n",
13 | "We simulate the result of the application of an LLM by generating two random vectors, of actual results and expected results (the ground truth). For this simplified example, the corresponding actual vector is always in the same place as the expected vector (it will not be always like this!)."
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": null,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "import os\n",
23 | "\n",
24 | "import numpy as np\n",
25 | "from sklearn.metrics.pairwise import cosine_similarity\n",
26 | "import matplotlib.pyplot as plt\n",
27 | "from sklearn.metrics import auc\n",
28 | "\n",
29 | "\n",
30 | "# Generate random dataset (100 vectors, each of 100 dimensions)\n",
31 | "np.random.seed(42) # For reproducibility\n",
32 | "expected = \"\"\n",
33 | "\n",
34 | "\n",
35 | "# Perturbation factor\n",
36 | "perturbation_factor = 0.4\n",
37 | "\n",
38 | "# Actual vectors\n",
39 | "actual = expected + np.random.uniform(-perturbation_factor, perturbation_factor, size=expected.shape)\n",
40 | "\n",
41 | "# Print out the original and modified datasets for comparison\n",
42 | "print(\"Expected (First 2 Vectors):\")\n",
43 | "print(expected[:2])\n",
44 | "print(\"\\nActual (First 2 Vectors):\")\n",
45 | "print(actual[:2])\n",
46 | "\n",
47 | "# Comparisons can be made as usual through cosine similarity\n",
48 | "cos_sim_matrix = cosine_similarity(actual, expected)\n",
49 | "\n",
50 | "print(\"\\nCosine Similarity Matrix:\")\n",
51 | "print(cos_sim_matrix)"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "### Step 2: a function to compute TPR and FPR\n",
59 | "\n",
60 | "True Positive Rate (TPR), also called Sensitivity or Recall, is the proportion of actual positives that are correctly identified by the model. It is given by TPR = TP / (TP + FN), where TP = true positives; FN = False Negatives.\n",
61 | "\n",
62 | "False Positive Rate (FPR) is the proportion of actual negatives that are incorrectly classified as positives. It is given by FPR = FP / (FP + TN), where FP = False Positives, TN = True Negatives.\n",
63 | "\n",
64 | "A classifier typically outputs a probability score for each sample (the likelihood that a sample belongs to the positive class). To classify the sample, you apply a threshold on this score. If the score is above the threshold, the sample is classified as positive (class 1), and if it is below the threshold, it is classified as negative (class 0).\n"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 2,
70 | "metadata": {},
71 | "outputs": [
72 | {
73 | "name": "stdout",
74 | "output_type": "stream",
75 | "text": [
76 | "True Positive Rate (TPR): 0\n",
77 | "False Positive Rate (FPR): 0\n"
78 | ]
79 | }
80 | ],
81 | "source": [
82 | "\n",
83 | "# Function to compute tpr and fpr given an actual vector, an expected vector, and a threshold\n",
84 | "\n",
85 | "def compute_tpr_fpr(data, data2, threshold=0.9):\n",
86 | "\n",
87 | " # Compute the cosine similarity between all pairs of vectors\n",
88 | " cos_sim_matrix = \"\"\n",
89 | " \n",
90 | " # Initialize counters for TP, FP, TN, FN\n",
91 | " tp = 0 # True Positives\n",
92 | " fp = 0 # False Positives\n",
93 | " tn = 0 # True Negatives\n",
94 | " fn = 0 # False Negatives\n",
95 | " \n",
96 | " # Loop over all the pairs in the matrix\n",
97 | " for i in range(len(data)):\n",
98 | " for j in range(len(data2)):\n",
99 | " # Compute ground truth by checking if the vectors are from the same index in original data\n",
100 | " if i == j:\n",
101 | " # Same vectors should be similar (positive pair)\n",
102 | " ground_truth = 1 # Positive pair (similar)\n",
103 | " else:\n",
104 | " # Different vectors should be dissimilar (negative pair)\n",
105 | " ground_truth = 0 # Negative pair (dissimilar)\n",
106 | " \n",
107 | " # Apply the threshold to the cosine similarity to obtain the predicted boolean values\n",
108 | " # TODO\n",
109 | " \n",
110 | " # Update the counts based on comparison of ground truth and prediction\n",
111 | " # TODO\n",
112 | "\n",
113 | " #compute tpr and fpr\n",
114 | " tpr = 0\n",
115 | " fpr = 0\n",
116 | "\n",
117 | " return tpr, fpr\n",
118 | "\n",
119 | "# Example of computation for a threshold equal to 0.95\n",
120 | "tpr, fpr = compute_tpr_fpr(actual, expected, threshold=0.95)\n",
121 | "\n",
122 | "# Print the result\n",
123 | "print(f\"True Positive Rate (TPR): {tpr}\")\n",
124 | "print(f\"False Positive Rate (FPR): {fpr}\")\n",
125 | "\n"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "### Step 3: Plotting the ROC curve\n",
133 | "\n",
134 | "In the ROC curve:\n",
135 | "\n",
136 | "- The x-axis represents the False Positive Rate (FPR)\n",
137 | "- The y-axis represents the True Positive Rate (TPR)\n",
138 | "\n",
139 | "Each point on the ROC curve corresponds to a specific threshold value. By adjusting the threshold, you change the trade-off between TPR and FPR.\n",
140 | "\n",
141 | "### Thresholding\n",
142 | "\n",
143 | "By varying this threshold from 0 to 1, you can calculate different values for TPR and FPR, generating a curve. The threshold determines the sensitivity (TPR) and the specificity (FPR) of the classifier:\n",
144 | "\n",
145 | "- At a high threshold, the model will classify fewer instances as positive, leading to fewer true positives and possibly many false negatives.\n",
146 | "\n",
147 | "- At a low threshold, the model will classify more instances as positive, leading to more true positives but also increasing false positives."
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "# Compute TPR and FPR at varying thresholds\n",
157 | "thresholds = [] # Create 100 thresholds between 0 and 1\n",
158 | "tprs = []\n",
159 | "fprs = []\n",
160 | "\n",
161 | "# Loop through all thresholds and calculate TPR and FPR\n",
162 | "# TODO\n",
163 | "\n",
164 | "# Plot the ROC curve\n",
165 | "plt.figure(figsize=(8, 6))\n",
166 | "plt.plot(fprs, tprs, color='blue', label='ROC curve')\n",
167 | "plt.plot([0, 1], [0, 1], color='red', linestyle='--') # Random guess line (diagonal)\n",
168 | "plt.title('ROC Curve')\n",
169 | "plt.xlabel('False Positive Rate (FPR)')\n",
170 | "plt.ylabel('True Positive Rate (TPR)')\n",
171 | "plt.legend(loc='lower right')\n",
172 | "plt.grid(True)\n",
173 | "plt.show()\n",
174 | "\n",
175 | "# Calculate AUC (Area Under Curve)\n",
176 | "roc_auc = auc(fprs, tprs)\n",
177 | "print(f'Area Under the ROC Curve (AUC): {roc_auc}')\n"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "### Now reason about the following points:\n",
185 | "\n",
186 | "- What happens by varying the size of the vectors?\n",
187 | "- What happens by varying the perturbation factor?\n",
188 | "- How to cope with cases in which *we don't know* what is the ground truth (i.e., we don't know that the actual result correspond to the one in the same position in the expected results?)"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "---\n",
196 | "\n",
197 | "# Exercise 2: A simple router architecture\n",
198 | "\n",
199 | "In this architecture, we leverage a Large Language Model (LLM) to dynamically interpret user instructions and route them to the appropriate task-specific prompt. This approach ensures that complex software engineering tasks, such as generating use cases or class diagrams, are efficiently handled based on the user's needs.\n",
200 | "\n",
201 | "The architecture is split into two main stages:\n",
202 | "\n",
203 | "- Router LLM Stage:\n",
204 | " - The LLM analyzes the user's instruction and determines whether the task is related to generating use cases or a class diagram.\n",
205 | " - It outputs an instruction to route the next stage.\n",
206 | "\n",
207 | "- Task Execution LLM Stage:\n",
208 | " - Based on the generated prompt from the router, the LLM executes the required task by producing either:\n",
209 | " - A set of use cases, or\n",
210 | " - A UML class diagram.\n"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "metadata": {
216 | "ExecuteTime": {
217 | "end_time": "2025-11-29T09:22:27.716506Z",
218 | "start_time": "2025-11-29T09:21:25.733070Z"
219 | }
220 | },
221 | "source": [
222 | "# reference code for llama prompting\n",
223 | " \n",
224 | " \n",
225 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
226 | "from huggingface_hub import login\n",
227 | "import torch\n",
228 | "import os\n",
229 | "\n",
230 | "# HF login\n",
231 | "# -------------------- LOCAL RUNTIME -------------------------------------------------\n",
232 | "#hf_token = os.getenv(\"HUGGING_FACE_HUB_TOKEN\") # The token must have been previously set in the environment variables (the syntax varies depending on your environment)\n",
233 | "#login(hf_token)\n",
234 | "# ------------------------------------------------------------------------------------\n",
235 | "\n",
236 | "# Detect the device\n",
237 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
238 | "\n",
239 | "# Load the tokenizer and model\n",
240 | "model_id = \"meta-llama/Llama-3.2-3B-Instruct\"\n",
241 | "model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map=\"auto\")\n",
242 | "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
243 | "\n",
244 | "\n",
245 | "\n",
246 | "\n",
247 | "# We define a method to ask any prompt to llama\n",
248 | "def make_a_query(prompt: str, max_new_tokens:int = 200):\n",
249 | " \"\"\"\n",
250 | " Send a prompt to the Llama model and get a response.\n",
251 | "\n",
252 | " Args:\n",
253 | " - prompt (str): The input question or statement to the model.\n",
254 | " - max_new_tokens (int): The maximum length of the response.\n",
255 | "\n",
256 | " Returns:\n",
257 | " - str: The model's generated response.\n",
258 | " \"\"\"\n",
259 | "\n",
260 | " # Both the tokenizer and the model are global variables\n",
261 | "\n",
262 | " # Set pad_token_id if missing\n",
263 | " if tokenizer.pad_token_id is None:\n",
264 | " tokenizer.pad_token_id = tokenizer.eos_token_id\n",
265 | "\n",
266 | "\n",
267 | " # Tokenize the input with padding and truncation\n",
268 | " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
269 | " inputs = tokenizer(prompt, return_tensors=\"pt\", padding=True, truncation=True).to(device)\n",
270 | "\n",
271 | " # Compute the lenght of the input prompt to be able to extract the model's response later\n",
272 | " input_ids = inputs[\"input_ids\"]\n",
273 | " prompt_length = input_ids.shape[1]\n",
274 | "\n",
275 | " # Generate a response\n",
276 | " output = model.generate(\n",
277 | " inputs['input_ids'],\n",
278 | " attention_mask=inputs['attention_mask'],\n",
279 | " max_new_tokens=max_new_tokens, # Limit the number of new tokens generated (e.g., a single word)\n",
280 | " #temperature=0.3, # Reduce randomness, use with do_sample = True\n",
281 | " repetition_penalty=2.0, # Penalize repetition\n",
282 | " no_repeat_ngram_size=3, # Avoid repeating bigrams\n",
283 | " do_sample= False, # Set to False to use Greedy or Beam search\n",
284 | " num_beams=3, # Use with do_sample = False\n",
285 | " eos_token_id=tokenizer.eos_token_id, # End generation at EOS token\n",
286 | " pad_token_id=tokenizer.pad_token_id, # Avoid padding tokens\n",
287 | " early_stopping=True,\n",
288 | " )\n",
289 | "\n",
290 | " generated_tokens = output[0, prompt_length:]\n",
291 | "\n",
292 | " # Decode the response into human-readable text\n",
293 | " response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()\n",
294 | "\n",
295 | " return response\n",
296 | "\n"
297 | ],
298 | "outputs": [
299 | {
300 | "name": "stderr",
301 | "output_type": "stream",
302 | "text": [
303 | "C:\\Users\\ZZARNAUDOA\\OneDrive - Vodafone Group\\Desktop\\PhD\\llm-main\\.venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
304 | " from .autonotebook import tqdm as notebook_tqdm\n"
305 | ]
306 | },
307 | {
308 | "ename": "NameError",
309 | "evalue": "name 'os' is not defined",
310 | "output_type": "error",
311 | "traceback": [
312 | "\u001B[31m---------------------------------------------------------------------------\u001B[39m",
313 | "\u001B[31mNameError\u001B[39m Traceback (most recent call last)",
314 | "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[1]\u001B[39m\u001B[32m, line 9\u001B[39m\n\u001B[32m 6\u001B[39m \u001B[38;5;28;01mimport\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01mtorch\u001B[39;00m\n\u001B[32m 8\u001B[39m \u001B[38;5;66;03m# HF login\u001B[39;00m\n\u001B[32m----> \u001B[39m\u001B[32m9\u001B[39m hf_token = \u001B[43mos\u001B[49m.getenv(\u001B[33m\"\u001B[39m\u001B[33mHUGGING_FACE_HUB_TOKEN\u001B[39m\u001B[33m\"\u001B[39m) \u001B[38;5;66;03m# The token must have been previously set in the environment variables\u001B[39;00m\n\u001B[32m 11\u001B[39m \u001B[38;5;66;03m# Detect the device\u001B[39;00m\n\u001B[32m 12\u001B[39m device = torch.device(\u001B[33m\"\u001B[39m\u001B[33mcuda\u001B[39m\u001B[33m\"\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m torch.cuda.is_available() \u001B[38;5;28;01melse\u001B[39;00m \u001B[33m\"\u001B[39m\u001B[33mcpu\u001B[39m\u001B[33m\"\u001B[39m)\n",
315 | "\u001B[31mNameError\u001B[39m: name 'os' is not defined"
316 | ]
317 | }
318 | ],
319 | "execution_count": 1
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "### Step 1: Router LLM - Decide the next step\n",
326 | "The first LLM will analyze the user's instruction and generate an instruction for the next prompt to use. For simplicity, in this case the router will only decide what is the type of diagram to create:\n",
327 | "\n",
328 | "- Use Case Diagram\n",
329 | "- Class Diagram"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": null,
335 | "metadata": {},
336 | "outputs": [],
337 | "source": [
338 | "requirements_text = \"The proposed platform is designed to enhance the hiking experience for various user groups, including visitors, local guides, platform managers, and hut workers. The platform provides a centralized repository of hiking routes, hut information, and parking facilities. It also enables interactive features such as real-time hike tracking, personalized recommendations, and group hike planning. By combining these capabilities, the platform seeks to foster safe, informed, and collaborative hiking experiences.\\\n",
339 | "The platform will be deployed as a cloud-based web and mobile application accessible to all stakeholders. The distribution strategy includes an app available on major mobile operating systems, such as iOS and Android, alongside a responsive web interface. It will require an internet connection for features like real-time tracking, notifications, and user authentication, though some offline capabilities, such as pre-downloaded hike information, will also be available.\\\n",
340 | "User authentication will be role-based, ensuring that only authorized users, such as verified hut workers and platform managers, can access sensitive or administrative features.\\\n",
341 | "Visitors are the primary users of the platform. They can browse a comprehensive list of hiking trails, filter them based on specific criteria such as difficulty, length, or starting point, and view detailed descriptions. To access advanced features like personalized recommendations, visitors can create user accounts by registering on the platform. Registered users can record their fitness parameters, enabling the system to suggest trails tailored to their capabilities.\\\n",
342 | "During a hike, visitors can record their progress by marking reference points and sharing their live location through a broadcasting URL. They can also initiate group activities by planning hikes, adding group members, and confirming group participation. The platform allows visitors to start, terminate, and track their hikes, with notifications for unfinished hikes or late group members to ensure safety and accountability.\\\n",
343 | "Local guides enrich the platform by contributing essential information. They can add detailed descriptions of hikes, parking facilities, and huts, ensuring hikers have accurate and comprehensive data. Local guides also link parking lots and huts to specific trails as starting or arrival points, enhancing the planning process.\\\n",
344 | "To aid in the visual representation and accessibility of information, local guides can upload pictures of huts and connect these locations directly to hikes. This integration simplifies route planning and helps visitors visualize their journey.\\\n",
345 | "Platform managers oversee the operational integrity and safety of the platform. They verify new hut worker registrations, ensuring that only authorized personnel can update hut-related data. Managers can also broadcast weather alerts for specific areas, notifying all hikers in those regions through push notifications. This ensures that users stay informed about potentially hazardous conditions.\\\n",
346 | "The platform manager's role includes maintaining an organized and secure user system while facilitating collaboration between local guides, hut workers, and visitors.\\\n",
347 | "Hut workers are critical to the maintenance of up-to-date trail and accommodation information. After registering and being verified, hut workers can log into the platform to add or update information about their assigned huts, including uploading pictures and describing the facilities available. They can also monitor and report on the condition of nearby trails, ensuring hikers receive current information.\\\n",
348 | "Hut workers play a vital role in providing situational updates for hikers. For instance, if a nearby trail is impacted by severe weather or physical obstructions, they can communicate these conditions through the platform. This enhances the safety and preparedness of all hikers relying on the platform.\"\n",
349 | "\n",
350 | "\n",
351 | "sys_prompt = \"\"\n",
352 | "\n",
353 | "\n",
354 | "user_instruction = f\"\"\" \"\"\"\n",
355 | "\n",
356 | "\n",
357 | "messages = []\n",
358 | "\n",
359 | "prompt_router = \"\" #TODO call apply_chat_template() with the correct params\n",
360 | "\n",
361 | "\n",
362 | "# The response here will be only used for guiding the generation of the next prompt.\n",
363 | "# It must be one of two alternatives: \"Use Cases\" or \"Class Diagram\"\n",
364 | "response = make_a_query(prompt_router, max_new_tokens = 2000)\n",
365 | "\n",
366 | "print(response)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "### Step 2: Task-Specific LLM - Generate Output\n",
374 | "The second LLM agent, based on the decision of the router, will either:\n",
375 | "\n",
376 | "- Generate use cases, or\n",
377 | "- Generate a class diagram\n",
378 | "\n",
379 | "In use case diagram design, the primary components typically include actors, which are entities interacting with the system, and use cases, which represent the goals or tasks the actors want to achieve. The diagram focuses on the interactions between these actors and the system, illustrating the functional requirements of the system from a user perspective. For this simplified example, we are focusing only on user-goal use cases (i.e., main functions of the system).\n",
380 | "\n",
381 | "In class diagram design, typically, the primary elements extracted are classes, their attributes, methods, and the relationships between them. Classes represent entities within the system, and attributes define their properties or characteristics. Methods outline the actions or operations that can be performed on or by a class. Additionally, relationships like associations, inheritance, and dependencies are represented to show how different classes interact with one another. For this simplified example, we are focusing only on the classes, leaving the recognition of individual attributes to other prompts."
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": null,
387 | "metadata": {},
388 | "outputs": [],
389 | "source": [
390 | "\n",
391 | "import re\n",
392 | "import string\n",
393 | "\n",
394 | "#remove all non textual characters from the response\n",
395 | "fixed_response = \"\"\n",
396 | "\n",
397 | "print(\"Router selection: \", fixed_response)\n",
398 | "\n",
399 | "\n",
400 | "if fixed_response == \"Use Cases\":\n",
401 | "\n",
402 | " sys_prompt = \"\"\n",
403 | "\n",
404 | " shot_example =[]\n",
405 | "\n",
406 | " query = f\"\"\"\"\"\"\n",
407 | "\n",
408 | " messages = []\n",
409 | "\n",
410 | " prompt_use_cases = tokenizer.appy_chat_template(messages, tokenize = False, add_generation_prompt = True)\n",
411 | "\n",
412 | " response_uc = ask_llama(prompt_use_cases, max_new_tokens = 2000)\n",
413 | "\n",
414 | " print(response_uc)\n",
415 | "\n",
416 | "\n",
417 | "elif fixed_response == \"Class Diagram\":\n",
418 | "\n",
419 | " sys_prompt = \"\"\n",
420 | "\n",
421 | " shot_example =[]\n",
422 | "\n",
423 | " query = f\"\"\"\"\"\"\n",
424 | "\n",
425 | " messages = []\n",
426 | "\n",
427 | " prompt_class_diagram = tokenizer.appy_chat_template(messages, tokenize = False, add_generation_prompt = True)\n",
428 | "\n",
429 | " response_cd = ask_llama(prompt_class_diagram, max_new_tokens = 2000)\n",
430 | "\n",
431 | " print(response_cd)\n",
432 | "\n",
433 | "\n",
434 | "else : \n",
435 | "\n",
436 | " print(\"Unrecognized command from the user\")\n"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "### Step 3: Reasoning\n",
444 | "\n",
445 | "Now reason about the following steps:\n",
446 | "- How can I evaluate the results? \n",
447 | "- How can I extend the prompts to provide other aspects of class and uml diagrams?\n",
448 | "- Try to execute the prompts with the ChatGPT engine. What are your results?"
449 | ]
450 | }
451 | ],
452 | "metadata": {
453 | "kernelspec": {
454 | "display_name": "Python 3",
455 | "language": "python",
456 | "name": "python3"
457 | },
458 | "language_info": {
459 | "codemirror_mode": {
460 | "name": "ipython",
461 | "version": 3
462 | },
463 | "file_extension": ".py",
464 | "mimetype": "text/x-python",
465 | "name": "python",
466 | "nbconvert_exporter": "python",
467 | "pygments_lexer": "ipython3",
468 | "version": "3.10.10"
469 | }
470 | },
471 | "nbformat": 4,
472 | "nbformat_minor": 2
473 | }
474 |
--------------------------------------------------------------------------------
/lab05/text-01-quantization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Quantization in Large Language Models"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### **Introduction to Quantization**\n",
15 | "\n",
16 | "Quantization is a process used to reduce the memory requirements and computational complexity of large machine learning models. By representing model parameters with lower-precision values, quantization makes it possible to run models more efficiently on devices with limited memory and computational resources.\n",
17 | "\n",
18 | "For large language models (LLMs), quantization can:\n",
19 | "- **Reduce Memory Usage:** Lower-precision data types (such as int8) use less memory than higher-precision types (like float32), allowing models to fit into memory-constrained environments.\n",
20 | "- **Improve Inference Speed:** By using simpler operations on smaller data types, quantization can reduce the time it takes for a model to process inputs and generate outputs.\n",
21 | "- **Preserve Accuracy:** Quantization is carefully designed to minimize the impact on model accuracy, though a trade-off often exists between precision and efficiency.\n",
22 | "\n",
23 | "We will focus on Post-Training Quantization (PTQ), a quantization technique that applies quantization to a pre-trained model. PTQ is a popular method for quantizing large language models because it can be applied to a wide range of models that we may want to use in inference mode. By contrast, quantization-aware training (QAT) requires retraining the model with quantization in mind, which can be more complex and time-consuming.\n",
24 | "\n",
25 | "We will first get a general understanding of quantization by manually implementing two commonly adopted approaches: absmax and minmax (or zero-point). \n",
26 | "\n",
27 | "Next, we will explore two different ways (one using PyTorch, the other using HuggingFace) to run a **dynamic quantization** (i.e., PTQ where only the weights are quantized, and not the activations). "
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "## 1. Absmax and Minmax Quantization\n",
35 | "\n",
36 | "The goal of quantization is, remember, mapping continuos values (e.g., float32) into a discrete set of values (e.g., int8). \n",
37 | "\n",
38 | "So let's create a matrix $W$ and a vector $x$ to be quantized. Let's initialize them randomly (but, just to see what happens, let's set W[0,0] and x[0] = 0)."
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 1,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "import torch\n",
48 | "\n",
49 | "torch.random.manual_seed(0)\n",
50 | "\n",
51 | "n_rows = 3\n",
52 | "n_cols = 5\n",
53 | "\n",
54 | "# TODO: Create two random tensors, W (of shape n_rows, n_cols) and x (of shape n_cols)\n",
55 | "W = ...\n",
56 | "x = ...\n",
57 | "\n",
58 | "W[0,0] = 0\n",
59 | "x[0] = 0"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "Let's first compute the matrix multiplication to observe the result. This is the operation that we typically want to execute, and that we want to quantize."
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "We will quantize $W$ and $x$ separately, and then multiply them together. Finally, we will need to dequantize the result to compare it with the original result."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "out = W @ x\n",
83 | "print(out)"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "## 1.1 Absmax quantization\n",
91 | "\n",
92 | "In absmax quantization, we use a symmetric range around 0. This means that we need to identify the maximum absolute value in the matrix $W$ and the vector $x$.\n",
93 | "\n",
94 | "We define a function `absmax_quantize` that takes as input any tensor and produces a version of the same tensor, but quantized. "
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 3,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "def absmax_quantize(W):\n",
104 | " # NOTE: we assume that we always map to 8-bit integers\n",
105 | "\n",
106 | " # TODO: find the scale factor that maps the maximum absolute value in W to the maximum value of int8\n",
107 | " max_value = ...\n",
108 | " scale = ... # how \"long\" the step between any two int8 values is\n",
109 | "\n",
110 | " # TODO: quantize W using the scale factor (hint: remember to round to the nearest integer and convert to int8)\n",
111 | " W_q = ...\n",
112 | " return W_q, scale"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "Notice, we return both the quantized tensor and the scale factor. The scale factor is used to dequantize the tensor. So we might as well define a dequantize function:"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 4,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "def absmax_dequantize(W_q, scale):\n",
129 | " # TODO: dequantize W_q using the scale factor\n",
130 | " dequantized_W = ...\n",
131 | " return dequantized_W"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "Let's get the quantized version of W, and of x. Then, we can check how much we are losing by quantizing the values."
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 5,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "W_q, scale_W = absmax_quantize(W)\n",
148 | "x_q, scale_x = absmax_quantize(x)"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": null,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "print(W_q)\n",
158 | "print(W)\n",
159 | "\n",
160 | "#TODO: dequantize W_q and check how close it is to the original W\n",
161 | "W_deq = ...\n",
162 | "print(W_deq)\n",
163 | "W_diff = ...\n",
164 | "print(W_diff)"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": null,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "print(x_q)\n",
174 | "print(x)\n",
175 | "\n",
176 | "# TODO: dequantize x_q and check how close it is to the original x\n",
177 | "x_deq = ...\n",
178 | "print(x_deq)\n",
179 | "x_diff = ...\n",
180 | "print(x_diff)"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "Notice that, in both cases, absmax maps the value 0 to 0. This is a good property, as it allows us to represent the zero value without losing any information. This property stems from the symmetry around 0 we imposed.\n",
188 | "\n",
189 | "However, do note that we are also \"wasting\" some bits of the range! Can you spot where?\n",
190 | "\n",
191 | "Let's now compute the matrix multiplication between W_q and x_q. "
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "# TODO: perform the matrix multiplication using the quantized values\n",
201 | "product = ... \n",
202 | "print(product)"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "Can you see that there's something wrong? Let's see what one of the rows of W_q and x_q contain:"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": null,
215 | "metadata": {},
216 | "outputs": [],
217 | "source": [
218 | "W_q[0], x_q"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "The dot product of these two vectors definitely isn't what we get as the first number of the matrix multiplication -- i.e. (W_q @ x_q)[0]. Indeed, we can run as int16, and see that the result is quite different:"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "# TODO: perform the matrix multiplication using the quantized values converted to int16\n",
235 | "out_q = ... \n",
236 | "print(out_q)"
237 | ]
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "The result of the dot product overflows the int8 range. This is a well-known problem. Indeed, the accumulation of results, in quantization, is typically done with higher precision than the single values. This is tricky to do in pure Python/PyTorch, but can be done efficiently in other ways.\n",
244 | "\n",
245 | "Let's stick to the simple approach for now. "
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "To get the correct result, we need to dequantize the result. This is done by multiplying the result by the scale factor of the two operands."
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": null,
258 | "metadata": {},
259 | "outputs": [],
260 | "source": [
261 | "# TODO: dequantize the result and check how close it is to the original result\n",
262 | "out_deq = ...\n",
263 | "\n",
264 | "print(out_deq)\n",
265 | "print(out)"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "Remember, our goal was `out`. How much did we lose by quantizing and dequantizing?"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": null,
278 | "metadata": {},
279 | "outputs": [],
280 | "source": [
281 | "#TODO: check the difference between the original and the dequantized result\n",
282 | "out_diff = ..."
283 | ]
284 | },
285 | {
286 | "cell_type": "markdown",
287 | "metadata": {},
288 | "source": [
289 | "## 2. Minmax Quantization\n",
290 | "\n",
291 | "In minmax quantization, we use the minimum and maximum values in the matrix $W$ and the vector $x$ to define the range. In this way, we get a range that is as tight as possible around the values we are quantizing. This will, however, change the zero value, which will not be mapped to 0 anymore.\n",
292 | "\n"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": 14,
298 | "metadata": {},
299 | "outputs": [],
300 | "source": [
301 | "def minmax_quantize(W):\n",
302 | " # the following notations come from:\n",
303 | " # (1) scaling W to [0,1] ==> W' = (W - min(W)) / (max(W) - min(W)),\n",
304 | " # (2) scaling W' to [-128, 127] ==> W_q = W' * 255 - 128\n",
305 | " # by combining the two, we get that:\n",
306 | " # W_q = W * scale + offset\n",
307 | " # we will call the offset \"zero_point\", as it represent the value that maps to 0\n",
308 | " \n",
309 | " # TODO: find the scale factor that maps the minimum and maximum values in W to -128 and 127\n",
310 | " delta = ... # the range of values in W\n",
311 | " scale = ...\n",
312 | " zero_point = ... # the value that maps to 0\n",
313 | "\n",
314 | " # TODO: quantize W using the scale factor (hint: remember to round to the nearest integer and convert to int8)\n",
315 | " W_q = ...\n",
316 | " \n",
317 | " return W_q, scale, zero_point\n",
318 | "\n",
319 | "def minmax_dequantize(W_q, scale, zero_point):\n",
320 | " # TODO: dequantize W_q using the scale factor and zero_point\n",
321 | " dequantized_W = ...\n",
322 | " return dequantized_W"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 15,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "W_q, scale_W, zero_point_W = minmax_quantize(W)\n",
332 | "x_q, scale_x, zero_point_x = minmax_quantize(x)"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "Let's see the results for $W$ (same considerations will apply for $x$)."
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": null,
345 | "metadata": {},
346 | "outputs": [],
347 | "source": [
348 | "print(W_q)\n",
349 | "print(W)\n",
350 | "\n",
351 | "#TODO: dequantize W_q and check how close it is to the original W\n",
352 | "W_deq = ...\n",
353 | "print(W_deq)\n",
354 | "W_diff = ...\n",
355 | "print(W_diff)\n",
356 | "print(\"zero point\", zero_point_W)"
357 | ]
358 | },
359 | {
360 | "cell_type": "markdown",
361 | "metadata": {},
362 | "source": [
363 | "First, notice that 0 no longer maps to 0! Indeed, it maps to zero_point_W (after rounding). This implies that the dequantization of 0 will no longer be 0. This may be a problem!\n",
364 | "\n",
365 | "But, notice that we are using the full range of the int8 values. This means that we are not wasting any bits of the range! (the minimum value is -128, the maximum value is 127). This can also be seen in the average absolute error, which is lower than what we had with absmax.\n",
366 | "\n",
367 | "Similarly to what we did before, let's compute the output of the operation, and then dequantize it!"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 17,
373 | "metadata": {},
374 | "outputs": [],
375 | "source": [
376 | "#TODO: perform the matrix multiplication using the quantized values converted to int16\n",
377 | "out_q = ..."
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {},
383 | "source": [
384 | "The dequantification is a bit trickier, in this case. Can you figure out why we need the following operations?\n",
385 | "\n",
386 | "Hint: consider the transformation we are applying to each value (value * scale + zero_point). What happens when we compute the dot product?"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 18,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": [
395 | "out_deq = (out_q - W.shape[1] * zero_point_W * zero_point_x - W.sum(axis=1) * scale_W * zero_point_x - x.sum() * scale_x * zero_point_W) / (scale_W * scale_x)"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": null,
401 | "metadata": {},
402 | "outputs": [],
403 | "source": [
404 | "print(out_deq)\n",
405 | "print(out)\n",
406 | "print((out_deq - out).abs().mean())"
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {},
412 | "source": [
413 | "Extra stuff!\n",
414 | "\n",
415 | "We could have computed scales and zero points at different granularities (e.g., for each row, or column of $W$). How would that have changed the results? What changes would we have to do to the code?"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {},
421 | "source": [
422 | "# Dynamic quantization"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "In this second part, we will apply dynamic quantization by using PyTorch or HuggingFace (with BitsAndBytes). We will quantize both to 8 and to 4 bits, and we will see how that affects LLMs (in terms of memory and speed). "
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": 20,
435 | "metadata": {},
436 | "outputs": [],
437 | "source": [
438 | "import torch\n",
439 | "import os\n",
440 | "import time \n",
441 | "from transformers import AutoTokenizer, AutoModelForCausalLM"
442 | ]
443 | },
444 | {
445 | "cell_type": "code",
446 | "execution_count": null,
447 | "metadata": {},
448 | "outputs": [],
449 | "source": [
450 | "from huggingface_hub import login\n",
451 | "\n",
452 | "# TODO: Login to the Hugging Face model hub to be able to upload models\n",
453 | "token = ...\n",
454 | "\n",
455 | "login(token=token)"
456 | ]
457 | },
458 | {
459 | "cell_type": "markdown",
460 | "metadata": {},
461 | "source": [
462 | "First, let's load our model (Llama 3.2 1B) and let's see some base statistics (memory usage, inference time)."
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": 22,
468 | "metadata": {},
469 | "outputs": [],
470 | "source": [
471 | "model_id = \"meta-llama/Llama-3.2-1B\"\n",
472 | "#TODO: load the model and tokenizer \n",
473 | "model = ... \n",
474 | "tokenizer = ...\n",
475 | "\n",
476 | "tokenizer.pad_token = tokenizer.eos_token"
477 | ]
478 | },
479 | {
480 | "cell_type": "code",
481 | "execution_count": null,
482 | "metadata": {},
483 | "outputs": [],
484 | "source": [
485 | "def get_model_size(model):\n",
486 | " \"\"\"Get the size of the model in MB\"\"\"\n",
487 | " torch.save(model.state_dict(), \"temp.pth\")\n",
488 | " size = os.path.getsize(\"temp.pth\") / 1e6 # size in \"MB\" (technically, it should be 1024**2, but we approximate to 1e6 to get an easier conversion #params <=> MB)\n",
489 | " os.remove(\"temp.pth\")\n",
490 | " return size\n",
491 | "\n",
492 | "print(f\"Model size before quantization {(get_model_size(model)):.2f} MB\")"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "Wait, wasn't Llama 1B supposed to be 4GB (4 bytes * 1B parameters)? Why do we get ~ 5 GB (i.e., 1.25B parameters)?\n",
500 | "\n",
501 | "We are not considering the parameters used in the embedding layer (you can count how many parameters you have in the embedding layer and see that it matches the difference). \n",
502 | "\n",
503 | "Additionally, the count does not include the `lm_head`, i.e. the layer used to go from the hidden states to the logits. This is because in Llama (and other models) the `lm_head` is shared with the embedding layer, so it is not counted twice."
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": null,
509 | "metadata": {},
510 | "outputs": [],
511 | "source": [
512 | "text = \"The secret of life is\"\n",
513 | "# Notice we use a batch of 20 sentences -- we will get better results\n",
514 | "# on quantized models when processing a batch of inputs\n",
515 | "text = [text]*20\n",
516 | "\n",
517 | "#TODO: encode the text using the tokenizer\n",
518 | "inputs = ...\n",
519 | "\n",
520 | "tic = time.time()\n",
521 | "\n",
522 | "with torch.no_grad():\n",
523 | " #TODO: generate the output from the model\n",
524 | " baseline_output = ...\n",
525 | "\n",
526 | "elapsed_time = time.time() - tic\n",
527 | "\n",
528 | "#TODO: decode the output\n",
529 | "baseline_decoded = ...\n",
530 | "\n",
531 | "print(\"Baseline model output:\", baseline_decoded)\n",
532 | "print(\"\\nTime taken for baseline model:\", elapsed_time)"
533 | ]
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "Dynamic quantization applies lower precision to model weights and activations at runtime. This method doesn’t require modifications to the model architecture or retraining, which makes it relatively easy to apply.\n",
540 | "\n",
541 | "- **Advantages:** \n",
542 | " - Quick to implement with minimal changes. No calibration step is needed.\n",
543 | "\n",
544 | "- **Limitations:** \n",
545 | " - Activations are not pre-quantized, meaning some precision is maintained but at the cost of slightly higher resource use at inference time."
546 | ]
547 | },
548 | {
549 | "cell_type": "markdown",
550 | "metadata": {},
551 | "source": [
552 | "We can use the `quantize_dynamic()` function, available in PyTorch, to apply dynamic quantization to a model.\n",
553 | "\n",
554 | "We can specify a set of layer types to be quantize. Let's stick with Linear layers. We specify the desired type (represented by torch.qint8) , and off we go!"
555 | ]
556 | },
557 | {
558 | "cell_type": "code",
559 | "execution_count": null,
560 | "metadata": {},
561 | "outputs": [],
562 | "source": [
563 | "quantized_model = torch.quantization.quantize_dynamic(\n",
564 | " model, {torch.nn.Linear}, dtype=torch.qint8\n",
565 | ").to('cpu')\n",
566 | "\n",
567 | "# Model size after quantization\n",
568 | "print(f\"Model size after quantization {(get_model_size(quantized_model)):.2f} MB\")"
569 | ]
570 | },
571 | {
572 | "cell_type": "markdown",
573 | "metadata": {},
574 | "source": [
575 | "Okay -- 2.3GB? Why not 5GB / 4 = 1.25GB? After all, we are going from float32 to int8. \n",
576 | "\n",
577 | "That's correct -- technically. Except, we are only encoding linear layers, and not the embedding layer. That means that, of the original 1.25B parameters, we are only quantizing 1B. The rest, in the embedding layer, is kept as float32.\n",
578 | "\n",
579 | "If you run the numbers, though, you should still find a problem: 1B * 1 byte + 0.25B * 4 bytes = 2GB. What about the rest? There's one more thing: remember, the `lm_head` was shared with the Embedding layer. However, since it is \"copied\" into a linear layer in Llama, the quantization process will quantize it as well. So that's an extra 0.25B parameters encoded as int8 -- hence 2.3GB.\n",
580 | "\n",
581 | "Finally, we could technically also quantize the embeddings (it has been introduced in later versions of PyTorch), but for simplicity we will not do it here (it would require some additional steps)."
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": null,
587 | "metadata": {},
588 | "outputs": [],
589 | "source": [
590 | "quantized_model"
591 | ]
592 | },
593 | {
594 | "cell_type": "code",
595 | "execution_count": null,
596 | "metadata": {},
597 | "outputs": [],
598 | "source": [
599 | "tic = time.time()\n",
600 | "\n",
601 | "with torch.no_grad():\n",
602 | " #TODO: generate the output from the quantized model\n",
603 | " output = ...\n",
604 | "\n",
605 | "elapsed_time = time.time() - tic\n",
606 | "\n",
607 | "#TODO: decode the output\n",
608 | "output_decoded = ...\n",
609 | "\n",
610 | "print(\"Quantized model output:\", output_decoded)\n",
611 | "print(\"\\nTime taken for baseline model:\", elapsed_time)"
612 | ]
613 | },
614 | {
615 | "cell_type": "markdown",
616 | "metadata": {},
617 | "source": [
618 | "Hugging Face provides several built-in quantization options, each suited to different model and deployment needs:\n",
619 | "https://huggingface.co/docs/transformers/v4.46.0/quantization/overview\n",
620 | "\n",
621 | "For this lab, we will use `BitsAndBytes`."
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": null,
627 | "metadata": {},
628 | "outputs": [],
629 | "source": [
630 | "from transformers import BitsAndBytesConfig\n",
631 | "\n",
632 | "quantization_config = BitsAndBytesConfig(load_in_4bit=True)\n",
633 | "\n",
634 | "quantized_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)\n",
635 | "\n",
636 | "print(f\"Model size after quantization: {get_model_size(quantized_model)} MB\")"
637 | ]
638 | },
639 | {
640 | "cell_type": "code",
641 | "execution_count": null,
642 | "metadata": {},
643 | "outputs": [],
644 | "source": [
645 | "tic = time.time()\n",
646 | "\n",
647 | "with torch.no_grad():\n",
648 | " #TODO: generate the output from the quantized model\n",
649 | " output = ...\n",
650 | "\n",
651 | "elapsed_time = time.time() - tic\n",
652 | "\n",
653 | "#TODO: decode the output\n",
654 | "output_decoded = ...\n",
655 | "\n",
656 | "print(\"\\nquantized model output:\", output_decoded)\n",
657 | "print(\"\\nTime taken for baseline model:\", elapsed_time)"
658 | ]
659 | }
660 | ],
661 | "metadata": {
662 | "kernelspec": {
663 | "display_name": "Python 3",
664 | "language": "python",
665 | "name": "python3"
666 | },
667 | "language_info": {
668 | "codemirror_mode": {
669 | "name": "ipython",
670 | "version": 3
671 | },
672 | "file_extension": ".py",
673 | "mimetype": "text/x-python",
674 | "name": "python",
675 | "nbconvert_exporter": "python",
676 | "pygments_lexer": "ipython3",
677 | "version": "3.10.12"
678 | }
679 | },
680 | "nbformat": 4,
681 | "nbformat_minor": 2
682 | }
683 |
--------------------------------------------------------------------------------
/lab05/solution-01-quantization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Quantization in Large Language Models"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### **Introduction to Quantization**\n",
15 | "\n",
16 | "Quantization is a process used to reduce the memory requirements and computational complexity of large machine learning models. By representing model parameters with lower-precision values, quantization makes it possible to run models more efficiently on devices with limited memory and computational resources.\n",
17 | "\n",
18 | "For large language models (LLMs), quantization can:\n",
19 | "- **Reduce Memory Usage:** Lower-precision data types (such as int8) use less memory than higher-precision types (like float32), allowing models to fit into memory-constrained environments.\n",
20 | "- **Improve Inference Speed:** By using simpler operations on smaller data types, quantization can reduce the time it takes for a model to process inputs and generate outputs.\n",
21 | "- **Preserve Accuracy:** Quantization is carefully designed to minimize the impact on model accuracy, though a trade-off often exists between precision and efficiency.\n",
22 | "\n",
23 | "We will focus on Post-Training Quantization (PTQ), a quantization technique that applies quantization to a pre-trained model. PTQ is a popular method for quantizing large language models because it can be applied to a wide range of models that we may want to use in inference mode. By contrast, quantization-aware training (QAT) requires retraining the model with quantization in mind, which can be more complex and time-consuming.\n",
24 | "\n",
25 | "We will first get a general understanding of quantization by manually implementing two commonly adopted approaches: absmax and minmax (or zero-point). \n",
26 | "\n",
27 | "Next, we will explore two different ways (one using PyTorch, the other using HuggingFace) to run a **dynamic quantization** (i.e., PTQ where only the weights are quantized, and not the activations). "
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "## 1. Absmax and Minmax Quantization\n",
35 | "\n",
36 | "The goal of quantization is, remember, mapping continuos values (e.g., float32) into a discrete set of values (e.g., int8). \n",
37 | "\n",
38 | "So let's create a matrix $W$ and a vector $x$ to be quantized. Let's initialize them randomly (but, just to see what happens, let's set W[0,0] and x[0] = 0)."
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "import torch\n",
48 | "\n",
49 | "torch.random.manual_seed(0)\n",
50 | "\n",
51 | "n_rows = 3\n",
52 | "n_cols = 5\n",
53 | "W = torch.randn(n_rows, n_cols)\n",
54 | "x = torch.randn(n_cols)\n",
55 | "\n",
56 | "W[0,0] = 0\n",
57 | "x[0] = 0"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {},
64 | "outputs": [],
65 | "source": [
66 | "W.shape, x.shape"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "Let's first compute the matrix multiplication to observe the result. This is the operation that we typically want to execute, and that we want to quantize."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "We will quantize $W$ and $x$ separately, and then multiply them together. Finally, we will need to dequantize the result to compare it with the original result."
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "out = W @ x\n",
90 | "print(out)"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "## 1.1 Absmax quantization\n",
98 | "\n",
99 | "In absmax quantization, we use a symmetric range around 0. This means that we need to identify the maximum absolute value in the matrix $W$ and the vector $x$.\n",
100 | "\n",
101 | "We define a function `absmax_quantize` that takes as input any tensor and produces a version of the same tensor, but quantized. "
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "def absmax_quantize(W):\n",
111 | " # NOTE: we assume that we always map to 8-bit integers\n",
112 | " max_value = W.abs().max()\n",
113 | " scale = 127 / max_value # how \"long\" the step between any two int8 values is\n",
114 | " W_q = (W * scale).round().to(torch.int8)\n",
115 | " return W_q, scale"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "Notice, we return both the quantized tensor and the scale factor. The scale factor is used to dequantize the tensor. So we might as well define a dequantize function:"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "def absmax_dequantize(W_q, scale):\n",
132 | " return W_q.float() / scale"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "Let's get the quantized version of W, and of x. Then, we can check how much we are losing by quantizing the values."
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "W_q, scale_W = absmax_quantize(W)\n",
149 | "x_q, scale_x = absmax_quantize(x)"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "# print(W_q)\n",
159 | "print(W)\n",
160 | "W_deq = absmax_dequantize(W_q, scale_W)\n",
161 | "print(W_deq)\n",
162 | "print((W - W_deq).abs().mean())"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "print(x_q)\n",
172 | "print(x)\n",
173 | "x_deq = absmax_dequantize(x_q, scale_x)\n",
174 | "print(x_deq)\n",
175 | "print((x - x_deq).abs().mean())"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "Notice that, in both cases, absmax maps the value 0 to 0. This is a good property, as it allows us to represent the zero value without losing any information. This property stems from the symmetry around 0 we imposed.\n",
183 | "\n",
184 | "However, do note that we are also \"wasting\" some bits of the range! Can you spot where?\n",
185 | "\n",
186 | "Let's now compute the matrix multiplication between W_q and x_q. "
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "W_q @ x_q"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "Can you see that there's something wrong? Let's see what one of the rows of W_q and x_q contain:"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "W_q[0], x_q"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": null,
217 | "metadata": {},
218 | "outputs": [],
219 | "source": [
220 | "a = torch.tensor(127, dtype=torch.int8)\n",
221 | "b = torch.tensor(1, dtype=torch.int8)\n",
222 | "c = a + b \n",
223 | "print(c) "
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "The dot product of these two vectors definitely isn't what we get as the first number of the matrix multiplication -- i.e. (W_q @ x_q)[0]. Indeed, we can run as int16, and see that the result is quite different:"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": null,
236 | "metadata": {},
237 | "outputs": [],
238 | "source": [
239 | "W_q.to(torch.int16) @ x_q.to(torch.int16)"
240 | ]
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "metadata": {},
245 | "source": [
246 | "The result of the dot product overflows the int8 range. This is a well-known problem. Indeed, the accumulation of results, in quantization, is typically done with higher precision than the single values. This is tricky to do in pure Python/PyTorch, but can be done efficiently in other ways.\n",
247 | "\n",
248 | "Let's stick to the simple approach for now. "
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {},
255 | "outputs": [],
256 | "source": [
257 | "out_q = W_q.to(torch.int16) @ x_q.to(torch.int16)"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "To get the correct result, we need to dequantize the result. This is done by multiplying the result by the scale factor of the two operands."
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": null,
270 | "metadata": {},
271 | "outputs": [],
272 | "source": [
273 | "out_deq = absmax_dequantize(absmax_dequantize(out_q, scale_W), scale_x)\n",
274 | "# Or, alternatively:\n",
275 | "# out_deq = out_q / scale_W / scale_x\n",
276 | "# out_deq = absmax_dequantize(out_q, scale_W * scale_x)\n",
277 | "print(out_deq)\n",
278 | "print(out)"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "Remember, our goal was `out`. How much did we lose by quantizing and dequantizing?"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "(out_deq - out).abs().mean()"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "## 2. Minmax Quantization\n",
302 | "\n",
303 | "In minmax quantization, we use the minimum and maximum values in the matrix $W$ and the vector $x$ to define the range. In this way, we get a range that is as tight as possible around the values we are quantizing. This will, however, change the zero value, which will not be mapped to 0 anymore.\n",
304 | "\n"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "def minmax_quantize(W):\n",
314 | " # the following notations come from:\n",
315 | " # (1) scaling W to [0,1] ==> W' = (W - min(W)) / (max(W) - min(W)),\n",
316 | " # (2) scaling W' to [-128, 127] ==> W_q = W' * 255 - 128\n",
317 | " # by combining the two, we get that:\n",
318 | " # W_q = W * scale + offset\n",
319 | " # we will call the offset \"zero_point\", as it represent the value that maps to 0\n",
320 | " delta = W.max() - W.min()\n",
321 | " scale = 255 / delta\n",
322 | " zero_point = -(128*W.max() + 127*W.min()) / delta\n",
323 | " W_q = (W * scale + zero_point).round().to(torch.int8)\n",
324 | " return W_q, scale, zero_point\n",
325 | "\n",
326 | "def minmax_dequantize(W_q, scale, zero_point):\n",
327 | " return (W_q.float() - zero_point) / scale"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": null,
333 | "metadata": {},
334 | "outputs": [],
335 | "source": [
336 | "W_q, scale_W, zero_point_W = minmax_quantize(W)\n",
337 | "x_q, scale_x, zero_point_x = minmax_quantize(x)"
338 | ]
339 | },
340 | {
341 | "cell_type": "markdown",
342 | "metadata": {},
343 | "source": [
344 | "Let's see the results for $W$ (same considerations will apply for $x$)."
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "metadata": {},
351 | "outputs": [],
352 | "source": [
353 | "print(W_q)\n",
354 | "print(W)\n",
355 | "W_deq = minmax_dequantize(W_q, scale_W, zero_point_W)\n",
356 | "print(W_deq)\n",
357 | "print((W - W_deq).abs().mean())"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "First, notice that 0 no longer maps to 0! Indeed, it maps to zero_point_W (after rounding). This implies that the dequantization of 0 will no longer be 0. This may be a problem!\n",
365 | "\n",
366 | "But, notice that we are using the full range of the int8 values. This means that we are not wasting any bits of the range! (the minimum value is -128, the maximum value is 127). This can also be seen in the average absolute error, which is lower than what we had with absmax.\n",
367 | "\n",
368 | "Similarly to what we did before, let's compute the output of the operation, and then dequantize it!"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": null,
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "out_q = W_q.to(torch.int16) @ x_q.to(torch.int16)\n",
378 | "print(out_q)"
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "The dequantification is a bit trickier, in this case. Can you figure out why we need the following operations?\n",
386 | "\n",
387 | "Hint: consider the transformation we are applying to each value (value * scale + zero_point). What happens when we compute the dot product?"
388 | ]
389 | },
390 | {
391 | "cell_type": "code",
392 | "execution_count": null,
393 | "metadata": {},
394 | "outputs": [],
395 | "source": [
396 | "out_deq = (out_q - W.shape[1] * zero_point_W * zero_point_x - W.sum(axis=1) * scale_W * zero_point_x - x.sum() * scale_x * zero_point_W) / (scale_W * scale_x)"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": null,
402 | "metadata": {},
403 | "outputs": [],
404 | "source": [
405 | "print(out_deq)\n",
406 | "print(out)\n",
407 | "print((out_deq - out).abs().mean())"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "Extra stuff!\n",
415 | "\n",
416 | "We could have computed scales and zero points at different granularities (e.g., for each row, or column of $W$). How would that have changed the results? What changes would we have to do to the code?"
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "# Dynamic quantization"
424 | ]
425 | },
426 | {
427 | "cell_type": "markdown",
428 | "metadata": {},
429 | "source": [
430 | "In this second part, we will apply dynamic quantization by using PyTorch or HuggingFace (with BitsAndBytes). We will quantize both to 8 and to 4 bits, and we will see how that affects LLMs (in terms of memory and speed). "
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": null,
436 | "metadata": {},
437 | "outputs": [],
438 | "source": [
439 | "import torch\n",
440 | "import os\n",
441 | "import time \n",
442 | "from transformers import AutoTokenizer, AutoModelForCausalLM"
443 | ]
444 | },
445 | {
446 | "cell_type": "code",
447 | "execution_count": null,
448 | "metadata": {},
449 | "outputs": [],
450 | "source": [
451 | "from huggingface_hub import login\n",
452 | "\n",
453 | "# Login to the Hugging Face model hub to be able to upload models\n",
454 | "with open(\"../hf_token.txt\", \"r\") as f:\n",
455 | " token = f.read()\n",
456 | " f.close()\n",
457 | "\n",
458 | "login(token=token)"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "First, let's load our model (Llama 3.2 1B) and let's see some base statistics (memory usage, inference time)."
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {},
472 | "outputs": [],
473 | "source": [
474 | "model_id = \"meta-llama/Llama-3.2-1B\"\n",
475 | "model = AutoModelForCausalLM.from_pretrained(model_id) \n",
476 | "tokenizer = AutoTokenizer.from_pretrained(model_id) \n",
477 | "\n",
478 | "tokenizer.pad_token = tokenizer.eos_token"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "metadata": {},
485 | "outputs": [],
486 | "source": [
487 | "def get_model_size(model):\n",
488 | " \"\"\"Get the size of the model in MB\"\"\"\n",
489 | " torch.save(model.state_dict(), \"temp.pth\")\n",
490 | " size = os.path.getsize(\"temp.pth\") / 1e6 # size in \"MB\" (technically, it should be 1024**2, but we approximate to 1e6 to get an easier conversion #params <=> MB)\n",
491 | " os.remove(\"temp.pth\")\n",
492 | " return size\n",
493 | "\n",
494 | "print(f\"Model size before quantization {(get_model_size(model)):.2f} MB\")"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "Wait, wasn't Llama 1B supposed to be 4GB (4 bytes * 1B parameters)? Why do we get ~ 5 GB (i.e., 1.25B parameters)?\n",
502 | "\n",
503 | "We are not considering the parameters used in the embedding layer (you can count how many parameters you have in the embedding layer and see that it matches the difference). \n",
504 | "\n",
505 | "Additionally, the count does not include the `lm_head`, i.e. the layer used to go from the hidden states to the logits. This is because in Llama (and other models) the `lm_head` is shared with the embedding layer, so it is not counted twice."
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": null,
511 | "metadata": {},
512 | "outputs": [],
513 | "source": [
514 | "text = \"The secret of life is\"\n",
515 | "# Notice we use a batch of 20 sentences -- we will get better results\n",
516 | "# on quantized models when processing a batch of inputs\n",
517 | "inputs = tokenizer([text]*20, return_tensors=\"pt\")\n",
518 | "\n",
519 | "tic = time.time()\n",
520 | "\n",
521 | "with torch.no_grad():\n",
522 | " baseline_output = model.generate(**inputs, max_new_tokens=100)\n",
523 | "\n",
524 | "elapsed_time = time.time() - tic\n",
525 | "\n",
526 | "baseline_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)\n",
527 | "\n",
528 | "print(\"\\nBaseline model output:\", baseline_decoded)\n",
529 | "print(\"\\nTime taken for baseline model:\", elapsed_time)"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "Dynamic quantization applies lower precision to model weights and activations at runtime. This method doesn’t require modifications to the model architecture or retraining, which makes it relatively easy to apply.\n",
537 | "\n",
538 | "- **Advantages:** \n",
539 | " - Quick to implement with minimal changes. No calibration step is needed.\n",
540 | "\n",
541 | "- **Limitations:** \n",
542 | " - Activations are not pre-quantized, meaning some precision is maintained but at the cost of slightly higher resource use at inference time."
543 | ]
544 | },
545 | {
546 | "cell_type": "markdown",
547 | "metadata": {},
548 | "source": [
549 | "\n",
550 | "We use **[TorchAO](https://docs.pytorch.org/ao/stable/api_ref_quantization.html?utm_source=chatgpt.com)** to apply dynamic quantization to a model. TorchAO is the new quantization framework that replaces the deprecated `torch.ao.quantization` APIs. \n",
551 | "\n",
552 | "We can specify a set of layer types to be quantize. Let's stick with Linear layers. We specify the desired type, and off we go!"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": null,
558 | "metadata": {},
559 | "outputs": [],
560 | "source": [
561 | "import torch\n",
562 | "from torchao.quantization import quantize_, Int8DynamicActivationInt8WeightConfig\n",
563 | "\n",
564 | "quantize_(\n",
565 | " model,\n",
566 | " Int8DynamicActivationInt8WeightConfig(),\n",
567 | " filter_fn=lambda m, name: isinstance(m, torch.nn.Linear),\n",
568 | " device=\"cpu\",\n",
569 | ")\n",
570 | "\n",
571 | "print(f\"Model size after quantization {get_model_size(model):.2f} MB\")"
572 | ]
573 | },
574 | {
575 | "cell_type": "markdown",
576 | "metadata": {},
577 | "source": [
578 | "Okay -- 2.3GB? Why not 5GB / 4 = 1.25GB? After all, we are going from float32 to int8. \n",
579 | "\n",
580 | "That's correct -- technically. Except, we are only encoding linear layers, and not the embedding layer. That means that, of the original 1.25B parameters, we are only quantizing 1B. The rest, in the embedding layer, is kept as float32.\n",
581 | "\n",
582 | "If you run the numbers, though, you should still find a problem: 1B * 1 byte + 0.25B * 4 bytes = 2GB. What about the rest? There's one more thing: remember, the `lm_head` was shared with the Embedding layer. However, since it is \"copied\" into a linear layer in Llama, the quantization process will quantize it as well. So that's an extra 0.25B parameters encoded as int8 -- hence 2.3GB.\n",
583 | "\n",
584 | "Finally, we could technically also quantize the embeddings (it has been introduced in later versions of PyTorch), but for simplicity we will not do it here (it would require some additional steps)."
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": null,
590 | "metadata": {},
591 | "outputs": [],
592 | "source": [
593 | "model"
594 | ]
595 | },
596 | {
597 | "cell_type": "code",
598 | "execution_count": null,
599 | "metadata": {},
600 | "outputs": [],
601 | "source": [
602 | "tic = time.time()\n",
603 | "\n",
604 | "with torch.no_grad():\n",
605 | " output = model.generate(**inputs, max_new_tokens=100)\n",
606 | "\n",
607 | "elapsed_time = time.time() - tic\n",
608 | "\n",
609 | "output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)\n",
610 | "\n",
611 | "print(\"\\nQuantized model output:\", output_decoded)\n",
612 | "print(\"\\nTime taken for baseline model:\", elapsed_time)"
613 | ]
614 | },
615 | {
616 | "cell_type": "markdown",
617 | "metadata": {},
618 | "source": [
619 | "Hugging Face provides several built-in quantization options, each suited to different model and deployment needs:\n",
620 | "https://huggingface.co/docs/transformers/v4.46.0/quantization/overview\n",
621 | "\n",
622 | "For this lab, we will use `Quanto`."
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "execution_count": null,
628 | "metadata": {},
629 | "outputs": [],
630 | "source": [
631 | "from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig\n",
632 | "import torch\n",
633 | " \n",
634 | "model_id = \"meta-llama/Llama-3.2-1B\"\n",
635 | "\n",
636 | "# Quantize to 8-bit weights\n",
637 | "quant = QuantoConfig(weights=\"int8\")\n",
638 | "\n",
639 | "tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)\n",
640 | "model = AutoModelForCausalLM.from_pretrained(\n",
641 | " model_id,\n",
642 | " quantization_config=quant,\n",
643 | " device_map=\"cpu\", \n",
644 | ")\n",
645 | "\n",
646 | "print(f\"Model size after quantization: {get_model_size(model)} MB\")"
647 | ]
648 | },
649 | {
650 | "cell_type": "code",
651 | "execution_count": null,
652 | "metadata": {},
653 | "outputs": [],
654 | "source": [
655 | "model.model.layers[0].self_attn.q_proj.weight"
656 | ]
657 | },
658 | {
659 | "cell_type": "code",
660 | "execution_count": null,
661 | "metadata": {},
662 | "outputs": [],
663 | "source": [
664 | "import time \n",
665 | "\n",
666 | "text = \"The secret of life is\"\n",
667 | "inputs = tokenizer([text]*20, return_tensors=\"pt\")\n",
668 | "\n",
669 | "tic = time.time()\n",
670 | "\n",
671 | "with torch.no_grad():\n",
672 | " output = model.generate(**inputs, max_new_tokens=100)\n",
673 | "\n",
674 | "elapsed_time = time.time() - tic\n",
675 | "\n",
676 | "output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)\n",
677 | "\n",
678 | "print(\"\\nquantized model output:\", output_decoded)\n",
679 | "print(\"\\nTime taken for baseline model:\", elapsed_time)"
680 | ]
681 | }
682 | ],
683 | "metadata": {
684 | "kernelspec": {
685 | "display_name": "Python 3",
686 | "language": "python",
687 | "name": "python3"
688 | },
689 | "language_info": {
690 | "codemirror_mode": {
691 | "name": "ipython",
692 | "version": 3
693 | },
694 | "file_extension": ".py",
695 | "mimetype": "text/x-python",
696 | "name": "python",
697 | "nbconvert_exporter": "python",
698 | "pygments_lexer": "ipython3",
699 | "version": "3.11.10"
700 | }
701 | },
702 | "nbformat": 4,
703 | "nbformat_minor": 2
704 | }
705 |
--------------------------------------------------------------------------------
/lab06/text-06-prompt-engineering-rag.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "metadata": {},
5 | "cell_type": "markdown",
6 | "source": [
7 | "## Exercise 1: Prompt Engineering\n",
8 | "\n",
9 | "Let's consider LLAMA as our starting point. In the following, we see a typical prompt feeding and text generation with LLAMA"
10 | ],
11 | "id": "f766dc288163d0d2"
12 | },
13 | {
14 | "metadata": {},
15 | "cell_type": "code",
16 | "outputs": [],
17 | "execution_count": null,
18 | "source": [
19 | "from huggingface_hub import login\n",
20 | "\n",
21 | "import torch\n",
22 | "from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM\n",
23 | "import datetime\n",
24 | "\n",
25 | "\n",
26 | "#----------------Google Colab only---------------------\n",
27 | "from google.colab import userdata\n",
28 | "login(userdata.get('HF_TOKEN'))\n",
29 | "#----------------Google Colab only---------------------\n",
30 | "\n",
31 | "#detect the device available\n",
32 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
33 | "\n",
34 | "model_id = \"meta-llama/Llama-3.2-3B-Instruct\"\n",
35 | "model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)\n",
36 | "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
37 | "\n",
38 | "\n",
39 | "query = \"Tell me the capital of France.\"\n",
40 | "\n",
41 | "\n",
42 | "\n",
43 | "def make_a_query(prompt, tokenizer, model):\n",
44 | "\n",
45 | " # Set pad_token_id if missing\n",
46 | " if tokenizer.pad_token_id is None:\n",
47 | " tokenizer.pad_token_id = tokenizer.eos_token_id\n",
48 | "\n",
49 | "\n",
50 | " # Tokenize the input with padding and truncation\n",
51 | " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
52 | " inputs = tokenizer(prompt, return_tensors=\"pt\", padding=True, truncation=True).to(device)\n",
53 | "\n",
54 | " # Compute the lenght of the input prompt to be able to extract the model's response later\n",
55 | " input_ids = inputs[\"input_ids\"]\n",
56 | " prompt_length = input_ids.shape[1]\n",
57 | "\n",
58 | " # Generate a response\n",
59 | " output = model.generate(\n",
60 | " inputs['input_ids'],\n",
61 | " attention_mask=inputs['attention_mask'],\n",
62 | " max_new_tokens=200, # Limit the number of new tokens generated (e.g., a single word)\n",
63 | " #temperature=0.3, # Reduce randomness, use with do_sample = True\n",
64 | " repetition_penalty=2.0, # Penalize repetition\n",
65 | " no_repeat_ngram_size=3, # Avoid repeating bigrams\n",
66 | " do_sample= False, # Set to False to use Greedy or Beam search\n",
67 | " num_beams=3, # Use with do_sample = False\n",
68 | " eos_token_id=tokenizer.eos_token_id, # End generation at EOS token\n",
69 | " pad_token_id=tokenizer.pad_token_id, # Avoid padding tokens\n",
70 | " early_stopping=True,\n",
71 | " )\n",
72 | "\n",
73 | " generated_tokens = output[0, prompt_length:]\n",
74 | "\n",
75 | " # Decode the response into human-readable text\n",
76 | " response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()\n",
77 | "\n",
78 | " return response\n",
79 | "\n",
80 | "print(\"\\n----------------------------------------------\\n\\n\")\n",
81 | "print(make_a_query(query, tokenizer, model))\n",
82 | "\n",
83 | "\n",
84 | "#alternative 2: use the chat template and pipelines from huggingface\n",
85 | "\n",
86 | "pipe = pipeline(\n",
87 | " \"text-generation\",\n",
88 | " model=model,\n",
89 | " tokenizer=tokenizer,\n",
90 | " torch_dtype=torch.float16,\n",
91 | " device_map=\"auto\",\n",
92 | ")\n",
93 | "\n",
94 | "\n",
95 | "outputs = pipe(\n",
96 | " query,\n",
97 | " max_new_tokens=200, # Limit the number of new tokens generated (e.g., a single word)\n",
98 | " #temperature=0.3, # Reduce randomness\n",
99 | " repetition_penalty=2.0, # Penalize repetition\n",
100 | " no_repeat_ngram_size=3, # Avoid repeating bigrams\n",
101 | " do_sample=False, # Make the output deterministic (not sampled)\n",
102 | " eos_token_id=tokenizer.eos_token_id, # End generation at EOS token\n",
103 | " pad_token_id=tokenizer.pad_token_id # Avoid padding tokens\n",
104 | ")\n",
105 | "\n",
106 | "print(\"\\n----------------------------------------------\\n\\n\")\n",
107 | "print(outputs[0][\"generated_text\"])\n"
108 | ],
109 | "id": "b8f6fbb1ca28287f"
110 | },
111 | {
112 | "metadata": {},
113 | "cell_type": "markdown",
114 | "source": [
115 | "### Fitz\n",
116 | "\n",
117 | "Reference libraries to install: pip install openai pymupdf faiss-cpu scikit-learn\n",
118 | "\n",
119 | "PyMuPDF is a Python library that provides tools for working with PDF files (as well as other document formats like XPS, OpenXPS, CBZ, EPUB, and FB2). It's built on the MuPDF library, a lightweight, high-performance PDF and XPS rendering engine. With PyMuPDF, you can perform various tasks like reading, creating, editing, and extracting content from PDFs, images, and annotations."
120 | ],
121 | "id": "1a17729410fab0f5"
122 | },
123 | {
124 | "metadata": {},
125 | "cell_type": "code",
126 | "outputs": [],
127 | "execution_count": null,
128 | "source": [
129 | "import fitz\n",
130 | "\n",
131 | "#open an example pdf\n",
132 | "doc = fitz.open(\"example.pdf\")\n",
133 | "\n",
134 | "# Extract text from the first page\n",
135 | "page = doc.load_page(0)\n",
136 | "text = page.get_text(\"text\") # Use 'text' mode to get raw text\n",
137 | "print(text)\n"
138 | ],
139 | "id": "29b56ba4cf842497"
140 | },
141 | {
142 | "metadata": {},
143 | "cell_type": "markdown",
144 | "source": [
145 | "### Example: Text Summarization\n",
146 | "\n",
147 | "Let's ask LLAMA to perform a summarization of the example PDF."
148 | ],
149 | "id": "acc8edbcaa9cc989"
150 | },
151 | {
152 | "metadata": {},
153 | "cell_type": "code",
154 | "outputs": [],
155 | "execution_count": null,
156 | "source": [
157 | "#define the prompt to ask for text summarization.\n",
158 | "text_summarization_prompt = \"\" #define your prompt here\n",
159 | "text = \"\" #load here the FULL text of the article\n",
160 | "p1 = \"\"\"{PROMPT}. article: {BODY}\"\"\".format(PROMPT=text_summarization_prompt, BODY=text)\n",
161 | "\n",
162 | "#feed the prompt to llama\n",
163 | "#print the result of text summarization into bullets\n",
164 | "\n",
165 | "r1 = \"\""
166 | ],
167 | "id": "6537dfbc8aa66ee1"
168 | },
169 | {
170 | "metadata": {},
171 | "cell_type": "markdown",
172 | "source": [
173 | "### Adding a System Prompt\n",
174 | "\n",
175 | "Llama was trained with a system message that set the context and persona to assume when solving a task. One of the unsung advantages of open-access models is that you have full control over the system prompt in chat applications. This is essential to specify the behavior of your chat assistant –and even imbue it with some personality–, but it's unreachable in models served behind APIs.\n"
176 | ],
177 | "id": "49ed0f2f8a16cd52"
178 | },
179 | {
180 | "metadata": {},
181 | "cell_type": "code",
182 | "outputs": [],
183 | "execution_count": null,
184 | "source": [
185 | "#default standard system message from the Hugging Face blog to the prompt from above\n",
186 | "system_prompt = \"\"\"You are a helpful, respectful and honest assistant. \\\n",
187 | " Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, \\\n",
188 | " unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses \\\n",
189 | " are socially unbiased and positive in nature. If a question does not make any sense, or is not factually \\\n",
190 | " coherent, explain why instead of answering something not correct. If you don't know the answer to a question, \\\n",
191 | " please don't share false information.\"\"\"\n",
192 | "\n",
193 | "#concatenate the system prompt with your prompt and get the response\n",
194 | "p2 = \"\"\"\n",
195 | "\n",
196 | "r2 = \"\"\n",
197 | "\n",
198 | "#what changes?"
199 | ],
200 | "id": "bb4dc76df41fde7b"
201 | },
202 | {
203 | "metadata": {},
204 | "cell_type": "markdown",
205 | "source": [
206 | "### Customizing the System prompt\n",
207 | "\n",
208 | "With Llama we have full control over the system prompt. The following experiment will instruct Llama to assume the persona of a researcher tasked with writing a concise brief.\n",
209 | "\n",
210 | "Apply the following changes the original system prompt:\n",
211 | "- Use the researcher persona and specify the tasks to summarize articles.\n",
212 | "- Remove safety instructions; they are unnecessary since we ask Llama to be truthful to the article.\n"
213 | ],
214 | "id": "ef7d5d32857a31d3"
215 | },
216 | {
217 | "metadata": {},
218 | "cell_type": "code",
219 | "outputs": [],
220 | "execution_count": null,
221 | "source": [
222 | "new_system_prompt = \"\"\n",
223 | "\n",
224 | "p3 = \"\"\n",
225 | "\n",
226 | "r3 = \"\""
227 | ],
228 | "id": "676d7e7fc770b6c9"
229 | },
230 | {
231 | "metadata": {},
232 | "cell_type": "markdown",
233 | "source": [
234 | "### Chain-of-Thought prompting\n",
235 | "\n",
236 | "Chain-of-thought is when a prompt is being constructed using a previous prompt answer. For our use case to extract information from text, we will first ask Llama what the article is about and then use the response to ask a second question: what problem does [what the article is about] solve?\n",
237 | "\n"
238 | ],
239 | "id": "86efc5f1cf63975a"
240 | },
241 | {
242 | "metadata": {},
243 | "cell_type": "code",
244 | "outputs": [],
245 | "execution_count": null,
246 | "source": [
247 | "#define a prompt to ask what the article is about\n",
248 | "\n",
249 | "p4 = \"\"\n",
250 | "\n",
251 | "r4 = \"\"\n",
252 | "\n",
253 | "#now embed the result of the previous prompt in a new prompt to ask what that solves\n",
254 | "\n",
255 | "p5 = \"\"\n",
256 | "\n",
257 | "r5 = \"\"\n",
258 | "\n",
259 | "\n"
260 | ],
261 | "id": "d6f92fe96dc1bb9f"
262 | },
263 | {
264 | "metadata": {},
265 | "cell_type": "markdown",
266 | "source": [
267 | "### Generating JSONs with Llama\n",
268 | "\n",
269 | "Llama needs precise instructions when asking it to generate JSON. In essence, here is what works for me to get valid JSON consistently:\n",
270 | "\n",
271 | "- Explicitly state — “ All output must be in valid JSON. Don’t add explanation beyond the JSON” in the system prompt.\n",
272 | "- Add an “explanation” variable to the JSON example. Llama enjoys explaining its answers. Give it an outlet.\n",
273 | "- Use the JSON as part of the instruction. See the “in_less_than_ten_words” example below.\n",
274 | "Change “write the answer” to “output the answer.”\n"
275 | ],
276 | "id": "ae2e9e130f76aec1"
277 | },
278 | {
279 | "metadata": {},
280 | "cell_type": "code",
281 | "outputs": [],
282 | "execution_count": null,
283 | "source": [
284 | "\n",
285 | "\n",
286 | "#example addition to a prompt to deal with jsons\n",
287 | "json_prompt_addition = \"Output must be in valid JSON like the following example {{\\\"topic\\\": topic, \\\"explanation\\\": [in_less_than_ten_words]}}. Output must include only JSON.\"\n",
288 | "\n",
289 | "#now generate a prompt by correctly concatenating the system prompt, the json prompt instruction, and an article\n",
290 | "p6 = \"\"\n",
291 | "\n",
292 | "r6 = \"\"\n",
293 | "\n",
294 | "#compare the difference between the prompt with the formatting instruction and a regular prompt without formatting instructions. is there any difference?\n",
295 | "\n",
296 | "\n"
297 | ],
298 | "id": "6634ad897d4e0e52"
299 | },
300 | {
301 | "metadata": {},
302 | "cell_type": "markdown",
303 | "source": [
304 | "### One-to-Many Shot Learning Prompting\n",
305 | "\n",
306 | "One-to-Many Shot Learning is a term that refers to a type of machine learning problem where the goal is to learn to recognize many different classes of objects from only one or a few examples of each class. For example, if you have only one image of a cat and one image of a dog, can you train a model to distinguish between cats and dogs in new images? This is a challenging problem because the model has to generalize well from minimal data (source)\n",
307 | "\n",
308 | "Important points about the prompts:\n",
309 | "\n",
310 | "- The system prompt includes the instructions to output the answer in JSON.\n",
311 | "- The prompt consists of an one-to-many shot learning section that starts after the end of the system prompt.\n",
312 | "- Shot examples are represented through pairs composed by user's question an assistant's response, as reported in the template below.\n",
313 | "- The examples are given in JSON because the answers need to be JSON.\n",
314 | "- The JSON allows defining the response with name, type, and explanation.\n",
315 | "- The prompt question is represented by the last user's question - assistant's response pair, where the response is blank and the last <|eot_id|> is missing.\n",
316 | "\n",
317 | "```\n",
318 | "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n",
319 | "SYSTEM PROMPT\n",
320 | "<|eot_id|>\n",
321 | "\n",
322 | "<|start_header_id|>user<|end_header_id|>\n",
323 | "Question shot 1.\n",
324 | "<|eot_id|>\n",
325 | "<|start_header_id|>assistant<|end_header_id|>\n",
326 | "Answer shot 1.\n",
327 | "<|eot_id|>\n",
328 | "\n",
329 | "<|start_header_id|>user<|end_header_id|>\n",
330 | "Question shot 2.\n",
331 | "<|eot_id|>\n",
332 | "<|start_header_id|>assistant<|end_header_id|>\n",
333 | "Answer shot 2.\n",
334 | "<|eot_id|>\n",
335 | "\n",
336 | "<|start_header_id|>user<|end_header_id|>\n",
337 | "Final prompt.\n",
338 | "<|eot_id|>\n",
339 | "<|start_header_id|>assistant<|end_header_id|>\n",
340 | "```"
341 | ],
342 | "id": "6bf5f28ceddaac11"
343 | },
344 | {
345 | "metadata": {},
346 | "cell_type": "code",
347 | "outputs": [],
348 | "execution_count": null,
349 | "source": [
350 | "#describe all the main nouns in the example.pdf article\n",
351 | "\n",
352 | "#use the following addition for one-to-many prompting exampling\n",
353 | "nouns = \"\"\"[\\\n",
354 | "{{\"name\": \"semiconductor\", \"type\": \"industry\", \"explanation\": \"Companies engaged in the design and fabrication of semiconductors and semiconductor devices\"}},\\\n",
355 | "{{\"name\": \"NBA\", \"type\": \"sport league\", \"explanation\": \"NBA is the national basketball league\"}},\\\n",
356 | "{{\"name\": \"Ford F150\", \"type\": \"vehicle\", \"explanation\": \"Article talks about the Ford F150 truck\"}},\\\n",
357 | "{{\"name\": \"Ford\", \"type\": \"company\", \"explanation\": \"Ford is a company that built vehicles\"}},\\\n",
358 | "{{\"name\": \"John Smith\", \"type\": \"person\", \"explanation\": \"Mentioned in the article\"}},\\\n",
359 | "]\"\"\"\n",
360 | "\n",
361 | "#now build the prompt following the template described above\n",
362 | "p7 = \"\"\n",
363 | "\n",
364 | "r7 = \"\"\n",
365 | "\n",
366 | "#compare the response of the prompt described above and a zero-shot prompt. Are there any differences?\n"
367 | ],
368 | "id": "cf6fdc6223195ea2"
369 | },
370 | {
371 | "metadata": {},
372 | "cell_type": "markdown",
373 | "source": [
374 | "## Exercise 2: RAG (Retrieval-Augmented-Generation)\n",
375 | "\n",
376 | "RAG (Retrieval-Augmented Generation) is a powerful framework in Natural Language Processing (NLP) that enhances the performance of language models by combining traditional generative models with external knowledge retrieval. This hybrid approach allows models to retrieve relevant information from a large corpus (like a database or document collection) and incorporate this information into the generation process. It is particularly useful when a model needs to answer questions, generate content, or provide explanations based on real-time or domain-specific data.\n",
377 | "\n"
378 | ],
379 | "id": "653cefd0237ca591"
380 | },
381 | {
382 | "metadata": {},
383 | "cell_type": "code",
384 | "outputs": [],
385 | "execution_count": null,
386 | "source": [
387 | "import os\n",
388 | "import glob\n",
389 | "\n",
390 | "\n",
391 | "#TODO: Function to extract text from a PDF\n",
392 | "def extract_text_from_pdf(pdf_path):\n",
393 | " print(\"\")\n",
394 | " #your code here...\n",
395 | "\n",
396 | "# Extract text from all uploaded PDF files\n",
397 | "pdf_texts = {}\n",
398 | "# your code here...\n",
399 | "\n",
400 | "#Display the text from all the PDF files\n",
401 | "for pdf_file, text in pdf_texts.items():\n",
402 | " print(\"\") #implement PDF read"
403 | ],
404 | "id": "adc42a2612519ae1"
405 | },
406 | {
407 | "metadata": {},
408 | "cell_type": "markdown",
409 | "source": "",
410 | "id": "3c0962e6f594560"
411 | },
412 | {
413 | "metadata": {},
414 | "cell_type": "markdown",
415 | "source": [
416 | "### Creating an index of vectors to represent the documents\n",
417 | "\n",
418 | "To perform efficient searches, we need to convert our text data into numerical vectors. To do so, we will use the first step of the BERT transformer.\n",
419 | "\n",
420 | "Since our full pdf files are very long to be fed as input into BERT, we perform a step in which we create a structure where we associate a document number to its abstract, and in a separate dictionary we associate a document number to its full text.\n"
421 | ],
422 | "id": "eac3a2264e809be9"
423 | },
424 | {
425 | "metadata": {},
426 | "cell_type": "code",
427 | "outputs": [],
428 | "execution_count": null,
429 | "source": [
430 | "from transformers import AutoModel, AutoTokenizer\n",
431 | "from sklearn.metrics.pairwise import cosine_similarity\n",
432 | "from sklearn.metrics.pairwise import cosine_similarity\n",
433 | "import numpy as np\n",
434 | "\n",
435 | "\n",
436 | "\n",
437 | "\n",
438 | "#import the Bert pretrained model from the transformers library\n",
439 | "model = AutoModel.from_pretrained(\"bert-base-uncased\")\n",
440 | "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n",
441 | "### Don't forget to move it to the device\n",
442 | "\n",
443 | "#initialization of the dictionary of abstracts. Substitute this with the abstracts of the 10 papers considered as sources for RAG\n",
444 | "#(we could use functions to read the PDFs to \"cut\" the abstracts from the papers. For simplicity reasons, we will copy and paste them)\n",
445 | "abstracts_dict = {\n",
446 | " 0: \"\"\n",
447 | "}\n",
448 | "\n",
449 | "#the text for rag is used as an input to the BERT model\n",
450 | "\n",
451 | "#The tokenized inputs are passed to the BERT model for processing.\n",
452 | "#(#remember padding=True: Ensures that all inputs are padded to the same length, allowing batch processing.)\n",
453 | "#The model outputs a tensor (last_hidden_state), where each input token is represented by a high-dimensional vector.\n",
454 | "#last_hidden_state is of shape (batch_size, sequence_length, hidden_size), where:\n",
455 | "#batch_size: Number of input texts.\n",
456 | "#sequence_length: Length of each tokenized text (after padding).\n",
457 | "#hidden_size: Dimensionality of the vector representation for each token (default 768 for bert-base-uncased).\n",
458 | "\n",
459 | "#last_hidden_state[:, 0]: Selects the representation of the [CLS] token for each input text. The [CLS] token is a special token added at the start of each input and is often used as the aggregate representation for the entire sequence.\n",
460 | "\n",
461 | "# -------------------------------------- HINT -----------------------------------\n",
462 | "tokenized_inputs = tokenizer(\n",
463 | " list(abstracts_dict.values()),\n",
464 | " return_tensors='pt', # stands for 'PyTorch': tensors will be returned (if omitted, we will get Python lists)\n",
465 | " padding=True) # all the abstracts will be padded to the same length, in order to have a uniformed batch of abstracts => an attention mask will be created\n",
466 | "tokenized_inputs = {key: value.to(device) for key, value in tokenized_inputs.items()} # move the tensors to the device\n",
467 | "\n",
468 | "abstract_vectors = model_indexing(**tokenized_inputs).last_hidden_state[:, 0]\n",
469 | "\n",
470 | "#abstract_vectors is a tensor of shape (batch_size, hidden_size) (e.g., (3, 768) in this case), representing each text as a single 768-dimensional vector.\n",
471 | "\n",
472 | "print(abstract_vectors.shape)\n",
473 | "\n"
474 | ],
475 | "id": "d3a357da8e8e4944"
476 | },
477 | {
478 | "metadata": {},
479 | "cell_type": "markdown",
480 | "source": [
481 | "### Search\n",
482 | "\n",
483 | "With our text data vectorized and indexed, we can now perform searches. We will define a function to search the index for the most relevant documents based on a query.\n",
484 | "\n",
485 | "To perform the search, we need a function (search documents) where we perform the cosine similarity between the query vector and all the abstract vectors. This function will give our the top-k indexes. Once we find the top-k indexes, with another function, we can collect the full text of the documents from the paper dictionary.\n",
486 | "\n",
487 | "To compute cosine similarity, refer to the following formula\n",
488 | "\n",
489 | "```cs = cosine_similarity(vector_a.detach().numpy(), vector_b.detach().numpy())```\n",
490 | "\n"
491 | ],
492 | "id": "94dd0f88dc4adbd1"
493 | },
494 | {
495 | "metadata": {},
496 | "cell_type": "code",
497 | "outputs": [],
498 | "execution_count": null,
499 | "source": [
500 | "\n",
501 | "\n",
502 | "\n",
503 | "\n",
504 | "def get_top_k_similar_indices(query_vector, # i.e., the QUERY in the RAG system\n",
505 | " abstract_vectors, # i.e., the KEY in the RAG system\n",
506 | " k):\n",
507 | "\n",
508 | " #Parameters:\n",
509 | " #- query_vector: A tensor of shape (1, hidden_size) representing the query vector.\n",
510 | " #- abstract_vectors: A tensor of shape (batch_size, hidden_size) representing the abstract vectors.\n",
511 | " #- k: The number of top indices to return.\n",
512 | "\n",
513 | " #Returns:\n",
514 | " #- sorted_indices: A numpy array of shape (1, k) containing the indices of the top k most similar abstracts.\n",
515 | "\n",
516 | " # ------------- HINT ----------------------------------------------------------------\n",
517 | " #Computes the top k indices of the most similar abstracts to the query based on cosine similarity.\n",
518 | " similarities = cosine_similarity( # from sklearn.metrics.pairwise\n",
519 | " query_vector.cpu().detach().numpy(), # it is needed to move the tensor on the cpu!\n",
520 | " abstract_vectors.cpu().detach().numpy())\n",
521 | "\n",
522 | " # IMPORTANT: reason about the size of 'similarities' to understand how to sort it (you can print it)\n",
523 | " # ------------- HINT ----------------------------------------------------------------\n",
524 | "\n",
525 | " return \"\"\n",
526 | "\n",
527 | "\n",
528 | "def retrieve_documents(indices, documents_dict):\n",
529 | "\n",
530 | " #Retrieves the documents corresponding to the given indices and concatenates them into a single string.\n",
531 | "\n",
532 | " #Parameters:\n",
533 | " #- indices: A numpy array or list of top-k indices of the most similar documents.\n",
534 | " #- documents_dict: A dictionary where keys are document indices (integers) and values are the document texts (strings).\n",
535 | "\n",
536 | " #Returns:\n",
537 | " #- concatenated_documents: A string containing the concatenated texts of the retrieved documents.\n",
538 | "\n",
539 | " return \"\"\n",
540 | "\n",
541 | "\n",
542 | "\n",
543 | "#now I create a vector also for my query\n",
544 | "\n",
545 | "# 1) from NL to tokens\n",
546 | "\n",
547 | "query = \"\" # remember to move the tokenised input on the correct device (the same as the model)\n",
548 | "\n",
549 | "# 2) from tokens to vector (i.e., embedding)\n",
550 | "\n",
551 | "query_vector = \"\"\n",
552 | "\n",
553 | "# TODO: get the top k abstracts similar to the query ( call get_top_k_similar_indices() )\n",
554 | "\n",
555 | "# TODO: get the relative texts ( call retrieve_documents() )\n",
556 | "\n",
557 | "\n",
558 | "\n"
559 | ],
560 | "id": "2a03db1509cb6cd0"
561 | },
562 | {
563 | "metadata": {},
564 | "cell_type": "markdown",
565 | "source": [
566 | "### A function to perform Retrieval Augmented Generation\n",
567 | "\n",
568 | "In this step, we’ll combine the context retrieved from our documents with LLAMA to generate responses. The context will provide the necessary information to the model to produce more accurate and relevant answers."
569 | ],
570 | "id": "2136247419aa8d3c"
571 | },
572 | {
573 | "metadata": {},
574 | "cell_type": "code",
575 | "outputs": [],
576 | "execution_count": null,
577 | "source": [
578 | "\n",
579 | "\n",
580 | "#now we put it all together\n",
581 | "\n",
582 | "def generate_augmented_response(query, documents):\n",
583 | "\n",
584 | " system = \"\" #TODO: define system prompt\n",
585 | "\n",
586 | " context = \"\" #TODO: concatenate here all the search results\n",
587 | "\n",
588 | "\n",
589 | " prompt = \"\" #TODO: create the prompt for LLAMA (system + context + query)\n",
590 | "\n",
591 | " response = \"\"\n",
592 | "\n",
593 | " #perform a query with LLAMA in the usual way ( call make_a_query() )\n",
594 | "\n",
595 | " #return the response\n",
596 | " return \"\"\n",
597 | "\n",
598 | "\n",
599 | "# TODO: generate the queries!\n",
600 | "query = \"\"\n",
601 | "response = generate_augmented_response(query)\n",
602 | "print(response)\n",
603 | "\n",
604 | "#TODO: now compare the results with a prompt without RAG. What are the results?\n"
605 | ],
606 | "id": "1a6128cfae3108e7"
607 | }
608 | ],
609 | "metadata": {
610 | "kernelspec": {
611 | "display_name": "Python 3",
612 | "language": "python",
613 | "name": "python3"
614 | },
615 | "language_info": {
616 | "codemirror_mode": {
617 | "name": "ipython",
618 | "version": 2
619 | },
620 | "file_extension": ".py",
621 | "mimetype": "text/x-python",
622 | "name": "python",
623 | "nbconvert_exporter": "python",
624 | "pygments_lexer": "ipython2",
625 | "version": "2.7.6"
626 | }
627 | },
628 | "nbformat": 4,
629 | "nbformat_minor": 5
630 | }
631 |
--------------------------------------------------------------------------------
/lab04/text-02-llms.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Exploring Large Language Models: LLaMA and Mistral"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook, we will dive into two famous large language models, LLaMA and Mistral, along with their instruction-tuned versions. We'll explore how each model performs on various tasks, with a particular focus on generating structured responses in JSON format.\n",
15 | "\n",
16 | "These models have been fine-tuned to follow instructions, making them suitable for a range of NLP applications. Through this lab, you will:\n",
17 | "\n",
18 | "- Learn how to load and interact with LLaMA and Mistral models using the `pipeline` and `chat_template` functions.\n",
19 | "- Examine the performance of their instruction-based variants.\n",
20 | "- Generate structured outputs, specifically in JSON, for practical applications."
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "> **Disclaimer**: Before starting this lab, ensure you have requested access to the required models on Hugging Face and have logged in to your Hugging Face account. Access is necessary for the following models:\n",
28 | ">\n",
29 | "> - [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)\n",
30 | "> - [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)\n",
31 | "> - [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)\n",
32 | ">\n",
33 | "> You can log in to Hugging Face directly from this notebook using the provided code snippet.\n"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "from huggingface_hub import login\n",
43 | "\n",
44 | "import torch\n",
45 | "from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "# TODO: Login to the Hugging Face model hub to be able to upload models\n",
55 | "token = \"YOUR HUGGING FACE TOKEN\"\n",
56 | "\n",
57 | "login(token=token)"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "# 1. LLaMA"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "In this part of the lab, we will explore **LLaMA (Large Language Model Meta AI)**, which is one of the most known large language models developed by Meta (Facebook). \n",
72 | "\n",
73 | "Next, we will focus on **Instruction LLaMA**, a version of LLaMA fine-tuned to better understand and follow user instructions. "
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "We will use Llama 3.2 (released in September 2024). In particular, we will adopt the 1B version. On the scale of things, this model is on the smaller side, but it is still a very powerful model.\n",
81 | "\n",
82 | "It has been released (along with a 3B version) with the intention of allowing running it on devices with modest hardware (e.g., mobile phones or other edge devices). "
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 63,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "model_id = \"meta-llama/Llama-3.2-1B\"\n",
92 | "\n",
93 | "# TODO: Load the model with `torch.float16` precision and the tokenizer \n",
94 | "# (You can specify the precision with `torch_dtype=torch.float16`)\n",
95 | "model = ...\n",
96 | "tokenizer = ...\n",
97 | "\n",
98 | "tokenizer.pad_token = tokenizer.eos_token"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "We can use this model to generate text using the generate() method. We use random sampling (`do_sample=True`) and extract 5 samples (`num_return_sequences=5`). You can find other generation parameters [here](https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/text_generation#transformers.GenerationConfig)."
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "tokens = tokenizer(\"Hello, my name is\", return_tensors=\"pt\").to(model.device)\n",
115 | "batch = model.generate(**tokens, do_sample=True, max_length=50, num_return_sequences=5, pad_token_id=tokenizer.eos_token_id) # (assigning pad_token_id avoids a warning)\n",
116 | "tokenizer.batch_decode(batch)"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "### **Understanding the `tokenizer.chat_template`**\n",
124 | "\n",
125 | "In this section, we will explore the **chat template** that is used to format and structure messages for a conversational assistant. The `tokenizer.chat_template` is a convenient way for organizing interactions between the user, system, and assistant in a way that the model can easily process and generate coherent responses.\n",
126 | "\n",
127 | "### **What is a Chat Template?**\n",
128 | "\n",
129 | "The chat template is a predefined format that ensures consistent structure for conversations. It marks the different roles in the interaction (system, user, assistant), and separates the various elements of the conversation using special tokens. This helps the language model understand which parts of the dialogue are instructions, which parts are user inputs, and where the assistant’s response should be generated.\n",
130 | "\n",
131 | "Let's create an example of a possible (simplified) chat template:"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 4,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "import datetime\n",
141 | "\n",
142 | "chat_template = \"\"\"\n",
143 | "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n",
144 | "\n",
145 | "Cutting Knowledge Date: December 2023\n",
146 | "Today Date: \"\"\"+datetime.datetime.now().strftime(\"%d %b %Y\")+\"\"\"\n",
147 | "\n",
148 | "{system_message}\n",
149 | "\n",
150 | "<|eot_id|>\n",
151 | "<|start_header_id|>user<|end_header_id|>\n",
152 | "\n",
153 | "{user_message}\n",
154 | "\n",
155 | "<|eot_id|>\n",
156 | "\"\"\""
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "### **Hugging Face Pipeline Overview**\n",
164 | "\n",
165 | "The **`pipeline`** method from Hugging Face’s Transformers library is a high-level API designed to streamline the process of using pre-trained models for a wide variety of **natural language processing (NLP) tasks**.\n",
166 | "\n",
167 | "#### **What is a Pipeline?**\n",
168 | "\n",
169 | "A pipeline is a modular tool that wraps around a pre-trained model, tokenizer, and task-specific configurations. It makes it easy to load and apply these models directly to different tasks, such as:\n",
170 | "- **Text generation**\n",
171 | "- **Text classification**\n",
172 | "- **Question answering**\n",
173 | "- **Summarization**\n",
174 | "- **Translation**\n",
175 | "\n",
176 | "By simply specifying the type of task (e.g., `\"text-generation\"`), `pipeline` takes care of loading and configuring a compatible model and tokenizer, providing a ready-to-use interface for generating results.\n",
177 | "\n",
178 | "You can find a full list of supported pipelines on the [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)."
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 66,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "# Create the pipeline with the model and tokenizer\n",
188 | "pipe = pipeline(\n",
189 | " \"text-generation\",\n",
190 | " model=model,\n",
191 | " tokenizer=tokenizer,\n",
192 | " torch_dtype=torch.float16,\n",
193 | " device_map=\"auto\",\n",
194 | ")"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "\n",
204 | "messages = [\n",
205 | " {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n",
206 | " {\"role\": \"user\", \"content\": \"What is 2 + 2?\"},\n",
207 | "]\n",
208 | "\n",
209 | "# Format the messages using the chat template\n",
210 | "formatted_messages = chat_template.format(\n",
211 | " system_message=messages[0][\"content\"],\n",
212 | " user_message=messages[1][\"content\"]\n",
213 | ")\n",
214 | "\n",
215 | "\n",
216 | "print(formatted_messages)"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "Now, remember that for models to follow instruction tuning, they need to have been tuned on this kind of data. In this case, we are not using the instruction-tuned version. \n",
224 | "\n",
225 | "So, we can expect the model to produce a garbage response (it has never seen that kind of inputs before!). But let's try it anyway!"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "# Generate the output text \n",
235 | "outputs = pipe(\n",
236 | " formatted_messages,\n",
237 | " max_new_tokens=256,\n",
238 | " do_sample=True,\n",
239 | ")\n",
240 | "\n",
241 | "print(outputs[0][\"generated_text\"])"
242 | ]
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "metadata": {},
247 | "source": [
248 | "### **Differences Between Standard and Instruct Versions of Large Language Models (LLMs)**\n",
249 | "\n",
250 | "Large Language Models (LLMs) come in different versions, with **standard** and **instruction-tuned (Instruct)** versions being the most common. Here’s a brief comparison:\n",
251 | "\n",
252 | "#### **1. Purpose and Training**:\n",
253 | " - **Standard LLM**: The standard model is generally pre-trained on large datasets without specific instruction-following capabilities. Typically generates more open-ended responses, which can be useful for creative writing or general information retrieval where the response style is flexible.\n",
254 | " - **Instruct LLM**: Instruction-tuned models, like the **Llama-3.2 Instruct**, are fine-tuned on datasets designed to help the model understand and follow instructions effectively. This tuning enhances the model's ability to respond directly to user prompts and handle structured requests. It is fine-tuned to produce concise, direct responses that are often more relevant in task-specific or conversational AI applications.\n",
255 | "\n",
256 | "Let's compare the outputs of the standard and Instruct versions of LLaMA to see the differences in their responses."
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "import torch\n",
266 | "from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM\n",
267 | "\n",
268 | "model_id = \"meta-llama/Llama-3.2-1B-Instruct\"\n",
269 | "\n",
270 | "# TODO: Load the model with `torch.float16` precision and the tokenizer \n",
271 | "model = ...\n",
272 | "tokenizer = ...\n",
273 | "\n",
274 | "# TODO: Set the pad token to the end of the sequence token\n",
275 | "tokenizer.pad_token = ...\n",
276 | "\n",
277 | "# TODO: Create the pipeline with the model and tokenizer\n",
278 | "pipe = ...\n",
279 | "\n",
280 | "messages = [\n",
281 | " {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n",
282 | " {\"role\": \"user\", \"content\": \"What is 2 + 2?\"},\n",
283 | "]\n",
284 | "\n",
285 | "# TODO: Format the messages using the chat template and generate the output text\n",
286 | "formatted_messages = ...\n",
287 | "outputs = ...\n",
288 | "\n",
289 | "print(outputs[0][\"generated_text\"])"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "### **Evaluation of the Tokenizer Chat Template**\n",
297 | "\n",
298 | "Actually, the chat template of `meta-llama/Llama-3.2-1B-Instruct` is much more complex than the example above. It includes various components that help the model understand the context of the conversation, manage dates, handle tools, and structure messages effectively.\n",
299 | "\n",
300 | "The template is written in [jinja](https://jinja.palletsprojects.com/en/stable/templates/), a language that allows for the dynamic generation of content based on variables, conditions and loops.\n",
301 | "\n",
302 | "\n",
303 | "Let's print it and analyze its key components:\n",
304 | " \n",
305 | "\n",
306 | "#### **Key Components of the Template**:\n",
307 | "1. **System Message Extraction**:\n",
308 | " - The system message is extracted if the first role in the message list is labeled \"system.\" This allows the template to clearly differentiate between user queries and system instructions.\n",
309 | " - If a system message exists, it is added to the template between special tokens (`<|start_header_id|>` and `<|end_header_id|>`), ensuring that the model knows when the system message starts and ends.\n",
310 | "\n",
311 | "2. **Date Management**:\n",
312 | " - The template automatically handles the current date using either a provided `strftime_now` function or a default date (`\"26 Jul 2024\"`). This can be useful when the model needs to be aware of the date in contexts such as time-sensitive responses.\n",
313 | "\n",
314 | "3. **Handling Tools**:\n",
315 | " - The template checks if **tools** are defined. If tools are available, it includes a description of these tools in the system message or the user message, depending on where they need to appear.\n",
316 | " - If the tools are part of the user message, the template ensures that the first user message prompts the user to respond in a structured format, such as using JSON for function calls.\n",
317 | "\n",
318 | "4. **Message Processing**:\n",
319 | " - The template loops through the list of messages and processes each based on the role (`user`, `assistant`, `ipython`, or `tool`). It formats each message using start and end tokens for the roles, helping the model understand the structure of the conversation.\n",
320 | " - If the message involves tool calls, the template ensures that they are properly formatted into a structured JSON format to be passed back to the model for further processing.\n",
321 | "\n",
322 | "5. **Ending the Assistant's Response**:\n",
323 | " - The template leaves a placeholder for the assistant’s response, which the model will generate during inference. This ensures that the assistant's response begins in the correct format, ready to be populated with the generated content.\n",
324 | "\n",
325 | "#### **Why Is This Template Needed?**\n",
326 | "\n",
327 | "- **Maintains Consistency**: This template ensures that the conversation is structured in a consistent manner, which is crucial for models designed to follow complex instructions or engage in multi-turn conversations.\n",
328 | "- **Handles Tools**: By incorporating the ability to dynamically introduce tools and functionality, the template allows the model to expand beyond simple text-based conversations and perform function-based tasks.\n",
329 | "- **Structured Outputs for Tools**: When the conversation involves tool calls (e.g., through APIs or function calls), the template ensures that these interactions are formatted properly for execution."
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": null,
335 | "metadata": {},
336 | "outputs": [],
337 | "source": [
338 | "model_id = \"meta-llama/Llama-3.2-1B-Instruct\"\n",
339 | "\n",
340 | "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
341 | "\n",
342 | "print(tokenizer.chat_template)"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "Let's generate again the same example using the `chat_template` of `meta-llama/Llama-3.2-1B-Instruct` and analyze the output."
350 | ]
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "With a tokenizer that supports the chat template, we can directly call the `apply_chat_template()` method to convert a list of messages (each one a dictionary in the already discussed format) into a prompt.\n",
357 | "\n",
358 | "Notice that, since we are not using any particular tools or other functionalities, our template will be similar to the one we manually introduced earlier."
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": null,
364 | "metadata": {},
365 | "outputs": [],
366 | "source": [
367 | "# TODO: Create the pipeline with the model and tokenizer\n",
368 | "pipe = ...\n",
369 | "\n",
370 | "messages = [\n",
371 | " {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n",
372 | " {\"role\": \"user\", \"content\": \"Who are you?\"},\n",
373 | "]\n",
374 | "input_tokens = tokenizer.apply_chat_template(messages)\n",
375 | "print(tokenizer.decode(input_tokens))\n",
376 | "\n",
377 | "# TODO: Generate the output text (pass input_tokens\n",
378 | "# as the input. You can use the `max_new_tokens` parameter\n",
379 | "# to control the length of the output)\n",
380 | "outputs = ...\n",
381 | "\n",
382 | "# we are getting back the full conversation history\n",
383 | "# as a list of messages outputs[0][\"generated_text\"]\n",
384 | "# -1 : last message (assistant response)\n",
385 | "print(outputs[0][\"generated_text\"][-1][\"content\"])"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "metadata": {},
391 | "source": [
392 | "Notice that the pipeline already supports chat mode, so we can pass the list of messages (as long as they contain role/content keys) directly to the pipeline.\n",
393 | "\n",
394 | "Alternatively, we could have passed the prompt as a string. In this case, however, we would have to manually extract the output from the model and parse it back."
395 | ]
396 | },
397 | {
398 | "cell_type": "code",
399 | "execution_count": null,
400 | "metadata": {},
401 | "outputs": [],
402 | "source": [
403 | "messages = [\n",
404 | " {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n",
405 | " {\"role\": \"user\", \"content\": \"Who are you?\"},\n",
406 | "]\n",
407 | "\n",
408 | "# TODO: Format the messages using the chat template and generate the output text\n",
409 | "input_tokens = ...\n",
410 | "prompt_string = ...\n",
411 | "\n",
412 | "outputs = ...\n",
413 | "\n",
414 | "print(outputs[0][\"generated_text\"])"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "# 2. Mistral"
422 | ]
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "metadata": {},
427 | "source": [
428 | "In this part, we will explore the use of `Mistral-7B-Instruct-v0.2`developed by Mistral AI to generate structured responses in JSON format. \n",
429 | "\n",
430 | "In this exercise, we will generate random math questions and instruct Mistral-7B to respond in a structured JSON format. We will then save the responses to a JSON file and verify the answers programmatically. \n",
431 | "\n",
432 | "Let's first repeat the same example we did with LLaMA, but now using Mistral.\n"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": null,
438 | "metadata": {},
439 | "outputs": [],
440 | "source": [
441 | "# Define the model ID\n",
442 | "model_id = \"mistralai/Mistral-7B-Instruct-v0.2\"\n",
443 | "\n",
444 | "# TODO: Load the model and the tokenizer \n",
445 | "model = ...\n",
446 | "\n",
447 | "tokenizer = ...\n",
448 | "tokenizer.pad_token = ...\n",
449 | "\n",
450 | "# TODO: Initialize the pipeline for text generation\n",
451 | "pipe = ...\n",
452 | "\n",
453 | "# Define the message prompts for the conversation\n",
454 | "messages = [\n",
455 | " {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n",
456 | " {\"role\": \"user\", \"content\": \"Who are you?\"}\n",
457 | "]\n",
458 | "\n",
459 | "# TODO: Generate the response\n",
460 | "outputs = pipe(messages, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)\n",
461 | "\n",
462 | "# Print the model's generated response\n",
463 | "print(outputs[0][\"generated_text\"][-1][\"content\"])\n"
464 | ]
465 | },
466 | {
467 | "cell_type": "markdown",
468 | "metadata": {},
469 | "source": [
470 | "Now let's generate a random math question and instruct Mistral-7B to respond in a structured JSON format. We will then save the responses to a JSON file and verify the answers programmatically!"
471 | ]
472 | },
473 | {
474 | "cell_type": "markdown",
475 | "metadata": {},
476 | "source": [
477 | "1. **Generate random math questions:** Use Python to create questions with random numbers in a conversational style (e.g., “What is the sum of 245 and 173?”)."
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": 54,
483 | "metadata": {},
484 | "outputs": [],
485 | "source": [
486 | "import random\n",
487 | "\n",
488 | "def generate_random_math_questions(num_samples=5):\n",
489 | " \"\"\"\n",
490 | " Generate random math questions with two numbers for a given number of samples.\n",
491 | "\n",
492 | " Args:\n",
493 | " num_samples (int): The number of math questions to generate.\n",
494 | " \n",
495 | " Returns:\n",
496 | " List of tuples: A list of tuples containing the math question and the two numbers (question, num1, num2).\n",
497 | " \"\"\"\n",
498 | "\n",
499 | " # Define templates for math questions with two numbers\n",
500 | " templates = [\n",
501 | " \"What is the sum of {} and {}?\",\n",
502 | " \"Can you add {} and {}?\",\n",
503 | " \"Calculate the sum of {} and {} for me.\",\n",
504 | " \"How much is {} plus {}?\",\n",
505 | " \"Please add {} and {}.\"\n",
506 | " ]\n",
507 | " \n",
508 | " questions = []\n",
509 | " for _ in range(num_samples):\n",
510 | " # TODO: Randomly select a template and generate two random numbers\n",
511 | " template = ...\n",
512 | " num1 = ...\n",
513 | " num2 = ...\n",
514 | " question = template.format(num1, num2)\n",
515 | " questions.append((question, num1, num2)) # store question with numbers for validation\n",
516 | " return questions\n"
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "2. **Instruct the model to respond in JSON:** Use a system role instruction to ensure Mistral-7B answers in a JSON format containing the fields `num_1`, `num_2`, and `answer`. This makes the output compatible with automated processing or JSON parsers."
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": 55,
529 | "metadata": {},
530 | "outputs": [],
531 | "source": [
532 | "role_instruction = {\n",
533 | " \"role\": \"system\",\n",
534 | " \"content\": \"Answer each question in JSON format with the fields 'num_1', 'num_2', and 'answer'. Provide only JSON to ensure compatibility with a JSON parser.\"\n",
535 | "}\n"
536 | ]
537 | },
538 | {
539 | "cell_type": "markdown",
540 | "metadata": {},
541 | "source": [
542 | "3. **Save and verify responses:** Generate and store model responses in a JSON file and check if the answers match expected values."
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": 57,
548 | "metadata": {},
549 | "outputs": [],
550 | "source": [
551 | "import json\n",
552 | "from tqdm import tqdm\n",
553 | "\n",
554 | "# Generate questions and answers, then save to JSON\n",
555 | "questions = generate_random_math_questions(num_samples=5)\n",
556 | "answers = []\n",
557 | "\n",
558 | "# Generate structured answers for each question\n",
559 | "for question, num1, num2 in tqdm(questions):\n",
560 | " # TODO: Define the message prompts\n",
561 | " formatted_messages = ...\n",
562 | " \n",
563 | " # TODO: Generate the response\n",
564 | " outputs = ...\n",
565 | " \n",
566 | " # Extract the model's JSON output\n",
567 | " structured_answer = outputs[0][\"generated_text\"]\n",
568 | "\n",
569 | " answers.append({\n",
570 | " \"question\": question,\n",
571 | " \"num_1\": num1,\n",
572 | " \"num_2\": num2,\n",
573 | " \"model_answer\": structured_answer\n",
574 | " })\n",
575 | "\n",
576 | "# Save answers to a JSON file\n",
577 | "with open(\"model_answers.json\", \"w\") as f:\n",
578 | " json.dump(answers, f, indent=2)\n"
579 | ]
580 | },
581 | {
582 | "cell_type": "code",
583 | "execution_count": null,
584 | "metadata": {},
585 | "outputs": [],
586 | "source": [
587 | "import json\n",
588 | "\n",
589 | "# Function to parse model's answer and verify correctness\n",
590 | "def verify_answer(entry):\n",
591 | " try:\n",
592 | " # Extract expected values\n",
593 | " num1, num2 = entry[\"num_1\"], entry[\"num_2\"]\n",
594 | " expected_answer = num1 + num2\n",
595 | " \n",
596 | " # Extract the assistant's response from the list of messages\n",
597 | " assistant_message = next(\n",
598 | " (msg[\"content\"] for msg in entry[\"model_answer\"] if msg[\"role\"] == \"assistant\"), None\n",
599 | " )\n",
600 | " \n",
601 | " if assistant_message is None:\n",
602 | " raise ValueError(\"Assistant's message not found in model_answer\")\n",
603 | " \n",
604 | " # Parse model's structured answer from JSON\n",
605 | " model_response = json.loads(assistant_message.strip()) # Ensure model_answer is a string\n",
606 | " \n",
607 | " print(f\"Expected answer: {num1} + {num2} = {expected_answer}\")\n",
608 | " print(f\"Model's answer: {model_response['num_1']} + {model_response['num_2']} = {model_response['answer']}\")\n",
609 | " \n",
610 | " # Check if the values match\n",
611 | " if (model_response[\"num_1\"] == num1 and \n",
612 | " model_response[\"num_2\"] == num2 and \n",
613 | " model_response[\"answer\"] == expected_answer):\n",
614 | " return True\n",
615 | " else:\n",
616 | " return False\n",
617 | " except (json.JSONDecodeError, KeyError, TypeError, ValueError) as e:\n",
618 | " # Handle cases where parsing fails or keys are missing\n",
619 | " print(f\"Error verifying entry: {entry}. Error: {e}\")\n",
620 | " return False\n",
621 | "\n",
622 | "# Load answers from the JSON file and verify\n",
623 | "try:\n",
624 | " with open(\"model_answers.json\", \"r\") as f:\n",
625 | " saved_answers = json.load(f)\n",
626 | "except (json.JSONDecodeError, FileNotFoundError) as e:\n",
627 | " print(f\"Error loading JSON file: {e}\")\n",
628 | " saved_answers = []\n",
629 | "\n",
630 | "for i, entry in enumerate(saved_answers, 1):\n",
631 | " result = verify_answer(entry)\n",
632 | " print(f\"Question {i}:\", \"Correct\" if result else \"Incorrect\", \"\\n\")"
633 | ]
634 | }
635 | ],
636 | "metadata": {
637 | "kernelspec": {
638 | "display_name": "Python 3",
639 | "language": "python",
640 | "name": "python3"
641 | },
642 | "language_info": {
643 | "codemirror_mode": {
644 | "name": "ipython",
645 | "version": 3
646 | },
647 | "file_extension": ".py",
648 | "mimetype": "text/x-python",
649 | "name": "python",
650 | "nbconvert_exporter": "python",
651 | "pygments_lexer": "ipython3",
652 | "version": "3.10.12"
653 | }
654 | },
655 | "nbformat": 4,
656 | "nbformat_minor": 2
657 | }
658 |
--------------------------------------------------------------------------------
/lab09/text-09-code-generation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exercise 1: A function to compute average grades\n",
8 | "\n",
9 | "In this series of exercises, we will use LLMs to generate code based on specific requirements and then compare the output against predefined test cases to ensure correctness. The focus will be on generating functional code and executing tests to verify that it meets the given specifications.\n",
10 | "\n",
11 | "### Step 1: Requirements\n",
12 | "\n",
13 | "The function receives the grades of a student on her courses (for simplicity exactly six grades for six courses are considered) and computes the average. Grades can be from 18 to 30, or 30Laude (which is represented by the int number 33). The average is computed excluding the best and worse grade."
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": null,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "\n",
23 | "\n",
24 | "\n",
25 | "def compute_average(grades):\n",
26 | " \"\"\"\n",
27 | " Computes the average of grades, excluding the best and worst grades.\n",
28 | " \n",
29 | " Args:\n",
30 | " grades (list): A list of six grades, where each grade is between 18 and 33.\n",
31 | " \n",
32 | " Returns:\n",
33 | " float: The computed average.\n",
34 | " \"\"\"\n",
35 | "\n",
36 | " #TODO\n",
37 | " #define here your reference solution\n",
38 | "\n",
39 | " return 0.0\n",
40 | "\n",
41 | "\n",
42 | "#TODO\n",
43 | "#paste here the result of the function as given by chatGPT\n",
44 | "\n",
45 | "gpt_result = \"\"\"\"\"\"\n",
46 | "\n",
47 | "\n",
48 | "#the dictionary my_codes will contain all the generated codes. Let's start by adding the one generated by chatgpt\n",
49 | "\n",
50 | "my_codes = {}\n",
51 | "my_codes[\"GPT\"] = gpt_result\n",
52 | "\n",
53 | "\n",
54 | "# Example usage of the function\n",
55 | "\n",
56 | "grades = [18, 25, 30, 33, 22, 28]\n",
57 | "average = compute_average(grades)\n",
58 | "print(f\"The computed average is: {average:.2f}\")\n",
59 | "\n",
60 | "\n"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "### Step 2: Test Cases\n",
68 | "\n",
69 | "To generate a comprehensive black-box test suite for the compute_average function, we will create tests that cover different equivalence classes, boundary conditions, and typical scenarios.\n",
70 | "\n",
71 | "- Valid inputs: \n",
72 | " - Standard valid grades within the range of 18 to 33, including \"30Laude\" (33).\n",
73 | " - Tests where all grades are different, with some being 30Laude.\n",
74 | "\n",
75 | "- Boundary values:\n",
76 | " - Input with the lowest valid grade (18) and the highest valid grade (33).\n",
77 | "\n",
78 | "- Invalid inputs:\n",
79 | " - A grade list that has more or fewer than 6 grades.\n",
80 | " - Grades outside the valid range (less than 18 or greater than 33).\n",
81 | " - non-numeric inputs\n",
82 | "\n",
83 | "- Edge cases:\n",
84 | " - All grades are the same.\n",
85 | " - All grades are 30Laude (33).\n"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {},
92 | "outputs": [],
93 | "source": [
94 | "import pytest\n",
95 | "\n",
96 | "# An example of a valid test case. Notice the use of the assert function and the pytest.approx to cope with possible roundings\n",
97 | "\n",
98 | "def test_valid_grades():\n",
99 | " grades = [18, 25, 30, 33, 22, 28]\n",
100 | " result = compute_average(grades)\n",
101 | " assert result == pytest.approx(26.25, rel=1e-2)\n",
102 | "\n",
103 | "# An example of an invalid test case. Notice the use of the with pytest.raises syntax, to catch exceptions in the function\n",
104 | "\n",
105 | "# Invalid test cases\n",
106 | "def test_more_than_six_grades():\n",
107 | " grades = [18, 25, 30, 33, 22, 28, 29]\n",
108 | " with pytest.raises(ValueError): # HINT: pay attention to the type of exception expected\n",
109 | " compute_average(grades)\n",
110 | "\n",
111 | "#TODO\n",
112 | "# Try to define here additional valid and invalid test cases, that can be used to verify the generated code.\n",
113 | "# Further examples are already defined in test_cases_01.py"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "### Step 3: Run the test cases\n",
121 | "\n",
122 | "This code defines a function run_tests that uses the ipytest library to run test cases within a Jupyter notebook. The ipytest library seeks for test cases in the notebook and launches them. The '-vv' parameter is used to provide a verbose output."
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "# Run the test suite\n",
132 | "\n",
133 | "import pytest\n",
134 | "import ipytest\n",
135 | "\n",
136 | "def run_tests():\n",
137 | " ipytest.run('-vv') \n",
138 | "\n",
139 | "# Running the tests\n",
140 | "run_tests()\n",
141 | "\n",
142 | "#TODO\n",
143 | "#what is the results of the tests?"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "### Step 4: Generating code with CodeLLAMA\n",
151 | "\n",
152 | "In this step, we leverage CodeLLAMA to automatically generate Python code based on specified requirements.\n",
153 | "\n",
154 | "In this step, we define a prompt for coding by providing the requirements, the arguments, and the expected returns.\n"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
164 | "from huggingface_hub import login\n",
165 | "import torch\n",
166 | "import os\n",
167 | "\n",
168 | "# HF login\n",
169 | "# ---------------- LOCAL RUNTIME --------------------------------------------\n",
170 | "#hf_token = os.getenv(\"HUGGING_FACE_HUB_TOKEN\") # The token must have been previously set in the environment variables (the syntax varies depending on your environment)\n",
171 | "# ---------------------------------------------------------------------------\n",
172 | "\n",
173 | "# ----------------- Google Colab --------------------------------------------\n",
174 | "from google.colab import userdata\n",
175 | "hf_token = userdata.get('HF_TOKEN')\n",
176 | "#----------------------------------------------------------------------------\n",
177 | "\n",
178 | "login(hf_token)\n",
179 | "\n",
180 | "# Detect the device\n",
181 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
182 | "\n",
183 | "\n",
184 | "# Load the model and tokenizer from Hugging Face\n",
185 | "model_name = \"codellama/CodeLlama-13b-Instruct-hf\" # Specify the model name\n",
186 | "model = AutoModelForCausalLM.from_pretrained(model_name).to(device) # Move model to GPU if available\n",
187 | "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
188 | "\n",
189 | "\n",
190 | "# We define a method to ask any prompt to llama\n",
191 | "def make_a_query(prompt: str, max_new_tokens:int = 200):\n",
192 | " \"\"\"\n",
193 | " Send a prompt to the Llama model and get a response.\n",
194 | "\n",
195 | " Args:\n",
196 | " - prompt (str): The input question or statement to the model.\n",
197 | " - max_new_tokens (int): The maximum length of the response.\n",
198 | "\n",
199 | " Returns:\n",
200 | " - str: The model's generated response.\n",
201 | " \"\"\"\n",
202 | "\n",
203 | " # Both the tokenizer and the model are global variables\n",
204 | "\n",
205 | " # Set pad_token_id if missing\n",
206 | " if tokenizer.pad_token_id is None:\n",
207 | " tokenizer.pad_token_id = tokenizer.eos_token_id\n",
208 | "\n",
209 | "\n",
210 | " # Tokenize the input with padding and truncation\n",
211 | " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
212 | " inputs = tokenizer(prompt, return_tensors=\"pt\", padding=True, truncation=True).to(device)\n",
213 | "\n",
214 | " # Compute the lenght of the input prompt to be able to extract the model's response later\n",
215 | " input_ids = inputs[\"input_ids\"]\n",
216 | " prompt_length = input_ids.shape[1]\n",
217 | "\n",
218 | " # Generate a response\n",
219 | " output = model.generate(\n",
220 | " inputs['input_ids'],\n",
221 | " attention_mask=inputs['attention_mask'],\n",
222 | " max_new_tokens=max_new_tokens, # Limit the number of new tokens generated (e.g., a single word)\n",
223 | " #temperature=0.3, # Reduce randomness, use with do_sample = True\n",
224 | " repetition_penalty=2.0, # Penalize repetition\n",
225 | " no_repeat_ngram_size=3, # Avoid repeating bigrams\n",
226 | " do_sample= False, # Set to False to use Greedy or Beam search\n",
227 | " num_beams=3, # Use with do_sample = False\n",
228 | " eos_token_id=tokenizer.eos_token_id, # End generation at EOS token\n",
229 | " pad_token_id=tokenizer.pad_token_id, # Avoid padding tokens\n",
230 | " early_stopping=True,\n",
231 | " )\n",
232 | "\n",
233 | " generated_tokens = output[0, prompt_length:]\n",
234 | "\n",
235 | " # Decode the response into human-readable text\n",
236 | " response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()\n",
237 | "\n",
238 | " return response\n",
239 | "\n",
240 | "#TODO\n",
241 | "# Define the prompt\n",
242 | "# HINT: When defining the prompt, also think about the output format necessary for the subsequent phases\n",
243 | "sys_prompt = \"\"\"\n",
244 | "\"\"\"\n",
245 | "query = \"\"\"\n",
246 | "\"\"\"\n",
247 | "\n",
248 | "# Apply the chat template\n",
249 | "input = \"\"\n",
250 | "\n",
251 | "# Generate code using the model\n",
252 | "generated_code = \"\"\n",
253 | "my_codes[\"CODELLAMA\"] = generated_code"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "### Step 5: organizing and analyzing test case results with pytest\n",
261 | "\n",
262 | "The code automates the process of running test cases in a Python file using pytest and parsing the results. It defines a function, parse_test_results, that extracts the counts of errors, failures, and passes from a summary line of the pytest output using regular expressions. The script runs the specified test file (test_cases_01.py) via a subprocess call to pytest, capturing its output in plain text. From the output, it locates the summary line containing test results and uses the parse_test_results function to extract key metrics. It calculates the functional correctness ratio as the fraction of passed tests to the total number of tests executed and prints the counts of passed, failed, and errored tests, along with the computed functional correctness ratio. This process provides a clear and automated way to assess the reliability of the tested code."
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": null,
268 | "metadata": {},
269 | "outputs": [],
270 | "source": [
271 | "import pytest\n",
272 | "import subprocess\n",
273 | "import re\n",
274 | "\n",
275 | "\n",
276 | "def parse_test_results(result_string):\n",
277 | " \"\"\"\n",
278 | " Parses the test result string to extract numbers of errors, failures, and passes.\n",
279 | "\n",
280 | " Args:\n",
281 | " result_string (str): A string containing test results (e.g., \"3 failed, 11 passed in 0.04s\").\n",
282 | "\n",
283 | " Returns:\n",
284 | " dict: A dictionary with keys 'errors', 'failures', and 'passed' and their respective counts.\n",
285 | " \"\"\"\n",
286 | " # Regular expressions to match the counts for errors, failures, and passes\n",
287 | " errors = re.search(r\"(\\d+)\\s+errors?\", result_string)\n",
288 | " failures = re.search(r\"(\\d+)\\s+failed\", result_string)\n",
289 | " passes = re.search(r\"(\\d+)\\s+passed\", result_string)\n",
290 | "\n",
291 | " # Extract numbers or default to 0 if not found\n",
292 | " return int(errors.group(1)) if errors else 0, int(failures.group(1)) if failures else 0, int(passes.group(1)) if passes else 0\n",
293 | "\n",
294 | "# Define the path to your test file\n",
295 | "test_file = \"test_cases_01.py\" \n",
296 | "\n",
297 | "\n",
298 | "\n",
299 | "\n",
300 | "\n",
301 | "\n",
302 | "#TODO\n",
303 | "# Define code to write the function generated with GPT in a file \"function_01.py\". This generated function will be called by the test cases in the file test_cases_01.py\n",
304 | "\n",
305 | "# HINT: with ...\n",
306 | "\n",
307 | "\n",
308 | "\n",
309 | "\n",
310 | "# Run the pytest command and capture the output\n",
311 | "result = subprocess.run(\n",
312 | " [\"pytest\", test_file, \"--disable-warnings\", \"--tb=short\", \"-q\", \"--color=no\"],\n",
313 | " stdout=subprocess.PIPE,\n",
314 | " stderr=subprocess.PIPE,\n",
315 | " text=True\n",
316 | ")\n",
317 | "\n",
318 | "# Extract test results from the pytest output\n",
319 | "output_lines = result.stdout.split(\"\\n\")\n",
320 | "summary_line = next((line for line in output_lines if \"passed\" in line or \"failed\" in line or \"error\" in line), None)\n",
321 | "result_line = output_lines[-2]\n",
322 | "\n",
323 | "errors, failures, passes = parse_test_results(result_line)\n",
324 | "gpt_functional_correctness = passes/(errors+failures+passes)\n",
325 | "\n",
326 | "\n",
327 | "# Print the results\n",
328 | "print(\"GPT\")\n",
329 | "print(f\"# Passed: {passes}\")\n",
330 | "print(f\"# Failed: {failures}\")\n",
331 | "print(f\"# Errors: {errors}\")\n",
332 | "print(f\"Functional Correctness Ratio: {gpt_functional_correctness:.2f}\")\n"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "### Step 6: comparing results\n",
340 | "\n",
341 | "We now extend our analysis to compare the results provided by different agents.\n",
342 | "\n",
343 | "For simplicity, we store the function always on the same file (e.g., function_01.py) and we re-execute the same test file (e.g., test_cases_01.py)\n",
344 | "\n",
345 | "We evaluate the results based on the functional correctness ratio, and we select the best option that was generated."
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": null,
351 | "metadata": {},
352 | "outputs": [],
353 | "source": [
354 | "\n",
355 | "#TODO\n",
356 | "# Define code to write the function generated with LLAMA in the same file \"function_01.py\". This generated function will be called by the test cases in the file test_cases_01.py\n",
357 | "\n",
358 | "# HINT: with ...\n",
359 | "\n",
360 | "\n",
361 | "result = subprocess.run(\n",
362 | " [\"pytest\", test_file, \"--disable-warnings\", \"--tb=short\", \"-q\", \"--color=no\"],\n",
363 | " stdout=subprocess.PIPE,\n",
364 | " stderr=subprocess.PIPE,\n",
365 | " text=True\n",
366 | ")\n",
367 | "\n",
368 | "# Extract test results from the pytest output\n",
369 | "output_lines = result.stdout.split(\"\\n\")\n",
370 | "summary_line = next((line for line in output_lines if \"passed\" in line or \"failed\" in line or \"error\" in line), None)\n",
371 | "\n",
372 | "\n",
373 | "result_line = output_lines[-2]\n",
374 | "\n",
375 | "errors, failures, passes = parse_test_results(result_line)\n",
376 | "llama_functional_correctness = passes/(errors+failures+passes)\n",
377 | "\n",
378 | "\n",
379 | "# Print the results\n",
380 | "print(\"CODELLAMA\")\n",
381 | "print(f\"# Passed: {passes}\")\n",
382 | "print(f\"# Failed: {failures}\")\n",
383 | "print(f\"# Errors: {errors}\")\n",
384 | "print(f\"Functional Correctness Ratio: {llama_functional_correctness:.2f}\")\n",
385 | "\n"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "metadata": {},
391 | "source": [
392 | "### Step 7: Analyzing and comparing the quality of the generated code\n",
393 | "\n",
394 | "Static code quality metrics are used to evaluate the structure, readability, and maintainability of code without executing it. Below is a Python program that calculates some common static code quality metrics for a given Python file:\n",
395 | "\n",
396 | "- Lines of Code (LOC): The total number of lines in the file.\n",
397 | "- Comment Density: The percentage of lines that are comments.\n",
398 | "- Cyclomatic Complexity: A measure of the complexity of a program based on the number of linearly independent paths.\n",
399 | "- Maintainability Index (MI): a software metric used to predict how maintainable the code is. It’s often calculated using a combination of Cyclomatic Complexity (CC), Lines of Code (LOC), and Halstead Volume (HV).\n",
400 | "\n",
401 | "The **Radon** tool is a Python package that helps analyze various aspects of code quality, such as Cyclomatic Complexity (CC), Maintainability Index (MI), Raw Metrics (LOC), and Halstead metrics. You can use Radon to analyze the static code quality of your Python code by running it from the command line or through Python code.\n",
402 | "\n",
403 | "The tool can be installed by the following command: pip install radon\n"
404 | ]
405 | },
406 | {
407 | "cell_type": "code",
408 | "execution_count": null,
409 | "metadata": {},
410 | "outputs": [],
411 | "source": [
412 | "import radon.complexity as radon_complexity\n",
413 | "import radon.metrics as radon_metrics\n",
414 | "import radon.raw as radon_raw\n",
415 | "\n",
416 | "def analyze_code_with_radon(file_path):\n",
417 | " \"\"\"\n",
418 | " Analyzes a Python file and calculates code quality metrics:\n",
419 | " - Cyclomatic Complexity (CC)\n",
420 | " - Maintainability Index (MI)\n",
421 | " - Raw Metrics (LOC, number of functions, etc.)\n",
422 | " \n",
423 | " Args:\n",
424 | " file_path (str): Path to the Python file to analyze.\n",
425 | " \"\"\"\n",
426 | " with open(file_path, 'r') as file:\n",
427 | " code = file.read()\n",
428 | "\n",
429 | " # Cyclomatic Complexity Analysis (CC)\n",
430 | " cc_results = radon_complexity.cc_visit(code)\n",
431 | " print(\"Cyclomatic Complexity:\")\n",
432 | " for result in cc_results:\n",
433 | " print(f\"Function: {result.name}, Complexity: {result.complexity}\")\n",
434 | "\n",
435 | " # Maintainability Index (MI)\n",
436 | " maintainability_index = radon_metrics.mi_visit(code, multi=False) # Set multi=False for single file\n",
437 | " print(f\"\\nMaintainability Index: {maintainability_index}\")\n",
438 | "\n",
439 | " # Raw Metrics (LOC, number of functions, etc.)\n",
440 | " raw_metrics = radon_raw.analyze(code) # Use analyse method for raw metrics\n",
441 | " print(f\"\\nLines of Code (LOC): {raw_metrics.loc}\")\n",
442 | " print(f\"Number of Comments: {raw_metrics.comments}\")\n",
443 | " #print(f\"Blank Lines: {raw_metrics.blank_lines}\")\n",
444 | "\n",
445 | " result = {}\n",
446 | " result[\"CC\"] = cc_results[0].complexity\n",
447 | " result[\"MI\"] = maintainability_index\n",
448 | " result[\"LOC\"] = raw_metrics.loc\n",
449 | " result[\"Comments\"] = raw_metrics.comments\n",
450 | " return result\n",
451 | "\n",
452 | "\n",
453 | "print(my_codes)\n",
454 | "\n",
455 | "\n",
456 | "overall_results = {}\n",
457 | "\n",
458 | "\n",
459 | "#TODO: cycle over all the pairs language, code in the \"my_codes\" dictionary. For each code, save the function in the function_01.py file and:\n",
460 | "# - execute the radon analysis with the function above\n",
461 | "# - run the python tests by using the code provided in a previous cell\n",
462 | "\n",
463 | "#Example of the final structure of the overall_results dictionary\n",
464 | "#{'GPT': {'CC': 4, 'MI': 93.26472421003476, 'LOC': 24, 'Comments': 3, 'FunctionalCorrectness': 0.80}, 'CODELLAMA': {'CC': 1, 'MI': 77.16231550361674, 'LOC': 5, 'Comments': 0, 'FunctionalCorrectness': 0.43}}\n",
465 | " \n",
466 | "\n",
467 | "print(overall_results)\n",
468 | "\n"
469 | ]
470 | },
471 | {
472 | "cell_type": "markdown",
473 | "metadata": {},
474 | "source": [
475 | "### Step 8: Plotting the results\n",
476 | "\n",
477 | "generate a set of bar charts to visualize various software metrics for different programming languages, including Cyclomatic Complexity (CC), Maintainability Index (MI), Lines of Code (LOC), and the number of comments. \n",
478 | "\n",
479 | "start by extracting the relevant data from the overall_results dictionary, which contains these metrics for each language. Then, using Matplotlib, create a 3x2 grid of subplots, each one dedicated to a different metric. \n",
480 | "\n",
481 | "The first four subplots display the values for CC, MI, LOC, and comments, while the last subplot shows the functional correctness values for two models, GPT and LLaMA. "
482 | ]
483 | },
484 | {
485 | "cell_type": "code",
486 | "execution_count": null,
487 | "metadata": {},
488 | "outputs": [],
489 | "source": [
490 | "\n",
491 | "import matplotlib.pyplot as plt\n",
492 | "import numpy as np\n",
493 | "\n",
494 | "\n",
495 | "languages = list(overall_results.keys()) \n",
496 | "\n",
497 | "\n",
498 | "\n",
499 | "cc_values = [overall_results[language]['CC'] for language in languages]\n",
500 | "mi_values = [overall_results[language]['MI'] for language in languages]\n",
501 | "loc_values = [overall_results[language]['LOC'] for language in languages]\n",
502 | "comments_values = [overall_results[language]['Comments'] for language in languages]\n",
503 | "functional_correctness = [overall_results[language]['FunctionalCorrectness'] for language in languages]\n",
504 | "\n",
505 | "# Create a figure with subplots for each metric\n",
506 | "fig, axs = plt.subplots(3, 2, figsize=(10, 8))\n",
507 | "\n",
508 | "# Plotting CC values\n",
509 | "axs[0, 0].bar(languages, cc_values, color='skyblue')\n",
510 | "axs[0, 0].set_title('Cyclomatic Complexity (CC)')\n",
511 | "axs[0, 0].set_ylabel('CC')\n",
512 | "\n",
513 | "# Plotting MI values\n",
514 | "axs[0, 1].bar(languages, mi_values, color='lightgreen')\n",
515 | "axs[0, 1].set_title('Maintainability Index (MI)')\n",
516 | "axs[0, 1].set_ylabel('MI')\n",
517 | "\n",
518 | "# Plotting LOC values\n",
519 | "axs[1, 0].bar(languages, loc_values, color='blue')\n",
520 | "axs[1, 0].set_title('Lines of Code (LOC)')\n",
521 | "axs[1, 0].set_ylabel('LOC')\n",
522 | "\n",
523 | "# Plotting Comments values\n",
524 | "axs[1, 1].bar(languages, comments_values, color='lightcoral')\n",
525 | "axs[1, 1].set_title('Number of Comments')\n",
526 | "axs[1, 1].set_ylabel('Comments')\n",
527 | "\n",
528 | "#plotting functional correctness\n",
529 | "axs[2, 0].bar(languages, functional_correctness, color='green')\n",
530 | "axs[2, 0].set_title('Functional correctness')\n",
531 | "axs[2, 0].set_ylabel('Comments')\n",
532 | "\n",
533 | "# Adjust layout and display the plot\n",
534 | "plt.tight_layout()\n",
535 | "plt.show()\n"
536 | ]
537 | },
538 | {
539 | "cell_type": "markdown",
540 | "metadata": {},
541 | "source": [
542 | "# Exercise 2: Comparing more LLM agents\n",
543 | "\n",
544 | "After analyzing a chat engine (GPT) and a HuggingFace model (CodeLLAMA) try to generate additional examples of code with other engines.\n",
545 | "\n",
546 | "You can use to this purpose other models provided by HuggingFace, and/or other ready-made chat engines (e.g., Qwen).\n",
547 | "NOTE: choose the *-Instruct version to be able to use chat template; otherwise, refer to the model's Hugging Face pace for prompting syntax"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": null,
553 | "metadata": {},
554 | "outputs": [],
555 | "source": [
556 | "#TODO generate additional code samples with other LLM engines\n",
557 | "qwen_chat = \"\"\n",
558 | "\n",
559 | "#add the results to the dictionary of codes\n",
560 | "\n",
561 | "#re-execute the analysis in the previous code boxes by comparing more languages"
562 | ]
563 | },
564 | {
565 | "cell_type": "markdown",
566 | "metadata": {},
567 | "source": [
568 | "# Exercise 3: Additional functions\n",
569 | "\n",
570 | "To evaluate the generalizability of the LLM code generation, now modify the functions on which you are applying your analysis.\n",
571 | "\n",
572 | "Consider the following requirements:\n",
573 | "\n",
574 | "### Railway company\n",
575 | "\n",
576 | "A railway company offers the possibility to people under 15 to travel free. The offer is dedicated to groups\n",
577 | "from 2 to 5 people travelling together.\n",
578 | "For being eligible to the offer, at least a member of the group must be at least 18 years old. If this condition\n",
579 | "applies, all the under 15 members of the group travel free, and the others pay the Base Price.\n",
580 | "The function computeFee receives as parameters basePrice (the price of the ticket), n_passengers (the\n",
581 | "number of passengers of the group), n_over18 (the number of passengers at least 18 old), n_under15 (the\n",
582 | "number of passengers under 15 years old). It gives as output the amount that the whole group has to spend. It\n",
583 | "gives an error if groups are composed of more than 5 persons.\n",
584 | "double computeFee(double basePrice, int n_passengers, int n_over18, int n_under15);\n",
585 | "\n",
586 | "- define test cases for this case study in test_cases_02.py\n",
587 | "- use a file function_02.py to host the generated results for this function\n",
588 | "\n",
589 | "### Bike Race\n",
590 | "\n",
591 | "In a bike race, the bikers must complete the entire track within a maximum time, otherwise their race is not\n",
592 | "valid. The maximum time is computed, for each race, based on the winner's time, on the average speed on the\n",
593 | "track, and on the category of the track.\n",
594 | "For tracks of category 'A' (easy tracks) the maximum time is computed as the winner's time increased by 5% if\n",
595 | "the average speed is lower than 30 km/h (30 included), 10% if the average speed is between 30 and 35 km/h (35\n",
596 | "included), and 15% if the average speed is higher than 35 km/h.\n",
597 | "For tracks of category 'B' (normal tracks) the maximum time is computed as the winner's time increased by 20%\n",
598 | "if the average speed is lower than 30 km/h (30 included), 25% if the average speed is between 30 and 35 km/h\n",
599 | "(35 included), and 30% if the average speed is higher than 35 km/h.\n",
600 | "For tracks of category 'C' (hard tracks) the maximum time does not depend on average speed, and is always\n",
601 | "computed as the winner's time increased by 50%.\n",
602 | "The function computeMaxTime receives as parameters winner_time (the time of the winner, in minutes),\n",
603 | "avg_speed (the average speed of the track, in km/h) and track_type (a char, whose valid values are 'A', 'B', or\n",
604 | "'C'). It gives as output the maximum time, 0 if there are errors in the input.\n",
605 | "double computeMaxTime(double winner_time, double avg_speed, char track_type)\n",
606 | "\n",
607 | "- define test cases for this case study in test_cases_03.py\n",
608 | "- use a file function_03.py to host the generated results for this function\n",
609 | "\n",
610 | "Re-execute the analysis once you have your test cases and adapted prompts. What changes baed on the complexity of the requirements to implement?\n",
611 | "\n"
612 | ]
613 | }
614 | ],
615 | "metadata": {
616 | "kernelspec": {
617 | "display_name": "Python 3",
618 | "language": "python",
619 | "name": "python3"
620 | },
621 | "language_info": {
622 | "codemirror_mode": {
623 | "name": "ipython",
624 | "version": 3
625 | },
626 | "file_extension": ".py",
627 | "mimetype": "text/x-python",
628 | "name": "python",
629 | "nbconvert_exporter": "python",
630 | "pygments_lexer": "ipython3",
631 | "version": "3.10.10"
632 | }
633 | },
634 | "nbformat": 4,
635 | "nbformat_minor": 2
636 | }
637 |
--------------------------------------------------------------------------------