├── 1_nlp
    ├── README.md
    ├── part_III_llm_finetuning
    │   ├── LoRA.ipynb
    │   └── LoRA_solved.ipynb
    ├── part_II_text_generation
    │   ├── Transformer_Decoder_MoE.ipynb
    │   └── Transformer_Decoder_MoE_solved.ipynb
    └── part_I_text_classification
    │   ├── Transformer_Encoder_Classification.ipynb
    │   └── Transformer_Encoder_Classification_solved.ipynb
├── 2_cv
    ├── First_exercise
    │   ├── M2L_Tutor_1_exercise.ipynb
    │   └── M2L_Tutor_1_solution.ipynb
    ├── README.md
    ├── Second_exercise
    │   ├── M2L_Tutor_2_exercise.ipynb
    │   └── M2L_Tutor_2_solution.ipynb
    └── Third_exercise
    │   ├── M2L_Tutor_3_exercise.ipynb
    │   ├── M2L_Tutor_3_solution.ipynb
    │   └── requirements.txt
├── 3_diffusion
    ├── README.md
    ├── diffusion.ipynb
    └── diffusion_solved.ipynb
├── 4_rl
    ├── Part 01
    │   ├── M2L24_RL01 _Intro_to_RL_exercise.ipynb
    │   └── M2L24_RL01 _Intro_to_RL_solution.ipynb
    ├── Part 02
    │   ├── M2L24_RL02 _Policy_gradient_methods_exercise.ipynb
    │   └── M2L24_RL02 _Policy_gradient_methods_solution.ipynb
    ├── Part 03
    │   ├── M2L24_RL03_RLHF_exercise.ipynb
    │   └── M2L24_RL03_RLHF_solution.ipynb
    └── README.md
├── 5_gnn
    ├── README.md
    ├── part_I
    │   ├── introduction_to_gnns.ipynb
    │   └── introduction_to_gnns_solved.ipynb
    ├── part_II
    │   ├── Missing_data_estimation.ipynb
    │   └── Missing_data_estimation_solved.ipynb
    └── part_III
    │   ├── gnns_advanced_topics.ipynb
    │   └── gnns_advanced_topics_solved.ipynb
├── LICENSE
└── README.md


/1_nlp/README.md:
--------------------------------------------------------------------------------
 1 | # [[M2L2024](https://www.m2lschool.org/home)] Tutorial 1: Natural Language Processing
 2 | 
 3 | **Authors:** [Georgios Peikos](https://www.linkedin.com/in/peikosgeorgios/), [Luca Herranz-Celotti](https://lucehe.github.io/)
 4 | 
 5 | --- 
 6 | 
 7 | This is the tutorial of the 2024 Mediterranean Machine Learning Summer School on Natural 
 8 | Language Processing!
 9 | 
10 | This tutorial will explore the fundamental aspects of Natural Language Processing (NLP). 
11 | Basic Python programming skills are expected. Prior knowledge of standard NLP techniques 
12 | (e.g. text tokenization and classification with ML) is beneficial but optional when working 
13 | through the notebooks as they assume minimal prior knowledge.
14 | 
15 | This tutorial combines detailed analysis and development of essential NLP concepts via 
16 | custom (i.e. from scratch) implementations. Other necessary NLP components will be developed 
17 | using PyTorch's NLP library implementations. As a result, the tutorial offers deep 
18 | understanding and facilitates easy usage in future applications.
19 | 
20 | ### Outline
21 | 
22 | * Part I: Introduction to Text Tokenization and Classification
23 |   *  Text Classification: Simple Classifier
24 |   *  Text Classification: Encoder-only Transformer
25 | 
26 | * Part II: Introduction to Decoder-only Transformer and Sparse Mixture of Experts Architecture
27 |   *  Text Generation: Decoder-only Transformer
28 |   *  Text Generation: Decoder-only Transformer + MoE
29 | 
30 | * Part III: Introduction to Parameter Efficient Fine-tuning
31 |   *  Fine-tuning the full Pre-trained Models
32 |   *  Fine-tuning using Low-Rank Adaptation of Large Language Models (LoRA)
33 | 
34 | ### Notebooks
35 | 
36 | #### Part I: 
37 | Tutorial: [![Open In 
38 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/1_nlp/part_I_text_classification/Transformer_Encoder_Classification.ipynb)
39 | 
40 | Solution: [![Open In 
41 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/1_nlp/part_I_text_classification/Transformer_Encoder_Classification_solved.ipynb)
42 | 
43 | #### Part II: 
44 | Tutorial: [![Open In 
45 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/1_nlp/part_II_text_generation/Transformer_Decoder_MoE.ipynb)
46 | 
47 | Solution: [![Open In 
48 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/1_nlp/part_II_text_generation/Transformer_Decoder_MoE_solved.ipynb)
49 | 
50 | #### Part III: 
51 | Tutorial: [![Open In 
52 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/1_nlp/part_III_llm_finetuning/LoRA.ipynb)
53 | 
54 | Solution: [![Open In 
55 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/1_nlp/part_III_llm_finetuning/LoRA_solved.ipynb)
56 | 
57 | ---
58 | 


--------------------------------------------------------------------------------
/1_nlp/part_III_llm_finetuning/LoRA.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["Natural Language Processing Tutorial\n","======\n","\n","This is the tutorial of the 2024 [Mediterranean Machine Learning Summer School](https://www.m2lschool.org/) on Natural Language Processing!\n","\n","This tutorial will explore the fundamental aspects of Natural Language Processing (NLP). Basic Python programming skills are expected.\n","Prior knowledge of standard NLP techniques (e.g. text tokenization and classification with ML) is beneficial but optional when working through the notebooks as they assume minimal prior knowledge.\n","\n","This tutorial combines detailed analysis and development of essential NLP concepts via custom (i.e. from scratch) implementations. Other necessary NLP components will be developed using PyTorch's NLP library implementations. As a result, the tutorial offers deep understanding and facilitates easy usage in future applications.\n","\n","## Outline\n","\n","* Part I: Introduction to Text Tokenization and Classification\n","  *  Text Classification: Simple Classifier\n","  *  Text Classification: Encoder-only Transformer\n","\n","* Part II: Introduction to Decoder-only Transformer and Sparse Mixture of Experts Architecture\n","  *  Text Generation: Decoder-only Transformer\n","  *  Text Generation: Decoder-only Transformer + MoE\n","\n","* Part III: Introduction to Parameter Efficient Fine-tuning\n","  *  Fine-tuning the full Pre-trained Models\n","  *  Fine-tuning using Low-Rank Adaptation of Large Language Models (LoRA)\n","\n","## Notation\n","\n","* Sections marked with [📚] contain cells that you should read, modify and complete to understand how your changes alter the obtained results.\n","* External resources are mentioned with [✨]. These provide valuable supplementary information for this tutorial and offer opportunities for further in-depth exploration of the topics covered.\n","\n","\n","## Libraries\n","\n","This tutorial leverages [PyTorch](https://pytorch.org/) for neural network implementation and training, complemented by standard Python libraries for data processing and the [Hugging Face](https://huggingface.co/) datasets library for accessing NLP resources.\n","\n","GPU access is recommended for optimal performance, particularly for model training and text generation. While all code can run on CPU, a CUDA-enabled environment will significantly speed up these processes.\n","\n","## Credits\n","\n","The tutorial is created by:\n","\n","* [Luca Herranz-Celotti](http://LuCeHe.github.io)\n","* [Georgios Peikos](https://www.linkedin.com/in/peikosgeorgios/)\n","\n","It is inspired by and synthesizes various online resources, which are cited throughout for reference and further reading.\n","\n","## Note for Colab users\n","\n","To grab a GPU (if available), make sure you go to `Edit -> Notebook settings` and choose a GPU under `Hardware accelerator`\n","\n"],"metadata":{"id":"F45HFdoiriet"}},{"cell_type":"markdown","source":["## Part III: Introduction to Parameter Efficient Fine-tuning\n","\n","We show how to adapt a model that has been pre-trained on a lot of data, can be adapted to be used in a downstream task, by fine-tuning it on a target dataset. The first idea could be to consider adapting all the weights of the network to the new task, but this can be resource intensive. This could lead us to decide that we can freeze all the weights, except the final output linear layer. We will see that this results in faster training, but also in worse performance on our target task. Finally we introduce a newer way of thinking, Parameter Efficient Fine-Tuning (PEFT) and one approach in that family, LoRA, that will provide us with a way to improve performance in a fine-tuning task, while being less resource intensive.\n"],"metadata":{"id":"GcFmNhCLZbtp"}},{"cell_type":"markdown","metadata":{"id":"Voy7Ivrr0MSi"},"source":["##Step 1: Load Packages\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XNg1OQ7hUVQM","collapsed":true},"outputs":[],"source":["!pip install peft datasets evaluate"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"qFNm9eSAUD8l"},"outputs":[],"source":["import math\n","import torch\n","import torch.nn as nn\n","\n","from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification\n","from transformers.utils import PushToHubMixin\n","\n","from peft.tuners.lora.layer import dispatch_default, Linear\n","from peft.tuners.tuners_utils import BaseTunerLayer\n","from peft import LoraConfig, PeftModel, LoraModel, get_peft_model\n","from datasets import load_dataset\n","\n","import numpy as np\n","import evaluate\n","from transformers import TrainingArguments, Trainer"]},{"cell_type":"markdown","source":["We will fine-tune the ✨ [BERT](https://arxiv.org/pdf/1810.04805) architecture, a well known language classification architecture built based on the Transformer encoder. The pre-trained model is openly available at different sources. We will focus on the HuggingFace library, since it has become a standard for Large Language Models, and it includes a large number of convenient tools for language processing and generation."],"metadata":{"id":"PGW42rGoqypH"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"pdDrMfPbyUky"},"outputs":[],"source":["model_name_or_path = \"google-bert/bert-base-cased\"\n","tokenizer_name_or_path = \"google-bert/bert-base-cased\""]},{"cell_type":"markdown","metadata":{"id":"CA4zbqo40Qnj"},"source":["##📚 Step 2: Load Dataset"]},{"cell_type":"markdown","metadata":{"id":"WV8gPKOeyXi4"},"source":["Let's pick a dataset and use the tokenizer that corresponds to the BERT model. The ✨ [Yelp reviews dataset](https://huggingface.co/datasets/Yelp/yelp_review_full) consists of reviews from Yelp, and each review has a number of stars between one and five. The neural network will see the review at the input, and will have to predict the number of stars that correspond to that review.\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Q9WzUAxwyRyH","collapsed":true},"outputs":[],"source":["# EXERCISE: load the yelp_review_full dataset using load_dataset\n","dataset =\n","\n","print(dataset)\n","print(dataset[\"train\"][100])\n","\n","# EXERCISE: load the BERT tokenizer with AutoTokenizer\n","tokenizer =\n","\n","def tokenize_function(examples):\n","    # EXERCISE: pad to max length and truncate sentences\n","    return tokenizer(examples[\"text\"],\n","\n","tokenized_datasets = dataset.map(tokenize_function, batched=True)\n","small_train_dataset = tokenized_datasets[\"train\"].shuffle(seed=42).select(range(1000))\n","small_eval_dataset = tokenized_datasets[\"test\"].shuffle(seed=42).select(range(1000))"]},{"cell_type":"markdown","metadata":{"id":"wbhu7MQT01w_"},"source":["## 📚 Step 3: Define Training and Evaluation Loop\n","Let's standardize the training and evaluation loop, so we can better appreciate the difference in the final result between the three finetuning techniques explained."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"10j0UC3502Ul"},"outputs":[],"source":["def train_and_evaluate(model, max_steps=-1, num_train_epochs=2, learning_rate=5e-5):\n","    metric = evaluate.load(\"accuracy\")\n","\n","    def compute_metrics(eval_pred):\n","        logits, labels = eval_pred\n","        # EXERCISE: the greedy prediction is the argmax of the logits\n","        predictions =\n","        return metric.compute(predictions=predictions, references=labels)\n","\n","    training_args = TrainingArguments(\n","        output_dir=\"test_trainer\",\n","        num_train_epochs=num_train_epochs,\n","        max_steps=max_steps,\n","        learning_rate=learning_rate,\n","        label_names=[\"labels\"],\n","    )\n","\n","    trainer = Trainer(\n","        model=model,\n","        args=training_args,\n","        train_dataset=small_train_dataset,\n","        eval_dataset=small_eval_dataset,\n","        compute_metrics=compute_metrics,\n","    )\n","\n","    train_metrics = trainer.train()\n","    print(train_metrics)\n","    eval_metrics = trainer.evaluate()\n","    print(eval_metrics)"]},{"cell_type":"markdown","metadata":{"id":"_5bPfW1n1NOu"},"source":["We also introduce an auxiliary function to count the number of trainable parameters in each case."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PE2Ro46X1Mob"},"outputs":[],"source":["def print_trainable_parameters(model):\n","    \"\"\"\n","    Prints the number of trainable parameters in the model.\n","    \"\"\"\n","    trainable_params = 0\n","    all_param = 0\n","    for _, param in model.named_parameters():\n","        all_param += param.numel()\n","        if param.requires_grad:\n","            trainable_params += param.numel()\n","    print(\n","        f\"trainable params: {trainable_params} || all params: {all_param} || trainable: {100 * trainable_params / all_param:.2f}%\"\n","    )"]},{"cell_type":"markdown","metadata":{"id":"nNfNGUSuyP_t"},"source":["## 📚 Step 4: Full Finetuning\n","\n","The simplest possibility is to fine-tune all the model, the pre-trained BERT, but also the new linear classifier on top. This might usually achieve the best final accuracy, but it results in slow fine-tuning. This is because all the matrices in the model have to be updated, which can be very large and consume a lot of memory."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"st1O7DuM1a2v"},"outputs":[],"source":["model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)\n","print_trainable_parameters(model)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ijVpFeyR1eaq"},"outputs":[],"source":["# EXERCISE: explore learning rates in the set [5e-2, 5e-3, 5e-4, 5e-5] to find the best\n","# one with this configuration\n","train_and_evaluate(model, learning_rate="]},{"cell_type":"markdown","metadata":{"id":"AS2k9B120gOv"},"source":["## 📚 Step 5: Head Finetuning\n","\n","Another possibility is to fix the weights of the pre-trained BERT, and fine-tune only the head, the linear classifier that HuggingFace has placed on top when we say `num_labels=5`. This will drastically reduce the number of trainable parameters, and therefore it will significantly speed up fine-tuning."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"eJggghVeYPUj"},"outputs":[],"source":["model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)\n","\n","# EXERCISE: set as trainable only the parameters of the classifier\n","for name, param in model.named_parameters():\n","\n","print_trainable_parameters(model)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"nZsfd65FfBea"},"outputs":[],"source":["# EXERCISE: explore learning rates in the set [5e-2, 5e-3, 5e-4, 5e-5] to find the best\n","# one with this configuration\n","train_and_evaluate(model, learning_rate="]},{"cell_type":"markdown","metadata":{"id":"UiAx9Rb90idU"},"source":["## 📚 Step 6: LoRA Finetuning\n","\n","A newer line of research, called Parameter Efficient Fine-Tuning (PEFT) attempts to figure out different techniques to drastically reduce the number of parameters to fine-tune, and still achieve good performance. One of the most popular options is called ✨ [LoRA](https://arxiv.org/pdf/2106.09685), for Low-Rank adaptation of Language Models. It consists on constructing the new matrices as $\\theta = \\hat{\\theta} + A^TB$, where $\\theta$ is the new matrix of our model, the pre-trained weights $\\hat{\\theta}$ are kept fixed, and only an additive component made up by multiplying two smaller matrices $A,B$ is learned. This drastically reduces the number of parameters to train, if $A,B$ are chosen appropriately.\n","\n","<img src=\"https://heidloff.net/assets/img/2023/08/lora.png\" alt=\"drawing\" width=\"50%\"/>\n","\n","\n","The speed up is noticeable with BERT, and becomes more significant for larger models. The matrices $A,B$ have size $A,B\\in\t\\mathbb{R}^{r\\times N}$, where the size of the original matrix was $\\theta,\\hat{\\theta}\\in\t\\mathbb{R}^{N\\times N}$.\n","\n","Now, let's first define the hyper-parameters of our LoRA:"]},{"cell_type":"code","source":["peft_config = LoraConfig(\n","    r=16,\n","    lora_alpha=32,\n","    lora_dropout=0.05,\n","    bias=\"none\",\n","    task_type=\"CAUSAL_LM\",\n",")"],"metadata":{"id":"lC2BGL8Z7EX8"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Then let's define the LoRA layer itself."],"metadata":{"id":"J1hYxrb67KaV"}},{"cell_type":"code","source":["class CustomLinearLoRA(Linear):\n","    def update_layer(\n","            self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora=False, use_dora=False\n","    ):\n","        # This code works for linear layers, override for other layer types\n","        if r <= 0:\n","            raise ValueError(f\"`r` should be a positive integer value but the value passed is {r}\")\n","\n","        self.r[adapter_name] = r\n","        self.lora_alpha[adapter_name] = lora_alpha\n","\n","        # EXERCISE: define a dropout layer\n","        lora_dropout_layer =\n","\n","        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))\n","\n","        # Actual trainable parameters\n","        # EXERCISE: write a linear layer that goes from self.in_features to r\n","        # and without bias\n","        self.lora_A[adapter_name] =\n","        # EXERCISE: write a linear layer that goes from r to self.out_features\n","        # and without bias\n","        self.lora_B[adapter_name] =\n","\n","        self.scaling[adapter_name] = lora_alpha / r\n","\n","        self.reset_lora_parameters(adapter_name, init_lora_weights)\n","        self.set_adapter(self.active_adapters)\n","\n","    def forward(self, x, *args, **kwargs):\n","        result = self.base_layer(x, *args, **kwargs)\n","        torch_result_dtype = result.dtype\n","        for active_adapter in self.active_adapters:\n","            if active_adapter not in self.lora_A.keys():\n","                continue\n","            lora_A = self.lora_A[active_adapter]\n","            lora_B = self.lora_B[active_adapter]\n","            dropout = self.lora_dropout[active_adapter]\n","            scaling = self.scaling[active_adapter]\n","\n","            x = x.to(lora_A.weight.dtype)\n","\n","            x = dropout(x)\n","\n","            # EXERCISE: add to the result of the base layer, the output of\n","            # lora_B and lora_A and multiply by the scaling factor\n","            result = result +\n","\n","        result = result.to(torch_result_dtype)\n","\n","        return result"],"metadata":{"id":"e9BeyUqp7UGm"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Since we are using the HuggingFace PEFT library framework, we need to tweak some of its internal workings to be able to expose the LoRA layer above. Therefore the following cell is not very insightful to understand."],"metadata":{"id":"4h2IO1wO7g9j"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"JRP7Mq0L8mD8"},"outputs":[],"source":["def custom_dispatch_default(target: torch.nn.Module, adapter_name, lora_config, **kwargs):\n","    new_module = None\n","    target_base_layer = target.get_base_layer() if isinstance(target, BaseTunerLayer) else target\n","\n","    if isinstance(target_base_layer, torch.nn.Linear):\n","        kwargs.update(lora_config.loftq_config)\n","        new_module = CustomLinearLoRA(target, adapter_name, **kwargs)\n","\n","    if new_module is None:\n","        new_module = dispatch_default(target, adapter_name, lora_config=lora_config, **kwargs)\n","    return new_module\n","\n","class CustomLoraModel(LoraModel):\n","    @staticmethod\n","    def _create_new_module(lora_config, adapter_name, target, **kwargs):\n","        return custom_dispatch_default(target, adapter_name, lora_config=lora_config, **kwargs)\n","\n","class CustomPeftModel(PeftModel):\n","    def __init__(self, model, peft_config, adapter_name=\"default\"):\n","        PushToHubMixin.__init__(self)\n","        torch.nn.Module.__init__(self)\n","\n","        self.modules_to_save = None\n","        self.active_adapter = adapter_name\n","        self.peft_type = peft_config.peft_type\n","        # These args are special PEFT arguments that users can pass. They need to be removed before passing them to\n","        # forward.\n","        self.special_peft_forward_args = {\"adapter_names\"}\n","\n","        self._is_prompt_learning = peft_config.is_prompt_learning\n","        self._peft_config = None\n","        self.base_model = CustomLoraModel(model, {adapter_name: peft_config}, adapter_name)\n","\n","        self.set_additional_trainable_modules(peft_config, adapter_name)\n","\n","        if getattr(model, \"is_gradient_checkpointing\", True):\n","            model = self._prepare_model_for_gradient_checkpointing(model)\n","\n","        # the `pretraining_tp` is set for some models to simulate Tensor Parallelism during inference to avoid\n","        # numerical differences, https://github.com/pytorch/pytorch/issues/76232 - to avoid any unexpected\n","        # behavior we disable that in this line.\n","        if hasattr(self.base_model, \"config\") and hasattr(self.base_model.config, \"pretraining_tp\"):\n","            self.base_model.config.pretraining_tp = 1"]},{"cell_type":"markdown","source":["Now we have everything we need to fine-tune BERT with LoRA. We load again the model, we upgrade it with LoRA, we count the trainable parameters and let's see what happens when we fine-tune it."],"metadata":{"id":"ojIzZpQl8Y3s"}},{"cell_type":"code","source":["model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)\n","\n","model = CustomPeftModel(model, peft_config)\n","print_trainable_parameters(model)"],"metadata":{"id":"7XnFXXab7q7J"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Xxe5mXn3yiOK"},"outputs":[],"source":["# EXERCISE: explore learning rates in the set [5e-2, 5e-3, 5e-4, 5e-5] to find the best\n","# one with this configuration\n","train_and_evaluate(model, learning_rate="]},{"cell_type":"markdown","source":["As you see, LoRA was faster than full fine-tuning, with a better final performance than just updating the last linear layer."],"metadata":{"id":"aF4F0PxD7v8Q"}},{"cell_type":"markdown","source":["## ✨ Resources used for this tutorial and references\n","- [LORA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/pdf/2106.09685)\n","- [DoRA: Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/pdf/2402.09353)\n","- [HuggingFace PEFT Tutorial](https://huggingface.co/blog/peft)\n","- [HuggingFace PEFT Tutorial for image classification](https://huggingface.co/docs/peft/main/en/task_guides/image_classification_lora)\n","- [HuggingFace Training Tutorial](https://huggingface.co/docs/transformers/training)\n"],"metadata":{"id":"5B2EVpV4_8q6"}}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}


--------------------------------------------------------------------------------
/1_nlp/part_II_text_generation/Transformer_Decoder_MoE.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"gpuType":"T4","machine_shape":"hm","toc_visible":true},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","source":["Natural Language Processing Tutorial\n","======\n","\n","This is the tutorial of the 2024 [Mediterranean Machine Learning Summer School](https://www.m2lschool.org/) on Natural Language Processing!\n","\n","This tutorial will explore the fundamental aspects of Natural Language Processing (NLP). Basic Python programming skills are expected.\n","Prior knowledge of standard NLP techniques (e.g. text tokenization and classification with ML) is beneficial but optional when working through the notebooks as they assume minimal prior knowledge.\n","\n","This tutorial combines detailed analysis and development of essential NLP concepts via custom (i.e. from scratch) implementations. Other necessary NLP components will be developed using PyTorch's NLP library implementations. As a result, the tutorial offers deep understanding and facilitates easy usage in future applications.\n","\n","## Outline\n","\n","* Part I: Introduction to Text Tokenization and Classification\n","  *  Text Classification: Simple Classifier\n","  *  Text Classification: Encoder-only Transformer\n","\n","* Part II: Introduction to Decoder-only Transformer and Sparse Mixture of Experts Architecture\n","  *  Text Generation: Decoder-only Transformer\n","  *  Text Generation: Decoder-only Transformer + MoE\n","\n","* Part III: Introduction to Parameter Efficient Fine-tuning\n","  *  Fine-tuning the full Pre-trained Models\n","  *  Fine-tuning using Low-Rank Adaptation of Large Language Models (LoRA)\n","\n","## Notation\n","\n","* Sections marked as [📝] contain cells with missing code that you should complete.\n","* Sections marked with [📚] contain cells that you should read and modify to understand how your changes alter the obtained results.\n","* External resources are mentioned with [✨]. These provide valuable supplementary information for this tutorial and offer opportunities for further in-depth exploration of the topics covered.\n","* Sections that contain code that test the functionality of other sections are marked with [✍]. You are more that welcome to modify these sections so that you can understand code functionality.\n","\n","\n","## Libraries\n","\n","This tutorial leverages [PyTorch](https://pytorch.org/) for neural network implementation and training, complemented by standard Python libraries for data processing and the [Hugging Face](https://huggingface.co/) datasets library for accessing NLP resources.\n","\n","GPU access is recommended for optimal performance, particularly for model training and text generation. While all code can run on CPU, a CUDA-enabled environment will significantly speed up these processes.\n","\n","## Credits\n","\n","The tutorial is created by:\n","\n","* [Georgios Peikos](https://www.linkedin.com/in/peikosgeorgios/)\n","* [Luca Herranz-Celotti](http://LuCeHe.github.io)\n","\n","It is inspired by and synthesizes various online resources, which are cited throughout for reference and further reading.\n","\n","## Note for Colab users\n","\n","To grab a GPU (if available), make sure you go to `Edit -> Notebook settings` and choose a GPU under `Hardware accelerator`\n","\n"],"metadata":{"id":"KBrDjSR61FHy"}},{"cell_type":"markdown","source":["\n","\n","---\n","\n"],"metadata":{"id":"nhQDxGSyOQh3"}},{"cell_type":"markdown","source":["# Part II: Introduction to the decoder-only Transformers architecture and Sparse Mixture of Expert\n","\n","We create a decoder-only Transformer architecture from the bottom up, including a custom text tokenizer and an efficient dataset handler. We will explore all essential components of this architecture, train the model, and show its capabilities in text generation.\n","\n","Then, we will enhance our base model by incorporating a gating function and implementing a sparse mixture of experts."],"metadata":{"id":"cCNKdaEAOFeJ"}},{"cell_type":"markdown","source":["\n","\n","---\n","\n"],"metadata":{"id":"7wFJ2UqTOcpT"}},{"cell_type":"markdown","source":["# Decoder-only Transformer Architecture\n","\n","The decoder-only transformer architecture consists of multiple identical blocks stacked sequentially. Each block is composed of two main elements:\n","- A masked multi-head self-attention mechanism.\n","- A feed-forward neural network.\n","\n","These components are typically encapsulated within residual connections and layer normalization. In this section, we will explore the internal structure of these blocks in greater depth and provide a practical PyTorch implementation."],"metadata":{"id":"bI8usToL1SAM"}},{"cell_type":"markdown","source":["![Decoder Only Architecture](https://drive.google.com/uc?id=1ksROxQxf3b7dlBUoIQggzyLeBaPO-AQn)\n","\n","\n"],"metadata":{"id":"V1Z2GWVTShRR"}},{"cell_type":"markdown","source":["\n","\n","---\n","\n"],"metadata":{"id":"UJPlhQquMr9W"}},{"cell_type":"markdown","source":["## Importing Libraries"],"metadata":{"id":"cSQDU10O6YUC"}},{"cell_type":"code","source":["!pip install datasets\n","import torch\n","import torch.nn as nn\n","import torch.optim as optim\n","from torch.utils.data import Dataset, DataLoader\n","import math\n","from collections import Counter\n","from typing import List, Tuple, Union\n","from datasets import load_dataset\n","import torch.nn.functional as F\n","\n","device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","print(f\"Using device: {device}\")"],"metadata":{"id":"E532K_c76aIB"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Text Tokenization from Scratch\n","\n","Tokenization is a fundamental step in NLP that converts raw text into a format that systems can understand and process. It enables the transformation of variable-length text sequences into fixed-size numerical representations, which is crucial for input to neural network models.\n","\n","Here, we create a simple text tokenizer for basic word-level tokenization tasks.\n","The tokenizer can be improved so that:\n","\n","1.   Methods for handling very large vocabularies (e.g., frequency thresholding)\n","2.   Support for n-grams or phrase detection\n","3.   Handle punctuation. For instance now, tokens like \"word.\" and \"word\" are being treated differently.\n","\n","Also, we create a testing function to showcase the codes behavior.\n","\n","**✨ Additional Resources:**\n","\n","*   Overview of hugging Face tokenizers [Link-huggingface](https://huggingface.co/docs/transformers/en/tokenizer_summary)"],"metadata":{"id":"HNcnHkLd6Vab"}},{"cell_type":"code","source":["class SimpleTokenizer:\n","    def __init__(self):\n","        \"\"\"Initialize the tokenizer with special tokens and prepare vocabulary structures.\"\"\"\n","        # Special tokens are used for various purposes in NLP tasks:\n","        # <PAD>: Used for padding sequences to a fixed length\n","        # <UNK>: Represents unknown words not in the vocabulary\n","        # <SOS>: Marks the start of a sequence\n","        # <EOS>: Marks the end of a sequence\n","        self.special_tokens = [\"<PAD>\", \"<UNK>\", \"<SOS>\", \"<EOS>\"]\n","\n","        # word_to_idx: Maps words to unique integer indices\n","        # This is crucial for converting text into a format that neural networks can process\n","        self.word_to_idx = {token: idx for idx, token in enumerate(self.special_tokens)}\n","\n","        # idx_to_word: The reverse mapping of word_to_idx\n","        # This is used for converting model outputs back into readable text\n","        self.idx_to_word = {idx: token for idx, token in enumerate(self.special_tokens)}\n","\n","        # Counter object to keep track of word frequencies in the corpus\n","        self.word_count = Counter()\n","\n","    def fit(self, texts: List[str]) -> None:\n","        \"\"\"Build the vocabulary from a list of texts.\"\"\"\n","        # Count the frequency of each word in the entire corpus\n","        for text in texts:\n","            self.word_count.update(text.split())\n","\n","        # Add each unique word to the vocabulary\n","        # We assign a unique index to each word, which the model will use to represent words\n","        for word in self.word_count:\n","            if word not in self.word_to_idx:\n","                idx = len(self.word_to_idx)\n","                self.word_to_idx[word] = idx\n","                self.idx_to_word[idx] = word\n","\n","    def encode(self, text: str) -> List[int]:\n","        \"\"\"Convert a text string to a list of indices.\"\"\"\n","        # This method is used to prepare input for the model\n","        # It converts each word to its corresponding index\n","        # If a word is not in the vocabulary, it uses the <UNK> token\n","        return [self.word_to_idx.get(word, self.word_to_idx[\"<UNK>\"]) for word in text.split()]\n","\n","    def decode(self, indices: List[int]) -> str:\n","        \"\"\"Convert a list of indices back to a text string.\"\"\"\n","        # This method is used to convert model output back into readable text\n","        # It maps each index back to its corresponding word\n","        return \" \".join([self.idx_to_word.get(idx, \"<UNK>\") for idx in indices])\n","\n","    def encode_batch(self, texts: List[str]) -> List[List[int]]:\n","        \"\"\"Convert a batch of text strings to lists of indices.\"\"\"\n","        return [self.encode(text) for text in texts]\n","\n","    def decode_batch(self, batch_indices: List[List[int]]) -> List[str]:\n","        \"\"\"Convert a batch of lists of indices back to text strings.\"\"\"\n","        return [self.decode(indices) for indices in batch_indices]\n","\n","    def show_vocab(self):\n","        \"\"\"Display the vocabulary.\"\"\"\n","        # Useful for debugging and understanding the tokenizer's state\n","        print(\"Vocabulary:\")\n","        for word, idx in self.word_to_idx.items():\n","            print(f\"{word}: {idx}\")\n","\n","    def __len__(self):\n","        \"\"\"Return the size of the vocabulary.\"\"\"\n","        # The vocabulary size is an important parameter for the model\n","        # It determines the dimensionality of the model's output layer\n","        return len(self.word_to_idx)"],"metadata":{"id":"mZ6q24hC1Q2J"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Tokenizer\n","\n","This testing function shows examples of text tokenization presenting also extreme use cases."],"metadata":{"id":"V8FKgjcjZyj2"}},{"cell_type":"code","source":["def test_tokenizer():\n","    print(\"\\nTesting SimpleTokenizer\")\n","    print(\"=\" * 30)\n","\n","    # Sample texts\n","    texts = [\n","        \"The quick brown fox jumps over the lazy dog.\",\n","        \"Pack my box with five dozen liquor jugs!\",\n","        \"How vexingly quick daft zebras jump!\",\n","        \"This is a sentence with some punctuation, including commas.\",\n","        \"This text contains an unknown word: monkey\",\n","        \"\"  # Empty string to test edge case\n","    ]\n","\n","    # Initialize and fit the tokenizer\n","    tokenizer = SimpleTokenizer()\n","    tokenizer.fit(texts)\n","\n","    # Display vocabulary\n","    tokenizer.show_vocab()\n","    print(f\"\\nVocabulary size: {len(tokenizer)}\")\n","\n","    # Test encoding and decoding\n","    print(\"\\nEncoding and Decoding Test:\")\n","    for text in texts:\n","        encoded = tokenizer.encode(text)\n","        decoded = tokenizer.decode(encoded)\n","        print(f\"\\nOriginal: {text}\")\n","        print(f\"Encoded : {encoded}\")\n","        print(f\"Decoded : {decoded}\")\n","        print(f\"Match   : {'✓' if text.strip().lower() == decoded.strip().lower() else '✗'}\")\n","\n","    # Test unknown word handling\n","    print(\"\\nUnknown Word Handling Test:\")\n","    unknown_text = \"This text contains an unknown word: xylophone\"\n","    encoded_unknown = tokenizer.encode(unknown_text)\n","    decoded_unknown = tokenizer.decode(encoded_unknown)\n","    print(f\"Original: {unknown_text}\")\n","    print(f\"Encoded : {encoded_unknown}\")\n","    print(f\"Decoded : {decoded_unknown}\")\n","\n","    # Test special tokens\n","    print(\"\\nSpecial Tokens Test:\")\n","    special_text = \"< SOS > This is a test sentence <EOS>\"\n","    encoded_special = tokenizer.encode(special_text)\n","    decoded_special = tokenizer.decode(encoded_special)\n","    print(f\"Original: {special_text}\")\n","    print(f\"Encoded : {encoded_special}\")\n","    print(f\"Decoded : {decoded_special}\")\n","\n","    # Test case sensitivity\n","    print(\"\\nCase Sensitivity Test:\")\n","    case_text = \"The Quick Brown Fox\"\n","    encoded_case = tokenizer.encode(case_text)\n","    decoded_case = tokenizer.decode(encoded_case)\n","    print(f\"Original: {case_text}\")\n","    print(f\"Encoded : {encoded_case}\")\n","    print(f\"Decoded : {decoded_case}\")\n","\n","print(\"\\nChecking the tokenizer's outputs\")\n","test_tokenizer()"],"metadata":{"collapsed":true,"id":"tc3YteacXJOs"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 TextDataset: Efficient Text Processing\n","\n","The TextDataset class is a crucial component in preparing text data for deep learning models, implementing a sliding window approach that allows processing of variable-length texts while maintaining context.\n","\n","This class bridges the gap between raw text data and the input requirements of neural networks, handling tasks such as tokenization, padding, and attention mask generation, which are essential for training effective sequence models like Transformers.\n","\n","**✨ Additional Resources:**\n","\n","*   Padding and truncation [Link-huggingface](https://huggingface.co/docs/transformers/en/pad_truncation)\n"],"metadata":{"id":"JS3le4cC6do5"}},{"cell_type":"code","source":["class TextDataset(Dataset):\n","    def __init__(self, texts: List[str], tokenizer: SimpleTokenizer, max_length: int, overlap: int = 50):\n","        \"\"\"\n","        Initialize the TextDataset with sliding window functionality.\n","\n","        Args:\n","            texts (List[str]): List of input texts.\n","            tokenizer (SimpleTokenizer): Tokenizer object for encoding texts.\n","            max_length (int): Maximum length of encoded sequences.\n","            overlap (int): Number of overlapping tokens between windows.\n","        \"\"\"\n","        self.tokenizer = tokenizer\n","        self.max_length = max_length\n","        self.overlap = overlap\n","        self.data = []\n","        self.attention_masks = []\n","        self.document_map = []  # Maps each window to its original document\n","        self.original_texts = texts  # Store original texts\n","\n","        for doc_idx, text in enumerate(texts):\n","            tokens = self.tokenizer.encode(text)\n","            windows = self.create_sliding_windows(tokens)\n","\n","            for window in windows:\n","                attention_mask = [1] * len(window)  # 1 for real tokens\n","\n","                # Pad if necessary\n","                if len(window) < max_length:\n","                    padding_length = max_length - len(window)\n","                    window = window + [self.tokenizer.word_to_idx[\"<PAD>\"]] * padding_length\n","                    attention_mask = attention_mask + [0] * padding_length  # 0 for padding in attention mask\n","\n","                self.data.append(window)\n","                self.attention_masks.append(attention_mask)\n","                self.document_map.append(doc_idx)\n","\n","    def create_sliding_windows(self, tokens: List[int]) -> List[List[int]]:\n","        \"\"\"\n","        Create sliding windows from a list of tokens.\n","\n","        Args:\n","            tokens (List[int]): List of token ids.\n","\n","        Returns:\n","            List[List[int]]: List of token windows.\n","        \"\"\"\n","        windows = []\n","        # Calculate stride: how many tokens to move for each new window\n","        # -1 accounts for the added <SOS> token at the start of each window\n","        stride = self.max_length - self.overlap - 1\n","\n","        for start in range(0, len(tokens), stride):\n","            # Create a window starting with <SOS> token\n","            window = [self.tokenizer.word_to_idx[\"<SOS>\"]] + tokens[start:start + self.max_length - 1]\n","            if len(window) < self.max_length:\n","                # This is the last window, add <EOS> token\n","                window.append(self.tokenizer.word_to_idx[\"<EOS>\"])\n","            windows.append(window)\n","\n","        return windows\n","\n","    def get_original_document(self, doc_idx: int) -> str:\n","        \"\"\"Retrieve the original document text.\"\"\"\n","        if 0 <= doc_idx < len(self.original_texts):\n","            return self.original_texts[doc_idx]\n","        else:\n","            raise IndexError(f\"Document index {doc_idx} is out of range.\")\n","\n","    def get_document_length(self, doc_idx: int) -> int:\n","        \"\"\"Get the number of tokens in the original document.\"\"\"\n","        if 0 <= doc_idx < len(self.original_texts):\n","            return len(self.tokenizer.encode(self.original_texts[doc_idx]))\n","        else:\n","            raise IndexError(f\"Document index {doc_idx} is out of range.\")\n","\n","    def window_to_document_position(self, window_idx: int, token_idx: int) -> Tuple[int, int]:\n","        \"\"\"Map a position in a window back to its position in the original document.\"\"\"\n","        if 0 <= window_idx < len(self.data):\n","            doc_idx = self.document_map[window_idx]\n","            doc_windows = self.get_document_windows(doc_idx)\n","            # Find which window of the document this is\n","            relative_window_idx = doc_windows.index(window_idx)\n","            # Calculate the start position of this window in the document\n","            window_start = relative_window_idx * (self.max_length - self.overlap - 1)\n","            # -1 to account for <SOS> token at the start of each window\n","            return doc_idx, window_start + token_idx - 1\n","        else:\n","            raise IndexError(f\"Window index {window_idx} is out of range.\")\n","\n","    def __len__(self) -> int:\n","        \"\"\"Get the number of windows in the dataset.\"\"\"\n","        return len(self.data)\n","\n","    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, int]:\n","        \"\"\"\n","        Get a sample from the dataset.\n","\n","        Args:\n","            idx (int): Index of the sample.\n","\n","        Returns:\n","            Tuple[torch.Tensor, torch.Tensor, int]:\n","                A tuple containing (token_ids, attention_mask, document_index).\n","        \"\"\"\n","        if 0 <= idx < len(self.data):\n","            # Add an extra dimension to make it batch-first (batch_size=1)\n","            return (torch.tensor(self.data[idx]).unsqueeze(0),\n","                    torch.tensor(self.attention_masks[idx]).unsqueeze(0),\n","                    self.document_map[idx])\n","        else:\n","            raise IndexError(f\"Index {idx} is out of range.\")\n","\n","\n","    def get_document_windows(self, doc_idx: int) -> List[int]:\n","        \"\"\"\n","        Get all window indices for a specific document.\n","\n","        Args:\n","            doc_idx (int): Index of the document.\n","\n","        Returns:\n","            List[int]: List of window indices belonging to the document.\n","        \"\"\"\n","        return [i for i, doc in enumerate(self.document_map) if doc == doc_idx]"],"metadata":{"id":"CeBCUMY16jms"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Dataset Processing\n","\n","This testing function shows how the TextDataset and SimpleTokenizer classes work together."],"metadata":{"id":"QMa3_yaQgAYf"}},{"cell_type":"code","source":["def test_sliding_window_dataset():\n","    print(\"\\n--- Testing Sliding Window Dataset ---\\n\")\n","\n","    texts = [\n","        \"This is a short sentence.\",\n","        \"This is a much longer sentence that will be split into multiple windows to demonstrate the sliding window approach. It contains enough tokens to create at least two or three windows depending on the chosen maximum length and overlap.\",\n","        \"Another sentence of medium length that might create two windows.\",\n","        \"\",  # Empty text to test edge case\n","        \"Short.\"  # Very short text to test edge case\n","    ]\n","\n","    try:\n","        tokenizer = SimpleTokenizer()\n","        tokenizer.fit(texts)\n","\n","        max_length = 16\n","        overlap = 5\n","        dataset = TextDataset(texts, tokenizer, max_length, overlap)\n","\n","        print(f\"Dataset configuration:\")\n","        print(f\"  Max length: {max_length}\")\n","        print(f\"  Overlap: {overlap}\")\n","        print(f\"  Total windows: {len(dataset)}\")\n","        print(f\"  Vocabulary size: {len(tokenizer)}\\n\")\n","\n","        for doc_idx, text in enumerate(texts):\n","            print(f\"Document {doc_idx}:\")\n","            print(f\"  Original text: '{text}'\")\n","            print(f\"  Original length: {len(text.split())}\")\n","\n","            window_indices = dataset.get_document_windows(doc_idx)\n","            print(f\"  Number of windows: {len(window_indices)}\")\n","\n","            for i, window_idx in enumerate(window_indices):\n","                tokens, attention_mask, _ = dataset[window_idx]\n","                # Remove the batch dimension for decoding\n","                decoded = tokenizer.decode(tokens.squeeze(0).tolist())\n","                print(f\"\\n    Window {i}:\")\n","                print(f\"    Tokens shape: {tokens.shape}\")\n","                print(f\"    Tokens: {tokens.squeeze(0).tolist()}\")\n","                print(f\"    Attention mask shape: {attention_mask.shape}\")\n","                print(f\"    Attention mask: {attention_mask.squeeze(0).tolist()}\")\n","                print(f\"    Decoded: '{decoded}'\")\n","                print(f\"    Window length: {tokens.size(1)}\")  # Use size(1) for sequence length\n","\n","            print(\"\\n\" + \"-\"*50)\n","\n","        tokenizer.show_vocab()\n","\n","    except Exception as e:\n","        print(f\"An error occurred: {str(e)}\")\n","\n","    print(\"\\n--- End of Test ---\")\n","\n","# Run the test\n","test_sliding_window_dataset()"],"metadata":{"collapsed":true,"id":"AwEtTxasgHC4"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📝 Positional Encoding\n","\n","Positional Encoding adds information about the position of each token in the sequence. This is necessary because the self-attention mechanism in Transformers doesn't inherently have a notion of token order.\n","\n","\n","\\begin{equation}\n","PE_{(pos, 2i)} = \\sin\\left(\\frac{pos}{10000^{\\frac{2i}{d_{model}}}}\\right)\n","\\end{equation}\n","\n","\\begin{equation}\n","PE_{(pos, 2i+1)} = \\cos\\left(\\frac{pos}{10000^{\\frac{2i}{d_{model}}}}\\right)\n","\\end{equation}\n","\n","\n","\n","**✨ Additional Resources:**\n","\n","*   Transformer Architecture: The Positional Encoding [Link-kazemnejad](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)\n","\n","*   Positional Encoding in Transformers [Link-geeksforgeeks](https://www.geeksforgeeks.org/positional-encoding-in-transformers/)\n","\n","\n"],"metadata":{"id":"BHKvbddC6lpB"}},{"cell_type":"code","source":["class PositionalEncoding(nn.Module):\n","\n","    def __init__(self, d_model, max_len=5000):\n","        \"\"\"\n","        Inputs\n","            d_model - Hidden dimensionality of the input.\n","            max_len - Maximum length of a sequence to expect.\n","        \"\"\"\n","        super().__init__()\n","\n","        #######################Code Here###############################\n","\n","        # Create matrix of [SeqLen, HiddenDim] representing the positional encoding for max_len inputs\n","        ## pe = ...\n","\n","        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)\n","\n","        # Compute the `div_term` for the frequency component using the formula involving d_model\n","        ## div_term =\n","\n","\n","        # Populate the even indices (0, 2, 4, ...) of `pe` with the sine of the position multiplied by the `div_term`\n","        ## pe[:, 0::2] =\n","\n","\n","        # Populate the odd indices (1, 3, 5, ...) of `pe` with the cosine of the position multiplied by the `div_term`\n","\n","\n","        # Add an *extra dimension* to `pe` to fit the batch dimension requirements\n","\n","\n","        # Register `pe` as a buffer to ensure it is not considered a model parameter but is still moved to the appropriate device when the model is moved\n","\n","\n","        ###############################################################\n","\n","    def forward(self, x):\n","        x = x + self.pe[:, :x.size(1)]\n","        return x"],"metadata":{"id":"9owPpdm-neKA"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing your Positional Encoding Function"],"metadata":{"id":"mLYaklgz6bq5"}},{"cell_type":"code","source":["def test_positional_encoding_consistency():\n","    # Test parameters\n","    d_model = 4\n","    max_len = 10\n","    batch_size = 2\n","    seq_len = 3\n","\n","    # Initialize a new PositionalEncoding module each time\n","    pe = PositionalEncoding(d_model, max_len)\n","\n","    # Create a small input tensor\n","    x1 = torch.zeros(batch_size, seq_len, d_model)\n","\n","    # Run the positional encoding\n","    output = pe(x1)\n","\n","    print(\"Input:\")\n","    print(x1)\n","    print(\"\\nOutput:\")\n","    print(output)\n","\n","    print(\"\"\"\\nNote: If your\n","    Output:\n","        tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000],\n","                [ 0.8415,  0.5403,  0.0100,  0.9999],\n","                [ 0.9093, -0.4161,  0.0200,  0.9998]],\n","\n","                [[ 0.0000,  1.0000,  0.0000,  1.0000],\n","                [ 0.8415,  0.5403,  0.0100,  0.9999],\n","                [ 0.9093, -0.4161,  0.0200,  0.9998]]])\n","             , the PositionalEncoding is consistent.\"\"\")\n","\n","test_positional_encoding_consistency()"],"metadata":{"id":"LfSwCfMy6gZH"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Positional Encoding for Text"],"metadata":{"id":"vcOi9vZdmc1t"}},{"cell_type":"code","source":["def test_positional_encoding_with_dataset():\n","    print(\"\\n--- Testing Positional Encoding with Dataset ---\\n\")\n","\n","    # Set a fixed seed for reproducibility\n","    torch.manual_seed(42)\n","\n","    # Sample texts\n","    texts = [\n","        \"This is a short sentence.\",\n","        \"This is a much longer sentence that will be split into multiple windows. Observe the term overlapping?\",\n","        \"Another sentence of medium length.\"\n","    ]\n","\n","    # Initialize tokenizer and fit it to the texts\n","    tokenizer = SimpleTokenizer()\n","    tokenizer.fit(texts)\n","\n","    # Create dataset\n","    max_length = 10\n","    overlap = 2\n","    dataset = TextDataset(texts, tokenizer, max_length, overlap)\n","\n","    # Initialize positional encoding\n","    d_model = 16  # Small dimension for demonstration\n","    pos_encoder = PositionalEncoding(d_model, max_length)\n","\n","    print(f\"Dataset configuration:\")\n","    print(f\"  Max length: {max_length}\")\n","    print(f\"  Overlap: {overlap}\")\n","    print(f\"  Total windows: {len(dataset)}\")\n","    print(f\"  Vocabulary size: {len(tokenizer)}\")\n","    print(f\"  Embedding dimension: {d_model}\\n\")\n","\n","    # Process each window through the positional encoding\n","    for i in range(len(dataset)):\n","        tokens, attention_mask, doc_idx = dataset[i]\n","\n","        # Convert tokens to \"embeddings\" (just for demonstration)\n","        pseudo_embeddings = torch.rand(1, tokens.size(1), d_model)  # (batch_size, seq_len, d_model)\n","\n","        # Apply positional encoding\n","        encoded = pos_encoder(pseudo_embeddings)\n","\n","        print(f\"Window {i} (from document {doc_idx}):\")\n","        print(f\"  Original tokens: {tokens.squeeze(0).tolist()}\")\n","        print(f\"  Attention mask: {attention_mask.squeeze(0).tolist()}\")\n","        print(f\"  Decoded: '{tokenizer.decode(tokens.squeeze(0).tolist())}'\")\n","        print(f\"  Shape after positional encoding: {encoded.shape}\")\n","\n","        # Display the positional encoding effect for all tokens\n","        print(f\"  Positional encoding effect:\")\n","        for j in range(tokens.size(1)):\n","            if attention_mask[0, j] == 1:  # Only show for non-padding tokens\n","                print(f\"    Token {j}:\")\n","                print(f\"      Before: {pseudo_embeddings[0, j, :].tolist()}\")\n","                print(f\"      After:  {encoded[0, j, :].tolist()}\")\n","\n","        print()\n","\n","    print(\"--- End of Test ---\")\n","\n","# Run the test\n","test_positional_encoding_with_dataset()"],"metadata":{"id":"jOzH22PBmfyG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📝 Masked Multihead Attention Mechanism\n","\n","As you have implemented the Attention Mechanism in Part I, here you will have to use its ready implementation from Pytorch.\n","\n","Masked Attention mechanism allows the transformer model to focus on relevant parts of the input sequence while preventing information leakage from future tokens during sequential processing (i.e. we use the term Masked).\n","\n","Please, visit [Link-pytorch](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) for a detailed description of the MultiheadAttention function in PyTorch.\n","\n","**✨ Additional Resources:**\n","\n","*   Multi-head Attention, deep dive [Link-towardsdatascience](https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853)\n","* Attention Is All You Need (original Transformer paper) [Link-ArXiv](https://arxiv.org/abs/1706.03762)\n","\n","* A visual explanation of the attention mechanism [Link-youtube](https://www.youtube.com/watch?v=bCz4OMemCcA&t=1208s&ab_channel=UmarJamil)"],"metadata":{"id":"i_d5KqeSAdiv"}},{"cell_type":"code","source":["class MaskedAttention(nn.Module):\n","    def __init__(self, d_model: int, nhead: int, dropout: float = 0.1):\n","        super().__init__()\n","\n","        #######################Code Here###############################\n","\n","        # Initialize the MultiheadAttention layer using its PyTorch implementation\n","\n","        # Initialize the parameters\n","\n","        ###############################################################\n","\n","\n","    def generate_square_subsequent_mask(self, sz: int) -> torch.Tensor:\n","\n","        #######################Code Here###############################\n","        # Create a triangular mask tensor in PyTorch where the lower triangle (including the diagonal) is\n","        # filled with ones and the upper triangle is filled with zeros. The tensor should be of size (sz, sz),\n","        # where sz is a given integer.\n","        ##############################################################\n","\n","        # The masked positions are filled with float('-inf'). Unmasked positions are filled with float(0.0). See: https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer\n","        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))\n","        return mask\n","\n","    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:\n","        seq_len = x.size(1)\n","        attn_mask = self.generate_square_subsequent_mask(seq_len).to(x.device)\n","        output, _ = self.multihead_attn(x, x, x, attn_mask=attn_mask)\n","        return output, attn_mask"],"metadata":{"id":"oDqt8XggAlFe"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Multihead Attention"],"metadata":{"id":"MAMRpHUKwGKP"}},{"cell_type":"code","source":["def test_expanded_transformer_components():\n","    print(\"\\n--- Testing Expanded Transformer Components ---\\n\")\n","\n","    # Set a fixed seed for reproducibility\n","    torch.manual_seed(42)\n","\n","    # Sample texts\n","    texts = [\n","        \"This is a short sentence.\",\n","        \"This is a much longer sentence that will be split into multiple windows. Observe the term overlapping?\",\n","        \"Another sentence of medium length.\"\n","    ]\n","\n","    # Initialize tokenizer and fit it to the texts\n","    tokenizer = SimpleTokenizer()\n","    tokenizer.fit(texts)\n","\n","    # Create dataset\n","    max_length = 10\n","    overlap = 2\n","    dataset = TextDataset(texts, tokenizer, max_length, overlap)\n","\n","    # Hyperparameters\n","    d_model = 16  # Small dimension for demonstration\n","    nhead = 2\n","\n","    # Initialize components\n","    pos_encoder = PositionalEncoding(d_model, max_length)\n","    masked_self_attn = MaskedAttention(d_model, nhead)\n","\n","    print(f\"Dataset configuration:\")\n","    print(f\"  Max length: {max_length}\")\n","    print(f\"  Overlap: {overlap}\")\n","    print(f\"  Total windows: {len(dataset)}\")\n","    print(f\"  Vocabulary size: {len(tokenizer)}\")\n","    print(f\"  Embedding dimension: {d_model}\")\n","    print(f\"  Number of attention heads: {nhead}\\n\")\n","\n","    # Process each window\n","    for i in range(len(dataset)):\n","        tokens, attention_mask, doc_idx = dataset[i]\n","\n","        print(f\"Window {i} (from document {doc_idx}):\")\n","        print(f\"  Original tokens: {tokens.squeeze(0).tolist()}\")\n","        print(f\"  Attention mask: {attention_mask.squeeze(0).tolist()}\")\n","        print(f\"  Decoded: '{tokenizer.decode(tokens.squeeze(0).tolist())}'\")\n","\n","        # Convert tokens to \"embeddings\" (just for demonstration)\n","        pseudo_embeddings = torch.rand(1, tokens.size(1), d_model)  # (batch_size, seq_len, d_model)\n","        print(f\"  Shape of pseudo embeddings: {pseudo_embeddings.shape}\")\n","\n","        # Apply positional encoding\n","        pos_encoded = pos_encoder(pseudo_embeddings)\n","        print(f\"  Shape after positional encoding: {pos_encoded.shape}\")\n","\n","        # Apply masked self-attention\n","        attn_output, attn_mask = masked_self_attn(pos_encoded)\n","        print(f\"  Shape after masked self-attention: {attn_output.shape}\")\n","\n","        # Display the effect of positional encoding and attention for all tokens\n","        print(f\"  Transformer effect on tokens:\")\n","        for j in range(tokens.size(1)):\n","            if attention_mask[0, j] == 1:  # Only show for non-padding tokens\n","                print(f\"    Token {j}:\")\n","                print(f\"      Initial:   {pseudo_embeddings[0, j, :5].tolist()}\")\n","                print(f\"      Positional:{pos_encoded[0, j, :5].tolist()}\")\n","                print(f\"      Attention Mask: {attn_mask[j, :5].tolist()}\")  # Show first 5 values of attention mask\n","                print(f\"      Attention: {attn_output[0, j, :5].tolist()}\")\n","        print()\n","\n","    print(\"--- End of Expanded Test ---\")\n","\n","# Run the expanded test\n","test_expanded_transformer_components()"],"metadata":{"id":"e0UqHQVswJzX"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📝 Feed Forward Netwrok\n","\n","A feed-forward network is a multi-layered structure in which information moves in a single direction, from the input layer to the output layer.\n","\n","\n","**✨ Additional Resources:**\n","\n","*   Transformer Feed-Forward Layers Are Key-Value Memories\n"," [Link-ArXiv](https://arxiv.org/abs/2012.14913)\n","\n","\n","\n","\n","\n"],"metadata":{"id":"AGmlNshUAmmM"}},{"cell_type":"code","source":["class FeedForward(nn.Module):\n","    def __init__(self, d_model, d_ff, dropout=0.1):\n","        super().__init__()\n","\n","        #######################Code Here###############################\n","        # Define feed-forward network using nn.Sequential\n","        self.net = nn.Sequential(\n","          # First linear layer\n","          # ReLU activation\n","          # Dropout for regularization\n","          # Second linear layer\n","        )\n","        ###############################################################\n","\n","\n","    def forward(self, x):\n","        return self.net(x)  # Apply the feed-forward network"],"metadata":{"id":"yH4rocZtFSY7"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Feed Forward Layer"],"metadata":{"id":"XxPG41aLFJz6"}},{"cell_type":"code","source":["def test_all_transformer_components():\n","    print(\"\\n--- Testing Transformer Components ---\\n\")\n","\n","    # Set a fixed seed for reproducibility\n","    torch.manual_seed(42)\n","\n","    # Sample texts\n","    texts = [\n","        \"This is a short sentence.\",\n","        \"This is a much longer sentence that will be split into multiple windows. Observe the term overlapping?\",\n","        \"Another sentence of medium length.\"\n","    ]\n","\n","    # Initialize tokenizer and fit it to the texts\n","    tokenizer = SimpleTokenizer()\n","    tokenizer.fit(texts)\n","\n","    # Create dataset\n","    max_length = 10\n","    overlap = 2\n","    dataset = TextDataset(texts, tokenizer, max_length, overlap)\n","\n","    # Hyperparameters\n","    d_model = 16  # Small dimension for demonstration\n","    nhead = 2\n","    d_ff = d_model  # Feed-forward dimension\n","    dropout = 0.1\n","\n","    # Initialize components\n","    pos_encoder = PositionalEncoding(d_model, max_length)\n","    masked_self_attn = MaskedAttention(d_model, nhead)\n","    feed_forward = FeedForward(d_model, d_ff, dropout)\n","\n","    print(f\"Dataset configuration:\")\n","    print(f\"  Max length: {max_length}\")\n","    print(f\"  Overlap: {overlap}\")\n","    print(f\"  Total windows: {len(dataset)}\")\n","    print(f\"  Vocabulary size: {len(tokenizer)}\")\n","    print(f\"  Embedding dimension: {d_model}\")\n","    print(f\"  Number of attention heads: {nhead}\")\n","    print(f\"  Feed-forward dimension: {d_ff}\\n\")\n","\n","    # Process each window\n","    for i in range(len(dataset)):\n","        tokens, attention_mask, doc_idx = dataset[i]\n","\n","        print(f\"Window {i} (from document {doc_idx}):\")\n","        print(f\"  Original tokens: {tokens.squeeze(0).tolist()}\")\n","        print(f\"  Attention mask: {attention_mask.squeeze(0).tolist()}\")\n","        print(f\"  Decoded: '{tokenizer.decode(tokens.squeeze(0).tolist())}'\")\n","\n","        # Convert tokens to \"embeddings\" (just for demonstration)\n","        pseudo_embeddings = torch.rand(1, tokens.size(1), d_model)  # (batch_size, seq_len, d_model)\n","        print(f\"  Shape of pseudo embeddings: {pseudo_embeddings.shape}\")\n","\n","        # Apply positional encoding\n","        pos_encoded = pos_encoder(pseudo_embeddings)\n","        print(f\"  Shape after positional encoding: {pos_encoded.shape}\")\n","\n","        # Apply masked self-attention\n","        attn_output, attn_mask = masked_self_attn(pos_encoded)\n","        print(f\"  Shape after masked self-attention: {attn_output.shape}\")\n","\n","        # Apply feed-forward network\n","        ff_output = feed_forward(attn_output)\n","        print(f\"  Shape after feed-forward: {ff_output.shape}\")\n","\n","        # Display the effect of positional encoding, attention, and feed-forward for all tokens\n","        print(f\"  Transformer effect on tokens:\")\n","        for j in range(tokens.size(1)):\n","            if attention_mask[0, j] == 1:  # Only show for non-padding tokens\n","                print(f\"    Token {j}:\")\n","                print(f\"      Initial:   {pseudo_embeddings[0, j, :5].tolist()}\")\n","                print(f\"      Positional:{pos_encoded[0, j, :5].tolist()}\")\n","                print(f\"      Attention Mask: {attn_mask[j, :5].tolist()}\")  # Show first 5 values\n","                print(f\"      Attention: {attn_output[0, j, :5].tolist()}\")\n","                print(f\"      Feed-Forward: {ff_output[0, j, :5].tolist()}\")\n","        print()\n","\n","    print(\"--- End of Expanded Test ---\")\n","\n","# Run the expanded test\n","test_all_transformer_components()"],"metadata":{"id":"FyyrKkJZAuMA"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📝 Decoder Layer\n","\n","Implementation of a Transformer Decoder Layer with Masked Multi-Head Attention and Feed Forward Network\n","\n","\n","\n","\n","\n","\n"],"metadata":{"id":"HM696sBjAtmY"}},{"cell_type":"markdown","source":["![Decoder Only Architecture](https://drive.google.com/uc?id=1ksROxQxf3b7dlBUoIQggzyLeBaPO-AQn)\n","\n","\n"],"metadata":{"id":"YPWxAfNbXhJ9"}},{"cell_type":"code","source":["class DecoderLayer(nn.Module):\n","    def __init__(self, d_model: int, nhead: int, d_ff: int, dropout: float = 0.1):\n","        super().__init__()\n","        self.norm1 = nn.LayerNorm(d_model)\n","\n","        #######################Code Here###############################\n","\n","        # Define the decoder - Layer Normalization, Attention, Feed_forward Dropouts\n","\n","        ###############################################################\n","\n","    def forward(self, x: torch.Tensor) -> torch.Tensor:\n","        # Masked Multi-Head Attention\n","        normed_x = self.norm1(x)\n","        attn_output, _ = self.masked_attention(normed_x) # _ because we returned also the mask in the previous demonstration\n","\n","        #######################Code Here###############################.\n","\n","        # Create the remaining connection and residual connections - What is a residual connection? (https://towardsdatascience.com/what-is-residual-connection-efb07cab0d55)\n","\n","        ###############################################################\n","\n","        return x"],"metadata":{"id":"P02MAkENBZrS"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Decoder-only Transformer\n","\n","The components of a Decoder-Only Transformer include an embedding layer for token representation, positional encoding for sequential information, stacked decoder layers for hierarchical processing, layer normalization for stability, and an output projection layer for generating tokens."],"metadata":{"id":"VJULRnp4BZJS"}},{"cell_type":"code","source":["class DecoderOnlyTransformer(nn.Module):\n","    def __init__(self, vocab_size: int, d_model: int, nhead: int, num_layers: int,\n","                 d_ff: int, max_seq_length: int, dropout: float = 0.1):\n","        super().__init__()\n","        self.d_model = d_model\n","        self.vocab_size = vocab_size\n","        self.max_seq_length = max_seq_length\n","\n","        # Embedding layer\n","        self.embedding = nn.Embedding(vocab_size, d_model)\n","\n","        # Positional encoding\n","        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)\n","\n","        # Decoder layers\n","        self.layers = nn.ModuleList([\n","            DecoderLayer(d_model, nhead, d_ff, dropout)\n","            for _ in range(num_layers)\n","        ])\n","\n","        # Final layer norm\n","        self.final_norm = nn.LayerNorm(d_model)\n","\n","        # Output projection\n","        self.output_projection = nn.Linear(d_model, vocab_size)\n","\n","    def forward(self, x: torch.Tensor) -> torch.Tensor:\n","        # x shape: (batch_size, seq_len)\n","\n","        # Embed the input\n","        x = self.embedding(x) * math.sqrt(self.d_model)\n","\n","        # Add positional encoding\n","        x = self.positional_encoding(x)\n","\n","        # Apply decoder layers\n","        for layer in self.layers:\n","            x = layer(x)\n","\n","        # Apply final layer norm\n","        x = self.final_norm(x)\n","\n","        # Project to vocabulary size\n","        output = self.output_projection(x)\n","\n","        return output\n","\n","    def generate(self, start_tokens: torch.Tensor, max_length: int,\n","                 temperature: float = 1.0) -> torch.Tensor:\n","        self.eval()\n","        current_seq = start_tokens\n","\n","        with torch.no_grad():\n","            for _ in range(max_length - start_tokens.size(1)):\n","                # Ensure we're not exceeding the maximum sequence length\n","                if current_seq.size(1) > self.max_seq_length:\n","                    current_seq = current_seq[:, -self.max_seq_length:]\n","\n","                # Get model predictions\n","                logits = self(current_seq)\n","                next_token_logits = logits[:, -1, :] / temperature\n","\n","                # Sample next token\n","                probs = F.softmax(next_token_logits, dim=-1)\n","                next_token = torch.multinomial(probs, num_samples=1)\n","\n","                # Append next token to sequence\n","                current_seq = torch.cat([current_seq, next_token], dim=1)\n","\n","                # Check if we've generated an EOS token\n","                if next_token.item() == self.vocab_size - 1:  # Assuming EOS is the last token\n","                    break\n","\n","        return current_seq"],"metadata":{"id":"hs7mWseS3hWV"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Displaying the Decoder-only Transformer Architecture"],"metadata":{"id":"YyhgXe9X7ytF"}},{"cell_type":"code","source":["!pip install torchinfo\n","from torchinfo import summary\n","\n","# Initialize the model with some example parameters\n","vocab_size = 10000\n","d_model = 512\n","nhead = 2\n","num_layers = 1\n","d_ff = 2048\n","max_seq_length = 1024\n","dropout = 0.1\n","\n","# Define your model\n","model = DecoderOnlyTransformer(\n","    vocab_size=vocab_size,\n","    d_model=d_model,\n","    nhead=nhead,\n","    num_layers=num_layers,\n","    d_ff=d_ff,\n","    max_seq_length=max_seq_length,\n","    dropout=dropout\n",")\n","\n","# Print the model summary\n","summary(model, input_size=(1, max_seq_length), dtypes=[torch.int64])"],"metadata":{"id":"49nYvRll78Ug"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Training the Decoder-only Transformer"],"metadata":{"id":"qJMRxOoSB0Go"}},{"cell_type":"code","source":["# Load the tiny_shakespeare dataset\n","dataset = load_dataset(\"tiny_shakespeare\", split=\"train\")\n","# Load the tiny_shakespeare dataset\n","# dataset = load_dataset(\"lyimo/shakespear\", split=\"train\")\n","\n","# Extract the text from the dataset\n","texts = dataset[\"text\"]\n","\n","# Hyperparameters\n","d_model = 256\n","nhead = 2\n","num_layers = 2\n","d_ff = 256\n","max_seq_length = 128\n","batch_size = 64\n","num_epochs = 10\n","learning_rate = 0.0001\n","dropout = 0.2\n","\n","# Tokenize and prepare data\n","tokenizer = SimpleTokenizer()\n","tokenizer.fit(texts)\n","vocab_size = len(tokenizer.word_to_idx)\n","\n","dataset = TextDataset(texts, tokenizer, max_seq_length)\n","train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)\n","\n","print(f\"Vocabulary size: {vocab_size}\")\n","\n","# Device configuration\n","device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","\n","# Create model and move to device\n","model = DecoderOnlyTransformer(vocab_size, d_model, nhead, num_layers, d_ff, max_seq_length, dropout).to(device)\n","\n","# Create optimizer and loss function\n","optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)\n","criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.word_to_idx[\"<PAD>\"])\n","\n","# Training loop\n","for epoch in range(num_epochs):\n","    model.train()\n","    total_loss = 0\n","    for batch_idx, batch in enumerate(train_loader):\n","        optimizer.zero_grad()\n","\n","        input_seq, _, _ = batch  # Unpack batch\n","        input_seq = input_seq.squeeze(1).to(device)  # Move input to device and remove extra dimension\n","\n","        # Forward pass\n","        output = model(input_seq)\n","\n","        # Reshape output tensor\n","        output = output[:, :-1, :].contiguous().view(-1, output.size(-1))  # Shift predictions to the left\n","\n","        # Shift targets to the right (original targets)\n","        target_seq = input_seq[:, 1:].contiguous().view(-1)\n","\n","        # Compute loss\n","        loss = criterion(output, target_seq)\n","\n","        # Debugging prints\n","        print(f\"Loss: {loss.item()}\")\n","\n","        # Backward pass and optimize\n","        loss.backward()\n","        optimizer.step()\n","\n","        total_loss += loss.item()\n","\n","        if batch_idx == 0:\n","          # Debugging prints\n","          print(f\"Epoch: {epoch+1}, Batch: {batch_idx+1}\")\n","          print(f\"Input sequence shape: {input_seq.shape}\")\n","          print(f\"Input sequence: {input_seq.unsqueeze(1)}\")\n","          print(f\"Output shape before reshape: {output.shape}\")\n","          print(f\"Output shape after reshape: {output.shape}\")\n","          print(f\"Target sequence shape: {target_seq.shape}\")\n","\n","    # Print epoch loss\n","    print(f\"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss / len(train_loader):.4f}\")"],"metadata":{"id":"gSUt8j7yBrcP"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Decoder-only Transformer"],"metadata":{"id":"XEkV3lWgCsHH"}},{"cell_type":"code","source":["texts = [\"Better three hours too soon than\", \" I believe I can \", \"My words fly up, my\", \"Brevity is \", \"Love looks not with the eyes, but\", \"To be or \"]\n","\n","for quote in texts:\n","  start_tokens = torch.tensor(tokenizer.encode(quote)).unsqueeze(0).to(device)  # Add batch dimension and move to device\n","\n","  generated_tokens = model.generate(start_tokens, max_length=20, temperature=.9)\n","  generated_text = tokenizer.decode(generated_tokens.squeeze().tolist())\n","\n","  print(generated_text)"],"metadata":{"id":"ioC249m3-X2H"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["_____________________________________________"],"metadata":{"id":"6_9fxH9cKaGq"}},{"cell_type":"markdown","source":["# Decoder-only with MoE instead of FFN\n","\n","In the Sparse Mixture of Experts (MoE) architecture, the self-attention mechanism within each transformer block stays the same.\n","\n","However, a key modification is made to the structure **of each block**: the standard **feed-forward neural network** is replaced with **multiple sparsely activated feed-forward networks, known as experts.**\n","\n","\"Sparse activation\" means that each token in the sequence is routed to only a small number of these experts—usually one or two—out of the entire pool.\n","\n","\n","\n","**✨ Additional Resources:**\n","\n","*   makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch [Link-huggingface](https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch)\n","\n","\n"],"metadata":{"id":"LYtAnvTb83kj"}},{"cell_type":"markdown","source":["_____________________________________________"],"metadata":{"id":"Um0oVSr_KdMV"}},{"cell_type":"markdown","source":["## 📝 Expert Layer"],"metadata":{"id":"uDOiZz6LKT69"}},{"cell_type":"code","source":["class Expert(nn.Module):\n","    \"\"\" An MLP with a single hidden layer and ReLU activation, serving as an expert in a Mixture of Experts. \"\"\"\n","    def __init__(self, n_embd: int, dropout: float = 0.1):\n","        super().__init__()\n","        self.net = nn.Sequential(\n","\n","        #######################Code Here###############################\n","\n","        # Define a Linear, ReLU, Linaer and Dropout, Linear should be (n_embd, 4 * n_embd)\n","\n","        ###############################################################\n","\n","      )\n","\n","    def forward(self, x: torch.Tensor) -> torch.Tensor:\n","        return self.net(x)"],"metadata":{"id":"S9PuG3xQKZcp"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Gating in MoE Architectures\n","\n","Types of gating in Mixture of Experts (MoE) systems include Top-k gating, Noisy Top-k gating (as implemented here), and other variants like Hierarchical gating or Soft gating.\n","\n","Gating is essential in MoE systems because it determines which experts to use for each input, allowing the model to specialize different experts for different types of inputs or tasks.\n","\n","Specifically, Noisy Top-k gating adds controlled randomness to the expert selection process, which can help balance expert utilization and potentially improve model performance by introducing exploration in the routing mechanism."],"metadata":{"id":"KhIocpeCKevX"}},{"cell_type":"code","source":["class NoisyTopkRouter(nn.Module):\n","    def __init__(self, n_embed, num_experts, top_k_moe):\n","        super(NoisyTopkRouter, self).__init__()\n","        # Store the top_k_moe parameter which specifies the number of top experts to select\n","        self.top_k_moe = top_k_moe\n","        # Linear layer to compute logits for routing\n","        self.topkroute_linear = nn.Linear(n_embed, num_experts)\n","        # Linear layer to compute noise logits for added noise\n","        self.noise_linear = nn.Linear(n_embed, num_experts)\n","\n","    def forward(self, mh_output):\n","        # Compute the logits for routing to experts\n","        logits = self.topkroute_linear(mh_output)\n","        # Compute the noise logits\n","        noise_logits = self.noise_linear(mh_output)\n","        # Generate noise with standard deviation determined by softplus of noise logits\n","        noise = torch.randn_like(logits) * F.softplus(noise_logits)\n","        # Add noise to the original logits to get noisy logits\n","        noisy_logits = logits + noise\n","        # Select the top k logits and their indices from the noisy logits\n","        top_k_moe_logits, indices = noisy_logits.topk(self.top_k_moe, dim=-1)\n","        # Create a tensor full of -inf values\n","        zeros = torch.full_like(noisy_logits, float('-inf'))\n","        # Scatter the top k logits into the zeros tensor to create a sparse logits tensor\n","        sparse_logits = zeros.scatter(-1, indices, top_k_moe_logits)\n","        # Apply softmax to the sparse logits to get the final router output\n","        router_output = F.softmax(sparse_logits, dim=-1)\n","        # Return the router output and the indices of the selected experts\n","        return router_output, indices"],"metadata":{"id":"N1z8IuwGKjV3"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Sparse MoE Layer"],"metadata":{"id":"OXoO2q9IKlmc"}},{"cell_type":"code","source":["class SparseMoE(nn.Module):\n","    def __init__(self, n_embed, num_experts, top_k_moe):\n","        super(SparseMoE, self).__init__()\n","        # Initialize the NoisyTopkRouter to determine which experts to activate\n","        self.router = NoisyTopkRouter(n_embed, num_experts, top_k_moe)\n","        # Create a list of expert networks, each being a feed-forward network\n","        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])\n","        # Store the number of top experts to activate\n","        self.top_k_moe = top_k_moe\n","\n","    def forward(self, x):\n","        # Get the gating output and indices from the router\n","        gating_output, indices = self.router(x)\n","        # Initialize the final output tensor with zeros, having the same shape as x\n","        final_output = torch.zeros_like(x)\n","        # Flatten the input tensor to simplify processing\n","        flat_x = x.view(-1, x.size(-1))\n","        # Flatten the gating output tensor to align with the flattened input\n","        flat_gating_output = gating_output.view(-1, gating_output.size(-1))\n","\n","        # Iterate over each expert\n","        for i, expert in enumerate(self.experts):\n","            # Create a mask to identify where the current expert is used\n","            expert_mask = (indices == i).any(dim=-1)\n","            # Flatten the expert mask to match the flattened input\n","            flat_mask = expert_mask.view(-1)\n","\n","            if flat_mask.any():  # Check if there are any positions using the current expert\n","                # Extract the inputs for the current expert based on the mask\n","                expert_input = flat_x[flat_mask]\n","                # Get the output from the current expert\n","                expert_output = expert(expert_input)\n","                # Get the gating scores for the current expert\n","                gating_scores = flat_gating_output[flat_mask, i].unsqueeze(1)\n","                # Compute the weighted output based on the gating scores\n","                weighted_output = expert_output * gating_scores\n","                # Add the weighted output to the final output tensor\n","                final_output[expert_mask] += weighted_output.view_as(final_output[expert_mask])\n","\n","        # Return the final output tensor which combines the results from all activated experts\n","        return final_output\n"],"metadata":{"id":"N6kQTACyKvhF"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📝 Decoder with Sparse MoE\n","\n","![Decoder Only Architecture](https://drive.google.com/uc?id=1ksROxQxf3b7dlBUoIQggzyLeBaPO-AQn)\n","\n"],"metadata":{"id":"pru3rAF3Kwg7"}},{"cell_type":"code","source":["class DecoderLayerMoE(nn.Module):\n","    def __init__(self, d_model, nhead, d_ff, num_experts, top_k_moe, dropout=0.1):\n","        super().__init__()\n","        #######################Code Here###############################\n","\n","        # Create the Decoder architecture as before, but now add the MoE block instead of the FFN\n","\n","\n","    def forward(self, x):\n","\n","        return x\n","\n","        ###############################################################\n","\n","    def generate_square_subsequent_mask(self, sz):\n","        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)\n","        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))\n","        return mask"],"metadata":{"id":"t38S7EsDK3_r"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Showcase how Sparse MoE handles its inputs\n","\n","-  Experiment: Change the number of experts in the SparseMoE_example model.\n","  -  Observation: Observe how increasing or decreasing the number of experts affects the routing, gating outputs, and final output.\n","\n","- Experiment: Adjust the top_k_moe parameter to select more or fewer top experts.\n","  - Observation: See how the number of experts activated for each token changes and how it impacts the final output.\n","\n"],"metadata":{"id":"ZGu0o0A00f3t"}},{"cell_type":"code","source":["import torch\n","import torch.nn as nn\n","import torch.nn.functional as F\n","\n","class SparseMoE_example(nn.Module):\n","    def __init__(self, n_embed, num_experts, top_k_moe):\n","        super(SparseMoE_example, self).__init__()\n","        self.router = NoisyTopkRouter(n_embed, num_experts, top_k_moe)\n","        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])\n","        self.top_k_moe = top_k_moe\n","\n","    def forward(self, x):\n","        gating_output, indices = self.router(x)\n","        final_output = torch.zeros_like(x)\n","        flat_x = x.view(-1, x.size(-1))\n","        flat_gating_output = gating_output.view(-1, gating_output.size(-1))\n","\n","        for i, expert in enumerate(self.experts):\n","            expert_mask = (indices == i).any(dim=-1)\n","            flat_mask = expert_mask.view(-1)\n","\n","            if flat_mask.any():\n","                expert_input = flat_x[flat_mask]\n","                expert_output = expert(expert_input)\n","                gating_scores = flat_gating_output[flat_mask, i].unsqueeze(1)\n","                weighted_output = expert_output * gating_scores\n","                final_output[expert_mask] += weighted_output.view_as(final_output[expert_mask])\n","\n","        return final_output\n","\n","    def forward_debug_example(self, x):\n","        # Forward pass with debug prints\n","        gating_output, indices = self.router(x)\n","        print(\"Gating Output Shape:\", gating_output.shape)\n","        print(\"Gating Output:\", gating_output)\n","        print(\"Expert Indices Shape:\", indices.shape)\n","        print(\"Expert Indices:\", indices)\n","\n","        print(\"Input Shape:\", x.shape)\n","        print(\"Input:\", x)\n","\n","        final_output = torch.zeros_like(x)\n","        flat_x = x.view(-1, x.size(-1))\n","        flat_gating_output = gating_output.view(-1, gating_output.size(-1))\n","\n","        for i, expert in enumerate(self.experts):\n","            print(\"\\n\" + \"-\"*50)\n","            expert_mask = (indices == i).any(dim=-1)\n","            flat_mask = expert_mask.view(-1)\n","\n","            if flat_mask.any():\n","                expert_input = flat_x[flat_mask]\n","                print(f\"Expert {i} Input Shape:\", expert_input.shape)\n","                print(f\"Expert {i} Input:\", expert_input)\n","\n","                expert_output = expert(expert_input)\n","                print(f\"Expert {i} Output Shape:\", expert_output.shape)\n","                print(f\"Expert {i} Output:\", expert_output)\n","\n","                gating_scores = flat_gating_output[flat_mask, i].unsqueeze(1)\n","                print(f\"Gating Scores for Expert {i}:\")\n","                print(gating_scores.squeeze())\n","\n","                print(f\"Weighted Output for Expert {i}:\")\n","                weighted_output = expert_output * gating_scores\n","                print(weighted_output)\n","\n","                final_output[expert_mask] += weighted_output.view_as(final_output[expert_mask])\n","                print(f\"Expert {i} final_output Shape:\", final_output[expert_mask].shape)\n","                print(f\"Expert {i} final_output:\", final_output[expert_mask])\n","\n","        print(\"Final MoE Output Shape:\", final_output.shape)\n","        print(\"Final MoE Output:\", final_output)\n","        print(\"-\"*50)\n","        return final_output\n","\n","\n","# Example usage and debugging prints\n","def test_sparse_moe():\n","    # Parameters\n","    batch_size = 2   #\n","    seq_length = 1   # number of tokens, if 1 then it is easier to see which experts are activated and how each embedding is calculated\n","    n_embed = 5\n","    num_experts = 6  # Increased number of experts\n","    top_k_moe = 2   # If you modify, more or less experts will be activated for each input token\n","\n","    # Random input tensor (simulating token embeddings)\n","    random_input = torch.randn(batch_size, seq_length, n_embed)\n","\n","    # Initialize SparseMoE\n","    sparse_moe = SparseMoE_example(n_embed, num_experts, top_k_moe)\n","\n","    # Forward pass with debugging example\n","    final_output = sparse_moe.forward_debug_example(random_input)\n","\n","    print(\"\\nRandom Input Tensor:\")\n","    print(random_input)\n","    print(\"\\nFinal Output Tensor (after MoE processing):\")\n","    print(final_output)\n","\n","# Run the test function\n","test_sparse_moe()"],"metadata":{"id":"9DBAPMat0oJh"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Decoder-only Transformer with MoE"],"metadata":{"id":"5BfeO9sAK5YD"}},{"cell_type":"code","source":["class DecoderOnlyTransformerMoE(nn.Module):\n","    def __init__(self, vocab_size, d_model, nhead, num_layers, d_ff, max_seq_length, dropout, num_experts, top_k_moe):\n","        super().__init__()\n","        self.embedding = nn.Embedding(vocab_size, d_model)\n","        self.pos_encoder = PositionalEncoding(d_model, max_seq_length)\n","        self.layers = nn.ModuleList([DecoderLayerMoE(d_model, nhead, d_ff, num_experts, top_k_moe, dropout) for _ in range(num_layers)])\n","        self.norm = nn.LayerNorm(d_model)\n","        self.output = nn.Linear(d_model, vocab_size)\n","        self.max_seq_length = max_seq_length\n","        self.num_experts = num_experts\n","        self.top_k_moe = top_k_moe\n","        self.vocab_size = vocab_size\n","\n","    def forward(self, x):\n","        x = self.embedding(x)\n","        x = self.pos_encoder(x)\n","\n","        for layer in self.layers:\n","            x = layer(x)\n","\n","        x = self.norm(x)\n","        return self.output(x)\n","\n","    def generate(self, start_tokens: torch.Tensor, max_length: int, temperature: float = 1.0) -> torch.Tensor:\n","        self.eval()\n","        current_seq = start_tokens\n","\n","         with torch.no_grad():  # Disable gradient computation\n","            # Generate tokens until max_length is reached or end token is generated\n","            for _ in range(max_length - start_tokens.size(1)):\n","                # Ensure the sequence length does not exceed max_seq_length\n","                if current_seq.size(1) > self.max_seq_length:\n","                    current_seq = current_seq[:, -self.max_seq_length:]\n","\n","                # Get logits from the model\n","                logits = self(current_seq)\n","\n","                # Extract logits for the next token and scale by temperature\n","                next_token_logits = logits[:, -1, :] / temperature\n","\n","                # Compute probabilities using softmax\n","                probs = F.softmax(next_token_logits, dim=-1)\n","\n","                # Sample the next token from the probability distribution\n","                next_token = torch.multinomial(probs, num_samples=1)\n","\n","                # Append the next token to the current sequence\n","                current_seq = torch.cat([current_seq, next_token], dim=1)\n","\n","                # Stop if the end token is generated (vocab_size - 1 assumed to be the end token)\n","                if next_token.item() == self.vocab_size - 1:\n","                    break\n","\n","        # Return the generated sequence\n","        return current_seq"],"metadata":{"id":"KvbjHbTCLEuO"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Displaying the Decoder-only Transformer with MoE Architecture"],"metadata":{"id":"gXwM_wnqLL4d"}},{"cell_type":"code","source":["# Initialize the model with some example parameters\n","vocab_size = 10000\n","d_model = 512\n","nhead = 2\n","num_layers = 1\n","d_ff = 2048\n","max_seq_length = 1024\n","dropout = 0.1\n","num_experts = 4\n","top_k_moe = 2\n","\n","# Define your model\n","model = DecoderOnlyTransformer(\n","    vocab_size=vocab_size,\n","    d_model=d_model,\n","    nhead=nhead,\n","    num_layers=num_layers,\n","    d_ff=d_ff,\n","    max_seq_length=max_seq_length,\n","    dropout=dropout,\n",")\n","\n","# Define your model\n","model_moe = DecoderOnlyTransformerMoE(\n","    vocab_size=vocab_size,\n","    d_model=d_model,\n","    nhead=nhead,\n","    num_layers=num_layers,\n","    d_ff=d_ff,\n","    max_seq_length=max_seq_length,\n","    dropout=dropout,\n","    num_experts=num_experts,\n","    top_k_moe=top_k_moe\n",")\n","\n","# Print the model summary\n","print(50*\"-\")\n","print(summary(model, input_size=(1, max_seq_length), dtypes=[torch.int64]))\n","print(50*\"-\")\n","print(summary(model_moe, input_size=(1, max_seq_length), dtypes=[torch.int64]))"],"metadata":{"id":"HPDbqV4ULPmA"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["def print_model_summary(model, input_size):\n","    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","    model.to(device)\n","\n","    dummy_input = torch.zeros(input_size, dtype=torch.int64).to(device)\n","\n","    def register_hook(module):\n","        def hook(module, input, output):\n","            class_name = module.__class__.__name__\n","            module_idx = len(summary)\n","            m_key = f\"{module_idx:03d} {class_name}\"\n","            summary[m_key] = {}\n","            summary[m_key][\"input_shape\"] = list(input[0].size())\n","            if isinstance(output, torch.Tensor):\n","                summary[m_key][\"output_shape\"] = list(output.size())\n","            elif isinstance(output, (tuple, list)) and len(output) > 0 and isinstance(output[0], torch.Tensor):\n","                summary[m_key][\"output_shape\"] = [list(out.size()) for out in output]\n","            else:\n","                summary[m_key][\"output_shape\"] = \"multiple outputs\"\n","            params = sum(p.numel() for p in module.parameters())\n","            summary[m_key][\"num_params\"] = params\n","\n","        if not isinstance(module, nn.Sequential) and not isinstance(module, nn.ModuleList):\n","            hooks.append(module.register_forward_hook(hook))\n","\n","    summary = {}\n","    hooks = []\n","    model.apply(register_hook)\n","    model(dummy_input)\n","    for h in hooks:\n","        h.remove()\n","\n","    print(\"----------------------------------------------------------------\")\n","    print(\"{:>20}  {:>25} {:>15}\".format(\"Layer (type)\", \"Input Shape\", \"Param #\"))\n","    print(\"================================================================\")\n","    total_params = 0\n","    total_output = 0\n","    for layer in summary:\n","        line_new = \"{:>20}  {:>25} {:>15}\".format(\n","            layer,\n","            str(summary[layer][\"input_shape\"]),\n","            \"{0:,}\".format(summary[layer][\"num_params\"]),\n","        )\n","        total_params += summary[layer][\"num_params\"]\n","        if isinstance(summary[layer][\"output_shape\"], list) and all(isinstance(i, int) for i in summary[layer][\"output_shape\"]):\n","            total_output += np.prod(summary[layer][\"output_shape\"])\n","        print(line_new)\n","    print(\"================================================================\")\n","    print(f\"Total params: {total_params:,}\")\n","    print(\"----------------------------------------------------------------\")\n","\n","vocab_size = 10000\n","d_model = 512\n","nhead = 8\n","num_layers = 1\n","d_ff = 2048\n","max_seq_length = 1024\n","dropout = 0.1\n","num_experts = 2\n","top_k_moe = 1\n","\n","# Ensure you have the correct definitions for these classes\n","# from your_model_definitions import DecoderOnlyTransformer, DecoderOnlyTransformerMoE\n","\n","model = DecoderOnlyTransformer(\n","    vocab_size=vocab_size,\n","    d_model=d_model,\n","    nhead=nhead,\n","    num_layers=num_layers,\n","    d_ff=d_ff,\n","    max_seq_length=max_seq_length,\n","    dropout=dropout,\n",")\n","\n","model_moe = DecoderOnlyTransformerMoE(\n","    vocab_size=vocab_size,\n","    d_model=d_model,\n","    nhead=nhead,\n","    num_layers=num_layers,\n","    d_ff=d_ff,\n","    max_seq_length=max_seq_length,\n","    dropout=dropout,\n","    num_experts=num_experts,\n","    top_k_moe=top_k_moe\n",")\n","\n","print(\"Summary for DecoderOnlyTransformer\")\n","print_model_summary(model, (1, max_seq_length))\n","\n","print(\"\\nSummary for DecoderOnlyTransformerMoE\")\n","print_model_summary(model_moe, (1, max_seq_length))\n"],"metadata":{"id":"MGHFvdZwpdqT"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 📚 Training the Decoder-only Transformer with MoE"],"metadata":{"id":"lV4CGY2ZLo1r"}},{"cell_type":"code","source":["# Load the tiny_shakespeare dataset\n","dataset = load_dataset(\"tiny_shakespeare\", split=\"train\")\n","\n","# Extract the text from the dataset\n","texts = dataset[\"text\"]\n","\n","# Hyperparameters\n","d_model = 128\n","nhead = 2\n","num_layers = 2\n","d_ff = 256\n","max_seq_length = 64\n","batch_size = 32\n","num_epochs = 1\n","learning_rate = 0.0001\n","dropout = 0.2\n","num_experts=4\n","top_k_moe=2\n","\n","# Tokenize and prepare data\n","tokenizer = SimpleTokenizer()\n","tokenizer.fit(texts)\n","vocab_size = len(tokenizer.word_to_idx)\n","\n","dataset = TextDataset(texts, tokenizer, max_seq_length)\n","train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)\n","\n","print(f\"Vocabulary size: {vocab_size}\")\n","\n","# Device configuration\n","device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n","\n","# Create model and move to device\n","model_moe = DecoderOnlyTransformerMoE(vocab_size, d_model, nhead, num_layers, d_ff, max_seq_length, dropout, num_experts, top_k_moe).to(device)\n","\n","# Create optimizer and loss function\n","optimizer = torch.optim.AdamW(model_moe.parameters(), lr=learning_rate)\n","criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.word_to_idx[\"<PAD>\"])\n","\n","# Training loop\n","for epoch in range(num_epochs):\n","    model_moe.train()\n","    total_loss = 0\n","    for batch_idx, batch in enumerate(train_loader):\n","        optimizer.zero_grad()\n","\n","        input_seq, _, _ = batch  # Unpack batch\n","        input_seq = input_seq.squeeze(1).to(device)  # Move input to device and remove extra dimension\n","\n","        # Forward pass\n","        output = model_moe(input_seq)\n","\n","\n","        # Reshape output tensor\n","        output = output[:, :-1, :].contiguous().view(-1, output.size(-1))  # Shift predictions to the left\n","\n","        # Shift targets to the right (original targets)\n","        target_seq = input_seq[:, 1:].contiguous().view(-1)\n","\n","\n","        # Compute loss\n","        loss = criterion(output, target_seq)\n","\n","        # Debugging prints\n","        print(f\"Loss: {loss.item()}\")\n","\n","        # Backward pass and optimize\n","        loss.backward()\n","        optimizer.step()\n","\n","        total_loss += loss.item()\n","\n","        if batch_idx == 0:\n","          # Debugging prints\n","          print(f\"Epoch: {epoch+1}, Batch: {batch_idx+1}\")\n","          print(f\"Input sequence shape: {input_seq.shape}\")\n","          print(f\"Input sequence: {input_seq.unsqueeze(1)}\")\n","          print(f\"Output shape before reshape: {output.shape}\")\n","          print(f\"Output shape after reshape: {output.shape}\")\n","          print(f\"Target sequence shape: {target_seq.shape}\")\n","\n","    # Print epoch loss\n","    print(f\"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss / len(train_loader):.4f}\")\n"],"metadata":{"id":"-HgiNedeLo1s"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### ✍ Testing the Decoder-only Transformer with MoE"],"metadata":{"id":"Ll7zTpyWLo1s"}},{"cell_type":"code","source":["texts = [\"Better three hours too soon than\", \" I believe I can \", \"My words fly up, my\", \"Brevity is \", \"Love looks not with the eyes, but\", \"To be or \"]\n","\n","for quote in texts:\n","  start_tokens = torch.tensor(tokenizer.encode(quote)).unsqueeze(0).to(device)  # Add batch dimension and move to device\n","\n","  generated_tokens = model_moe.generate(start_tokens, max_length=20, temperature=.9)\n","  generated_text = tokenizer.decode(generated_tokens.squeeze().tolist())\n","\n","  print(generated_text)"],"metadata":{"id":"yDzYIElYLo1s"},"execution_count":null,"outputs":[]}]}


--------------------------------------------------------------------------------
/1_nlp/part_I_text_classification/Transformer_Encoder_Classification.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","source":["Natural Language Processing Tutorial\n","======\n","\n","This is the tutorial of the 2024 [Mediterranean Machine Learning Summer School](https://www.m2lschool.org/) on Natural Language Processing!\n","\n","This tutorial will explore the fundamental aspects of Natural Language Processing (NLP). Basic Python programming skills are expected.\n","Prior knowledge of standard NLP techniques (e.g. text tokenization and classification with ML) is beneficial but optional when working through the notebooks as they assume minimal prior knowledge.\n","\n","This tutorial combines detailed analysis and development of essential NLP concepts via custom (i.e. from scratch) implementations. Other necessary NLP components will be developed using PyTorch's NLP library implementations. As a result, the tutorial offers deep understanding and facilitates easy usage in future applications.\n","\n","## Outline\n","\n","* Part I: Introduction to Text Tokenization and Classification\n","  *  Text Classification: Simple Classifier\n","  *  Text Classification: Encoder-only Transformer\n","\n","* Part II: Introduction to Decoder-only Transformer and Sparse Mixture of Experts Architecture\n","  *  Text Generation: Decoder-only Transformer\n","  *  Text Generation: Decoder-only Transformer + MoE\n","\n","* Part III: Introduction to Parameter Efficient Fine-tuning\n","  *  Fine-tuning the full Pre-trained Models\n","  *  Fine-tuning using Low-Rank Adaptation of Large Language Models (LoRA)\n","\n","## Notation\n","\n","* Sections marked with [📚] contain cells that you should read, modify and complete to understand how your changes alter the obtained results.\n","* External resources are mentioned with [✨]. These provide valuable supplementary information for this tutorial and offer opportunities for further in-depth exploration of the topics covered.\n","\n","\n","## Libraries\n","\n","This tutorial leverages [PyTorch](https://pytorch.org/) for neural network implementation and training, complemented by standard Python libraries for data processing and the [Hugging Face](https://huggingface.co/) datasets library for accessing NLP resources.\n","\n","GPU access is recommended for optimal performance, particularly for model training and text generation. While all code can run on CPU, a CUDA-enabled environment will significantly speed up these processes.\n","\n","## Credits\n","\n","The tutorial is created by:\n","\n","* [Luca Herranz-Celotti](http://LuCeHe.github.io)\n","* [Georgios Peikos](https://www.linkedin.com/in/peikosgeorgios/)\n","\n","It is inspired by and synthesizes various online resources, which are cited throughout for reference and further reading.\n","\n","## Note for Colab users\n","\n","To grab a GPU (if available), make sure you go to `Edit -> Notebook settings` and choose a GPU under `Hardware accelerator`\n","\n"],"metadata":{"id":"832pEvfsciyd"}},{"cell_type":"markdown","source":["In this notebook we will show how a simple sentiment classification task can be solved using first a simple neural network in PyTorch, and then using the great Transformer encoder. Let's begin."],"metadata":{"id":"-xZzJjAao8PS"}},{"cell_type":"markdown","metadata":{"id":"dxXRV09-ixEA"},"source":["# Chapter I. Simple Architecture for Language Classification"]},{"cell_type":"markdown","metadata":{"id":"G7Wv5Dm6J7xI"},"source":["##Step 1: Load Packages"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"7zSBW7gbJJdy","collapsed":true},"outputs":[],"source":["!pip install datasets"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5GcxgGKuH1it"},"outputs":[],"source":["from tqdm import tqdm\n","\n","import torch\n","from torch.utils.data import DataLoader, TensorDataset\n","import torch.nn as nn\n","import torch.optim as optim\n","\n","from datasets import load_dataset\n","from tokenizers import Tokenizer\n","from tokenizers.models import WordPiece\n","from tokenizers.pre_tokenizers import Whitespace\n","from tokenizers.trainers import WordPieceTrainer\n","from tokenizers.processors import BertProcessing"]},{"cell_type":"markdown","metadata":{"id":"nfnZCzvnKFxJ"},"source":["##📚 Step 2: Load a Dataset\n","We'll use the ✨ [HuggingFace datasets](https://huggingface.co/datasets/nyu-mll/glue) library to load a dataset to play with. Let's use the GLUE MRPC dataset for sentiment analysis. ✨ [GLUE](https://gluebenchmark.com/), the General Language Understanding Evaluation benchmark, is a collection of resources for training, evaluating, and analyzing natural language understanding systems. The ✨ [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398), Microsoft Research Paraphrase Corpus, is a corpus of sentence pairs automatically extracted from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent. We will load also another dataset that we will use to create our tokenizer."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"n9RmyOZjZHp3"},"outputs":[],"source":["# EXERCISE: Load the GLUE MRPC dataset\n","dataset = load_dataset(\n","\n","# EXERCISE: Create the tokenizer on a different dataset than the one used for\n","# training. Load the train split of the wikitext-103-raw-v1 dataset.\n","# For speed we will use only the 100K sentences.\n","num_sentences = 100_000\n","tokenizer_dataset = load_dataset("]},{"cell_type":"markdown","metadata":{"id":"kUOcwwTJK3qu"},"source":["##📚 Step 3: Tokenize the Dataset with tokenizers\n","In order to turn long texts into numbers that can be used by the mathematics of a neural network, we have to first cut long texts into small pieces, which is called tokenization, which in turn will make the step to turn those pieces into integers very simple."]},{"cell_type":"markdown","metadata":{"id":"2elGnb5gXzXX"},"source":["Four well-known types of tokenization are Character-level, Word-level, BPE and WordPiece, the last two known as two Subword tokenizers:\n","\n","1. **Character-Level Tokenization:**\n","  - **Description:** Character-level tokenization breaks down text into individual characters. Each character, including spaces and punctuation, is treated as a separate token. Used in the old days.\n","  - **Example:** For the sentence \"Hello, world!\", character-level tokenization would result in tokens: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']\n","\n","2. **Word-Level Tokenization:**\n","  - **Description:** Word-level tokenization splits text into words based on whitespace or punctuation. Each word is considered a separate token. Used in the old days.\n","  - **Example:** For the sentence \"Hello, world!\", word-level tokenization would result in tokens: ['Hello', ',', 'world', '!']\n","3. **Byte-Level Byte Pair Encoding (Byte-level BPE):**\n","  - **Description:** Byte-level BPE tokenization operates on bytes of the input text. It uses a merge operation to gradually build a vocabulary of byte pairs, making it useful for handling multilingual texts and rare characters. Used by e.g. GPT-2, RoBERTa.\n","  - **Example:** It creates tokens based on byte pairs, such as \"b\" and \"an\" merging into a single token of \"ban\".\n","4. **WordPiece Tokenization:**\n","  - **Description:** WordPiece tokenization breaks words into smaller units. It begins with a basic vocabulary of individual characters and merges the most frequent character sequences to form new tokens. Used by e.g. BERT, DistilBERT, and Electra.\n","  - **Example:** For the word \"tokenization\", WordPiece might create tokens like \"token\", \"##ization\" where \"##\" indicates continuation.\n","\n","Better tokenizers have been developed to serve different purposes in natural language processing and generation tasks, from handling character-level nuances to efficiently managing vocabulary size and handling unseen words.\n","\n","Typically you will end up using an existing tokenizer, for example the one used by GPT-2 is relatively popular, but here we show you the steps to create one from scratch using the tokenizers library by HuggingFace."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"YxIccr_FNY9g"},"outputs":[],"source":["# Set the maximal number of integers fed to the Neural Network per sentence\n","max_length = 128\n","\n","# Set the number of elements the tokenizer will create as its vocabulary\n","vocab_size = 30522\n","\n","# Initialize the tokenizer with a WordPiece model, using \"[UNK]\" for unknown tokens\n","tokenizer = Tokenizer(WordPiece(unk_token=\"[UNK]\"))\n","\n","# EXERCISE: Configure the tokenizer to split input text based on whitespace, as a .pre_tokenizer\n","tokenizer.pre_tokenizer =\n","\n","# Display the dataset to be used for training the tokenizer\n","print(tokenizer_dataset)\n","train_texts = tokenizer_dataset['text']\n","\n","# EXERCISE: Train the tokenizer\n","trainer = WordPieceTrainer(vocab_size=vocab_size, special_tokens=[\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])\n","tokenizer.train_from_iterator(\n","\n","# Set up post-processing to handle padding and truncation as BERT inputs\n","tokenizer.post_processor = BertProcessing(\n","    (\"[SEP]\", tokenizer.token_to_id(\"[SEP]\")), # Token to mark the end of a sequence\n","    (\"[CLS]\", tokenizer.token_to_id(\"[CLS]\")) # Token to mark the beginning of a sequence\n",")\n","\n","# EXERCISE: Enable truncation to ensure long sequences do not exceed max_length\n","tokenizer.enable_\n","\n","# EXERCISE: Enable padding to ensure short sequences reach the max_length, adding the\n","# \"[PAD]\" token at the end of the sentence.\n","tokenizer.enable_"]},{"cell_type":"code","source":["# Example texts\n","texts = [\n","    \"Hello, how are you?\",\n","    \"I am fine, thank you!\",\n","    \"What about you?\",\n","    \"[MASK][CLS]\"\n","]\n","\n","# Show the effect of tokenizing random sentences\n","for text in texts:\n","    print('-'*30)\n","    print(\"Text:  \", text)\n","    output = tokenizer.encode(text, 'nice')\n","    print(\"Tokens:\", output.tokens)\n","    print(\"IDs:   \", output.ids)\n","    print(\"length:\", len(output.ids))"],"metadata":{"id":"BWlFipMxK8O1"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"T4ng6UL7fqZI"},"source":["Let's apply our newly created tokenizer to the dataset we want to train our Neural Network to solve."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XpZib3yWSkjq"},"outputs":[],"source":["def tokenize_function(batch):\n","    # EXERCISE: Tokenize each example in the batch\n","    tokenized_batch = tokenizer.encode_batch(list(zip(batch['sentence1'],\n","\n","    # EXERCISE: Prepare tokenized outputs in the required format\n","    tokenized_dict = {\n","        'input_ids': [ for encoding in tokenized_batch],\n","        'attention_mask': [ for encoding in tokenized_batch]\n","    }\n","\n","    return tokenized_dict\n","\n","# Tokenize the dataset\n","tokenized_datasets = dataset.map(tokenize_function, batched=True)"]},{"cell_type":"markdown","metadata":{"id":"0xqBYWUrmi6D"},"source":["Now we need to turn the list of sequences into matrices, also known as mini-batches of data, of shape (batch_size, max_length), which is the standard way to feed data to Neural Networks."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Jh9XOfk8H8in"},"outputs":[],"source":["batch_size = 64\n","\n","# Convert the tokenized datasets to TensorDatasets\n","def convert_to_tensors(tokenized_dataset):\n","    input_ids = torch.tensor(tokenized_dataset['input_ids'])\n","    attention_mask = torch.tensor(tokenized_dataset['attention_mask'])\n","    # EXERCISE: convert labels to tensor too\n","    labels =\n","    return TensorDataset(input_ids, attention_mask, labels)\n","\n","train_dataset = convert_to_tensors(tokenized_datasets['train'])\n","test_dataset = convert_to_tensors(tokenized_datasets['test'])\n","\n","# Create DataLoader objects\n","# EXERCISE: set the batch_size and shuffle only the train set\n","train_dataloader = DataLoader(\n","test_dataloader = DataLoader("]},{"cell_type":"markdown","metadata":{"id":"y96Y-oxDnCB1"},"source":["## 📚 Step 4: Define the Neural Network\n","\n","We start with an extremely simple Neural Network. First, each integer defined through the tokenization process is assigned a random vector that is learnable, meaning the training process will change its values. That random vector is called embedding, so each word in the sentence will be represented by a learnable vector.\n","\n","Second, since each sentence has variable length, we will take the mean of the sentence over the time axis, to end up with a representation of the sentence that is of the length of the embedding vector.\n","\n","Finally we will use a linear layer to turn that mean embedding, into two possible outcomes: one will represent the network's estimate of the sentence being negative, and the other will represent its estimate of the sentence being positive."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1KF6yOcdICzN"},"outputs":[],"source":["class SimpleModel(nn.Module):\n","    def __init__(self, vocab_size, embedding_dim, output_dim):\n","        super(SimpleModel, self).__init__()\n","        self.embedding = nn.Embedding(vocab_size, embedding_dim)\n","        self.fc = nn.Linear(embedding_dim, output_dim)\n","\n","    def forward(self, input_ids, attention_mask):\n","        embedded = self.embedding(input_ids)\n","\n","        # Apply attention mask\n","        masked_embedded = embedded * attention_mask.unsqueeze(-1).float()\n","\n","        # Average the embeddings across the temporal dimension\n","        # EXERCISE: average each sentence score with the sentence length\n","        summed =\n","        counts =\n","        averaged =\n","\n","        # Pass through the fully connected layer\n","        output = self.fc(averaged)\n","        return output"]},{"cell_type":"markdown","metadata":{"id":"dmg6RyR3pPb1"},"source":["## 📚 Step 5: Train and Evaluate"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"55kGha23IGn0"},"outputs":[],"source":["def train(model, num_epochs = 2, lr=1e-3, weight_decay=0.01):\n","    # Define loss function and optimizer\n","    criterion = nn.CrossEntropyLoss()\n","    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)\n","\n","    # Initialize the model\n","    initial_embedding_weights = model.embedding.weight.data.clone()\n","\n","    # Training loop\n","    for epoch in range(num_epochs):\n","        model.train()\n","        pbar = tqdm(train_dataloader)\n","        for batch in pbar:\n","            input_ids, attention_mask, labels = batch\n","\n","            # EXERCISE: Zero the gradients\n","            optimizer.\n","\n","            # EXERCISE: Forward pass, pass the inputs and the attention_mask\n","            outputs = model(\n","\n","            # EXERCISE: Compute loss\n","            loss = criterion(\n","\n","            # EXERCISE: Backward pass and optimization step\n","            loss.\n","            optimizer.\n","            pbar.set_description(f\"Loss: {loss.item():.2f}\")\n","\n","    print(f\"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}\")\n","\n","    final_embedding_weights = model.embedding.weight.data\n","\n","    # Check if the weights have changed\n","    weights_changed = not torch.equal(initial_embedding_weights, final_embedding_weights)\n","    print(\"Embedding weights changed during training:\", weights_changed)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"9Lxq89uNIL4y"},"outputs":[],"source":["def evaluate(model):\n","    # Evaluation loop\n","    model.eval()\n","    correct = 0\n","    total = 0\n","\n","    with torch.no_grad():\n","        for batch in tqdm(test_dataloader):\n","            input_ids, attention_mask, labels = batch\n","\n","            # EXERCISE: Forward pass with masks\n","            outputs = model(\n","\n","            # Get predictions\n","            _, predicted = torch.max(outputs, 1)\n","\n","            # Update accuracy\n","            total += labels.size(0)\n","            correct += (predicted == labels).sum().item()\n","\n","    # EXERCISE: compute the total accuracy\n","    accuracy =\n","    print(f\"\\n\\nTest Accuracy: {accuracy * 100:.2f}%\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dDTjRmWUSY45"},"outputs":[],"source":["# Hyperparameters\n","vocab_size = tokenizer.get_vocab_size()\n","embedding_dim = 64\n","output_dim = 2  # Binary classification\n","\n","simple_model = SimpleModel(vocab_size, embedding_dim, output_dim)\n","\n","params = sum(p.numel() for p in simple_model.parameters() if p.requires_grad)\n","print(f'The model has {params} parameters')\n","\n","# EXERCISE: play with learning rates in the set [1e-2, 1e-3, 1e-4, 1e-5]\n","# to find the best choice\n","train(simple_model, lr=1e-4, weight_decay=0.001)\n","evaluate(simple_model)"]},{"cell_type":"markdown","metadata":{"id":"IKkBrDbJpdaN"},"source":["# Chapter II: Transformer-based Architecture for Language Classification\n","\n","✨ [Transformers](https://arxiv.org/pdf/1706.03762) appeared as the best option first for language translation, replacing RNNs. Now RNNs are making a come back but Transformers are still the standard, essentially thanks to having what is known as an attention mechanism everywhere in the architecture, that allows them to be able to consider all the previous time steps, while RNNs were in theory limited by being able to see only the previous time step.\n","\n","The introduction of a typical Transformer-based classifier, like ✨ [BERT](https://arxiv.org/pdf/1810.04805), has to be preceeded by the introduction of the MultiHearAttention as the main ingredient, and of the PositionalEncoding and FeedForward layers used as its building blocks."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0zWvzxD_I4hm"},"outputs":[],"source":["import torch\n","import torch.nn as nn\n","import torch.optim as optim\n","import numpy as np\n","\n","# Constants\n","num_heads = 8    # Number of attention heads\n","num_layers = 8   # Number of Transformer layers"]},{"cell_type":"markdown","metadata":{"id":"d_90-JTIL-Qa"},"source":["## 📚 Step 1: MultiHeadAttention\n","\n","The key to the MultiHeadAttention mechanism is the softmax attention.\n","The attention scores are computed using the scaled dot-product of the Query and Key vectors. The formula is:\n","\n","$$\n","Attention(Q,K,V)=softmax\\Big(\\frac{QK^T}{\\sqrt{d_k}}\\Big) V\n","$$\n","\n","where $d_k$ is the dimensionality of the $Q,K,V$ vectors. The softmax operation ensures that the scores are normalized, and the scaling factor $\\sqrt{d_k}$ helps mitigate the issue of large dot-product values. Typically $Q,K,V$ are going to be linear projections of the same tensor."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"O_HM31Lfw-u2"},"outputs":[],"source":["# Multi-Head Attention\n","class MultiHeadAttention(nn.Module):\n","    def __init__(self, d_model, num_heads):\n","        super(MultiHeadAttention, self).__init__()\n","        assert d_model % num_heads == 0\n","        self.d_k = d_model // num_heads\n","        self.num_heads = num_heads\n","\n","        # EXERCISE: create the following 4 linear layers without bias\n","        self.linear_q =\n","        self.linear_k =\n","        self.linear_v =\n","        self.linear_out =\n","\n","    def forward(self, query, mask=None):\n","        batch_size = query.size(0)\n","\n","        # Linear projections\n","        Q = self.linear_q(query)\n","        K = self.linear_k(query)\n","        V = self.linear_v(query)\n","\n","        # Split into multiple heads\n","        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_k)\n","        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_k)\n","        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_k)\n","\n","        # Attention scores\n","        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch_size, num_heads, seq_len, seq_len)\n","\n","        # EXERCISE: divide the scores by the sqrt of\n","        scores =\n","\n","        if mask is not None:\n","            mask = mask.unsqueeze(1).unsqueeze(1)  # (batch_size, 1, 1, seq_len)\n","            scores = scores.masked_fill(mask == 0, -1e9)\n","\n","        # Attention weights\n","        # EXERCISE: apply the softmax to the scores to have the attn_weights\n","        attn_weights =\n","\n","        # Weighted sum of values\n","        attn_output = torch.matmul(attn_weights, V)  # (batch_size, num_heads, seq_len, d_k)\n","\n","        # Concatenate heads and project back to d_model\n","        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)  # (batch_size, seq_len, d_model)\n","        attn_output = self.linear_out(attn_output)  # (batch_size, seq_len, d_model)\n","\n","        return attn_output"]},{"cell_type":"markdown","metadata":{"id":"m7NxFPXtOIqU"},"source":["## 📚 Step 2: PositionalEncoding and FeedForward\n","\n","Next key factors to Transformer-based architectures success are the positional encoding and the interleaved feedforward network. The PositionalEncoding adds positional information to input tokens using sinusoidal functions. The FeedForward layer is a simple feedforward neural network with two linear layers and ReLU activation, that projects the MHA representation into a 4x wider representation."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"WGrGgebTw_mQ"},"outputs":[],"source":["import math\n","\n","# Positional Encoding\n","class PositionalEncoding(nn.Module):\n","    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):\n","        super().__init__()\n","        self.dropout = nn.Dropout(p=dropout)\n","\n","        position = torch.arange(max_len).unsqueeze(1)\n","        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))\n","        pe = torch.zeros(max_len, 1, d_model)\n","        pe[:, 0, 0::2] = torch.sin(position * div_term)\n","        pe[:, 0, 1::2] = torch.cos(position * div_term)\n","        self.register_buffer('pe', pe)\n","\n","    def forward(self, x):\n","        x = x + self.pe[:x.size(0)]\n","        return x\n","\n","# Feedforward Layer\n","class FeedForward(nn.Module):\n","    def __init__(self, d_model, d_ff):\n","        super(FeedForward, self).__init__()\n","        self.linear1 = nn.Linear(d_model, d_ff)\n","        self.activation = nn.GELU()\n","        self.linear2 = nn.Linear(d_ff, d_model)\n","\n","    def forward(self, x):\n","        # EXERCISE: x = linear(activation(linear(x)))\n","        return x"]},{"cell_type":"markdown","metadata":{"id":"hHj74Z_NPCSH"},"source":["## 📚 Step 3: Transformer Encoder\n","\n","The final architecture is a sequence of Transformer blocks where a specific sequence of LayerNormalization, Dropout, skip connections and the layers introduced above are used. Finally the Transformer blocks are chained after the embedding and the positional embedding."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F_lCluW2xqs6"},"outputs":[],"source":["# Transformer Decoder Layer\n","class TransformerEncoderLayer(nn.Module):\n","    def __init__(self, d_model, num_heads, d_ff):\n","        super(TransformerEncoderLayer, self).__init__()\n","        self.self_attn = MultiHeadAttention(d_model, num_heads)\n","\n","        self.norm1 = nn.LayerNorm(d_model)\n","        self.dropout1 = nn.Dropout(p=0.1)\n","\n","        self.ff = FeedForward(d_model, d_ff)\n","        self.norm2 = nn.LayerNorm(d_model)\n","        self.dropout2 = nn.Dropout(p=0.1)\n","\n","    def forward(self, tgt, tgt_mask=None):\n","        tgt2 = self.self_attn(tgt, mask=tgt_mask)\n","\n","        # EXERCISE: norm(tgt + dropout(tgt2))\n","        tgt =\n","\n","        # EXERCISE: norm(tgt + dropout(ff(tgt)))\n","        tgt =\n","\n","        return tgt\n","\n","# Transformer Encoder\n","class TransformerEncoder(nn.Module):\n","    def __init__(self, input_dim, d_model, num_heads, num_layers, d_ff, output_dim):\n","        super(TransformerEncoder, self).__init__()\n","        self.embedding = nn.Embedding(input_dim, d_model)\n","        self.pos_encoder = PositionalEncoding(d_model)\n","\n","        # EXERCISE: a list of layers has to be recorded as a ModuleList in pytorch\n","        self.layers = [TransformerEncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)]\n","        self.dropout = nn.Dropout(p=0.1)\n","        self.fc_out = nn.Linear(d_model, output_dim)\n","\n","    def forward(self, tgt, tgt_mask=None):\n","        # EXERCISE: do the embedding and follow them with a pos_encoder\n","        tgt =\n","        tgt =\n","\n","        for layer in self.layers:\n","            tgt = layer(tgt, tgt_mask)\n","\n","        summed = tgt.sum(1)\n","        counts = tgt_mask.sum(1, keepdim=True)\n","        tgt = summed / counts\n","\n","        output = self.fc_out(self.dropout(tgt))\n","        return output"]},{"cell_type":"markdown","source":["## 📚 Step 4: Train and Evaluate"],"metadata":{"id":"fBG020UX4hBi"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"YNcAXxMO360Z"},"outputs":[],"source":["bert_model = TransformerEncoder(\n","    input_dim=vocab_size,\n","    d_model=embedding_dim,\n","    num_heads=num_heads,\n","    num_layers=num_layers,\n","    d_ff=4*embedding_dim,\n","    output_dim=output_dim\n",")\n","\n","params = sum(p.numel() for p in bert_model.parameters() if p.requires_grad)\n","print(f'The model has {params} parameters')\n","\n","# EXERCISE: play with learning rates in the set [1e-2, 1e-3, 1e-4, 1e-5]\n","# to find the best choice\n","train(bert_model, lr=1e-2, weight_decay=0.001)\n","evaluate(bert_model)"]},{"cell_type":"code","source":[],"metadata":{"id":"Lf_VhyEOgXuN"},"execution_count":null,"outputs":[]}],"metadata":{"accelerator":"GPU","colab":{"gpuType":"T4","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}


--------------------------------------------------------------------------------
/2_cv/README.md:
--------------------------------------------------------------------------------
 1 | # [[M2L2024](https://www.m2lschool.org/home)] Tutorial 2: Computer Vision
 2 | 
 3 | **Authors:** [Mirko Paolo Barbato](https://www.linkedin.com/in/mirko-barbato-88709424a), [Srivatsan Srinivasan](https://www.linkedin.com/in/srivatsan-srinivasan-49501123)
 4 | 
 5 | --- 
 6 | This tutorial of the 2024 Mediterranean Machine Learning Summer School will give an insight into the basis of Computer Vision.
 7 | 
 8 | The tutorial is divided into three practical exercises that will show different tasks and aspects of computer vision. The exercises have different difficulties, from the easiest (beginner) to the hardest (advanced).
 9 | 
10 | ## Practical 1: Image classification with Convolutional Neural Networks (CNN)
11 | 
12 | In this practical, you will develop a simple Convolutional Neural Network for image classification.
13 | 
14 | ### Outline
15 | 
16 | - Definition of the dataset class
17 | - Design of a CNN architecture
18 | - Creation of the training and test process
19 | 
20 | ### Notebooks
21 | 
22 | Tutorial: [![Open In 
23 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/2_cv/First_exercise/M2L_Tutor_1_exercise.ipynb)
24 | 
25 | Solution: [![Open In 
26 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/2_cv/First_exercise/M2L_Tutor_1_solution.ipynb)
27 | 
28 | ## Practical 2: Semantic Segmentation with Vision Transformer
29 | 
30 | In this practical, you will develop a Neural Network for semantic segmentation based on the Vision Transformer (ViT).
31 | 
32 | ### Outline
33 | 
34 | - Understanding of the Albumentation library
35 | - Development of the Vision Transformer
36 | 
37 | ### Notebooks
38 | 
39 | Tutorial: [![Open In 
40 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/2_cv/Second_exercise/M2L_Tutor_2_exercise.ipynb)
41 | 
42 | Solution: [![Open In 
43 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/2_cv/Second_exercise/M2L_Tutor_2_solution.ipynb)
44 | 
45 | ## Practical 3: Vector Quantization with Variational AutoEncoder (VQ-VAE)
46 | 
47 | In this practical, you will understand what is and how to implement the Vector Quantization of an image using Variational Autoencoders.
48 | 
49 | ### Outline
50 | 
51 | - Implementation and understanding of Vector Quantization
52 | 
53 | ### Notebooks
54 | 
55 | Tutorial: [![Open In 
56 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/2_cv/Third_exercise/M2L_Tutor_3_exercise.ipynb)
57 | 
58 | Solution: [![Open In 
59 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/2_cv/Third_exercise/M2L_Tutor_3_solution.ipynb)
60 | ---
61 | 


--------------------------------------------------------------------------------
/2_cv/Third_exercise/M2L_Tutor_3_exercise.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "cells": [
  3 |     {
  4 |       "cell_type": "markdown",
  5 |       "metadata": {
  6 |         "id": "n-k9uGuAyA9D"
  7 |       },
  8 |       "source": [
  9 |         "# [VQ-VAE](https://arxiv.org/abs/1711.00937)\n",
 10 |         "\n",
 11 |         "## Introduction\n",
 12 |         "\n",
 13 |         "Variational Auto Encoders (VAEs) can be thought of as what all but the last layer of a neural network is doing, namely feature extraction or seperating out the data. Thus given some data we can think of using a neural network for representation generation.\n",
 14 |         "\n",
 15 |         "Recall that the goal of a generative model is to estimate the probability distribution of high dimensional data such as images, videos, audio or even text by learning the underlying structure in the data as well as the dependencies between the different elements of the data. This is very useful since we can then use this representation to generate new data with similar properties. This way we can also learn useful features from the data in an unsupervised fashion.\n",
 16 |         "\n",
 17 |         "The VQ-VAE uses a discrete latent representation mostly because many important real-world objects are discrete. For example in images we might have categories like \"Cat\", \"Car\", etc. and it might not make sense to interpolate between these categories. Discrete representations are also easier to model since each category has a single value whereas if we had a continous latent space then we will need to normalize this density function and learn the dependencies between the different variables which could be very complex.\n",
 18 |         "\n",
 19 |         "\n",
 20 |         "## Basic Idea\n",
 21 |         "\n",
 22 |         "The overall architecture is summarized in the diagram below:\n",
 23 |         "\n",
 24 |         "![](https://github.com/zalandoresearch/pytorch-vq-vae/blob/master/images/vq-vae.png?raw=1)"
 25 |       ]
 26 |     },
 27 |     {
 28 |       "cell_type": "markdown",
 29 |       "metadata": {
 30 |         "id": "WmBW0M9wyA9F"
 31 |       },
 32 |       "source": [
 33 |         "We start by defining a latent embedding space of dimension `[K, D]` where `K` are the number of embeddings and `D` is the dimensionality of each latent embeddng vector, i.e. $e_i \\in \\mathbb{R}^{D}$. The model is comprised of an encoder and a decoder. The encoder will map the input to a sequence of discrete latent variables, whereas the decoder will try to reconstruct the input from these latent sequences.\n",
 34 |         "\n",
 35 |         "More preciesly, the model will take in batches of RGB images,  say $x$, each of size 32x32 for our example, and pass it through a ConvNet encoder producing some output $E(x)$, where we make sure the channels are the same as the dimensionality of the latent embedding vectors. To calculate the discrete latent variable we find the nearest embedding vector and output it's index.\n",
 36 |         "\n",
 37 |         "The input to the decoder is the embedding vector corresponding to the index which is passed through the decoder to produce the reconstructed image.\n",
 38 |         "\n",
 39 |         "Since the nearest neighbour lookup has no real gradient in the backward pass we simply pass the gradients from the decoder to the encoder  unaltered. The intuition is that since the output representation of the encoder and the input to the decoder share the same `D` channel dimensional space, the gradients contain useful information for how the encoder has to change its output to lower the reconstruction loss.\n",
 40 |         "\n",
 41 |         "## Loss\n",
 42 |         "\n",
 43 |         "The total loss is actually composed of three components\n",
 44 |         "\n",
 45 |         "1. **reconstruction loss**: which optimizes the decoder and encoder\n",
 46 |         "1. **codebook loss**: due to the fact that gradients bypass the embedding, we use a dictionary learning algorithm  which uses an $l_2$  error to move the embedding vectors $e_i$ towards the encoder output\n",
 47 |         "1. **commitment loss**:  since the volume of the embedding space is dimensionless, it can grow arbirtarily if the embeddings $e_i$ do not train as fast as  the encoder parameters, and thus we add a commitment loss to make sure that the encoder commits to an embedding"
 48 |       ]
 49 |     },
 50 |     {
 51 |       "cell_type": "code",
 52 |       "execution_count": null,
 53 |       "metadata": {
 54 |         "id": "HivvzLl3yA9G"
 55 |       },
 56 |       "outputs": [],
 57 |       "source": [
 58 |         "from google.colab import files\n",
 59 |         "\n",
 60 |         "uploaded = files.upload()\n",
 61 |         "\n"
 62 |       ]
 63 |     },
 64 |     {
 65 |       "cell_type": "code",
 66 |       "source": [
 67 |         "!pip3 install -U -r requirements.txt"
 68 |       ],
 69 |       "metadata": {
 70 |         "id": "Bd0s_Mxk0kjk"
 71 |       },
 72 |       "execution_count": null,
 73 |       "outputs": []
 74 |     },
 75 |     {
 76 |       "cell_type": "code",
 77 |       "execution_count": null,
 78 |       "metadata": {
 79 |         "id": "DF0RwrtCyA9G"
 80 |       },
 81 |       "outputs": [],
 82 |       "source": [
 83 |         "from __future__ import print_function\n",
 84 |         "\n",
 85 |         "\n",
 86 |         "import matplotlib.pyplot as plt\n",
 87 |         "import numpy as np\n",
 88 |         "from scipy.signal import savgol_filter\n",
 89 |         "\n",
 90 |         "\n",
 91 |         "from six.moves import xrange\n",
 92 |         "\n",
 93 |         "#import umap\n",
 94 |         "\n",
 95 |         "import torch\n",
 96 |         "import torch.nn as nn\n",
 97 |         "import torch.nn.functional as F\n",
 98 |         "from torch.utils.data import DataLoader\n",
 99 |         "import torch.optim as optim\n",
100 |         "\n",
101 |         "import torchvision.datasets as datasets\n",
102 |         "import torchvision.transforms as transforms\n",
103 |         "from torchvision.utils import make_grid"
104 |       ]
105 |     },
106 |     {
107 |       "cell_type": "code",
108 |       "execution_count": null,
109 |       "metadata": {
110 |         "id": "XY1tyIcAyA9G"
111 |       },
112 |       "outputs": [],
113 |       "source": [
114 |         "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
115 |       ]
116 |     },
117 |     {
118 |       "cell_type": "markdown",
119 |       "metadata": {
120 |         "id": "MvBLzjkyyA9H"
121 |       },
122 |       "source": [
123 |         "## Load Data"
124 |       ]
125 |     },
126 |     {
127 |       "cell_type": "code",
128 |       "execution_count": null,
129 |       "metadata": {
130 |         "id": "smrt5_1XyA9H"
131 |       },
132 |       "outputs": [],
133 |       "source": [
134 |         "training_data = datasets.CIFAR10(root=\"data\", train=True, download=True,\n",
135 |         "                                  transform=transforms.Compose([\n",
136 |         "                                      transforms.ToTensor(),\n",
137 |         "                                      transforms.Normalize((0.5,0.5,0.5), (1.0,1.0,1.0))\n",
138 |         "                                  ]))\n",
139 |         "\n",
140 |         "validation_data = datasets.CIFAR10(root=\"data\", train=False, download=True,\n",
141 |         "                                  transform=transforms.Compose([\n",
142 |         "                                      transforms.ToTensor(),\n",
143 |         "                                      transforms.Normalize((0.5,0.5,0.5), (1.0,1.0,1.0))\n",
144 |         "                                  ]))"
145 |       ]
146 |     },
147 |     {
148 |       "cell_type": "code",
149 |       "execution_count": null,
150 |       "metadata": {
151 |         "id": "EvjWtBp8yA9H"
152 |       },
153 |       "outputs": [],
154 |       "source": [
155 |         "data_variance = np.var(training_data.data / 255.0)"
156 |       ]
157 |     },
158 |     {
159 |       "cell_type": "markdown",
160 |       "metadata": {
161 |         "id": "csjwifL3yA9H"
162 |       },
163 |       "source": [
164 |         "## Vector Quantizer Layer\n",
165 |         "\n",
166 |         "### Exercise 1: Setup vector quantizer layer\n",
167 |         "\n",
168 |         "This layer takes a tensor to be quantized. The channel dimension will be used as the space in which to quantize. All other dimensions will be flattened and will be seen as different examples to quantize.\n",
169 |         "\n",
170 |         "The output tensor will have the same shape as the input.\n",
171 |         "\n",
172 |         "As an example for a `BCHW` tensor of shape `[16, 64, 32, 32]`, we will first convert it to an `BHWC` tensor of shape `[16, 32, 32, 64]` and then reshape it into `[16384, 64]` and all `16384` vectors of size `64`  will be quantized independently. In otherwords, the channels are used as the space in which to quantize. All other dimensions will be flattened and be seen as different examples to quantize, `16384` in this case."
173 |       ]
174 |     },
175 |     {
176 |       "cell_type": "code",
177 |       "execution_count": null,
178 |       "metadata": {
179 |         "id": "EUCeSyr5yA9I"
180 |       },
181 |       "outputs": [],
182 |       "source": [
183 |         "class VectorQuantizer(nn.Module):\n",
184 |         "    def __init__(self, num_embeddings, embedding_dim, commitment_cost):\n",
185 |         "        super(VectorQuantizer, self).__init__()\n",
186 |         "\n",
187 |         "        self._embedding_dim = embedding_dim\n",
188 |         "        self._num_embeddings = num_embeddings\n",
189 |         "\n",
190 |         "        self._embedding = nn.Embedding(self._num_embeddings, self._embedding_dim)\n",
191 |         "        self._embedding.weight.data.uniform_(-1/self._num_embeddings, 1/self._num_embeddings)\n",
192 |         "        self._commitment_cost = commitment_cost\n",
193 |         "\n",
194 |         "    def forward(self, inputs):\n",
195 |         "        # convert inputs from BCHW -> BHWC\n",
196 |         "        # TODO:\n",
197 |         "\n",
198 |         "        # Flatten input\n",
199 |         "        # TODO:\n",
200 |         "\n",
201 |         "        # TODO: Calculate distances between flat input and embeddings.\n",
202 |         "\n",
203 |         "        # Encoding\n",
204 |         "        encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1)\n",
205 |         "        encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings, device=inputs.device)\n",
206 |         "        encodings.scatter_(1, encoding_indices, 1)\n",
207 |         "\n",
208 |         "        # Quantize and unflatten\n",
209 |         "        quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape)\n",
210 |         "\n",
211 |         "        # Loss\n",
212 |         "        e_latent_loss = F.mse_loss(quantized.detach(), inputs)\n",
213 |         "        q_latent_loss = F.mse_loss(quantized, inputs.detach())\n",
214 |         "        loss = q_latent_loss + self._commitment_cost * e_latent_loss\n",
215 |         "\n",
216 |         "        quantized = inputs + (quantized - inputs).detach()\n",
217 |         "        avg_probs = torch.mean(encodings, dim=0)\n",
218 |         "        perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))\n",
219 |         "\n",
220 |         "        # convert quantized from BHWC -> BCHW\n",
221 |         "        return loss, quantized.permute(0, 3, 1, 2).contiguous(), perplexity, encodings"
222 |       ]
223 |     },
224 |     {
225 |       "cell_type": "markdown",
226 |       "metadata": {
227 |         "id": "hhdFx2_8yA9I"
228 |       },
229 |       "source": [
230 |         "We will also implement a slightly modified version  which will use exponential moving averages to update the embedding vectors instead of an auxillary loss. This has the advantage that the embedding updates are independent of the choice of optimizer for the encoder, decoder and other parts of the architecture. For most experiments the EMA version trains faster than the non-EMA version."
231 |       ]
232 |     },
233 |     {
234 |       "cell_type": "code",
235 |       "execution_count": null,
236 |       "metadata": {
237 |         "id": "1dgpj0GCyA9I"
238 |       },
239 |       "outputs": [],
240 |       "source": [
241 |         "class VectorQuantizerEMA(nn.Module):\n",
242 |         "    def __init__(self, num_embeddings, embedding_dim, commitment_cost, decay, epsilon=1e-5):\n",
243 |         "        super(VectorQuantizerEMA, self).__init__()\n",
244 |         "\n",
245 |         "        self._embedding_dim = embedding_dim\n",
246 |         "        self._num_embeddings = num_embeddings\n",
247 |         "\n",
248 |         "        self._embedding = nn.Embedding(self._num_embeddings, self._embedding_dim)\n",
249 |         "        self._embedding.weight.data.normal_()\n",
250 |         "        self._commitment_cost = commitment_cost\n",
251 |         "\n",
252 |         "        self.register_buffer('_ema_cluster_size', torch.zeros(num_embeddings))\n",
253 |         "        self._ema_w = nn.Parameter(torch.Tensor(num_embeddings, self._embedding_dim))\n",
254 |         "        self._ema_w.data.normal_()\n",
255 |         "\n",
256 |         "        self._decay = decay\n",
257 |         "        self._epsilon = epsilon\n",
258 |         "\n",
259 |         "    def forward(self, inputs):\n",
260 |         "        # convert inputs from BCHW -> BHWC\n",
261 |         "        # TODO:\n",
262 |         "\n",
263 |         "        # Flatten input\n",
264 |         "        # TODO:\n",
265 |         "\n",
266 |         "        # # TODO: Calculate distances between flat input and embeddings.\n",
267 |         "\n",
268 |         "        # Encoding\n",
269 |         "        encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1)\n",
270 |         "        encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings, device=inputs.device)\n",
271 |         "        encodings.scatter_(1, encoding_indices, 1)\n",
272 |         "\n",
273 |         "        # Quantize and unflatten\n",
274 |         "        quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape)\n",
275 |         "\n",
276 |         "        # Use EMA to update the embedding vectors\n",
277 |         "        if self.training:\n",
278 |         "            self._ema_cluster_size = self._ema_cluster_size * self._decay + \\\n",
279 |         "                                     (1 - self._decay) * torch.sum(encodings, 0)\n",
280 |         "\n",
281 |         "            # Laplace smoothing of the cluster size\n",
282 |         "            n = torch.sum(self._ema_cluster_size.data)\n",
283 |         "            self._ema_cluster_size = (\n",
284 |         "                (self._ema_cluster_size + self._epsilon)\n",
285 |         "                / (n + self._num_embeddings * self._epsilon) * n)\n",
286 |         "\n",
287 |         "            dw = torch.matmul(encodings.t(), flat_input)\n",
288 |         "            self._ema_w = nn.Parameter(self._ema_w * self._decay + (1 - self._decay) * dw)\n",
289 |         "\n",
290 |         "            self._embedding.weight = nn.Parameter(self._ema_w / self._ema_cluster_size.unsqueeze(1))\n",
291 |         "\n",
292 |         "        # Loss\n",
293 |         "        e_latent_loss = F.mse_loss(quantized.detach(), inputs)\n",
294 |         "        loss = self._commitment_cost * e_latent_loss\n",
295 |         "\n",
296 |         "        # Straight Through Estimator\n",
297 |         "        quantized = inputs + (quantized - inputs).detach()\n",
298 |         "        avg_probs = torch.mean(encodings, dim=0)\n",
299 |         "        perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))\n",
300 |         "\n",
301 |         "        # convert quantized from BHWC -> BCHW\n",
302 |         "        return loss, quantized.permute(0, 3, 1, 2).contiguous(), perplexity, encodings"
303 |       ]
304 |     },
305 |     {
306 |       "cell_type": "markdown",
307 |       "metadata": {
308 |         "id": "1y4nhfq_yA9I"
309 |       },
310 |       "source": [
311 |         "## Encoder & Decoder Architecture\n",
312 |         "\n",
313 |         "### Exercise 2: Setup residual stack\n",
314 |         "\n",
315 |         "The encoder and decoder architecture is based on a ResNet and is implemented below:"
316 |       ]
317 |     },
318 |     {
319 |       "cell_type": "code",
320 |       "execution_count": null,
321 |       "metadata": {
322 |         "id": "y29OGaKEyA9I"
323 |       },
324 |       "outputs": [],
325 |       "source": [
326 |         "class Residual(nn.Module):\n",
327 |         "    def __init__(self, in_channels, num_hiddens, num_residual_hiddens):\n",
328 |         "        super(Residual, self).__init__()\n",
329 |         "        # TODO: ReLU, conv: in X num_residual_hidden, ReLU, conv: num_residual_hidden x num_hidden\n",
330 |         "        self._block =\n",
331 |         "\n",
332 |         "    def forward(self, x):\n",
333 |         "        # TODO:\n",
334 |         "\n",
335 |         "\n",
336 |         "class ResidualStack(nn.Module):\n",
337 |         "    def __init__(self, in_channels, num_hiddens, num_residual_layers, num_residual_hiddens):\n",
338 |         "        super(ResidualStack, self).__init__()\n",
339 |         "        self._num_residual_layers = num_residual_layers\n",
340 |         "        self._layers = nn.ModuleList([Residual(in_channels, num_hiddens, num_residual_hiddens)\n",
341 |         "                             for _ in range(self._num_residual_layers)])\n",
342 |         "\n",
343 |         "    def forward(self, x):\n",
344 |         "        for i in range(self._num_residual_layers):\n",
345 |         "            x = self._layers[i](x)\n",
346 |         "        return F.relu(x)"
347 |       ]
348 |     },
349 |     {
350 |       "cell_type": "markdown",
351 |       "source": [
352 |         "### Exercise 3: Setup encoder and decoder."
353 |       ],
354 |       "metadata": {
355 |         "id": "8nSn7LK80T8d"
356 |       }
357 |     },
358 |     {
359 |       "cell_type": "code",
360 |       "execution_count": null,
361 |       "metadata": {
362 |         "id": "qh8lG14QyA9I"
363 |       },
364 |       "outputs": [],
365 |       "source": [
366 |         "class Encoder(nn.Module):\n",
367 |         "    def __init__(self, in_channels, num_hiddens, num_residual_layers, num_residual_hiddens):\n",
368 |         "        super(Encoder, self).__init__()\n",
369 |         "\n",
370 |         "        self._conv_1 = nn.Conv2d(in_channels=in_channels,\n",
371 |         "                                 out_channels=num_hiddens//2,\n",
372 |         "                                 kernel_size=4,\n",
373 |         "                                 stride=2, padding=1)\n",
374 |         "        self._conv_2 = nn.Conv2d(in_channels=num_hiddens//2,\n",
375 |         "                                 out_channels=num_hiddens,\n",
376 |         "                                 kernel_size=4,\n",
377 |         "                                 stride=2, padding=1)\n",
378 |         "        self._conv_3 = nn.Conv2d(in_channels=num_hiddens,\n",
379 |         "                                 out_channels=num_hiddens,\n",
380 |         "                                 kernel_size=3,\n",
381 |         "                                 stride=1, padding=1)\n",
382 |         "        self._residual_stack = ResidualStack(in_channels=num_hiddens,\n",
383 |         "                                             num_hiddens=num_hiddens,\n",
384 |         "                                             num_residual_layers=num_residual_layers,\n",
385 |         "                                             num_residual_hiddens=num_residual_hiddens)\n",
386 |         "\n",
387 |         "    def forward(self, inputs):\n",
388 |         "      # TODO: conv_1, relu, conv_2, relu, conv_3, residual_stack"
389 |       ]
390 |     },
391 |     {
392 |       "cell_type": "code",
393 |       "execution_count": null,
394 |       "metadata": {
395 |         "id": "26bAuEgzyA9I"
396 |       },
397 |       "outputs": [],
398 |       "source": [
399 |         "class Decoder(nn.Module):\n",
400 |         "    def __init__(self, in_channels, num_hiddens, num_residual_layers, num_residual_hiddens):\n",
401 |         "        super(Decoder, self).__init__()\n",
402 |         "\n",
403 |         "        # TODO: conv (in_channels X num_hiddens), residual_stack,\n",
404 |         "        # trans-conv (num_hiddens X num_hiddens //2),\n",
405 |         "        # trans-conv (num_hiddens //2 X 3)\n",
406 |         "\n",
407 |         "    def forward(self, inputs):\n",
408 |         "        x = self._conv_1(inputs)\n",
409 |         "\n",
410 |         "        x = self._residual_stack(x)\n",
411 |         "\n",
412 |         "        x = self._conv_trans_1(x)\n",
413 |         "        x = F.relu(x)\n",
414 |         "\n",
415 |         "        return self._conv_trans_2(x)"
416 |       ]
417 |     },
418 |     {
419 |       "cell_type": "markdown",
420 |       "metadata": {
421 |         "id": "P_YK-5KNyA9J"
422 |       },
423 |       "source": [
424 |         "## Train\n",
425 |         "\n",
426 |         "We use the hyperparameters from the author's code:"
427 |       ]
428 |     },
429 |     {
430 |       "cell_type": "code",
431 |       "execution_count": null,
432 |       "metadata": {
433 |         "id": "HXc4VdzhyA9J"
434 |       },
435 |       "outputs": [],
436 |       "source": [
437 |         "batch_size = 256\n",
438 |         "num_training_updates = 15000\n",
439 |         "\n",
440 |         "num_hiddens = 128\n",
441 |         "num_residual_hiddens = 32\n",
442 |         "num_residual_layers = 2\n",
443 |         "\n",
444 |         "embedding_dim = 64\n",
445 |         "num_embeddings = 512\n",
446 |         "\n",
447 |         "commitment_cost = 0.25\n",
448 |         "\n",
449 |         "decay = 0.99\n",
450 |         "\n",
451 |         "learning_rate = 1e-3"
452 |       ]
453 |     },
454 |     {
455 |       "cell_type": "code",
456 |       "execution_count": null,
457 |       "metadata": {
458 |         "id": "5_b2f3TFyA9J"
459 |       },
460 |       "outputs": [],
461 |       "source": [
462 |         "training_loader = DataLoader(training_data,\n",
463 |         "                             batch_size=batch_size,\n",
464 |         "                             shuffle=True,\n",
465 |         "                             pin_memory=True)"
466 |       ]
467 |     },
468 |     {
469 |       "cell_type": "code",
470 |       "execution_count": null,
471 |       "metadata": {
472 |         "id": "9E1thqg1yA9J"
473 |       },
474 |       "outputs": [],
475 |       "source": [
476 |         "validation_loader = DataLoader(validation_data,\n",
477 |         "                               batch_size=32,\n",
478 |         "                               shuffle=True,\n",
479 |         "                               pin_memory=True)"
480 |       ]
481 |     },
482 |     {
483 |       "cell_type": "markdown",
484 |       "source": [
485 |         "### Exercise 4: Wire everything together in the model."
486 |       ],
487 |       "metadata": {
488 |         "id": "dKnpSwgy0YUT"
489 |       }
490 |     },
491 |     {
492 |       "cell_type": "code",
493 |       "execution_count": null,
494 |       "metadata": {
495 |         "id": "naqec50MyA9J"
496 |       },
497 |       "outputs": [],
498 |       "source": [
499 |         "class Model(nn.Module):\n",
500 |         "    def __init__(self, num_hiddens, num_residual_layers, num_residual_hiddens,\n",
501 |         "                 num_embeddings, embedding_dim, commitment_cost, decay=0):\n",
502 |         "        super(Model, self).__init__()\n",
503 |         "\n",
504 |         "        # TODO: encoder, pre_vq_conv, vector quantizer, decoder.\n",
505 |         "\n",
506 |         "    def forward(self, x):\n",
507 |         "        # TODO: forward pass"
508 |       ]
509 |     },
510 |     {
511 |       "cell_type": "code",
512 |       "execution_count": null,
513 |       "metadata": {
514 |         "id": "jNws_Sk2yA9J"
515 |       },
516 |       "outputs": [],
517 |       "source": [
518 |         "model = Model(num_hiddens, num_residual_layers, num_residual_hiddens,\n",
519 |         "              num_embeddings, embedding_dim,\n",
520 |         "              commitment_cost, decay).to(device)"
521 |       ]
522 |     },
523 |     {
524 |       "cell_type": "code",
525 |       "execution_count": null,
526 |       "metadata": {
527 |         "id": "QpUSbhN1yA9J"
528 |       },
529 |       "outputs": [],
530 |       "source": [
531 |         "optimizer = optim.Adam(model.parameters(), lr=learning_rate, amsgrad=False)"
532 |       ]
533 |     },
534 |     {
535 |       "cell_type": "code",
536 |       "execution_count": null,
537 |       "metadata": {
538 |         "id": "Qso7cnE7yA9J"
539 |       },
540 |       "outputs": [],
541 |       "source": [
542 |         "model.train()\n",
543 |         "train_res_recon_error = []\n",
544 |         "train_res_perplexity = []\n",
545 |         "\n",
546 |         "for i in xrange(num_training_updates):\n",
547 |         "    (data, _) = next(iter(training_loader))\n",
548 |         "    data = data.to(device)\n",
549 |         "    optimizer.zero_grad()\n",
550 |         "\n",
551 |         "    vq_loss, data_recon, perplexity = model(data)\n",
552 |         "    recon_error = F.mse_loss(data_recon, data) / data_variance\n",
553 |         "    loss = recon_error + vq_loss\n",
554 |         "    loss.backward()\n",
555 |         "\n",
556 |         "    optimizer.step()\n",
557 |         "\n",
558 |         "    train_res_recon_error.append(recon_error.item())\n",
559 |         "    train_res_perplexity.append(perplexity.item())\n",
560 |         "\n",
561 |         "    if (i+1) % 100 == 0:\n",
562 |         "        print('%d iterations' % (i+1))\n",
563 |         "        print('recon_error: %.3f' % np.mean(train_res_recon_error[-100:]))\n",
564 |         "        print('perplexity: %.3f' % np.mean(train_res_perplexity[-100:]))\n",
565 |         "        print()"
566 |       ]
567 |     },
568 |     {
569 |       "cell_type": "markdown",
570 |       "metadata": {
571 |         "id": "pSDhXKYVyA9J"
572 |       },
573 |       "source": [
574 |         "## Plot Loss"
575 |       ]
576 |     },
577 |     {
578 |       "cell_type": "code",
579 |       "execution_count": null,
580 |       "metadata": {
581 |         "id": "57rGquZzyA9J"
582 |       },
583 |       "outputs": [],
584 |       "source": [
585 |         "train_res_recon_error_smooth = savgol_filter(train_res_recon_error, 201, 7)\n",
586 |         "train_res_perplexity_smooth = savgol_filter(train_res_perplexity, 201, 7)"
587 |       ]
588 |     },
589 |     {
590 |       "cell_type": "code",
591 |       "execution_count": null,
592 |       "metadata": {
593 |         "id": "3SPBT3rVyA9J"
594 |       },
595 |       "outputs": [],
596 |       "source": [
597 |         "f = plt.figure(figsize=(16,8))\n",
598 |         "ax = f.add_subplot(1,2,1)\n",
599 |         "ax.plot(train_res_recon_error_smooth)\n",
600 |         "ax.set_yscale('log')\n",
601 |         "ax.set_title('Smoothed NMSE.')\n",
602 |         "ax.set_xlabel('iteration')\n",
603 |         "\n",
604 |         "ax = f.add_subplot(1,2,2)\n",
605 |         "ax.plot(train_res_perplexity_smooth)\n",
606 |         "ax.set_title('Smoothed Average codebook usage (perplexity).')\n",
607 |         "ax.set_xlabel('iteration')"
608 |       ]
609 |     },
610 |     {
611 |       "cell_type": "markdown",
612 |       "metadata": {
613 |         "id": "lGUtgWxxyA9J"
614 |       },
615 |       "source": [
616 |         "## View Reconstructions"
617 |       ]
618 |     },
619 |     {
620 |       "cell_type": "code",
621 |       "execution_count": null,
622 |       "metadata": {
623 |         "id": "3BhecME9yA9J"
624 |       },
625 |       "outputs": [],
626 |       "source": [
627 |         "model.eval()\n",
628 |         "\n",
629 |         "(valid_originals, _) = next(iter(validation_loader))\n",
630 |         "valid_originals = valid_originals.to(device)\n",
631 |         "\n",
632 |         "vq_output_eval = model._pre_vq_conv(model._encoder(valid_originals))\n",
633 |         "_, valid_quantize, _, _ = model._vq_vae(vq_output_eval)\n",
634 |         "valid_reconstructions = model._decoder(valid_quantize)"
635 |       ]
636 |     },
637 |     {
638 |       "cell_type": "code",
639 |       "execution_count": null,
640 |       "metadata": {
641 |         "id": "b3mSuFZzyA9J"
642 |       },
643 |       "outputs": [],
644 |       "source": [
645 |         "(train_originals, _) = next(iter(training_loader))\n",
646 |         "train_originals = train_originals.to(device)\n",
647 |         "_, train_reconstructions, _, _ = model._vq_vae(train_originals)"
648 |       ]
649 |     },
650 |     {
651 |       "cell_type": "code",
652 |       "execution_count": null,
653 |       "metadata": {
654 |         "id": "uYPxmXBdyA9J"
655 |       },
656 |       "outputs": [],
657 |       "source": [
658 |         "def show(img):\n",
659 |         "    npimg = img.numpy()\n",
660 |         "    fig = plt.imshow(np.transpose(npimg, (1,2,0)), interpolation='nearest')\n",
661 |         "    fig.axes.get_xaxis().set_visible(False)\n",
662 |         "    fig.axes.get_yaxis().set_visible(False)"
663 |       ]
664 |     },
665 |     {
666 |       "cell_type": "code",
667 |       "execution_count": null,
668 |       "metadata": {
669 |         "id": "fo7nn-ZGyA9K"
670 |       },
671 |       "outputs": [],
672 |       "source": [
673 |         "show(make_grid(valid_reconstructions.cpu().data)+0.5, )"
674 |       ]
675 |     },
676 |     {
677 |       "cell_type": "code",
678 |       "execution_count": null,
679 |       "metadata": {
680 |         "id": "m2ey6jhMyA9K"
681 |       },
682 |       "outputs": [],
683 |       "source": [
684 |         "show(make_grid(valid_originals.cpu()+0.5))"
685 |       ]
686 |     },
687 |     {
688 |       "cell_type": "markdown",
689 |       "metadata": {
690 |         "id": "KtEvf-7nyA9K"
691 |       },
692 |       "source": [
693 |         "## View Embedding"
694 |       ]
695 |     },
696 |     {
697 |       "cell_type": "code",
698 |       "execution_count": null,
699 |       "metadata": {
700 |         "id": "9KSzIjXtyA9K"
701 |       },
702 |       "outputs": [],
703 |       "source": [
704 |         "proj = umap.UMAP(n_neighbors=3,\n",
705 |         "                 min_dist=0.1,\n",
706 |         "                 metric='cosine').fit_transform(model._vq_vae._embedding.weight.data.cpu())"
707 |       ]
708 |     },
709 |     {
710 |       "cell_type": "code",
711 |       "execution_count": null,
712 |       "metadata": {
713 |         "id": "Sxpkx22IyA9K"
714 |       },
715 |       "outputs": [],
716 |       "source": [
717 |         "plt.scatter(proj[:,0], proj[:,1], alpha=0.3)"
718 |       ]
719 |     }
720 |   ],
721 |   "metadata": {
722 |     "kernelspec": {
723 |       "display_name": "Python 3",
724 |       "name": "python3"
725 |     },
726 |     "language_info": {
727 |       "codemirror_mode": {
728 |         "name": "ipython",
729 |         "version": 3
730 |       },
731 |       "file_extension": ".py",
732 |       "mimetype": "text/x-python",
733 |       "name": "python",
734 |       "nbconvert_exporter": "python",
735 |       "pygments_lexer": "ipython3",
736 |       "version": "3.6.7"
737 |     },
738 |     "colab": {
739 |       "provenance": [],
740 |       "gpuType": "T4"
741 |     },
742 |     "accelerator": "GPU"
743 |   },
744 |   "nbformat": 4,
745 |   "nbformat_minor": 0
746 | }


--------------------------------------------------------------------------------
/2_cv/Third_exercise/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | torchvision
3 | matplotlib
4 | numpy
5 | scipy
6 | six
7 | umap-learn


--------------------------------------------------------------------------------
/3_diffusion/README.md:
--------------------------------------------------------------------------------
 1 | # [[M2L2024](https://www.m2lschool.org/home)] Tutorial 3: Diffusion Models
 2 | 
 3 | **Authors:** 
 4 | [Virginia Aglietti](https://www.linkedin.com/in/virginia-aglietti-a80321a4/?originalSubdomain=uk) and [Ira Ktena](https://www.linkedin.com/in/ira-ktena-phd-12b04b58)
 5 | 
 6 | --- 
 7 | 
 8 | In this tutorial, we will have a single running colab covering three practicals. Please start with the beginner section of each practical, and then move on to the intermediate section.
 9 | 
10 | ## Practical 1: Forward Diffusion Process - Introducing Noise Schedulers
11 | 
12 | In this practical, you will implement the forward diffusion process where you will design noise schedulers to corrupt the original image to map the image to the noise space, in a tractable manner.
13 | 
14 | ### Outline
15 | - [Beginner] Implement a linear noise schedule. This introduces the process of adding noise to images
16 | - [Intermediate] Replace the linear schedule with a cosine noise schedule
17 | - [Advanced] Implement another noise schedule from the literature
18 | 
19 | ## Practical 2: Backward Diffusion Process - Defining a model architecture
20 | 
21 | In this practical, you will implement the backward diffusion process where you will design a neural network architecture that will map the noise back to the original image.
22 | 
23 | ### Outline
24 | - [Beginner] Implement a UNet that does not take into account time embeddings
25 | - [Intermediate] Implement a UNet NN with time embeddings
26 | 
27 | ## Practical 3: Inference
28 | In this practical, you will implement algorithms to generate images using a trained diffusion model.
29 | 
30 | ### Outline
31 | - [Beginner] Implement the backward inference process 
32 | - [Intermediate] Improve the backward inference process  using steering
33 | 
34 | 
35 | ### Notebooks
36 | 
37 | Tutorial: [![Open In 
38 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/3_diffusion/diffusion.ipynb)
39 | 
40 | Solution: [![Open In 
41 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/3_diffusion/diffusion_solved.ipynb)
42 | 
43 | ---
44 | 


--------------------------------------------------------------------------------
/4_rl/README.md:
--------------------------------------------------------------------------------
 1 | # [[M2L2024](https://www.m2lschool.org/home)] Tutorial 4: Reinforcement Learning
 2 | 
 3 | **Authors:** [Diego Calanzone](https://halixness.github.io), [Gianmarco Tedeschi](#)
 4 | 
 5 | --- 
 6 | 
 7 | Welcome to the Reinforcement Learning (RL) Tutorial at the Mediterranean Machine Learning (M2L) Summer School 2024!
 8 | In this practicum we introduce fundamental concepts from RL literature with exercises on toy games provided by the Gym library. The first two notebooks do not require extensive knowledge beyond the basics of machine learning and neural networks; however, knowledge on modern Natural Language Processing (transformers, large language models) could be very helpful in order to complete part 3.
 9 | 
10 | You will find theoretical explanations in each chapter that will give you enough understanding of the exercises. However, you are invited to take some time to dive in the provided learning resources if you are interested in the topic. With this tutorial we aim at introducing you to three stages in reinforcement learning research: tabular methods, policy gradient methods and finally, the intersection between Deep RL and Large Language Models to tackle complex tasks.   
11 | 
12 | ### Outline
13 | 
14 | * Part I: Introduction to Reinforcement Learning
15 |   *  Environments
16 |   *  Agent-Environment Loop
17 |   *  Value-based Reinforcement Learning (Q-Learning, Greedy Agent, Eps-Greedy Agent)
18 | 
19 | * Part II: Policy Gradient Methods and Deep RL
20 |   *  Policy definition
21 |   *  Policy gradient algorithms (Softmax NN, REINFORCE, GPOMDP)
22 | 
23 | * Part III: Reinforcement Learning with (Any-) Human Feedback
24 |   *  Pre-trained Language Models
25 |   *  Using Reward Models
26 |   *  LM fine-tuning with Proximal Policy Optimization (PPO)
27 |   *  Drawbacks and Introduction to Preference Optimization methods
28 | 
29 | ### Notebooks
30 | 
31 | #### Part I:
32 | Tutorial: [![Open In 
33 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/4_rl/Part%2001/M2L24_RL01%20_Intro_to_RL_exercise.ipynb)
34 | 
35 | Solution: [![Open In 
36 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/4_rl/Part%2001/M2L24_RL01%20_Intro_to_RL_solution.ipynb)
37 | 
38 | 
39 | #### Part II:
40 | Tutorial: [![Open In 
41 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/4_rl/Part%2002/M2L24_RL02%20_Policy_gradient_methods_exercise.ipynb)
42 | 
43 | Solution: [![Open In 
44 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/4_rl/Part%2002/M2L24_RL02%20_Policy_gradient_methods_solution.ipynb)
45 | 
46 | #### Part III:
47 | Tutorial: [![Open In 
48 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/4_rl/Part%2003/M2L24_RL03_RLHF_exercise.ipynb)
49 | 
50 | Solution: [![Open In 
51 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/4_rl/Part%2003/M2L24_RL03_RLHF_solution.ipynb)
52 | 


--------------------------------------------------------------------------------
/5_gnn/README.md:
--------------------------------------------------------------------------------
 1 | # [[M2L2024](https://www.m2lschool.org/home)] Tutorial 5: Graph Neural Networks
 2 | 
 3 | **Authors:** Masha Samsikova and Wilfried Bounsi
 4 | 
 5 | --- 
 6 | 
 7 | Welcome to the final tutorial of the summer school! Congratulations on getting this far :-)
 8 | 
 9 | Today we will look at Graph Neural Networks (GNNs)! GNNs are a powerful and widespread class of Neural Network models specifically designed to operate over graph structures. In Part I, we will look at how to build the most common types of GNNs. Next, in Part II, we will look at an interesting application of GNNs -- using them to estimate missing data in a dataset. Finally, in Part III, we will look at more advanced topics such as how to scale GNNs to massive graphs and on a mathematical phenomenon known as over-squashing -- information being "squashed" due to exponential amounts of information being observed by a single node. 
10 | 
11 | ### Outline
12 | 
13 | * Part I: Introduction to Graph Neural Nets
14 |     * Graph Convolution Network (GCN) Implementation.
15 |     * Graph Attention Network (GAT) Implementation.
16 |     * Efficient GAT and GCN implementations with Sparse Matrix operations.
17 |     * Node classification on the Cora dataset: training, evaluation & analysis.
18 | * Part II: Missing Data Estimation
19 |     *  Use GNNs to estimate missing training data
20 | * Part III: Advanced Topics
21 |     * Scaling GNNs to massive graphs
22 |     * Over-squashing
23 | 
24 | ### Notebooks
25 | 
26 | #### Part I:
27 | Tutorial: [![Open In 
28 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/5_gnn/part_I/introduction_to_gnns.ipynb)
29 | 
30 | Solution: [![Open In 
31 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/5_gnn/part_I/introduction_to_gnns_solved.ipynb)
32 | 
33 | #### Part II:
34 | Tutorial: [![Open In 
35 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/5_gnn/part_II/Missing_data_estimation.ipynb)
36 | 
37 | Solved: [![Open In 
38 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/5_gnn/part_II/Missing_data_estimation_solved.ipynb)
39 | 
40 | #### Part III: 
41 | Tutorial: [![Open In 
42 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/5_gnn/part_III/gnns_advanced_topics.ipynb)
43 | 
44 | Solved: [![Open In 
45 | Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.sandbox.google.com/github/M2Lschool/tutorials2024/blob/main/5_gnn/part_III/gnns_advanced_topics_solved.ipynb)
46 | 
47 | ---
48 | 


--------------------------------------------------------------------------------
/5_gnn/part_II/Missing_data_estimation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "cells": [
  3 |     {
  4 |       "cell_type": "markdown",
  5 |       "metadata": {
  6 |         "id": "OHa51EeXfL9E"
  7 |       },
  8 |       "source": [
  9 |         "In this part of our tutorial, we will explore how GNNs can be used to estimate missing tabular data on the [extended Iris dataset](https://www.kaggle.com/datasets/samybaladram/iris-dataset-extended/data). The approach we will be implementing is called GRAPE and can be found [this paper](https://proceedings.neurips.cc/paper/2020/file/dc36f18a9a0a776671d4879cae69b551-Paper.pdf)."
 10 |       ]
 11 |     },
 12 |     {
 13 |       "cell_type": "code",
 14 |       "execution_count": null,
 15 |       "metadata": {
 16 |         "id": "g8Ot6qEL8hEV",
 17 |         "cellView": "form"
 18 |       },
 19 |       "outputs": [],
 20 |       "source": [
 21 |         "# @title Required setup\n",
 22 |         "!pip3 install torch_geometric\n",
 23 |         "\n",
 24 |         "# Downloads and unpacks the dataset\n",
 25 |         "!kaggle datasets download -d samybaladram/iris-dataset-extended\n",
 26 |         "!unzip iris-dataset-extended.zip\n",
 27 |         "\n",
 28 |         "import matplotlib.pyplot as plt\n",
 29 |         "import numpy as np\n",
 30 |         "import pandas as pd\n",
 31 |         "from sklearn import model_selection\n",
 32 |         "from sklearn import preprocessing\n",
 33 |         "import torch\n",
 34 |         "import torch.nn as nn\n",
 35 |         "import torch.nn.functional as F\n",
 36 |         "import torch_geometric\n",
 37 |         "from torch_geometric.data import Data\n",
 38 |         "import torch_geometric.datasets as datasets\n",
 39 |         "\n",
 40 |         "\n",
 41 |         "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')"
 42 |       ]
 43 |     },
 44 |     {
 45 |       "cell_type": "markdown",
 46 |       "metadata": {
 47 |         "id": "y4DTf1gagbWk"
 48 |       },
 49 |       "source": [
 50 |         "First, let's look at the data."
 51 |       ]
 52 |     },
 53 |     {
 54 |       "cell_type": "code",
 55 |       "execution_count": null,
 56 |       "metadata": {
 57 |         "id": "4zW6oDiJgF53"
 58 |       },
 59 |       "outputs": [],
 60 |       "source": [
 61 |         "iris_df = pd.read_csv('iris_extended.csv')\n",
 62 |         "iris_df.sample(5)"
 63 |       ]
 64 |     },
 65 |     {
 66 |       "cell_type": "markdown",
 67 |       "metadata": {
 68 |         "id": "iFcEjPTEknRf"
 69 |       },
 70 |       "source": [
 71 |         "We have a bunch of categorical variables, let's encode them using one-hot encoding."
 72 |       ]
 73 |     },
 74 |     {
 75 |       "cell_type": "code",
 76 |       "execution_count": null,
 77 |       "metadata": {
 78 |         "id": "yybdzUshkZW8"
 79 |       },
 80 |       "outputs": [],
 81 |       "source": [
 82 |         "iris_df = pd.get_dummies(iris_df, drop_first=True)\n",
 83 |         "iris_df.sample(5)"
 84 |       ]
 85 |     },
 86 |     {
 87 |       "cell_type": "code",
 88 |       "execution_count": null,
 89 |       "metadata": {
 90 |         "cellView": "form",
 91 |         "id": "cG7hNrtxipIk"
 92 |       },
 93 |       "outputs": [],
 94 |       "source": [
 95 |         "# @title Consider normalizing the dataset\n",
 96 |         "\n",
 97 |         "NORMALIZE = False  # @param {'type': 'boolean'}\n",
 98 |         "\n",
 99 |         "iris_df = iris_df.to_numpy()\n",
100 |         "if NORMALIZE:\n",
101 |         "  iris_df = preprocessing.MinMaxScaler().fit_transform(iris_df)"
102 |       ]
103 |     },
104 |     {
105 |       "cell_type": "markdown",
106 |       "metadata": {
107 |         "id": "qU_r29O2lm52"
108 |       },
109 |       "source": [
110 |         "Now let's encode the data into a graph according to the following rules:\n",
111 |         "Nodes - dataset entries and features, edges - feature values."
112 |       ]
113 |     },
114 |     {
115 |       "cell_type": "code",
116 |       "execution_count": null,
117 |       "metadata": {
118 |         "id": "p0dMlNbGlkT_"
119 |       },
120 |       "outputs": [],
121 |       "source": [
122 |         "def encode_data(data: pd.DataFrame, train_mask: np.ndarray) -> np.ndarray:\n",
123 |         "  \"\"\"Encodes tabular data into a graph.\"\"\"\n",
124 |         "  # Number of dataset entries.\n",
125 |         "  num_entries = data.shape[0]\n",
126 |         "\n",
127 |         "  # Number of features in the dataset.\n",
128 |         "  num_features = data.shape[1]\n",
129 |         "\n",
130 |         "  # Compute the number of edges in the graph.\n",
131 |         "  num_edges = None # <----- Your code here ----->\n",
132 |         "\n",
133 |         "  # Creates train and test indices according to the `train_mask`.\n",
134 |         "  train_indices = np.arange(num_edges)[train_mask]\n",
135 |         "  test_indices = np.arange(num_edges)[~train_mask]\n",
136 |         "\n",
137 |         "  # Finds the index of the first feature node.\n",
138 |         "  # First `num_entries` nodes correspons to the entries of the dataset.\n",
139 |         "  least_feature_node_id = num_entries\n",
140 |         "\n",
141 |         "  # Use one-hot encoding to specify the type of each node (column type or\n",
142 |         "  # dataset entry).\n",
143 |         "  # <----- Your code here ----->\n",
144 |         "\n",
145 |         "  # Defines graph connectivity and has the final shape of [2, `num_edges`].\n",
146 |         "  edge_index = []\n",
147 |         "  # Edge feature matrix with shape [`num_edges`, `num_features`].\n",
148 |         "  edge_attr = []\n",
149 |         "  # Retrieve edge indices (indices of nodes that are connected by that edge).\n",
150 |         "  # Assume a directed graph, where all nodes start in an entry node and end in\n",
151 |         "  # a feature node.\n",
152 |         "  # Don't forget to collect edge attributes too!\n",
153 |         "  # <----- Your code here ----->\n",
154 |         "\n",
155 |         "\n",
156 |         "  # Splits edges and attributes into train and tests subsets.\n",
157 |         "  edge_index_train = edge_index[:, train_indices]\n",
158 |         "  edge_index_test = edge_index[:, test_indices]\n",
159 |         "  edge_attr_train = edge_attr[train_indices]\n",
160 |         "  edge_attr_test = edge_attr[test_indices]\n",
161 |         "  return Data(\n",
162 |         "      x=nodes_features,\n",
163 |         "      edge_index_train=edge_index_train,\n",
164 |         "      edge_index_test=edge_index_test,\n",
165 |         "      edge_attr_train=edge_attr_train,\n",
166 |         "      edge_attr_test=edge_attr_test,\n",
167 |         "  )"
168 |       ]
169 |     },
170 |     {
171 |       "cell_type": "code",
172 |       "execution_count": null,
173 |       "metadata": {
174 |         "id": "VxXBjrh_kUns"
175 |       },
176 |       "outputs": [],
177 |       "source": [
178 |         "TRAIN_RATIO = 0.7  # @param {'type': 'number'}\n",
179 |         "train_mask = (\n",
180 |         "    np.random.RandomState(0)\n",
181 |         "    .binomial(1, TRAIN_RATIO, iris_df.shape[0] * iris_df.shape[1])\n",
182 |         "    .astype(bool)\n",
183 |         ")"
184 |       ]
185 |     },
186 |     {
187 |       "cell_type": "code",
188 |       "execution_count": null,
189 |       "metadata": {
190 |         "id": "muqluvpsU6_d"
191 |       },
192 |       "outputs": [],
193 |       "source": [
194 |         "# @title Let's visualize the resulting train/test split\n",
195 |         "\n",
196 |         "plt.imshow(train_mask.reshape(iris_df.shape[0], iris_df.shape[1])[:40])\n",
197 |         "plt.colorbar()\n",
198 |         "plt.show()"
199 |       ]
200 |     },
201 |     {
202 |       "cell_type": "code",
203 |       "execution_count": null,
204 |       "metadata": {
205 |         "id": "j9L1pF673Rra"
206 |       },
207 |       "outputs": [],
208 |       "source": [
209 |         "class Net(torch.nn.Module):\n",
210 |         "\n",
211 |         "  def __init__(\n",
212 |         "      self,\n",
213 |         "      *,\n",
214 |         "      node_input_dim: int,\n",
215 |         "      edge_input_dim: int,\n",
216 |         "      node_hidden_dim: int,\n",
217 |         "      edge_hidden_dim: int,\n",
218 |         "  ):\n",
219 |         "    super().__init__()\n",
220 |         "\n",
221 |         "    self.node_conv = torch_geometric.nn.SAGEConv(\n",
222 |         "        node_input_dim, node_hidden_dim\n",
223 |         "    )\n",
224 |         "    self.edge_update_mlps = nn.Sequential(\n",
225 |         "        nn.Linear(2 * node_hidden_dim + edge_input_dim, edge_hidden_dim),\n",
226 |         "        torch.nn.ReLU(),\n",
227 |         "        nn.Linear(edge_hidden_dim, edge_input_dim),\n",
228 |         "        torch.nn.ReLU(),\n",
229 |         "    )\n",
230 |         "\n",
231 |         "  def forward(\n",
232 |         "      self, x: torch.Tensor, edge_attr: torch.Tensor, edge_index: torch.Tensor\n",
233 |         "  ):\n",
234 |         "    # Apply node convolution.\n",
235 |         "    # <----- Your code here ----->\n",
236 |         "\n",
237 |         "    # Apply MLP on the edges.\n",
238 |         "    edge_attr = None # <----- Your code here ----->\n",
239 |         "    return x, edge_attr"
240 |       ]
241 |     },
242 |     {
243 |       "cell_type": "code",
244 |       "execution_count": null,
245 |       "metadata": {
246 |         "id": "crw9gH_K4Gy9"
247 |       },
248 |       "outputs": [],
249 |       "source": [
250 |         "NUM_EPOCHS = 200  # @param {'type': 'number'}\n",
251 |         "LEARNING_RATE = 0.001  # @param {'type': 'number'}\n",
252 |         "WEIGHT_DECAY = 0.0000001  # @param {'type': 'number'}\n",
253 |         "\n",
254 |         "\n",
255 |         "def train(gnn: torch.nn.Module, graph: Data) -> tuple[list[float], list[float]]:\n",
256 |         "  train_loss, val_loss = [], []\n",
257 |         "  # Puts all of the tensors to the device in use.\n",
258 |         "  x = torch.from_numpy(graph.x).to(device)\n",
259 |         "  edge_attr = torch.from_numpy(graph.edge_attr_train).to(device)\n",
260 |         "  edge_attr_test = torch.from_numpy(graph.edge_attr_test).to(device)\n",
261 |         "  edge_index_train = torch.from_numpy(graph.edge_index_train).to(device)\n",
262 |         "  edge_index_test = torch.from_numpy(graph.edge_index_test).to(device)\n",
263 |         "\n",
264 |         "  optimizer = torch.optim.Adam(\n",
265 |         "      gnn.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY\n",
266 |         "  )\n",
267 |         "  for epoch in range(NUM_EPOCHS):\n",
268 |         "    gnn.train()\n",
269 |         "    optimizer.zero_grad()\n",
270 |         "    out, out_edge = gnn(x, edge_attr, edge_index_train)\n",
271 |         "    loss = F.mse_loss(edge_attr, out_edge)\n",
272 |         "    loss.backward()\n",
273 |         "    optimizer.step()\n",
274 |         "    out.detach().to('cpu')\n",
275 |         "    out_edge.detach().to('cpu')\n",
276 |         "    del out\n",
277 |         "    del out_edge\n",
278 |         "    train_loss.append(loss.item())\n",
279 |         "    with torch.no_grad():\n",
280 |         "      out, out_edge_test = gnn(x, edge_attr_test, edge_index_test)\n",
281 |         "      loss = F.mse_loss(edge_attr_test, out_edge_test)\n",
282 |         "      out.detach().to('cpu')\n",
283 |         "      out_edge_test.detach().to('cpu')\n",
284 |         "      val_loss.append(loss.item())\n",
285 |         "  return train_loss, val_loss"
286 |       ]
287 |     },
288 |     {
289 |       "cell_type": "code",
290 |       "execution_count": null,
291 |       "metadata": {
292 |         "id": "0dUHOXsg1xlI"
293 |       },
294 |       "outputs": [],
295 |       "source": [
296 |         "graph = encode_data(iris_df, train_mask)"
297 |       ]
298 |     },
299 |     {
300 |       "cell_type": "code",
301 |       "execution_count": null,
302 |       "metadata": {
303 |         "id": "JSrA9x8WQT5W"
304 |       },
305 |       "outputs": [],
306 |       "source": [
307 |         "# @title Instantiates a GNN\n",
308 |         "\n",
309 |         "NODE_HIDDEN_DIM = 128  # @param {'type': 'number'}\n",
310 |         "EDGE_HIDDEN_DIM = 128  # @param {'type': 'number'}\n",
311 |         "\n",
312 |         "\n",
313 |         "gnn = Net(\n",
314 |         "    node_input_dim=graph.x.shape[1],\n",
315 |         "    edge_input_dim=graph.edge_attr_train.shape[1],\n",
316 |         "    node_hidden_dim=NODE_HIDDEN_DIM,\n",
317 |         "    edge_hidden_dim=EDGE_HIDDEN_DIM,\n",
318 |         ")\n",
319 |         "gnn = gnn.to(device)"
320 |       ]
321 |     },
322 |     {
323 |       "cell_type": "code",
324 |       "execution_count": null,
325 |       "metadata": {
326 |         "id": "SzgtpPMfQAJ2"
327 |       },
328 |       "outputs": [],
329 |       "source": [
330 |         "train_loss, val_loss = train(gnn, graph)\n",
331 |         "\n",
332 |         "plt.plot((x := list(range(NUM_EPOCHS))), train_loss, label='train loss')\n",
333 |         "plt.plot(x, val_loss, label='val loss')\n",
334 |         "plt.legend()\n",
335 |         "plt.title('Training progress')\n",
336 |         "plt.xlabel('Epoch')\n",
337 |         "plt.ylabel('Loss value')\n",
338 |         "plt.show()"
339 |       ]
340 |     }
341 |   ],
342 |   "metadata": {
343 |     "colab": {
344 |       "private_outputs": true,
345 |       "provenance": []
346 |     },
347 |     "kernelspec": {
348 |       "display_name": "Python 3",
349 |       "name": "python3"
350 |     },
351 |     "language_info": {
352 |       "name": "python"
353 |     }
354 |   },
355 |   "nbformat": 4,
356 |   "nbformat_minor": 0
357 | }


--------------------------------------------------------------------------------
/5_gnn/part_II/Missing_data_estimation_solved.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "cells": [
  3 |     {
  4 |       "cell_type": "markdown",
  5 |       "metadata": {
  6 |         "id": "OHa51EeXfL9E"
  7 |       },
  8 |       "source": [
  9 |         "In this part of our tutorial, we will explore how GNNs can be used to estimate missing tabular data on the [extended Iris dataset](https://www.kaggle.com/datasets/samybaladram/iris-dataset-extended/data). The approach we will be implementing is called GRAPE and can be found [this paper](https://proceedings.neurips.cc/paper/2020/file/dc36f18a9a0a776671d4879cae69b551-Paper.pdf)."
 10 |       ]
 11 |     },
 12 |     {
 13 |       "cell_type": "code",
 14 |       "execution_count": null,
 15 |       "metadata": {
 16 |         "id": "g8Ot6qEL8hEV",
 17 |         "cellView": "form"
 18 |       },
 19 |       "outputs": [],
 20 |       "source": [
 21 |         "# @title Required setup\n",
 22 |         "!pip3 install torch_geometric\n",
 23 |         "\n",
 24 |         "# Downloads and unpacks the dataset\n",
 25 |         "!kaggle datasets download -d samybaladram/iris-dataset-extended\n",
 26 |         "!unzip iris-dataset-extended.zip\n",
 27 |         "\n",
 28 |         "import matplotlib.pyplot as plt\n",
 29 |         "import numpy as np\n",
 30 |         "import pandas as pd\n",
 31 |         "from sklearn import model_selection\n",
 32 |         "from sklearn import preprocessing\n",
 33 |         "import torch\n",
 34 |         "import torch.nn as nn\n",
 35 |         "import torch.nn.functional as F\n",
 36 |         "import torch_geometric\n",
 37 |         "from torch_geometric.data import Data\n",
 38 |         "import torch_geometric.datasets as datasets\n",
 39 |         "\n",
 40 |         "\n",
 41 |         "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')"
 42 |       ]
 43 |     },
 44 |     {
 45 |       "cell_type": "markdown",
 46 |       "metadata": {
 47 |         "id": "y4DTf1gagbWk"
 48 |       },
 49 |       "source": [
 50 |         "First, let's look at the data."
 51 |       ]
 52 |     },
 53 |     {
 54 |       "cell_type": "code",
 55 |       "execution_count": null,
 56 |       "metadata": {
 57 |         "id": "4zW6oDiJgF53"
 58 |       },
 59 |       "outputs": [],
 60 |       "source": [
 61 |         "iris_df = pd.read_csv('iris_extended.csv')\n",
 62 |         "iris_df.sample(5)"
 63 |       ]
 64 |     },
 65 |     {
 66 |       "cell_type": "markdown",
 67 |       "metadata": {
 68 |         "id": "iFcEjPTEknRf"
 69 |       },
 70 |       "source": [
 71 |         "We have a bunch of categorical variables, let's encode them using one-hot encoding."
 72 |       ]
 73 |     },
 74 |     {
 75 |       "cell_type": "code",
 76 |       "execution_count": null,
 77 |       "metadata": {
 78 |         "id": "yybdzUshkZW8"
 79 |       },
 80 |       "outputs": [],
 81 |       "source": [
 82 |         "iris_df = pd.get_dummies(iris_df, drop_first=True)\n",
 83 |         "iris_df.sample(5)"
 84 |       ]
 85 |     },
 86 |     {
 87 |       "cell_type": "code",
 88 |       "execution_count": null,
 89 |       "metadata": {
 90 |         "cellView": "form",
 91 |         "id": "cG7hNrtxipIk"
 92 |       },
 93 |       "outputs": [],
 94 |       "source": [
 95 |         "# @title Consider normalizing the dataset\n",
 96 |         "\n",
 97 |         "NORMALIZE = False  # @param {'type': 'boolean'}\n",
 98 |         "\n",
 99 |         "iris_df = iris_df.to_numpy()\n",
100 |         "if NORMALIZE:\n",
101 |         "  iris_df = preprocessing.MinMaxScaler().fit_transform(iris_df)"
102 |       ]
103 |     },
104 |     {
105 |       "cell_type": "markdown",
106 |       "metadata": {
107 |         "id": "qU_r29O2lm52"
108 |       },
109 |       "source": [
110 |         "Now let's encode the data into a graph according to the following rules:\n",
111 |         "Nodes - dataset entries and features, edges - feature values."
112 |       ]
113 |     },
114 |     {
115 |       "cell_type": "code",
116 |       "execution_count": null,
117 |       "metadata": {
118 |         "id": "p0dMlNbGlkT_"
119 |       },
120 |       "outputs": [],
121 |       "source": [
122 |         "def encode_data(data: pd.DataFrame, train_mask: np.ndarray) -> np.ndarray:\n",
123 |         "  \"\"\"Encodes tabular data into a graph.\"\"\"\n",
124 |         "  # Number of dataset entries.\n",
125 |         "  num_entries = data.shape[0]\n",
126 |         "\n",
127 |         "  # Number of features in the dataset.\n",
128 |         "  num_features = data.shape[1]\n",
129 |         "\n",
130 |         "  # Computes the number of edges in the graph.\n",
131 |         "  num_edges = num_entries * num_features\n",
132 |         "\n",
133 |         "  # Creates train and test indices according to the `train_mask`.\n",
134 |         "  train_indices = np.arange(num_edges)[train_mask]\n",
135 |         "  test_indices = np.arange(num_edges)[~train_mask]\n",
136 |         "\n",
137 |         "  # Finds the index of the first feature node.\n",
138 |         "  # First `num_entries` nodes correspons to the entries of the dataset.\n",
139 |         "  least_feature_node_id = num_entries\n",
140 |         "\n",
141 |         "  # Specifies nodes features. Here, we are using them to specify the\n",
142 |         "  # one-hot-encoded type of a node.\n",
143 |         "  entry_nodes = np.concatenate(\n",
144 |         "      [np.ones((num_entries, 1)), np.zeros((num_entries, num_features))], axis=1\n",
145 |         "  )\n",
146 |         "  feature_nodes = np.concatenate(\n",
147 |         "      [np.zeros((num_features, 1)), np.identity(num_features)], axis=1\n",
148 |         "  )\n",
149 |         "  nodes_features = np.concatenate([entry_nodes, feature_nodes]).astype(\n",
150 |         "      np.float32\n",
151 |         "  )\n",
152 |         "\n",
153 |         "  # Defines graph connectivity and has the final shape of [2, `num_edges`].\n",
154 |         "  edge_index = []\n",
155 |         "  # Edge feature matrix with shape [`num_edges`, `num_features`].\n",
156 |         "  edge_attr = []\n",
157 |         "  # Retrieves edge indices (indices of nodes that are connected by that edge).\n",
158 |         "  # Builds a directed graph, where all nodes start in an entry node and end in\n",
159 |         "  # a feature node.\n",
160 |         "  for entry_index, features_per_entry in enumerate(data):\n",
161 |         "    for feature_index, feature_value in enumerate(features_per_entry):\n",
162 |         "      edge_index.append([entry_index, least_feature_node_id + feature_index])\n",
163 |         "      edge_attr.append(feature_value)\n",
164 |         "\n",
165 |         "  edge_index = np.array(edge_index, dtype=np.int64).T\n",
166 |         "  edge_attr = np.array(edge_attr, dtype=np.float32).reshape(-1, 1)\n",
167 |         "\n",
168 |         "  # Splits edges and attributes into train and tests subsets.\n",
169 |         "  edge_index_train = edge_index[:, train_indices]\n",
170 |         "  edge_index_test = edge_index[:, test_indices]\n",
171 |         "  edge_attr_train = edge_attr[train_indices]\n",
172 |         "  edge_attr_test = edge_attr[test_indices]\n",
173 |         "  return Data(\n",
174 |         "      x=nodes_features,\n",
175 |         "      edge_index_train=edge_index_train,\n",
176 |         "      edge_index_test=edge_index_test,\n",
177 |         "      edge_attr_train=edge_attr_train,\n",
178 |         "      edge_attr_test=edge_attr_test,\n",
179 |         "  )"
180 |       ]
181 |     },
182 |     {
183 |       "cell_type": "code",
184 |       "execution_count": null,
185 |       "metadata": {
186 |         "id": "VxXBjrh_kUns"
187 |       },
188 |       "outputs": [],
189 |       "source": [
190 |         "TRAIN_RATIO = 0.7  # @param {'type': 'number'}\n",
191 |         "train_mask = (\n",
192 |         "    np.random.RandomState(0)\n",
193 |         "    .binomial(1, TRAIN_RATIO, iris_df.shape[0] * iris_df.shape[1])\n",
194 |         "    .astype(bool)\n",
195 |         ")"
196 |       ]
197 |     },
198 |     {
199 |       "cell_type": "code",
200 |       "execution_count": null,
201 |       "metadata": {
202 |         "id": "muqluvpsU6_d"
203 |       },
204 |       "outputs": [],
205 |       "source": [
206 |         "# @title Let's visualize the resulting train/test split\n",
207 |         "\n",
208 |         "plt.imshow(train_mask.reshape(iris_df.shape[0], iris_df.shape[1])[:40])\n",
209 |         "plt.colorbar()\n",
210 |         "plt.show()"
211 |       ]
212 |     },
213 |     {
214 |       "cell_type": "code",
215 |       "execution_count": null,
216 |       "metadata": {
217 |         "id": "j9L1pF673Rra"
218 |       },
219 |       "outputs": [],
220 |       "source": [
221 |         "class Net(torch.nn.Module):\n",
222 |         "\n",
223 |         "  def __init__(\n",
224 |         "      self,\n",
225 |         "      *,\n",
226 |         "      node_input_dim: int,\n",
227 |         "      edge_input_dim: int,\n",
228 |         "      node_hidden_dim: int,\n",
229 |         "      edge_hidden_dim: int,\n",
230 |         "  ):\n",
231 |         "    super().__init__()\n",
232 |         "\n",
233 |         "    self.node_conv = torch_geometric.nn.SAGEConv(\n",
234 |         "        node_input_dim, node_hidden_dim\n",
235 |         "    )\n",
236 |         "    self.edge_update_mlps = nn.Sequential(\n",
237 |         "        nn.Linear(2 * node_hidden_dim + edge_input_dim, edge_hidden_dim),\n",
238 |         "        torch.nn.ReLU(),\n",
239 |         "        nn.Linear(edge_hidden_dim, edge_input_dim),\n",
240 |         "        torch.nn.ReLU(),\n",
241 |         "    )\n",
242 |         "\n",
243 |         "  def forward(\n",
244 |         "      self, x: torch.Tensor, edge_attr: torch.Tensor, edge_index: torch.Tensor\n",
245 |         "  ):\n",
246 |         "    x = self.node_conv(x, edge_index)\n",
247 |         "    x_from = x[edge_index[0]]\n",
248 |         "    x_to = x[edge_index[1]]\n",
249 |         "    edge_attr = self.edge_update_mlps(\n",
250 |         "        torch.cat([x_from, x_to, edge_attr], dim=-1)\n",
251 |         "    )\n",
252 |         "    return x, edge_attr"
253 |       ]
254 |     },
255 |     {
256 |       "cell_type": "code",
257 |       "execution_count": null,
258 |       "metadata": {
259 |         "id": "crw9gH_K4Gy9"
260 |       },
261 |       "outputs": [],
262 |       "source": [
263 |         "NUM_EPOCHS = 200  # @param {'type': 'number'}\n",
264 |         "LEARNING_RATE = 0.001  # @param {'type': 'number'}\n",
265 |         "WEIGHT_DECAY = 0.0000001  # @param {'type': 'number'}\n",
266 |         "\n",
267 |         "\n",
268 |         "def train(gnn: torch.nn.Module, graph: Data) -> tuple[list[float], list[float]]:\n",
269 |         "  train_loss, val_loss = [], []\n",
270 |         "  # Puts all of the tensors to the device in use.\n",
271 |         "  x = torch.from_numpy(graph.x).to(device)\n",
272 |         "  edge_attr = torch.from_numpy(graph.edge_attr_train).to(device)\n",
273 |         "  edge_attr_test = torch.from_numpy(graph.edge_attr_test).to(device)\n",
274 |         "  edge_index_train = torch.from_numpy(graph.edge_index_train).to(device)\n",
275 |         "  edge_index_test = torch.from_numpy(graph.edge_index_test).to(device)\n",
276 |         "\n",
277 |         "  optimizer = torch.optim.Adam(\n",
278 |         "      gnn.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY\n",
279 |         "  )\n",
280 |         "  for epoch in range(NUM_EPOCHS):\n",
281 |         "    gnn.train()\n",
282 |         "    optimizer.zero_grad()\n",
283 |         "    out, out_edge = gnn(x, edge_attr, edge_index_train)\n",
284 |         "    loss = F.mse_loss(edge_attr, out_edge)\n",
285 |         "    loss.backward()\n",
286 |         "    optimizer.step()\n",
287 |         "    out.detach().to('cpu')\n",
288 |         "    out_edge.detach().to('cpu')\n",
289 |         "    del out\n",
290 |         "    del out_edge\n",
291 |         "    train_loss.append(loss.item())\n",
292 |         "    with torch.no_grad():\n",
293 |         "      out, out_edge_test = gnn(x, edge_attr_test, edge_index_test)\n",
294 |         "      loss = F.mse_loss(edge_attr_test, out_edge_test)\n",
295 |         "      out.detach().to('cpu')\n",
296 |         "      out_edge_test.detach().to('cpu')\n",
297 |         "      val_loss.append(loss.item())\n",
298 |         "  return train_loss, val_loss"
299 |       ]
300 |     },
301 |     {
302 |       "cell_type": "code",
303 |       "execution_count": null,
304 |       "metadata": {
305 |         "id": "0dUHOXsg1xlI"
306 |       },
307 |       "outputs": [],
308 |       "source": [
309 |         "graph = encode_data(iris_df, train_mask)"
310 |       ]
311 |     },
312 |     {
313 |       "cell_type": "code",
314 |       "execution_count": null,
315 |       "metadata": {
316 |         "id": "JSrA9x8WQT5W"
317 |       },
318 |       "outputs": [],
319 |       "source": [
320 |         "# @title Instantiate the GNN\n",
321 |         "\n",
322 |         "NODE_HIDDEN_DIM = 128  # @param {'type': 'number'}\n",
323 |         "EDGE_HIDDEN_DIM = 128  # @param {'type': 'number'}\n",
324 |         "\n",
325 |         "\n",
326 |         "gnn = Net(\n",
327 |         "    node_input_dim=graph.x.shape[1],\n",
328 |         "    edge_input_dim=graph.edge_attr_train.shape[1],\n",
329 |         "    node_hidden_dim=NODE_HIDDEN_DIM,\n",
330 |         "    edge_hidden_dim=EDGE_HIDDEN_DIM,\n",
331 |         ")\n",
332 |         "gnn = gnn.to(device)"
333 |       ]
334 |     },
335 |     {
336 |       "cell_type": "code",
337 |       "execution_count": null,
338 |       "metadata": {
339 |         "id": "SzgtpPMfQAJ2"
340 |       },
341 |       "outputs": [],
342 |       "source": [
343 |         "train_loss, val_loss = train(gnn, graph)\n",
344 |         "\n",
345 |         "plt.plot((x := list(range(NUM_EPOCHS))), train_loss, label='train loss')\n",
346 |         "plt.plot(x, val_loss, label='val loss')\n",
347 |         "plt.legend()\n",
348 |         "plt.title('Training progress')\n",
349 |         "plt.xlabel('Epoch')\n",
350 |         "plt.ylabel('Loss value')\n",
351 |         "plt.show()"
352 |       ]
353 |     }
354 |   ],
355 |   "metadata": {
356 |     "colab": {
357 |       "private_outputs": true,
358 |       "provenance": []
359 |     },
360 |     "kernelspec": {
361 |       "display_name": "Python 3",
362 |       "name": "python3"
363 |     },
364 |     "language_info": {
365 |       "name": "python"
366 |     }
367 |   },
368 |   "nbformat": 4,
369 |   "nbformat_minor": 0
370 | }


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Mediterranean Machine Learning summer school
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Mediterranean Machine Learning school 2024 tutorials.
 2 | *Designed for education purposes. Please do not distribute without permission.*
 3 | 
 4 | **Credits:** These tutorials are based on previous year's M2L tutorials as well as the outstanding colabs developed for the [EEML](https://github.com/eemlcommunity/) school and the excellent collection of [summer schools tutorials](https://github.com/deepmind/educational#summer-schools-tutorials) developed by DeepMind.
 5 | 
 6 | You are welcome to reuse this material in other courses or schools, giving credits to the respective authors and with a link to this repository. Please also kindly reach out to organizers@m2lschool.org if you plan to do so, to help us measure the impact of this project.
 7 | 
 8 | Please note that we will only be able to provide limited support, but feel free to open a GitHub issue if you any question on the code.
 9 | 
10 | 
11 | ## License
12 | 
13 | Designed for education purposes. Please do not distribute without permission. Write at organizers@m2lschool.org if you have any question.
14 | 
15 | You are welcome to reuse this material in other courses or schools, but please reach out to organizers@m2lschool.org if you plan to do so. We would appreciate it if you could acknowledge that the materials come from M2L 20202 and give credits to the authors. Also please keep a link in your materials to the original repo, in case updates occur.
16 | 
17 | MIT License
18 | 
19 | Copyright (c) 2020 m2lschool
20 | 
21 | Permission is hereby granted, free of charge, to any person obtaining a copy
22 | of this software and associated documentation files (the "Software"), to deal
23 | in the Software without restriction, including without limitation the rights
24 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
25 | copies of the Software, and to permit persons to whom the Software is
26 | furnished to do so, subject to the following conditions:
27 | 
28 | The above copyright notice and this permission notice shall be included in all
29 | copies or substantial portions of the Software.
30 | 
31 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
32 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
33 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
34 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
35 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
36 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
37 | SOFTWARE.


--------------------------------------------------------------------------------