├── 00_AIM_Quicklinks
    └── README.md
├── 00_LLME_Challenge
    ├── LLME_Challenge.ipynb
    └── README.md
├── 01_Kickoff_Model_Evolution
    └── README.md
├── 02_The_Transformer
    ├── Encoder_Decoder_Transformer_from_Scratch_Assignment_Version.ipynb
    ├── Encoder_Decoder_Transformer_from_Scratch_Hardmode_Assignment_Version.ipynb
    └── README.md
├── 03_Attention
    ├── Focusing_in_on_Attention_Transformer_from_Scratch_Assignment_Version.ipynb
    ├── Focusing_in_on_Attention_Transformer_from_Scratch_Hardmode_Version.ipynb
    └── README.md
├── 04_Embeddings
    ├── README.md
    ├── Training_Emebddings_From_Scratch_&_Fine_tuning_Embedding_Models_Assignment.ipynb
    └── Training_Emebddings_From_Scratch_&_Fine_tuning_Embedding_Models_Coding_Assignment.ipynb
├── 05_Next-Token-Prediction
    ├── Decoding_From_Logits_to_the_Speculative_Decoding_and_Guard_Rails.ipynb
    ├── Decoding_From_Logits_to_the_Speculative_Decoding_and_Guard_Rails_Hardmode.ipynb
    └── README.md
├── 06_Pre-Training
    ├── README.md
    ├── The_Loss_Function_in_LLMs_Cross_Entropy_Assignment.ipynb
    ├── The_Loss_Function_in_LLMs_Cross_Entropy_Hardmode.ipynb
    └── Using_Hugging_Face_transformers_to_Load_and_Run_a_Model_AIM_Assignment.ipynb
├── 07_Fine-tuning
    └── README.md
├── 08_Alignment
    └── README.md
├── 09_Alignment_II
    └── README.md
├── 10_Cool_Session
    └── README.md
└── README.md


/00_AIM_Quicklinks/README.md:
--------------------------------------------------------------------------------
 1 | # Quicklinks - AI Engineering
 2 | 
 3 | Quicklinks are to help you **easily navigate individual session information** for `AI Engineering`!  
 4 | 
 5 | - We'll follow the [course schedule]() as outlined throughout the cohort!
 6 | - You can also access all [course materials]() via the [Notion Home Page]()
 7 | 
 8 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
 9 | |:-----------------|:-----------------|:-----------------|:-----------------|
10 |  | [Session 1: Overview and Intro to LLM Engineering](https://www.notion.so/Session-1-Overview-and-Intro-to-LLM-Engineering-1a7cd547af3d80149a20d822bd0a9280) | [01: Introduction to LLM Engineering](https://www.youtube.com/watch?v=xgcPRrq6NBw&ab_channel=AIMakerspace) |  [Session 1: Introduction to LLM Engineering](https://www.canva.com/design/DAGWfLzJLjU/q_swsY0ng_NRpXA-ckn26Q/view?utm_content=DAGWfLzJLjU&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hea55fde063) | [01_Kickoff_Model_Evolution](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/01_Kickoff_Model_Evolution)
11 |  | [Session 2: The Transformer](https://www.notion.so/Session-2-The-Transformer-1a7cd547af3d80079041d5112fb052a8) | [02: The Transformer](https://www.youtube.com/watch?v=LYODbG3X4oI&ab_channel=AIMakerspace) |  [Session 2: The Transformer](https://www.canva.com/design/DAGW9drJwtU/d5pIdoSDGNoTHppA3i9Crg/view?utm_content=DAGW9drJwtU&utm_campaign=designshare&utm_medium=link&utm_source=editor) | [02_The_Transformer](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/02_The_Transformer)
12 | | [Session 3: Attention](https://www.notion.so/Session-3-Attention-1a7cd547af3d808fae58ed6252cc3e7f) | [03: Attention & Flash Attention](https://www.youtube.com/watch?v=cE5E1m1cSAU&ab_channel=AIMakerspace) |  [Session 3: Attention & Flash Attention](https://www.canva.com/design/DAGXJDsxuyI/TO3MaXqimiS-MjbR8-qm3g/view?utm_content=DAGXJDsxuyI&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hbdbc7bdd9d) | [03_Attention](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/03_Attention)
13 | | [Session 4: Embeddings](https://www.notion.so/Session-4-Embeddings-1a7cd547af3d80669ea0e47ff2a142f9) | [04: Embeddings](https://www.youtube.com/watch?v=XMJzqxElhfY&ab_channel=AIMakerspace) |  [Session 4: Embeddings](https://www.canva.com/design/DAGXnKDginc/-g-2FCMJKDr2yhmUuuvVqg/view?utm_content=DAGXnKDginc&utm_campaign=designshare&utm_medium=link&utm_source=editor) | [04_Embeddings](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/04_Embeddings)
14 | | [Session 5: Next-Token Prediction](https://www.notion.so/Session-5-Next-Token-Prediction-1a7cd547af3d8056bacaf652f6f9e8d9) | [05: Next-Token Prediction ](https://www.youtube.com/watch?v=xNRgycrPQFY&ab_channel=AIMakerspace) |  [Session 5: Decoding or Next-Token Prediction](https://www.canva.com/design/DAGYRgCRV2k/3xwuCV92aSKKNG7qpockFw/view?utm_content=DAGYRgCRV2k&utm_campaign=designshare&utm_medium=link&utm_source=editor) | [05_Next-Token-Prediction](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/05_Next-Token-Prediction)
15 | | [Session 6: Pretraining](https://www.notion.so/Session-6-Pretraining-1a7cd547af3d80fbaba5decb2ee8616b) | [06: Pretraining](https://www.youtube.com/watch?v=zU5iIAsqJVU&ab_channel=AIMakerspace) |  [Session 6: Pretraining](https://www.canva.com/design/DAGYdUqfwVg/l_9JK-h7dgvP4bseYdzwaQ/view?utm_content=DAGYdUqfwVg&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h9c74da65a4) | [06_Pre-Training](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/06_Pre-Training)
16 | | [Session 7: Fine-Tuning ](https://www.notion.so/Session-7-Fine-Tuning-1a7cd547af3d80e3adedd44ef63ac992) | [07: Fine-Tuning](https://www.youtube.com/watch?v=ELu2dy2Iccs&ab_channel=AIMakerspace) |  [Session 7: Fine-Tuning](https://www.canva.com/design/DAGY7ZxFsRU/wzpT21_Ub_a3RAo3-HVvPQ/view?utm_content=DAGY7ZxFsRU&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hcadf98dde7) | [07_Fine-tuning](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/07_Fine-tuning)
17 | | [Session 8: Alignment](https://www.notion.so/Session-8-Alignment-1a7cd547af3d800ab391dd8f2ceb9329) | [08: Alignment ](https://www.youtube.com/watch?v=4ehPGFIf91o&ab_channel=AIMakerspace) |  [Session 8: Alignment](https://www.canva.com/design/DAGZHXVSNBE/OHkXXiAmsfSXwHL1r2P0bw/view?utm_content=DAGZHXVSNBE&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h59157a8ad7) | [08_Alignment](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/08_Alignment)
18 | | [Session 9: Alignment II and Merging](https://www.notion.so/Session-9-Alignment-II-and-Merging-1a7cd547af3d805ca7bae487d35073a5) | [09: Alignment II & Merging](https://www.youtube.com/watch?v=VzTujojD1ho&ab_channel=AIMakerspace) |  [Session 9: Alignment II & Merging](https://www.canva.com/design/DAGZlbhGppY/Lmr8nwEG4T5p8vsY3pqOmw/view?utm_content=DAGZlbhGppY&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h580ea460bf) | [09_Alignment_II](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/09_Alignment_II)
19 | | [Session 10: Frontiers](https://www.notion.so/Session-10-Frontiers-1a7cd547af3d80ff9482cee21e65edaf) | [10: Frontiers ](https://www.youtube.com/watch?v=ft8DrEW1ZSc&ab_channel=AIMakerspace) |  [Session 10: Frontiers](https://www.canva.com/design/DAGZsTHsSuE/hLmurLxgBsX-D8royq42jA/view?utm_content=DAGZsTHsSuE&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hf1f2028a93) | [10_Cool_Session](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs-Open-Source/tree/main/10_Cool_Session)
20 | 


--------------------------------------------------------------------------------
/00_LLME_Challenge/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | 
 7 | <h1 align="center" id="heading">The LLM Engineering Challenge!</h1>
 8 | 
 9 | Welcome to the LLM Engineering Challenge!
10 | 
11 | Your challenge, should you choose to accept it, is to build and ship a fine-tuned version of Llama 3; one that's good at summarization!
12 | 
13 | You can find the Colab version of the LLME Challenge [here](https://colab.research.google.com/drive/16oeDV8RTpn_3irlZQmZ4yXZjkW_bvxfT?usp=sharing) and can see a [step-by-step walkthrough](https://img.youtube.com/vi/etdAcVJAoao/0.jpg) on YouTube
14 | 
15 | [![YouTube Video](https://img.youtube.com/vi/etdAcVJAoao/0.jpg)](https://youtu.be/etdAcVJAoao?si=88RxVTtpSYhwkoTZ)
16 | 
17 | After you ship, it's time to share!
18 | 
19 | Post about your model in the [build-ship-share-🏗-🚢-🚀](https://discord.com/channels/1135695983720792216/1135700320517890131)on Discord!
20 | 


--------------------------------------------------------------------------------
/01_Kickoff_Model_Evolution/README.md:
--------------------------------------------------------------------------------
  1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
  2 |      width="200px"
  3 |      height="auto"/>
  4 | </p>
  5 | 
  6 | <h1 align="center" id="heading">How Models Have Evolved Over Time: "How many 'r's in strawberry" 🍓</h1>
  7 | 
  8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
  9 | 
 10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
 11 | |:-----------------|:-----------------|:-----------------|:-----------------|
 12 |  | [Session 1: Overview and Intro to LLM Engineering](https://www.notion.so/Session-1-Overview-and-Intro-to-LLM-Engineering-1a7cd547af3d80149a20d822bd0a9280) | [01: Introduction to LLM Engineering](https://www.youtube.com/watch?v=xgcPRrq6NBw&ab_channel=AIMakerspace) |  [Session 1: Introduction to LLM Engineering](https://www.canva.com/design/DAGWfLzJLjU/q_swsY0ng_NRpXA-ckn26Q/view?utm_content=DAGWfLzJLjU&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hea55fde063) | You Are Here! |
 13 | 
 14 | ## Assignment
 15 | 
 16 | Over time, we've changed and improved the way we thinking about and train models in practice. 
 17 | 
 18 | From the earliest days of GPT, to the most recent developments like OpenAI `o1` model - model training prcoesses have undergone rapid transformation.
 19 | 
 20 | ## So How Many 'r's are there?
 21 | 
 22 | Let's go through the eras of model training/post-training methods to see how things have changed with a simple question: 
 23 | 
 24 | `How many 'r's are in the word "Strawberry"?`
 25 | 
 26 | ### GPT-2
 27 | Starting with [GPT2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) - where pre-training was all we needed - we can see that the model doesn't even follow instructions, it's just a sequence-continuing machine.
 28 | 
 29 | Let's see our input: 
 30 | 
 31 | ```
 32 | "How many 'r's are in the word "Strawberry"?"
 33 | ```
 34 | 
 35 | And our output: 
 36 | 
 37 | ```
 38 | "How many 'r's are in the word "Strawberry"?
 39 | 
 40 | The answer is, of course, no.
 41 | 
 42 | The word "Strawberry"
 43 | ```
 44 | 
 45 | Let's try a model that's a bit better at following instructions.
 46 | 
 47 | ### Llama 3 8B Instruct
 48 | 
 49 | Enter "Instruction Tuning". Introduced in [Cross-Task Generalization via Natural Language Crowdsourcing Instructions](https://aclanthology.org/2022.acl-long.244/), and popularized in the "FLAN paper" ([Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652)), instruction-tuning gave models the ability to follow instructions. 
 50 | 
 51 | Since this method was introduced, it became a staple of LLM releases, open-source or not. We'll be looking at [Meta Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to see how it gets on with our question!
 52 | 
 53 | We also have some additional post-training (alignment, etc.) that has been applied to make the mode respond to our instructions *how we want*.
 54 | 
 55 | Our input remains the same, but this time our output is: 
 56 | 
 57 | ```
 58 | "That's a tasty question!
 59 | 
 60 | Let me count the "r's" for you: S-T-R-A-W-B-E-R-R-Y
 61 | 
 62 | There are 3 "r's" in the word "Strawberry"!"
 63 | ```
 64 | 
 65 | Incredibly, we get the correct response!
 66 | 
 67 | ### GenZAI - An Supervised Fine-tuned Llama 3 8B Instruct
 68 | 
 69 | With the advent of things like LoRA, QLoRA, and more, developers are able to leverage supervised fine-tuning to move models toward their desired behaviours - like GenZAI!
 70 | 
 71 | Let's ask it the same question using the supervised fine-tuned model [Gen-Z-Translate-Llama-3-Instruct-v1](https://huggingface.co/ai-maker-space/gen-z-translate-llama-3-instruct-v1):
 72 | 
 73 | ```
 74 | "I'ma count 'em for ya! There's like, 2 "r"s in the word "Strawberry", bruh."
 75 | ```
 76 | 
 77 | We lose correctness, but gain a GenZ flavour of speech!
 78 | 
 79 | ### Arcee-SuperNova-v1
 80 | 
 81 | Next, we move on to a model that takes post-training to new heights! From alignment, to model merging, [Arcee-SuperNova-v1](https://supernova.arcee.ai/) is a juggernaut. 
 82 | 
 83 | Let's ask it the same question we asked the other models, and see what it says:
 84 | 
 85 | ```
 86 | The word "Strawberry" contains two 'r's.
 87 | ```
 88 | 
 89 | Alas, it tried its best, but didn't get there!
 90 | 
 91 | ### OpenAI `o1`
 92 | 
 93 | Finally, let's stop at a model that has undergone extensive post-training (some of which we don't even know yet!) to improve reasoning capacity. 
 94 | 
 95 | Let's ask OpenAI's [`o1`](https://chatgpt.com/) the same question, and see what happens:
 96 | 
 97 | ```
 98 | Thought for a couple of seconds
 99 | There are three 'r's in the word "Strawberry".
100 | ```
101 | 
102 | ## Conclusion
103 | 
104 | Model training has blossomed into an entire field - which we will try to cover in our time together - in the last few years. We can finally consistently get the correct answer to life's important questions, like: 
105 | 
106 | `How many 'r's in the word "Strawberry"`
107 | 


--------------------------------------------------------------------------------
/02_The_Transformer/Encoder_Decoder_Transformer_from_Scratch_Assignment_Version.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "id": "9ZwKP3f2HGz9"
   7 |    },
   8 |    "source": [
   9 |     "# Encoder-Decoder Transformer Model from Scratch in PyTorch\n",
  10 |     "\n",
  11 |     "In today's notebook, we'll be focusing on two major components of transformers:\n",
  12 |     "\n",
  13 |     "1. The Building Blocks\n",
  14 |     "2. Training the Model\n",
  15 |     "\n",
  16 |     "As a brunt of the in-class time will be spent on the building blocks, we'll leave the training logic as an assignment to complete. We'll be focusing more specifically on training once we start using decoder-only architectures."
  17 |    ]
  18 |   },
  19 |   {
  20 |    "cell_type": "markdown",
  21 |    "metadata": {
  22 |     "id": "RjbGDn9SF49m"
  23 |    },
  24 |    "source": [
  25 |     "# How AIM Does Assignments\n",
  26 |     "\n",
  27 |     "Throughout our time together - we'll be providing a number of assignments. Each assignment will be split into two broad categories:\n",
  28 |     "\n",
  29 |     "1. Base Assignment - a more conceptual and theory based assignment focused on locking in specific key concepts and learnings.\n",
  30 |     "2. Hardmode Assignment - a more programming focused assignment focused on core code-concepts used in transformers.\n",
  31 |     "\n",
  32 |     "Each assignment will have a few of the following categories of exercises:\n",
  33 |     "\n",
  34 |     "1. ❓Questions - these will be questions that you will be expected to gather the answer to!\n",
  35 |     "2. 🏗️ Activities - these will be work or coding activities meant to reinforce specific concepts or theory components.\n",
  36 |     "\n",
  37 |     "You are expected to complete all of the activities in your selected notebook!"
  38 |    ]
  39 |   },
  40 |   {
  41 |    "cell_type": "markdown",
  42 |    "metadata": {
  43 |     "id": "Z43kfJGvlQyA"
  44 |    },
  45 |    "source": [
  46 |     "# The Building Block Fundamentals of Transformer Architecture\n",
  47 |     "\n",
  48 |     "We're going to start with an example of an encoder-decoder model - the kind found in the classic paper:\n",
  49 |     "\n",
  50 |     "[Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf).\n",
  51 |     "\n",
  52 |     "We'll walk through each step in code - leveraging the [PyTorch]() library heavily - in order to get an idea of how these models work.\n",
  53 |     "\n",
  54 |     "While this example notebook could be extended to a sincere usecase - we'll be using a toy dataset, and we will not fully train the model until it converges (under-train), as the full training process might take many days!"
  55 |    ]
  56 |   },
  57 |   {
  58 |    "cell_type": "markdown",
  59 |    "metadata": {
  60 |     "id": "_sd2KYudl0Hn"
  61 |    },
  62 |    "source": [
  63 |     "## The Desired Architecture\n",
  64 |     "\n",
  65 |     "![image](https://i.imgur.com/YPjbqW6.png)\n",
  66 |     "\n",
  67 |     "We'll skip over the diagram for now, and talk through each component in detail!"
  68 |    ]
  69 |   },
  70 |   {
  71 |    "cell_type": "code",
  72 |    "execution_count": 2,
  73 |    "metadata": {
  74 |     "id": "TPUnyvsuovuP"
  75 |    },
  76 |    "outputs": [],
  77 |    "source": [
  78 |     "import torch\n",
  79 |     "import torch.nn as nn\n",
  80 |     "import math"
  81 |    ]
  82 |   },
  83 |   {
  84 |    "cell_type": "markdown",
  85 |    "metadata": {
  86 |     "id": "JR8M3oT0l8fM"
  87 |    },
  88 |    "source": [
  89 |     "## Embedding\n",
  90 |     "\n",
  91 |     "![image](https://i.imgur.com/sFlEZ2e.png)\n",
  92 |     "\n",
  93 |     "The first step will be do convert our tokenized sequence of inputs into an embedding vector. This allows use to understand a rich amount of information about input sequences and their semantic meanings.\n",
  94 |     "\n",
  95 |     "As the embedding layer will be training along side the rest of the model - it will allow us to have an excellent vector-representation of the tokens in our dataset.\n",
  96 |     "\n",
  97 |     "Let's see how it looks in code!"
  98 |    ]
  99 |   },
 100 |   {
 101 |    "cell_type": "code",
 102 |    "execution_count": 3,
 103 |    "metadata": {
 104 |     "id": "ddpQjPJWmXZg"
 105 |    },
 106 |    "outputs": [],
 107 |    "source": [
 108 |     "class InputEmbeddings(nn.Module):\n",
 109 |     "  def __init__(self, d_model: int, vocab_size: int, verbose=False) -> None:\n",
 110 |     "    \"\"\"\n",
 111 |     "    vocab_size - the size of our vocabulary\n",
 112 |     "    d_model - the dimension of our embeddings and the input dimension for our model\n",
 113 |     "    \"\"\"\n",
 114 |     "    super().__init__()\n",
 115 |     "    self.vocab_size = vocab_size\n",
 116 |     "    self.d_model = d_model\n",
 117 |     "    self.embedding = nn.Embedding(vocab_size, d_model)\n",
 118 |     "    self.verbose = verbose\n",
 119 |     "\n",
 120 |     "  def forward(self, x):\n",
 121 |     "    if self.verbose:\n",
 122 |     "      print(f\"Embedding Vector (1st 5 elements): {self.embedding(x)[:5] * math.sqrt(self.d_model)}\")\n",
 123 |     "    return self.embedding(x) * math.sqrt(self.d_model) # scale embeddings by square root of d_model"
 124 |    ]
 125 |   },
 126 |   {
 127 |    "cell_type": "markdown",
 128 |    "metadata": {
 129 |     "id": "P4Q4WPNALUKR"
 130 |    },
 131 |    "source": [
 132 |     "### ❓Question 1:\n",
 133 |     "\n",
 134 |     "Given:\n",
 135 |     "\n",
 136 |     "1. Batch Size = `16`\n",
 137 |     "2. Sequence Length = `350`\n",
 138 |     "\n",
 139 |     "What will the output shape of the `InputEmbeddings` layer be?"
 140 |    ]
 141 |   },
 142 |   {
 143 |    "cell_type": "markdown",
 144 |    "metadata": {
 145 |     "id": "ZCx8DYkaxwto"
 146 |    },
 147 |    "source": [
 148 |     "### Test Embedding Layer\n",
 149 |     "\n",
 150 |     "We'll set up a sample Embedding Layer and then test that it does what we'd expect!"
 151 |    ]
 152 |   },
 153 |   {
 154 |    "cell_type": "code",
 155 |    "execution_count": 5,
 156 |    "metadata": {
 157 |     "id": "xl9VdXuYwfVU"
 158 |    },
 159 |    "outputs": [],
 160 |    "source": [
 161 |     "def test_input_embeddings_with_example():\n",
 162 |     "    # Create a small embedding layer\n",
 163 |     "    embed = InputEmbeddings(d_model=512, vocab_size=1000)\n",
 164 |     "\n",
 165 |     "    # Example sentence tokens (simplified)\n",
 166 |     "    tokens = torch.tensor([[1, 2, 3, 4, 5]])  # \"The cat sat down quickly\"\n",
 167 |     "\n",
 168 |     "    output = embed(tokens)\n",
 169 |     "    print(f\"Input shape: {tokens.shape}\")\n",
 170 |     "    print(f\"Output shape: {output.shape}\")\n",
 171 |     "    print(\"\\nExample shows how words are converted to high-dimensional vectors\")\n",
 172 |     "\n",
 173 |     "    # Run technical test\n",
 174 |     "    assert output.shape == (1, 5, 512), f\"Expected shape (1, 5, 512), got {output.shape}\"\n",
 175 |     "    print(\"✓ Input Embeddings Test Passed\")"
 176 |    ]
 177 |   },
 178 |   {
 179 |    "cell_type": "code",
 180 |    "execution_count": 6,
 181 |    "metadata": {
 182 |     "colab": {
 183 |      "base_uri": "https://localhost:8080/"
 184 |     },
 185 |     "id": "CBIElPiqxORa",
 186 |     "outputId": "477638d1-3ff3-42b6-b557-04f970072c06"
 187 |    },
 188 |    "outputs": [
 189 |     {
 190 |      "name": "stdout",
 191 |      "output_type": "stream",
 192 |      "text": [
 193 |       "Input shape: torch.Size([1, 5])\n",
 194 |       "Output shape: torch.Size([1, 5, 512])\n",
 195 |       "\n",
 196 |       "Example shows how words are converted to high-dimensional vectors\n",
 197 |       "✓ Input Embeddings Test Passed\n"
 198 |      ]
 199 |     }
 200 |    ],
 201 |    "source": [
 202 |     "test_input_embeddings_with_example()"
 203 |    ]
 204 |   },
 205 |   {
 206 |    "cell_type": "markdown",
 207 |    "metadata": {
 208 |     "id": "ra5KCa1KnfrC"
 209 |    },
 210 |    "source": [
 211 |     "## Positional Encoding\n",
 212 |     "\n",
 213 |     "![image](https://i.imgur.com/IIA3NK3.png)\n",
 214 |     "\n",
 215 |     "We need to impart information about where each token is in the sequence, but we aren't using any recurrence or convolutions - the easiest way to encode positional information is to inject positional information into our input embeddings.\n",
 216 |     "\n",
 217 |     "We're going to use the process outlined in the paper to do this - which is to use a specific combination of functions to add positional information to the embeddings."
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "code",
 222 |    "execution_count": 10,
 223 |    "metadata": {
 224 |     "id": "5K3NXh7MoM5D"
 225 |    },
 226 |    "outputs": [],
 227 |    "source": [
 228 |     "class PositionalEncoding(nn.Module):\n",
 229 |     "  def __init__(self, d_model: int, seq_len: int, dropout: float, verbose=False) -> None:\n",
 230 |     "    super().__init__()\n",
 231 |     "    self.d_model = d_model\n",
 232 |     "    self.seq_len = seq_len\n",
 233 |     "    self.dropout = nn.Dropout(dropout)\n",
 234 |     "    self.verbose=verbose\n",
 235 |     "\n",
 236 |     "    positional_embeddings = torch.zeros(seq_len, d_model)\n",
 237 |     "    positional_sequence_vector = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)\n",
 238 |     "    positional_model_vector = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))\n",
 239 |     "    positional_embeddings[:, 0::2] = torch.sin(positional_sequence_vector * positional_model_vector)\n",
 240 |     "    positional_embeddings[:, 1::2] = torch.cos(positional_sequence_vector * positional_model_vector)\n",
 241 |     "    positional_embeddings = positional_embeddings.unsqueeze(0)\n",
 242 |     "\n",
 243 |     "    self.register_buffer('positional_embeddings', positional_embeddings)\n",
 244 |     "\n",
 245 |     "  def forward(self, x):\n",
 246 |     "    x = x + (self.positional_embeddings[:, :x.shape[1], :]).requires_grad_(False)\n",
 247 |     "    if self.verbose:\n",
 248 |     "      print(f\"Positional Encodings (1st 5 elements): {x}\")\n",
 249 |     "    return self.dropout(x)"
 250 |    ]
 251 |   },
 252 |   {
 253 |    "cell_type": "markdown",
 254 |    "metadata": {
 255 |     "id": "QKGHvs0eMCoQ"
 256 |    },
 257 |    "source": [
 258 |     "### ❓Question 2:\n",
 259 |     "\n",
 260 |     "Given:\n",
 261 |     "\n",
 262 |     "1. Batch Size = `16`\n",
 263 |     "2. Sequence Length = `350`\n",
 264 |     "\n",
 265 |     "What will the output shape of the `PositionalEncoding` layer be?"
 266 |    ]
 267 |   },
 268 |   {
 269 |    "cell_type": "markdown",
 270 |    "metadata": {
 271 |     "id": "zkcMmoT2yOZs"
 272 |    },
 273 |    "source": [
 274 |     "### Test Positional Encoding Layer\n",
 275 |     "\n",
 276 |     "We'll set up a sample Positional Encoding Layer and then test that it does what we'd expect!"
 277 |    ]
 278 |   },
 279 |   {
 280 |    "cell_type": "code",
 281 |    "execution_count": 11,
 282 |    "metadata": {
 283 |     "id": "ayoWMbSYySBp"
 284 |    },
 285 |    "outputs": [],
 286 |    "source": [
 287 |     "def test_positional_encoding_with_example():\n",
 288 |     "    pos = PositionalEncoding(d_model=512, seq_len=10, dropout=0.1)\n",
 289 |     "\n",
 290 |     "    # Create sample embeddings for \"The cat sat\"\n",
 291 |     "    x = torch.randn(1, 3, 512)\n",
 292 |     "\n",
 293 |     "    output = pos(x)\n",
 294 |     "    print(\"Input tokens position:  [1, 2, 3]\")\n",
 295 |     "    print(\"Added position info to each word's embedding\")\n",
 296 |     "    print(f\"Output maintains shape: {output.shape}\")\n",
 297 |     "\n",
 298 |     "    # Verify position information was added\n",
 299 |     "    assert not torch.allclose(output, x), \"Position information should modify embeddings\"\n",
 300 |     "    print(\"✓ Positional Encoding Test Passed\")"
 301 |    ]
 302 |   },
 303 |   {
 304 |    "cell_type": "code",
 305 |    "execution_count": 12,
 306 |    "metadata": {
 307 |     "colab": {
 308 |      "base_uri": "https://localhost:8080/"
 309 |     },
 310 |     "id": "Z4hDXJaGyU5M",
 311 |     "outputId": "b16add72-3165-4c11-aa38-3605048f9059"
 312 |    },
 313 |    "outputs": [
 314 |     {
 315 |      "name": "stdout",
 316 |      "output_type": "stream",
 317 |      "text": [
 318 |       "Input tokens position:  [1, 2, 3]\n",
 319 |       "Added position info to each word's embedding\n",
 320 |       "Output maintains shape: torch.Size([1, 3, 512])\n",
 321 |       "✓ Positional Encoding Test Passed\n"
 322 |      ]
 323 |     }
 324 |    ],
 325 |    "source": [
 326 |     "test_positional_encoding_with_example()"
 327 |    ]
 328 |   },
 329 |   {
 330 |    "cell_type": "markdown",
 331 |    "metadata": {
 332 |     "id": "ueiM7LzKpcFY"
 333 |    },
 334 |    "source": [
 335 |     "## Add & Norm\n",
 336 |     "\n",
 337 |     "Next we'll tackle the Add & Norm Block of the diagram.\n",
 338 |     "\n",
 339 |     "![image](https://i.imgur.com/otdEq4D.png)"
 340 |    ]
 341 |   },
 342 |   {
 343 |    "cell_type": "markdown",
 344 |    "metadata": {
 345 |     "id": "_lDABPLSKqOt"
 346 |    },
 347 |    "source": [
 348 |     "### Layer Normalization\n",
 349 |     "\n",
 350 |     "The first step is to add layer normalization. You can read more about it [here](https://paperswithcode.com/method/layer-normalization)!\n",
 351 |     "\n",
 352 |     "The basic idea is that it makes training the model a bit easier, and allows the model to generalize a bit better."
 353 |    ]
 354 |   },
 355 |   {
 356 |    "cell_type": "code",
 357 |    "execution_count": 18,
 358 |    "metadata": {
 359 |     "id": "1Nlv7BH7ruSf"
 360 |    },
 361 |    "outputs": [],
 362 |    "source": [
 363 |     "class LayerNormalization(nn.Module):\n",
 364 |     "  def __init__(self, features: int, epsilon:float=10**-6) -> None:\n",
 365 |     "    super().__init__()\n",
 366 |     "    self.epsilon = epsilon\n",
 367 |     "    self.gamma = nn.Parameter(torch.ones(features))\n",
 368 |     "    self.beta = nn.Parameter(torch.zeros(features))\n",
 369 |     "\n",
 370 |     "  def forward(self, x):\n",
 371 |     "    mean = x.mean(dim = -1, keepdim = True)\n",
 372 |     "    standard_deviation = x.std(dim = -1, keepdim = True)\n",
 373 |     "    return self.gamma * (x - mean) / (standard_deviation + self.epsilon) + self.beta"
 374 |    ]
 375 |   },
 376 |   {
 377 |    "cell_type": "markdown",
 378 |    "metadata": {
 379 |     "id": "jg-deMc3MNnM"
 380 |    },
 381 |    "source": [
 382 |     "### ❓Question 3:\n",
 383 |     "\n",
 384 |     "What is the purpose of `epsilon` in the above code.\n",
 385 |     "\n",
 386 |     "> HINT: Pay special attention to the math in the `return` statement."
 387 |    ]
 388 |   },
 389 |   {
 390 |    "cell_type": "markdown",
 391 |    "metadata": {
 392 |     "id": "vft5kFUBy1aE"
 393 |    },
 394 |    "source": [
 395 |     "### Test Layer Normalization\n",
 396 |     "\n",
 397 |     "We'll set up a sample Layer Normalization and then test that it does what we'd expect!"
 398 |    ]
 399 |   },
 400 |   {
 401 |    "cell_type": "code",
 402 |    "execution_count": 21,
 403 |    "metadata": {
 404 |     "id": "01Llr9cwy7C2"
 405 |    },
 406 |    "outputs": [],
 407 |    "source": [
 408 |     "def test_layer_normalization_with_example():\n",
 409 |     "    layer_norm = LayerNormalization(features=3)  # Smaller feature size for example\n",
 410 |     "\n",
 411 |     "    # Simulate word embeddings with different magnitudes\n",
 412 |     "    word_embeddings = torch.tensor([\n",
 413 |     "        [2.5, 4.1, -3.2],  # \"The\" (high magnitude)\n",
 414 |     "        [0.1, 0.2, -0.1],  # \"cat\" (low magnitude)\n",
 415 |     "        [8.2, -6.1, 5.5]   # \"sat\" (very high magnitude)\n",
 416 |     "    ]).unsqueeze(0)\n",
 417 |     "\n",
 418 |     "    normalized = layer_norm(word_embeddings)\n",
 419 |     "\n",
 420 |     "    print(\"Before normalization (magnitudes vary greatly):\")\n",
 421 |     "    print(word_embeddings[0])\n",
 422 |     "    print(\"\\nAfter normalization (values scaled to similar ranges):\")\n",
 423 |     "    print(normalized[0])\n",
 424 |     "\n",
 425 |     "    # Verify statistical properties\n",
 426 |     "    mean = normalized.mean(dim=-1)\n",
 427 |     "    var = normalized.var(dim=-1)\n",
 428 |     "    assert torch.allclose(mean, torch.zeros_like(mean), atol=1e-5)\n",
 429 |     "    assert torch.allclose(var, torch.ones_like(var), atol=1e-5)\n",
 430 |     "    print(\"✓ Layer Normalization Test Passed\")"
 431 |    ]
 432 |   },
 433 |   {
 434 |    "cell_type": "code",
 435 |    "execution_count": 22,
 436 |    "metadata": {
 437 |     "colab": {
 438 |      "base_uri": "https://localhost:8080/"
 439 |     },
 440 |     "id": "75hToLCry9mR",
 441 |     "outputId": "2319b0d9-72a9-43f6-a21c-7fc66f794ccf"
 442 |    },
 443 |    "outputs": [
 444 |     {
 445 |      "name": "stdout",
 446 |      "output_type": "stream",
 447 |      "text": [
 448 |       "Before normalization (magnitudes vary greatly):\n",
 449 |       "tensor([[ 2.5000,  4.1000, -3.2000],\n",
 450 |       "        [ 0.1000,  0.2000, -0.1000],\n",
 451 |       "        [ 8.2000, -6.1000,  5.5000]])\n",
 452 |       "\n",
 453 |       "After normalization (values scaled to similar ranges):\n",
 454 |       "tensor([[ 0.3562,  0.7732, -1.1293],\n",
 455 |       "        [ 0.2182,  0.8729, -1.0911],\n",
 456 |       "        [ 0.7459, -1.1363,  0.3905]], grad_fn=<SelectBackward0>)\n",
 457 |       "✓ Layer Normalization Test Passed\n"
 458 |      ]
 459 |     }
 460 |    ],
 461 |    "source": [
 462 |     "test_layer_normalization_with_example()"
 463 |    ]
 464 |   },
 465 |   {
 466 |    "cell_type": "markdown",
 467 |    "metadata": {
 468 |     "id": "rwXt7KKKycrl"
 469 |    },
 470 |    "source": [
 471 |     "### Residual Connection\n",
 472 |     "\n",
 473 |     "Another technique that makes model training easier, we add a Residual connection to the outputs of the Attention Block - this helps to prevent vanishing gradient."
 474 |    ]
 475 |   },
 476 |   {
 477 |    "cell_type": "code",
 478 |    "execution_count": 25,
 479 |    "metadata": {
 480 |     "id": "XKwGFD2-yd-Z"
 481 |    },
 482 |    "outputs": [],
 483 |    "source": [
 484 |     "class ResidualConnection(nn.Module):\n",
 485 |     "  def __init__(self, features: int, dropout: float = 0.1) -> None:\n",
 486 |     "    super().__init__()\n",
 487 |     "    self.dropout = nn.Dropout(dropout)\n",
 488 |     "    self.layernorm = LayerNormalization(features)\n",
 489 |     "\n",
 490 |     "  def forward(self, x, sublayer):\n",
 491 |     "    return x + self.dropout(sublayer(self.layernorm(x)))"
 492 |    ]
 493 |   },
 494 |   {
 495 |    "cell_type": "markdown",
 496 |    "metadata": {
 497 |     "id": "p668YYgBzc3G"
 498 |    },
 499 |    "source": [
 500 |     "### Testing Residual Connection\n",
 501 |     "\n",
 502 |     "We'll set up a sample Residual Connection and then test that it does what we'd expect!"
 503 |    ]
 504 |   },
 505 |   {
 506 |    "cell_type": "code",
 507 |    "execution_count": 30,
 508 |    "metadata": {
 509 |     "id": "heebV5mIzje4"
 510 |    },
 511 |    "outputs": [],
 512 |    "source": [
 513 |     "def test_residual_connection_with_example():\n",
 514 |     "    residual = ResidualConnection(features=3, dropout=0.1)\n",
 515 |     "\n",
 516 |     "    # Original input \"The cat\"\n",
 517 |     "    x = torch.tensor([\n",
 518 |     "        [1.0, 1.0, 1.0],\n",
 519 |     "        [2.0, 2.0, 2.0]\n",
 520 |     "    ]).unsqueeze(0)\n",
 521 |     "\n",
 522 |     "    # Sublayer that makes meaningful changes\n",
 523 |     "    def sublayer(x):\n",
 524 |     "        return torch.nn.functional.relu(x + 0.5) # Non-linear transformation\n",
 525 |     "\n",
 526 |     "    output = residual(x, sublayer)\n",
 527 |     "\n",
 528 |     "    print(\"Original input:\")\n",
 529 |     "    print(x[0])\n",
 530 |     "    print(\"\\nAfter residual connection (combines original + transformed):\")\n",
 531 |     "    print(output[0])\n",
 532 |     "\n",
 533 |     "    # Verify output changed but maintained shape\n",
 534 |     "    assert output.shape == x.shape\n",
 535 |     "    assert torch.any(torch.abs(output - x) > 1e-6), \"Output should differ from input\"\n",
 536 |     "    print(\"✓ Residual Connection Test Passed\")"
 537 |    ]
 538 |   },
 539 |   {
 540 |    "cell_type": "code",
 541 |    "execution_count": 31,
 542 |    "metadata": {
 543 |     "colab": {
 544 |      "base_uri": "https://localhost:8080/"
 545 |     },
 546 |     "id": "YZMF65cmzmgx",
 547 |     "outputId": "a3184e93-1dc0-4272-c50d-399d276a8ae6"
 548 |    },
 549 |    "outputs": [
 550 |     {
 551 |      "name": "stdout",
 552 |      "output_type": "stream",
 553 |      "text": [
 554 |       "Original input:\n",
 555 |       "tensor([[1., 1., 1.],\n",
 556 |       "        [2., 2., 2.]])\n",
 557 |       "\n",
 558 |       "After residual connection (combines original + transformed):\n",
 559 |       "tensor([[1.5556, 1.5556, 1.5556],\n",
 560 |       "        [2.5556, 2.5556, 2.5556]], grad_fn=<SelectBackward0>)\n",
 561 |       "✓ Residual Connection Test Passed\n"
 562 |      ]
 563 |     }
 564 |    ],
 565 |    "source": [
 566 |     "test_residual_connection_with_example()"
 567 |    ]
 568 |   },
 569 |   {
 570 |    "cell_type": "markdown",
 571 |    "metadata": {
 572 |     "id": "IIOZp3xhsaXK"
 573 |    },
 574 |    "source": [
 575 |     "## Feed Forward Network\n",
 576 |     "\n",
 577 |     "![image](https://i.imgur.com/woEqBjQ.png)\n",
 578 |     "\n",
 579 |     "Moving onto the next component, we have our feed forward network.\n",
 580 |     "\n",
 581 |     "The feed forward networks servers two purposes in our model:\n",
 582 |     "\n",
 583 |     "1. It reforms the attention outputs into a format that works with the next block.\n",
 584 |     "\n",
 585 |     "2. It helps add complexity to prevent each attention block acting in a similar fashion."
 586 |    ]
 587 |   },
 588 |   {
 589 |    "cell_type": "code",
 590 |    "execution_count": 34,
 591 |    "metadata": {
 592 |     "id": "JueNG2UBszbR"
 593 |    },
 594 |    "outputs": [],
 595 |    "source": [
 596 |     "class FeedForwardBlock(nn.Module):\n",
 597 |     "  def __init__(self, d_model: int, d_ff: int = 2048, dropout: float = 0.1) -> None:\n",
 598 |     "    \"\"\"\n",
 599 |     "    d_model - dimension of model\n",
 600 |     "    d_ff - dimension of feed forward network\n",
 601 |     "    dropout - regularization measure\n",
 602 |     "    \"\"\"\n",
 603 |     "    super().__init__()\n",
 604 |     "    self.linear_1 = nn.Linear(d_model, d_ff)\n",
 605 |     "    self.dropout = nn.Dropout(dropout)\n",
 606 |     "    self.linear_2 = nn.Linear(d_ff, d_model)\n",
 607 |     "\n",
 608 |     "  def forward(self, x):\n",
 609 |     "    return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))"
 610 |    ]
 611 |   },
 612 |   {
 613 |    "cell_type": "markdown",
 614 |    "metadata": {
 615 |     "id": "cU-H1m4L0UhY"
 616 |    },
 617 |    "source": [
 618 |     "### Testing the Feed-forward Block\n",
 619 |     "\n",
 620 |     "Let's test!"
 621 |    ]
 622 |   },
 623 |   {
 624 |    "cell_type": "code",
 625 |    "execution_count": 32,
 626 |    "metadata": {
 627 |     "id": "pwtbsJou0fxz"
 628 |    },
 629 |    "outputs": [],
 630 |    "source": [
 631 |     "def test_feed_forward_block_with_example():\n",
 632 |     "   ff_block = FeedForwardBlock(d_model=3, d_ff=8)  # Small dimensions for demonstration\n",
 633 |     "\n",
 634 |     "   # Input: Word embeddings for \"The cat\"\n",
 635 |     "   x = torch.tensor([\n",
 636 |     "       [1.0, 0.5, 0.2],  # \"The\"\n",
 637 |     "       [2.0, -0.3, 1.1]  # \"cat\"\n",
 638 |     "   ]).unsqueeze(0)\n",
 639 |     "\n",
 640 |     "   output = ff_block(x)\n",
 641 |     "\n",
 642 |     "   print(\"Input embeddings:\")\n",
 643 |     "   print(x[0])\n",
 644 |     "   print(\"\\nAfter feed-forward transformation:\")\n",
 645 |     "   print(output[0])\n",
 646 |     "\n",
 647 |     "   # First linear layer expands to d_ff dimensions\n",
 648 |     "   # ReLU keeps only positive values\n",
 649 |     "   # Second linear layer projects back to d_model dimensions\n",
 650 |     "   assert output.shape == x.shape\n",
 651 |     "   assert not torch.allclose(output, x)\n",
 652 |     "   print(\"✓ Feed Forward Block Test Passed\")"
 653 |    ]
 654 |   },
 655 |   {
 656 |    "cell_type": "code",
 657 |    "execution_count": 35,
 658 |    "metadata": {
 659 |     "colab": {
 660 |      "base_uri": "https://localhost:8080/"
 661 |     },
 662 |     "id": "IMHE9fWB0hj3",
 663 |     "outputId": "7694194a-6268-41c6-b640-1e7dabbc6f6b"
 664 |    },
 665 |    "outputs": [
 666 |     {
 667 |      "name": "stdout",
 668 |      "output_type": "stream",
 669 |      "text": [
 670 |       "Input embeddings:\n",
 671 |       "tensor([[ 1.0000,  0.5000,  0.2000],\n",
 672 |       "        [ 2.0000, -0.3000,  1.1000]])\n",
 673 |       "\n",
 674 |       "After feed-forward transformation:\n",
 675 |       "tensor([[ 0.1777, -0.3403,  0.1365],\n",
 676 |       "        [ 0.0666, -0.1123, -0.0746]], grad_fn=<SelectBackward0>)\n",
 677 |       "✓ Feed Forward Block Test Passed\n"
 678 |      ]
 679 |     }
 680 |    ],
 681 |    "source": [
 682 |     "test_feed_forward_block_with_example()"
 683 |    ]
 684 |   },
 685 |   {
 686 |    "cell_type": "markdown",
 687 |    "metadata": {
 688 |     "id": "bJ5EGkvdtP0O"
 689 |    },
 690 |    "source": [
 691 |     "## Multi-Head Attention\n",
 692 |     "\n",
 693 |     "![image](https://i.imgur.com/4qOT46y.png)\n",
 694 |     "\n",
 695 |     "Next up is the heart and soul of the Transformer - Multi-Head Attention.\n",
 696 |     "\n",
 697 |     "We'll break it down into the basic building blocks in code in the following section!"
 698 |    ]
 699 |   },
 700 |   {
 701 |    "cell_type": "markdown",
 702 |    "metadata": {
 703 |     "id": "-Oto52ZyvX0k"
 704 |    },
 705 |    "source": [
 706 |     "### Multi-Head Attention Class\n",
 707 |     "\n"
 708 |    ]
 709 |   },
 710 |   {
 711 |    "cell_type": "code",
 712 |    "execution_count": 36,
 713 |    "metadata": {
 714 |     "id": "mDHkw1f_vmuv"
 715 |    },
 716 |    "outputs": [],
 717 |    "source": [
 718 |     "class MultiHeadAttention(nn.Module):\n",
 719 |     "  def __init__(self, d_model: int = 512, num_heads: int = 8, dropout: float = 0.1) -> None:\n",
 720 |     "    super().__init__()\n",
 721 |     "    self.d_model = d_model\n",
 722 |     "    self.num_heads = num_heads\n",
 723 |     "    assert d_model % num_heads == 0, \"d_model is not divisible by h\"\n",
 724 |     "\n",
 725 |     "    self.d_k = d_model // num_heads\n",
 726 |     "\n",
 727 |     "    self.w_q = nn.Linear(d_model, d_model, bias=False)\n",
 728 |     "    self.w_k = nn.Linear(d_model, d_model, bias=False)\n",
 729 |     "    self.w_v = nn.Linear(d_model, d_model, bias=False)\n",
 730 |     "\n",
 731 |     "    self.w_o = nn.Linear(d_model, d_model, bias=False)\n",
 732 |     "\n",
 733 |     "    self.dropout = nn.Dropout(dropout)"
 734 |    ]
 735 |   },
 736 |   {
 737 |    "cell_type": "markdown",
 738 |    "metadata": {
 739 |     "id": "dEm7I6iwMiom"
 740 |    },
 741 |    "source": [
 742 |     "### ❓Question 4:\n",
 743 |     "\n",
 744 |     "What do: Q, K, V, and O stand for in the above code?"
 745 |    ]
 746 |   },
 747 |   {
 748 |    "cell_type": "markdown",
 749 |    "metadata": {
 750 |     "id": "9dp3eM7H1cRl"
 751 |    },
 752 |    "source": [
 753 |     "### Testing Multi-Head Attention\n",
 754 |     "\n",
 755 |     "Let's test it out!"
 756 |    ]
 757 |   },
 758 |   {
 759 |    "cell_type": "code",
 760 |    "execution_count": 48,
 761 |    "metadata": {
 762 |     "id": "O6R_o-c-1fCu"
 763 |    },
 764 |    "outputs": [],
 765 |    "source": [
 766 |     "def test_multi_head_attention_with_example():\n",
 767 |     "   mha = MultiHeadAttention(d_model=6, num_heads=2)  # Small dimensions for clarity\n",
 768 |     "\n",
 769 |     "   # Input sequence: [\"The\", \"cat\", \"sat\"]\n",
 770 |     "   query = key = value = torch.tensor([\n",
 771 |     "       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # \"The\"\n",
 772 |     "       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # \"cat\"\n",
 773 |     "       [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]   # \"sat\"\n",
 774 |     "   ]).unsqueeze(0)\n",
 775 |     "\n",
 776 |     "   # Allow all words to attend to each other\n",
 777 |     "   mask = torch.ones(1, 1, 3, 3)\n",
 778 |     "\n",
 779 |     "   output = mha(query, key, value, mask)\n",
 780 |     "\n",
 781 |     "   print(\"Input embeddings (each row is a word):\")\n",
 782 |     "   print(query[0])\n",
 783 |     "   print(\"\\nAttention output (words now contain mixed information from relevant words):\")\n",
 784 |     "   print(output[0])\n",
 785 |     "\n",
 786 |     "   # Each head processes sequence differently, then results are combined\n",
 787 |     "   assert output.shape == query.shape\n",
 788 |     "   assert not torch.allclose(output, query)\n",
 789 |     "   print(\"✓ Multi-Head Attention Test Passed\")"
 790 |    ]
 791 |   },
 792 |   {
 793 |    "cell_type": "code",
 794 |    "execution_count": 49,
 795 |    "metadata": {
 796 |     "colab": {
 797 |      "base_uri": "https://localhost:8080/"
 798 |     },
 799 |     "id": "owCjHXhi1gYL",
 800 |     "outputId": "3ba74aac-b7bf-42fa-c743-a53f335a8e88"
 801 |    },
 802 |    "outputs": [
 803 |     {
 804 |      "name": "stdout",
 805 |      "output_type": "stream",
 806 |      "text": [
 807 |       "Input embeddings (each row is a word):\n",
 808 |       "tensor([[1., 1., 0., 0., 0., 0.],\n",
 809 |       "        [0., 0., 1., 1., 0., 0.],\n",
 810 |       "        [0., 0., 0., 0., 1., 1.]])\n",
 811 |       "\n",
 812 |       "Attention output (words now contain mixed information from relevant words):\n",
 813 |       "tensor([[-0.2170,  0.0145,  0.3044, -0.2460,  0.2197,  0.2138],\n",
 814 |       "        [-0.1175,  0.0650,  0.2979, -0.2039,  0.1218,  0.1577],\n",
 815 |       "        [-0.1234,  0.0651,  0.2954, -0.1954,  0.1228,  0.1649]],\n",
 816 |       "       grad_fn=<SelectBackward0>)\n",
 817 |       "✓ Multi-Head Attention Test Passed\n"
 818 |      ]
 819 |     }
 820 |    ],
 821 |    "source": [
 822 |     "test_multi_head_attention_with_example()"
 823 |    ]
 824 |   },
 825 |   {
 826 |    "cell_type": "markdown",
 827 |    "metadata": {
 828 |     "id": "FyD7nAM6tb0K"
 829 |    },
 830 |    "source": [
 831 |     "### Scaled Dot-Product Attention\n",
 832 |     "\n",
 833 |     "![image](https://i.imgur.com/Yp48DuB.png)"
 834 |    ]
 835 |   },
 836 |   {
 837 |    "cell_type": "code",
 838 |    "execution_count": 37,
 839 |    "metadata": {
 840 |     "id": "dKk08SvowJIc"
 841 |    },
 842 |    "outputs": [],
 843 |    "source": [
 844 |     "def attention(query, key, value, mask, d_k, dropout: nn.Dropout = None):\n",
 845 |     "  attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)\n",
 846 |     "\n",
 847 |     "  if mask is not None:\n",
 848 |     "    attention_scores.masked_fill_(mask == 0, -1e9)\n",
 849 |     "\n",
 850 |     "  attention_scores = attention_scores.softmax(dim=-1)\n",
 851 |     "\n",
 852 |     "  if dropout is not None:\n",
 853 |     "    attention_scores = dropout(attention_scores)\n",
 854 |     "\n",
 855 |     "  return (attention_scores @ value), attention_scores"
 856 |    ]
 857 |   },
 858 |   {
 859 |    "cell_type": "markdown",
 860 |    "metadata": {
 861 |     "id": "Ac8k7b3SxKOC"
 862 |    },
 863 |    "source": [
 864 |     "### Forward Method\n",
 865 |     "\n",
 866 |     "This is code is required to do a forward pass with our model."
 867 |    ]
 868 |   },
 869 |   {
 870 |    "cell_type": "code",
 871 |    "execution_count": 38,
 872 |    "metadata": {
 873 |     "id": "um2Oxv1axWYK"
 874 |    },
 875 |    "outputs": [],
 876 |    "source": [
 877 |     "def forward(self, query, key, value, mask):\n",
 878 |     "  query = self.w_q(query)\n",
 879 |     "  key = self.w_k(key)\n",
 880 |     "  value = self.w_v(value)\n",
 881 |     "\n",
 882 |     "  query = query.view(query.shape[0], query.shape[1], self.num_heads, self.d_k).transpose(1, 2)\n",
 883 |     "  key = key.view(key.shape[0], key.shape[1], self.num_heads, self.d_k).transpose(1, 2)\n",
 884 |     "  value = value.view(value.shape[0], value.shape[1], self.num_heads, self.d_k).transpose(1, 2)\n",
 885 |     "\n",
 886 |     "  x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)\n",
 887 |     "\n",
 888 |     "  x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.num_heads * self.d_k)\n",
 889 |     "\n",
 890 |     "  return self.w_o(x)"
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "markdown",
 895 |    "metadata": {
 896 |     "id": "kgr_D0mmyEgp"
 897 |    },
 898 |    "source": [
 899 |     "### Combining it All Together"
 900 |    ]
 901 |   },
 902 |   {
 903 |    "cell_type": "code",
 904 |    "execution_count": 39,
 905 |    "metadata": {
 906 |     "id": "odYKdmwMyH4P"
 907 |    },
 908 |    "outputs": [],
 909 |    "source": [
 910 |     "class MultiHeadAttention(nn.Module):\n",
 911 |     "  def __init__(self, d_model: int = 512, num_heads: int = 8, dropout: float = 0.1) -> None:\n",
 912 |     "    super().__init__()\n",
 913 |     "    self.d_model = d_model\n",
 914 |     "    self.num_heads = num_heads\n",
 915 |     "    assert d_model % num_heads == 0, \"d_model is not divisible by h\"\n",
 916 |     "\n",
 917 |     "    self.d_k = d_model // num_heads\n",
 918 |     "\n",
 919 |     "    self.w_q = nn.Linear(d_model, d_model, bias=False)\n",
 920 |     "    self.w_k = nn.Linear(d_model, d_model, bias=False)\n",
 921 |     "    self.w_v = nn.Linear(d_model, d_model, bias=False)\n",
 922 |     "\n",
 923 |     "    self.w_o = nn.Linear(d_model, d_model, bias=False)\n",
 924 |     "\n",
 925 |     "    self.dropout = nn.Dropout(dropout)\n",
 926 |     "\n",
 927 |     "  @staticmethod\n",
 928 |     "  def attention(query, key, value, mask, dropout: nn.Dropout = None):\n",
 929 |     "    d_k = query.shape[-1]\n",
 930 |     "\n",
 931 |     "    attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)\n",
 932 |     "\n",
 933 |     "    if mask is not None:\n",
 934 |     "      attention_scores.masked_fill_(mask == 0, -1e9)\n",
 935 |     "\n",
 936 |     "    attention_scores = attention_scores.softmax(dim=-1)\n",
 937 |     "\n",
 938 |     "    if dropout is not None:\n",
 939 |     "      attention_scores = dropout(attention_scores)\n",
 940 |     "\n",
 941 |     "    return (attention_scores @ value), attention_scores\n",
 942 |     "\n",
 943 |     "  def forward(self, query, key, value, mask):\n",
 944 |     "    query = self.w_q(query)\n",
 945 |     "    key = self.w_k(key)\n",
 946 |     "    value = self.w_v(value)\n",
 947 |     "\n",
 948 |     "    query = query.view(query.shape[0], query.shape[1], self.num_heads, self.d_k).transpose(1, 2)\n",
 949 |     "    key = key.view(key.shape[0], key.shape[1], self.num_heads, self.d_k).transpose(1, 2)\n",
 950 |     "    value = value.view(value.shape[0], value.shape[1], self.num_heads, self.d_k).transpose(1, 2)\n",
 951 |     "\n",
 952 |     "    x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)\n",
 953 |     "\n",
 954 |     "    x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.num_heads * self.d_k)\n",
 955 |     "\n",
 956 |     "    return self.w_o(x)"
 957 |    ]
 958 |   },
 959 |   {
 960 |    "cell_type": "markdown",
 961 |    "metadata": {
 962 |     "id": "qRFqoZyD1-AU"
 963 |    },
 964 |    "source": [
 965 |     "### Testing MultiHeadAttention\n",
 966 |     "\n",
 967 |     "Let's test it out!"
 968 |    ]
 969 |   },
 970 |   {
 971 |    "cell_type": "code",
 972 |    "execution_count": 52,
 973 |    "metadata": {
 974 |     "id": "sY1N60di2CHQ"
 975 |    },
 976 |    "outputs": [],
 977 |    "source": [
 978 |     "def test_attention_mechanism():\n",
 979 |     "   mha = MultiHeadAttention(d_model=6, num_heads=2)\n",
 980 |     "\n",
 981 |     "   # Simple sequence: \"The cat sleeps\"\n",
 982 |     "   seq = torch.tensor([\n",
 983 |     "       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],\n",
 984 |     "       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],\n",
 985 |     "       [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]\n",
 986 |     "   ]).unsqueeze(0)  # [1, 3, 6]\n",
 987 |     "\n",
 988 |     "   # Mask shape needs to match attention scores [batch, heads, seq_len, seq_len]\n",
 989 |     "   attention_scores = torch.ones(1, 2, 3, 3)  # 2 heads, sequence length 3\n",
 990 |     "\n",
 991 |     "   print(\"Input sequence shape:\", seq.shape)\n",
 992 |     "   print(\"Input values (each row is a word):\")\n",
 993 |     "   print(seq[0])\n",
 994 |     "\n",
 995 |     "   output = mha(seq, seq, seq, attention_scores)\n",
 996 |     "   print(\"\\nOutput after attention:\")\n",
 997 |     "   print(output[0])\n",
 998 |     "\n",
 999 |     "   # Verify output maintains shape but changes values\n",
1000 |     "   assert output.shape == seq.shape\n",
1001 |     "   assert not torch.allclose(output, seq)\n",
1002 |     "   print(\"✓ Multi-Head Attention Test Passed\")"
1003 |    ]
1004 |   },
1005 |   {
1006 |    "cell_type": "code",
1007 |    "execution_count": 53,
1008 |    "metadata": {
1009 |     "colab": {
1010 |      "base_uri": "https://localhost:8080/"
1011 |     },
1012 |     "id": "2qEqn4BS2DHt",
1013 |     "outputId": "9d8334b1-3063-4f07-bf64-069832add766"
1014 |    },
1015 |    "outputs": [
1016 |     {
1017 |      "name": "stdout",
1018 |      "output_type": "stream",
1019 |      "text": [
1020 |       "Input sequence shape: torch.Size([1, 3, 6])\n",
1021 |       "Input values (each row is a word):\n",
1022 |       "tensor([[1., 1., 0., 0., 0., 0.],\n",
1023 |       "        [0., 0., 1., 1., 0., 0.],\n",
1024 |       "        [0., 0., 0., 0., 1., 1.]])\n",
1025 |       "\n",
1026 |       "Output after attention:\n",
1027 |       "tensor([[-0.0902, -0.1290,  0.1568, -0.0584, -0.0283, -0.0413],\n",
1028 |       "        [ 0.0680, -0.1830,  0.1265,  0.1465,  0.0071, -0.0246],\n",
1029 |       "        [-0.1006, -0.1287,  0.1628, -0.0680, -0.0278, -0.0481]],\n",
1030 |       "       grad_fn=<SelectBackward0>)\n",
1031 |       "✓ Multi-Head Attention Test Passed\n"
1032 |      ]
1033 |     }
1034 |    ],
1035 |    "source": [
1036 |     "test_attention_mechanism()"
1037 |    ]
1038 |   },
1039 |   {
1040 |    "cell_type": "markdown",
1041 |    "metadata": {
1042 |     "id": "EkKjLhpiyz5b"
1043 |    },
1044 |    "source": [
1045 |     "## Encoder\n",
1046 |     "\n",
1047 |     "When we pass information through our model - the first thing we will do is Encode it by passing it through our Encoder Blocks.\n"
1048 |    ]
1049 |   },
1050 |   {
1051 |    "cell_type": "markdown",
1052 |    "metadata": {
1053 |     "id": "f-0edIfMzijj"
1054 |    },
1055 |    "source": [
1056 |     "### Encoder Block\n",
1057 |     "\n",
1058 |     "![image](https://i.imgur.com/nwNYZAT.png)\n",
1059 |     "\n",
1060 |     "The encoder takes in the source language sentence (e.g. English). Each word is converted into a vector representation using an embedding layer. Then a positional encoder adds information about the position of each word. This goes through multiple self-attention layers, where each word vector attends to all other word vectors to build contextual representations."
1061 |    ]
1062 |   },
1063 |   {
1064 |    "cell_type": "code",
1065 |    "execution_count": 40,
1066 |    "metadata": {
1067 |     "id": "dMVnZiGDy1PG"
1068 |    },
1069 |    "outputs": [],
1070 |    "source": [
1071 |     "class EncoderBlock(nn.Module):\n",
1072 |     "  def __init__(self, features: int, self_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardBlock, dropout: float) -> None:\n",
1073 |     "    super().__init__()\n",
1074 |     "    self.self_attention_block = self_attention_block\n",
1075 |     "    self.feed_forward_block = feed_forward_block\n",
1076 |     "    self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(2)])\n",
1077 |     "\n",
1078 |     "  def forward(self, x, input_mask):\n",
1079 |     "    x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, input_mask))\n",
1080 |     "    x = self.residual_connections[1](x, self.feed_forward_block)\n",
1081 |     "    return x"
1082 |    ]
1083 |   },
1084 |   {
1085 |    "cell_type": "markdown",
1086 |    "metadata": {
1087 |     "id": "_twqqReS28ul"
1088 |    },
1089 |    "source": [
1090 |     "### Testing the EncoderBlock\n",
1091 |     "\n",
1092 |     "Testing time!"
1093 |    ]
1094 |   },
1095 |   {
1096 |    "cell_type": "code",
1097 |    "execution_count": 54,
1098 |    "metadata": {
1099 |     "id": "PP_wq0R02_g0"
1100 |    },
1101 |    "outputs": [],
1102 |    "source": [
1103 |     "def test_encoder_block():\n",
1104 |     "   # Create encoder block with small dimensions\n",
1105 |     "   mha = MultiHeadAttention(d_model=6, num_heads=2)\n",
1106 |     "   ff = FeedForwardBlock(d_model=6, d_ff=12)\n",
1107 |     "   encoder = EncoderBlock(features=6, self_attention_block=mha, feed_forward_block=ff, dropout=0.1)\n",
1108 |     "\n",
1109 |     "   # Input: \"The cat sleeps\"\n",
1110 |     "   x = torch.tensor([\n",
1111 |     "       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # \"The\"\n",
1112 |     "       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # \"cat\"\n",
1113 |     "       [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]   # \"sleeps\"\n",
1114 |     "   ]).unsqueeze(0)\n",
1115 |     "\n",
1116 |     "   # Attention mask\n",
1117 |     "   mask = torch.ones(1, 2, 3, 3)  # Allow all connections\n",
1118 |     "\n",
1119 |     "   output = encoder(x, mask)\n",
1120 |     "\n",
1121 |     "   print(\"Input sequence:\")\n",
1122 |     "   print(x[0])\n",
1123 |     "   print(\"\\nAfter encoder processing (self-attention + feed-forward):\")\n",
1124 |     "   print(output[0])\n",
1125 |     "\n",
1126 |     "   assert output.shape == x.shape\n",
1127 |     "   assert not torch.allclose(output, x)\n",
1128 |     "   print(\"✓ Encoder Block Test Passed\")"
1129 |    ]
1130 |   },
1131 |   {
1132 |    "cell_type": "code",
1133 |    "execution_count": 55,
1134 |    "metadata": {
1135 |     "colab": {
1136 |      "base_uri": "https://localhost:8080/"
1137 |     },
1138 |     "id": "DPAH-NKA3AEN",
1139 |     "outputId": "81589cd6-c882-4469-c38f-536be8c32d1b"
1140 |    },
1141 |    "outputs": [
1142 |     {
1143 |      "name": "stdout",
1144 |      "output_type": "stream",
1145 |      "text": [
1146 |       "Input sequence:\n",
1147 |       "tensor([[1., 1., 0., 0., 0., 0.],\n",
1148 |       "        [0., 0., 1., 1., 0., 0.],\n",
1149 |       "        [0., 0., 0., 0., 1., 1.]])\n",
1150 |       "\n",
1151 |       "After encoder processing (self-attention + feed-forward):\n",
1152 |       "tensor([[ 1.5212,  0.8665, -0.8088,  0.6585,  0.7103, -1.0204],\n",
1153 |       "        [ 0.5185,  0.0000,  0.8880,  1.7109, -0.4607,  0.0312],\n",
1154 |       "        [-0.4288,  0.2765, -0.0639,  0.2137,  1.1175,  0.4317]],\n",
1155 |       "       grad_fn=<SelectBackward0>)\n",
1156 |       "✓ Encoder Block Test Passed\n"
1157 |      ]
1158 |     }
1159 |    ],
1160 |    "source": [
1161 |     "test_encoder_block()"
1162 |    ]
1163 |   },
1164 |   {
1165 |    "cell_type": "markdown",
1166 |    "metadata": {
1167 |     "id": "a-AvnoPrzvwu"
1168 |    },
1169 |    "source": [
1170 |     "### Encoder Stack\n",
1171 |     "\n",
1172 |     "Following along from the original paper - we will organize these blocks into a set of 6.\n",
1173 |     "\n",
1174 |     "These 6 Encoder Blocks (each with 8 Attention Heads) will comprise our Encoding Stack."
1175 |    ]
1176 |   },
1177 |   {
1178 |    "cell_type": "code",
1179 |    "execution_count": 56,
1180 |    "metadata": {
1181 |     "id": "kOaR5SjUzxv7"
1182 |    },
1183 |    "outputs": [],
1184 |    "source": [
1185 |     "class EncoderStack(nn.Module):\n",
1186 |     "  def __init__(self, features: int, layers: nn.ModuleList) -> None:\n",
1187 |     "    super().__init__()\n",
1188 |     "    self.layers = layers\n",
1189 |     "    self.norm = LayerNormalization(features)\n",
1190 |     "\n",
1191 |     "  def forward(self, x, mask):\n",
1192 |     "    for layer in self.layers:\n",
1193 |     "      x = layer(x, mask)\n",
1194 |     "    return self.norm(x)"
1195 |    ]
1196 |   },
1197 |   {
1198 |    "cell_type": "markdown",
1199 |    "metadata": {
1200 |     "id": "fQyujsRTz5NT"
1201 |    },
1202 |    "source": [
1203 |     "## Decoder\n",
1204 |     "\n",
1205 |     "Next, we will take the encoded sequence and decode it through our Decoder Blocks."
1206 |    ]
1207 |   },
1208 |   {
1209 |    "cell_type": "markdown",
1210 |    "metadata": {
1211 |     "id": "zBYgl77Kz6bx"
1212 |    },
1213 |    "source": [
1214 |     "### Decoder Block\n",
1215 |     "\n",
1216 |     "![image](https://i.imgur.com/HtAAXZc.png)\n",
1217 |     "\n",
1218 |     "The decoder takes in the target language sentence (e.g. Italian). It also converts words to vectors and adds positional info. Then it goes through self-attention layers. Here, a mask is applied so each word can only see the words before it, not after.\n",
1219 |     "\n",
1220 |     "The decoder also does attention over the encoder output. This allows each French word to find relevant connections with the English words."
1221 |    ]
1222 |   },
1223 |   {
1224 |    "cell_type": "code",
1225 |    "execution_count": 42,
1226 |    "metadata": {
1227 |     "id": "SIwafbqzz5-n"
1228 |    },
1229 |    "outputs": [],
1230 |    "source": [
1231 |     "class DecoderBlock(nn.Module):\n",
1232 |     "  def __init__(self, features: int, self_attention_block: MultiHeadAttention, cross_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardBlock, dropout: float) -> None:\n",
1233 |     "    super().__init__()\n",
1234 |     "    self.self_attention_block = self_attention_block\n",
1235 |     "    self.cross_attention_block = cross_attention_block\n",
1236 |     "    self.feed_forward_block = feed_forward_block\n",
1237 |     "    self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(3)])\n",
1238 |     "\n",
1239 |     "  def forward(self, x, encoder_output, input_mask, target_mask):\n",
1240 |     "    x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, target_mask))\n",
1241 |     "    x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, input_mask))\n",
1242 |     "    x = self.residual_connections[2](x, self.feed_forward_block)\n",
1243 |     "    return x"
1244 |    ]
1245 |   },
1246 |   {
1247 |    "cell_type": "markdown",
1248 |    "metadata": {
1249 |     "id": "s0KkwjkM3xe7"
1250 |    },
1251 |    "source": [
1252 |     "### Testing DecoderBlock\n",
1253 |     "\n",
1254 |     "You know what's up next...testing!"
1255 |    ]
1256 |   },
1257 |   {
1258 |    "cell_type": "code",
1259 |    "execution_count": 57,
1260 |    "metadata": {
1261 |     "id": "UnmnWAVl307S"
1262 |    },
1263 |    "outputs": [],
1264 |    "source": [
1265 |     "def test_decoder_block():\n",
1266 |     "   # Initialize components with small dimensions\n",
1267 |     "   self_attn = MultiHeadAttention(d_model=6, num_heads=2)\n",
1268 |     "   cross_attn = MultiHeadAttention(d_model=6, num_heads=2)\n",
1269 |     "   ff = FeedForwardBlock(d_model=6, d_ff=12)\n",
1270 |     "   decoder = DecoderBlock(features=6, self_attention_block=self_attn,\n",
1271 |     "                         cross_attention_block=cross_attn,\n",
1272 |     "                         feed_forward_block=ff, dropout=0.1)\n",
1273 |     "\n",
1274 |     "   # Input: \"El gato\" (target sequence)\n",
1275 |     "   x = torch.tensor([\n",
1276 |     "       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # \"El\"\n",
1277 |     "       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # \"gato\"\n",
1278 |     "   ]).unsqueeze(0)\n",
1279 |     "\n",
1280 |     "   # Encoder output: \"The cat\" (source sequence)\n",
1281 |     "   encoder_output = torch.tensor([\n",
1282 |     "       [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # \"The\"\n",
1283 |     "       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0],  # \"cat\"\n",
1284 |     "   ]).unsqueeze(0)\n",
1285 |     "\n",
1286 |     "   # Masks\n",
1287 |     "   src_mask = torch.ones(1, 2, 2, 2)  # Can attend to all encoder outputs\n",
1288 |     "   tgt_mask = torch.tril(torch.ones(1, 2, 2, 2))  # Can only attend to previous words\n",
1289 |     "\n",
1290 |     "   output = decoder(x, encoder_output, src_mask, tgt_mask)\n",
1291 |     "\n",
1292 |     "   print(\"Input target sequence:\")\n",
1293 |     "   print(x[0])\n",
1294 |     "   print(\"\\nSource (encoder) sequence:\")\n",
1295 |     "   print(encoder_output[0])\n",
1296 |     "   print(\"\\nDecoder output (after self-attention, cross-attention, and feed-forward):\")\n",
1297 |     "   print(output[0])\n",
1298 |     "\n",
1299 |     "   assert output.shape == x.shape\n",
1300 |     "   assert not torch.allclose(output, x)\n",
1301 |     "   print(\"✓ Decoder Block Test Passed\")"
1302 |    ]
1303 |   },
1304 |   {
1305 |    "cell_type": "code",
1306 |    "execution_count": 58,
1307 |    "metadata": {
1308 |     "colab": {
1309 |      "base_uri": "https://localhost:8080/"
1310 |     },
1311 |     "id": "COJlCq3033A2",
1312 |     "outputId": "6b4ef755-228d-49c3-f8e5-fff106168a88"
1313 |    },
1314 |    "outputs": [
1315 |     {
1316 |      "name": "stdout",
1317 |      "output_type": "stream",
1318 |      "text": [
1319 |       "Input target sequence:\n",
1320 |       "tensor([[1., 1., 0., 0., 0., 0.],\n",
1321 |       "        [0., 0., 1., 1., 0., 0.]])\n",
1322 |       "\n",
1323 |       "Source (encoder) sequence:\n",
1324 |       "tensor([[1., 1., 0., 0., 0., 0.],\n",
1325 |       "        [0., 0., 1., 1., 0., 0.]])\n",
1326 |       "\n",
1327 |       "Decoder output (after self-attention, cross-attention, and feed-forward):\n",
1328 |       "tensor([[ 0.7727,  2.0374,  0.6013,  0.8936, -0.6175,  0.4774],\n",
1329 |       "        [ 0.0142,  0.0789,  0.7944,  1.5585, -0.4808, -0.0209]],\n",
1330 |       "       grad_fn=<SelectBackward0>)\n",
1331 |       "✓ Decoder Block Test Passed\n"
1332 |      ]
1333 |     }
1334 |    ],
1335 |    "source": [
1336 |     "test_decoder_block()"
1337 |    ]
1338 |   },
1339 |   {
1340 |    "cell_type": "markdown",
1341 |    "metadata": {
1342 |     "id": "sa4yiTNn0BkA"
1343 |    },
1344 |    "source": [
1345 |     "### Decoder Stack\n",
1346 |     "\n",
1347 |     "We'll use the same number of Decoder Blocks as we did Encoder Blocks - leaving us with 6 Deocder Blocks in our Decoder Stack."
1348 |    ]
1349 |   },
1350 |   {
1351 |    "cell_type": "code",
1352 |    "execution_count": 43,
1353 |    "metadata": {
1354 |     "id": "1QUXzOXk0CcT"
1355 |    },
1356 |    "outputs": [],
1357 |    "source": [
1358 |     "class DecoderStack(nn.Module):\n",
1359 |     "  def __init__(self, features: int, layers: nn.ModuleList) -> None:\n",
1360 |     "    super().__init__()\n",
1361 |     "    self.layers = layers\n",
1362 |     "    self.norm = LayerNormalization(features)\n",
1363 |     "\n",
1364 |     "  def forward(self, x, encoder_output, input_mask, target_mask):\n",
1365 |     "    for layer in self.layers:\n",
1366 |     "      x = layer(x, encoder_output, input_mask, target_mask)\n",
1367 |     "    return self.norm(x)"
1368 |    ]
1369 |   },
1370 |   {
1371 |    "cell_type": "markdown",
1372 |    "metadata": {
1373 |     "id": "kRFiAP580S4b"
1374 |    },
1375 |    "source": [
1376 |     "## Linear Projection Layer\n",
1377 |     "\n",
1378 |     "After the decoder's self-attention and encoder-decoder attention layers, we have a context vector representing each Italian word. This context vector has a high dimension (e.g. 512 or 1024).\n",
1379 |     "\n",
1380 |     "We want to take this context vector and generate a probability distribution over the French vocabulary so we can pick the next translated word.\n",
1381 |     "\n",
1382 |     "The linear projection layer helps with this. It projects the context vector into a much larger vector called the vocabulary distribution - one entry per word in the vocabulary.\n",
1383 |     "\n",
1384 |     "For example, if our Italian vocabulary has 50,000 words, the vocabulary distribution will have 50,000 dimensions. Each dimension corresponds to the probability of that Italian word being the correct translation."
1385 |    ]
1386 |   },
1387 |   {
1388 |    "cell_type": "code",
1389 |    "execution_count": 44,
1390 |    "metadata": {
1391 |     "id": "tkBBMAZK0WLB"
1392 |    },
1393 |    "outputs": [],
1394 |    "source": [
1395 |     "class LinearProjectionLayer(nn.Module):\n",
1396 |     "  def __init__(self, d_model, vocab_size) -> None:\n",
1397 |     "    super().__init__()\n",
1398 |     "    self.proj = nn.Linear(d_model, vocab_size)\n",
1399 |     "\n",
1400 |     "  def forward(self, x) -> None:\n",
1401 |     "    return self.proj(x)"
1402 |    ]
1403 |   },
1404 |   {
1405 |    "cell_type": "markdown",
1406 |    "metadata": {
1407 |     "id": "v9ucsRWs0lG9"
1408 |    },
1409 |    "source": [
1410 |     "## The Transformer\n",
1411 |     "\n",
1412 |     "At this point, all we need to do is create a class that represents our model!"
1413 |    ]
1414 |   },
1415 |   {
1416 |    "cell_type": "code",
1417 |    "execution_count": 45,
1418 |    "metadata": {
1419 |     "id": "5ip11mmQ0nMM"
1420 |    },
1421 |    "outputs": [],
1422 |    "source": [
1423 |     "class Transformer(nn.Module):\n",
1424 |     "  def __init__(self, encoder: EncoderBlock, decoder: DecoderBlock, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings, src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer: LinearProjectionLayer) -> None:\n",
1425 |     "    super().__init__()\n",
1426 |     "    self.encoder = encoder\n",
1427 |     "    self.decoder = decoder\n",
1428 |     "    self.src_embed = src_embed\n",
1429 |     "    self.tgt_embed = tgt_embed\n",
1430 |     "    self.src_pos = src_pos\n",
1431 |     "    self.tgt_pos = tgt_pos\n",
1432 |     "    self.projection_layer = projection_layer\n",
1433 |     "\n",
1434 |     "  def encode(self, src, src_mask):\n",
1435 |     "    src = self.src_embed(src)\n",
1436 |     "    src = self.src_pos(src)\n",
1437 |     "    return self.encoder(src, src_mask)\n",
1438 |     "\n",
1439 |     "  def decode(self, encoder_output: torch.Tensor, src_mask: torch.Tensor, tgt: torch.Tensor, tgt_mask: torch.Tensor):\n",
1440 |     "    tgt = self.tgt_embed(tgt)\n",
1441 |     "    tgt = self.tgt_pos(tgt)\n",
1442 |     "    return self.decoder(tgt, encoder_output, src_mask, tgt_mask)\n",
1443 |     "\n",
1444 |     "  def project(self, x):\n",
1445 |     "    return self.projection_layer(x)"
1446 |    ]
1447 |   },
1448 |   {
1449 |    "cell_type": "markdown",
1450 |    "metadata": {
1451 |     "id": "4Ssr5nA039--"
1452 |    },
1453 |    "source": [
1454 |     "## Building Our Transformer\n",
1455 |     "\n",
1456 |     "Now that we have each of our components - we need to construct an actual model!\n",
1457 |     "\n",
1458 |     "We'll use this helper function to aid in our goal and set up our Encoder/Decoder Stacks!"
1459 |    ]
1460 |   },
1461 |   {
1462 |    "cell_type": "code",
1463 |    "execution_count": 46,
1464 |    "metadata": {
1465 |     "id": "kdCt3wNi4EvY"
1466 |    },
1467 |    "outputs": [],
1468 |    "source": [
1469 |     "def build_transformer(input_vocab_size: int, target_vocab_size: int, input_seq_len: int, target_seq_len: int, d_model: int=512, N: int=6, num_heads: int=8, dropout: float=0.1, d_ff: int=2048, verbose=True) -> Transformer:\n",
1470 |     "  input_embeddings = InputEmbeddings(d_model, input_vocab_size, verbose=verbose)\n",
1471 |     "  target_embeddings = InputEmbeddings(d_model, target_vocab_size)\n",
1472 |     "\n",
1473 |     "  input_position = PositionalEncoding(d_model, input_seq_len, dropout, verbose=verbose)\n",
1474 |     "  target_position = PositionalEncoding(d_model, target_seq_len, dropout)\n",
1475 |     "\n",
1476 |     "  encoder_blocks = []\n",
1477 |     "\n",
1478 |     "  for _ in range(N):\n",
1479 |     "    encoder_self_attention_block = MultiHeadAttention(d_model, num_heads, dropout)\n",
1480 |     "    feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)\n",
1481 |     "    encoder_block = EncoderBlock(d_model, encoder_self_attention_block, feed_forward_block, dropout)\n",
1482 |     "    encoder_blocks.append(encoder_block)\n",
1483 |     "\n",
1484 |     "  decoder_blocks = []\n",
1485 |     "\n",
1486 |     "  for _ in range(N):\n",
1487 |     "    decoder_self_attention_block = MultiHeadAttention(d_model, num_heads, dropout)\n",
1488 |     "    decoder_cross_attention_block = MultiHeadAttention(d_model, num_heads, dropout)\n",
1489 |     "    feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)\n",
1490 |     "    decoder_block = DecoderBlock(d_model, decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)\n",
1491 |     "    decoder_blocks.append(decoder_block)\n",
1492 |     "\n",
1493 |     "  encoder_stack = EncoderStack(d_model, nn.ModuleList(encoder_blocks))\n",
1494 |     "  decoder_stack = DecoderStack(d_model, nn.ModuleList(decoder_blocks))\n",
1495 |     "\n",
1496 |     "  linear_projection_layer = LinearProjectionLayer(d_model, target_vocab_size)\n",
1497 |     "\n",
1498 |     "  transformer = Transformer(encoder_stack, decoder_stack, input_embeddings, target_embeddings, input_position, target_position, linear_projection_layer)\n",
1499 |     "\n",
1500 |     "  for p in transformer.parameters():\n",
1501 |     "    if p.dim() > 1:\n",
1502 |     "      nn.init.xavier_uniform_(p)\n",
1503 |     "\n",
1504 |     "  return transformer"
1505 |    ]
1506 |   },
1507 |   {
1508 |    "cell_type": "markdown",
1509 |    "metadata": {
1510 |     "id": "_2jfvmRx4ZNd"
1511 |    },
1512 |    "source": [
1513 |     "# Training Our Transformer!\n",
1514 |     "\n",
1515 |     "We will be using the resources created in [this](https://github.com/hkproj/pytorch-transformer/tree/main) repository to train our model on a English -> French translation task.\n",
1516 |     "\n"
1517 |    ]
1518 |   },
1519 |   {
1520 |    "cell_type": "markdown",
1521 |    "metadata": {
1522 |     "id": "SxF6JLcg5LJp"
1523 |    },
1524 |    "source": [
1525 |     "## Dataset Creation\n",
1526 |     "\n",
1527 |     "The BilingualDataset is a custom PyTorch dataset for working with translation data. It needs a tokenizer for each language, a dataset of sentence pairs, info on which languages are source and target, and the max sequence length.\n",
1528 |     "\n",
1529 |     "This class handles tokenizing the sentences, padding them to be the same length, and getting the data into the right format for sequence-to-sequence models. It adds special start, end, and padding tokens so all the inputs and outputs are the same length.\n",
1530 |     "\n",
1531 |     "When you grab a sample from the dataset, it tokenizes the source and target sentences, pads them, and creates the input tensors the model needs - encoder input, decoder input, and target labels. It also makes masks to show what's real data vs padding, and to make sure the decoder predictions only use previous tokens, not future ones.\n",
1532 |     "\n",
1533 |     "The BilingualDataset gets the data ready for training seq2seq models in a way that works with the sequential nature of language. The model can only predict the next token based on what came before it, not after."
1534 |    ]
1535 |   },
1536 |   {
1537 |    "cell_type": "code",
1538 |    "execution_count": 47,
1539 |    "metadata": {
1540 |     "id": "WLnUZRgk7S-G"
1541 |    },
1542 |    "outputs": [],
1543 |    "source": [
1544 |     "from torch.utils.data import Dataset\n",
1545 |     "\n",
1546 |     "class BilingualDataset(Dataset):\n",
1547 |     "  def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len):\n",
1548 |     "    super().__init__()\n",
1549 |     "    self.seq_len = seq_len\n",
1550 |     "\n",
1551 |     "    self.ds = ds\n",
1552 |     "    self.tokenizer_src = tokenizer_src\n",
1553 |     "    self.tokenizer_tgt = tokenizer_tgt\n",
1554 |     "    self.src_lang = src_lang\n",
1555 |     "    self.tgt_lang = tgt_lang\n",
1556 |     "\n",
1557 |     "    self.sos_token = torch.tensor([tokenizer_tgt.token_to_id(\"[SOS]\")], dtype=torch.int64)\n",
1558 |     "    self.eos_token = torch.tensor([tokenizer_tgt.token_to_id(\"[EOS]\")], dtype=torch.int64)\n",
1559 |     "    self.pad_token = torch.tensor([tokenizer_tgt.token_to_id(\"[PAD]\")], dtype=torch.int64)\n",
1560 |     "\n",
1561 |     "  def __len__(self):\n",
1562 |     "    return len(self.ds)\n",
1563 |     "\n",
1564 |     "  def __getitem__(self, idx):\n",
1565 |     "    src_target_pair = self.ds[idx]\n",
1566 |     "    src_text = src_target_pair['translation'][self.src_lang]\n",
1567 |     "    tgt_text = src_target_pair['translation'][self.tgt_lang]\n",
1568 |     "\n",
1569 |     "    enc_input_tokens = self.tokenizer_src.encode(src_text).ids\n",
1570 |     "    dec_input_tokens = self.tokenizer_tgt.encode(tgt_text).ids\n",
1571 |     "\n",
1572 |     "    enc_num_padding_tokens = self.seq_len - len(enc_input_tokens) - 2\n",
1573 |     "    dec_num_padding_tokens = self.seq_len - len(dec_input_tokens) - 1\n",
1574 |     "\n",
1575 |     "    if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:\n",
1576 |     "        raise ValueError(\"Sentence is too long\")\n",
1577 |     "\n",
1578 |     "    encoder_input = torch.cat(\n",
1579 |     "        [\n",
1580 |     "            self.sos_token,\n",
1581 |     "            torch.tensor(enc_input_tokens, dtype=torch.int64),\n",
1582 |     "            self.eos_token,\n",
1583 |     "            torch.tensor([self.pad_token] * enc_num_padding_tokens, dtype=torch.int64),\n",
1584 |     "        ],\n",
1585 |     "        dim=0,\n",
1586 |     "    )\n",
1587 |     "\n",
1588 |     "    decoder_input = torch.cat(\n",
1589 |     "        [\n",
1590 |     "            self.sos_token,\n",
1591 |     "            torch.tensor(dec_input_tokens, dtype=torch.int64),\n",
1592 |     "            torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),\n",
1593 |     "        ],\n",
1594 |     "        dim=0,\n",
1595 |     "    )\n",
1596 |     "\n",
1597 |     "    label = torch.cat(\n",
1598 |     "        [\n",
1599 |     "            torch.tensor(dec_input_tokens, dtype=torch.int64),\n",
1600 |     "            self.eos_token,\n",
1601 |     "            torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),\n",
1602 |     "        ],\n",
1603 |     "        dim=0,\n",
1604 |     "    )\n",
1605 |     "\n",
1606 |     "    assert encoder_input.size(0) == self.seq_len\n",
1607 |     "    assert decoder_input.size(0) == self.seq_len\n",
1608 |     "    assert label.size(0) == self.seq_len\n",
1609 |     "\n",
1610 |     "    return {\n",
1611 |     "        \"encoder_input\": encoder_input,\n",
1612 |     "        \"decoder_input\": decoder_input,\n",
1613 |     "        \"encoder_mask\": (encoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int(),\n",
1614 |     "        \"decoder_mask\": (decoder_input != self.pad_token).unsqueeze(0).int() & causal_mask(decoder_input.size(0)),\n",
1615 |     "        \"label\": label,\n",
1616 |     "        \"src_text\": src_text,\n",
1617 |     "        \"tgt_text\": tgt_text,\n",
1618 |     "    }\n",
1619 |     "\n",
1620 |     "def causal_mask(size):\n",
1621 |     "  mask = torch.triu(torch.ones((1, size, size)), diagonal=1).type(torch.int)\n",
1622 |     "  return mask == 0"
1623 |    ]
1624 |   },
1625 |   {
1626 |    "cell_type": "markdown",
1627 |    "metadata": {
1628 |     "id": "E0hSYI7F8IAI"
1629 |    },
1630 |    "source": [
1631 |     "## Build Tokenizer For Training"
1632 |    ]
1633 |   },
1634 |   {
1635 |    "cell_type": "code",
1636 |    "execution_count": null,
1637 |    "metadata": {
1638 |     "colab": {
1639 |      "base_uri": "https://localhost:8080/"
1640 |     },
1641 |     "id": "QfmUwnpo8qou",
1642 |     "outputId": "a9c7aaaf-2a26-4729-c0ee-4e4407508b15"
1643 |    },
1644 |    "outputs": [
1645 |     {
1646 |      "name": "stdout",
1647 |      "output_type": "stream",
1648 |      "text": [
1649 |       "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.1/44.1 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1650 |       "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.0/10.0 MB\u001b[0m \u001b[31m113.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1651 |       "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m36.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1652 |       "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m11.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1653 |       "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1654 |       "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m13.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1655 |       "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m15.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
1656 |       "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
1657 |       "gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.\u001b[0m\u001b[31m\n",
1658 |       "\u001b[0m"
1659 |      ]
1660 |     }
1661 |    ],
1662 |    "source": [
1663 |     "!pip install transformers tokenizers datasets -qU"
1664 |    ]
1665 |   },
1666 |   {
1667 |    "cell_type": "markdown",
1668 |    "metadata": {
1669 |     "id": "1TCDOGq0UD1W"
1670 |    },
1671 |    "source": [
1672 |     "This will grab all the sentences from our dataset per language."
1673 |    ]
1674 |   },
1675 |   {
1676 |    "cell_type": "code",
1677 |    "execution_count": null,
1678 |    "metadata": {
1679 |     "id": "iAUUEXFz8eGH"
1680 |    },
1681 |    "outputs": [],
1682 |    "source": [
1683 |     "def get_all_sentences(ds, lang):\n",
1684 |     "    for item in ds:\n",
1685 |     "        yield item['translation'][lang]"
1686 |    ]
1687 |   },
1688 |   {
1689 |    "cell_type": "markdown",
1690 |    "metadata": {
1691 |     "id": "vUDKOCTVUIa7"
1692 |    },
1693 |    "source": [
1694 |     "We'll quickly train a tokenizer on our dataset for both our source and target languages.\n",
1695 |     "\n",
1696 |     "We'll be sure to add the `[UNK]`, `[PAD]`, `[SOS]`, and `[EOS]` special tokens."
1697 |    ]
1698 |   },
1699 |   {
1700 |    "cell_type": "code",
1701 |    "execution_count": null,
1702 |    "metadata": {
1703 |     "id": "4s1ZHzKb8hTN"
1704 |    },
1705 |    "outputs": [],
1706 |    "source": [
1707 |     "from datasets import load_dataset\n",
1708 |     "from tokenizers import Tokenizer\n",
1709 |     "from tokenizers.models import WordLevel\n",
1710 |     "from tokenizers.trainers import WordLevelTrainer\n",
1711 |     "from tokenizers.pre_tokenizers import Whitespace\n",
1712 |     "\n",
1713 |     "def build_tokenizer(config, ds, lang):\n",
1714 |     "  tokenizer = Tokenizer(WordLevel(unk_token=\"[UNK]\"))\n",
1715 |     "  tokenizer.pre_tokenizer = Whitespace()\n",
1716 |     "  trainer = WordLevelTrainer(special_tokens=[\"[UNK]\", \"[PAD]\", \"[SOS]\", \"[EOS]\"], min_frequency=2)\n",
1717 |     "  tokenizer.train_from_iterator(get_all_sentences(ds, lang), trainer=trainer)\n",
1718 |     "  return tokenizer"
1719 |    ]
1720 |   },
1721 |   {
1722 |    "cell_type": "markdown",
1723 |    "metadata": {
1724 |     "id": "MEUDihkcVRq1"
1725 |    },
1726 |    "source": [
1727 |     "Now we can create our dataset in a format that our model expects and can train with!"
1728 |    ]
1729 |   },
1730 |   {
1731 |    "cell_type": "code",
1732 |    "execution_count": null,
1733 |    "metadata": {
1734 |     "id": "yI7xXnLo84cE"
1735 |    },
1736 |    "outputs": [],
1737 |    "source": [
1738 |     "from torch.utils.data import DataLoader, random_split\n",
1739 |     "\n",
1740 |     "def get_ds(config):\n",
1741 |     "  # It only has the train split, so we divide it overselves\n",
1742 |     "  ds_raw = load_dataset(f\"{config['datasource']}\", f\"{config['lang_src']}-{config['lang_tgt']}\", split='train')\n",
1743 |     "\n",
1744 |     "  # Build tokenizers\n",
1745 |     "  tokenizer_src = build_tokenizer(config, ds_raw, config['lang_src'])\n",
1746 |     "  tokenizer_tgt = build_tokenizer(config, ds_raw, config['lang_tgt'])\n",
1747 |     "\n",
1748 |     "  # Keep 90% for training, 10% for validation\n",
1749 |     "  train_ds_size = int(0.9 * len(ds_raw))\n",
1750 |     "  val_ds_size = len(ds_raw) - train_ds_size\n",
1751 |     "  train_ds_raw, val_ds_raw = random_split(ds_raw, [train_ds_size, val_ds_size])\n",
1752 |     "\n",
1753 |     "  train_ds = BilingualDataset(train_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])\n",
1754 |     "  val_ds = BilingualDataset(val_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])\n",
1755 |     "\n",
1756 |     "  # Find the maximum length of each sentence in the source and target sentence\n",
1757 |     "  max_len_src = 0\n",
1758 |     "  max_len_tgt = 0\n",
1759 |     "\n",
1760 |     "  for item in ds_raw:\n",
1761 |     "    src_ids = tokenizer_src.encode(item['translation'][config['lang_src']]).ids\n",
1762 |     "    tgt_ids = tokenizer_tgt.encode(item['translation'][config['lang_tgt']]).ids\n",
1763 |     "    max_len_src = max(max_len_src, len(src_ids))\n",
1764 |     "    max_len_tgt = max(max_len_tgt, len(tgt_ids))\n",
1765 |     "\n",
1766 |     "  print(f'Max length of source sentence: {max_len_src}')\n",
1767 |     "  print(f'Max length of target sentence: {max_len_tgt}')\n",
1768 |     "\n",
1769 |     "\n",
1770 |     "  train_dataloader = DataLoader(train_ds, batch_size=config['batch_size'], shuffle=True)\n",
1771 |     "  val_dataloader = DataLoader(val_ds, batch_size=1, shuffle=True)\n",
1772 |     "\n",
1773 |     "  return train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt"
1774 |    ]
1775 |   },
1776 |   {
1777 |    "cell_type": "markdown",
1778 |    "metadata": {
1779 |     "id": "cr3Ys-ufVW1E"
1780 |    },
1781 |    "source": [
1782 |     "We can build our model with this helper function."
1783 |    ]
1784 |   },
1785 |   {
1786 |    "cell_type": "code",
1787 |    "execution_count": null,
1788 |    "metadata": {
1789 |     "id": "FK6k5X829JOo"
1790 |    },
1791 |    "outputs": [],
1792 |    "source": [
1793 |     "def get_model(config, vocab_src_len, vocab_tgt_len):\n",
1794 |     "  model = build_transformer(vocab_src_len, vocab_tgt_len, config[\"seq_len\"], config['seq_len'], d_model=config['d_model'], verbose=False)\n",
1795 |     "  return model"
1796 |    ]
1797 |   },
1798 |   {
1799 |    "cell_type": "code",
1800 |    "execution_count": null,
1801 |    "metadata": {
1802 |     "id": "JCALGXpf9tnv"
1803 |    },
1804 |    "outputs": [],
1805 |    "source": [
1806 |     "def get_weights_file_path(config, epoch: str):\n",
1807 |     "  model_folder = f\"{config['datasource']}_{config['model_folder']}\"\n",
1808 |     "  model_filename = f\"{config['model_basename']}{epoch}.pt\"\n",
1809 |     "  return str(Path('.') / model_folder / model_filename)"
1810 |    ]
1811 |   },
1812 |   {
1813 |    "cell_type": "code",
1814 |    "execution_count": null,
1815 |    "metadata": {
1816 |     "id": "QPWZgCwD9vVX"
1817 |    },
1818 |    "outputs": [],
1819 |    "source": [
1820 |     "def latest_weights_file_path(config):\n",
1821 |     "  model_folder = f\"{config['datasource']}_{config['model_folder']}\"\n",
1822 |     "  model_filename = f\"{config['model_basename']}*\"\n",
1823 |     "  weights_files = list(Path(model_folder).glob(model_filename))\n",
1824 |     "  if len(weights_files) == 0:\n",
1825 |     "      return None\n",
1826 |     "  weights_files.sort()\n",
1827 |     "  return str(weights_files[-1])"
1828 |    ]
1829 |   },
1830 |   {
1831 |    "cell_type": "markdown",
1832 |    "metadata": {
1833 |     "id": "BNjAS-LvV557"
1834 |    },
1835 |    "source": [
1836 |     "Finally....our training loop!\n",
1837 |     "\n",
1838 |     "We'll spend more time in following weeks discussing this - for now, we'll quickly walk through what's happening:\n",
1839 |     "\n",
1840 |     "1. Configure the training device (GPU/CPU) and print details. Set device in PyTorch.\n",
1841 |     "\n",
1842 |     "2. Create directory for saving model weights based on config.\n",
1843 |     "\n",
1844 |     "3. Get data loaders, tokenizers, and model. Move model to configured device.\n",
1845 |     "\n",
1846 |     "4. Initialize Adam optimizer with learning rate and epsilon from config.\n",
1847 |     "\n",
1848 |     "5. Set up initial training parameters like start epoch and global step.\n",
1849 |     "\n",
1850 |     "6. Define cross-entropy loss function with label smoothing, ignoring padding.\n",
1851 |     "\n",
1852 |     "---\n",
1853 |     "\n",
1854 |     "- Main training loop over epochs:\n",
1855 |     "\n",
1856 |     "  - Clear cache, set model to train mode, initialize progress bar.\n",
1857 |     "\n",
1858 |     "  - For each batch:\n",
1859 |     "\n",
1860 |     "    - Move data to device, run model forward/backward passes.\n",
1861 |     "    - Compute loss, backprop, update model weights.\n",
1862 |     "    - Increment global step.\n",
1863 |     "  - After each epoch, save model and optimizer checkpoint."
1864 |    ]
1865 |   },
1866 |   {
1867 |    "cell_type": "code",
1868 |    "execution_count": null,
1869 |    "metadata": {
1870 |     "id": "gL1bH7is9OfS"
1871 |    },
1872 |    "outputs": [],
1873 |    "source": [
1874 |     "import warnings\n",
1875 |     "from tqdm import tqdm\n",
1876 |     "import os\n",
1877 |     "from pathlib import Path\n",
1878 |     "\n",
1879 |     "def train_model(config):\n",
1880 |     "  # Define the device\n",
1881 |     "  device = \"cuda\" if torch.cuda.is_available() else \"mps\" if torch.has_mps or torch.backends.mps.is_available() else \"cpu\"\n",
1882 |     "  print(\"Using device:\", device)\n",
1883 |     "  if (device == 'cuda'):\n",
1884 |     "    print(f\"Device name: {torch.cuda.get_device_name(device.index)}\")\n",
1885 |     "    print(f\"Device memory: {torch.cuda.get_device_properties(device.index).total_memory / 1024 ** 3} GB\")\n",
1886 |     "  else:\n",
1887 |     "    print(\"Please ensure you're in a GPU enabled Colab Notebook instance.\")\n",
1888 |     "  device = torch.device(device)\n",
1889 |     "\n",
1890 |     "  # Make sure the weights folder exists\n",
1891 |     "  Path(f\"{config['datasource']}_{config['model_folder']}\").mkdir(parents=True, exist_ok=True)\n",
1892 |     "\n",
1893 |     "  train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)\n",
1894 |     "  model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)\n",
1895 |     "\n",
1896 |     "  optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'], eps=1e-9)\n",
1897 |     "\n",
1898 |     "  initial_epoch = 0\n",
1899 |     "  global_step = 0\n",
1900 |     "\n",
1901 |     "  loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer_src.token_to_id('[PAD]'), label_smoothing=0.1).to(device)\n",
1902 |     "\n",
1903 |     "  for epoch in range(initial_epoch, config['num_epochs']):\n",
1904 |     "    torch.cuda.empty_cache()\n",
1905 |     "    model.train()\n",
1906 |     "    batch_iterator = tqdm(train_dataloader, desc=f\"Processing Epoch {epoch:02d}\")\n",
1907 |     "    for batch in batch_iterator:\n",
1908 |     "      encoder_input = batch['encoder_input'].to(device)\n",
1909 |     "      decoder_input = batch['decoder_input'].to(device)\n",
1910 |     "      encoder_mask = batch['encoder_mask'].to(device)\n",
1911 |     "      decoder_mask = batch['decoder_mask'].to(device)\n",
1912 |     "\n",
1913 |     "      encoder_output = model.encode(encoder_input, encoder_mask)\n",
1914 |     "      decoder_output = model.decode(encoder_output, encoder_mask, decoder_input, decoder_mask)\n",
1915 |     "      proj_output = model.project(decoder_output)\n",
1916 |     "\n",
1917 |     "      label = batch['label'].to(device)\n",
1918 |     "\n",
1919 |     "      loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))\n",
1920 |     "      batch_iterator.set_postfix({\"loss\": f\"{loss.item():6.3f}\"})\n",
1921 |     "\n",
1922 |     "      loss.backward()\n",
1923 |     "\n",
1924 |     "      optimizer.step()\n",
1925 |     "      optimizer.zero_grad(set_to_none=True)\n",
1926 |     "\n",
1927 |     "      global_step += 1\n",
1928 |     "\n",
1929 |     "    model_filename = get_weights_file_path(config, f\"{epoch:02d}\")\n",
1930 |     "    torch.save({\n",
1931 |     "      'epoch': epoch,\n",
1932 |     "      'model_state_dict': model.state_dict(),\n",
1933 |     "      'optimizer_state_dict': optimizer.state_dict(),\n",
1934 |     "      'global_step': global_step\n",
1935 |     "    }, model_filename)"
1936 |    ]
1937 |   },
1938 |   {
1939 |    "cell_type": "code",
1940 |    "execution_count": null,
1941 |    "metadata": {
1942 |     "id": "f21pxGoS9-gM"
1943 |    },
1944 |    "outputs": [],
1945 |    "source": [
1946 |     "config = {\n",
1947 |     "  \"batch_size\": 64,\n",
1948 |     "  \"num_epochs\": 6,\n",
1949 |     "  \"lr\": 1e-4,\n",
1950 |     "  \"seq_len\": 350,\n",
1951 |     "  \"d_model\": 512,\n",
1952 |     "  \"datasource\": 'opus_books',\n",
1953 |     "  \"lang_src\": \"en\",\n",
1954 |     "  \"lang_tgt\": \"it\",\n",
1955 |     "  \"model_folder\": \"trained__en_it_translation_model\",\n",
1956 |     "  \"model_basename\": \"encoder_decoder_model_\"\n",
1957 |     "}"
1958 |    ]
1959 |   },
1960 |   {
1961 |    "cell_type": "code",
1962 |    "execution_count": null,
1963 |    "metadata": {
1964 |     "colab": {
1965 |      "base_uri": "https://localhost:8080/"
1966 |     },
1967 |     "id": "hfBcSHUo-MU6",
1968 |     "outputId": "dec1db65-a12c-42b1-dc34-394eab453b2a"
1969 |    },
1970 |    "outputs": [
1971 |     {
1972 |      "name": "stdout",
1973 |      "output_type": "stream",
1974 |      "text": [
1975 |       "Using device: cuda\n",
1976 |       "Device name: NVIDIA A100-SXM4-40GB\n",
1977 |       "Device memory: 39.56427001953125 GB\n",
1978 |       "Max length of source sentence: 309\n",
1979 |       "Max length of target sentence: 274\n"
1980 |      ]
1981 |     },
1982 |     {
1983 |      "name": "stderr",
1984 |      "output_type": "stream",
1985 |      "text": [
1986 |       "Processing Epoch 00: 100%|██████████| 455/455 [05:26<00:00,  1.39it/s, loss=6.288]\n",
1987 |       "Processing Epoch 01: 100%|██████████| 455/455 [05:27<00:00,  1.39it/s, loss=5.811]\n",
1988 |       "Processing Epoch 02: 100%|██████████| 455/455 [05:27<00:00,  1.39it/s, loss=5.545]\n",
1989 |       "Processing Epoch 03: 100%|██████████| 455/455 [05:27<00:00,  1.39it/s, loss=5.524]\n",
1990 |       "Processing Epoch 04: 100%|██████████| 455/455 [05:27<00:00,  1.39it/s, loss=5.216]\n",
1991 |       "Processing Epoch 05: 100%|██████████| 455/455 [05:27<00:00,  1.39it/s, loss=5.005]\n"
1992 |      ]
1993 |     }
1994 |    ],
1995 |    "source": [
1996 |     "train_model(config)"
1997 |    ]
1998 |   },
1999 |   {
2000 |    "cell_type": "code",
2001 |    "execution_count": null,
2002 |    "metadata": {
2003 |     "colab": {
2004 |      "base_uri": "https://localhost:8080/"
2005 |     },
2006 |     "id": "FT_QNpmqzAxe",
2007 |     "outputId": "7f679622-22bc-43b7-b848-8ce175635269"
2008 |    },
2009 |    "outputs": [
2010 |     {
2011 |      "name": "stdout",
2012 |      "output_type": "stream",
2013 |      "text": [
2014 |       "Max length of source sentence: 309\n",
2015 |       "Max length of target sentence: 274\n",
2016 |       "Loading weights from opus_books_trained__en_it_translation_model/encoder_decoder_model_05.pt\n"
2017 |      ]
2018 |     },
2019 |     {
2020 |      "name": "stderr",
2021 |      "output_type": "stream",
2022 |      "text": [
2023 |       "<ipython-input-51-752505bf774c>:12: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
2024 |       "  state = torch.load(model_filename)\n"
2025 |      ]
2026 |     },
2027 |     {
2028 |      "data": {
2029 |       "text/plain": [
2030 |        "Transformer(\n",
2031 |        "  (encoder): EncoderStack(\n",
2032 |        "    (layers): ModuleList(\n",
2033 |        "      (0-5): 6 x EncoderBlock(\n",
2034 |        "        (self_attention_block): MultiHeadAttention(\n",
2035 |        "          (w_q): Linear(in_features=512, out_features=512, bias=False)\n",
2036 |        "          (w_k): Linear(in_features=512, out_features=512, bias=False)\n",
2037 |        "          (w_v): Linear(in_features=512, out_features=512, bias=False)\n",
2038 |        "          (w_o): Linear(in_features=512, out_features=512, bias=False)\n",
2039 |        "          (dropout): Dropout(p=0.1, inplace=False)\n",
2040 |        "        )\n",
2041 |        "        (feed_forward_block): FeedForwardBlock(\n",
2042 |        "          (linear_1): Linear(in_features=512, out_features=2048, bias=True)\n",
2043 |        "          (dropout): Dropout(p=0.1, inplace=False)\n",
2044 |        "          (linear_2): Linear(in_features=2048, out_features=512, bias=True)\n",
2045 |        "        )\n",
2046 |        "        (residual_connections): ModuleList(\n",
2047 |        "          (0-1): 2 x ResidualConnection(\n",
2048 |        "            (dropout): Dropout(p=0.1, inplace=False)\n",
2049 |        "            (layernorm): LayerNormalization()\n",
2050 |        "          )\n",
2051 |        "        )\n",
2052 |        "      )\n",
2053 |        "    )\n",
2054 |        "    (norm): LayerNormalization()\n",
2055 |        "  )\n",
2056 |        "  (decoder): DecoderStack(\n",
2057 |        "    (layers): ModuleList(\n",
2058 |        "      (0-5): 6 x DecoderBlock(\n",
2059 |        "        (self_attention_block): MultiHeadAttention(\n",
2060 |        "          (w_q): Linear(in_features=512, out_features=512, bias=False)\n",
2061 |        "          (w_k): Linear(in_features=512, out_features=512, bias=False)\n",
2062 |        "          (w_v): Linear(in_features=512, out_features=512, bias=False)\n",
2063 |        "          (w_o): Linear(in_features=512, out_features=512, bias=False)\n",
2064 |        "          (dropout): Dropout(p=0.1, inplace=False)\n",
2065 |        "        )\n",
2066 |        "        (cross_attention_block): MultiHeadAttention(\n",
2067 |        "          (w_q): Linear(in_features=512, out_features=512, bias=False)\n",
2068 |        "          (w_k): Linear(in_features=512, out_features=512, bias=False)\n",
2069 |        "          (w_v): Linear(in_features=512, out_features=512, bias=False)\n",
2070 |        "          (w_o): Linear(in_features=512, out_features=512, bias=False)\n",
2071 |        "          (dropout): Dropout(p=0.1, inplace=False)\n",
2072 |        "        )\n",
2073 |        "        (feed_forward_block): FeedForwardBlock(\n",
2074 |        "          (linear_1): Linear(in_features=512, out_features=2048, bias=True)\n",
2075 |        "          (dropout): Dropout(p=0.1, inplace=False)\n",
2076 |        "          (linear_2): Linear(in_features=2048, out_features=512, bias=True)\n",
2077 |        "        )\n",
2078 |        "        (residual_connections): ModuleList(\n",
2079 |        "          (0-2): 3 x ResidualConnection(\n",
2080 |        "            (dropout): Dropout(p=0.1, inplace=False)\n",
2081 |        "            (layernorm): LayerNormalization()\n",
2082 |        "          )\n",
2083 |        "        )\n",
2084 |        "      )\n",
2085 |        "    )\n",
2086 |        "    (norm): LayerNormalization()\n",
2087 |        "  )\n",
2088 |        "  (src_embed): InputEmbeddings(\n",
2089 |        "    (embedding): Embedding(15698, 512)\n",
2090 |        "  )\n",
2091 |        "  (tgt_embed): InputEmbeddings(\n",
2092 |        "    (embedding): Embedding(22463, 512)\n",
2093 |        "  )\n",
2094 |        "  (src_pos): PositionalEncoding(\n",
2095 |        "    (dropout): Dropout(p=0.1, inplace=False)\n",
2096 |        "  )\n",
2097 |        "  (tgt_pos): PositionalEncoding(\n",
2098 |        "    (dropout): Dropout(p=0.1, inplace=False)\n",
2099 |        "  )\n",
2100 |        "  (projection_layer): LinearProjectionLayer(\n",
2101 |        "    (proj): Linear(in_features=512, out_features=22463, bias=True)\n",
2102 |        "  )\n",
2103 |        ")"
2104 |       ]
2105 |      },
2106 |      "execution_count": 51,
2107 |      "metadata": {},
2108 |      "output_type": "execute_result"
2109 |     }
2110 |    ],
2111 |    "source": [
2112 |     "def load_model(config):\n",
2113 |     "    # Get dataloaders and tokenizers\n",
2114 |     "    _, _, tokenizer_src, tokenizer_tgt = get_ds(config)\n",
2115 |     "\n",
2116 |     "    # Initialize model\n",
2117 |     "    model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size())\n",
2118 |     "\n",
2119 |     "    # Load trained weights\n",
2120 |     "    model_filename = latest_weights_file_path(config)\n",
2121 |     "    if model_filename:\n",
2122 |     "        print(f\"Loading weights from {model_filename}\")\n",
2123 |     "        state = torch.load(model_filename)\n",
2124 |     "        model.load_state_dict(state['model_state_dict'])\n",
2125 |     "\n",
2126 |     "    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
2127 |     "    model.to(device)\n",
2128 |     "    return model, tokenizer_src, tokenizer_tgt, device\n",
2129 |     "\n",
2130 |     "def generate(model, tokenizer_src, tokenizer_tgt, src_text, device, max_length=350):\n",
2131 |     "    model.eval()\n",
2132 |     "\n",
2133 |     "    enc_input = tokenizer_src.encode(src_text).ids\n",
2134 |     "    enc_input = torch.tensor([tokenizer_src.token_to_id('[SOS]')] + enc_input + [tokenizer_src.token_to_id('[EOS]')]).unsqueeze(0)\n",
2135 |     "\n",
2136 |     "    enc_mask = (enc_input != tokenizer_src.token_to_id('[PAD]')).unsqueeze(0).unsqueeze(0).int()\n",
2137 |     "\n",
2138 |     "    enc_input = enc_input.to(device)\n",
2139 |     "    enc_mask = enc_mask.to(device)\n",
2140 |     "\n",
2141 |     "    with torch.no_grad():\n",
2142 |     "        enc_output = model.encode(enc_input, enc_mask)\n",
2143 |     "        dec_input = torch.tensor([[tokenizer_tgt.token_to_id('[SOS]')]]).to(device)\n",
2144 |     "\n",
2145 |     "        for _ in range(max_length):\n",
2146 |     "            dec_mask = causal_mask(dec_input.size(1)).to(device)\n",
2147 |     "\n",
2148 |     "            dec_output = model.decode(enc_output, enc_mask, dec_input, dec_mask)\n",
2149 |     "            proj_output = model.project(dec_output)\n",
2150 |     "\n",
2151 |     "            next_word = proj_output[:, -1].argmax(dim=-1)\n",
2152 |     "            dec_input = torch.cat([dec_input, next_word.unsqueeze(-1)], dim=1)\n",
2153 |     "\n",
2154 |     "            if next_word.item() == tokenizer_tgt.token_to_id('[EOS]'):\n",
2155 |     "                break\n",
2156 |     "\n",
2157 |     "    translated_tokens = [tokenizer_tgt.id_to_token(t.item()) for t in dec_input[0]]\n",
2158 |     "    translated_text = ' '.join([t for t in translated_tokens if t not in ['[SOS]', '[EOS]', '[PAD]']])\n",
2159 |     "\n",
2160 |     "    return translated_text\n",
2161 |     "\n",
2162 |     "\n",
2163 |     "model, tokenizer_src, tokenizer_tgt, device = load_model(config)\n",
2164 |     "model.eval()"
2165 |    ]
2166 |   },
2167 |   {
2168 |    "cell_type": "code",
2169 |    "execution_count": null,
2170 |    "metadata": {
2171 |     "colab": {
2172 |      "base_uri": "https://localhost:8080/"
2173 |     },
2174 |     "id": "-B1hXBPhsd4p",
2175 |     "outputId": "267033cd-46c2-48bd-f172-f5c8b397d913"
2176 |    },
2177 |    "outputs": [
2178 |     {
2179 |      "name": "stdout",
2180 |      "output_type": "stream",
2181 |      "text": [
2182 |       "English to Italian Translations:\n",
2183 |       "--------------------------------------------------\n",
2184 |       "EN: the weather is beautiful today\n",
2185 |       "IT: Il giorno è stato , ma è stato bello .\n",
2186 |       "--------------------------------------------------\n",
2187 |       "EN: how are you?\n",
2188 |       "IT: Come siete ?\n",
2189 |       "--------------------------------------------------\n"
2190 |      ]
2191 |     }
2192 |    ],
2193 |    "source": [
2194 |     "test_sentences = [\n",
2195 |     "        \"the weather is beautiful today\",\n",
2196 |     "        \"how are you?\"\n",
2197 |     "    ]\n",
2198 |     "\n",
2199 |     "print(\"English to Italian Translations:\")\n",
2200 |     "print(\"-\" * 50)\n",
2201 |     "for sentence in test_sentences:\n",
2202 |     "    translation = generate(model, tokenizer_src, tokenizer_tgt, sentence, device)\n",
2203 |     "    print(f\"EN: {sentence}\")\n",
2204 |     "    print(f\"IT: {translation}\")\n",
2205 |     "    print(\"-\" * 50)"
2206 |    ]
2207 |   },
2208 |   {
2209 |    "cell_type": "markdown",
2210 |    "metadata": {
2211 |     "id": "Z9Cofqp53bQB"
2212 |    },
2213 |    "source": [
2214 |     "#### Acknowledgements\n",
2215 |     "\n",
2216 |     "This notebook is heavily adapted from a number of incredible resources on Transformers, including but not limited to:\n",
2217 |     "\n",
2218 |     "- https://blog.floydhub.com/the-transformer-in-pytorch/\n",
2219 |     "- https://arxiv.org/pdf/1706.03762.pdf\n",
2220 |     "- https://txt.cohere.com/what-are-transformer-models/\n",
2221 |     "- https://jalammar.github.io/illustrated-transformer/"
2222 |    ]
2223 |   }
2224 |  ],
2225 |  "metadata": {
2226 |   "accelerator": "GPU",
2227 |   "colab": {
2228 |    "gpuType": "A100",
2229 |    "machine_shape": "hm",
2230 |    "provenance": [],
2231 |    "toc_visible": true
2232 |   },
2233 |   "kernelspec": {
2234 |    "display_name": "Python 3",
2235 |    "name": "python3"
2236 |   },
2237 |   "language_info": {
2238 |    "name": "python"
2239 |   }
2240 |  },
2241 |  "nbformat": 4,
2242 |  "nbformat_minor": 0
2243 | }
2244 | 


--------------------------------------------------------------------------------
/02_The_Transformer/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading">📜 Session 2: The Transformer</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 |  | [Session 2: The Transformer](https://www.notion.so/Session-2-The-Transformer-1a7cd547af3d80079041d5112fb052a8) | [02: The Transformer](https://www.youtube.com/watch?v=LYODbG3X4oI&ab_channel=AIMakerspace) |  [Session 2: The Transformer](https://www.canva.com/design/DAGW9drJwtU/d5pIdoSDGNoTHppA3i9Crg/view?utm_content=DAGW9drJwtU&utm_campaign=designshare&utm_medium=link&utm_source=editor) | You Are Here! 
13 | 
14 | ## A Note on How AIM Does Assignments:
15 | 
16 | How AIM Does Assignments
17 | Throughout our time together - we'll be providing a number of assignments. Each assignment will be split into two broad categories:
18 | 
19 | - Base Assignment - a more conceptual and theory based assignment focused on locking in specific key concepts and learnings.
20 | - Hardmode Assignment - a more programming focused assignment focused on core code-concepts used in transformers.
21 | 
22 | Each assignment will have a few of the following categories of exercises:
23 | 
24 | - ❓Questions - these will be questions that you will be expected to gather the answer to!
25 | - 🏗️ Activities - these will be work or coding activities meant to reinforce specific concepts or theory components.
26 | - 🚧 Advanced Builds - these will only appear in Hardmode assignments, and will require you to build something with little to no help outside of documentation!
27 | 
28 | You are expected to complete all of the activities in your selected notebook!
29 | 
30 | ## Assignment
31 | 
32 | The assignment for Session 2: The Transformer is straightforward: Work through the notebook, answer all questions and complete all activities!
33 | 
34 | Below are the links to the Colab Notebooks (this will require GPUs with at least 40GB of GPU Memory) - though you can use the local notebooks if you have access to the same capacity hardware.
35 | 
36 | > NOTE: You can reduce the training batchsize to 1 or 8, and reduce the number of epochs correspondingly (due to increased training time) to run on reduced hardware capacity. 
37 | 
38 | - [Standard Assignment](https://colab.research.google.com/drive/1gdzYi9PUqZnc80z3M_omIsLOpQ254xtu?usp=sharing): Complete the notebook, including all ❓Questions. 
39 | - [Hardmode Assignment](https://colab.research.google.com/drive/1CInxjTSpqG5dSgg234TrYge9XSCshyQN?usp=sharing): Complete the notebook, including all 🏗️ Activities, and ❓Questions.
40 | 


--------------------------------------------------------------------------------
/03_Attention/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading">🧐  Session 3: Attention</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 3: Attention](https://www.notion.so/Session-3-Attention-1a7cd547af3d808fae58ed6252cc3e7f) | [03: Attention & Flash Attention](https://www.youtube.com/watch?v=cE5E1m1cSAU&ab_channel=AIMakerspace) |  [Session 3: Attention & Flash Attention](https://www.canva.com/design/DAGXJDsxuyI/TO3MaXqimiS-MjbR8-qm3g/view?utm_content=DAGXJDsxuyI&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hbdbc7bdd9d) | You Are Here!
13 | ### Assignment:
14 | 
15 | Today's assignment is available in Colab:
16 | 
17 | - [Base Assignment](https://colab.research.google.com/drive/15N58WHScLs2XSe8fUGhh3cwG8qZs5ZfL?usp=sharing)
18 | - [Hardmode Assignment](https://colab.research.google.com/drive/1IGt121B-t8GFUHgwV8w5G_1QRt5CepNY?usp=sharing)
19 | 
20 | You will be required to work through each: 
21 | 
22 | - 🏗️ Activity
23 | - 👪❓Discussion Question
24 | - ❓Question
25 | 


--------------------------------------------------------------------------------
/04_Embeddings/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading">🔠 Session 4: Embeddings</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
10 | |:-----------------|:-----------------|:-----------------|:-----------------|
11 | | [Session 4: Embeddings](https://www.notion.so/Session-4-Embeddings-1a7cd547af3d80669ea0e47ff2a142f9) | [04: Embeddings](https://www.youtube.com/watch?v=XMJzqxElhfY&ab_channel=AIMakerspace) |  [Session 4: Embeddings](https://www.canva.com/design/DAGXnKDginc/-g-2FCMJKDr2yhmUuuvVqg/view?utm_content=DAGXnKDginc&utm_campaign=designshare&utm_medium=link&utm_source=editor) | You Are Here! 
12 | 
13 | ### Assignment
14 | 
15 | Today's assignment is available in Colab:
16 | 
17 | - [Base Assignment](https://colab.research.google.com/drive/1i79_h9hr4f4hEsoRHLU96mOU33l4OKIx?usp=sharing)
18 | - [Hardmode Assignment](https://colab.research.google.com/drive/1copu7fM7C2KH6rX_05nFUhIzGGUsbC1a?usp=sharing)
19 | 
20 | Today's Notebook will take us through the following tasks: 
21 | 
22 | - Breakout Room #1: Training Word2Vec from Scratch
23 |   - Task 1: Dependencies
24 |   - Task 2: Data Collection
25 |   - Task 3: Data Preprocessing
26 |     - 🏗️ Activity #1 (Hardmode Only)
27 |     - ❓Question #1
28 |     - 🏗️ Activity #2 (Hardmode Only)
29 |     - 👪❓ Discussion Question #1
30 |   - Task 4: Training Word2Vec
31 |     - 🏗️ Activity #3
32 |     - ❓Question #2
33 | - Breakout Room #2:
34 |   - Task 1: Fine-tuning Our Embedding Model
35 |     - ❓Question #3
36 |     - 🏗️ Activity #4 
37 |   - Task 2: Evaluating our Embedding Model
38 |     - 👪❓Discussion Question #2
39 | 


--------------------------------------------------------------------------------
/05_Next-Token-Prediction/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading">🪙 Session 5: Next Token Predicition</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 5: Next-Token Prediction](https://www.notion.so/Session-5-Next-Token-Prediction-1a7cd547af3d8056bacaf652f6f9e8d9) | [05: Next-Token Prediction ](https://www.youtube.com/watch?v=xNRgycrPQFY&ab_channel=AIMakerspace) |  [Session 5: Decoding or Next-Token Prediction](https://www.canva.com/design/DAGYRgCRV2k/3xwuCV92aSKKNG7qpockFw/view?utm_content=DAGYRgCRV2k&utm_campaign=designshare&utm_medium=link&utm_source=editor) | You Are Here! | 
13 | 
14 | 
15 | ### Assignment: 
16 | 
17 | Today's assignment is available in Colab:
18 | - [Assignment](https://colab.research.google.com/drive/1U1FqxvG1U0mxKoTvJYa3KpUhRuk12LX8?usp=sharing)
19 | - [Hardmode Assignment](https://colab.research.google.com/drive/1mvf-UNbUCIoZlv4atDbEsYNk-RRlBO07?usp=sharing)
20 | 
21 | Breakout Room #1: Logits to Tokens
22 | - Task 1: Dependencies 
23 | - Task 2: Generating Tokens!
24 |   - 🏗️ Activity #1:
25 | - Task 3: Data Preprocessing
26 |   - ❓Question #1
27 | - Task 4: Alternate Decoding Examples:
28 |   - 👪❓ Discussion Question #1
29 | 
30 | Breakout Room #2: Speculative Decoding and Guard Rails
31 | - Task 5: Speculative Decoding
32 |   - ❓ Discussion Question #2
33 | - Task 6: Guard Rails
34 |   - 👪❓ Discussion Question #2
35 | 


--------------------------------------------------------------------------------
/06_Pre-Training/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading"> 🚇 Session 6: Pre-Training</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 6: Pretraining](https://www.notion.so/Session-6-Pretraining-1a7cd547af3d80fbaba5decb2ee8616b) | [06: Pretraining](https://www.youtube.com/watch?v=zU5iIAsqJVU&ab_channel=AIMakerspace) |  [Session 6: Pretraining](https://www.canva.com/design/DAGYdUqfwVg/l_9JK-h7dgvP4bseYdzwaQ/view?utm_content=DAGYdUqfwVg&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h9c74da65a4) | You Are Here! |
13 | 
14 | ### Assignment: 
15 | 
16 | Today's assignments are available in Colab:
17 | - Assignment #1: 
18 |     - [Assignment](https://colab.research.google.com/drive/1gO5Y8QTvF7b2BpyYIXlSh7lMTOCkDtPT?usp=sharing)
19 |     - [Hardmode Assignment](https://colab.research.google.com/drive/14IpIKAYtkziYjP1plGpXrNj9i4FwuEOf?usp=sharing)
20 | - Assignemnt #2: 
21 |     - [Assignment](https://colab.research.google.com/drive/125A8lgCFLxwlBPqgEMnJpnhj8W0SLXGr?usp=sharing)
22 | 
23 | - Breakout #1: 
24 |   - Task 1: Dependencies
25 |   - Task 2: Data Preparation
26 |     - 🏗️ Activity #1
27 |     - ❓ Question #1
28 |   - Task 3: Training Loop 
29 |     - 👪❓ Discussion Question #1
30 |   - Task 4: Training the Model
31 |     - ❓ Question #2
32 | 
33 | - Breakout Room #2: 
34 |   - Using 🤗 `transformers`
35 |     - 👪❓ Discussion Question #1
36 | 


--------------------------------------------------------------------------------
/06_Pre-Training/The_Loss_Function_in_LLMs_Cross_Entropy_Assignment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "machine_shape": "hm",
  8 |       "gpuType": "L4",
  9 |       "collapsed_sections": [
 10 |         "otLV55h9T322"
 11 |       ]
 12 |     },
 13 |     "kernelspec": {
 14 |       "name": "python3",
 15 |       "display_name": "Python 3"
 16 |     },
 17 |     "language_info": {
 18 |       "name": "python"
 19 |     },
 20 |     "accelerator": "GPU"
 21 |   },
 22 |   "cells": [
 23 |     {
 24 |       "cell_type": "markdown",
 25 |       "source": [
 26 |         "# The Loss Function in LLMs - Cross Entropy- AIMS\n",
 27 |         "\n",
 28 |         "- Breakout #1:\n",
 29 |         "  - Task 1: Dependencies\n",
 30 |         "  - Task 2: Data Preparation\n",
 31 |         "    - 🏗️ Activity #1\n",
 32 |         "    - ❓ Question #1\n",
 33 |         "  - Task 3: Training Loop\n",
 34 |         "    - 👪❓ Discussion Question #1\n",
 35 |         "  - Task 4: Training the Model\n",
 36 |         "    - ❓ Question #2\n",
 37 |         "\n",
 38 |         "Now that we have a better understanding of what decoder-only transformer based LLMs are doing to predict the next token, let's look at how they train using that prediciton mechanism!\n",
 39 |         "\n",
 40 |         "> ⚠ NOTE: This notebook is **NOT** compatible with the T4 Instance. Please ensure you're in the L4 or A100 instance."
 41 |       ],
 42 |       "metadata": {
 43 |         "id": "4QBs8FHvShu6"
 44 |       }
 45 |     },
 46 |     {
 47 |       "cell_type": "markdown",
 48 |       "source": [
 49 |         "## Task 1: Dependencies\n",
 50 |         "\n",
 51 |         "We'll start by loading the repository and the requirements not included in Colab by default."
 52 |       ],
 53 |       "metadata": {
 54 |         "id": "-Qtr6wnLTYlR"
 55 |       }
 56 |     },
 57 |     {
 58 |       "cell_type": "code",
 59 |       "source": [
 60 |         "!pip install -qU datasets tiktoken wandb tqdm triton"
 61 |       ],
 62 |       "metadata": {
 63 |         "id": "XcfnV1C3Cz50",
 64 |         "colab": {
 65 |           "base_uri": "https://localhost:8080/"
 66 |         },
 67 |         "outputId": "fbbf2d62-c16d-48b7-c3ed-ee3c8470f242"
 68 |       },
 69 |       "execution_count": null,
 70 |       "outputs": [
 71 |         {
 72 |           "output_type": "stream",
 73 |           "name": "stdout",
 74 |           "text": [
 75 |             "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.5/209.5 MB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 76 |             "\u001b[?25h"
 77 |           ]
 78 |         }
 79 |       ]
 80 |     },
 81 |     {
 82 |       "cell_type": "code",
 83 |       "execution_count": null,
 84 |       "metadata": {
 85 |         "colab": {
 86 |           "base_uri": "https://localhost:8080/"
 87 |         },
 88 |         "id": "4n9tJuC0BKFJ",
 89 |         "outputId": "297550a1-8561-40b0-ea4f-cea56489763b"
 90 |       },
 91 |       "outputs": [
 92 |         {
 93 |           "output_type": "stream",
 94 |           "name": "stdout",
 95 |           "text": [
 96 |             "Cloning into 'nanoGPT'...\n",
 97 |             "remote: Enumerating objects: 682, done.\u001b[K\n",
 98 |             "remote: Total 682 (delta 0), reused 0 (delta 0), pack-reused 682 (from 1)\u001b[K\n",
 99 |             "Receiving objects: 100% (682/682), 952.47 KiB | 28.86 MiB/s, done.\n",
100 |             "Resolving deltas: 100% (385/385), done.\n"
101 |           ]
102 |         }
103 |       ],
104 |       "source": [
105 |         "!git clone https://github.com/karpathy/nanoGPT.git"
106 |       ]
107 |     },
108 |     {
109 |       "cell_type": "code",
110 |       "source": [
111 |         "%cd nanoGPT"
112 |       ],
113 |       "metadata": {
114 |         "colab": {
115 |           "base_uri": "https://localhost:8080/"
116 |         },
117 |         "id": "ucLa1gp3CmB0",
118 |         "outputId": "e11d11a1-2978-435b-9e22-ebea93c60e7d"
119 |       },
120 |       "execution_count": null,
121 |       "outputs": [
122 |         {
123 |           "output_type": "stream",
124 |           "name": "stdout",
125 |           "text": [
126 |             "/content/nanoGPT\n"
127 |           ]
128 |         }
129 |       ]
130 |     },
131 |     {
132 |       "cell_type": "markdown",
133 |       "source": [
134 |         "## Task 2: Data Preparation\n",
135 |         "\n",
136 |         "In order to have the correct form of our inputs (tokens) we need to prepare our dataset - let's do this using the script provided by the repository!"
137 |       ],
138 |       "metadata": {
139 |         "id": "kstygtQwVOUZ"
140 |       }
141 |     },
142 |     {
143 |       "cell_type": "code",
144 |       "source": [
145 |         "!python data/shakespeare/prepare.py"
146 |       ],
147 |       "metadata": {
148 |         "colab": {
149 |           "base_uri": "https://localhost:8080/"
150 |         },
151 |         "id": "PNV0Z2uLVVay",
152 |         "outputId": "cd8073b2-130f-4ef1-cad7-b6b8add2d9cd"
153 |       },
154 |       "execution_count": null,
155 |       "outputs": [
156 |         {
157 |           "output_type": "stream",
158 |           "name": "stdout",
159 |           "text": [
160 |             "train has 301,966 tokens\n",
161 |             "val has 36,059 tokens\n"
162 |           ]
163 |         }
164 |       ]
165 |     },
166 |     {
167 |       "cell_type": "markdown",
168 |       "source": [
169 |         "##### 🏗️ Activity #1:\n",
170 |         "\n",
171 |         "Describe what is happening in the `prepare.py` function in the [repository](https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare/prepare.py) in natural language."
172 |       ],
173 |       "metadata": {
174 |         "id": "OFeTcZOV_a4v"
175 |       }
176 |     },
177 |     {
178 |       "cell_type": "markdown",
179 |       "source": [
180 |         "##### ❓ Question #1:\n",
181 |         "\n",
182 |         "What kind of tokenization strategy is being used here? (provide some examples of tokens)"
183 |       ],
184 |       "metadata": {
185 |         "id": "yv7GVHU3_2_k"
186 |       }
187 |     },
188 |     {
189 |       "cell_type": "markdown",
190 |       "source": [
191 |         "## Training Loop\n",
192 |         "\n",
193 |         "We'll leverage the training loop from Karpathy's repository to focus in on the specifics of where we're using loss - how we're using it - and what it means!"
194 |       ],
195 |       "metadata": {
196 |         "id": "hPyDaUl3TgSg"
197 |       }
198 |     },
199 |     {
200 |       "cell_type": "markdown",
201 |       "source": [
202 |         "### Preamble Code"
203 |       ],
204 |       "metadata": {
205 |         "id": "otLV55h9T322"
206 |       }
207 |     },
208 |     {
209 |       "cell_type": "code",
210 |       "source": [
211 |         "import os\n",
212 |         "import time\n",
213 |         "import math\n",
214 |         "import pickle\n",
215 |         "from contextlib import nullcontext\n",
216 |         "\n",
217 |         "import numpy as np\n",
218 |         "import torch\n",
219 |         "from torch.nn.parallel import DistributedDataParallel as DDP\n",
220 |         "from torch.distributed import init_process_group, destroy_process_group\n",
221 |         "\n",
222 |         "from model import GPTConfig, GPT\n",
223 |         "\n",
224 |         "# -----------------------------------------------------------------------------\n",
225 |         "# default config values designed to train a gpt2 (124M) on OpenWebText\n",
226 |         "# I/O\n",
227 |         "out_dir = 'out'\n",
228 |         "eval_interval = 2000\n",
229 |         "log_interval = 1\n",
230 |         "eval_iters = 200\n",
231 |         "eval_only = False # if True, script exits right after the first eval\n",
232 |         "always_save_checkpoint = True # if True, always save a checkpoint after each eval\n",
233 |         "init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'\n",
234 |         "# wandb logging\n",
235 |         "wandb_log = False # disabled by default\n",
236 |         "wandb_project = 'owt'\n",
237 |         "wandb_run_name = 'gpt2' # 'run' + str(time.time())\n",
238 |         "# data\n",
239 |         "dataset = 'shakespeare'\n",
240 |         "gradient_accumulation_steps = 1 # used to simulate larger batch sizes\n",
241 |         "batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size\n",
242 |         "block_size = 1024\n",
243 |         "# model\n",
244 |         "n_layer = 12\n",
245 |         "n_head = 12\n",
246 |         "n_embd = 768\n",
247 |         "dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+\n",
248 |         "bias = False # do we use bias inside LayerNorm and Linear layers?\n",
249 |         "# adamw optimizer\n",
250 |         "learning_rate = 6e-4 # max learning rate\n",
251 |         "max_iters = 10 # total number of training iterations\n",
252 |         "weight_decay = 1e-1\n",
253 |         "beta1 = 0.9\n",
254 |         "beta2 = 0.95\n",
255 |         "grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0\n",
256 |         "# learning rate decay settings\n",
257 |         "decay_lr = True # whether to decay the learning rate\n",
258 |         "warmup_iters = 10 # how many steps to warm up for\n",
259 |         "lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla\n",
260 |         "min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla\n",
261 |         "# DDP settings\n",
262 |         "backend = 'nccl' # 'nccl', 'gloo', etc.\n",
263 |         "# system\n",
264 |         "device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks\n",
265 |         "dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler\n",
266 |         "compile = True # use PyTorch 2.0 to compile the model to be faster\n",
267 |         "# -----------------------------------------------------------------------------\n",
268 |         "config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]\n",
269 |         "config = {k: globals()[k] for k in config_keys} # will be useful for logging\n",
270 |         "# -----------------------------------------------------------------------------\n",
271 |         "\n",
272 |         "# various inits, derived attributes, I/O setup\n",
273 |         "ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?\n",
274 |         "if ddp:\n",
275 |         "    init_process_group(backend=backend)\n",
276 |         "    ddp_rank = int(os.environ['RANK'])\n",
277 |         "    ddp_local_rank = int(os.environ['LOCAL_RANK'])\n",
278 |         "    ddp_world_size = int(os.environ['WORLD_SIZE'])\n",
279 |         "    device = f'cuda:{ddp_local_rank}'\n",
280 |         "    torch.cuda.set_device(device)\n",
281 |         "    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.\n",
282 |         "    seed_offset = ddp_rank # each process gets a different seed\n",
283 |         "    # world_size number of processes will be training simultaneously, so we can scale\n",
284 |         "    # down the desired gradient accumulation iterations per process proportionally\n",
285 |         "    assert gradient_accumulation_steps % ddp_world_size == 0\n",
286 |         "    gradient_accumulation_steps //= ddp_world_size\n",
287 |         "else:\n",
288 |         "    # if not ddp, we are running on a single gpu, and one process\n",
289 |         "    master_process = True\n",
290 |         "    seed_offset = 0\n",
291 |         "    ddp_world_size = 1\n",
292 |         "tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size\n",
293 |         "print(f\"tokens per iteration will be: {tokens_per_iter:,}\")\n",
294 |         "\n",
295 |         "if master_process:\n",
296 |         "    os.makedirs(out_dir, exist_ok=True)\n",
297 |         "torch.manual_seed(1337 + seed_offset)\n",
298 |         "torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul\n",
299 |         "torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn\n",
300 |         "device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast\n",
301 |         "# note: float16 data type will automatically use a GradScaler\n",
302 |         "ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]\n",
303 |         "ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)\n",
304 |         "\n",
305 |         "# poor man's data loader\n",
306 |         "data_dir = os.path.join('data', dataset)\n",
307 |         "def get_batch(split):\n",
308 |         "    # We recreate np.memmap every batch to avoid a memory leak, as per\n",
309 |         "    # https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122\n",
310 |         "    if split == 'train':\n",
311 |         "        data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')\n",
312 |         "    else:\n",
313 |         "        data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')\n",
314 |         "    ix = torch.randint(len(data) - block_size, (batch_size,))\n",
315 |         "    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])\n",
316 |         "    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])\n",
317 |         "    if device_type == 'cuda':\n",
318 |         "        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)\n",
319 |         "        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)\n",
320 |         "    else:\n",
321 |         "        x, y = x.to(device), y.to(device)\n",
322 |         "    return x, y\n",
323 |         "\n",
324 |         "# init these up here, can override if init_from='resume' (i.e. from a checkpoint)\n",
325 |         "iter_num = 0\n",
326 |         "best_val_loss = 1e9\n",
327 |         "\n",
328 |         "# attempt to derive vocab_size from the dataset\n",
329 |         "meta_path = os.path.join(data_dir, 'meta.pkl')\n",
330 |         "meta_vocab_size = None\n",
331 |         "if os.path.exists(meta_path):\n",
332 |         "    with open(meta_path, 'rb') as f:\n",
333 |         "        meta = pickle.load(f)\n",
334 |         "    meta_vocab_size = meta['vocab_size']\n",
335 |         "    print(f\"found vocab_size = {meta_vocab_size} (inside {meta_path})\")\n",
336 |         "\n",
337 |         "# model init\n",
338 |         "model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,\n",
339 |         "                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line\n",
340 |         "if init_from == 'scratch':\n",
341 |         "    # init a new model from scratch\n",
342 |         "    print(\"Initializing a new model from scratch\")\n",
343 |         "    # determine the vocab size we'll use for from-scratch training\n",
344 |         "    if meta_vocab_size is None:\n",
345 |         "        print(\"defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)\")\n",
346 |         "    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304\n",
347 |         "    gptconf = GPTConfig(**model_args)\n",
348 |         "    model = GPT(gptconf)\n",
349 |         "elif init_from == 'resume':\n",
350 |         "    print(f\"Resuming training from {out_dir}\")\n",
351 |         "    # resume training from a checkpoint.\n",
352 |         "    ckpt_path = os.path.join(out_dir, 'ckpt.pt')\n",
353 |         "    checkpoint = torch.load(ckpt_path, map_location=device)\n",
354 |         "    checkpoint_model_args = checkpoint['model_args']\n",
355 |         "    # force these config attributes to be equal otherwise we can't even resume training\n",
356 |         "    # the rest of the attributes (e.g. dropout) can stay as desired from command line\n",
357 |         "    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:\n",
358 |         "        model_args[k] = checkpoint_model_args[k]\n",
359 |         "    # create the model\n",
360 |         "    gptconf = GPTConfig(**model_args)\n",
361 |         "    model = GPT(gptconf)\n",
362 |         "    state_dict = checkpoint['model']\n",
363 |         "    # fix the keys of the state dictionary :(\n",
364 |         "    # honestly no idea how checkpoints sometimes get this prefix, have to debug more\n",
365 |         "    unwanted_prefix = '_orig_mod.'\n",
366 |         "    for k,v in list(state_dict.items()):\n",
367 |         "        if k.startswith(unwanted_prefix):\n",
368 |         "            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)\n",
369 |         "    model.load_state_dict(state_dict)\n",
370 |         "    iter_num = checkpoint['iter_num']\n",
371 |         "    best_val_loss = checkpoint['best_val_loss']\n",
372 |         "elif init_from.startswith('gpt2'):\n",
373 |         "    print(f\"Initializing from OpenAI GPT-2 weights: {init_from}\")\n",
374 |         "    # initialize from OpenAI GPT-2 weights\n",
375 |         "    override_args = dict(dropout=dropout)\n",
376 |         "    model = GPT.from_pretrained(init_from, override_args)\n",
377 |         "    # read off the created config params, so we can store them into checkpoint correctly\n",
378 |         "    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:\n",
379 |         "        model_args[k] = getattr(model.config, k)\n",
380 |         "# crop down the model block size if desired, using model surgery\n",
381 |         "if block_size < model.config.block_size:\n",
382 |         "    model.crop_block_size(block_size)\n",
383 |         "    model_args['block_size'] = block_size # so that the checkpoint will have the right value\n",
384 |         "model.to(device)\n",
385 |         "\n",
386 |         "# initialize a GradScaler. If enabled=False scaler is a no-op\n",
387 |         "scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))\n",
388 |         "\n",
389 |         "# optimizer\n",
390 |         "optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)\n",
391 |         "if init_from == 'resume':\n",
392 |         "    optimizer.load_state_dict(checkpoint['optimizer'])\n",
393 |         "checkpoint = None # free up memory\n",
394 |         "\n",
395 |         "# compile the model\n",
396 |         "if compile:\n",
397 |         "    print(\"compiling the model... (takes a ~minute)\")\n",
398 |         "    unoptimized_model = model\n",
399 |         "    model = torch.compile(model) # requires PyTorch 2.0\n",
400 |         "\n",
401 |         "# wrap model into DDP container\n",
402 |         "if ddp:\n",
403 |         "    model = DDP(model, device_ids=[ddp_local_rank])\n",
404 |         "\n",
405 |         "# helps estimate an arbitrarily accurate loss over either split using many batches\n",
406 |         "@torch.no_grad()\n",
407 |         "def estimate_loss():\n",
408 |         "    out = {}\n",
409 |         "    model.eval()\n",
410 |         "    for split in ['train', 'val']:\n",
411 |         "        losses = torch.zeros(eval_iters)\n",
412 |         "        for k in range(eval_iters):\n",
413 |         "            X, Y = get_batch(split)\n",
414 |         "            with ctx:\n",
415 |         "                logits, loss = model(X, Y)\n",
416 |         "            losses[k] = loss.item()\n",
417 |         "        out[split] = losses.mean()\n",
418 |         "    model.train()\n",
419 |         "    return out\n",
420 |         "\n",
421 |         "# learning rate decay scheduler (cosine with warmup)\n",
422 |         "def get_lr(it):\n",
423 |         "    # 1) linear warmup for warmup_iters steps\n",
424 |         "    if it < warmup_iters:\n",
425 |         "        return learning_rate * it / warmup_iters\n",
426 |         "    # 2) if it > lr_decay_iters, return min learning rate\n",
427 |         "    if it > lr_decay_iters:\n",
428 |         "        return min_lr\n",
429 |         "    # 3) in between, use cosine decay down to min learning rate\n",
430 |         "    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)\n",
431 |         "    assert 0 <= decay_ratio <= 1\n",
432 |         "    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1\n",
433 |         "    return min_lr + coeff * (learning_rate - min_lr)\n",
434 |         "\n",
435 |         "# logging\n",
436 |         "if wandb_log and master_process:\n",
437 |         "    import wandb\n",
438 |         "    wandb.init(project=wandb_project, name=wandb_run_name, config=config)"
439 |       ],
440 |       "metadata": {
441 |         "colab": {
442 |           "base_uri": "https://localhost:8080/"
443 |         },
444 |         "id": "QgyIgEn1T3Te",
445 |         "outputId": "752246a0-a771-4d06-aa01-05c91741a0e7"
446 |       },
447 |       "execution_count": null,
448 |       "outputs": [
449 |         {
450 |           "output_type": "stream",
451 |           "name": "stdout",
452 |           "text": [
453 |             "tokens per iteration will be: 12,288\n",
454 |             "Initializing a new model from scratch\n",
455 |             "defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)\n",
456 |             "number of parameters: 123.59M\n"
457 |           ]
458 |         },
459 |         {
460 |           "output_type": "stream",
461 |           "name": "stderr",
462 |           "text": [
463 |             "<ipython-input-3-30430fb971dc>:177: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.\n",
464 |             "  scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))\n"
465 |           ]
466 |         },
467 |         {
468 |           "output_type": "stream",
469 |           "name": "stdout",
470 |           "text": [
471 |             "num decayed parameter tensors: 50, with 124,354,560 parameters\n",
472 |             "num non-decayed parameter tensors: 25, with 19,200 parameters\n",
473 |             "using fused AdamW: True\n",
474 |             "compiling the model... (takes a ~minute)\n"
475 |           ]
476 |         }
477 |       ]
478 |     },
479 |     {
480 |       "cell_type": "markdown",
481 |       "source": [
482 |         "### Task 3: Training Loop\n",
483 |         "\n",
484 |         "Here is where the magic happens!\n",
485 |         "\n",
486 |         "Before we get straight into training - let's look at our model to obtain some key insights."
487 |       ],
488 |       "metadata": {
489 |         "id": "MFQcSiOYUyBV"
490 |       }
491 |     },
492 |     {
493 |       "cell_type": "code",
494 |       "source": [
495 |         "print(model)"
496 |       ],
497 |       "metadata": {
498 |         "colab": {
499 |           "base_uri": "https://localhost:8080/"
500 |         },
501 |         "id": "27KCZIhtuRFc",
502 |         "outputId": "4edd70f7-4d06-4cc4-ad5a-c6d032cad754"
503 |       },
504 |       "execution_count": null,
505 |       "outputs": [
506 |         {
507 |           "output_type": "stream",
508 |           "name": "stdout",
509 |           "text": [
510 |             "OptimizedModule(\n",
511 |             "  (_orig_mod): GPT(\n",
512 |             "    (transformer): ModuleDict(\n",
513 |             "      (wte): Embedding(50304, 768)\n",
514 |             "      (wpe): Embedding(1024, 768)\n",
515 |             "      (drop): Dropout(p=0.0, inplace=False)\n",
516 |             "      (h): ModuleList(\n",
517 |             "        (0-11): 12 x Block(\n",
518 |             "          (ln_1): LayerNorm()\n",
519 |             "          (attn): CausalSelfAttention(\n",
520 |             "            (c_attn): Linear(in_features=768, out_features=2304, bias=False)\n",
521 |             "            (c_proj): Linear(in_features=768, out_features=768, bias=False)\n",
522 |             "            (attn_dropout): Dropout(p=0.0, inplace=False)\n",
523 |             "            (resid_dropout): Dropout(p=0.0, inplace=False)\n",
524 |             "          )\n",
525 |             "          (ln_2): LayerNorm()\n",
526 |             "          (mlp): MLP(\n",
527 |             "            (c_fc): Linear(in_features=768, out_features=3072, bias=False)\n",
528 |             "            (gelu): GELU(approximate='none')\n",
529 |             "            (c_proj): Linear(in_features=3072, out_features=768, bias=False)\n",
530 |             "            (dropout): Dropout(p=0.0, inplace=False)\n",
531 |             "          )\n",
532 |             "        )\n",
533 |             "      )\n",
534 |             "      (ln_f): LayerNorm()\n",
535 |             "    )\n",
536 |             "    (lm_head): Linear(in_features=768, out_features=50304, bias=False)\n",
537 |             "  )\n",
538 |             ")\n"
539 |           ]
540 |         }
541 |       ]
542 |     },
543 |     {
544 |       "cell_type": "markdown",
545 |       "source": [
546 |         "##### 👪❓ Discussion Question #1:\n",
547 |         "\n",
548 |         "Describe how this model is different that the traditional Transformer Architecture from the paper \"Attention is All You Need\"."
549 |       ],
550 |       "metadata": {
551 |         "id": "piujJIlnBK22"
552 |       }
553 |     },
554 |     {
555 |       "cell_type": "markdown",
556 |       "source": [
557 |         "Notice our final layer - the `lm_head` - and how it has `out_features=50304`. This number of output features lines up exactly with our vocabulary! When we're making predictions, we're making predictions about which token (in our vocabulary) should be selected next!"
558 |       ],
559 |       "metadata": {
560 |         "id": "NS1oTMdV-oH-"
561 |       }
562 |     },
563 |     {
564 |       "cell_type": "markdown",
565 |       "source": [
566 |         "#### First Batch - What is a Batch?\n",
567 |         "\n",
568 |         "Let's look at our first batch to see what exactly a \"batch\" is in this context."
569 |       ],
570 |       "metadata": {
571 |         "id": "NX778-01U2SF"
572 |       }
573 |     },
574 |     {
575 |       "cell_type": "code",
576 |       "source": [
577 |         "X, Y = get_batch('train') # fetch the very first batch"
578 |       ],
579 |       "metadata": {
580 |         "id": "Gd6LWL9LU14w"
581 |       },
582 |       "execution_count": null,
583 |       "outputs": []
584 |     },
585 |     {
586 |       "cell_type": "code",
587 |       "source": [
588 |         "print(f\"X Shape: {X.shape}, Y Shape: {Y.shape}\")"
589 |       ],
590 |       "metadata": {
591 |         "colab": {
592 |           "base_uri": "https://localhost:8080/"
593 |         },
594 |         "id": "Y6cC2iGj-q8i",
595 |         "outputId": "887ebc49-29fd-4a44-81fb-50b78f5a4303"
596 |       },
597 |       "execution_count": null,
598 |       "outputs": [
599 |         {
600 |           "output_type": "stream",
601 |           "name": "stdout",
602 |           "text": [
603 |             "X Shape: torch.Size([12, 1024]), Y Shape: torch.Size([12, 1024])\n"
604 |           ]
605 |         }
606 |       ]
607 |     },
608 |     {
609 |       "cell_type": "markdown",
610 |       "source": [
611 |         "Let's look at what our X and Y look like!\n",
612 |         "\n",
613 |         "> NOTE: We'll only look at the last 5 tokens of the last batch to get a sense of what is happening under the hood of the data selection."
614 |       ],
615 |       "metadata": {
616 |         "id": "juIxTM6HVrfk"
617 |       }
618 |     },
619 |     {
620 |       "cell_type": "code",
621 |       "source": [
622 |         "print(X[:][-1][-5:])\n",
623 |         "print(Y[:][-1][-5:])"
624 |       ],
625 |       "metadata": {
626 |         "colab": {
627 |           "base_uri": "https://localhost:8080/"
628 |         },
629 |         "id": "lThD-QrrVqlK",
630 |         "outputId": "924a2070-b502-44b9-e514-b302e21f5003"
631 |       },
632 |       "execution_count": null,
633 |       "outputs": [
634 |         {
635 |           "output_type": "stream",
636 |           "name": "stdout",
637 |           "text": [
638 |             "tensor([18719,    11,   351,   465, 21752], device='cuda:0')\n",
639 |             "tensor([   11,   351,   465, 21752,    11], device='cuda:0')\n"
640 |           ]
641 |         }
642 |       ]
643 |     },
644 |     {
645 |       "cell_type": "markdown",
646 |       "source": [
647 |         "Notice how X and Y are simply shifted by a single index - where Y contains (at every index) the token that *follows* X!\n",
648 |         "\n",
649 |         "So essentially - Y contains the *labels* (or target) for X!\n",
650 |         "\n",
651 |         "Let's see how we can leverage this in our training loop!\n",
652 |         "\n",
653 |         "Before we start training - let's import our decoder so we can see specific text that our model is leveraging!"
654 |       ],
655 |       "metadata": {
656 |         "id": "fWgmihsgWBmN"
657 |       }
658 |     },
659 |     {
660 |       "cell_type": "code",
661 |       "source": [
662 |         "import tiktoken\n",
663 |         "\n",
664 |         "enc = tiktoken.get_encoding(\"gpt2\")\n",
665 |         "encode = lambda s: enc.encode(s, allowed_special={\"<|endoftext|>\"})\n",
666 |         "decode = lambda l: enc.decode(l)"
667 |       ],
668 |       "metadata": {
669 |         "id": "bL8uueRjyqL6"
670 |       },
671 |       "execution_count": null,
672 |       "outputs": []
673 |     },
674 |     {
675 |       "cell_type": "markdown",
676 |       "source": [
677 |         "## Task 4: Training the Model"
678 |       ],
679 |       "metadata": {
680 |         "id": "-Hk5EmQjB3Gb"
681 |       }
682 |     },
683 |     {
684 |       "cell_type": "markdown",
685 |       "source": [
686 |         "The training loop (provided through the NanoGPT repository) has a lot of interesting things going on - but we're going to focus on the specific section of the training loop that relates to the loss and logits."
687 |       ],
688 |       "metadata": {
689 |         "id": "xQXAoLD19w4g"
690 |       }
691 |     },
692 |     {
693 |       "cell_type": "code",
694 |       "source": [
695 |         "t0 = time.time()\n",
696 |         "local_iter_num = 0\n",
697 |         "raw_model = model.module if ddp else model\n",
698 |         "running_mfu = -1.0\n",
699 |         "while True:\n",
700 |         "    lr = get_lr(iter_num) if decay_lr else learning_rate\n",
701 |         "    for param_group in optimizer.param_groups:\n",
702 |         "        param_group['lr'] = lr\n",
703 |         "    if iter_num % eval_interval == 0 and master_process:\n",
704 |         "        losses = estimate_loss()\n",
705 |         "        print(f\"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}\")\n",
706 |         "        if wandb_log:\n",
707 |         "            wandb.log({\n",
708 |         "                \"iter\": iter_num,\n",
709 |         "                \"train/loss\": losses['train'],\n",
710 |         "                \"val/loss\": losses['val'],\n",
711 |         "                \"lr\": lr,\n",
712 |         "                \"mfu\": running_mfu*100,\n",
713 |         "            })\n",
714 |         "        if losses['val'] < best_val_loss or always_save_checkpoint:\n",
715 |         "            best_val_loss = losses['val']\n",
716 |         "            if iter_num > 0:\n",
717 |         "                checkpoint = {\n",
718 |         "                    'model': raw_model.state_dict(),\n",
719 |         "                    'optimizer': optimizer.state_dict(),\n",
720 |         "                    'model_args': model_args,\n",
721 |         "                    'iter_num': iter_num,\n",
722 |         "                    'best_val_loss': best_val_loss,\n",
723 |         "                    'config': config,\n",
724 |         "                }\n",
725 |         "                print(f\"saving checkpoint to {out_dir}\")\n",
726 |         "                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))\n",
727 |         "    if iter_num == 0 and eval_only:\n",
728 |         "        break\n",
729 |         "    for micro_step in range(gradient_accumulation_steps):\n",
730 |         "        if ddp:\n",
731 |         "            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)\n",
732 |         "        ####### LOGITS AND LOSS #######\n",
733 |         "        with ctx:\n",
734 |         "            logits, loss = model(X, Y)\n",
735 |         "            print(f\"Our inputs (truncated to the last 15 tokens in the last batch) were: {X[:][-1][-15:].cpu().numpy()}\")\n",
736 |         "            print(f\"Represented in text that is: {[decode(X[:][-1][-15:].cpu().numpy())]}\")\n",
737 |         "            print(f\"Our targets represented in text were: {[decode(Y[:][-1][-15:].cpu().numpy())]}\")\n",
738 |         "            print(f\"Our logits are in shape: {logits.shape}\")\n",
739 |         "            print(f\"The Vocabulary Size is: {logits.shape[-1]}\")\n",
740 |         "            print(f\"The Sequence Length is: {logits.shape[1]}\")\n",
741 |         "            print(f\"Our loss was calculated as: {loss}\")\n",
742 |         "            loss = loss / gradient_accumulation_steps\n",
743 |         "        X, Y = get_batch('train')\n",
744 |         "        scaler.scale(loss).backward()\n",
745 |         "        ###############################\n",
746 |         "    if grad_clip != 0.0:\n",
747 |         "        scaler.unscale_(optimizer)\n",
748 |         "        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)\n",
749 |         "    scaler.step(optimizer)\n",
750 |         "    scaler.update()\n",
751 |         "    optimizer.zero_grad(set_to_none=True)\n",
752 |         "    t1 = time.time()\n",
753 |         "    dt = t1 - t0\n",
754 |         "    t0 = t1\n",
755 |         "    if iter_num % log_interval == 0 and master_process:\n",
756 |         "        lossf = loss.item() * gradient_accumulation_steps\n",
757 |         "        if local_iter_num >= 5:\n",
758 |         "            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)\n",
759 |         "            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu\n",
760 |         "        print(f\"iter {iter_num}: time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%\")\n",
761 |         "    iter_num += 1\n",
762 |         "    local_iter_num += 1\n",
763 |         "    if iter_num > max_iters:\n",
764 |         "        break\n",
765 |         "if ddp:\n",
766 |         "    destroy_process_group()"
767 |       ],
768 |       "metadata": {
769 |         "colab": {
770 |           "base_uri": "https://localhost:8080/"
771 |         },
772 |         "id": "AhuKbQE6Upm0",
773 |         "outputId": "c29a658c-c612-4e88-a307-2f82e4488ea3"
774 |       },
775 |       "execution_count": null,
776 |       "outputs": [
777 |         {
778 |           "output_type": "stream",
779 |           "name": "stdout",
780 |           "text": [
781 |             "step 0: train loss 11.0040, val loss 10.9976\n",
782 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 9005  4432   528   417    11   198 31056   286  2165   844 18719    11\n",
783 |             "   351   465 21752]\n",
784 |             "Represented in text that is: [' Prince Florizel,\\nSon of Polixenes, with his princess']\n",
785 |             "Our targets represented in text were: [' Florizel,\\nSon of Polixenes, with his princess,']\n",
786 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
787 |             "The Vocabulary Size is: 50304\n",
788 |             "The Sequence Length is: 1024\n",
789 |             "Our loss was calculated as: 10.98572063446045\n",
790 |             "iter 0: time 77725.16ms, mfu -100.00%\n",
791 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [   25   198 33873  3183    11  1497   351   683     0   198   198    51\n",
792 |             " 38409    25   198]\n",
793 |             "Represented in text that is: [':\\nSoldiers, away with him!\\n\\nTutor:\\n']\n",
794 |             "Our targets represented in text were: ['\\nSoldiers, away with him!\\n\\nTutor:\\nAh']\n",
795 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
796 |             "The Vocabulary Size is: 50304\n",
797 |             "The Sequence Length is: 1024\n",
798 |             "Our loss was calculated as: 11.003988265991211\n",
799 |             "iter 1: time 116.64ms, mfu -100.00%\n",
800 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [4776  502  510  329  262  198 3157  395  638 1015  287 1951  437  296\n",
801 |             "   13]\n",
802 |             "Represented in text that is: [' score me up for the\\nlyingest knave in Christendom.']\n",
803 |             "Our targets represented in text were: [' me up for the\\nlyingest knave in Christendom. What']\n",
804 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
805 |             "The Vocabulary Size is: 50304\n",
806 |             "The Sequence Length is: 1024\n",
807 |             "Our loss was calculated as: 9.73719310760498\n",
808 |             "iter 2: time 315.76ms, mfu -100.00%\n",
809 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 198   54 1670  290  649 1494 1549   13  198  198 4805 1268 5222   25\n",
810 |             "  198]\n",
811 |             "Represented in text that is: [\"\\nWarm and new kill'd.\\n\\nPRINCE:\\n\"]\n",
812 |             "Our targets represented in text were: [\"Warm and new kill'd.\\n\\nPRINCE:\\nSearch\"]\n",
813 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
814 |             "The Vocabulary Size is: 50304\n",
815 |             "The Sequence Length is: 1024\n",
816 |             "Our loss was calculated as: 9.323944091796875\n",
817 |             "iter 3: time 318.78ms, mfu -100.00%\n",
818 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  674 10715    13   198   198 15946   455    25   198  5247   284    11\n",
819 |             " 15967    26   345]\n",
820 |             "Represented in text that is: [' our mystery.\\n\\nProvost:\\nGo to, sir; you']\n",
821 |             "Our targets represented in text were: [' mystery.\\n\\nProvost:\\nGo to, sir; you weigh']\n",
822 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
823 |             "The Vocabulary Size is: 50304\n",
824 |             "The Sequence Length is: 1024\n",
825 |             "Our loss was calculated as: 9.514596939086914\n",
826 |             "iter 4: time 332.62ms, mfu -100.00%\n",
827 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  11 4379  705 4246  292  339  326  925  345  284 1207  577   11  198\n",
828 |             " 7120]\n",
829 |             "Represented in text that is: [\", seeing 'twas he that made you to depose,\\nYour\"]\n",
830 |             "Our targets represented in text were: [\" seeing 'twas he that made you to depose,\\nYour oath\"]\n",
831 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
832 |             "The Vocabulary Size is: 50304\n",
833 |             "The Sequence Length is: 1024\n",
834 |             "Our loss was calculated as: 8.966322898864746\n",
835 |             "iter 5: time 320.00ms, mfu 10.52%\n",
836 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [   43  2885    56 20176 24212    51    25   198   464   661   287   262\n",
837 |             "  4675  3960 43989]\n",
838 |             "Represented in text that is: ['LADY CAPULET:\\nThe people in the street cry Romeo']\n",
839 |             "Our targets represented in text were: ['ADY CAPULET:\\nThe people in the street cry Romeo,']\n",
840 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
841 |             "The Vocabulary Size is: 50304\n",
842 |             "The Sequence Length is: 1024\n",
843 |             "Our loss was calculated as: 8.681777000427246\n",
844 |             "iter 6: time 315.95ms, mfu 10.53%\n",
845 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 2236  1282   284 22363  3725    11   326   198  1639  4145   423  7715\n",
846 |             "  1549   502     0]\n",
847 |             "Represented in text that is: [\" shall come to clearer knowledge, that\\nYou thus have publish'd me!\"]\n",
848 |             "Our targets represented in text were: [\" come to clearer knowledge, that\\nYou thus have publish'd me! Gentle\"]\n",
849 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
850 |             "The Vocabulary Size is: 50304\n",
851 |             "The Sequence Length is: 1024\n",
852 |             "Our loss was calculated as: 8.433647155761719\n",
853 |             "iter 7: time 323.38ms, mfu 10.52%\n",
854 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  262  1633   198  1870   307   407  4259  1549   287 27666 29079    11\n",
855 |             "   198    39  2502]\n",
856 |             "Represented in text that is: [\" the air\\nAnd be not fix'd in doom perpetual,\\nHover\"]\n",
857 |             "Our targets represented in text were: [\" air\\nAnd be not fix'd in doom perpetual,\\nHover about\"]\n",
858 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
859 |             "The Vocabulary Size is: 50304\n",
860 |             "The Sequence Length is: 1024\n",
861 |             "Our loss was calculated as: 8.148585319519043\n",
862 |             "iter 8: time 321.68ms, mfu 10.52%\n",
863 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 3025  8557  7739   550    11   198  2484   439  2245   393 26724   502\n",
864 |             "    13  8192   314]\n",
865 |             "Represented in text that is: [' whose spiritual counsel had,\\nShall stop or spur me. Have I']\n",
866 |             "Our targets represented in text were: [' spiritual counsel had,\\nShall stop or spur me. Have I done']\n",
867 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
868 |             "The Vocabulary Size is: 50304\n",
869 |             "The Sequence Length is: 1024\n",
870 |             "Our loss was calculated as: 7.851408958435059\n",
871 |             "iter 9: time 319.71ms, mfu 10.52%\n",
872 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  407   257  1573   286  8716    30   198  4366  4467    11 15849    13\n",
873 |             "   198   198    45]\n",
874 |             "Represented in text that is: [' not a word of joy?\\nSome comfort, nurse.\\n\\nN']\n",
875 |             "Our targets represented in text were: [' a word of joy?\\nSome comfort, nurse.\\n\\nNurse']\n",
876 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
877 |             "The Vocabulary Size is: 50304\n",
878 |             "The Sequence Length is: 1024\n",
879 |             "Our loss was calculated as: 7.475827217102051\n",
880 |             "iter 10: time 321.52ms, mfu 10.51%\n"
881 |           ]
882 |         }
883 |       ]
884 |     },
885 |     {
886 |       "cell_type": "markdown",
887 |       "source": [
888 |         "##### ❓ Question #2:\n",
889 |         "\n",
890 |         "Describe if, and why, this process is supervised or unsupervised."
891 |       ],
892 |       "metadata": {
893 |         "id": "Vfg_B5rWB6te"
894 |       }
895 |     }
896 |   ]
897 | }


--------------------------------------------------------------------------------
/06_Pre-Training/The_Loss_Function_in_LLMs_Cross_Entropy_Hardmode.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "machine_shape": "hm",
  8 |       "gpuType": "L4",
  9 |       "collapsed_sections": [
 10 |         "otLV55h9T322"
 11 |       ]
 12 |     },
 13 |     "kernelspec": {
 14 |       "name": "python3",
 15 |       "display_name": "Python 3"
 16 |     },
 17 |     "language_info": {
 18 |       "name": "python"
 19 |     },
 20 |     "accelerator": "GPU"
 21 |   },
 22 |   "cells": [
 23 |     {
 24 |       "cell_type": "markdown",
 25 |       "source": [
 26 |         "# The Loss Function in LLMs - Cross Entropy- AIMS\n",
 27 |         "\n",
 28 |         "- Breakout #1:\n",
 29 |         "  - Task 1: Dependencies\n",
 30 |         "  - Task 2: Data Preparation\n",
 31 |         "    - 🏗️ Activity #1\n",
 32 |         "    - ❓ Question #1\n",
 33 |         "  - Task 3: Training Loop\n",
 34 |         "    - 👪❓ Discussion Question #1\n",
 35 |         "  - Task 4: Training the Model\n",
 36 |         "    - ❓ Question #2\n",
 37 |         "  - Task 5: Do Inference\n",
 38 |         "    - 🏗️ Activity #2\n",
 39 |         "\n",
 40 |         "Now that we have a better understanding of what decoder-only transformer based LLMs are doing to predict the next token, let's look at how they train using that prediciton mechanism!\n",
 41 |         "\n",
 42 |         "> ⚠ NOTE: This notebook is **NOT** compatible with the T4 Instance. Please ensure you're in the L4 or A100 instance."
 43 |       ],
 44 |       "metadata": {
 45 |         "id": "4QBs8FHvShu6"
 46 |       }
 47 |     },
 48 |     {
 49 |       "cell_type": "markdown",
 50 |       "source": [
 51 |         "## Task 1: Dependencies\n",
 52 |         "\n",
 53 |         "We'll start by loading the repository and the requirements not included in Colab by default."
 54 |       ],
 55 |       "metadata": {
 56 |         "id": "-Qtr6wnLTYlR"
 57 |       }
 58 |     },
 59 |     {
 60 |       "cell_type": "code",
 61 |       "source": [
 62 |         "!pip install -qU datasets tiktoken wandb tqdm triton"
 63 |       ],
 64 |       "metadata": {
 65 |         "id": "XcfnV1C3Cz50",
 66 |         "colab": {
 67 |           "base_uri": "https://localhost:8080/"
 68 |         },
 69 |         "outputId": "fbbf2d62-c16d-48b7-c3ed-ee3c8470f242"
 70 |       },
 71 |       "execution_count": null,
 72 |       "outputs": [
 73 |         {
 74 |           "output_type": "stream",
 75 |           "name": "stdout",
 76 |           "text": [
 77 |             "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.5/209.5 MB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
 78 |             "\u001b[?25h"
 79 |           ]
 80 |         }
 81 |       ]
 82 |     },
 83 |     {
 84 |       "cell_type": "code",
 85 |       "execution_count": null,
 86 |       "metadata": {
 87 |         "colab": {
 88 |           "base_uri": "https://localhost:8080/"
 89 |         },
 90 |         "id": "4n9tJuC0BKFJ",
 91 |         "outputId": "297550a1-8561-40b0-ea4f-cea56489763b"
 92 |       },
 93 |       "outputs": [
 94 |         {
 95 |           "output_type": "stream",
 96 |           "name": "stdout",
 97 |           "text": [
 98 |             "Cloning into 'nanoGPT'...\n",
 99 |             "remote: Enumerating objects: 682, done.\u001b[K\n",
100 |             "remote: Total 682 (delta 0), reused 0 (delta 0), pack-reused 682 (from 1)\u001b[K\n",
101 |             "Receiving objects: 100% (682/682), 952.47 KiB | 28.86 MiB/s, done.\n",
102 |             "Resolving deltas: 100% (385/385), done.\n"
103 |           ]
104 |         }
105 |       ],
106 |       "source": [
107 |         "!git clone https://github.com/karpathy/nanoGPT.git"
108 |       ]
109 |     },
110 |     {
111 |       "cell_type": "code",
112 |       "source": [
113 |         "%cd nanoGPT"
114 |       ],
115 |       "metadata": {
116 |         "colab": {
117 |           "base_uri": "https://localhost:8080/"
118 |         },
119 |         "id": "ucLa1gp3CmB0",
120 |         "outputId": "e11d11a1-2978-435b-9e22-ebea93c60e7d"
121 |       },
122 |       "execution_count": null,
123 |       "outputs": [
124 |         {
125 |           "output_type": "stream",
126 |           "name": "stdout",
127 |           "text": [
128 |             "/content/nanoGPT\n"
129 |           ]
130 |         }
131 |       ]
132 |     },
133 |     {
134 |       "cell_type": "markdown",
135 |       "source": [
136 |         "## Task 2: Data Preparation\n",
137 |         "\n",
138 |         "In order to have the correct form of our inputs (tokens) we need to prepare our dataset - let's do this using the script provided by the repository!"
139 |       ],
140 |       "metadata": {
141 |         "id": "kstygtQwVOUZ"
142 |       }
143 |     },
144 |     {
145 |       "cell_type": "code",
146 |       "source": [
147 |         "!python data/shakespeare/prepare.py"
148 |       ],
149 |       "metadata": {
150 |         "colab": {
151 |           "base_uri": "https://localhost:8080/"
152 |         },
153 |         "id": "PNV0Z2uLVVay",
154 |         "outputId": "cd8073b2-130f-4ef1-cad7-b6b8add2d9cd"
155 |       },
156 |       "execution_count": null,
157 |       "outputs": [
158 |         {
159 |           "output_type": "stream",
160 |           "name": "stdout",
161 |           "text": [
162 |             "train has 301,966 tokens\n",
163 |             "val has 36,059 tokens\n"
164 |           ]
165 |         }
166 |       ]
167 |     },
168 |     {
169 |       "cell_type": "markdown",
170 |       "source": [
171 |         "##### 🏗️ Activity #1:\n",
172 |         "\n",
173 |         "Describe what is happening in the `prepare.py` function in the [repository](https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare/prepare.py) in natural language."
174 |       ],
175 |       "metadata": {
176 |         "id": "OFeTcZOV_a4v"
177 |       }
178 |     },
179 |     {
180 |       "cell_type": "markdown",
181 |       "source": [
182 |         "##### ❓ Question #1:\n",
183 |         "\n",
184 |         "What kind of tokenization strategy is being used here? (provide some examples of tokens)"
185 |       ],
186 |       "metadata": {
187 |         "id": "yv7GVHU3_2_k"
188 |       }
189 |     },
190 |     {
191 |       "cell_type": "markdown",
192 |       "source": [
193 |         "## Training Loop\n",
194 |         "\n",
195 |         "We'll leverage the training loop from Karpathy's repository to focus in on the specifics of where we're using loss - how we're using it - and what it means!"
196 |       ],
197 |       "metadata": {
198 |         "id": "hPyDaUl3TgSg"
199 |       }
200 |     },
201 |     {
202 |       "cell_type": "markdown",
203 |       "source": [
204 |         "### Preamble Code"
205 |       ],
206 |       "metadata": {
207 |         "id": "otLV55h9T322"
208 |       }
209 |     },
210 |     {
211 |       "cell_type": "code",
212 |       "source": [
213 |         "import os\n",
214 |         "import time\n",
215 |         "import math\n",
216 |         "import pickle\n",
217 |         "from contextlib import nullcontext\n",
218 |         "\n",
219 |         "import numpy as np\n",
220 |         "import torch\n",
221 |         "from torch.nn.parallel import DistributedDataParallel as DDP\n",
222 |         "from torch.distributed import init_process_group, destroy_process_group\n",
223 |         "\n",
224 |         "from model import GPTConfig, GPT\n",
225 |         "\n",
226 |         "# -----------------------------------------------------------------------------\n",
227 |         "# default config values designed to train a gpt2 (124M) on OpenWebText\n",
228 |         "# I/O\n",
229 |         "out_dir = 'out'\n",
230 |         "eval_interval = 2000\n",
231 |         "log_interval = 1\n",
232 |         "eval_iters = 200\n",
233 |         "eval_only = False # if True, script exits right after the first eval\n",
234 |         "always_save_checkpoint = True # if True, always save a checkpoint after each eval\n",
235 |         "init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'\n",
236 |         "# wandb logging\n",
237 |         "wandb_log = False # disabled by default\n",
238 |         "wandb_project = 'owt'\n",
239 |         "wandb_run_name = 'gpt2' # 'run' + str(time.time())\n",
240 |         "# data\n",
241 |         "dataset = 'shakespeare'\n",
242 |         "gradient_accumulation_steps = 1 # used to simulate larger batch sizes\n",
243 |         "batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size\n",
244 |         "block_size = 1024\n",
245 |         "# model\n",
246 |         "n_layer = 12\n",
247 |         "n_head = 12\n",
248 |         "n_embd = 768\n",
249 |         "dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+\n",
250 |         "bias = False # do we use bias inside LayerNorm and Linear layers?\n",
251 |         "# adamw optimizer\n",
252 |         "learning_rate = 6e-4 # max learning rate\n",
253 |         "max_iters = 10 # total number of training iterations\n",
254 |         "weight_decay = 1e-1\n",
255 |         "beta1 = 0.9\n",
256 |         "beta2 = 0.95\n",
257 |         "grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0\n",
258 |         "# learning rate decay settings\n",
259 |         "decay_lr = True # whether to decay the learning rate\n",
260 |         "warmup_iters = 10 # how many steps to warm up for\n",
261 |         "lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla\n",
262 |         "min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla\n",
263 |         "# DDP settings\n",
264 |         "backend = 'nccl' # 'nccl', 'gloo', etc.\n",
265 |         "# system\n",
266 |         "device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks\n",
267 |         "dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler\n",
268 |         "compile = True # use PyTorch 2.0 to compile the model to be faster\n",
269 |         "# -----------------------------------------------------------------------------\n",
270 |         "config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]\n",
271 |         "config = {k: globals()[k] for k in config_keys} # will be useful for logging\n",
272 |         "# -----------------------------------------------------------------------------\n",
273 |         "\n",
274 |         "# various inits, derived attributes, I/O setup\n",
275 |         "ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?\n",
276 |         "if ddp:\n",
277 |         "    init_process_group(backend=backend)\n",
278 |         "    ddp_rank = int(os.environ['RANK'])\n",
279 |         "    ddp_local_rank = int(os.environ['LOCAL_RANK'])\n",
280 |         "    ddp_world_size = int(os.environ['WORLD_SIZE'])\n",
281 |         "    device = f'cuda:{ddp_local_rank}'\n",
282 |         "    torch.cuda.set_device(device)\n",
283 |         "    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.\n",
284 |         "    seed_offset = ddp_rank # each process gets a different seed\n",
285 |         "    # world_size number of processes will be training simultaneously, so we can scale\n",
286 |         "    # down the desired gradient accumulation iterations per process proportionally\n",
287 |         "    assert gradient_accumulation_steps % ddp_world_size == 0\n",
288 |         "    gradient_accumulation_steps //= ddp_world_size\n",
289 |         "else:\n",
290 |         "    # if not ddp, we are running on a single gpu, and one process\n",
291 |         "    master_process = True\n",
292 |         "    seed_offset = 0\n",
293 |         "    ddp_world_size = 1\n",
294 |         "tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size\n",
295 |         "print(f\"tokens per iteration will be: {tokens_per_iter:,}\")\n",
296 |         "\n",
297 |         "if master_process:\n",
298 |         "    os.makedirs(out_dir, exist_ok=True)\n",
299 |         "torch.manual_seed(1337 + seed_offset)\n",
300 |         "torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul\n",
301 |         "torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn\n",
302 |         "device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast\n",
303 |         "# note: float16 data type will automatically use a GradScaler\n",
304 |         "ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]\n",
305 |         "ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)\n",
306 |         "\n",
307 |         "# poor man's data loader\n",
308 |         "data_dir = os.path.join('data', dataset)\n",
309 |         "def get_batch(split):\n",
310 |         "    # We recreate np.memmap every batch to avoid a memory leak, as per\n",
311 |         "    # https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122\n",
312 |         "    if split == 'train':\n",
313 |         "        data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')\n",
314 |         "    else:\n",
315 |         "        data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')\n",
316 |         "    ix = torch.randint(len(data) - block_size, (batch_size,))\n",
317 |         "    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])\n",
318 |         "    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])\n",
319 |         "    if device_type == 'cuda':\n",
320 |         "        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)\n",
321 |         "        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)\n",
322 |         "    else:\n",
323 |         "        x, y = x.to(device), y.to(device)\n",
324 |         "    return x, y\n",
325 |         "\n",
326 |         "# init these up here, can override if init_from='resume' (i.e. from a checkpoint)\n",
327 |         "iter_num = 0\n",
328 |         "best_val_loss = 1e9\n",
329 |         "\n",
330 |         "# attempt to derive vocab_size from the dataset\n",
331 |         "meta_path = os.path.join(data_dir, 'meta.pkl')\n",
332 |         "meta_vocab_size = None\n",
333 |         "if os.path.exists(meta_path):\n",
334 |         "    with open(meta_path, 'rb') as f:\n",
335 |         "        meta = pickle.load(f)\n",
336 |         "    meta_vocab_size = meta['vocab_size']\n",
337 |         "    print(f\"found vocab_size = {meta_vocab_size} (inside {meta_path})\")\n",
338 |         "\n",
339 |         "# model init\n",
340 |         "model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,\n",
341 |         "                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line\n",
342 |         "if init_from == 'scratch':\n",
343 |         "    # init a new model from scratch\n",
344 |         "    print(\"Initializing a new model from scratch\")\n",
345 |         "    # determine the vocab size we'll use for from-scratch training\n",
346 |         "    if meta_vocab_size is None:\n",
347 |         "        print(\"defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)\")\n",
348 |         "    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304\n",
349 |         "    gptconf = GPTConfig(**model_args)\n",
350 |         "    model = GPT(gptconf)\n",
351 |         "elif init_from == 'resume':\n",
352 |         "    print(f\"Resuming training from {out_dir}\")\n",
353 |         "    # resume training from a checkpoint.\n",
354 |         "    ckpt_path = os.path.join(out_dir, 'ckpt.pt')\n",
355 |         "    checkpoint = torch.load(ckpt_path, map_location=device)\n",
356 |         "    checkpoint_model_args = checkpoint['model_args']\n",
357 |         "    # force these config attributes to be equal otherwise we can't even resume training\n",
358 |         "    # the rest of the attributes (e.g. dropout) can stay as desired from command line\n",
359 |         "    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:\n",
360 |         "        model_args[k] = checkpoint_model_args[k]\n",
361 |         "    # create the model\n",
362 |         "    gptconf = GPTConfig(**model_args)\n",
363 |         "    model = GPT(gptconf)\n",
364 |         "    state_dict = checkpoint['model']\n",
365 |         "    # fix the keys of the state dictionary :(\n",
366 |         "    # honestly no idea how checkpoints sometimes get this prefix, have to debug more\n",
367 |         "    unwanted_prefix = '_orig_mod.'\n",
368 |         "    for k,v in list(state_dict.items()):\n",
369 |         "        if k.startswith(unwanted_prefix):\n",
370 |         "            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)\n",
371 |         "    model.load_state_dict(state_dict)\n",
372 |         "    iter_num = checkpoint['iter_num']\n",
373 |         "    best_val_loss = checkpoint['best_val_loss']\n",
374 |         "elif init_from.startswith('gpt2'):\n",
375 |         "    print(f\"Initializing from OpenAI GPT-2 weights: {init_from}\")\n",
376 |         "    # initialize from OpenAI GPT-2 weights\n",
377 |         "    override_args = dict(dropout=dropout)\n",
378 |         "    model = GPT.from_pretrained(init_from, override_args)\n",
379 |         "    # read off the created config params, so we can store them into checkpoint correctly\n",
380 |         "    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:\n",
381 |         "        model_args[k] = getattr(model.config, k)\n",
382 |         "# crop down the model block size if desired, using model surgery\n",
383 |         "if block_size < model.config.block_size:\n",
384 |         "    model.crop_block_size(block_size)\n",
385 |         "    model_args['block_size'] = block_size # so that the checkpoint will have the right value\n",
386 |         "model.to(device)\n",
387 |         "\n",
388 |         "# initialize a GradScaler. If enabled=False scaler is a no-op\n",
389 |         "scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))\n",
390 |         "\n",
391 |         "# optimizer\n",
392 |         "optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)\n",
393 |         "if init_from == 'resume':\n",
394 |         "    optimizer.load_state_dict(checkpoint['optimizer'])\n",
395 |         "checkpoint = None # free up memory\n",
396 |         "\n",
397 |         "# compile the model\n",
398 |         "if compile:\n",
399 |         "    print(\"compiling the model... (takes a ~minute)\")\n",
400 |         "    unoptimized_model = model\n",
401 |         "    model = torch.compile(model) # requires PyTorch 2.0\n",
402 |         "\n",
403 |         "# wrap model into DDP container\n",
404 |         "if ddp:\n",
405 |         "    model = DDP(model, device_ids=[ddp_local_rank])\n",
406 |         "\n",
407 |         "# helps estimate an arbitrarily accurate loss over either split using many batches\n",
408 |         "@torch.no_grad()\n",
409 |         "def estimate_loss():\n",
410 |         "    out = {}\n",
411 |         "    model.eval()\n",
412 |         "    for split in ['train', 'val']:\n",
413 |         "        losses = torch.zeros(eval_iters)\n",
414 |         "        for k in range(eval_iters):\n",
415 |         "            X, Y = get_batch(split)\n",
416 |         "            with ctx:\n",
417 |         "                logits, loss = model(X, Y)\n",
418 |         "            losses[k] = loss.item()\n",
419 |         "        out[split] = losses.mean()\n",
420 |         "    model.train()\n",
421 |         "    return out\n",
422 |         "\n",
423 |         "# learning rate decay scheduler (cosine with warmup)\n",
424 |         "def get_lr(it):\n",
425 |         "    # 1) linear warmup for warmup_iters steps\n",
426 |         "    if it < warmup_iters:\n",
427 |         "        return learning_rate * it / warmup_iters\n",
428 |         "    # 2) if it > lr_decay_iters, return min learning rate\n",
429 |         "    if it > lr_decay_iters:\n",
430 |         "        return min_lr\n",
431 |         "    # 3) in between, use cosine decay down to min learning rate\n",
432 |         "    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)\n",
433 |         "    assert 0 <= decay_ratio <= 1\n",
434 |         "    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1\n",
435 |         "    return min_lr + coeff * (learning_rate - min_lr)\n",
436 |         "\n",
437 |         "# logging\n",
438 |         "if wandb_log and master_process:\n",
439 |         "    import wandb\n",
440 |         "    wandb.init(project=wandb_project, name=wandb_run_name, config=config)"
441 |       ],
442 |       "metadata": {
443 |         "colab": {
444 |           "base_uri": "https://localhost:8080/"
445 |         },
446 |         "id": "QgyIgEn1T3Te",
447 |         "outputId": "752246a0-a771-4d06-aa01-05c91741a0e7"
448 |       },
449 |       "execution_count": null,
450 |       "outputs": [
451 |         {
452 |           "output_type": "stream",
453 |           "name": "stdout",
454 |           "text": [
455 |             "tokens per iteration will be: 12,288\n",
456 |             "Initializing a new model from scratch\n",
457 |             "defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)\n",
458 |             "number of parameters: 123.59M\n"
459 |           ]
460 |         },
461 |         {
462 |           "output_type": "stream",
463 |           "name": "stderr",
464 |           "text": [
465 |             "<ipython-input-3-30430fb971dc>:177: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.\n",
466 |             "  scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))\n"
467 |           ]
468 |         },
469 |         {
470 |           "output_type": "stream",
471 |           "name": "stdout",
472 |           "text": [
473 |             "num decayed parameter tensors: 50, with 124,354,560 parameters\n",
474 |             "num non-decayed parameter tensors: 25, with 19,200 parameters\n",
475 |             "using fused AdamW: True\n",
476 |             "compiling the model... (takes a ~minute)\n"
477 |           ]
478 |         }
479 |       ]
480 |     },
481 |     {
482 |       "cell_type": "markdown",
483 |       "source": [
484 |         "### Task 3: Training Loop\n",
485 |         "\n",
486 |         "Here is where the magic happens!\n",
487 |         "\n",
488 |         "Before we get straight into training - let's look at our model to obtain some key insights."
489 |       ],
490 |       "metadata": {
491 |         "id": "MFQcSiOYUyBV"
492 |       }
493 |     },
494 |     {
495 |       "cell_type": "code",
496 |       "source": [
497 |         "print(model)"
498 |       ],
499 |       "metadata": {
500 |         "colab": {
501 |           "base_uri": "https://localhost:8080/"
502 |         },
503 |         "id": "27KCZIhtuRFc",
504 |         "outputId": "4edd70f7-4d06-4cc4-ad5a-c6d032cad754"
505 |       },
506 |       "execution_count": null,
507 |       "outputs": [
508 |         {
509 |           "output_type": "stream",
510 |           "name": "stdout",
511 |           "text": [
512 |             "OptimizedModule(\n",
513 |             "  (_orig_mod): GPT(\n",
514 |             "    (transformer): ModuleDict(\n",
515 |             "      (wte): Embedding(50304, 768)\n",
516 |             "      (wpe): Embedding(1024, 768)\n",
517 |             "      (drop): Dropout(p=0.0, inplace=False)\n",
518 |             "      (h): ModuleList(\n",
519 |             "        (0-11): 12 x Block(\n",
520 |             "          (ln_1): LayerNorm()\n",
521 |             "          (attn): CausalSelfAttention(\n",
522 |             "            (c_attn): Linear(in_features=768, out_features=2304, bias=False)\n",
523 |             "            (c_proj): Linear(in_features=768, out_features=768, bias=False)\n",
524 |             "            (attn_dropout): Dropout(p=0.0, inplace=False)\n",
525 |             "            (resid_dropout): Dropout(p=0.0, inplace=False)\n",
526 |             "          )\n",
527 |             "          (ln_2): LayerNorm()\n",
528 |             "          (mlp): MLP(\n",
529 |             "            (c_fc): Linear(in_features=768, out_features=3072, bias=False)\n",
530 |             "            (gelu): GELU(approximate='none')\n",
531 |             "            (c_proj): Linear(in_features=3072, out_features=768, bias=False)\n",
532 |             "            (dropout): Dropout(p=0.0, inplace=False)\n",
533 |             "          )\n",
534 |             "        )\n",
535 |             "      )\n",
536 |             "      (ln_f): LayerNorm()\n",
537 |             "    )\n",
538 |             "    (lm_head): Linear(in_features=768, out_features=50304, bias=False)\n",
539 |             "  )\n",
540 |             ")\n"
541 |           ]
542 |         }
543 |       ]
544 |     },
545 |     {
546 |       "cell_type": "markdown",
547 |       "source": [
548 |         "##### 👪❓ Discussion Question #1:\n",
549 |         "\n",
550 |         "Describe how this model is different that the traditional Transformer Architecture from the paper \"Attention is All You Need\"."
551 |       ],
552 |       "metadata": {
553 |         "id": "piujJIlnBK22"
554 |       }
555 |     },
556 |     {
557 |       "cell_type": "markdown",
558 |       "source": [
559 |         "Notice our final layer - the `lm_head` - and how it has `out_features=50304`. This number of output features lines up exactly with our vocabulary! When we're making predictions, we're making predictions about which token (in our vocabulary) should be selected next!"
560 |       ],
561 |       "metadata": {
562 |         "id": "NS1oTMdV-oH-"
563 |       }
564 |     },
565 |     {
566 |       "cell_type": "markdown",
567 |       "source": [
568 |         "#### First Batch - What is a Batch?\n",
569 |         "\n",
570 |         "Let's look at our first batch to see what exactly a \"batch\" is in this context."
571 |       ],
572 |       "metadata": {
573 |         "id": "NX778-01U2SF"
574 |       }
575 |     },
576 |     {
577 |       "cell_type": "code",
578 |       "source": [
579 |         "X, Y = get_batch('train') # fetch the very first batch"
580 |       ],
581 |       "metadata": {
582 |         "id": "Gd6LWL9LU14w"
583 |       },
584 |       "execution_count": null,
585 |       "outputs": []
586 |     },
587 |     {
588 |       "cell_type": "code",
589 |       "source": [
590 |         "print(f\"X Shape: {X.shape}, Y Shape: {Y.shape}\")"
591 |       ],
592 |       "metadata": {
593 |         "colab": {
594 |           "base_uri": "https://localhost:8080/"
595 |         },
596 |         "id": "Y6cC2iGj-q8i",
597 |         "outputId": "887ebc49-29fd-4a44-81fb-50b78f5a4303"
598 |       },
599 |       "execution_count": null,
600 |       "outputs": [
601 |         {
602 |           "output_type": "stream",
603 |           "name": "stdout",
604 |           "text": [
605 |             "X Shape: torch.Size([12, 1024]), Y Shape: torch.Size([12, 1024])\n"
606 |           ]
607 |         }
608 |       ]
609 |     },
610 |     {
611 |       "cell_type": "markdown",
612 |       "source": [
613 |         "Let's look at what our X and Y look like!\n",
614 |         "\n",
615 |         "> NOTE: We'll only look at the last 5 tokens of the last batch to get a sense of what is happening under the hood of the data selection."
616 |       ],
617 |       "metadata": {
618 |         "id": "juIxTM6HVrfk"
619 |       }
620 |     },
621 |     {
622 |       "cell_type": "code",
623 |       "source": [
624 |         "print(X[:][-1][-5:])\n",
625 |         "print(Y[:][-1][-5:])"
626 |       ],
627 |       "metadata": {
628 |         "colab": {
629 |           "base_uri": "https://localhost:8080/"
630 |         },
631 |         "id": "lThD-QrrVqlK",
632 |         "outputId": "924a2070-b502-44b9-e514-b302e21f5003"
633 |       },
634 |       "execution_count": null,
635 |       "outputs": [
636 |         {
637 |           "output_type": "stream",
638 |           "name": "stdout",
639 |           "text": [
640 |             "tensor([18719,    11,   351,   465, 21752], device='cuda:0')\n",
641 |             "tensor([   11,   351,   465, 21752,    11], device='cuda:0')\n"
642 |           ]
643 |         }
644 |       ]
645 |     },
646 |     {
647 |       "cell_type": "markdown",
648 |       "source": [
649 |         "Notice how X and Y are simply shifted by a single index - where Y contains (at every index) the token that *follows* X!\n",
650 |         "\n",
651 |         "So essentially - Y contains the *labels* (or target) for X!\n",
652 |         "\n",
653 |         "Let's see how we can leverage this in our training loop!\n",
654 |         "\n",
655 |         "Before we start training - let's import our decoder so we can see specific text that our model is leveraging!"
656 |       ],
657 |       "metadata": {
658 |         "id": "fWgmihsgWBmN"
659 |       }
660 |     },
661 |     {
662 |       "cell_type": "code",
663 |       "source": [
664 |         "import tiktoken\n",
665 |         "\n",
666 |         "enc = tiktoken.get_encoding(\"gpt2\")\n",
667 |         "encode = lambda s: enc.encode(s, allowed_special={\"<|endoftext|>\"})\n",
668 |         "decode = lambda l: enc.decode(l)"
669 |       ],
670 |       "metadata": {
671 |         "id": "bL8uueRjyqL6"
672 |       },
673 |       "execution_count": null,
674 |       "outputs": []
675 |     },
676 |     {
677 |       "cell_type": "markdown",
678 |       "source": [
679 |         "## Task 4: Training the Model"
680 |       ],
681 |       "metadata": {
682 |         "id": "-Hk5EmQjB3Gb"
683 |       }
684 |     },
685 |     {
686 |       "cell_type": "markdown",
687 |       "source": [
688 |         "The training loop (provided through the NanoGPT repository) has a lot of interesting things going on - but we're going to focus on the specific section of the training loop that relates to the loss and logits."
689 |       ],
690 |       "metadata": {
691 |         "id": "xQXAoLD19w4g"
692 |       }
693 |     },
694 |     {
695 |       "cell_type": "code",
696 |       "source": [
697 |         "t0 = time.time()\n",
698 |         "local_iter_num = 0\n",
699 |         "raw_model = model.module if ddp else model\n",
700 |         "running_mfu = -1.0\n",
701 |         "while True:\n",
702 |         "    lr = get_lr(iter_num) if decay_lr else learning_rate\n",
703 |         "    for param_group in optimizer.param_groups:\n",
704 |         "        param_group['lr'] = lr\n",
705 |         "    if iter_num % eval_interval == 0 and master_process:\n",
706 |         "        losses = estimate_loss()\n",
707 |         "        print(f\"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}\")\n",
708 |         "        if wandb_log:\n",
709 |         "            wandb.log({\n",
710 |         "                \"iter\": iter_num,\n",
711 |         "                \"train/loss\": losses['train'],\n",
712 |         "                \"val/loss\": losses['val'],\n",
713 |         "                \"lr\": lr,\n",
714 |         "                \"mfu\": running_mfu*100,\n",
715 |         "            })\n",
716 |         "        if losses['val'] < best_val_loss or always_save_checkpoint:\n",
717 |         "            best_val_loss = losses['val']\n",
718 |         "            if iter_num > 0:\n",
719 |         "                checkpoint = {\n",
720 |         "                    'model': raw_model.state_dict(),\n",
721 |         "                    'optimizer': optimizer.state_dict(),\n",
722 |         "                    'model_args': model_args,\n",
723 |         "                    'iter_num': iter_num,\n",
724 |         "                    'best_val_loss': best_val_loss,\n",
725 |         "                    'config': config,\n",
726 |         "                }\n",
727 |         "                print(f\"saving checkpoint to {out_dir}\")\n",
728 |         "                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))\n",
729 |         "    if iter_num == 0 and eval_only:\n",
730 |         "        break\n",
731 |         "    for micro_step in range(gradient_accumulation_steps):\n",
732 |         "        if ddp:\n",
733 |         "            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)\n",
734 |         "        ####### LOGITS AND LOSS #######\n",
735 |         "        with ctx:\n",
736 |         "            logits, loss = model(X, Y)\n",
737 |         "            print(f\"Our inputs (truncated to the last 15 tokens in the last batch) were: {X[:][-1][-15:].cpu().numpy()}\")\n",
738 |         "            print(f\"Represented in text that is: {[decode(X[:][-1][-15:].cpu().numpy())]}\")\n",
739 |         "            print(f\"Our targets represented in text were: {[decode(Y[:][-1][-15:].cpu().numpy())]}\")\n",
740 |         "            print(f\"Our logits are in shape: {logits.shape}\")\n",
741 |         "            print(f\"The Vocabulary Size is: {logits.shape[-1]}\")\n",
742 |         "            print(f\"The Sequence Length is: {logits.shape[1]}\")\n",
743 |         "            print(f\"Our loss was calculated as: {loss}\")\n",
744 |         "            loss = loss / gradient_accumulation_steps\n",
745 |         "        X, Y = get_batch('train')\n",
746 |         "        scaler.scale(loss).backward()\n",
747 |         "        ###############################\n",
748 |         "    if grad_clip != 0.0:\n",
749 |         "        scaler.unscale_(optimizer)\n",
750 |         "        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)\n",
751 |         "    scaler.step(optimizer)\n",
752 |         "    scaler.update()\n",
753 |         "    optimizer.zero_grad(set_to_none=True)\n",
754 |         "    t1 = time.time()\n",
755 |         "    dt = t1 - t0\n",
756 |         "    t0 = t1\n",
757 |         "    if iter_num % log_interval == 0 and master_process:\n",
758 |         "        lossf = loss.item() * gradient_accumulation_steps\n",
759 |         "        if local_iter_num >= 5:\n",
760 |         "            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)\n",
761 |         "            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu\n",
762 |         "        print(f\"iter {iter_num}: time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%\")\n",
763 |         "    iter_num += 1\n",
764 |         "    local_iter_num += 1\n",
765 |         "    if iter_num > max_iters:\n",
766 |         "        break\n",
767 |         "if ddp:\n",
768 |         "    destroy_process_group()"
769 |       ],
770 |       "metadata": {
771 |         "colab": {
772 |           "base_uri": "https://localhost:8080/"
773 |         },
774 |         "id": "AhuKbQE6Upm0",
775 |         "outputId": "c29a658c-c612-4e88-a307-2f82e4488ea3"
776 |       },
777 |       "execution_count": null,
778 |       "outputs": [
779 |         {
780 |           "output_type": "stream",
781 |           "name": "stdout",
782 |           "text": [
783 |             "step 0: train loss 11.0040, val loss 10.9976\n",
784 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 9005  4432   528   417    11   198 31056   286  2165   844 18719    11\n",
785 |             "   351   465 21752]\n",
786 |             "Represented in text that is: [' Prince Florizel,\\nSon of Polixenes, with his princess']\n",
787 |             "Our targets represented in text were: [' Florizel,\\nSon of Polixenes, with his princess,']\n",
788 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
789 |             "The Vocabulary Size is: 50304\n",
790 |             "The Sequence Length is: 1024\n",
791 |             "Our loss was calculated as: 10.98572063446045\n",
792 |             "iter 0: time 77725.16ms, mfu -100.00%\n",
793 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [   25   198 33873  3183    11  1497   351   683     0   198   198    51\n",
794 |             " 38409    25   198]\n",
795 |             "Represented in text that is: [':\\nSoldiers, away with him!\\n\\nTutor:\\n']\n",
796 |             "Our targets represented in text were: ['\\nSoldiers, away with him!\\n\\nTutor:\\nAh']\n",
797 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
798 |             "The Vocabulary Size is: 50304\n",
799 |             "The Sequence Length is: 1024\n",
800 |             "Our loss was calculated as: 11.003988265991211\n",
801 |             "iter 1: time 116.64ms, mfu -100.00%\n",
802 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [4776  502  510  329  262  198 3157  395  638 1015  287 1951  437  296\n",
803 |             "   13]\n",
804 |             "Represented in text that is: [' score me up for the\\nlyingest knave in Christendom.']\n",
805 |             "Our targets represented in text were: [' me up for the\\nlyingest knave in Christendom. What']\n",
806 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
807 |             "The Vocabulary Size is: 50304\n",
808 |             "The Sequence Length is: 1024\n",
809 |             "Our loss was calculated as: 9.73719310760498\n",
810 |             "iter 2: time 315.76ms, mfu -100.00%\n",
811 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 198   54 1670  290  649 1494 1549   13  198  198 4805 1268 5222   25\n",
812 |             "  198]\n",
813 |             "Represented in text that is: [\"\\nWarm and new kill'd.\\n\\nPRINCE:\\n\"]\n",
814 |             "Our targets represented in text were: [\"Warm and new kill'd.\\n\\nPRINCE:\\nSearch\"]\n",
815 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
816 |             "The Vocabulary Size is: 50304\n",
817 |             "The Sequence Length is: 1024\n",
818 |             "Our loss was calculated as: 9.323944091796875\n",
819 |             "iter 3: time 318.78ms, mfu -100.00%\n",
820 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  674 10715    13   198   198 15946   455    25   198  5247   284    11\n",
821 |             " 15967    26   345]\n",
822 |             "Represented in text that is: [' our mystery.\\n\\nProvost:\\nGo to, sir; you']\n",
823 |             "Our targets represented in text were: [' mystery.\\n\\nProvost:\\nGo to, sir; you weigh']\n",
824 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
825 |             "The Vocabulary Size is: 50304\n",
826 |             "The Sequence Length is: 1024\n",
827 |             "Our loss was calculated as: 9.514596939086914\n",
828 |             "iter 4: time 332.62ms, mfu -100.00%\n",
829 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  11 4379  705 4246  292  339  326  925  345  284 1207  577   11  198\n",
830 |             " 7120]\n",
831 |             "Represented in text that is: [\", seeing 'twas he that made you to depose,\\nYour\"]\n",
832 |             "Our targets represented in text were: [\" seeing 'twas he that made you to depose,\\nYour oath\"]\n",
833 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
834 |             "The Vocabulary Size is: 50304\n",
835 |             "The Sequence Length is: 1024\n",
836 |             "Our loss was calculated as: 8.966322898864746\n",
837 |             "iter 5: time 320.00ms, mfu 10.52%\n",
838 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [   43  2885    56 20176 24212    51    25   198   464   661   287   262\n",
839 |             "  4675  3960 43989]\n",
840 |             "Represented in text that is: ['LADY CAPULET:\\nThe people in the street cry Romeo']\n",
841 |             "Our targets represented in text were: ['ADY CAPULET:\\nThe people in the street cry Romeo,']\n",
842 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
843 |             "The Vocabulary Size is: 50304\n",
844 |             "The Sequence Length is: 1024\n",
845 |             "Our loss was calculated as: 8.681777000427246\n",
846 |             "iter 6: time 315.95ms, mfu 10.53%\n",
847 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 2236  1282   284 22363  3725    11   326   198  1639  4145   423  7715\n",
848 |             "  1549   502     0]\n",
849 |             "Represented in text that is: [\" shall come to clearer knowledge, that\\nYou thus have publish'd me!\"]\n",
850 |             "Our targets represented in text were: [\" come to clearer knowledge, that\\nYou thus have publish'd me! Gentle\"]\n",
851 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
852 |             "The Vocabulary Size is: 50304\n",
853 |             "The Sequence Length is: 1024\n",
854 |             "Our loss was calculated as: 8.433647155761719\n",
855 |             "iter 7: time 323.38ms, mfu 10.52%\n",
856 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  262  1633   198  1870   307   407  4259  1549   287 27666 29079    11\n",
857 |             "   198    39  2502]\n",
858 |             "Represented in text that is: [\" the air\\nAnd be not fix'd in doom perpetual,\\nHover\"]\n",
859 |             "Our targets represented in text were: [\" air\\nAnd be not fix'd in doom perpetual,\\nHover about\"]\n",
860 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
861 |             "The Vocabulary Size is: 50304\n",
862 |             "The Sequence Length is: 1024\n",
863 |             "Our loss was calculated as: 8.148585319519043\n",
864 |             "iter 8: time 321.68ms, mfu 10.52%\n",
865 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [ 3025  8557  7739   550    11   198  2484   439  2245   393 26724   502\n",
866 |             "    13  8192   314]\n",
867 |             "Represented in text that is: [' whose spiritual counsel had,\\nShall stop or spur me. Have I']\n",
868 |             "Our targets represented in text were: [' spiritual counsel had,\\nShall stop or spur me. Have I done']\n",
869 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
870 |             "The Vocabulary Size is: 50304\n",
871 |             "The Sequence Length is: 1024\n",
872 |             "Our loss was calculated as: 7.851408958435059\n",
873 |             "iter 9: time 319.71ms, mfu 10.52%\n",
874 |             "Our inputs (truncated to the last 15 tokens in the last batch) were: [  407   257  1573   286  8716    30   198  4366  4467    11 15849    13\n",
875 |             "   198   198    45]\n",
876 |             "Represented in text that is: [' not a word of joy?\\nSome comfort, nurse.\\n\\nN']\n",
877 |             "Our targets represented in text were: [' a word of joy?\\nSome comfort, nurse.\\n\\nNurse']\n",
878 |             "Our logits are in shape: torch.Size([12, 1024, 50304])\n",
879 |             "The Vocabulary Size is: 50304\n",
880 |             "The Sequence Length is: 1024\n",
881 |             "Our loss was calculated as: 7.475827217102051\n",
882 |             "iter 10: time 321.52ms, mfu 10.51%\n"
883 |           ]
884 |         }
885 |       ]
886 |     },
887 |     {
888 |       "cell_type": "markdown",
889 |       "source": [
890 |         "##### ❓ Question #2:\n",
891 |         "\n",
892 |         "Describe if, and why, this process is supervised or unsupervised."
893 |       ],
894 |       "metadata": {
895 |         "id": "Vfg_B5rWB6te"
896 |       }
897 |     },
898 |     {
899 |       "cell_type": "markdown",
900 |       "source": [
901 |         "## Inference:"
902 |       ],
903 |       "metadata": {
904 |         "id": "rCKh9ej8D8HW"
905 |       }
906 |     },
907 |     {
908 |       "cell_type": "markdown",
909 |       "source": [
910 |         "##### 🏗️ Activity #2\n",
911 |         "\n",
912 |         "Leverage the trained model and do some inference!"
913 |       ],
914 |       "metadata": {
915 |         "id": "KQvSTQxpEBw_"
916 |       }
917 |     },
918 |     {
919 |       "cell_type": "code",
920 |       "source": [
921 |         "### YOUR CODE HERE"
922 |       ],
923 |       "metadata": {
924 |         "id": "xJchmShFD_pb"
925 |       },
926 |       "execution_count": null,
927 |       "outputs": []
928 |     }
929 |   ]
930 | }


--------------------------------------------------------------------------------
/07_Fine-tuning/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading"> 🚇 Session 7: Fine-Tuning</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 7: Fine-Tuning ](https://www.notion.so/Session-7-Fine-Tuning-1a7cd547af3d80e3adedd44ef63ac992) | [07: Fine-Tuning](https://www.youtube.com/watch?v=ELu2dy2Iccs&ab_channel=AIMakerspace) |  [Session 7: Fine-Tuning](https://www.canva.com/design/DAGY7ZxFsRU/wzpT21_Ub_a3RAo3-HVvPQ/view?utm_content=DAGY7ZxFsRU&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hcadf98dde7) | You Are Here! | 
13 | 
14 | ### Assignment: 
15 | 
16 | Today's assignments are available in Colab:
17 | - Assignment #1: 
18 |     - [Assignment](https://colab.research.google.com/drive/18KDy41LCsTKpg6M03A94VkGqRuVJx4yA?usp=sharing)
19 |     - [Hardmode Assignment](https://colab.research.google.com/drive/1lXa2jU2_7aEduHI5x2TWZ5x3VawLKSKa?usp=sharing)
20 | 
21 | 1. Breakout Room #1:
22 |   - Task #1: Loading the Model
23 |     - 👪❓ Discussion Question #1
24 |     - ❓ Question #1
25 |     - ❓ Question #2
26 | 2. Breakout Room #2:
27 |   - Task #2: Data and Data Prep.
28 |     - 🏗️ Activity #1
29 |   - Task #3: Setting up PEFT LoRA
30 |     - ❓ Question #3
31 |   - Task #4: Training the Model
32 |     - ❓ Question #4
33 |     - ❓ Question #5
34 |   - Task #5: Share Your Model!
35 |     - ❓ Question #6
36 |    
37 | ### Hardmode:
38 | 
39 | Take the [base](https://huggingface.co/meta-llama/Llama-3.1-8B) Llama 3.1 8B model and instruction-tune it using TRL on [this](https://huggingface.co/datasets/yahma/alpaca-cleaned) instruction following dataset. 
40 | 
41 | Evaluate your final model using the Eleuther AI's [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) on the [`IFEval`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/README.md) task.
42 | 
43 | Report the baseline Llama 3.1 8B model's performance and your model's performance, then compare and contrast your results with Llama 3.1 8B Instruct.
44 | 
45 | > NOTE: This will consume a large volume of compute credits - and take a long time! Only embark on this journey if you really want to get deep into the weeds!
46 | 


--------------------------------------------------------------------------------
/08_Alignment/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading"> 🚇 Session 8: Alignment</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 8: Alignment](https://www.notion.so/Session-8-Alignment-1a7cd547af3d800ab391dd8f2ceb9329) | [08: Alignment ](https://www.youtube.com/watch?v=4ehPGFIf91o&ab_channel=AIMakerspace) |  [Session 8: Alignment](https://www.canva.com/design/DAGZHXVSNBE/OHkXXiAmsfSXwHL1r2P0bw/view?utm_content=DAGZHXVSNBE&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h59157a8ad7) |You Are Here! |
13 | 
14 | ### Assignment: 
15 | 
16 | Today's assignments are available in Colab:
17 | - Assignment #1: 
18 |     - [Notebook #1 Assignment](https://colab.research.google.com/drive/1h4xq7cfBv9Gg_YPWvEPblP6fuCK2vjQy?usp=sharing)
19 |     - [Notebook #2 Assignment](https://colab.research.google.com/drive/11qCfcABsxjjde7EihH6nHMH96aNBNONL?usp=sharing)
20 |    
21 | ### Hardmode:
22 | 
23 | Take the [base](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) Llama 3.1 8B Instruct model and apply RLHF on it using a dataset of your choosing. 
24 | 
25 | > ENSURE THERE IS A EVALUATE METRIC ALIGNED WITH THE GOAL
26 | 
27 | Evaluate a specific metric (using the Hugging Face `evaluate` library) to baseline your model - and then find out the delta due to the RLHF process.
28 | 
29 | > NOTE: This will consume a large volume of compute credits - and take a long time! Only embark on this journey if you really want to get deep into the weeds!
30 | 


--------------------------------------------------------------------------------
/09_Alignment_II/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading"> 🚇 Session 9: Alignment II</h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 9: Alignment II and Merging](https://www.notion.so/Session-9-Alignment-II-and-Merging-1a7cd547af3d805ca7bae487d35073a5) | [09: Alignment II & Merging](https://www.youtube.com/watch?v=VzTujojD1ho&ab_channel=AIMakerspace) |  [Session 9: Alignment II & Merging](https://www.canva.com/design/DAGZlbhGppY/Lmr8nwEG4T5p8vsY3pqOmw/view?utm_content=DAGZlbhGppY&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h580ea460bf) | You Are Here! | 
13 | 
14 | ### Assignment: 
15 | 
16 | Today's assignments are available in Colab:
17 | - Assignment #1: 
18 |     - [Notebook #1 Assignment](https://colab.research.google.com/drive/1YkkPh1tAGheA39b5DEMs1DalZ5cMsvpV?usp=sharing)
19 |     - [Notebook #2 Assignment](https://colab.research.google.com/drive/1-H6vxZda7P-DEmenH2r6HOJBai_fAp1a?usp=sharing)
20 |    
21 | 


--------------------------------------------------------------------------------
/10_Cool_Session/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | <h1 align="center" id="heading"> 🚇 Session 10: Frontiers </h1>
 7 | 
 8 | ### [Quicklinks](https://github.com/AI-Maker-Space/LLM-Engineering-Foundations-to-SLMs/tree/main/00_AIM_Quicklinks)
 9 | 
10 | | 📰 Session Sheet | 📽️ YouTube Video  | 🖼️ Slides      | 👨‍💻 Repo         |
11 | |:-----------------|:-----------------|:-----------------|:-----------------|
12 | | [Session 10: Frontiers](https://www.notion.so/Session-10-Frontiers-1a7cd547af3d80ff9482cee21e65edaf) | [10: Frontiers ](https://www.youtube.com/watch?v=ft8DrEW1ZSc&ab_channel=AIMakerspace) |  [Session 10: Frontiers](https://www.canva.com/design/DAGZsTHsSuE/hLmurLxgBsX-D8royq42jA/view?utm_content=DAGZsTHsSuE&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hf1f2028a93) | You Are Here!
13 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <p align = "center" draggable=”false” ><img src="https://github.com/AI-Maker-Space/LLM-Dev-101/assets/37101144/d1343317-fa2f-41e1-8af1-1dbb18399719" 
 2 |      width="200px"
 3 |      height="auto"/>
 4 | </p>
 5 | 
 6 | 
 7 | <h1 align="center" id="heading">:wave: Welcome to LLM Engineering - Foundations to SLMs!</h1>
 8 | 
 9 | 
10 | Large Language Model Engineering (LLM Engineering) refers to the emerging best-practices and tools for pretraining, post-training, and optimizing LLMs prior to production deployment.
11 | 
12 | Pre- and post-training techniques include unsupervised pretraining, supervised fine-tuning, alignment, model merging, distillation, quantization. and others.
13 | 
14 | *LLM Engineering today is done with the GPT-style transformer architecture.
15 | 
16 | **Small Language Models (SLMs) of today can be as large as 70B parameters.
17 | 
18 | ## Course Modules
19 | This course teaches you the fundamentals of LLMs, and will quickly onramp you up to the practical LLM Engineering edge.  When you complete this course, you will understand how the latest Large and Small Language Models are built, and you'll be ready to build, ship, and share your very own.  <br/>
20 | ### Module 1: Transformer: Attention Is All You Need
21 | 🤖 The Transformer <br/>
22 | 🧐 Attention <br/>
23 | 🔠 Embeddings <br/>
24 | ### Module 2: Practical LLM Mechanics
25 | 🪙 Next-Token Prediction <br/>
26 | 🔡 Embedding Models <br/>
27 | ### Module 3: LLM Training, Fine-Tuning, and Alignment
28 | 🚇 Pretraining <br/>
29 | 🚉 Fine-Tuning <br/>
30 | 🛤️ Alignment <br/>
31 | ### Module 4: LLM Engineering Frontiers
32 | 🥪 Model Merging <br/>
33 | ⚗️ Distillation <br/>
34 | 
35 | and more from the LLM Edge! <br/>
36 | 
37 | ## 🙏 Contributions
38 | 
39 | We believe in the power of collaboration. Contributions, ideas, and feedback are highly encouraged! Let's build the ultimate resource for LLMEs together. 🤝
40 | 
41 | Feel free to reach out with any questions or suggestions. Happy coding! 🚀🔮
42 | 
43 | 👤 Follow us on [Twitter](https://twitter.com/AIMakerspace) and [LinkedIn](https://www.linkedin.com/company/ai-maker-space) for the latest news!
44 | 


--------------------------------------------------------------------------------