├── .gitignore ├── Attention Mechanism.ipynb ├── Images ├── 1.webp ├── 2.webp ├── 3.webp ├── 4.webp ├── 5.webp ├── 6.webp ├── 7.webp ├── Token Embeddings.webp ├── data sampling with sliding window.webp ├── positional Encoding.webp ├── step 1.webp └── word embeddings.webp ├── LLM Architecture.ipynb ├── README.md ├── Tokenizer.ipynb └── the-verdict.txt /.gitignore: -------------------------------------------------------------------------------- 1 | llm_scratch.ipynb 2 | -------------------------------------------------------------------------------- /Attention Mechanism.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "1ae38945-39dd-45dc-ad4f-da7a4404241f", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "\n", 16 | "\n", 19 | "\n", 20 | "
\n", 11 | "\n", 12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 13 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "
\n", 15 | "
\n", 17 | "\n", 18 | "
\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "8bfa70ec-5c4c-40e8-b923-16f8167e3181", 26 | "metadata": {}, 27 | "source": [ 28 | "# Chapter 3: Coding Attention Mechanisms" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "c29bcbe8-a034-43a2-b557-997b03c9882d", 34 | "metadata": {}, 35 | "source": [ 36 | "Packages that are being used in this notebook:" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 1, 42 | "id": "e58f33e8-5dc9-4dd5-ab84-5a011fa11d92", 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "torch version: 2.4.0\n" 50 | ] 51 | } 52 | ], 53 | "source": [ 54 | "from importlib.metadata import version\n", 55 | "\n", 56 | "print(\"torch version:\", version(\"torch\"))" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "id": "a2a4474d-7c68-4846-8702-37906cf08197", 62 | "metadata": {}, 63 | "source": [ 64 | "- This chapter covers attention mechanisms, the engine of LLMs:" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "id": "02a11208-d9d3-44b1-8e0d-0c8414110b93", 70 | "metadata": {}, 71 | "source": [ 72 | "" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "id": "50e020fd-9690-4343-80df-da96678bef5e", 78 | "metadata": {}, 79 | "source": [ 80 | "" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "id": "ecc4dcee-34ea-4c05-9085-2f8887f70363", 86 | "metadata": {}, 87 | "source": [ 88 | "## 3.1 The problem with modeling long sequences" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "id": "a55aa49c-36c2-48da-b1d9-70f416e46a6a", 94 | "metadata": {}, 95 | "source": [ 96 | "- No code in this section\n", 97 | "- Translating a text word by word isn't feasible due to the differences in grammatical structures between the source and target languages:" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "id": "55c0c433-aa4b-491e-848a-54905ebb05ad", 103 | "metadata": {}, 104 | "source": [ 105 | "" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "db03c48a-3429-48ea-9d4a-2e53b0e516b1", 111 | "metadata": {}, 112 | "source": [ 113 | "- Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks\n", 114 | "- In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state—a kind of intermediate layer within the neural network—to generate a condensed representation of the entire input sequence:" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "id": "03d8df2c-c1c2-4df0-9977-ade9713088b2", 120 | "metadata": {}, 121 | "source": [ 122 | "" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "id": "3602c585-b87a-41c7-a324-c5e8298849df", 128 | "metadata": {}, 129 | "source": [ 130 | "## 3.2 Capturing data dependencies with attention mechanisms" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "id": "b6fde64c-6034-421d-81d9-8244932086ea", 136 | "metadata": {}, 137 | "source": [ 138 | "- No code in this section\n", 139 | "- Through an attention mechanism, the text-generating decoder segment of the network is capable of selectively accessing all input tokens, implying that certain input tokens hold more significance than others in the generation of a specific output token:" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "id": "bc4f6293-8ab5-4aeb-a04c-50ee158485b1", 145 | "metadata": {}, 146 | "source": [ 147 | "" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "id": "8044be1f-e6a2-4a1f-a6dd-e325d3bad05e", 153 | "metadata": {}, 154 | "source": [ 155 | "- Self-attention in transformers is a technique designed to enhance input representations by enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "id": "6565dc9f-b1be-4c78-b503-42ccc743296c", 161 | "metadata": {}, 162 | "source": [ 163 | "" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "id": "5efe05ff-b441-408e-8d66-cde4eb3397e3", 169 | "metadata": {}, 170 | "source": [ 171 | "## 3.3 Attending to different parts of the input with self-attention" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "id": "6d9af516-7c37-4400-ab53-34936d5495a9", 177 | "metadata": {}, 178 | "source": [ 179 | "### 3.3.1 A simple self-attention mechanism without trainable weights" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "id": "d269e9f1-df11-4644-b575-df338cf46cdf", 185 | "metadata": {}, 186 | "source": [ 187 | "- This section explains a very simplified variant of self-attention, which does not contain any trainable weights\n", 188 | "- This is purely for illustration purposes and NOT the attention mechanism that is used in transformers\n", 189 | "- The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism\n", 190 | "- Suppose we are given an input sequence $x^{(1)}$ to $x^{(T)}$\n", 191 | " - The input is a text (for example, a sentence like \"Your journey starts with one step\") that has already been converted into token embeddings as described in chapter 2\n", 192 | " - For instance, $x^{(1)}$ is a d-dimensional vector representing the word \"Your\", and so forth\n", 193 | "- **Goal:** compute context vectors $z^{(i)}$ for each input sequence element $x^{(i)}$ in $x^{(1)}$ to $x^{(T)}$ (where $z$ and $x$ have the same dimension)\n", 194 | " - A context vector $z^{(i)}$ is a weighted sum over the inputs $x^{(1)}$ to $x^{(T)}$\n", 195 | " - The context vector is \"context\"-specific to a certain input\n", 196 | " - Instead of $x^{(i)}$ as a placeholder for an arbitrary input token, let's consider the second input, $x^{(2)}$\n", 197 | " - And to continue with a concrete example, instead of the placeholder $z^{(i)}$, we consider the second output context vector, $z^{(2)}$\n", 198 | " - The second context vector, $z^{(2)}$, is a weighted sum over all inputs $x^{(1)}$ to $x^{(T)}$ weighted with respect to the second input element, $x^{(2)}$\n", 199 | " - The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing $z^{(2)}$\n", 200 | " - In short, think of $z^{(2)}$ as a modified version of $x^{(2)}$ that also incorporates information about all other input elements that are relevant to a given task at hand" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "id": "fcc7c7a2-b6ab-478f-ae37-faa8eaa8049a", 206 | "metadata": {}, 207 | "source": [ 208 | "\n", 209 | "\n", 210 | "- (Please note that the numbers in this figure are truncated to one\n", 211 | "digit after the decimal point to reduce visual clutter; similarly, other figures may also contain truncated values)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "ff856c58-8382-44c7-827f-798040e6e697", 217 | "metadata": {}, 218 | "source": [ 219 | "- By convention, the unnormalized attention weights are referred to as **\"attention scores\"** whereas the normalized attention scores, which sum to 1, are referred to as **\"attention weights\"**\n" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "id": "01b10344-128d-462a-823f-2178dff5fd58", 225 | "metadata": {}, 226 | "source": [ 227 | "- The code below walks through the figure above step by step\n", 228 | "\n", 229 | "
\n", 230 | "\n", 231 | "- **Step 1:** compute unnormalized attention scores $\\omega$\n", 232 | "- Suppose we use the second input token as the query, that is, $q^{(2)} = x^{(2)}$, we compute the unnormalized attention scores via dot products:\n", 233 | " - $\\omega_{21} = x^{(1)} q^{(2)\\top}$\n", 234 | " - $\\omega_{22} = x^{(2)} q^{(2)\\top}$\n", 235 | " - $\\omega_{23} = x^{(3)} q^{(2)\\top}$\n", 236 | " - ...\n", 237 | " - $\\omega_{2T} = x^{(T)} q^{(2)\\top}$\n", 238 | "- Above, $\\omega$ is the Greek letter \"omega\" used to symbolize the unnormalized attention scores\n", 239 | " - The subscript \"21\" in $\\omega_{21}$ means that input sequence element 2 was used as a query against input sequence element 1" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "35e55f7a-f2d0-4f24-858b-228e4fe88fb3", 245 | "metadata": {}, 246 | "source": [ 247 | "- Suppose we have the following input sentence that is already embedded in 3-dimensional vectors as described in chapter 3 (we use a very small embedding dimension here for illustration purposes, so that it fits onto the page without line breaks):" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 2, 253 | "id": "22b9556a-aaf8-4ab4-a5b4-973372b0b2c3", 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "import torch\n", 258 | "\n", 259 | "inputs = torch.tensor(\n", 260 | " [[0.43, 0.15, 0.89], # Your (x^1)\n", 261 | " [0.55, 0.87, 0.66], # journey (x^2)\n", 262 | " [0.57, 0.85, 0.64], # starts (x^3)\n", 263 | " [0.22, 0.58, 0.33], # with (x^4)\n", 264 | " [0.77, 0.25, 0.10], # one (x^5)\n", 265 | " [0.05, 0.80, 0.55]] # step (x^6)\n", 266 | ")" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "id": "299baef3-b1a8-49ba-bad4-f62c8a416d83", 272 | "metadata": {}, 273 | "source": [ 274 | "- (In this book, we follow the common machine learning and deep learning convention where training examples are represented as rows and feature values as columns; in the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension)\n", 275 | "\n", 276 | "- The primary objective of this section is to demonstrate how the context vector $z^{(2)}$\n", 277 | " is calculated using the second input sequence, $x^{(2)}$, as a query\n", 278 | "\n", 279 | "- The figure depicts the initial step in this process, which involves calculating the attention scores ω between $x^{(2)}$\n", 280 | " and all other input elements through a dot product operation" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "id": "5cb3453a-58fa-42c4-b225-86850bc856f8", 286 | "metadata": {}, 287 | "source": [ 288 | "" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "id": "77be52fb-82fd-4886-a4c8-f24a9c87af22", 294 | "metadata": {}, 295 | "source": [ 296 | "- We use input sequence element 2, $x^{(2)}$, as an example to compute context vector $z^{(2)}$; later in this section, we will generalize this to compute all context vectors.\n", 297 | "- The first step is to compute the unnormalized attention scores by computing the dot product between the query $x^{(2)}$ and all other input tokens:" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 3, 303 | "id": "6fb5b2f8-dd2c-4a6d-94ef-a0e9ad163951", 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "query = inputs[1] # 2nd input token is the query\n", 316 | "\n", 317 | "attn_scores_2 = torch.empty(inputs.shape[0])\n", 318 | "for i, x_i in enumerate(inputs):\n", 319 | " attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)\n", 320 | "\n", 321 | "print(attn_scores_2)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "id": "8df09ae0-199f-4b6f-81a0-2f70546684b8", 327 | "metadata": {}, 328 | "source": [ 329 | "- Side note: a dot product is essentially a shorthand for multiplying two vectors elements-wise and summing the resulting products:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 4, 335 | "id": "9842f39b-1654-410e-88bf-d1b899bf0241", 336 | "metadata": {}, 337 | "outputs": [ 338 | { 339 | "name": "stdout", 340 | "output_type": "stream", 341 | "text": [ 342 | "tensor(0.9544)\n", 343 | "tensor(0.9544)\n" 344 | ] 345 | } 346 | ], 347 | "source": [ 348 | "res = 0.\n", 349 | "\n", 350 | "for idx, element in enumerate(inputs[0]):\n", 351 | " res += inputs[0][idx] * query[idx]\n", 352 | "\n", 353 | "print(res)\n", 354 | "print(torch.dot(inputs[0], query))" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "id": "7d444d76-e19e-4e9a-a268-f315d966609b", 360 | "metadata": {}, 361 | "source": [ 362 | "- **Step 2:** normalize the unnormalized attention scores (\"omegas\", $\\omega$) so that they sum up to 1\n", 363 | "- Here is a simple way to normalize the unnormalized attention scores to sum up to 1 (a convention, useful for interpretation, and important for training stability):" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "id": "dfd965d6-980c-476a-93d8-9efe603b1b3b", 369 | "metadata": {}, 370 | "source": [ 371 | "" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 5, 377 | "id": "e3ccc99c-33ce-4f11-b7f2-353cf1cbdaba", 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "name": "stdout", 382 | "output_type": "stream", 383 | "text": [ 384 | "Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])\n", 385 | "Sum: tensor(1.0000)\n" 386 | ] 387 | } 388 | ], 389 | "source": [ 390 | "attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()\n", 391 | "\n", 392 | "print(\"Attention weights:\", attn_weights_2_tmp)\n", 393 | "print(\"Sum:\", attn_weights_2_tmp.sum())" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "id": "75dc0a57-f53e-41bf-8793-daa77a819431", 399 | "metadata": {}, 400 | "source": [ 401 | "- However, in practice, using the softmax function for normalization, which is better at handling extreme values and has more desirable gradient properties during training, is common and recommended.\n", 402 | "- Here's a naive implementation of a softmax function for scaling, which also normalizes the vector elements such that they sum up to 1:" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 6, 408 | "id": "07b2e58d-a6ed-49f0-a1cd-2463e8d53a20", 409 | "metadata": {}, 410 | "outputs": [ 411 | { 412 | "name": "stdout", 413 | "output_type": "stream", 414 | "text": [ 415 | "Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n", 416 | "Sum: tensor(1.)\n" 417 | ] 418 | } 419 | ], 420 | "source": [ 421 | "def softmax_naive(x):\n", 422 | " return torch.exp(x) / torch.exp(x).sum(dim=0)\n", 423 | "\n", 424 | "attn_weights_2_naive = softmax_naive(attn_scores_2)\n", 425 | "\n", 426 | "print(\"Attention weights:\", attn_weights_2_naive)\n", 427 | "print(\"Sum:\", attn_weights_2_naive.sum())" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "id": "f0a1cbbb-4744-41cb-8910-f5c1355555fb", 433 | "metadata": {}, 434 | "source": [ 435 | "- The naive implementation above can suffer from numerical instability issues for large or small input values due to overflow and underflow issues\n", 436 | "- Hence, in practice, it's recommended to use the PyTorch implementation of softmax instead, which has been highly optimized for performance:" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 7, 442 | "id": "2d99cac4-45ea-46b3-b3c1-e000ad16e158", 443 | "metadata": {}, 444 | "outputs": [ 445 | { 446 | "name": "stdout", 447 | "output_type": "stream", 448 | "text": [ 449 | "Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n", 450 | "Sum: tensor(1.)\n" 451 | ] 452 | } 453 | ], 454 | "source": [ 455 | "attn_weights_2 = torch.softmax(attn_scores_2, dim=0)\n", 456 | "\n", 457 | "print(\"Attention weights:\", attn_weights_2)\n", 458 | "print(\"Sum:\", attn_weights_2.sum())" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "id": "e43e36c7-90b2-427f-94f6-bb9d31b2ab3f", 464 | "metadata": {}, 465 | "source": [ 466 | "- **Step 3**: compute the context vector $z^{(2)}$ by multiplying the embedded input tokens, $x^{(i)}$ with the attention weights and sum the resulting vectors:" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "id": "f1c9f5ac-8d3d-4847-94e3-fd783b7d4d3d", 472 | "metadata": {}, 473 | "source": [ 474 | "" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 8, 480 | "id": "8fcb96f0-14e5-4973-a50e-79ea7c6af99f", 481 | "metadata": {}, 482 | "outputs": [ 483 | { 484 | "name": "stdout", 485 | "output_type": "stream", 486 | "text": [ 487 | "tensor([0.4419, 0.6515, 0.5683])\n" 488 | ] 489 | } 490 | ], 491 | "source": [ 492 | "query = inputs[1] # 2nd input token is the query\n", 493 | "\n", 494 | "context_vec_2 = torch.zeros(query.shape)\n", 495 | "for i,x_i in enumerate(inputs):\n", 496 | " context_vec_2 += attn_weights_2[i]*x_i\n", 497 | "\n", 498 | "print(context_vec_2)" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "id": "5a454262-40eb-430e-9ca4-e43fb8d6cd89", 504 | "metadata": {}, 505 | "source": [ 506 | "### 3.3.2 Computing attention weights for all input tokens" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "id": "6a02bb73-fc19-4c88-b155-8314de5d63a8", 512 | "metadata": {}, 513 | "source": [ 514 | "#### Generalize to all input sequence tokens:\n", 515 | "\n", 516 | "- Above, we computed the attention weights and context vector for input 2 (as illustrated in the highlighted row in the figure below)\n", 517 | "- Next, we are generalizing this computation to compute all attention weights and context vectors" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "id": "11c0fb55-394f-42f4-ba07-d01ae5c98ab4", 523 | "metadata": {}, 524 | "source": [ 525 | "\n", 526 | "\n", 527 | "- (Please note that the numbers in this figure are truncated to two\n", 528 | "digits after the decimal point to reduce visual clutter; the values in each row should add up to 1.0 or 100%; similarly, digits in other figures are truncated)" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "id": "b789b990-fb51-4beb-9212-bf58876b5983", 534 | "metadata": {}, 535 | "source": [ 536 | "- In self-attention, the process starts with the calculation of attention scores, which are subsequently normalized to derive attention weights that total 1\n", 537 | "- These attention weights are then utilized to generate the context vectors through a weighted summation of the inputs" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "id": "d9bffe4b-56fe-4c37-9762-24bd924b7d3c", 543 | "metadata": {}, 544 | "source": [ 545 | "" 546 | ] 547 | }, 548 | { 549 | "cell_type": "markdown", 550 | "id": "aa652506-f2c8-473c-a905-85c389c842cc", 551 | "metadata": {}, 552 | "source": [ 553 | "- Apply previous **step 1** to all pairwise elements to compute the unnormalized attention score matrix:" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 9, 559 | "id": "04004be8-07a1-468b-ab33-32e16a551b45", 560 | "metadata": {}, 561 | "outputs": [ 562 | { 563 | "name": "stdout", 564 | "output_type": "stream", 565 | "text": [ 566 | "tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],\n", 567 | " [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],\n", 568 | " [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],\n", 569 | " [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],\n", 570 | " [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],\n", 571 | " [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])\n" 572 | ] 573 | } 574 | ], 575 | "source": [ 576 | "attn_scores = torch.empty(6, 6)\n", 577 | "\n", 578 | "for i, x_i in enumerate(inputs):\n", 579 | " for j, x_j in enumerate(inputs):\n", 580 | " attn_scores[i, j] = torch.dot(x_i, x_j)\n", 581 | "\n", 582 | "print(attn_scores)" 583 | ] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "id": "1539187f-1ece-47b7-bc9b-65a97115f1d4", 588 | "metadata": {}, 589 | "source": [ 590 | "- We can achieve the same as above more efficiently via matrix multiplication:" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 10, 596 | "id": "2cea69d0-9a47-45da-8d5a-47ceef2df673", 597 | "metadata": {}, 598 | "outputs": [ 599 | { 600 | "name": "stdout", 601 | "output_type": "stream", 602 | "text": [ 603 | "tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],\n", 604 | " [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],\n", 605 | " [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],\n", 606 | " [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],\n", 607 | " [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],\n", 608 | " [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])\n" 609 | ] 610 | } 611 | ], 612 | "source": [ 613 | "attn_scores = inputs @ inputs.T\n", 614 | "print(attn_scores)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "id": "02c4bac4-acfd-427f-9b11-c436ac71748d", 620 | "metadata": {}, 621 | "source": [ 622 | "- Similar to **step 2** previously, we normalize each row so that the values in each row sum to 1:" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 11, 628 | "id": "fa4ef062-de81-47ee-8415-bfe1708c81b8", 629 | "metadata": {}, 630 | "outputs": [ 631 | { 632 | "name": "stdout", 633 | "output_type": "stream", 634 | "text": [ 635 | "tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],\n", 636 | " [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],\n", 637 | " [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],\n", 638 | " [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],\n", 639 | " [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],\n", 640 | " [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])\n" 641 | ] 642 | } 643 | ], 644 | "source": [ 645 | "attn_weights = torch.softmax(attn_scores, dim=-1)\n", 646 | "print(attn_weights)" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "id": "3fa6d02b-7f15-4eb4-83a7-0b8a819e7a0c", 652 | "metadata": {}, 653 | "source": [ 654 | "- Quick verification that the values in each row indeed sum to 1:" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 12, 660 | "id": "112b492c-fb6f-4e6d-8df5-518ae83363d5", 661 | "metadata": {}, 662 | "outputs": [ 663 | { 664 | "name": "stdout", 665 | "output_type": "stream", 666 | "text": [ 667 | "Row 2 sum: 1.0\n", 668 | "All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])\n" 669 | ] 670 | } 671 | ], 672 | "source": [ 673 | "row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])\n", 674 | "print(\"Row 2 sum:\", row_2_sum)\n", 675 | "\n", 676 | "print(\"All row sums:\", attn_weights.sum(dim=-1))" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "id": "138b0b5c-d813-44c7-b373-fde9540ddfd1", 682 | "metadata": {}, 683 | "source": [ 684 | "- Apply previous **step 3** to compute all context vectors:" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 13, 690 | "id": "ba8eafcf-f7f7-4989-b8dc-61b50c4f81dc", 691 | "metadata": {}, 692 | "outputs": [ 693 | { 694 | "name": "stdout", 695 | "output_type": "stream", 696 | "text": [ 697 | "tensor([[0.4421, 0.5931, 0.5790],\n", 698 | " [0.4419, 0.6515, 0.5683],\n", 699 | " [0.4431, 0.6496, 0.5671],\n", 700 | " [0.4304, 0.6298, 0.5510],\n", 701 | " [0.4671, 0.5910, 0.5266],\n", 702 | " [0.4177, 0.6503, 0.5645]])\n" 703 | ] 704 | } 705 | ], 706 | "source": [ 707 | "all_context_vecs = attn_weights @ inputs\n", 708 | "print(all_context_vecs)" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "id": "25b245b8-7732-4fab-aa1c-e3d333195605", 714 | "metadata": {}, 715 | "source": [ 716 | "- As a sanity check, the previously computed context vector $z^{(2)} = [0.4419, 0.6515, 0.5683]$ can be found in the 2nd row in above: " 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 14, 722 | "id": "2570eb7d-aee1-457a-a61e-7544478219fa", 723 | "metadata": {}, 724 | "outputs": [ 725 | { 726 | "name": "stdout", 727 | "output_type": "stream", 728 | "text": [ 729 | "Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])\n" 730 | ] 731 | } 732 | ], 733 | "source": [ 734 | "print(\"Previous 2nd context vector:\", context_vec_2)" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "id": "a303b6fb-9f7e-42bb-9fdb-2adabf0a6525", 740 | "metadata": {}, 741 | "source": [ 742 | "## 3.4 Implementing self-attention with trainable weights" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "id": "88363117-93d8-41fb-8240-f7cfe08b14a3", 748 | "metadata": {}, 749 | "source": [ 750 | "- A conceptual framework illustrating how the self-attention mechanism developed in this section integrates into the overall narrative and structure of this book and chapter" 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "id": "ac9492ba-6f66-4f65-bd1d-87cf16d59928", 756 | "metadata": {}, 757 | "source": [ 758 | "" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "id": "2b90a77e-d746-4704-9354-1ddad86e6298", 764 | "metadata": {}, 765 | "source": [ 766 | "### 3.4.1 Computing the attention weights step by step" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "id": "46e95a46-1f67-4b71-9e84-8e2db84ab036", 772 | "metadata": {}, 773 | "source": [ 774 | "- In this section, we are implementing the self-attention mechanism that is used in the original transformer architecture, the GPT models, and most other popular LLMs\n", 775 | "- This self-attention mechanism is also called \"scaled dot-product attention\"\n", 776 | "- The overall idea is similar to before:\n", 777 | " - We want to compute context vectors as weighted sums over the input vectors specific to a certain input element\n", 778 | " - For the above, we need attention weights\n", 779 | "- As you will see, there are only slight differences compared to the basic attention mechanism introduced earlier:\n", 780 | " - The most notable difference is the introduction of weight matrices that are updated during model training\n", 781 | " - These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce \"good\" context vectors" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "id": "59db4093-93e8-4bee-be8f-c8fac8a08cdd", 787 | "metadata": {}, 788 | "source": [ 789 | "" 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "id": "4d996671-87aa-45c9-b2e0-07a7bcc9060a", 795 | "metadata": {}, 796 | "source": [ 797 | "- Implementing the self-attention mechanism step by step, we will start by introducing the three training weight matrices $W_q$, $W_k$, and $W_v$\n", 798 | "- These three matrices are used to project the embedded input tokens, $x^{(i)}$, into query, key, and value vectors via matrix multiplication:\n", 799 | "\n", 800 | " - Query vector: $q^{(i)} = W_q \\,x^{(i)}$\n", 801 | " - Key vector: $k^{(i)} = W_k \\,x^{(i)}$\n", 802 | " - Value vector: $v^{(i)} = W_v \\,x^{(i)}$\n" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "id": "9f334313-5fd0-477b-8728-04080a427049", 808 | "metadata": {}, 809 | "source": [ 810 | "- The embedding dimensions of the input $x$ and the query vector $q$ can be the same or different, depending on the model's design and specific implementation\n", 811 | "- In GPT models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input and output dimensions here:" 812 | ] 813 | }, 814 | { 815 | "cell_type": "code", 816 | "execution_count": 15, 817 | "id": "8250fdc6-6cd6-4c5b-b9c0-8c643aadb7db", 818 | "metadata": {}, 819 | "outputs": [], 820 | "source": [ 821 | "x_2 = inputs[1] # second input element\n", 822 | "d_in = inputs.shape[1] # the input embedding size, d=3\n", 823 | "d_out = 2 # the output embedding size, d=2" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "id": "f528cfb3-e226-47dd-b363-cc2caaeba4bf", 829 | "metadata": {}, 830 | "source": [ 831 | "- Below, we initialize the three weight matrices; note that we are setting `requires_grad=False` to reduce clutter in the outputs for illustration purposes, but if we were to use the weight matrices for model training, we would set `requires_grad=True` to update these matrices during model training" 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": 16, 837 | "id": "bfd7259a-f26c-4cea-b8fc-282b5cae1e00", 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [ 841 | "torch.manual_seed(123)\n", 842 | "\n", 843 | "W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)\n", 844 | "W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)\n", 845 | "W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "id": "abfd0b50-7701-4adb-821c-e5433622d9c4", 851 | "metadata": {}, 852 | "source": [ 853 | "- Next we compute the query, key, and value vectors:" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": 17, 859 | "id": "73cedd62-01e1-4196-a575-baecc6095601", 860 | "metadata": {}, 861 | "outputs": [ 862 | { 863 | "name": "stdout", 864 | "output_type": "stream", 865 | "text": [ 866 | "tensor([0.4306, 1.4551])\n" 867 | ] 868 | } 869 | ], 870 | "source": [ 871 | "query_2 = x_2 @ W_query # _2 because it's with respect to the 2nd input element\n", 872 | "key_2 = x_2 @ W_key \n", 873 | "value_2 = x_2 @ W_value\n", 874 | "\n", 875 | "print(query_2)" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "id": "9be308b3-aca3-421b-b182-19c3a03b71c7", 881 | "metadata": {}, 882 | "source": [ 883 | "- As we can see below, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space:" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": 18, 889 | "id": "8c1c3949-fc08-4d19-a41e-1c235b4e631b", 890 | "metadata": {}, 891 | "outputs": [ 892 | { 893 | "name": "stdout", 894 | "output_type": "stream", 895 | "text": [ 896 | "keys.shape: torch.Size([6, 2])\n", 897 | "values.shape: torch.Size([6, 2])\n" 898 | ] 899 | } 900 | ], 901 | "source": [ 902 | "keys = inputs @ W_key \n", 903 | "values = inputs @ W_value\n", 904 | "\n", 905 | "print(\"keys.shape:\", keys.shape)\n", 906 | "print(\"values.shape:\", values.shape)" 907 | ] 908 | }, 909 | { 910 | "cell_type": "markdown", 911 | "id": "bac5dfd6-ade8-4e7b-b0c1-bed40aa24481", 912 | "metadata": {}, 913 | "source": [ 914 | "- In the next step, **step 2**, we compute the unnormalized attention scores by computing the dot product between the query and each key vector:" 915 | ] 916 | }, 917 | { 918 | "cell_type": "markdown", 919 | "id": "8ed0a2b7-5c50-4ede-90cf-7ad74412b3aa", 920 | "metadata": {}, 921 | "source": [ 922 | "" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": 19, 928 | "id": "64cbc253-a182-4490-a765-246979ea0a28", 929 | "metadata": {}, 930 | "outputs": [ 931 | { 932 | "name": "stdout", 933 | "output_type": "stream", 934 | "text": [ 935 | "tensor(1.8524)\n" 936 | ] 937 | } 938 | ], 939 | "source": [ 940 | "keys_2 = keys[1] # Python starts index at 0\n", 941 | "attn_score_22 = query_2.dot(keys_2)\n", 942 | "print(attn_score_22)" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "id": "9e9d15c0-c24e-4e6f-a160-6349b418f935", 948 | "metadata": {}, 949 | "source": [ 950 | "- Since we have 6 inputs, we have 6 attention scores for the given query vector:" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 20, 956 | "id": "b14e44b5-d170-40f9-8847-8990804af26d", 957 | "metadata": {}, 958 | "outputs": [ 959 | { 960 | "name": "stdout", 961 | "output_type": "stream", 962 | "text": [ 963 | "tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])\n" 964 | ] 965 | } 966 | ], 967 | "source": [ 968 | "attn_scores_2 = query_2 @ keys.T # All attention scores for given query\n", 969 | "print(attn_scores_2)" 970 | ] 971 | }, 972 | { 973 | "cell_type": "markdown", 974 | "id": "8622cf39-155f-4eb5-a0c0-82a03ce9b999", 975 | "metadata": {}, 976 | "source": [ 977 | "" 978 | ] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "id": "e1609edb-f089-461a-8de2-c20c1bb29836", 983 | "metadata": {}, 984 | "source": [ 985 | "- Next, in **step 3**, we compute the attention weights (normalized attention scores that sum up to 1) using the softmax function we used earlier\n", 986 | "- The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension, $\\sqrt{d_k}$ (i.e., `d_k**0.5`):" 987 | ] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "execution_count": 21, 992 | "id": "146f5587-c845-4e30-9894-c7ed3a248153", 993 | "metadata": {}, 994 | "outputs": [ 995 | { 996 | "name": "stdout", 997 | "output_type": "stream", 998 | "text": [ 999 | "tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])\n" 1000 | ] 1001 | } 1002 | ], 1003 | "source": [ 1004 | "d_k = keys.shape[1]\n", 1005 | "attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)\n", 1006 | "print(attn_weights_2)" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "id": "b8f61a28-b103-434a-aee1-ae7cbd821126", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "markdown", 1019 | "id": "1890e3f9-db86-4ab8-9f3b-53113504a61f", 1020 | "metadata": {}, 1021 | "source": [ 1022 | "- In **step 4**, we now compute the context vector for input query vector 2:" 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": 22, 1028 | "id": "e138f033-fa7e-4e3a-8764-b53a96b26397", 1029 | "metadata": {}, 1030 | "outputs": [ 1031 | { 1032 | "name": "stdout", 1033 | "output_type": "stream", 1034 | "text": [ 1035 | "tensor([0.3061, 0.8210])\n" 1036 | ] 1037 | } 1038 | ], 1039 | "source": [ 1040 | "context_vec_2 = attn_weights_2 @ values\n", 1041 | "print(context_vec_2)" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "id": "9d7b2907-e448-473e-b46c-77735a7281d8", 1047 | "metadata": {}, 1048 | "source": [ 1049 | "### 3.4.2 Implementing a compact SelfAttention class" 1050 | ] 1051 | }, 1052 | { 1053 | "cell_type": "markdown", 1054 | "id": "04313410-3155-4d90-a7a3-2f3386e73677", 1055 | "metadata": {}, 1056 | "source": [ 1057 | "- Putting it all together, we can implement the self-attention mechanism as follows:" 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "code", 1062 | "execution_count": 23, 1063 | "id": "51590326-cdbe-4e62-93b1-17df71c11ee4", 1064 | "metadata": {}, 1065 | "outputs": [ 1066 | { 1067 | "name": "stdout", 1068 | "output_type": "stream", 1069 | "text": [ 1070 | "tensor([[0.2996, 0.8053],\n", 1071 | " [0.3061, 0.8210],\n", 1072 | " [0.3058, 0.8203],\n", 1073 | " [0.2948, 0.7939],\n", 1074 | " [0.2927, 0.7891],\n", 1075 | " [0.2990, 0.8040]], grad_fn=)\n" 1076 | ] 1077 | } 1078 | ], 1079 | "source": [ 1080 | "import torch.nn as nn\n", 1081 | "\n", 1082 | "class SelfAttention_v1(nn.Module):\n", 1083 | "\n", 1084 | " def __init__(self, d_in, d_out):\n", 1085 | " super().__init__()\n", 1086 | " self.W_query = nn.Parameter(torch.rand(d_in, d_out))\n", 1087 | " self.W_key = nn.Parameter(torch.rand(d_in, d_out))\n", 1088 | " self.W_value = nn.Parameter(torch.rand(d_in, d_out))\n", 1089 | "\n", 1090 | " def forward(self, x):\n", 1091 | " keys = x @ self.W_key\n", 1092 | " queries = x @ self.W_query\n", 1093 | " values = x @ self.W_value\n", 1094 | " \n", 1095 | " attn_scores = queries @ keys.T # omega\n", 1096 | " attn_weights = torch.softmax(\n", 1097 | " attn_scores / keys.shape[-1]**0.5, dim=-1\n", 1098 | " )\n", 1099 | "\n", 1100 | " context_vec = attn_weights @ values\n", 1101 | " return context_vec\n", 1102 | "\n", 1103 | "torch.manual_seed(123)\n", 1104 | "sa_v1 = SelfAttention_v1(d_in, d_out)\n", 1105 | "print(sa_v1(inputs))" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "markdown", 1110 | "id": "7ee1a024-84a5-425a-9567-54ab4e4ed445", 1111 | "metadata": {}, 1112 | "source": [ 1113 | "" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "id": "048e0c16-d911-4ec8-b0bc-45ceec75c081", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "- We can streamline the implementation above using PyTorch's Linear layers, which are equivalent to a matrix multiplication if we disable the bias units\n", 1122 | "- Another big advantage of using `nn.Linear` over our manual `nn.Parameter(torch.rand(...)` approach is that `nn.Linear` has a preferred weight initialization scheme, which leads to more stable model training" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "code", 1127 | "execution_count": 24, 1128 | "id": "73f411e3-e231-464a-89fe-0a9035e5f839", 1129 | "metadata": {}, 1130 | "outputs": [ 1131 | { 1132 | "name": "stdout", 1133 | "output_type": "stream", 1134 | "text": [ 1135 | "tensor([[-0.0739, 0.0713],\n", 1136 | " [-0.0748, 0.0703],\n", 1137 | " [-0.0749, 0.0702],\n", 1138 | " [-0.0760, 0.0685],\n", 1139 | " [-0.0763, 0.0679],\n", 1140 | " [-0.0754, 0.0693]], grad_fn=)\n" 1141 | ] 1142 | } 1143 | ], 1144 | "source": [ 1145 | "class SelfAttention_v2(nn.Module):\n", 1146 | "\n", 1147 | " def __init__(self, d_in, d_out, qkv_bias=False):\n", 1148 | " super().__init__()\n", 1149 | " self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1150 | " self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1151 | " self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1152 | "\n", 1153 | " def forward(self, x):\n", 1154 | " keys = self.W_key(x)\n", 1155 | " queries = self.W_query(x)\n", 1156 | " values = self.W_value(x)\n", 1157 | " \n", 1158 | " attn_scores = queries @ keys.T\n", 1159 | " attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n", 1160 | "\n", 1161 | " context_vec = attn_weights @ values\n", 1162 | " return context_vec\n", 1163 | "\n", 1164 | "torch.manual_seed(789)\n", 1165 | "sa_v2 = SelfAttention_v2(d_in, d_out)\n", 1166 | "print(sa_v2(inputs))" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "markdown", 1171 | "id": "915cd8a5-a895-42c9-8b8e-06b5ae19ffce", 1172 | "metadata": {}, 1173 | "source": [ 1174 | "- Note that `SelfAttention_v1` and `SelfAttention_v2` give different outputs because they use different initial weights for the weight matrices" 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "markdown", 1179 | "id": "c5025b37-0f2c-4a67-a7cb-1286af7026ab", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "## 3.5 Hiding future words with causal attention" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "markdown", 1187 | "id": "aef0a6b8-205a-45bf-9d26-8fd77a8a03c3", 1188 | "metadata": {}, 1189 | "source": [ 1190 | "- In causal attention, the attention weights above the diagonal are masked, ensuring that for any given input, the LLM is unable to utilize future tokens while calculating the context vectors with the attention weight" 1191 | ] 1192 | }, 1193 | { 1194 | "cell_type": "markdown", 1195 | "id": "71e91bb5-5aae-4f05-8a95-973b3f988a35", 1196 | "metadata": {}, 1197 | "source": [ 1198 | "" 1199 | ] 1200 | }, 1201 | { 1202 | "cell_type": "markdown", 1203 | "id": "82f405de-cd86-4e72-8f3c-9ea0354946ba", 1204 | "metadata": {}, 1205 | "source": [ 1206 | "### 3.5.1 Applying a causal attention mask" 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "markdown", 1211 | "id": "014f28d0-8218-48e4-8b9c-bdc5ce489218", 1212 | "metadata": {}, 1213 | "source": [ 1214 | "- In this section, we are converting the previous self-attention mechanism into a causal self-attention mechanism\n", 1215 | "- Causal self-attention ensures that the model's prediction for a certain position in a sequence is only dependent on the known outputs at previous positions, not on future positions\n", 1216 | "- In simpler words, this ensures that each next word prediction should only depend on the preceding words\n", 1217 | "- To achieve this, for each given token, we mask out the future tokens (the ones that come after the current token in the input text):" 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "markdown", 1222 | "id": "57f99af3-32bc-48f5-8eb4-63504670ca0a", 1223 | "metadata": {}, 1224 | "source": [ 1225 | "" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "markdown", 1230 | "id": "cbfaec7a-68f2-4157-a4b5-2aeceed199d9", 1231 | "metadata": {}, 1232 | "source": [ 1233 | "- To illustrate and implement causal self-attention, let's work with the attention scores and weights from the previous section: " 1234 | ] 1235 | }, 1236 | { 1237 | "cell_type": "code", 1238 | "execution_count": 25, 1239 | "id": "1933940d-0fa5-4b17-a3ce-388e5314a1bb", 1240 | "metadata": {}, 1241 | "outputs": [ 1242 | { 1243 | "name": "stdout", 1244 | "output_type": "stream", 1245 | "text": [ 1246 | "tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],\n", 1247 | " [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],\n", 1248 | " [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],\n", 1249 | " [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],\n", 1250 | " [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],\n", 1251 | " [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],\n", 1252 | " grad_fn=)\n" 1253 | ] 1254 | } 1255 | ], 1256 | "source": [ 1257 | "# Reuse the query and key weight matrices of the\n", 1258 | "# SelfAttention_v2 object from the previous section for convenience\n", 1259 | "queries = sa_v2.W_query(inputs)\n", 1260 | "keys = sa_v2.W_key(inputs) \n", 1261 | "attn_scores = queries @ keys.T\n", 1262 | "\n", 1263 | "attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n", 1264 | "print(attn_weights)" 1265 | ] 1266 | }, 1267 | { 1268 | "cell_type": "markdown", 1269 | "id": "89020a96-b34d-41f8-9349-98c3e23fd5d6", 1270 | "metadata": {}, 1271 | "source": [ 1272 | "- The simplest way to mask out future attention weights is by creating a mask via PyTorch's tril function with elements below the main diagonal (including the diagonal itself) set to 1 and above the main diagonal set to 0:" 1273 | ] 1274 | }, 1275 | { 1276 | "cell_type": "code", 1277 | "execution_count": 26, 1278 | "id": "43f3d2e3-185b-4184-9f98-edde5e6df746", 1279 | "metadata": {}, 1280 | "outputs": [ 1281 | { 1282 | "name": "stdout", 1283 | "output_type": "stream", 1284 | "text": [ 1285 | "tensor([[1., 0., 0., 0., 0., 0.],\n", 1286 | " [1., 1., 0., 0., 0., 0.],\n", 1287 | " [1., 1., 1., 0., 0., 0.],\n", 1288 | " [1., 1., 1., 1., 0., 0.],\n", 1289 | " [1., 1., 1., 1., 1., 0.],\n", 1290 | " [1., 1., 1., 1., 1., 1.]])\n" 1291 | ] 1292 | } 1293 | ], 1294 | "source": [ 1295 | "context_length = attn_scores.shape[0]\n", 1296 | "mask_simple = torch.tril(torch.ones(context_length, context_length))\n", 1297 | "print(mask_simple)" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "markdown", 1302 | "id": "efce2b08-3583-44da-b3fc-cabdd38761f6", 1303 | "metadata": {}, 1304 | "source": [ 1305 | "- Then, we can multiply the attention weights with this mask to zero out the attention scores above the diagonal:" 1306 | ] 1307 | }, 1308 | { 1309 | "cell_type": "code", 1310 | "execution_count": 27, 1311 | "id": "9f531e2e-f4d2-4fea-a87f-4c132e48b9e7", 1312 | "metadata": {}, 1313 | "outputs": [ 1314 | { 1315 | "name": "stdout", 1316 | "output_type": "stream", 1317 | "text": [ 1318 | "tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1319 | " [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1320 | " [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],\n", 1321 | " [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],\n", 1322 | " [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],\n", 1323 | " [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],\n", 1324 | " grad_fn=)\n" 1325 | ] 1326 | } 1327 | ], 1328 | "source": [ 1329 | "masked_simple = attn_weights*mask_simple\n", 1330 | "print(masked_simple)" 1331 | ] 1332 | }, 1333 | { 1334 | "cell_type": "markdown", 1335 | "id": "3eb35787-cf12-4024-b66d-e7215e175500", 1336 | "metadata": {}, 1337 | "source": [ 1338 | "- However, if the mask were applied after softmax, like above, it would disrupt the probability distribution created by softmax\n", 1339 | "- Softmax ensures that all output values sum to 1\n", 1340 | "- Masking after softmax would require re-normalizing the outputs to sum to 1 again, which complicates the process and might lead to unintended effects" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "markdown", 1345 | "id": "94db92d7-c397-4e42-bd8a-6a2b3e237e0f", 1346 | "metadata": {}, 1347 | "source": [ 1348 | "- To make sure that the rows sum to 1, we can normalize the attention weights as follows:" 1349 | ] 1350 | }, 1351 | { 1352 | "cell_type": "code", 1353 | "execution_count": 28, 1354 | "id": "6d392083-fd81-4f70-9bdf-8db985e673d6", 1355 | "metadata": {}, 1356 | "outputs": [ 1357 | { 1358 | "name": "stdout", 1359 | "output_type": "stream", 1360 | "text": [ 1361 | "tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1362 | " [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1363 | " [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],\n", 1364 | " [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],\n", 1365 | " [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],\n", 1366 | " [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],\n", 1367 | " grad_fn=)\n" 1368 | ] 1369 | } 1370 | ], 1371 | "source": [ 1372 | "row_sums = masked_simple.sum(dim=-1, keepdim=True)\n", 1373 | "masked_simple_norm = masked_simple / row_sums\n", 1374 | "print(masked_simple_norm)" 1375 | ] 1376 | }, 1377 | { 1378 | "cell_type": "markdown", 1379 | "id": "512e7cf4-dc0e-4cec-948e-c7a3c4eb6877", 1380 | "metadata": {}, 1381 | "source": [ 1382 | "- While we are technically done with coding the causal attention mechanism now, let's briefly look at a more efficient approach to achieve the same as above\n", 1383 | "- So, instead of zeroing out attention weights above the diagonal and renormalizing the results, we can mask the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function:" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "id": "eb682900-8df2-4767-946c-a82bee260188", 1389 | "metadata": {}, 1390 | "source": [ 1391 | "" 1392 | ] 1393 | }, 1394 | { 1395 | "cell_type": "code", 1396 | "execution_count": 29, 1397 | "id": "a2be2f43-9cf0-44f6-8d8b-68ef2fb3cc39", 1398 | "metadata": {}, 1399 | "outputs": [ 1400 | { 1401 | "name": "stdout", 1402 | "output_type": "stream", 1403 | "text": [ 1404 | "tensor([[0.2899, -inf, -inf, -inf, -inf, -inf],\n", 1405 | " [0.4656, 0.1723, -inf, -inf, -inf, -inf],\n", 1406 | " [0.4594, 0.1703, 0.1731, -inf, -inf, -inf],\n", 1407 | " [0.2642, 0.1024, 0.1036, 0.0186, -inf, -inf],\n", 1408 | " [0.2183, 0.0874, 0.0882, 0.0177, 0.0786, -inf],\n", 1409 | " [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],\n", 1410 | " grad_fn=)\n" 1411 | ] 1412 | } 1413 | ], 1414 | "source": [ 1415 | "mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)\n", 1416 | "masked = attn_scores.masked_fill(mask.bool(), -torch.inf)\n", 1417 | "print(masked)" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "markdown", 1422 | "id": "91d5f803-d735-4543-b9da-00ac10fb9c50", 1423 | "metadata": {}, 1424 | "source": [ 1425 | "- As we can see below, now the attention weights in each row correctly sum to 1 again:" 1426 | ] 1427 | }, 1428 | { 1429 | "cell_type": "code", 1430 | "execution_count": 30, 1431 | "id": "b1cd6d7f-16f2-43c1-915e-0824f1a4bc52", 1432 | "metadata": {}, 1433 | "outputs": [ 1434 | { 1435 | "name": "stdout", 1436 | "output_type": "stream", 1437 | "text": [ 1438 | "tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1439 | " [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1440 | " [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],\n", 1441 | " [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],\n", 1442 | " [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],\n", 1443 | " [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],\n", 1444 | " grad_fn=)\n" 1445 | ] 1446 | } 1447 | ], 1448 | "source": [ 1449 | "attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)\n", 1450 | "print(attn_weights)" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "markdown", 1455 | "id": "7636fc5f-6bc6-461e-ac6a-99ec8e3c0912", 1456 | "metadata": {}, 1457 | "source": [ 1458 | "### 3.5.2 Masking additional attention weights with dropout" 1459 | ] 1460 | }, 1461 | { 1462 | "cell_type": "markdown", 1463 | "id": "ec3dc7ee-6539-4fab-804a-8f31a890c85a", 1464 | "metadata": {}, 1465 | "source": [ 1466 | "- In addition, we also apply dropout to reduce overfitting during training\n", 1467 | "- Dropout can be applied in several places:\n", 1468 | " - for example, after computing the attention weights;\n", 1469 | " - or after multiplying the attention weights with the value vectors\n", 1470 | "- Here, we will apply the dropout mask after computing the attention weights because it's more common\n", 1471 | "\n", 1472 | "- Furthermore, in this specific example, we use a dropout rate of 50%, which means randomly masking out half of the attention weights. (When we train the GPT model later, we will use a lower dropout rate, such as 0.1 or 0.2" 1473 | ] 1474 | }, 1475 | { 1476 | "cell_type": "markdown", 1477 | "id": "ee799cf6-6175-45f2-827e-c174afedb722", 1478 | "metadata": {}, 1479 | "source": [ 1480 | "" 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "markdown", 1485 | "id": "5a575458-a6da-4e54-8688-83e155f2de06", 1486 | "metadata": {}, 1487 | "source": [ 1488 | "- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2\n", 1489 | "- The scaling is calculated by the formula 1 / (1 - `dropout_rate`)" 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "code", 1494 | "execution_count": 31, 1495 | "id": "0de578db-8289-41d6-b377-ef645751e33f", 1496 | "metadata": {}, 1497 | "outputs": [ 1498 | { 1499 | "name": "stdout", 1500 | "output_type": "stream", 1501 | "text": [ 1502 | "tensor([[2., 2., 0., 2., 2., 0.],\n", 1503 | " [0., 0., 0., 2., 0., 2.],\n", 1504 | " [2., 2., 2., 2., 0., 2.],\n", 1505 | " [0., 2., 2., 0., 0., 2.],\n", 1506 | " [0., 2., 0., 2., 0., 2.],\n", 1507 | " [0., 2., 2., 2., 2., 0.]])\n" 1508 | ] 1509 | } 1510 | ], 1511 | "source": [ 1512 | "torch.manual_seed(123)\n", 1513 | "dropout = torch.nn.Dropout(0.5) # dropout rate of 50%\n", 1514 | "example = torch.ones(6, 6) # create a matrix of ones\n", 1515 | "\n", 1516 | "print(dropout(example))" 1517 | ] 1518 | }, 1519 | { 1520 | "cell_type": "code", 1521 | "execution_count": 32, 1522 | "id": "b16c5edb-942b-458c-8e95-25e4e355381e", 1523 | "metadata": {}, 1524 | "outputs": [ 1525 | { 1526 | "name": "stdout", 1527 | "output_type": "stream", 1528 | "text": [ 1529 | "tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1530 | " [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n", 1531 | " [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],\n", 1532 | " [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],\n", 1533 | " [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],\n", 1534 | " [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],\n", 1535 | " grad_fn=)\n" 1536 | ] 1537 | } 1538 | ], 1539 | "source": [ 1540 | "torch.manual_seed(123)\n", 1541 | "print(dropout(attn_weights))" 1542 | ] 1543 | }, 1544 | { 1545 | "cell_type": "markdown", 1546 | "id": "269df5c8-3e25-49d0-95d3-bb232287404f", 1547 | "metadata": {}, 1548 | "source": [ 1549 | "- Note that the resulting dropout outputs may look different depending on your operating system; you can read more about this inconsistency [here on the PyTorch issue tracker](https://github.com/pytorch/pytorch/issues/121595)" 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "markdown", 1554 | "id": "cdc14639-5f0f-4840-aa9d-8eb36ea90fb7", 1555 | "metadata": {}, 1556 | "source": [ 1557 | "### 3.5.3 Implementing a compact causal self-attention class" 1558 | ] 1559 | }, 1560 | { 1561 | "cell_type": "markdown", 1562 | "id": "09c41d29-1933-43dc-ada6-2dbb56287204", 1563 | "metadata": {}, 1564 | "source": [ 1565 | "- Now, we are ready to implement a working implementation of self-attention, including the causal and dropout masks\n", 1566 | "- One more thing is to implement the code to handle batches consisting of more than one input so that our `CausalAttention` class supports the batch outputs produced by the data loader we implemented in chapter 2\n", 1567 | "- For simplicity, to simulate such batch input, we duplicate the input text example:" 1568 | ] 1569 | }, 1570 | { 1571 | "cell_type": "code", 1572 | "execution_count": 33, 1573 | "id": "977a5fa7-a9d5-4e2e-8a32-8e0331ccfe28", 1574 | "metadata": {}, 1575 | "outputs": [ 1576 | { 1577 | "name": "stdout", 1578 | "output_type": "stream", 1579 | "text": [ 1580 | "torch.Size([2, 6, 3])\n" 1581 | ] 1582 | } 1583 | ], 1584 | "source": [ 1585 | "batch = torch.stack((inputs, inputs), dim=0)\n", 1586 | "print(batch.shape) # 2 inputs with 6 tokens each, and each token has embedding dimension 3" 1587 | ] 1588 | }, 1589 | { 1590 | "cell_type": "code", 1591 | "execution_count": 34, 1592 | "id": "60d8c2eb-2d8e-4d2c-99bc-9eef8cc53ca0", 1593 | "metadata": {}, 1594 | "outputs": [ 1595 | { 1596 | "name": "stdout", 1597 | "output_type": "stream", 1598 | "text": [ 1599 | "tensor([[[-0.4519, 0.2216],\n", 1600 | " [-0.5874, 0.0058],\n", 1601 | " [-0.6300, -0.0632],\n", 1602 | " [-0.5675, -0.0843],\n", 1603 | " [-0.5526, -0.0981],\n", 1604 | " [-0.5299, -0.1081]],\n", 1605 | "\n", 1606 | " [[-0.4519, 0.2216],\n", 1607 | " [-0.5874, 0.0058],\n", 1608 | " [-0.6300, -0.0632],\n", 1609 | " [-0.5675, -0.0843],\n", 1610 | " [-0.5526, -0.0981],\n", 1611 | " [-0.5299, -0.1081]]], grad_fn=)\n", 1612 | "context_vecs.shape: torch.Size([2, 6, 2])\n" 1613 | ] 1614 | } 1615 | ], 1616 | "source": [ 1617 | "class CausalAttention(nn.Module):\n", 1618 | "\n", 1619 | " def __init__(self, d_in, d_out, context_length,\n", 1620 | " dropout, qkv_bias=False):\n", 1621 | " super().__init__()\n", 1622 | " self.d_out = d_out\n", 1623 | " self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1624 | " self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1625 | " self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1626 | " self.dropout = nn.Dropout(dropout) # New\n", 1627 | " self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New\n", 1628 | "\n", 1629 | " def forward(self, x):\n", 1630 | " b, num_tokens, d_in = x.shape # New batch dimension b\n", 1631 | " keys = self.W_key(x)\n", 1632 | " queries = self.W_query(x)\n", 1633 | " values = self.W_value(x)\n", 1634 | "\n", 1635 | " attn_scores = queries @ keys.transpose(1, 2) # Changed transpose\n", 1636 | " attn_scores.masked_fill_( # New, _ ops are in-place\n", 1637 | " self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size\n", 1638 | " attn_weights = torch.softmax(\n", 1639 | " attn_scores / keys.shape[-1]**0.5, dim=-1\n", 1640 | " )\n", 1641 | " attn_weights = self.dropout(attn_weights) # New\n", 1642 | "\n", 1643 | " context_vec = attn_weights @ values\n", 1644 | " return context_vec\n", 1645 | "\n", 1646 | "torch.manual_seed(123)\n", 1647 | "\n", 1648 | "context_length = batch.shape[1]\n", 1649 | "ca = CausalAttention(d_in, d_out, context_length, 0.0)\n", 1650 | "\n", 1651 | "context_vecs = ca(batch)\n", 1652 | "\n", 1653 | "print(context_vecs)\n", 1654 | "print(\"context_vecs.shape:\", context_vecs.shape)" 1655 | ] 1656 | }, 1657 | { 1658 | "cell_type": "markdown", 1659 | "id": "c4333d12-17e4-4bb5-9d83-54b3a32618cd", 1660 | "metadata": {}, 1661 | "source": [ 1662 | "- Note that dropout is only applied during training, not during inference" 1663 | ] 1664 | }, 1665 | { 1666 | "cell_type": "markdown", 1667 | "id": "a554cf47-558c-4f45-84cd-bf9b839a8d50", 1668 | "metadata": {}, 1669 | "source": [ 1670 | "" 1671 | ] 1672 | }, 1673 | { 1674 | "cell_type": "markdown", 1675 | "id": "c8bef90f-cfd4-4289-b0e8-6a00dc9be44c", 1676 | "metadata": {}, 1677 | "source": [ 1678 | "## 3.6 Extending single-head attention to multi-head attention" 1679 | ] 1680 | }, 1681 | { 1682 | "cell_type": "markdown", 1683 | "id": "11697757-9198-4a1c-9cee-f450d8bbd3b9", 1684 | "metadata": {}, 1685 | "source": [ 1686 | "### 3.6.1 Stacking multiple single-head attention layers" 1687 | ] 1688 | }, 1689 | { 1690 | "cell_type": "markdown", 1691 | "id": "70766faf-cd53-41d9-8a17-f1b229756a5a", 1692 | "metadata": {}, 1693 | "source": [ 1694 | "- Below is a summary of the self-attention implemented previously (causal and dropout masks not shown for simplicity)\n", 1695 | "\n", 1696 | "- This is also called single-head attention:\n", 1697 | "\n", 1698 | "\n", 1699 | "\n", 1700 | "- We simply stack multiple single-head attention modules to obtain a multi-head attention module:\n", 1701 | "\n", 1702 | "\n", 1703 | "\n", 1704 | "- The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions." 1705 | ] 1706 | }, 1707 | { 1708 | "cell_type": "code", 1709 | "execution_count": 35, 1710 | "id": "b9a66e11-7105-4bb4-be84-041f1a1f3bd2", 1711 | "metadata": {}, 1712 | "outputs": [ 1713 | { 1714 | "name": "stdout", 1715 | "output_type": "stream", 1716 | "text": [ 1717 | "tensor([[[-0.4519, 0.2216, 0.4772, 0.1063],\n", 1718 | " [-0.5874, 0.0058, 0.5891, 0.3257],\n", 1719 | " [-0.6300, -0.0632, 0.6202, 0.3860],\n", 1720 | " [-0.5675, -0.0843, 0.5478, 0.3589],\n", 1721 | " [-0.5526, -0.0981, 0.5321, 0.3428],\n", 1722 | " [-0.5299, -0.1081, 0.5077, 0.3493]],\n", 1723 | "\n", 1724 | " [[-0.4519, 0.2216, 0.4772, 0.1063],\n", 1725 | " [-0.5874, 0.0058, 0.5891, 0.3257],\n", 1726 | " [-0.6300, -0.0632, 0.6202, 0.3860],\n", 1727 | " [-0.5675, -0.0843, 0.5478, 0.3589],\n", 1728 | " [-0.5526, -0.0981, 0.5321, 0.3428],\n", 1729 | " [-0.5299, -0.1081, 0.5077, 0.3493]]], grad_fn=)\n", 1730 | "context_vecs.shape: torch.Size([2, 6, 4])\n" 1731 | ] 1732 | } 1733 | ], 1734 | "source": [ 1735 | "class MultiHeadAttentionWrapper(nn.Module):\n", 1736 | "\n", 1737 | " def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):\n", 1738 | " super().__init__()\n", 1739 | " self.heads = nn.ModuleList(\n", 1740 | " [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) \n", 1741 | " for _ in range(num_heads)]\n", 1742 | " )\n", 1743 | "\n", 1744 | " def forward(self, x):\n", 1745 | " return torch.cat([head(x) for head in self.heads], dim=-1)\n", 1746 | "\n", 1747 | "\n", 1748 | "torch.manual_seed(123)\n", 1749 | "\n", 1750 | "context_length = batch.shape[1] # This is the number of tokens\n", 1751 | "d_in, d_out = 3, 2\n", 1752 | "mha = MultiHeadAttentionWrapper(\n", 1753 | " d_in, d_out, context_length, 0.0, num_heads=2\n", 1754 | ")\n", 1755 | "\n", 1756 | "context_vecs = mha(batch)\n", 1757 | "\n", 1758 | "print(context_vecs)\n", 1759 | "print(\"context_vecs.shape:\", context_vecs.shape)" 1760 | ] 1761 | }, 1762 | { 1763 | "cell_type": "markdown", 1764 | "id": "193d3d2b-2578-40ba-b791-ea2d49328e48", 1765 | "metadata": {}, 1766 | "source": [ 1767 | "- In the implementation above, the embedding dimension is 4, because we `d_out=2` as the embedding dimension for the key, query, and value vectors as well as the context vector. And since we have 2 attention heads, we have the output embedding dimension 2*2=4" 1768 | ] 1769 | }, 1770 | { 1771 | "cell_type": "markdown", 1772 | "id": "6836b5da-ef82-4b4c-bda1-72a462e48d4e", 1773 | "metadata": {}, 1774 | "source": [ 1775 | "### 3.6.2 Implementing multi-head attention with weight splits" 1776 | ] 1777 | }, 1778 | { 1779 | "cell_type": "markdown", 1780 | "id": "f4b48d0d-71ba-4fa0-b714-ca80cabcb6f7", 1781 | "metadata": {}, 1782 | "source": [ 1783 | "- While the above is an intuitive and fully functional implementation of multi-head attention (wrapping the single-head attention `CausalAttention` implementation from earlier), we can write a stand-alone class called `MultiHeadAttention` to achieve the same\n", 1784 | "\n", 1785 | "- We don't concatenate single attention heads for this stand-alone `MultiHeadAttention` class\n", 1786 | "- Instead, we create single W_query, W_key, and W_value weight matrices and then split those into individual matrices for each attention head:" 1787 | ] 1788 | }, 1789 | { 1790 | "cell_type": "code", 1791 | "execution_count": 36, 1792 | "id": "110b0188-6e9e-4e56-a988-10523c6c8538", 1793 | "metadata": {}, 1794 | "outputs": [ 1795 | { 1796 | "name": "stdout", 1797 | "output_type": "stream", 1798 | "text": [ 1799 | "tensor([[[0.3190, 0.4858],\n", 1800 | " [0.2943, 0.3897],\n", 1801 | " [0.2856, 0.3593],\n", 1802 | " [0.2693, 0.3873],\n", 1803 | " [0.2639, 0.3928],\n", 1804 | " [0.2575, 0.4028]],\n", 1805 | "\n", 1806 | " [[0.3190, 0.4858],\n", 1807 | " [0.2943, 0.3897],\n", 1808 | " [0.2856, 0.3593],\n", 1809 | " [0.2693, 0.3873],\n", 1810 | " [0.2639, 0.3928],\n", 1811 | " [0.2575, 0.4028]]], grad_fn=)\n", 1812 | "context_vecs.shape: torch.Size([2, 6, 2])\n" 1813 | ] 1814 | } 1815 | ], 1816 | "source": [ 1817 | "class MultiHeadAttention(nn.Module):\n", 1818 | " def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):\n", 1819 | " super().__init__()\n", 1820 | " assert (d_out % num_heads == 0), \\\n", 1821 | " \"d_out must be divisible by num_heads\"\n", 1822 | "\n", 1823 | " self.d_out = d_out\n", 1824 | " self.num_heads = num_heads\n", 1825 | " self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim\n", 1826 | "\n", 1827 | " self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1828 | " self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1829 | " self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n", 1830 | " self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs\n", 1831 | " self.dropout = nn.Dropout(dropout)\n", 1832 | " self.register_buffer(\n", 1833 | " \"mask\",\n", 1834 | " torch.triu(torch.ones(context_length, context_length),\n", 1835 | " diagonal=1)\n", 1836 | " )\n", 1837 | "\n", 1838 | " def forward(self, x):\n", 1839 | " b, num_tokens, d_in = x.shape\n", 1840 | "\n", 1841 | " keys = self.W_key(x) # Shape: (b, num_tokens, d_out)\n", 1842 | " queries = self.W_query(x)\n", 1843 | " values = self.W_value(x)\n", 1844 | "\n", 1845 | " # We implicitly split the matrix by adding a `num_heads` dimension\n", 1846 | " # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n", 1847 | " keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) \n", 1848 | " values = values.view(b, num_tokens, self.num_heads, self.head_dim)\n", 1849 | " queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)\n", 1850 | "\n", 1851 | " # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n", 1852 | " keys = keys.transpose(1, 2)\n", 1853 | " queries = queries.transpose(1, 2)\n", 1854 | " values = values.transpose(1, 2)\n", 1855 | "\n", 1856 | " # Compute scaled dot-product attention (aka self-attention) with a causal mask\n", 1857 | " attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head\n", 1858 | "\n", 1859 | " # Original mask truncated to the number of tokens and converted to boolean\n", 1860 | " mask_bool = self.mask.bool()[:num_tokens, :num_tokens]\n", 1861 | "\n", 1862 | " # Use the mask to fill attention scores\n", 1863 | " attn_scores.masked_fill_(mask_bool, -torch.inf)\n", 1864 | " \n", 1865 | " attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n", 1866 | " attn_weights = self.dropout(attn_weights)\n", 1867 | "\n", 1868 | " # Shape: (b, num_tokens, num_heads, head_dim)\n", 1869 | " context_vec = (attn_weights @ values).transpose(1, 2) \n", 1870 | " \n", 1871 | " # Combine heads, where self.d_out = self.num_heads * self.head_dim\n", 1872 | " context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n", 1873 | " context_vec = self.out_proj(context_vec) # optional projection\n", 1874 | "\n", 1875 | " return context_vec\n", 1876 | "\n", 1877 | "torch.manual_seed(123)\n", 1878 | "\n", 1879 | "batch_size, context_length, d_in = batch.shape\n", 1880 | "d_out = 2\n", 1881 | "mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)\n", 1882 | "\n", 1883 | "context_vecs = mha(batch)\n", 1884 | "\n", 1885 | "print(context_vecs)\n", 1886 | "print(\"context_vecs.shape:\", context_vecs.shape)" 1887 | ] 1888 | }, 1889 | { 1890 | "cell_type": "markdown", 1891 | "id": "d334dfb5-2b6c-4c33-82d5-b4e9db5867bb", 1892 | "metadata": {}, 1893 | "source": [ 1894 | "- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient\n", 1895 | "- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters\n", 1896 | "- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)\n" 1897 | ] 1898 | }, 1899 | { 1900 | "cell_type": "markdown", 1901 | "id": "dbe5d396-c990-45dc-9908-2c621461f851", 1902 | "metadata": {}, 1903 | "source": [ 1904 | "" 1905 | ] 1906 | }, 1907 | { 1908 | "cell_type": "markdown", 1909 | "id": "8b0ed78c-e8ac-4f8f-a479-a98242ae8f65", 1910 | "metadata": {}, 1911 | "source": [ 1912 | "- Note that if you are interested in a compact and efficient implementation of the above, you can also consider the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch" 1913 | ] 1914 | }, 1915 | { 1916 | "cell_type": "markdown", 1917 | "id": "363701ad-2022-46c8-9972-390d2a2b9911", 1918 | "metadata": {}, 1919 | "source": [ 1920 | "- Since the above implementation may look a bit complex at first glance, let's look at what happens when executing `attn_scores = queries @ keys.transpose(2, 3)`:" 1921 | ] 1922 | }, 1923 | { 1924 | "cell_type": "code", 1925 | "execution_count": 37, 1926 | "id": "e8cfc1ae-78ab-4faa-bc73-98bd054806c9", 1927 | "metadata": {}, 1928 | "outputs": [ 1929 | { 1930 | "name": "stdout", 1931 | "output_type": "stream", 1932 | "text": [ 1933 | "tensor([[[[1.3208, 1.1631, 1.2879],\n", 1934 | " [1.1631, 2.2150, 1.8424],\n", 1935 | " [1.2879, 1.8424, 2.0402]],\n", 1936 | "\n", 1937 | " [[0.4391, 0.7003, 0.5903],\n", 1938 | " [0.7003, 1.3737, 1.0620],\n", 1939 | " [0.5903, 1.0620, 0.9912]]]])\n" 1940 | ] 1941 | } 1942 | ], 1943 | "source": [ 1944 | "# (b, num_heads, num_tokens, head_dim) = (1, 2, 3, 4)\n", 1945 | "a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],\n", 1946 | " [0.8993, 0.0390, 0.9268, 0.7388],\n", 1947 | " [0.7179, 0.7058, 0.9156, 0.4340]],\n", 1948 | "\n", 1949 | " [[0.0772, 0.3565, 0.1479, 0.5331],\n", 1950 | " [0.4066, 0.2318, 0.4545, 0.9737],\n", 1951 | " [0.4606, 0.5159, 0.4220, 0.5786]]]])\n", 1952 | "\n", 1953 | "print(a @ a.transpose(2, 3))" 1954 | ] 1955 | }, 1956 | { 1957 | "cell_type": "markdown", 1958 | "id": "0587b946-c8f2-4888-adbf-5a5032fbfd7b", 1959 | "metadata": {}, 1960 | "source": [ 1961 | "- In this case, the matrix multiplication implementation in PyTorch will handle the 4-dimensional input tensor so that the matrix multiplication is carried out between the 2 last dimensions (num_tokens, head_dim) and then repeated for the individual heads \n", 1962 | "\n", 1963 | "- For instance, the following becomes a more compact way to compute the matrix multiplication for each head separately:" 1964 | ] 1965 | }, 1966 | { 1967 | "cell_type": "code", 1968 | "execution_count": 38, 1969 | "id": "053760f1-1a02-42f0-b3bf-3d939e407039", 1970 | "metadata": {}, 1971 | "outputs": [ 1972 | { 1973 | "name": "stdout", 1974 | "output_type": "stream", 1975 | "text": [ 1976 | "First head:\n", 1977 | " tensor([[1.3208, 1.1631, 1.2879],\n", 1978 | " [1.1631, 2.2150, 1.8424],\n", 1979 | " [1.2879, 1.8424, 2.0402]])\n", 1980 | "\n", 1981 | "Second head:\n", 1982 | " tensor([[0.4391, 0.7003, 0.5903],\n", 1983 | " [0.7003, 1.3737, 1.0620],\n", 1984 | " [0.5903, 1.0620, 0.9912]])\n" 1985 | ] 1986 | } 1987 | ], 1988 | "source": [ 1989 | "first_head = a[0, 0, :, :]\n", 1990 | "first_res = first_head @ first_head.T\n", 1991 | "print(\"First head:\\n\", first_res)\n", 1992 | "\n", 1993 | "second_head = a[0, 1, :, :]\n", 1994 | "second_res = second_head @ second_head.T\n", 1995 | "print(\"\\nSecond head:\\n\", second_res)" 1996 | ] 1997 | }, 1998 | { 1999 | "cell_type": "markdown", 2000 | "id": "dec671bf-7938-4304-ad1e-75d9920e7f43", 2001 | "metadata": {}, 2002 | "source": [ 2003 | "# Summary and takeaways" 2004 | ] 2005 | }, 2006 | { 2007 | "cell_type": "markdown", 2008 | "id": "fa3e4113-ffca-432c-b3ec-7a50bd15da25", 2009 | "metadata": {}, 2010 | "source": [ 2011 | "- See the [./multihead-attention.ipynb](./multihead-attention.ipynb) code notebook, which is a concise version of the data loader (chapter 2) plus the multi-head attention class that we implemented in this chapter and will need for training the GPT model in upcoming chapters\n", 2012 | "- You can find the exercise solutions in [./exercise-solutions.ipynb](./exercise-solutions.ipynb)" 2013 | ] 2014 | } 2015 | ], 2016 | "metadata": { 2017 | "kernelspec": { 2018 | "display_name": "Python 3 (ipykernel)", 2019 | "language": "python", 2020 | "name": "python3" 2021 | }, 2022 | "language_info": { 2023 | "codemirror_mode": { 2024 | "name": "ipython", 2025 | "version": 3 2026 | }, 2027 | "file_extension": ".py", 2028 | "mimetype": "text/x-python", 2029 | "name": "python", 2030 | "nbconvert_exporter": "python", 2031 | "pygments_lexer": "ipython3", 2032 | "version": "3.11.4" 2033 | } 2034 | }, 2035 | "nbformat": 4, 2036 | "nbformat_minor": 5 2037 | } 2038 | -------------------------------------------------------------------------------- /Images/1.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/1.webp -------------------------------------------------------------------------------- /Images/2.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/2.webp -------------------------------------------------------------------------------- /Images/3.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/3.webp -------------------------------------------------------------------------------- /Images/4.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/4.webp -------------------------------------------------------------------------------- /Images/5.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/5.webp -------------------------------------------------------------------------------- /Images/6.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/6.webp -------------------------------------------------------------------------------- /Images/7.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/7.webp -------------------------------------------------------------------------------- /Images/Token Embeddings.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/Token Embeddings.webp -------------------------------------------------------------------------------- /Images/data sampling with sliding window.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/data sampling with sliding window.webp -------------------------------------------------------------------------------- /Images/positional Encoding.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/positional Encoding.webp -------------------------------------------------------------------------------- /Images/step 1.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/step 1.webp -------------------------------------------------------------------------------- /Images/word embeddings.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danyalalam/LLM-from-scratch/f2436e44fd3cc2fe52aebef3015235117ad122c6/Images/word embeddings.webp -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LLM from Scratch 2 | 3 | Welcome to the **LLM from Scratch** project! This repository contains code and resources for building a Language Model (LLM) from the ground up. 4 | 5 | The mental model below summarizes the contents that will be covered in this repo. 6 | 7 | 8 | 9 | 10 | ## Table of Contents 11 | 12 | - [Introduction](#introduction) 13 | - [Features](#features) 14 | - [Installation](#installation) 15 | - [Usage](#usage) 16 | - [Contributing](#contributing) 17 | - [License](#license) 18 | 19 | ## Introduction 20 | 21 | This project aims to provide a comprehensive guide and implementation for creating a Language Model from scratch. It covers the fundamental concepts, algorithms, and techniques required to build and train a functional LLM. 22 | 23 | ## Features 24 | 25 | - Step-by-step guide to building an LLM 26 | - Sample datasets for training and evaluation 27 | - Modular and extensible codebase 28 | - Detailed documentation and tutorials 29 | 30 | 31 | 32 | ## License 33 | 34 | This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. 35 | -------------------------------------------------------------------------------- /Tokenizer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "1-yMc2xt2Z_0" 7 | }, 8 | "source": [ 9 | "## Reading in a short story as text sample into Python." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "id": "gn6vgk0o2Z_2" 16 | }, 17 | "source": [ 18 | "## Step 1: Creating Tokens" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "A8RJlJgy2Z_3" 25 | }, 26 | "source": [ 27 | "
\n", 28 | "\n", 29 | "The print command prints the total number of characters followed by the first 100\n", 30 | "characters of this file for illustration purposes.
" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": { 37 | "collapsed": true, 38 | "id": "_gvhtisW2Z_4", 39 | "outputId": "f13def41-f2d8-467a-d291-86fba141f144" 40 | }, 41 | "outputs": [ 42 | { 43 | "name": "stdout", 44 | "output_type": "stream", 45 | "text": [ 46 | "Total number of character: 20479\n", 47 | "I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no \n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n", 53 | " raw_text = f.read()\n", 54 | "\n", 55 | "print(\"Total number of character:\", len(raw_text))\n", 56 | "print(raw_text[:99])" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": { 62 | "id": "Yu6Qc30p2Z_7" 63 | }, 64 | "source": [ 65 | "
\n", 66 | "\n", 67 | "Our goal is to tokenize this 20,479-character short story into individual words and special\n", 68 | "characters that we can then turn into embeddings for LLM training
" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "id": "5QY4ea_c2Z_7" 75 | }, 76 | "source": [ 77 | "
\n", 78 | "\n", 79 | "Note that it's common to process millions of articles and hundreds of thousands of\n", 80 | "books -- many gigabytes of text -- when working with LLMs. However, for educational\n", 81 | "purposes, it's sufficient to work with smaller text samples like a single book to\n", 82 | "illustrate the main ideas behind the text processing steps and to make it possible to\n", 83 | "run it in reasonable time on consumer hardware.
" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": { 89 | "id": "iASCPKiH2Z_8" 90 | }, 91 | "source": [ 92 | "
\n", 93 | "\n", 94 | "How can we best split this text to obtain a list of tokens? For this, we go on a small\n", 95 | "excursion and use Python's regular expression library re for illustration purposes. (Note\n", 96 | "that you don't have to learn or memorize any regular expression syntax since we will\n", 97 | "transition to a pre-built tokenizer later in this chapter.)
" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": { 103 | "id": "muW-yg2w2Z_9" 104 | }, 105 | "source": [ 106 | "
\n", 107 | "\n", 108 | "Using some simple example text, we can use the re.split command with the following\n", 109 | "syntax to split a text on whitespace characters:
" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "id": "CIjurEIZ2Z_-", 117 | "outputId": "1c96fba0-4be1-419e-f7ba-8c22a701f996" 118 | }, 119 | "outputs": [ 120 | { 121 | "name": "stdout", 122 | "output_type": "stream", 123 | "text": [ 124 | "['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']\n" 125 | ] 126 | } 127 | ], 128 | "source": [ 129 | "import re\n", 130 | "\n", 131 | "text = \"Hello, world. This, is a test.\"\n", 132 | "result = re.split(r'(\\s)', text)\n", 133 | "\n", 134 | "print(result)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": { 140 | "id": "W1zGB2TY2Z__" 141 | }, 142 | "source": [ 143 | "
\n", 144 | "The result is a list of individual words, whitespaces, and punctuation characters:\n", 145 | "
\n" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": { 151 | "id": "glQNv0QE2Z__" 152 | }, 153 | "source": [ 154 | "
\n", 155 | "\n", 156 | "Let's modify the regular expression splits on whitespaces (\\s) and commas, and periods\n", 157 | "([,.]):
" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": { 164 | "id": "XJ1voBWR2Z__", 165 | "outputId": "e7e13092-230c-43ae-d35d-6247f55389ec" 166 | }, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']\n" 173 | ] 174 | } 175 | ], 176 | "source": [ 177 | "result = re.split(r'([,.]|\\s)', text)\n", 178 | "\n", 179 | "print(result)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": { 185 | "id": "0hLywFqp2aAA" 186 | }, 187 | "source": [ 188 | "
\n", 189 | "We can see that the words and punctuation characters are now separate list entries just as\n", 190 | "we wanted\n", 191 | "
\n" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": { 197 | "id": "LkN7wK932aAB" 198 | }, 199 | "source": [ 200 | "
\n", 201 | "\n", 202 | "A small remaining issue is that the list still includes whitespace characters. Optionally, we\n", 203 | "can remove these redundant characters safely as follows:
" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": { 210 | "id": "ylK1jllW2aAB", 211 | "outputId": "d0ea860e-1abd-4227-ba9f-99f55709359b" 212 | }, 213 | "outputs": [ 214 | { 215 | "name": "stdout", 216 | "output_type": "stream", 217 | "text": [ 218 | "['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']\n" 219 | ] 220 | } 221 | ], 222 | "source": [ 223 | "result = [item for item in result if item.strip()]\n", 224 | "print(result)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": { 230 | "id": "FEVg2umE2aAB" 231 | }, 232 | "source": [ 233 | "
\n", 234 | "\n", 235 | "REMOVING WHITESPACES OR NOT\n", 236 | "\n", 237 | "\n", 238 | "When developing a simple tokenizer, whether we should encode whitespaces as\n", 239 | "separate characters or just remove them depends on our application and its\n", 240 | "requirements. Removing whitespaces reduces the memory and computing\n", 241 | "requirements. However, keeping whitespaces can be useful if we train models that\n", 242 | "are sensitive to the exact structure of the text (for example, Python code, which is\n", 243 | "sensitive to indentation and spacing). Here, we remove whitespaces for simplicity\n", 244 | "and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme\n", 245 | "that includes whitespaces.\n", 246 | "\n", 247 | "
" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": { 253 | "id": "NCeF60eS2aAC" 254 | }, 255 | "source": [ 256 | "
\n", 257 | "\n", 258 | "The tokenization scheme we devised above works well on the simple sample text. Let's\n", 259 | "modify it a bit further so that it can also handle other types of punctuation, such as\n", 260 | "question marks, quotation marks, and the double-dashes we have seen earlier in the first\n", 261 | "100 characters of Edith Wharton's short story, along with additional special characters:
" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": { 268 | "id": "1rl31IlP2aAC", 269 | "outputId": "f204c41c-caaf-48ea-c077-1a1c20045045" 270 | }, 271 | "outputs": [ 272 | { 273 | "name": "stdout", 274 | "output_type": "stream", 275 | "text": [ 276 | "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "text = \"Hello, world. Is this-- a test?\"\n", 282 | "result = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 283 | "result = [item.strip() for item in result if item.strip()]\n", 284 | "print(result)" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "id": "bcwGtK_92aAC", 292 | "outputId": "b96d8428-4c80-4a81-f440-3129a3c7dc76" 293 | }, 294 | "outputs": [ 295 | { 296 | "name": "stdout", 297 | "output_type": "stream", 298 | "text": [ 299 | "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n" 300 | ] 301 | } 302 | ], 303 | "source": [ 304 | "# Strip whitespace from each item and then filter out any empty strings.\n", 305 | "result = [item for item in result if item.strip()]\n", 306 | "print(result)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": { 313 | "id": "3EhGnEg12aAD", 314 | "outputId": "3ea1ab0c-5b27-4381-e12b-cd5ff140496e" 315 | }, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n" 322 | ] 323 | } 324 | ], 325 | "source": [ 326 | "text = \"Hello, world. Is this-- a test?\"\n", 327 | "\n", 328 | "result = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 329 | "result = [item.strip() for item in result if item.strip()]\n", 330 | "print(result)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": { 336 | "id": "hsYCIh_c2aAD" 337 | }, 338 | "source": [ 339 | "
\n", 340 | "\n", 341 | "Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short\n", 342 | "story:\n", 343 | "\n", 344 | "
" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": { 351 | "id": "cxlhYDNa2aAE", 352 | "outputId": "afac311e-1615-4d82-873a-19e269aca5e7" 353 | }, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "preprocessed = re.split(r'([,.:;?_!\"()\\']|--|\\s)', raw_text)\n", 365 | "preprocessed = [item.strip() for item in preprocessed if item.strip()]\n", 366 | "print(preprocessed[:30])" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "id": "95H7bkzw2aAE", 374 | "outputId": "f52bb6f0-fbff-4edc-dc3f-b8d49b52ab66" 375 | }, 376 | "outputs": [ 377 | { 378 | "name": "stdout", 379 | "output_type": "stream", 380 | "text": [ 381 | "4690\n" 382 | ] 383 | } 384 | ], 385 | "source": [ 386 | "print(len(preprocessed))\n" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": { 392 | "id": "db_vZSmh2aAF" 393 | }, 394 | "source": [ 395 | "## Step 2: Creating Token IDs" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": { 401 | "id": "_o6XYXoR2aAF" 402 | }, 403 | "source": [ 404 | "
\n", 405 | "\n", 406 | "In the previous section, we tokenized Edith Wharton's short story and assigned it to a\n", 407 | "Python variable called preprocessed. Let's now create a list of all unique tokens and sort\n", 408 | "them alphabetically to determine the vocabulary size:
" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": { 415 | "id": "RIai--ET2aAF", 416 | "outputId": "80a60efe-f3a6-4c7e-bb34-6255f6330f8f" 417 | }, 418 | "outputs": [ 419 | { 420 | "name": "stdout", 421 | "output_type": "stream", 422 | "text": [ 423 | "1130\n" 424 | ] 425 | } 426 | ], 427 | "source": [ 428 | "all_words = sorted(set(preprocessed))\n", 429 | "vocab_size = len(all_words)\n", 430 | "\n", 431 | "print(vocab_size)" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": { 437 | "id": "WnizFxBf2aAF" 438 | }, 439 | "source": [ 440 | "
\n", 441 | "\n", 442 | "After determining that the vocabulary size is 1,130 via the above code, we create the\n", 443 | "vocabulary and print its first 51 entries for illustration purposes:\n", 444 | "\n", 445 | "
" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": { 452 | "id": "VDFQNQQU2aAG" 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "vocab = {token:integer for integer,token in enumerate(all_words)}\n" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": { 463 | "collapsed": true, 464 | "id": "jVbGxlS-2aAG", 465 | "outputId": "11ecd500-8765-4e08-ee94-3f2159256cd0" 466 | }, 467 | "outputs": [ 468 | { 469 | "name": "stdout", 470 | "output_type": "stream", 471 | "text": [ 472 | "('!', 0)\n", 473 | "('\"', 1)\n", 474 | "(\"'\", 2)\n", 475 | "('(', 3)\n", 476 | "(')', 4)\n", 477 | "(',', 5)\n", 478 | "('--', 6)\n", 479 | "('.', 7)\n", 480 | "(':', 8)\n", 481 | "(';', 9)\n", 482 | "('?', 10)\n", 483 | "('A', 11)\n", 484 | "('Ah', 12)\n", 485 | "('Among', 13)\n", 486 | "('And', 14)\n", 487 | "('Are', 15)\n", 488 | "('Arrt', 16)\n", 489 | "('As', 17)\n", 490 | "('At', 18)\n", 491 | "('Be', 19)\n", 492 | "('Begin', 20)\n", 493 | "('Burlington', 21)\n", 494 | "('But', 22)\n", 495 | "('By', 23)\n", 496 | "('Carlo', 24)\n", 497 | "('Chicago', 25)\n", 498 | "('Claude', 26)\n", 499 | "('Come', 27)\n", 500 | "('Croft', 28)\n", 501 | "('Destroyed', 29)\n", 502 | "('Devonshire', 30)\n", 503 | "('Don', 31)\n", 504 | "('Dubarry', 32)\n", 505 | "('Emperors', 33)\n", 506 | "('Florence', 34)\n", 507 | "('For', 35)\n", 508 | "('Gallery', 36)\n", 509 | "('Gideon', 37)\n", 510 | "('Gisburn', 38)\n", 511 | "('Gisburns', 39)\n", 512 | "('Grafton', 40)\n", 513 | "('Greek', 41)\n", 514 | "('Grindle', 42)\n", 515 | "('Grindles', 43)\n", 516 | "('HAD', 44)\n", 517 | "('Had', 45)\n", 518 | "('Hang', 46)\n", 519 | "('Has', 47)\n", 520 | "('He', 48)\n", 521 | "('Her', 49)\n", 522 | "('Hermia', 50)\n" 523 | ] 524 | } 525 | ], 526 | "source": [ 527 | "for i, item in enumerate(vocab.items()):\n", 528 | " print(item)\n", 529 | " if i >= 50:\n", 530 | " break" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": { 536 | "id": "u1h7H3ws2aAG" 537 | }, 538 | "source": [ 539 | "
\n", 540 | "As we can see, based on the output above, the dictionary contains individual tokens\n", 541 | "associated with unique integer labels.\n", 542 | "
" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": { 548 | "id": "y09IgDqp2aAH" 549 | }, 550 | "source": [ 551 | "
\n", 552 | "\n", 553 | "Later in this book, when we want to convert the outputs of an LLM from numbers back into\n", 554 | "text, we also need a way to turn token IDs into text.\n", 555 | "\n", 556 | "For this, we can create an inverse\n", 557 | "version of the vocabulary that maps token IDs back to corresponding text tokens.\n", 558 | "\n", 559 | "
" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": { 565 | "id": "-ApwhDbU2aAH" 566 | }, 567 | "source": [ 568 | "
\n", 569 | "\n", 570 | "Let's implement a complete tokenizer class in Python.\n", 571 | "\n", 572 | "The class will have an encode method that splits\n", 573 | "text into tokens and carries out the string-to-integer mapping to produce token IDs via the\n", 574 | "vocabulary.\n", 575 | "\n", 576 | "In addition, we implement a decode method that carries out the reverse\n", 577 | "integer-to-string mapping to convert the token IDs back into text.\n", 578 | "\n", 579 | "
" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": { 585 | "id": "ijzt__VO2aAI" 586 | }, 587 | "source": [ 588 | "
\n", 589 | " \n", 590 | "Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods\n", 591 | " \n", 592 | "Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens\n", 593 | "\n", 594 | "Step 3: Process input text into token IDs\n", 595 | "\n", 596 | "Step 4: Convert token IDs back into text\n", 597 | "\n", 598 | "Step 5: Replace spaces before the specified punctuation\n", 599 | "\n", 600 | "
\n", 601 | "\n" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": { 608 | "id": "7_f0SGhc2aAI" 609 | }, 610 | "outputs": [], 611 | "source": [ 612 | "class SimpleTokenizerV1:\n", 613 | " def __init__(self, vocab):\n", 614 | " self.str_to_int = vocab\n", 615 | " self.int_to_str = {i:s for s,i in vocab.items()}\n", 616 | "\n", 617 | " def encode(self, text):\n", 618 | " preprocessed = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 619 | "\n", 620 | " preprocessed = [\n", 621 | " item.strip() for item in preprocessed if item.strip()\n", 622 | " ]\n", 623 | " ids = [self.str_to_int[s] for s in preprocessed]\n", 624 | " return ids\n", 625 | "\n", 626 | " def decode(self, ids):\n", 627 | " text = \" \".join([self.int_to_str[i] for i in ids])\n", 628 | " # Replace spaces before the specified punctuations\n", 629 | " text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text)\n", 630 | " return text" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": { 636 | "id": "NUsHkEAa2aAI" 637 | }, 638 | "source": [ 639 | "
\n", 640 | "\n", 641 | "Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a\n", 642 | "passage from Edith Wharton's short story to try it out in practice:\n", 643 | "
" 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "execution_count": null, 649 | "metadata": { 650 | "id": "KaO9Tjqe2aAJ", 651 | "outputId": "32a61499-39f2-44ea-86fd-6d510e4cd2fe" 652 | }, 653 | "outputs": [ 654 | { 655 | "name": "stdout", 656 | "output_type": "stream", 657 | "text": [ 658 | "[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]\n" 659 | ] 660 | } 661 | ], 662 | "source": [ 663 | "tokenizer = SimpleTokenizerV1(vocab)\n", 664 | "\n", 665 | "text = \"\"\"\"It's the last he painted, you know,\"\n", 666 | " Mrs. Gisburn said with pardonable pride.\"\"\"\n", 667 | "ids = tokenizer.encode(text)\n", 668 | "print(ids)" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": { 674 | "id": "lJ0mJOGV2aAJ" 675 | }, 676 | "source": [ 677 | "
\n", 678 | " \n", 679 | "The code above prints the following token IDs:\n", 680 | "Next, let's see if we can turn these token IDs back into text using the decode method:\n", 681 | "
" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": { 688 | "id": "XTUn_bC02aAJ", 689 | "outputId": "a7de0b2f-aa76-4d82-cb3c-1484aa0e906d" 690 | }, 691 | "outputs": [ 692 | { 693 | "data": { 694 | "text/plain": [ 695 | "'\" It\\' s the last he painted, you know,\" Mrs. Gisburn said with pardonable pride.'" 696 | ] 697 | }, 698 | "execution_count": 91, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "tokenizer.decode(ids)\n" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": { 710 | "id": "Aa6KgDes2aAK" 711 | }, 712 | "source": [ 713 | "
\n", 714 | " \n", 715 | "Based on the output above, we can see that the decode method successfully converted the\n", 716 | "token IDs back into the original text.\n", 717 | "
" 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "metadata": { 723 | "id": "RDo2nnqb2aAS" 724 | }, 725 | "source": [ 726 | "
\n", 727 | "\n", 728 | "So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing\n", 729 | "text based on a snippet from the training set.\n", 730 | "\n", 731 | "Let's now apply it to a new text sample that\n", 732 | "is not contained in the training set:\n", 733 | "
" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": { 740 | "id": "hwh-jek42aAS", 741 | "outputId": "b843b8f0-44b9-4641-ff5b-5594d96fdc79" 742 | }, 743 | "outputs": [ 744 | { 745 | "ename": "KeyError", 746 | "evalue": "'Hello'", 747 | "output_type": "error", 748 | "traceback": [ 749 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 750 | "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", 751 | "Cell \u001b[0;32mIn[92], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m text \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mHello, do you like tea?\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mtokenizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtext\u001b[49m\u001b[43m)\u001b[49m)\n", 752 | "Cell \u001b[0;32mIn[89], line 12\u001b[0m, in \u001b[0;36mSimpleTokenizerV1.encode\u001b[0;34m(self, text)\u001b[0m\n\u001b[1;32m 7\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m re\u001b[38;5;241m.\u001b[39msplit(\u001b[38;5;124mr\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m([,.:;?_!\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m()\u001b[39m\u001b[38;5;130;01m\\'\u001b[39;00m\u001b[38;5;124m]|--|\u001b[39m\u001b[38;5;124m\\\u001b[39m\u001b[38;5;124ms)\u001b[39m\u001b[38;5;124m'\u001b[39m, text)\n\u001b[1;32m 9\u001b[0m preprocessed \u001b[38;5;241m=\u001b[39m [\n\u001b[1;32m 10\u001b[0m item\u001b[38;5;241m.\u001b[39mstrip() \u001b[38;5;28;01mfor\u001b[39;00m item \u001b[38;5;129;01min\u001b[39;00m preprocessed \u001b[38;5;28;01mif\u001b[39;00m item\u001b[38;5;241m.\u001b[39mstrip()\n\u001b[1;32m 11\u001b[0m ]\n\u001b[0;32m---> 12\u001b[0m ids \u001b[38;5;241m=\u001b[39m [\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstr_to_int\u001b[49m\u001b[43m[\u001b[49m\u001b[43ms\u001b[49m\u001b[43m]\u001b[49m \u001b[38;5;28;01mfor\u001b[39;00m s \u001b[38;5;129;01min\u001b[39;00m preprocessed]\n\u001b[1;32m 13\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ids\n", 753 | "\u001b[0;31mKeyError\u001b[0m: 'Hello'" 754 | ] 755 | } 756 | ], 757 | "source": [ 758 | "text = \"Hello, do you like tea?\"\n", 759 | "print(tokenizer.encode(text))" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": { 765 | "id": "_kvSmXPw2aAT" 766 | }, 767 | "source": [ 768 | "
\n", 769 | " \n", 770 | "The problem is that the word \"Hello\" was not used in the The Verdict short story.\n", 771 | "\n", 772 | "Hence, it\n", 773 | "is not contained in the vocabulary.\n", 774 | "\n", 775 | "This highlights the need to consider large and diverse\n", 776 | "training sets to extend the vocabulary when working on LLMs.\n", 777 | "\n", 778 | "
" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": { 784 | "id": "UQdtJrny2aAT" 785 | }, 786 | "source": [ 787 | "### ADDING SPECIAL CONTEXT TOKENS\n", 788 | "\n", 789 | "In the previous section, we implemented a simple tokenizer and applied it to a passage\n", 790 | "from the training set.\n", 791 | "\n", 792 | "In this section, we will modify this tokenizer to handle unknown\n", 793 | "words.\n", 794 | "\n", 795 | "\n", 796 | "In particular, we will modify the vocabulary and tokenizer we implemented in the\n", 797 | "previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and\n", 798 | "<|endoftext|>" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": { 804 | "id": "Ws1p81XP2aAT" 805 | }, 806 | "source": [ 807 | "
\n", 808 | "\n", 809 | "We can modify the tokenizer to use an <|unk|> token if it\n", 810 | "encounters a word that is not part of the vocabulary.\n", 811 | "\n", 812 | "Furthermore, we add a token between\n", 813 | "unrelated texts.\n", 814 | "\n", 815 | "For example, when training GPT-like LLMs on multiple independent\n", 816 | "documents or books, it is common to insert a token before each document or book that\n", 817 | "follows a previous text source\n", 818 | "\n", 819 | "
\n", 820 | "\n" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "metadata": { 826 | "id": "O0zBzO_W2aAU" 827 | }, 828 | "source": [ 829 | "
\n", 830 | "\n", 831 | "Let's now modify the vocabulary to include these two special tokens, and\n", 832 | "<|endoftext|>, by adding these to the list of all unique words that we created in the\n", 833 | "previous section:\n", 834 | "
" 835 | ] 836 | }, 837 | { 838 | "cell_type": "code", 839 | "execution_count": null, 840 | "metadata": { 841 | "id": "Ii4DqLk_2aAU" 842 | }, 843 | "outputs": [], 844 | "source": [ 845 | "all_tokens = sorted(list(set(preprocessed)))\n", 846 | "all_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n", 847 | "\n", 848 | "vocab = {token:integer for integer,token in enumerate(all_tokens)}" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": null, 854 | "metadata": { 855 | "id": "V6QmmYXW2aAU", 856 | "outputId": "f4d89ab4-7b2b-42be-ea50-d6e9175e68d2" 857 | }, 858 | "outputs": [ 859 | { 860 | "data": { 861 | "text/plain": [ 862 | "1132" 863 | ] 864 | }, 865 | "execution_count": 69, 866 | "metadata": {}, 867 | "output_type": "execute_result" 868 | } 869 | ], 870 | "source": [ 871 | "len(vocab.items())\n" 872 | ] 873 | }, 874 | { 875 | "cell_type": "markdown", 876 | "metadata": { 877 | "id": "rjfYoTY22aAV" 878 | }, 879 | "source": [ 880 | "
\n", 881 | " \n", 882 | "Based on the output of the print statement above, the new vocabulary size is 1132 (the\n", 883 | "vocabulary size in the previous section was 1130).\n", 884 | "\n", 885 | "
\n", 886 | "\n" 887 | ] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "metadata": { 892 | "id": "MoM--dPK2aAV" 893 | }, 894 | "source": [ 895 | "
\n", 896 | "\n", 897 | "As an additional quick check, let's print the last 5 entries of the updated vocabulary:\n", 898 | "
" 899 | ] 900 | }, 901 | { 902 | "cell_type": "code", 903 | "execution_count": null, 904 | "metadata": { 905 | "id": "Xmr_BBSY2aAV", 906 | "outputId": "a8e13e98-1a8a-49b2-e8ae-7342789bae9d" 907 | }, 908 | "outputs": [ 909 | { 910 | "name": "stdout", 911 | "output_type": "stream", 912 | "text": [ 913 | "('younger', 1127)\n", 914 | "('your', 1128)\n", 915 | "('yourself', 1129)\n", 916 | "('<|endoftext|>', 1130)\n", 917 | "('<|unk|>', 1131)\n" 918 | ] 919 | } 920 | ], 921 | "source": [ 922 | "for i, item in enumerate(list(vocab.items())[-5:]):\n", 923 | " print(item)" 924 | ] 925 | }, 926 | { 927 | "cell_type": "markdown", 928 | "metadata": { 929 | "id": "rxAZnkW32aAV" 930 | }, 931 | "source": [ 932 | "
\n", 933 | "\n", 934 | "A simple text tokenizer that handles unknown words
\n", 935 | "\n" 936 | ] 937 | }, 938 | { 939 | "cell_type": "markdown", 940 | "metadata": { 941 | "id": "g-usthWV2aAV" 942 | }, 943 | "source": [ 944 | "
\n", 945 | " \n", 946 | "Step 1: Replace unknown words by <|unk|> tokens\n", 947 | " \n", 948 | "Step 2: Replace spaces before the specified punctuations\n", 949 | "\n", 950 | "
\n" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "metadata": { 957 | "id": "cPvEcgPn2aAW" 958 | }, 959 | "outputs": [], 960 | "source": [ 961 | "class SimpleTokenizerV2:\n", 962 | " def __init__(self, vocab):\n", 963 | " self.str_to_int = vocab\n", 964 | " self.int_to_str = { i:s for s,i in vocab.items()}\n", 965 | "\n", 966 | " def encode(self, text):\n", 967 | " preprocessed = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 968 | " preprocessed = [item.strip() for item in preprocessed if item.strip()]\n", 969 | " preprocessed = [\n", 970 | " item if item in self.str_to_int\n", 971 | " else \"<|unk|>\" for item in preprocessed\n", 972 | " ]\n", 973 | "\n", 974 | " ids = [self.str_to_int[s] for s in preprocessed]\n", 975 | " return ids\n", 976 | "\n", 977 | " def decode(self, ids):\n", 978 | " text = \" \".join([self.int_to_str[i] for i in ids])\n", 979 | " # Replace spaces before the specified punctuations\n", 980 | " text = re.sub(r'\\s+([,.:;?!\"()\\'])', r'\\1', text)\n", 981 | " return text" 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": null, 987 | "metadata": { 988 | "id": "kx4lqyL92aAW", 989 | "outputId": "687e8cf8-4691-46b3-c09f-9f014faeb2b3" 990 | }, 991 | "outputs": [ 992 | { 993 | "name": "stdout", 994 | "output_type": "stream", 995 | "text": [ 996 | "Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.\n" 997 | ] 998 | } 999 | ], 1000 | "source": [ 1001 | "tokenizer = SimpleTokenizerV2(vocab)\n", 1002 | "\n", 1003 | "text1 = \"Hello, do you like tea?\"\n", 1004 | "text2 = \"In the sunlit terraces of the palace.\"\n", 1005 | "\n", 1006 | "text = \" <|endoftext|> \".join((text1, text2))\n", 1007 | "\n", 1008 | "print(text)" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": { 1015 | "id": "KBpoj3YA2aAX", 1016 | "outputId": "a188539c-2269-4ec1-8a6d-8f0ff0f6c063" 1017 | }, 1018 | "outputs": [ 1019 | { 1020 | "data": { 1021 | "text/plain": [ 1022 | "[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]" 1023 | ] 1024 | }, 1025 | "execution_count": 73, 1026 | "metadata": {}, 1027 | "output_type": "execute_result" 1028 | } 1029 | ], 1030 | "source": [ 1031 | "tokenizer.encode(text)\n" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "code", 1036 | "execution_count": null, 1037 | "metadata": { 1038 | "id": "fBvqo0tr2aAX", 1039 | "outputId": "13758089-44ff-431f-be14-07eea7a406e6" 1040 | }, 1041 | "outputs": [ 1042 | { 1043 | "data": { 1044 | "text/plain": [ 1045 | "'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'" 1046 | ] 1047 | }, 1048 | "execution_count": 74, 1049 | "metadata": {}, 1050 | "output_type": "execute_result" 1051 | } 1052 | ], 1053 | "source": [ 1054 | "tokenizer.decode(tokenizer.encode(text))" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "markdown", 1059 | "metadata": { 1060 | "id": "v084jSYM2aAY" 1061 | }, 1062 | "source": [ 1063 | "\n", 1064 | "
\n", 1065 | " \n", 1066 | "Based on comparing the de-tokenized text above with the original input text, we know that\n", 1067 | "the training dataset, Edith Wharton's short story The Verdict, did not contain the words\n", 1068 | "\"Hello\" and \"palace.\"\n", 1069 | "\n", 1070 | "
\n" 1071 | ] 1072 | }, 1073 | { 1074 | "cell_type": "markdown", 1075 | "metadata": { 1076 | "id": "nRaCsnt52aAY" 1077 | }, 1078 | "source": [ 1079 | "
\n", 1080 | "\n", 1081 | "So far, we have discussed tokenization as an essential step in processing text as input to\n", 1082 | "LLMs. Depending on the LLM, some researchers also consider additional special tokens such\n", 1083 | "as the following:\n", 1084 | "\n", 1085 | "[BOS] (beginning of sequence): This token marks the start of a text. It\n", 1086 | "signifies to the LLM where a piece of content begins.\n", 1087 | "\n", 1088 | "[EOS] (end of sequence): This token is positioned at the end of a text,\n", 1089 | "and is especially useful when concatenating multiple unrelated texts,\n", 1090 | "similar to <|endoftext|>. For instance, when combining two different\n", 1091 | "Wikipedia articles or books, the [EOS] token indicates where one article\n", 1092 | "ends and the next one begins.\n", 1093 | "\n", 1094 | "[PAD] (padding): When training LLMs with batch sizes larger than one,\n", 1095 | "the batch might contain texts of varying lengths. To ensure all texts have\n", 1096 | "the same length, the shorter texts are extended or \"padded\" using the\n", 1097 | "[PAD] token, up to the length of the longest text in the batch.\n", 1098 | "\n", 1099 | "
\n" 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "markdown", 1104 | "metadata": { 1105 | "id": "aJBsMe7x2aAZ" 1106 | }, 1107 | "source": [ 1108 | "
\n", 1109 | "\n", 1110 | "Note that the tokenizer used for GPT models does not need any of these tokens mentioned\n", 1111 | "above but only uses an <|endoftext|> token for simplicity\n", 1112 | "\n", 1113 | "
" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": { 1119 | "id": "Ybr33L5N2aAZ" 1120 | }, 1121 | "source": [ 1122 | "
\n", 1123 | "\n", 1124 | "the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks\n", 1125 | "down words into subword units\n", 1126 | "
" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "markdown", 1131 | "metadata": { 1132 | "id": "mE8F5MTk2aAZ" 1133 | }, 1134 | "source": [ 1135 | "### BYTE PAIR ENCODING\n" 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "markdown", 1140 | "metadata": { 1141 | "id": "bm4wZ2ak2aAZ" 1142 | }, 1143 | "source": [ 1144 | "**BPE Tokenizer**" 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "execution_count": null, 1150 | "metadata": { 1151 | "id": "VpWVQjsx2aAa", 1152 | "outputId": "dd625bdd-8caf-4824-f8e6-377bc5cfcda6" 1153 | }, 1154 | "outputs": [ 1155 | { 1156 | "name": "stdout", 1157 | "output_type": "stream", 1158 | "text": [ 1159 | "Requirement already satisfied: tiktoken in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (0.6.0)\n", 1160 | "Requirement already satisfied: regex>=2022.1.18 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from tiktoken) (2024.4.28)\n", 1161 | "Requirement already satisfied: requests>=2.26.0 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from tiktoken) (2.31.0)\n", 1162 | "Requirement already satisfied: charset-normalizer<4,>=2 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.3.2)\n", 1163 | "Requirement already satisfied: idna<4,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.6)\n", 1164 | "Requirement already satisfied: urllib3<3,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (2.2.1)\n", 1165 | "Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (2024.2.2)\n", 1166 | "\n", 1167 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.1.2\u001b[0m\n", 1168 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip3 install --upgrade pip\u001b[0m\n" 1169 | ] 1170 | } 1171 | ], 1172 | "source": [ 1173 | "! pip3 install tiktoken" 1174 | ] 1175 | }, 1176 | { 1177 | "cell_type": "code", 1178 | "execution_count": null, 1179 | "metadata": { 1180 | "id": "Q1z-g3dA2aAa", 1181 | "outputId": "9e9bb4f2-2446-4dc5-c276-c8b5629d47f1" 1182 | }, 1183 | "outputs": [ 1184 | { 1185 | "name": "stdout", 1186 | "output_type": "stream", 1187 | "text": [ 1188 | "tiktoken version: 0.6.0\n" 1189 | ] 1190 | } 1191 | ], 1192 | "source": [ 1193 | "import importlib\n", 1194 | "import tiktoken\n", 1195 | "\n", 1196 | "print(\"tiktoken version:\", importlib.metadata.version(\"tiktoken\"))" 1197 | ] 1198 | }, 1199 | { 1200 | "cell_type": "code", 1201 | "execution_count": null, 1202 | "metadata": { 1203 | "id": "AITEt15_2aAa" 1204 | }, 1205 | "outputs": [], 1206 | "source": [ 1207 | "tokenizer = tiktoken.get_encoding(\"gpt2\")" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "code", 1212 | "execution_count": null, 1213 | "metadata": { 1214 | "id": "8mSCDaSW2aAb", 1215 | "outputId": "ece2a2f8-f4c5-4037-adcf-186a24390b02" 1216 | }, 1217 | "outputs": [ 1218 | { 1219 | "name": "stdout", 1220 | "output_type": "stream", 1221 | "text": [ 1222 | "[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]\n" 1223 | ] 1224 | } 1225 | ], 1226 | "source": [ 1227 | "text = (\n", 1228 | " \"Hello, do you like tea? <|endoftext|> In the sunlit terraces\"\n", 1229 | " \"of someunknownPlace.\"\n", 1230 | ")\n", 1231 | "\n", 1232 | "integers = tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n", 1233 | "\n", 1234 | "print(integers)" 1235 | ] 1236 | }, 1237 | { 1238 | "cell_type": "code", 1239 | "execution_count": null, 1240 | "metadata": { 1241 | "id": "ZXAwo3eA2aAb", 1242 | "outputId": "fda11253-35f7-438b-b57c-991ceeedbac7" 1243 | }, 1244 | "outputs": [ 1245 | { 1246 | "name": "stdout", 1247 | "output_type": "stream", 1248 | "text": [ 1249 | "Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.\n" 1250 | ] 1251 | } 1252 | ], 1253 | "source": [ 1254 | "strings = tokenizer.decode(integers)\n", 1255 | "\n", 1256 | "print(strings)" 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "markdown", 1261 | "metadata": { 1262 | "id": "L5Jbv8Ig2aAb" 1263 | }, 1264 | "source": [ 1265 | "**Exercise 2.1**" 1266 | ] 1267 | }, 1268 | { 1269 | "cell_type": "code", 1270 | "execution_count": null, 1271 | "metadata": { 1272 | "id": "oCL1Jt1k2aAc", 1273 | "outputId": "8c64cb6f-78a5-49eb-fafa-4a95fb093899" 1274 | }, 1275 | "outputs": [ 1276 | { 1277 | "name": "stdout", 1278 | "output_type": "stream", 1279 | "text": [ 1280 | "[33901, 86, 343, 86, 220, 959]\n", 1281 | "Akwirw ier\n" 1282 | ] 1283 | } 1284 | ], 1285 | "source": [ 1286 | "integers = tokenizer.encode(\"Akwirw ier\")\n", 1287 | "print(integers)\n", 1288 | "\n", 1289 | "strings = tokenizer.decode(integers)\n", 1290 | "print(strings)" 1291 | ] 1292 | }, 1293 | { 1294 | "cell_type": "markdown", 1295 | "metadata": { 1296 | "id": "SdICjQEW2aAc" 1297 | }, 1298 | "source": [ 1299 | "**Data sampling with sliding window**" 1300 | ] 1301 | }, 1302 | { 1303 | "cell_type": "code", 1304 | "execution_count": null, 1305 | "metadata": { 1306 | "id": "Z4TsrbVx2aAc", 1307 | "outputId": "57f7764f-07f9-40e9-d3f5-522cad5c68b3" 1308 | }, 1309 | "outputs": [ 1310 | { 1311 | "name": "stdout", 1312 | "output_type": "stream", 1313 | "text": [ 1314 | "5145\n" 1315 | ] 1316 | } 1317 | ], 1318 | "source": [ 1319 | "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n", 1320 | " raw_text = f.read()\n", 1321 | "\n", 1322 | "enc_text = tokenizer.encode(raw_text)\n", 1323 | "print(len(enc_text))" 1324 | ] 1325 | }, 1326 | { 1327 | "cell_type": "code", 1328 | "execution_count": null, 1329 | "metadata": { 1330 | "id": "EzV7NyTQ2aAd" 1331 | }, 1332 | "outputs": [], 1333 | "source": [ 1334 | "enc_sample = enc_text[50:]\n" 1335 | ] 1336 | }, 1337 | { 1338 | "cell_type": "code", 1339 | "execution_count": null, 1340 | "metadata": { 1341 | "id": "95WtzSHD2aAd", 1342 | "outputId": "6acef830-5033-4ef4-ddf0-fc3fcc4b7764" 1343 | }, 1344 | "outputs": [ 1345 | { 1346 | "name": "stdout", 1347 | "output_type": "stream", 1348 | "text": [ 1349 | "x: [290, 4920, 2241, 287]\n", 1350 | "y: [4920, 2241, 287, 257]\n" 1351 | ] 1352 | } 1353 | ], 1354 | "source": [ 1355 | "context_size = 4\n", 1356 | "\n", 1357 | "x = enc_sample[:context_size]\n", 1358 | "y = enc_sample[1:context_size+1]\n", 1359 | "\n", 1360 | "print(f\"x: {x}\")\n", 1361 | "print(f\"y: {y}\")" 1362 | ] 1363 | }, 1364 | { 1365 | "cell_type": "code", 1366 | "execution_count": null, 1367 | "metadata": { 1368 | "id": "drio_Bir2aAd", 1369 | "outputId": "8861714e-de41-41ee-fccc-54a67474f6bf" 1370 | }, 1371 | "outputs": [ 1372 | { 1373 | "name": "stdout", 1374 | "output_type": "stream", 1375 | "text": [ 1376 | "[290] ----> 4920\n", 1377 | "[290, 4920] ----> 2241\n", 1378 | "[290, 4920, 2241] ----> 287\n", 1379 | "[290, 4920, 2241, 287] ----> 257\n" 1380 | ] 1381 | } 1382 | ], 1383 | "source": [ 1384 | "for i in range(1, context_size+1):\n", 1385 | " context = enc_sample[:i]\n", 1386 | " desired = enc_sample[i]\n", 1387 | "\n", 1388 | " print(context, \"---->\", desired)" 1389 | ] 1390 | }, 1391 | { 1392 | "cell_type": "code", 1393 | "execution_count": null, 1394 | "metadata": { 1395 | "id": "_muM0-Yy2aAe", 1396 | "outputId": "da1e4e5d-ae8d-46a8-f724-721b7cc25b67" 1397 | }, 1398 | "outputs": [ 1399 | { 1400 | "name": "stdout", 1401 | "output_type": "stream", 1402 | "text": [ 1403 | " and ----> established\n", 1404 | " and established ----> himself\n", 1405 | " and established himself ----> in\n", 1406 | " and established himself in ----> a\n" 1407 | ] 1408 | } 1409 | ], 1410 | "source": [ 1411 | "for i in range(1, context_size+1):\n", 1412 | " context = enc_sample[:i]\n", 1413 | " desired = enc_sample[i]\n", 1414 | "\n", 1415 | " print(tokenizer.decode(context), \"---->\", tokenizer.decode([desired]))" 1416 | ] 1417 | }, 1418 | { 1419 | "cell_type": "markdown", 1420 | "metadata": { 1421 | "id": "9osX-dML2aAe" 1422 | }, 1423 | "source": [ 1424 | "**IMPLEMENTING A DATA LOADER**" 1425 | ] 1426 | }, 1427 | { 1428 | "cell_type": "code", 1429 | "execution_count": null, 1430 | "metadata": { 1431 | "id": "-wRoS1Xg2aAe" 1432 | }, 1433 | "outputs": [], 1434 | "source": [ 1435 | "from torch.utils.data import Dataset, DataLoader\n", 1436 | "\n", 1437 | "\n", 1438 | "class GPTDatasetV1(Dataset):\n", 1439 | " def __init__(self, txt, tokenizer, max_length, stride):\n", 1440 | " self.input_ids = []\n", 1441 | " self.target_ids = []\n", 1442 | "\n", 1443 | " # Tokenize the entire text\n", 1444 | " token_ids = tokenizer.encode(txt, allowed_special={\"<|endoftext|>\"})\n", 1445 | "\n", 1446 | " # Use a sliding window to chunk the book into overlapping sequences of max_length\n", 1447 | " for i in range(0, len(token_ids) - max_length, stride):\n", 1448 | " input_chunk = token_ids[i:i + max_length]\n", 1449 | " target_chunk = token_ids[i + 1: i + max_length + 1]\n", 1450 | " self.input_ids.append(torch.tensor(input_chunk))\n", 1451 | " self.target_ids.append(torch.tensor(target_chunk))\n", 1452 | "\n", 1453 | " def __len__(self):\n", 1454 | " return len(self.input_ids)\n", 1455 | "\n", 1456 | " def __getitem__(self, idx):\n", 1457 | " return self.input_ids[idx], self.target_ids[idx]" 1458 | ] 1459 | }, 1460 | { 1461 | "cell_type": "code", 1462 | "execution_count": null, 1463 | "metadata": { 1464 | "id": "zEvgVBWU2aAf" 1465 | }, 1466 | "outputs": [], 1467 | "source": [ 1468 | "def create_dataloader_v1(txt, batch_size=4, max_length=256,\n", 1469 | " stride=128, shuffle=True, drop_last=True,\n", 1470 | " num_workers=0):\n", 1471 | "\n", 1472 | " # Initialize the tokenizer\n", 1473 | " tokenizer = tiktoken.get_encoding(\"gpt2\")\n", 1474 | "\n", 1475 | " # Create dataset\n", 1476 | " dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n", 1477 | "\n", 1478 | " # Create dataloader\n", 1479 | " dataloader = DataLoader(\n", 1480 | " dataset,\n", 1481 | " batch_size=batch_size,\n", 1482 | " shuffle=shuffle,\n", 1483 | " drop_last=drop_last,\n", 1484 | " num_workers=num_workers\n", 1485 | " )\n", 1486 | "\n", 1487 | " return dataloader" 1488 | ] 1489 | }, 1490 | { 1491 | "cell_type": "code", 1492 | "execution_count": null, 1493 | "metadata": { 1494 | "id": "Zd43FzyA2aAg" 1495 | }, 1496 | "outputs": [], 1497 | "source": [ 1498 | "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n", 1499 | " raw_text = f.read()" 1500 | ] 1501 | }, 1502 | { 1503 | "cell_type": "code", 1504 | "execution_count": null, 1505 | "metadata": { 1506 | "id": "pmcR6pQ-2aAg", 1507 | "outputId": "b4aed8d9-bf5d-4c5e-c0f8-0160e45aa0ef" 1508 | }, 1509 | "outputs": [ 1510 | { 1511 | "name": "stdout", 1512 | "output_type": "stream", 1513 | "text": [ 1514 | "PyTorch version: 2.3.0\n", 1515 | "[tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]\n" 1516 | ] 1517 | } 1518 | ], 1519 | "source": [ 1520 | "import torch\n", 1521 | "print(\"PyTorch version:\", torch.__version__)\n", 1522 | "dataloader = create_dataloader_v1(\n", 1523 | " raw_text, batch_size=1, max_length=4, stride=1, shuffle=False\n", 1524 | ")\n", 1525 | "\n", 1526 | "data_iter = iter(dataloader)\n", 1527 | "first_batch = next(data_iter)\n", 1528 | "print(first_batch)" 1529 | ] 1530 | }, 1531 | { 1532 | "cell_type": "code", 1533 | "execution_count": null, 1534 | "metadata": { 1535 | "id": "DlHXQQ2v2aAh", 1536 | "outputId": "972c25cc-3e2d-4bcd-9464-241dd84670a6" 1537 | }, 1538 | "outputs": [ 1539 | { 1540 | "name": "stdout", 1541 | "output_type": "stream", 1542 | "text": [ 1543 | "[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]\n" 1544 | ] 1545 | } 1546 | ], 1547 | "source": [ 1548 | "second_batch = next(data_iter)\n", 1549 | "print(second_batch)" 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "code", 1554 | "execution_count": null, 1555 | "metadata": { 1556 | "id": "jue_LgoQ2aAh", 1557 | "outputId": "3b83eb5a-667e-47f8-877a-e3b201b139da" 1558 | }, 1559 | "outputs": [ 1560 | { 1561 | "name": "stdout", 1562 | "output_type": "stream", 1563 | "text": [ 1564 | "Inputs:\n", 1565 | " tensor([[ 40, 367, 2885, 1464],\n", 1566 | " [ 1807, 3619, 402, 271],\n", 1567 | " [10899, 2138, 257, 7026],\n", 1568 | " [15632, 438, 2016, 257],\n", 1569 | " [ 922, 5891, 1576, 438],\n", 1570 | " [ 568, 340, 373, 645],\n", 1571 | " [ 1049, 5975, 284, 502],\n", 1572 | " [ 284, 3285, 326, 11]])\n", 1573 | "\n", 1574 | "Targets:\n", 1575 | " tensor([[ 367, 2885, 1464, 1807],\n", 1576 | " [ 3619, 402, 271, 10899],\n", 1577 | " [ 2138, 257, 7026, 15632],\n", 1578 | " [ 438, 2016, 257, 922],\n", 1579 | " [ 5891, 1576, 438, 568],\n", 1580 | " [ 340, 373, 645, 1049],\n", 1581 | " [ 5975, 284, 502, 284],\n", 1582 | " [ 3285, 326, 11, 287]])\n" 1583 | ] 1584 | } 1585 | ], 1586 | "source": [ 1587 | "dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)\n", 1588 | "\n", 1589 | "data_iter = iter(dataloader)\n", 1590 | "inputs, targets = next(data_iter)\n", 1591 | "print(\"Inputs:\\n\", inputs)\n", 1592 | "print(\"\\nTargets:\\n\", targets)" 1593 | ] 1594 | }, 1595 | { 1596 | "cell_type": "markdown", 1597 | "metadata": { 1598 | "id": "DAUQe8IM2aAi" 1599 | }, 1600 | "source": [ 1601 | "**CREATE TOKEN EMBEDDINGS**" 1602 | ] 1603 | }, 1604 | { 1605 | "cell_type": "code", 1606 | "execution_count": null, 1607 | "metadata": { 1608 | "id": "TaUsZMpI2aAi" 1609 | }, 1610 | "outputs": [], 1611 | "source": [ 1612 | "input_ids = torch.tensor([2, 3, 5, 1])\n" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "code", 1617 | "execution_count": null, 1618 | "metadata": { 1619 | "id": "mzJ3X7cc2aAj" 1620 | }, 1621 | "outputs": [], 1622 | "source": [ 1623 | "vocab_size = 6\n", 1624 | "output_dim = 3\n", 1625 | "\n", 1626 | "torch.manual_seed(123)\n", 1627 | "embedding_layer = torch.nn.Embedding(vocab_size, output_dim)" 1628 | ] 1629 | }, 1630 | { 1631 | "cell_type": "code", 1632 | "execution_count": null, 1633 | "metadata": { 1634 | "id": "MGYinkYW2aAj", 1635 | "outputId": "836d6ddc-54c2-4ae2-bc46-8c5aa548fab2" 1636 | }, 1637 | "outputs": [ 1638 | { 1639 | "name": "stdout", 1640 | "output_type": "stream", 1641 | "text": [ 1642 | "Parameter containing:\n", 1643 | "tensor([[ 0.3374, -0.1778, -0.1690],\n", 1644 | " [ 0.9178, 1.5810, 1.3010],\n", 1645 | " [ 1.2753, -0.2010, -0.1606],\n", 1646 | " [-0.4015, 0.9666, -1.1481],\n", 1647 | " [-1.1589, 0.3255, -0.6315],\n", 1648 | " [-2.8400, -0.7849, -1.4096]], requires_grad=True)\n" 1649 | ] 1650 | } 1651 | ], 1652 | "source": [ 1653 | "print(embedding_layer.weight)\n" 1654 | ] 1655 | }, 1656 | { 1657 | "cell_type": "code", 1658 | "execution_count": null, 1659 | "metadata": { 1660 | "id": "bI7zVC6K2aAj", 1661 | "outputId": "e42021a1-aff2-4e36-dec4-a9876bdee139" 1662 | }, 1663 | "outputs": [ 1664 | { 1665 | "name": "stdout", 1666 | "output_type": "stream", 1667 | "text": [ 1668 | "tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=)\n" 1669 | ] 1670 | } 1671 | ], 1672 | "source": [ 1673 | "print(embedding_layer(torch.tensor([3])))\n" 1674 | ] 1675 | }, 1676 | { 1677 | "cell_type": "code", 1678 | "execution_count": null, 1679 | "metadata": { 1680 | "id": "jShgG3HC2aAk", 1681 | "outputId": "91302fca-fa03-47e4-eed2-e2d7d4d845c7" 1682 | }, 1683 | "outputs": [ 1684 | { 1685 | "name": "stdout", 1686 | "output_type": "stream", 1687 | "text": [ 1688 | "tensor([[ 1.2753, -0.2010, -0.1606],\n", 1689 | " [-0.4015, 0.9666, -1.1481],\n", 1690 | " [-2.8400, -0.7849, -1.4096],\n", 1691 | " [ 0.9178, 1.5810, 1.3010]], grad_fn=)\n" 1692 | ] 1693 | } 1694 | ], 1695 | "source": [ 1696 | "print(embedding_layer(input_ids))\n" 1697 | ] 1698 | }, 1699 | { 1700 | "cell_type": "markdown", 1701 | "metadata": { 1702 | "id": "OsPiUIKP2aAk" 1703 | }, 1704 | "source": [ 1705 | "**POSITIONAL EMBEDDINGS (ENCODING WORD POSITIONS)**" 1706 | ] 1707 | }, 1708 | { 1709 | "cell_type": "code", 1710 | "execution_count": null, 1711 | "metadata": { 1712 | "id": "ZOKFVX3_2aAk" 1713 | }, 1714 | "outputs": [], 1715 | "source": [ 1716 | "vocab_size = 50257\n", 1717 | "output_dim = 256\n", 1718 | "\n", 1719 | "token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)" 1720 | ] 1721 | }, 1722 | { 1723 | "cell_type": "code", 1724 | "execution_count": null, 1725 | "metadata": { 1726 | "id": "WvtrtbML2aAl" 1727 | }, 1728 | "outputs": [], 1729 | "source": [ 1730 | "max_length = 4\n", 1731 | "dataloader = create_dataloader_v1(\n", 1732 | " raw_text, batch_size=8, max_length=max_length,\n", 1733 | " stride=max_length, shuffle=False\n", 1734 | ")\n", 1735 | "data_iter = iter(dataloader)\n", 1736 | "inputs, targets = next(data_iter)" 1737 | ] 1738 | }, 1739 | { 1740 | "cell_type": "code", 1741 | "execution_count": null, 1742 | "metadata": { 1743 | "id": "mkUAIOiX2aAl", 1744 | "outputId": "6f01c46c-1bda-4a6d-87c9-e7963dc1184d" 1745 | }, 1746 | "outputs": [ 1747 | { 1748 | "name": "stdout", 1749 | "output_type": "stream", 1750 | "text": [ 1751 | "Token IDs:\n", 1752 | " tensor([[ 40, 367, 2885, 1464],\n", 1753 | " [ 1807, 3619, 402, 271],\n", 1754 | " [10899, 2138, 257, 7026],\n", 1755 | " [15632, 438, 2016, 257],\n", 1756 | " [ 922, 5891, 1576, 438],\n", 1757 | " [ 568, 340, 373, 645],\n", 1758 | " [ 1049, 5975, 284, 502],\n", 1759 | " [ 284, 3285, 326, 11]])\n", 1760 | "\n", 1761 | "Inputs shape:\n", 1762 | " torch.Size([8, 4])\n" 1763 | ] 1764 | } 1765 | ], 1766 | "source": [ 1767 | "print(\"Token IDs:\\n\", inputs)\n", 1768 | "print(\"\\nInputs shape:\\n\", inputs.shape)" 1769 | ] 1770 | }, 1771 | { 1772 | "cell_type": "code", 1773 | "execution_count": null, 1774 | "metadata": { 1775 | "id": "uPPszpan2aAl", 1776 | "outputId": "b377f487-9ceb-4a34-8d29-8666787b65ef" 1777 | }, 1778 | "outputs": [ 1779 | { 1780 | "name": "stdout", 1781 | "output_type": "stream", 1782 | "text": [ 1783 | "torch.Size([8, 4, 256])\n" 1784 | ] 1785 | } 1786 | ], 1787 | "source": [ 1788 | "token_embeddings = token_embedding_layer(inputs)\n", 1789 | "print(token_embeddings.shape)" 1790 | ] 1791 | }, 1792 | { 1793 | "cell_type": "code", 1794 | "execution_count": null, 1795 | "metadata": { 1796 | "id": "kaMnySut2aAm" 1797 | }, 1798 | "outputs": [], 1799 | "source": [ 1800 | "context_length = max_length\n", 1801 | "pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)" 1802 | ] 1803 | }, 1804 | { 1805 | "cell_type": "code", 1806 | "execution_count": null, 1807 | "metadata": { 1808 | "id": "FhRhOAIC2aAm", 1809 | "outputId": "51f77640-7b5e-4e7f-dd40-922ffe450fde" 1810 | }, 1811 | "outputs": [ 1812 | { 1813 | "name": "stdout", 1814 | "output_type": "stream", 1815 | "text": [ 1816 | "torch.Size([4, 256])\n" 1817 | ] 1818 | } 1819 | ], 1820 | "source": [ 1821 | "pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n", 1822 | "print(pos_embeddings.shape)" 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "code", 1827 | "execution_count": null, 1828 | "metadata": { 1829 | "id": "wgnjAD372aAm", 1830 | "outputId": "4e231855-3dfa-4354-925a-640fdbb8edba" 1831 | }, 1832 | "outputs": [ 1833 | { 1834 | "name": "stdout", 1835 | "output_type": "stream", 1836 | "text": [ 1837 | "torch.Size([8, 4, 256])\n" 1838 | ] 1839 | } 1840 | ], 1841 | "source": [ 1842 | "input_embeddings = token_embeddings + pos_embeddings\n", 1843 | "print(input_embeddings.shape)" 1844 | ] 1845 | }, 1846 | { 1847 | "cell_type": "code", 1848 | "execution_count": null, 1849 | "metadata": { 1850 | "id": "pu-URnBc2aAm" 1851 | }, 1852 | "outputs": [], 1853 | "source": [] 1854 | } 1855 | ], 1856 | "metadata": { 1857 | "kernelspec": { 1858 | "display_name": "Python 3 (ipykernel)", 1859 | "language": "python", 1860 | "name": "python3" 1861 | }, 1862 | "language_info": { 1863 | "codemirror_mode": { 1864 | "name": "ipython", 1865 | "version": 3 1866 | }, 1867 | "file_extension": ".py", 1868 | "mimetype": "text/x-python", 1869 | "name": "python", 1870 | "nbconvert_exporter": "python", 1871 | "pygments_lexer": "ipython3", 1872 | "version": "3.12.2" 1873 | }, 1874 | "colab": { 1875 | "provenance": [] 1876 | } 1877 | }, 1878 | "nbformat": 4, 1879 | "nbformat_minor": 0 1880 | } -------------------------------------------------------------------------------- /the-verdict.txt: -------------------------------------------------------------------------------- 1 | I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.) 2 | 3 | "The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"? 4 | 5 | Well!--even through the prism of Hermia's tears I felt able to face the fact with equanimity. Poor Jack Gisburn! The women had made him--it was fitting that they should mourn him. Among his own sex fewer regrets were heard, and in his own trade hardly a murmur. Professional jealousy? Perhaps. If it were, the honour of the craft was vindicated by little Claude Nutley, who, in all good faith, brought out in the Burlington a very handsome "obituary" on Jack--one of those showy articles stocked with random technicalities that I have heard (I won't say by whom) compared to Gisburn's painting. And so--his resolve being apparently irrevocable--the discussion gradually died out, and, as Mrs. Thwing had predicted, the price of "Gisburns" went up. 6 | 7 | It was not till three years later that, in the course of a few weeks' idling on the Riviera, it suddenly occurred to me to wonder why Gisburn had given up his painting. On reflection, it really was a tempting problem. To accuse his wife would have been too easy--his fair sitters had been denied the solace of saying that Mrs. Gisburn had "dragged him down." For Mrs. Gisburn--as such--had not existed till nearly a year after Jack's resolve had been taken. It might be that he had married her--since he liked his ease--because he didn't want to go on painting; but it would have been hard to prove that he had given up his painting because he had married her. 8 | 9 | Of course, if she had not dragged him down, she had equally, as Miss Croft contended, failed to "lift him up"--she had not led him back to the easel. To put the brush into his hand again--what a vocation for a wife! But Mrs. Gisburn appeared to have disdained it--and I felt it might be interesting to find out why. 10 | 11 | The desultory life of the Riviera lends itself to such purely academic speculations; and having, on my way to Monte Carlo, caught a glimpse of Jack's balustraded terraces between the pines, I had myself borne thither the next day. 12 | 13 | I found the couple at tea beneath their palm-trees; and Mrs. Gisburn's welcome was so genial that, in the ensuing weeks, I claimed it frequently. It was not that my hostess was "interesting": on that point I could have given Miss Croft the fullest reassurance. It was just because she was _not_ interesting--if I may be pardoned the bull--that I found her so. For Jack, all his life, had been surrounded by interesting women: they had fostered his art, it had been reared in the hot-house of their adulation. And it was therefore instructive to note what effect the "deadening atmosphere of mediocrity" (I quote Miss Croft) was having on him. 14 | 15 | I have mentioned that Mrs. Gisburn was rich; and it was immediately perceptible that her husband was extracting from this circumstance a delicate but substantial satisfaction. It is, as a rule, the people who scorn money who get most out of it; and Jack's elegant disdain of his wife's big balance enabled him, with an appearance of perfect good-breeding, to transmute it into objects of art and luxury. To the latter, I must add, he remained relatively indifferent; but he was buying Renaissance bronzes and eighteenth-century pictures with a discrimination that bespoke the amplest resources. 16 | 17 | "Money's only excuse is to put beauty into circulation," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gisburn, beaming on him, added for my enlightenment: "Jack is so morbidly sensitive to every form of beauty." 18 | 19 | Poor Jack! It had always been his fate to have women say such things of him: the fact should be set down in extenuation. What struck me now was that, for the first time, he resented the tone. I had seen him, so often, basking under similar tributes--was it the conjugal note that robbed them of their savour? No--for, oddly enough, it became apparent that he was fond of Mrs. Gisburn--fond enough not to see her absurdity. It was his own absurdity he seemed to be wincing under--his own attitude as an object for garlands and incense. 20 | 21 | "My dear, since I've chucked painting people don't say that stuff about me--they say it about Victor Grindle," was his only protest, as he rose from the table and strolled out onto the sunlit terrace. 22 | 23 | I glanced after him, struck by his last word. Victor Grindle was, in fact, becoming the man of the moment--as Jack himself, one might put it, had been the man of the hour. The younger artist was said to have formed himself at my friend's feet, and I wondered if a tinge of jealousy underlay the latter's mysterious abdication. But no--for it was not till after that event that the _rose Dubarry_ drawing-rooms had begun to display their "Grindles." 24 | 25 | I turned to Mrs. Gisburn, who had lingered to give a lump of sugar to her spaniel in the dining-room. 26 | 27 | "Why _has_ he chucked painting?" I asked abruptly. 28 | 29 | She raised her eyebrows with a hint of good-humoured surprise. 30 | 31 | "Oh, he doesn't _have_ to now, you know; and I want him to enjoy himself," she said quite simply. 32 | 33 | I looked about the spacious white-panelled room, with its _famille-verte_ vases repeating the tones of the pale damask curtains, and its eighteenth-century pastels in delicate faded frames. 34 | 35 | "Has he chucked his pictures too? I haven't seen a single one in the house." 36 | 37 | A slight shade of constraint crossed Mrs. Gisburn's open countenance. "It's his ridiculous modesty, you know. He says they're not fit to have about; he's sent them all away except one--my portrait--and that I have to keep upstairs." 38 | 39 | His ridiculous modesty--Jack's modesty about his pictures? My curiosity was growing like the bean-stalk. I said persuasively to my hostess: "I must really see your portrait, you know." 40 | 41 | She glanced out almost timorously at the terrace where her husband, lounging in a hooded chair, had lit a cigar and drawn the Russian deerhound's head between his knees. 42 | 43 | "Well, come while he's not looking," she said, with a laugh that tried to hide her nervousness; and I followed her between the marble Emperors of the hall, and up the wide stairs with terra-cotta nymphs poised among flowers at each landing. 44 | 45 | In the dimmest corner of her boudoir, amid a profusion of delicate and distinguished objects, hung one of the familiar oval canvases, in the inevitable garlanded frame. The mere outline of the frame called up all Gisburn's past! 46 | 47 | Mrs. Gisburn drew back the window-curtains, moved aside a _jardiniere_ full of pink azaleas, pushed an arm-chair away, and said: "If you stand here you can just manage to see it. I had it over the mantel-piece, but he wouldn't let it stay." 48 | 49 | Yes--I could just manage to see it--the first portrait of Jack's I had ever had to strain my eyes over! Usually they had the place of honour--say the central panel in a pale yellow or _rose Dubarry_ drawing-room, or a monumental easel placed so that it took the light through curtains of old Venetian point. The more modest place became the picture better; yet, as my eyes grew accustomed to the half-light, all the characteristic qualities came out--all the hesitations disguised as audacities, the tricks of prestidigitation by which, with such consummate skill, he managed to divert attention from the real business of the picture to some pretty irrelevance of detail. Mrs. Gisburn, presenting a neutral surface to work on--forming, as it were, so inevitably the background of her own picture--had lent herself in an unusual degree to the display of this false virtuosity. The picture was one of Jack's "strongest," as his admirers would have put it--it represented, on his part, a swelling of muscles, a congesting of veins, a balancing, straddling and straining, that reminded one of the circus-clown's ironic efforts to lift a feather. It met, in short, at every point the demand of lovely woman to be painted "strongly" because she was tired of being painted "sweetly"--and yet not to lose an atom of the sweetness. 50 | 51 | "It's the last he painted, you know," Mrs. Gisburn said with pardonable pride. "The last but one," she corrected herself--"but the other doesn't count, because he destroyed it." 52 | 53 | "Destroyed it?" I was about to follow up this clue when I heard a footstep and saw Jack himself on the threshold. 54 | 55 | As he stood there, his hands in the pockets of his velveteen coat, the thin brown waves of hair pushed back from his white forehead, his lean sunburnt cheeks furrowed by a smile that lifted the tips of a self-confident moustache, I felt to what a degree he had the same quality as his pictures--the quality of looking cleverer than he was. 56 | 57 | His wife glanced at him deprecatingly, but his eyes travelled past her to the portrait. 58 | 59 | "Mr. Rickham wanted to see it," she began, as if excusing herself. He shrugged his shoulders, still smiling. 60 | 61 | "Oh, Rickham found me out long ago," he said lightly; then, passing his arm through mine: "Come and see the rest of the house." 62 | 63 | He showed it to me with a kind of naive suburban pride: the bath-rooms, the speaking-tubes, the dress-closets, the trouser-presses--all the complex simplifications of the millionaire's domestic economy. And whenever my wonder paid the expected tribute he said, throwing out his chest a little: "Yes, I really don't see how people manage to live without that." 64 | 65 | Well--it was just the end one might have foreseen for him. Only he was, through it all and in spite of it all--as he had been through, and in spite of, his pictures--so handsome, so charming, so disarming, that one longed to cry out: "Be dissatisfied with your leisure!" as once one had longed to say: "Be dissatisfied with your work!" 66 | 67 | But, with the cry on my lips, my diagnosis suffered an unexpected check. 68 | 69 | "This is my own lair," he said, leading me into a dark plain room at the end of the florid vista. It was square and brown and leathery: no "effects"; no bric-a-brac, none of the air of posing for reproduction in a picture weekly--above all, no least sign of ever having been used as a studio. 70 | 71 | The fact brought home to me the absolute finality of Jack's break with his old life. 72 | 73 | "Don't you ever dabble with paint any more?" I asked, still looking about for a trace of such activity. 74 | 75 | "Never," he said briefly. 76 | 77 | "Or water-colour--or etching?" 78 | 79 | His confident eyes grew dim, and his cheeks paled a little under their handsome sunburn. 80 | 81 | "Never think of it, my dear fellow--any more than if I'd never touched a brush." 82 | 83 | And his tone told me in a flash that he never thought of anything else. 84 | 85 | I moved away, instinctively embarrassed by my unexpected discovery; and as I turned, my eye fell on a small picture above the mantel-piece--the only object breaking the plain oak panelling of the room. 86 | 87 | "Oh, by Jove!" I said. 88 | 89 | It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall. 90 | 91 | "By Jove--a Stroud!" I cried. 92 | 93 | He was silent; but I felt him close behind me, breathing a little quickly. 94 | 95 | "What a wonder! Made with a dozen lines--but on everlasting foundations. You lucky chap, where did you get it?" 96 | 97 | He answered slowly: "Mrs. Stroud gave it to me." 98 | 99 | "Ah--I didn't know you even knew the Strouds. He was such an inflexible hermit." 100 | 101 | "I didn't--till after. . . . She sent for me to paint him when he was dead." 102 | 103 | "When he was dead? You?" 104 | 105 | I must have let a little too much amazement escape through my surprise, for he answered with a deprecating laugh: "Yes--she's an awful simpleton, you know, Mrs. Stroud. Her only idea was to have him done by a fashionable painter--ah, poor Stroud! She thought it the surest way of proclaiming his greatness--of forcing it on a purblind public. And at the moment I was _the_ fashionable painter." 106 | 107 | "Ah, poor Stroud--as you say. Was _that_ his history?" 108 | 109 | "That was his history. She believed in him, gloried in him--or thought she did. But she couldn't bear not to have all the drawing-rooms with her. She couldn't bear the fact that, on varnishing days, one could always get near enough to see his pictures. Poor woman! She's just a fragment groping for other fragments. Stroud is the only whole I ever knew." 110 | 111 | "You ever knew? But you just said--" 112 | 113 | Gisburn had a curious smile in his eyes. 114 | 115 | "Oh, I knew him, and he knew me--only it happened after he was dead." 116 | 117 | I dropped my voice instinctively. "When she sent for you?" 118 | 119 | "Yes--quite insensible to the irony. She wanted him vindicated--and by me!" 120 | 121 | He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I couldn't look at that thing--couldn't face it. But I forced myself to put it here; and now it's cured me--cured me. That's the reason why I don't dabble any more, my dear Rickham; or rather Stroud himself is the reason." 122 | 123 | For the first time my idle curiosity about my companion turned into a serious desire to understand him better. 124 | 125 | "I wish you'd tell me how it happened," I said. 126 | 127 | He stood looking up at the sketch, and twirling between his fingers a cigarette he had forgotten to light. Suddenly he turned toward me. 128 | 129 | "I'd rather like to tell you--because I've always suspected you of loathing my work." 130 | 131 | I made a deprecating gesture, which he negatived with a good-humoured shrug. 132 | 133 | "Oh, I didn't care a straw when I believed in myself--and now it's an added tie between us!" 134 | 135 | He laughed slightly, without bitterness, and pushed one of the deep arm-chairs forward. "There: make yourself comfortable--and here are the cigars you like." 136 | 137 | He placed them at my elbow and continued to wander up and down the room, stopping now and then beneath the picture. 138 | 139 | "How it happened? I can tell you in five minutes--and it didn't take much longer to happen. . . . I can remember now how surprised and pleased I was when I got Mrs. Stroud's note. Of course, deep down, I had always _felt_ there was no one like him--only I had gone with the stream, echoed the usual platitudes about him, till I half got to think he was a failure, one of the kind that are left behind. By Jove, and he _was_ left behind--because he had come to stay! The rest of us had to let ourselves be swept along or go under, but he was high above the current--on everlasting foundations, as you say. 140 | 141 | "Well, I went off to the house in my most egregious mood--rather moved, Lord forgive me, at the pathos of poor Stroud's career of failure being crowned by the glory of my painting him! Of course I meant to do the picture for nothing--I told Mrs. Stroud so when she began to stammer something about her poverty. I remember getting off a prodigious phrase about the honour being _mine_--oh, I was princely, my dear Rickham! I was posing to myself like one of my own sitters. 142 | 143 | "Then I was taken up and left alone with him. I had sent all my traps in advance, and I had only to set up the easel and get to work. He had been dead only twenty-four hours, and he died suddenly, of heart disease, so that there had been no preliminary work of destruction--his face was clear and untouched. I had met him once or twice, years before, and thought him insignificant and dingy. Now I saw that he was superb. 144 | 145 | "I was glad at first, with a merely aesthetic satisfaction: glad to have my hand on such a 'subject.' Then his strange life-likeness began to affect me queerly--as I blocked the head in I felt as if he were watching me do it. The sensation was followed by the thought: if he _were_ watching me, what would he say to my way of working? My strokes began to go a little wild--I felt nervous and uncertain. 146 | 147 | "Once, when I looked up, I seemed to see a smile behind his close grayish beard--as if he had the secret, and were amusing himself by holding it back from me. That exasperated me still more. The secret? Why, I had a secret worth twenty of his! I dashed at the canvas furiously, and tried some of my bravura tricks. But they failed me, they crumbled. I saw that he wasn't watching the showy bits--I couldn't distract his attention; he just kept his eyes on the hard passages between. Those were the ones I had always shirked, or covered up with some lying paint. And how he saw through my lies! 148 | 149 | "I looked up again, and caught sight of that sketch of the donkey hanging on the wall near his bed. His wife told me afterward it was the last thing he had done--just a note taken with a shaking hand, when he was down in Devonshire recovering from a previous heart attack. Just a note! But it tells his whole history. There are years of patient scornful persistence in every line. A man who had swum with the current could never have learned that mighty up-stream stroke. . . . 150 | 151 | "I turned back to my work, and went on groping and muddling; then I looked at the donkey again. I saw that, when Stroud laid in the first stroke, he knew just what the end would be. He had possessed his subject, absorbed it, recreated it. When had I done that with any of my things? They hadn't been born of me--I had just adopted them. . . . 152 | 153 | "Hang it, Rickham, with that face watching me I couldn't do another stroke. The plain truth was, I didn't know where to put it--_I had never known_. Only, with my sitters and my public, a showy splash of colour covered up the fact--I just threw paint into their faces. . . . Well, paint was the one medium those dead eyes could see through--see straight to the tottering foundations underneath. Don't you know how, in talking a foreign language, even fluently, one says half the time not what one wants to but what one can? Well--that was the way I painted; and as he lay there and watched me, the thing they called my 'technique' collapsed like a house of cards. He didn't sneer, you understand, poor Stroud--he just lay there quietly watching, and on his lips, through the gray beard, I seemed to hear the question: 'Are you sure you know where you're coming out?' 154 | 155 | "If I could have painted that face, with that question on it, I should have done a great thing. The next greatest thing was to see that I couldn't--and that grace was given me. But, oh, at that minute, Rickham, was there anything on earth I wouldn't have given to have Stroud alive before me, and to hear him say: 'It's not too late--I'll show you how'? 156 | 157 | "It _was_ too late--it would have been, even if he'd been alive. I packed up my traps, and went down and told Mrs. Stroud. Of course I didn't tell her _that_--it would have been Greek to her. I simply said I couldn't paint him, that I was too moved. She rather liked the idea--she's so romantic! It was that that made her give me the donkey. But she was terribly upset at not getting the portrait--she did so want him 'done' by some one showy! At first I was afraid she wouldn't let me off--and at my wits' end I suggested Grindle. Yes, it was I who started Grindle: I told Mrs. Stroud he was the 'coming' man, and she told somebody else, and so it got to be true. . . . And he painted Stroud without wincing; and she hung the picture among her husband's things. . . ." 158 | 159 | He flung himself down in the arm-chair near mine, laid back his head, and clasping his arms beneath it, looked up at the picture above the chimney-piece. 160 | 161 | "I like to fancy that Stroud himself would have given it to me, if he'd been able to say what he thought that day." 162 | 163 | And, in answer to a question I put half-mechanically--"Begin again?" he flashed out. "When the one thing that brings me anywhere near him is that I knew enough to leave off?" 164 | 165 | He stood up and laid his hand on my shoulder with a laugh. "Only the irony of it is that I _am_ still painting--since Grindle's doing it for me! The Strouds stand alone, and happen once--but there's no exterminating our kind of art." --------------------------------------------------------------------------------