├── LLM_1_Tokenizer.ipynb ├── LLM_2_Byte_Pair_Encoding.ipynb ├── LLM_3_Data_Loader.ipynb ├── LLM_4_Embeddings.ipynb ├── LLM_5_SelfAttention.ipynb ├── LLM_6_Attention_Trainable_Wt.ipynb ├── LLM_6_Attention_Trainable_Wt.pdf ├── LLM_Part_1_Tokenizer.pdf ├── LLM_Part_2_Byte_Pair_Encoding .pdf ├── LLM_Part_3_Data_Loader.pdf ├── LLM_Part_4_Word_Embeddings.pdf ├── LLM_Part_5__Self_Attention.pdf └── README.md /LLM_1_Tokenizer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "8054aaf0", 6 | "metadata": {}, 7 | "source": [ 8 | "## LLM Part 1 : Tokeniser \n", 9 | "\n", 10 | "Reference text : \n", 11 | "\n", 12 | "https://www.manning.com/books/build-a-large-language-model-from-scratch\n" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "112af175", 18 | "metadata": {}, 19 | "source": [ 20 | "### Problem Statement :\n", 21 | "\n", 22 | "The text we will tokenize for LLM training is a short story by Edith Wharton called The Verdict, which has been released into the public domain and is thus permitted to be used for LLM training tasks. The text is available on Wikisource at https://en.wikisource.org/wiki/The_Verdict,\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "id": "9c5e9846", 28 | "metadata": {}, 29 | "source": [ 30 | "### Step 1: Read input file " 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 26, 36 | "id": "12a540d7", 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "word count is -> 20479\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "import pandas as pd \n", 49 | "\n", 50 | "file1 = open(\"the-verdict.txt\", \"r+\", encoding=\"utf-8\")\n", 51 | "\n", 52 | "#print(\"Output of Read function is \")\n", 53 | "corpus = file1.read()\n", 54 | "#print(text)\n", 55 | "\n", 56 | "# check count of words \n", 57 | "print(\"word count is -> \", len(corpus))" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "eed7eead", 63 | "metadata": {}, 64 | "source": [ 65 | "### Step 2: Split text to tokens " 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 2, 71 | "id": "dc7d5a04", 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "import re " 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "id": "546e2347", 81 | "metadata": {}, 82 | "source": [ 83 | "#### Strategy 1 : Split on white spaces \n", 84 | "\n", 85 | "**Note**\n", 86 | "\n", 87 | "- The simple tokenization scheme below mostly works for separating the example text into individual words.\n", 88 | "- However, some words are still connected to punctuation characters that we want to have as separate list entries. \n", 89 | "- We also refrain from making all text lowercase because capitalization helps LLMs distinguish between proper nouns and common nouns." 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 3, 95 | "id": "a266d89a", 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "name": "stdout", 100 | "output_type": "stream", 101 | "text": [ 102 | "['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ']\n" 103 | ] 104 | } 105 | ], 106 | "source": [ 107 | "\n", 108 | "text = \"Hello, world. This, is a test.\"\n", 109 | "result = re.split(r'(\\s)', text)\n", 110 | "print(result[0:10])" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "id": "589e7400", 116 | "metadata": {}, 117 | "source": [ 118 | "#### Strategy 2 : Split on white space or comma or period.\n", 119 | "\n", 120 | "**Pattern Explanation**\n", 121 | "\n", 122 | "r'([,.]|\\s)': This is a raw string containing the regular expression pattern used to split the text.\n", 123 | "[,.]: Matches a comma , or a period ..\n", 124 | "|: This is the OR operator in regex, meaning that the pattern will match either the part before it or the part after it.\n", 125 | "\\s: Matches any whitespace character (spaces, tabs, newlines, etc.).\n", 126 | "() (parentheses): These are used to create a capturing group. Capturing groups save the matched text, so it appears in the result.\n" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 4, 132 | "id": "469164b5", 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',']\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "\n", 145 | "text = \"Hello, world. This, is a test.\"\n", 146 | "\n", 147 | "result = re.split(r'([,.]|\\s)', text)\n", 148 | "\n", 149 | "print(result[0:10])" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "8f9491f7", 155 | "metadata": {}, 156 | "source": [ 157 | "#### Strategy 3 : Split on white space or comma or period.\n", 158 | "\n", 159 | "The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:\n", 160 | " \n", 161 | "\n", 162 | "**Pattern Explanation**\n", 163 | "\n", 164 | "[,.:;?_!\"()\\']: A character class that matches any single character inside the square brackets. This includes:\n", 165 | "\n", 166 | "\n", 167 | "- Comma\n", 168 | "- Period\n", 169 | "- Colon\n", 170 | "- Semicolon\n", 171 | "- Question mark \n", 172 | "- Underscore\n", 173 | "- Exclamation mark \n", 174 | "- Parentheses ()\n", 175 | "- Single quote \n", 176 | "- --: Matches the double hyphen --.\n", 177 | "- \\s: Matches any whitespace character (space, tab, newline, etc.).\n", 178 | "\n", 179 | "\n", 180 | "**Notes**\n", 181 | "The parentheses around the pattern create a capture group, meaning the matched delimiters are also included in the result.\n", 182 | "\n", 183 | "\n" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 5, 189 | "id": "c924277e", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "text = \"Hello, world. Is this-- a test?\"\n", 202 | "result = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 203 | "print(result)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "id": "a1cd9017", 209 | "metadata": {}, 210 | "source": [ 211 | "### Step 3: Data cleaning - remove white space characters\n", 212 | "\n", 213 | "**Note on white space character removal**\n", 214 | "\n", 215 | "- When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. \n", 216 | "- Removing whitespaces reduces the memory and computing requirements. \n", 217 | "- However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). " 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 6, 223 | "id": "f432a913", 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "name": "stdout", 228 | "output_type": "stream", 229 | "text": [ 230 | "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n" 231 | ] 232 | } 233 | ], 234 | "source": [ 235 | "result = [item for item in result if item.strip()]\n", 236 | "print(result[0:10])" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "id": "a50e2564", 242 | "metadata": {}, 243 | "source": [ 244 | "### Step 4 : Create. function \"text_to_tokens\"\n" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 7, 250 | "id": "31ef569e", 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "from typing import List\n", 255 | "import re\n", 256 | "\n", 257 | "def text_to_tokens(text: str) -> List[str]:\n", 258 | " \"\"\"\n", 259 | " Create an array of tokens from a given input text data. White spaces are removed.\n", 260 | " Split takes care of special characters , which are treated as tokens also. \n", 261 | "\n", 262 | " Parameters:\n", 263 | " tokens (text: str ): A text string which needs to be tokenized \n", 264 | "\n", 265 | " Returns:\n", 266 | " List[str]: an list of tokens\n", 267 | " \"\"\"\n", 268 | " \n", 269 | " # split text into tokens\n", 270 | " result = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 271 | " \n", 272 | " # remove white spaces \n", 273 | " result = [item for item in result if item.strip()]\n", 274 | " \n", 275 | " return result" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 8, 281 | "id": "f5f33aea", 282 | "metadata": {}, 283 | "outputs": [ 284 | { 285 | "name": "stdout", 286 | "output_type": "stream", 287 | "text": [ 288 | "31\n", 289 | "10\n", 290 | "['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "# check the function \n", 296 | "\n", 297 | "# function called \n", 298 | "tokenized = text_to_tokens(text)\n", 299 | "\n", 300 | "# check length of original text \n", 301 | "print(len(text))\n", 302 | "\n", 303 | "# check length after tokenization and removal of whitespace characters\n", 304 | "print(len(tokenized))\n", 305 | "\n", 306 | "# display 1st ten tokens \n", 307 | "print(tokenized[0:10])\n", 308 | "\n" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "id": "a230f39e", 314 | "metadata": {}, 315 | "source": [ 316 | "### Step 5 : Build a Vocabulary.\n", 317 | "\n", 318 | "**Key Steps**\n", 319 | "\n", 320 | "- Take the tokenized text as input\n", 321 | "- Sort Alphabetically\n", 322 | "- Remove Duplicates \n", 323 | "- Create a Dictionary mapping individual tokens to a unique numeric ID \n", 324 | "\n", 325 | "### For this we will define a function \"create_vocab\"\n" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 9, 331 | "id": "e973d793", 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "from typing import List, Dict\n", 336 | "\n", 337 | "def create_vocab(tokens: List[str], ) -> List [int]:\n", 338 | " \"\"\"\n", 339 | " Creates a Dictionary which maps a token to its token ID. The token inputs are sorted and duplicates a \n", 340 | " removed before a dictionary is mapped \n", 341 | "\n", 342 | " Parameters:\n", 343 | " tokens (tokens: List[str]): A list of tokens\n", 344 | "\n", 345 | " Returns:\n", 346 | " Dict[str, int]: a vocabulary dictionary which maps a token to a unique tokenid.\n", 347 | "\n", 348 | " \"\"\"\n", 349 | " \n", 350 | " # remove duplicates \n", 351 | " unq_tokens = list(set(tokens))\n", 352 | " \n", 353 | " # sort \n", 354 | " srt_tokens = sorted(unq_tokens)\n", 355 | " \n", 356 | " # create vocabulary\n", 357 | " vocabulary = {token:tokenid for tokenid,token in enumerate(srt_tokens)}\n", 358 | " \n", 359 | " return vocabulary" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 10, 365 | "id": "8640edd2", 366 | "metadata": {}, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "(',', 0)\n", 373 | "('--', 1)\n", 374 | "('.', 2)\n", 375 | "('?', 3)\n", 376 | "('Hello', 4)\n", 377 | "('Is', 5)\n", 378 | "('a', 6)\n", 379 | "('test', 7)\n", 380 | "('this', 8)\n", 381 | "('world', 9)\n" 382 | ] 383 | } 384 | ], 385 | "source": [ 386 | "# call function\n", 387 | "vocab = create_vocab(tokenized)\n", 388 | "\n", 389 | "# Check - print 1st ten items of the vocabulary\n", 390 | "for i, item in enumerate(vocab.items()):\n", 391 | " print(item)\n", 392 | " if i > 10:\n", 393 | " break\n" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "id": "0125cfc5", 399 | "metadata": {}, 400 | "source": [ 401 | "### Step 6 : Create the Encoder \n", 402 | "\n", 403 | "**Key Steps**\n", 404 | "\n", 405 | "- Take any input text string and the pre defined vocabulary as input\n", 406 | "- Split the text to tokens \n", 407 | "- Use the vocabulary to generate tokenid for the input tokens \n", 408 | "- If a token does not exists in the vocabulary encode it with -99 \n", 409 | "\n", 410 | "### For this we will create a function \"encode\"\n" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 11, 416 | "id": "2f6536ad", 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "import re\n", 421 | "from typing import List, Dict\n", 422 | "\n", 423 | "def encode(text: str, vocabulary: Dict[str, int]) -> List[int]:\n", 424 | " \"\"\"\n", 425 | " Encode the input text into a list of token IDs using the given vocabulary.\n", 426 | "\n", 427 | " Parameters:\n", 428 | " text (str): The input text string of tokens.\n", 429 | " vocabulary (Dict[str, int]): A dictionary mapping tokens to integer values.\n", 430 | "\n", 431 | " Returns:\n", 432 | " List[int]: A list of integers representing the token IDs.\n", 433 | " \"\"\"\n", 434 | " \n", 435 | " # Split the input text into tokens\n", 436 | " result = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\n", 437 | " \n", 438 | " # remove white spaces \n", 439 | " tokens = [item for item in result if item.strip()]\n", 440 | " \n", 441 | " \n", 442 | " # Generate the list of token IDs using the vocabulary\n", 443 | " token_ids = []\n", 444 | " for token in tokens:\n", 445 | " if token.strip() and token in vocabulary:\n", 446 | " token_ids.append(vocabulary[token])\n", 447 | " else:\n", 448 | " # Handle unknown tokens if necessary (e.g., append a special token ID or skip)\n", 449 | " # For example, let's append -1 for unknown tokens\n", 450 | " token_ids.append(-99)\n", 451 | " \n", 452 | " return token_ids\n" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 12, 458 | "id": "14c3e822", 459 | "metadata": {}, 460 | "outputs": [ 461 | { 462 | "name": "stdout", 463 | "output_type": "stream", 464 | "text": [ 465 | "[1, 2, 3, 4, 5, 6, 7, -99]\n" 466 | ] 467 | } 468 | ], 469 | "source": [ 470 | "# Example usage\n", 471 | "vocabulary = {\n", 472 | " \"Hello\": 1,\n", 473 | " \"world\": 2,\n", 474 | " \"!\": 3,\n", 475 | " \"This\": 4,\n", 476 | " \"is\": 5,\n", 477 | " \"an\": 6,\n", 478 | " \"example\": 7\n", 479 | "}\n", 480 | "\n", 481 | "text = \"Hello world! This is an example.\"\n", 482 | "encoded_text = encode(text, vocabulary)\n", 483 | "print(encoded_text)" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "id": "9196d482", 489 | "metadata": {}, 490 | "source": [ 491 | "### Step 7: Create the Decoder \n", 492 | "\n", 493 | "**Notes**\n", 494 | "\n", 495 | "- When we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text. \n", 496 | "\n", 497 | "- For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.\n", 498 | "\n", 499 | "- **If an unknown token is passed for encoding - return a -99 for the same** \n", 500 | "\n", 501 | "#### We develop the \"decoder' function as shown below " 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 13, 507 | "id": "800ffe1d", 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "\n", 512 | "from typing import List, Dict\n", 513 | "\n", 514 | "def decode(vocabulary: Dict[str, int], token_ids: List[int]) -> List[str]:\n", 515 | " \"\"\"\n", 516 | " Decode the input list of token IDs into a list of string tokens using the given vocabulary.\n", 517 | "\n", 518 | " Parameters:\n", 519 | " vocabulary (Dict[str, int]): A dictionary mapping tokens to integer values.\n", 520 | " token_ids (List[int]): A list of integers representing the token IDs.\n", 521 | "\n", 522 | " Returns:\n", 523 | " List[str]: A list of string tokens.\n", 524 | " \"\"\"\n", 525 | " \n", 526 | " # Create a reverse dictionary from the vocabulary\n", 527 | " int_to_str = {v: k for k, v in vocabulary.items()}\n", 528 | " \n", 529 | " # Generate the list of string tokens using the reverse dictionary\n", 530 | " tokens = []\n", 531 | " for token_id in token_ids:\n", 532 | " if token_id in int_to_str:\n", 533 | " tokens.append(int_to_str[token_id])\n", 534 | " else:\n", 535 | " # Handle unknown token IDs if necessary (e.g., append a special token or skip)\n", 536 | " # For example, let's append -99 for unknown token IDs\n", 537 | " tokens.append(-99)\n", 538 | " \n", 539 | " return tokens" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 14, 545 | "id": "dfbfcaed", 546 | "metadata": {}, 547 | "outputs": [ 548 | { 549 | "name": "stdout", 550 | "output_type": "stream", 551 | "text": [ 552 | "['Hello', 'world', '!', 'This', 'is', 'an', 'example']\n" 553 | ] 554 | } 555 | ], 556 | "source": [ 557 | "# Example usage\n", 558 | "\n", 559 | "# define a test vocab\n", 560 | "vocabulary = {\n", 561 | " \"Hello\": 1,\n", 562 | " \"world\": 2,\n", 563 | " \"!\": 3,\n", 564 | " \"This\": 4,\n", 565 | " \"is\": 5,\n", 566 | " \"an\": 6,\n", 567 | " \"example\": 7\n", 568 | "}\n", 569 | "\n", 570 | "\n", 571 | "token_ids = [1, 2, 3, 4, 5, 6, 7]\n", 572 | "decoded_tokens = decode(vocabulary, token_ids)\n", 573 | "print(decoded_tokens)" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "id": "d8cf391b", 579 | "metadata": {}, 580 | "source": [ 581 | "### Step 8: Build a final modified vocabulary function \n", 582 | "\n", 583 | "- Takes a list of raw strings \n", 584 | "- tokenizes them and removes white spaces \n", 585 | "- sorts the list \n", 586 | "- Adds a special token for unknown token \n", 587 | "- Adds a special token to mark end of text of a particular text source. \n", 588 | "- generates token id list as output \n" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 22, 594 | "id": "0348020f", 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "from typing import List, Dict\n", 599 | "\n", 600 | "def create_vocab(rawtext: List[str], ) -> List [int]:\n", 601 | " \"\"\"\n", 602 | " Creates a Dictionary which maps a token to its token ID. \n", 603 | " Takes a list of raw strings \n", 604 | " tokenizes them and removes white spaces \n", 605 | " sorts the list \n", 606 | " adds a special token for unknown token \n", 607 | " adds a special token to mark end of text of a particular text source. \n", 608 | " generates token id list as output \n", 609 | "\n", 610 | "\n", 611 | " Parameters:\n", 612 | " rawtext (rawtext: List[str]): A list of raw text strings\n", 613 | "\n", 614 | " Returns:\n", 615 | " Dict[str, int]: a vocabulary dictionary which maps a token to a unique tokenid.\n", 616 | "\n", 617 | " \"\"\"\n", 618 | " \n", 619 | " # tokenize input text string \n", 620 | " tokens = re.split(r'([,.?_!\"()\\']|--|\\s)', rawtext)\n", 621 | " \n", 622 | " # remove white space \n", 623 | " tokens = [item.strip() for item in tokens if item.strip()]\n", 624 | " \n", 625 | " # remove duplicates \n", 626 | " unq_tokens = list(set(tokens))\n", 627 | " \n", 628 | " # sorted tokens\n", 629 | " srt_tokens = sorted(unq_tokens)\n", 630 | " \n", 631 | " # add special tokens for unknown strings and end of text segment\n", 632 | " srt_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\n", 633 | " \n", 634 | " # create vocabulary\n", 635 | " vocabulary = {token:tokenid for tokenid,token in enumerate(srt_tokens)}\n", 636 | " \n", 637 | " return vocabulary" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 27, 643 | "id": "842ba607", 644 | "metadata": {}, 645 | "outputs": [ 646 | { 647 | "name": "stdout", 648 | "output_type": "stream", 649 | "text": [ 650 | "Hello world! This is an example.\n" 651 | ] 652 | } 653 | ], 654 | "source": [ 655 | "# check text \n", 656 | "print(text)" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 29, 662 | "id": "155c0543", 663 | "metadata": {}, 664 | "outputs": [ 665 | { 666 | "name": "stdout", 667 | "output_type": "stream", 668 | "text": [ 669 | "10\n", 670 | "{'!': 0, '.': 1, 'Hello': 2, 'This': 3, 'an': 4, 'example': 5, 'is': 6, 'world': 7, '<|endoftext|>': 8, '<|unk|>': 9}\n" 671 | ] 672 | } 673 | ], 674 | "source": [ 675 | "# Create Vocabulary from text \n", 676 | "vocab = create_vocab(text)\n", 677 | "\n", 678 | "# check length \n", 679 | "lenvocab = len(vocab)\n", 680 | "print(lenvocab)\n", 681 | "\n", 682 | "# print the vocab dict \n", 683 | "print(vocab)" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "id": "7c09e1b7", 689 | "metadata": {}, 690 | "source": [ 691 | "### Step 9 Build Vocabulary on short story by Edith Wharton called The Verdict" 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": 34, 697 | "id": "51dfba88", 698 | "metadata": {}, 699 | "outputs": [ 700 | { 701 | "name": "stdout", 702 | "output_type": "stream", 703 | "text": [ 704 | "[('!', 0), ('\"', 1), (\"'\", 2), ('(', 3), (')', 4)]\n", 705 | "[('younger', 1156), ('your', 1157), ('yourself', 1158), ('<|endoftext|>', 1159), ('<|unk|>', 1160)]\n" 706 | ] 707 | } 708 | ], 709 | "source": [ 710 | "# create vocab \n", 711 | "vocab = create_vocab(corpus)\n", 712 | "\n", 713 | "# convert dict to list \n", 714 | "items_list = list(vocab.items())\n", 715 | "\n", 716 | "# Extract the first 5 items\n", 717 | "first_5_items = items_list[:5]\n", 718 | "\n", 719 | "# display and check \n", 720 | "print(first_5_items)\n", 721 | "\n", 722 | "# Extract the last 5 items\n", 723 | "last_5_items = items_list[-5:]\n", 724 | "\n", 725 | "# dispay and check \n", 726 | "print(last_5_items)" 727 | ] 728 | }, 729 | { 730 | "cell_type": "markdown", 731 | "id": "b522e3f2", 732 | "metadata": {}, 733 | "source": [ 734 | "### Step 10 : Create the Tokenizer Class \n", 735 | "(Code Reference - Ch 2 : Build a LLM from Scratch by Sebastian Raschka )\n", 736 | "\n", 737 | "- Here we implement a complete tokenizer class with an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. \n", 738 | "\n", 739 | "- We add an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. \n", 740 | "\n", 741 | "- Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.\n", 742 | "\n", 743 | "- We also implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.\n", 744 | "\n", 745 | "**Note on the decoder**\n", 746 | "\n", 747 | "We add an extra clean up step as follows \n", 748 | "\n", 749 | "- The code **re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text)**\n", 750 | "\n", 751 | "removes any whitespace characters that appear immediately before specified punctuation marks, effectively tidying up the text by ensuring no spaces precede punctuation.\n" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": 62, 757 | "id": "8a58df42", 758 | "metadata": {}, 759 | "outputs": [], 760 | "source": [ 761 | "class TokenizerV1:\n", 762 | " def __init__(self, vocab):\n", 763 | " self.str_to_int = vocab \n", 764 | " self.int_to_str = {tokenid:string for string,tokenid in vocab.items()} \n", 765 | " \n", 766 | " \n", 767 | " def encode(self, text): \n", 768 | " \n", 769 | " # split input text into tokens\n", 770 | " preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n", 771 | " \n", 772 | " # remove white spaces \n", 773 | " preprocessed = [item.strip() for item in preprocessed if item.strip()]\n", 774 | " \n", 775 | " # add special token unknown \n", 776 | " preprocessed = [\n", 777 | " item if item in self.str_to_int \n", 778 | " else \"<|unk|>\" for item in preprocessed\n", 779 | " ]\n", 780 | "\n", 781 | " # Return list of token ids \n", 782 | " ids = [self.str_to_int[s] for s in preprocessed]\n", 783 | " return ids\n", 784 | " \n", 785 | " \n", 786 | " def decode(self, ids): \n", 787 | " \n", 788 | " # join the decoded tokens separated by one space \n", 789 | " text = \" \".join([self.int_to_str[i] for i in ids])\n", 790 | " \n", 791 | " # removes any whitespace characters that appear immediately before specified punctuation marks, effectively tidying up the text by ensuring no spaces precede punctuation.\n", 792 | " text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) \n", 793 | " return text\n", 794 | " " 795 | ] 796 | }, 797 | { 798 | "cell_type": "markdown", 799 | "id": "d1f9d3f5", 800 | "metadata": {}, 801 | "source": [ 802 | "## Test Tokenizer with basic text string " 803 | ] 804 | }, 805 | { 806 | "cell_type": "code", 807 | "execution_count": 73, 808 | "id": "870f9f69", 809 | "metadata": {}, 810 | "outputs": [], 811 | "source": [ 812 | "# define input text \n", 813 | "\n", 814 | "text1 = \"\"\" If no mistake have you made, yet losing you are, a different game you should play. \"\"\"" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 64, 820 | "id": "eb0333b0", 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "# Instantiate tokenizer with vocab \n", 825 | "tokenizer = TokenizerV1(vocab)" 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": 74, 831 | "id": "a3f6ee45", 832 | "metadata": {}, 833 | "outputs": [ 834 | { 835 | "name": "stdout", 836 | "output_type": "stream", 837 | "text": [ 838 | "[56, 725, 1160, 538, 1155, 669, 5, 1154, 1160, 1155, 174, 5, 119, 1160, 1160, 1155, 904, 1160, 7]\n" 839 | ] 840 | } 841 | ], 842 | "source": [ 843 | "# encode and check \n", 844 | "ids = tokenizer.encode(text1)\n", 845 | "print(ids)" 846 | ] 847 | }, 848 | { 849 | "cell_type": "code", 850 | "execution_count": 75, 851 | "id": "42551b56", 852 | "metadata": {}, 853 | "outputs": [ 854 | { 855 | "name": "stdout", 856 | "output_type": "stream", 857 | "text": [ 858 | "If no <|unk|> have you made, yet <|unk|> you are, a <|unk|> <|unk|> you should <|unk|>.\n" 859 | ] 860 | } 861 | ], 862 | "source": [ 863 | "# decode and check \n", 864 | "print(tokenizer.decode(ids))" 865 | ] 866 | }, 867 | { 868 | "cell_type": "markdown", 869 | "id": "e109949e", 870 | "metadata": {}, 871 | "source": [ 872 | "## Test Tokenizer with compound text string" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 76, 878 | "id": "2255391a", 879 | "metadata": {}, 880 | "outputs": [ 881 | { 882 | "name": "stdout", 883 | "output_type": "stream", 884 | "text": [ 885 | "Hello, do you wish to have coffee? <|endoftext|> In the shade of the large palm trees\n" 886 | ] 887 | } 888 | ], 889 | "source": [ 890 | "# define inout text \n", 891 | "text1 = \"Hello, do you wish to have coffee?\"\n", 892 | "\n", 893 | "text2 = \"In the shade of the large palm trees\"\n", 894 | "\n", 895 | "text = \" <|endoftext|> \".join((text1, text2))\n", 896 | "\n", 897 | "print(text)" 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": 77, 903 | "id": "863026e4", 904 | "metadata": {}, 905 | "outputs": [ 906 | { 907 | "name": "stdout", 908 | "output_type": "stream", 909 | "text": [ 910 | "[1160, 5, 362, 1155, 1135, 1042, 538, 1160, 10, 1159, 57, 1013, 898, 738, 1013, 1160, 1160, 1160]\n" 911 | ] 912 | } 913 | ], 914 | "source": [ 915 | "# Instantiate tokenizer with vocab \n", 916 | "tokenizer = TokenizerV1(vocab)\n", 917 | "\n", 918 | "print(tokenizer.encode(text))\n" 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": 78, 924 | "id": "ac849fb3", 925 | "metadata": {}, 926 | "outputs": [ 927 | { 928 | "name": "stdout", 929 | "output_type": "stream", 930 | "text": [ 931 | "<|unk|>, do you wish to have <|unk|>? <|endoftext|> In the shade of the <|unk|> <|unk|> <|unk|>\n" 932 | ] 933 | } 934 | ], 935 | "source": [ 936 | "# decode\n", 937 | "print(tokenizer.decode(tokenizer.encode(text)))\n" 938 | ] 939 | }, 940 | { 941 | "cell_type": "markdown", 942 | "id": "7438aca0", 943 | "metadata": {}, 944 | "source": [ 945 | "## End of notebook" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "id": "45c3f9f0", 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [] 955 | } 956 | ], 957 | "metadata": { 958 | "kernelspec": { 959 | "display_name": "Python 3 (ipykernel)", 960 | "language": "python", 961 | "name": "python3" 962 | }, 963 | "language_info": { 964 | "codemirror_mode": { 965 | "name": "ipython", 966 | "version": 3 967 | }, 968 | "file_extension": ".py", 969 | "mimetype": "text/x-python", 970 | "name": "python", 971 | "nbconvert_exporter": "python", 972 | "pygments_lexer": "ipython3", 973 | "version": "3.10.9" 974 | } 975 | }, 976 | "nbformat": 4, 977 | "nbformat_minor": 5 978 | } 979 | -------------------------------------------------------------------------------- /LLM_2_Byte_Pair_Encoding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "8054aaf0", 6 | "metadata": {}, 7 | "source": [ 8 | "## LLM Part 2 : Byte Pair Encoding\n", 9 | "\n", 10 | "**Reference text** \n", 11 | "\n", 12 | "https://www.manning.com/books/build-a-large-language-model-from-scratch\n", 13 | "\n", 14 | "**Text Corpus** \n", 15 | "\n", 16 | "The text we will tokenize for LLM training is a short story by Edith Wharton called The Verdict, which has been released into the public domain and is thus permitted to be used for LLM training tasks. The text is available on Wikisource at https://en.wikisource.org/wiki/The_Verdict," 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "112af175", 22 | "metadata": {}, 23 | "source": [ 24 | "### Problem Statement :\n", 25 | "\n", 26 | "In the first notebook, we discussed how to develop a Word Tokenizer step by step. In this notebook we will demonstrate An advanced tokenization method called Byte Pair Encoding\n" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "ea1c2fa3", 32 | "metadata": {}, 33 | "source": [ 34 | "### Additional special tokens that could have been implemented further in the tokenizer developed in Part 1 \n", 35 | "\n", 36 | "- [BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins.\n", 37 | "\n", 38 | "- [EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins.\n", 39 | "\n", 40 | "- [PAD] (padding): When training LLMs with batch sizes larger than one,\n", 41 | "the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or \"padded\" using the [PAD] token, up to the length of the longest text in the batch." 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "b0b93dda", 47 | "metadata": {}, 48 | "source": [ 49 | "### What does the tokenizer for GPT models use ?\n", 50 | "\n", 51 | "- The tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity. \n", 52 | "\n", 53 | "- The <|endoftext|> is analogous to the [EOS] token mentioned above. Also, <|endoftext|> is used for padding as well. However, as we'll explore in subsequent chapters when training on batched inputs, we typically use a mask, meaning we don't attend to padded tokens. Thus, the specific token chosen for padding becomes inconsequential.\n", 54 | "\n", 55 | "- Moreover, the tokenizer used for GPT models also doesn't use an <|unk|> token for out-of-vocabulary words.\n", 56 | "\n", 57 | "- Instead, GPT models use a **byte pair encoding tokenizer**, which breaks down words into subword units, which we will discuss in the next section." 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "06e63639", 63 | "metadata": {}, 64 | "source": [ 65 | "### How is the Byte Pair Encoding used by GPT-2 superior ?\n", 66 | "\n", 67 | "\n", 68 | "- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words\n", 69 | "- For instance, if GPT-2's vocabulary doesn't have the word \"unfamiliarword,\" it might tokenize it as [\"unfam\", \"iliar\", \"word\"] or some other subword breakdown, depending on its trained BPE merges\n", 70 | "- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n", 71 | "\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "id": "e00fb133", 77 | "metadata": {}, 78 | "source": [ 79 | "## Concept Note : Byte Pair Encoding \n", 80 | "\n", 81 | "REFERENCE \n", 82 | "https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "id": "eed7eead", 88 | "metadata": {}, 89 | "source": [ 90 | "#### Concepts related to BPE:\n", 91 | "\n", 92 | "- **Vocabulary:** A set of subword units that can be used to represent a text corpus.\n", 93 | "- **Byte:** A unit of digital information that typically consists of eight bits.\n", 94 | "- **Character:** A symbol that represents a written or printed letter or numeral.\n", 95 | "- **Frequency:** The number of times a byte or character occurs in a text corpus.\n", 96 | "- **Merge:** The process of combining two consecutive bytes or characters to create a new subword unit.\n", 97 | "\n", 98 | "\n", 99 | "### Steps involved in BPE:\n", 100 | "- Initialize the vocabulary with all the bytes or characters in the text corpus\n", 101 | "- Calculate the frequency of each byte or character in the text corpus.\n", 102 | "- Repeat the following steps until the desired vocabulary size is reached:\n", 103 | " - Find the most frequent pair of consecutive bytes or characters in the text corpus\n", 104 | " - Merge the pair to create a new subword unit.\n", 105 | " - Update the frequency counts of all the bytes or characters that contain the merged pair.\n", 106 | " - Add the new subword unit to the vocabulary.\n", 107 | "- Represent the text corpus using the subword units in the vocabulary." 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "id": "c6c45af8", 113 | "metadata": {}, 114 | "source": [ 115 | "### How does BPE Work - A simple example \n", 116 | "\n", 117 | "Suppose we have a text corpus with the following four words: “ab”, “bc”, “bcd”, and “cde”. The initial vocabulary consists of all the bytes or characters in the text corpus: {“a”, “b”, “c”, “d”, “e”}" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "id": "546e2347", 123 | "metadata": {}, 124 | "source": [ 125 | "#### Step 1 : Initialize the vocabulary \n", 126 | "\n", 127 | "Vocabulary = {\"a\", \"b\", \"c\", \"d\", \"e\"}\n", 128 | "\n", 129 | "#### Step 2 : Compute Frequency \n", 130 | "\n", 131 | "Frequency = {\"a\": 1, \"b\": 3, \"c\": 3, \"d\": 2, \"e\": 1}\n", 132 | "\n", 133 | "#### Repeat Steps 3 to 5 until the desired vocabulary size is reached.\n", 134 | "\n", 135 | "- #### Step 3 : Find the most frequent pair of two characters\n", 136 | "\n", 137 | " The most frequent pair is \"bc\" with a frequency of 2.\n", 138 | "\n", 139 | "- #### Step 4 : Merge the pair\n", 140 | "\n", 141 | " Merge \"bc\" to create a new subword unit \"bc\"\n", 142 | "\n", 143 | "- #### Step 5: Update frequency counts\n", 144 | "\n", 145 | " Frequency = {\"a\": 1, \"b\": 2, \"c\": 3, \"d\": 2, \"e\": 1, \"bc\": 2}\n", 146 | " \n", 147 | "#### Represent the text corpus using subword units\n", 148 | "\n", 149 | "The resulting vocabulary consists of the following subword units: {\"a\", \"b\", \"c\", \"d\", \"e\", \"bc\", \"cd\", \"de\",\"ab\",\"bcd\",\"cde\"}." 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "e92d6c69", 155 | "metadata": {}, 156 | "source": [ 157 | "## Section A: Python Implementation of Byte Pair Encoding \n", 158 | "\n", 159 | "We will define a series of functions to perfome the byte pair encoding as discussed above\n", 160 | "\n", 161 | "\n", 162 | "#### 1) get_vocab()\n", 163 | "\n", 164 | "The get_vocab function is defined to take a list of strings (data) as input and return a dictionary mapping words (formatted as separated characters with an end token) to their frequency counts.\n", 165 | "\n", 166 | "\n", 167 | "#### 2) get_stats()\n", 168 | "\n", 169 | "The get_stats function is defined to take a dictionary (vocab) as input and return a dictionary mapping tuples of character pairs to their frequency counts.\n", 170 | " \n", 171 | "#### 3) merge_vocab()\n", 172 | "\n", 173 | "The merge_vocab function is defined to take a tuple of characters (pair) and a dictionary (v_in) as input, and return a new dictionary with the specified pair of characters merged.\n", 174 | "\n", 175 | "\n", 176 | "byte pai \n", 177 | "\n", 178 | "The byte_pair_encoding function is defined to take a list of strings (data) and an integer (n) as input, and return a dictionary representing the vocabulary with merged character pairs." 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "id": "589e7400", 184 | "metadata": {}, 185 | "source": [ 186 | "### STEP 1. Define function get_vocab()\n", 187 | "\n", 188 | "\n", 189 | "#### Note on creating the vocabulary dictionary**\n", 190 | "\n", 191 | "- **vocab = defaultdict(int):** Initializes a defaultdict with int as the default factory, meaning any new key will have a default value of 0.\n", 192 | "\n", 193 | "- The function iterates through each line in data and then through each word in the line.\n", 194 | "\n", 195 | "- **vocab[' '.join(list(word)) + ' '] += 1:**\n", 196 | "\n", 197 | "- **list(word):** Converts the word into a list of characters.\n", 198 | "\n", 199 | "- **' '.join(list(word)): Joins the characters with spaces.**\n", 200 | "\n", 201 | "- **+ ' ': Adds an end token to the end of the word.**\n", 202 | "\n", 203 | "- The resulting string is used as a key in the vocab dictionary, and its value (frequency count) is incremented by 1." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 1, 209 | "id": "469164b5", 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "from collections import defaultdict\n", 214 | "from typing import List, Dict\n", 215 | "\n", 216 | "def get_vocab(data: List[str]) -> Dict[str, int]:\n", 217 | " \"\"\"\n", 218 | " Given a list of strings, returns a dictionary of words mapping to their frequency \n", 219 | " count in the data.\n", 220 | "\n", 221 | " Parameters:\n", 222 | " data (List[str]): A list of strings where each string is a line of text.\n", 223 | "\n", 224 | " Returns:\n", 225 | " Dict[str, int]: A dictionary where keys are words with separated characters and\n", 226 | " an end token, and values are their frequency counts.\n", 227 | " \"\"\"\n", 228 | " vocab = defaultdict(int)\n", 229 | " for line in data:\n", 230 | " for word in line.split():\n", 231 | " # Join the characters of the word with spaces and add an end token\n", 232 | " vocab[' '.join(list(word)) + ' '] += 1\n", 233 | " return vocab\n" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 2, 239 | "id": "57a06733", 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "name": "stdout", 244 | "output_type": "stream", 245 | "text": [ 246 | "defaultdict(, {'t h i s ': 3, 'i s ': 3, 'a ': 3, 't e s t ': 4, 'o n l y ': 1})\n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "# Example usage\n", 252 | "data = [\n", 253 | " \"this is a test\",\n", 254 | " \"this test is only a test\",\n", 255 | " \"a test this is\"\n", 256 | "]\n", 257 | "\n", 258 | "vocab = get_vocab(data)\n", 259 | "print(vocab)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "id": "62fdc019", 265 | "metadata": {}, 266 | "source": [ 267 | "### STEP 2: Define function get_stats()\n", 268 | "\n", 269 | "\n", 270 | "\n", 271 | "#### Note on Creating the Pairs Dictionary:\n", 272 | "\n", 273 | "- **pairs = defaultdict(int):** Initializes a defaultdict with int as the default factory, meaning any new key will have a default value of 0.\n", 274 | "\n", 275 | "- The function iterates through each word and its frequency in vocab.\n", 276 | "\n", 277 | "- **symbols = word.split():** Splits the word into its component symbols (characters and end token).\n", 278 | "\n", 279 | "\n", 280 | "- The nested loop iterates through adjacent symbol pairs in the list and increments their frequency count in the pairs dictionary." 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 3, 286 | "id": "754158be", 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "from collections import defaultdict\n", 291 | "from typing import Dict, Tuple\n", 292 | "\n", 293 | "def get_stats(vocab: Dict[str, int]) -> Dict[Tuple[str, str], int]:\n", 294 | " \"\"\"\n", 295 | " Given a vocabulary (dictionary mapping words to frequency counts), returns a \n", 296 | " dictionary of tuples representing the frequency count of pairs of characters \n", 297 | " in the vocabulary.\n", 298 | "\n", 299 | " Parameters:\n", 300 | " vocab (Dict[str, int]): A dictionary where keys are words with separated characters \n", 301 | " and an end token, and values are their frequency counts.\n", 302 | "\n", 303 | " Returns:\n", 304 | " Dict[Tuple[str, str], int]: A dictionary where keys are tuples of character pairs \n", 305 | " and values are their frequency counts in the vocabulary.\n", 306 | " \"\"\"\n", 307 | " pairs = defaultdict(int)\n", 308 | " for word, freq in vocab.items():\n", 309 | " symbols = word.split()\n", 310 | " for i in range(len(symbols) - 1):\n", 311 | " pairs[symbols[i], symbols[i + 1]] += freq\n", 312 | " return pairs\n", 313 | "\n", 314 | "\n" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 4, 320 | "id": "3764c9d5", 321 | "metadata": {}, 322 | "outputs": [ 323 | { 324 | "name": "stdout", 325 | "output_type": "stream", 326 | "text": [ 327 | "defaultdict(, {('t', 'h'): 3, ('h', 'i'): 3, ('i', 's'): 5, ('s', ''): 5, ('a', ''): 2, ('t', 'e'): 2, ('e', 's'): 2, ('s', 't'): 2, ('t', ''): 2})\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "# Example usage\n", 333 | "vocab = {\n", 334 | " 't h i s ': 3,\n", 335 | " 'i s ': 2,\n", 336 | " 'a ': 2,\n", 337 | " 't e s t ': 2\n", 338 | "}\n", 339 | "\n", 340 | "stats = get_stats(vocab)\n", 341 | "print(stats)" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "id": "f11ef8c6", 347 | "metadata": {}, 348 | "source": [ 349 | "### STEP 3: Define function merge_vocab()\n", 350 | "\n", 351 | "\n", 352 | "**Notes on Creating the New Vocabulary Dictionary**\n", 353 | "\n", 354 | "\n", 355 | "- v_out = {}: Initializes an empty dictionary for the new vocabulary.\n", 356 | "\n", 357 | "- bigram = re.escape(' '.join(pair)): Joins the pair with a space and escapes any special characters for use in a regular expression.\n", 358 | "\n", 359 | "- p = re.compile(r'(? Dict[str, int]:\n", 529 | " \"\"\"\n", 530 | " Given a pair of characters and a vocabulary, returns a new vocabulary with the \n", 531 | " pair of characters merged together wherever they appear.\n", 532 | "\n", 533 | " Parameters:\n", 534 | " pair (Tuple[str, str]): A tuple containing two characters to be merged.\n", 535 | " v_in (Dict[str, int]): A dictionary where keys are words with separated characters \n", 536 | " and an end token, and values are their frequency counts.\n", 537 | "\n", 538 | " Returns:\n", 539 | " Dict[str, int]: A new vocabulary dictionary with the pair of characters merged.\n", 540 | " \"\"\"\n", 541 | " v_out = {}\n", 542 | " bigram = re.escape(' '.join(pair))\n", 543 | " p = re.compile(r'(?': 3, 'i s ': 2, 'a ': 2, 't e s t ': 2}\n" 564 | ] 565 | } 566 | ], 567 | "source": [ 568 | "# Example usage\n", 569 | "v_in = {\n", 570 | " 't h i s ': 3,\n", 571 | " 'i s ': 2,\n", 572 | " 'a ': 2,\n", 573 | " 't e s t ': 2\n", 574 | "}\n", 575 | "\n", 576 | "pair = ('t', 'h')\n", 577 | "merged_vocab = merge_vocab(pair, v_in)\n", 578 | "print(merged_vocab)" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "id": "041d079f", 584 | "metadata": {}, 585 | "source": [ 586 | "### STEP 4 : Create the Byte Pair Encoder Function \n", 587 | "\n", 588 | "\n", 589 | "**Putting it all together**\n", 590 | "\n", 591 | "\n", 592 | "**Input Parameters**\n", 593 | "\n", 594 | "- A list of strings \n", 595 | "- an integer n denoting how many merged pairs are to be returned \n", 596 | "\n", 597 | "**Output Parameter(s)**\n", 598 | "\n", 599 | "- A dictionary with the vocabulary post Byte Pair Encoding \n", 600 | "\n", 601 | "\n", 602 | "\n", 603 | "**Key Process Steps**\n", 604 | "\n", 605 | "- vocab = get_vocab(data): Initializes the vocabulary using the get_vocab function.\n", 606 | "- loop through 'n' times, at each iteration:\n", 607 | " - determine frequency dict of character pairs\n", 608 | " - extract the most frequent character pair\n", 609 | " - merge the most frequent pair in the vocab list " 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "execution_count": 9, 615 | "id": "b9917e98", 616 | "metadata": {}, 617 | "outputs": [], 618 | "source": [ 619 | "from typing import List, Dict\n", 620 | "\n", 621 | "def byte_pair_encoder(data: List[str], n: int) -> Dict[str, int]:\n", 622 | " \"\"\"\n", 623 | " Given a list of strings and an integer n, returns a list of n merged pairs\n", 624 | " of characters found in the vocabulary of the input data.\n", 625 | "\n", 626 | " Parameters:\n", 627 | " data (List[str]): A list of strings where each string is a line of text.\n", 628 | " n (int): The number of pairs of characters to merge.\n", 629 | "\n", 630 | " Returns:\n", 631 | " Dict[str, int]: A dictionary representing the vocabulary with merged character pairs.\n", 632 | " \"\"\"\n", 633 | " vocab = get_vocab(data)\n", 634 | " for i in range(n):\n", 635 | " pairs = get_stats(vocab)\n", 636 | " best = max(pairs, key=pairs.get)\n", 637 | " vocab = merge_vocab(best, vocab)\n", 638 | " return vocab" 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": 10, 644 | "id": "d7959892", 645 | "metadata": {}, 646 | "outputs": [], 647 | "source": [ 648 | "# Example usage:\n", 649 | "\n", 650 | "# Set corpus \n", 651 | "\n", 652 | "corpus = '''Tokenization is the process of breaking down \n", 653 | "a sequence of text into smaller units called tokens,\n", 654 | "which can be words, phrases, or even individual characters.\n", 655 | "Tokenization is often the first step in natural language processing tasks \n", 656 | "such as text classification, named entity recognition, and sentiment analysis.\n", 657 | "The resulting tokens are typically used as input to further processing steps,\n", 658 | "such as vectorization, where the tokens are converted\n", 659 | "into numerical representations for machine learning models to use.'''\\\n", 660 | "\n", 661 | "\n", 662 | "# split by sentence \n", 663 | "data = corpus.split('.')\n", 664 | "\n", 665 | "\n" 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "id": "c186f956", 671 | "metadata": {}, 672 | "source": [ 673 | "#### TEST BPE with n = 200" 674 | ] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "execution_count": 11, 679 | "id": "047a0ffc", 680 | "metadata": {}, 681 | "outputs": [ 682 | { 683 | "name": "stdout", 684 | "output_type": "stream", 685 | "text": [ 686 | "{'Tokenization': 2, 'is': 2, 'the': 3, 'process': 1, 'of': 2, 'breaking': 1, 'down': 1, 'a': 1, 'sequence': 1, 'text': 2, 'into': 2, 'smaller': 1, 'units': 1, 'called': 1, 'tokens,': 1, 'which': 1, 'can': 1, 'be': 1, 'words,': 1, 'phrases,': 1, 'or': 1, 'even': 1, 'individual': 1, 'characters': 1, 'often': 1, 'first': 1, 'step': 1, 'in': 1, 'natural': 1, 'language': 1, 'processing': 2, 'tasks': 1, 'such': 2, 'as': 3, 'classification,': 1, 'named': 1, 'entity': 1, 'recognition,': 1, 'and': 1, 'sentiment': 1, 'analysis': 1, 'The': 1, 'resulting': 1, 'tokens': 2, 'are': 2, 'typically': 1, 'used': 1, 'input': 1, 'to': 2, 'further': 1, 'steps,': 1, 'vectorization,': 1, 'where': 1, 'conv er te d': 1, 'n u m er ic al': 1, 're pr es en t ation s': 1, 'f or': 1, 'm a ch in e': 1, 'l e ar n ing': 1, 'm o d e l s': 1, 'us e': 1}\n" 687 | ] 688 | } 689 | ], 690 | "source": [ 691 | "# define output count \n", 692 | "n = 200\n", 693 | "\n", 694 | "# call function \n", 695 | "bpe_pairs = byte_pair_encoder(data, n)\n", 696 | "\n", 697 | "# check \n", 698 | "print(bpe_pairs)" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "id": "8cf09d17", 704 | "metadata": {}, 705 | "source": [ 706 | "#### TEST BPE with n = 210" 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": 12, 712 | "id": "dbd11b0f", 713 | "metadata": {}, 714 | "outputs": [ 715 | { 716 | "name": "stdout", 717 | "output_type": "stream", 718 | "text": [ 719 | "{'Tokenization': 2, 'is': 2, 'the': 3, 'process': 1, 'of': 2, 'breaking': 1, 'down': 1, 'a': 1, 'sequence': 1, 'text': 2, 'into': 2, 'smaller': 1, 'units': 1, 'called': 1, 'tokens,': 1, 'which': 1, 'can': 1, 'be': 1, 'words,': 1, 'phrases,': 1, 'or': 1, 'even': 1, 'individual': 1, 'characters': 1, 'often': 1, 'first': 1, 'step': 1, 'in': 1, 'natural': 1, 'language': 1, 'processing': 2, 'tasks': 1, 'such': 2, 'as': 3, 'classification,': 1, 'named': 1, 'entity': 1, 'recognition,': 1, 'and': 1, 'sentiment': 1, 'analysis': 1, 'The': 1, 'resulting': 1, 'tokens': 2, 'are': 2, 'typically': 1, 'used': 1, 'input': 1, 'to': 2, 'further': 1, 'steps,': 1, 'vectorization,': 1, 'where': 1, 'converted': 1, 'numerical': 1, 'repres en t ation s': 1, 'f or': 1, 'm a ch in e': 1, 'l e ar n ing': 1, 'm o d e l s': 1, 'us e': 1}\n" 720 | ] 721 | } 722 | ], 723 | "source": [ 724 | "# define output count \n", 725 | "n = 210\n", 726 | "\n", 727 | "# call function \n", 728 | "bpe_pairs = byte_pair_encoder(data, n)\n", 729 | "\n", 730 | "# check \n", 731 | "print(bpe_pairs)" 732 | ] 733 | }, 734 | { 735 | "cell_type": "markdown", 736 | "id": "ebf8d6c3", 737 | "metadata": {}, 738 | "source": [ 739 | "## Section B: Using Byte Pair Encoder from tiktoken \n", 740 | "\n", 741 | "**Note**\n", 742 | "\n", 743 | "We have seen how complex it is to implement BPE from ground up! and its also operationally expensive in compute sense.\n", 744 | "\n", 745 | "In this section we will use an existing Python open-source library called tiktoken (https://github.com/openai/tiktoken), which implements the BPE algorithm very efficiently based on source code in **Rust**.\n", 746 | "\n", 747 | "\n", 748 | "### installs" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": 13, 754 | "id": "8d3036db", 755 | "metadata": {}, 756 | "outputs": [], 757 | "source": [ 758 | "#!pip install tiktoken\n", 759 | "import tiktoken" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "id": "b7f7c1d9", 765 | "metadata": {}, 766 | "source": [ 767 | "### Instantiate the BPE tokenizer from tiktoken " 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 14, 773 | "id": "b1294148", 774 | "metadata": {}, 775 | "outputs": [], 776 | "source": [ 777 | "tokenizer = tiktoken.get_encoding(\"gpt2\")" 778 | ] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "id": "267ccb06", 783 | "metadata": {}, 784 | "source": [ 785 | "### Check usage \n", 786 | "\n", 787 | "**Key Steps**\n", 788 | " - tokenize an input text to token ids and check \n", 789 | " - convert token ids back to tokens and check \n", 790 | "\n", 791 | "\n", 792 | "#### encode" 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": 15, 798 | "id": "a13e1297", 799 | "metadata": {}, 800 | "outputs": [ 801 | { 802 | "name": "stdout", 803 | "output_type": "stream", 804 | "text": [ 805 | "[15496, 11, 466, 345, 765, 617, 6891, 30, 220, 50256, 554, 262, 16187, 286, 1588, 18057, 7150, 1659, 617, 34680, 27271, 13]\n" 806 | ] 807 | } 808 | ], 809 | "source": [ 810 | "# define text \n", 811 | "text = (\n", 812 | " \"Hello, do you want some coffee? <|endoftext|> In the shadows of large palm trees\"\n", 813 | " \"of someunknownPlace.\"\n", 814 | ")\n", 815 | "\n", 816 | "# tokenize \n", 817 | "integers = tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n", 818 | "\n", 819 | "# check \n", 820 | "print(integers)\n" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "id": "2a1e5c63", 826 | "metadata": {}, 827 | "source": [ 828 | "#### decode " 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "execution_count": 16, 834 | "id": "c924277e", 835 | "metadata": {}, 836 | "outputs": [ 837 | { 838 | "name": "stdout", 839 | "output_type": "stream", 840 | "text": [ 841 | "Hello, do you want some coffee? <|endoftext|> In the shadows of large palm treesof someunknownPlace.\n" 842 | ] 843 | } 844 | ], 845 | "source": [ 846 | "strings = tokenizer.decode(integers)\n", 847 | "\n", 848 | "print(strings)" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": null, 854 | "id": "ca755ab7", 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [] 858 | }, 859 | { 860 | "cell_type": "markdown", 861 | "id": "a1cd9017", 862 | "metadata": {}, 863 | "source": [ 864 | "### Observations from usage \n", 865 | "\n", 866 | "- **First** The <|endoftext|> token is assigned a relatively large token ID, namely, 50256. In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.\n", 867 | "\n", 868 | "\n", 869 | "- **Second** The BPE tokenizer above encodes and decodes unknown words, such as \"someunknownPlace\" correctly. The BPE tokenizer can handle any unknown word. **How does it achieve this without using <|unk|> tokens?**" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "id": "22b68d41", 875 | "metadata": {}, 876 | "source": [ 877 | "#### The Trick is as below\n", 878 | "\n", 879 | "The algorithm underlying BPE breaks down words that aren't in its\n", 880 | "predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words. So, thanks to the BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters, as illustrated in Figure below.\n", 881 | "\n", 882 | "#### Fig reference - Ch2 Reference text \n" 883 | ] 884 | }, 885 | { 886 | "cell_type": "markdown", 887 | "id": "962bbacb", 888 | "metadata": {}, 889 | "source": [ 890 | "\n", 891 | "" 892 | ] 893 | }, 894 | { 895 | "cell_type": "markdown", 896 | "id": "bdc836bc", 897 | "metadata": {}, 898 | "source": [ 899 | "### Let us test the above " 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": 17, 905 | "id": "d680ad40", 906 | "metadata": {}, 907 | "outputs": [ 908 | { 909 | "name": "stdout", 910 | "output_type": "stream", 911 | "text": [ 912 | "[9084, 86, 343, 86, 220, 959]\n", 913 | " Akwirw ier\n" 914 | ] 915 | } 916 | ], 917 | "source": [ 918 | "text = \" Akwirw ier\"\n", 919 | "\n", 920 | "# tokenize \n", 921 | "integers = tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n", 922 | "\n", 923 | "# check \n", 924 | "print(integers)\n", 925 | "\n", 926 | "\n", 927 | "strings = tokenizer.decode(integers)\n", 928 | "\n", 929 | "print(strings)" 930 | ] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "id": "a50e2564", 935 | "metadata": {}, 936 | "source": [ 937 | "### Food for thought: - How does it recombine unknown words post byte pair encoding!!!\n", 938 | "\n", 939 | "(guess we need to ask that to some one in open AI!!!)\n" 940 | ] 941 | }, 942 | { 943 | "cell_type": "markdown", 944 | "id": "7438aca0", 945 | "metadata": {}, 946 | "source": [ 947 | "## End of notebook" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": null, 953 | "id": "45c3f9f0", 954 | "metadata": {}, 955 | "outputs": [], 956 | "source": [] 957 | } 958 | ], 959 | "metadata": { 960 | "kernelspec": { 961 | "display_name": "Python 3 (ipykernel)", 962 | "language": "python", 963 | "name": "python3" 964 | }, 965 | "language_info": { 966 | "codemirror_mode": { 967 | "name": "ipython", 968 | "version": 3 969 | }, 970 | "file_extension": ".py", 971 | "mimetype": "text/x-python", 972 | "name": "python", 973 | "nbconvert_exporter": "python", 974 | "pygments_lexer": "ipython3", 975 | "version": "3.10.12" 976 | } 977 | }, 978 | "nbformat": 4, 979 | "nbformat_minor": 5 980 | } 981 | -------------------------------------------------------------------------------- /LLM_6_Attention_Trainable_Wt.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anishiisc/Build_LLM_from_Scratch/4638da9ab05edd433370b89856f0b5a9fb04cb46/LLM_6_Attention_Trainable_Wt.pdf -------------------------------------------------------------------------------- /LLM_Part_1_Tokenizer.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anishiisc/Build_LLM_from_Scratch/4638da9ab05edd433370b89856f0b5a9fb04cb46/LLM_Part_1_Tokenizer.pdf -------------------------------------------------------------------------------- /LLM_Part_2_Byte_Pair_Encoding .pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anishiisc/Build_LLM_from_Scratch/4638da9ab05edd433370b89856f0b5a9fb04cb46/LLM_Part_2_Byte_Pair_Encoding .pdf -------------------------------------------------------------------------------- /LLM_Part_3_Data_Loader.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anishiisc/Build_LLM_from_Scratch/4638da9ab05edd433370b89856f0b5a9fb04cb46/LLM_Part_3_Data_Loader.pdf -------------------------------------------------------------------------------- /LLM_Part_4_Word_Embeddings.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anishiisc/Build_LLM_from_Scratch/4638da9ab05edd433370b89856f0b5a9fb04cb46/LLM_Part_4_Word_Embeddings.pdf -------------------------------------------------------------------------------- /LLM_Part_5__Self_Attention.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anishiisc/Build_LLM_from_Scratch/4638da9ab05edd433370b89856f0b5a9fb04cb46/LLM_Part_5__Self_Attention.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Build_LLM_from_Scratch 2 | 3 | This repo is a WIP towards developing a series of tutorial notebooks towards buidling a LLM from scratch 4 | The primary reference text for this Tutorial series is 5 | https://www.manning.com/books/build-a-large-language-model-from-scratch 6 | 7 | 8 | ## The Repo contains the following notebooks 9 | 10 | ### LLM_1_Tokeniser 11 | In this notebook we develop step by a step a tokenizer class. The text we will tokenize for LLM training is a short story by Edith Wharton called The Verdict, which has been released into the public domain and is thus permitted to be used for LLM training tasks. The text is available on Wikisource at https://en.wikisource.org/wiki/The_Verdict, 12 | A notebook based tutorial series on buildling a LLM from scratch 13 | 14 | 15 | ### LLM_2_Byte Pair Encoding 16 | In the first notebook, we discussed how to develop a Word Tokenizer step by step. In this notebook we will demonstrate The next step before we can finally create the embeddings for the LLM, which is to generate the input-target pairs required for training an LLM. 17 | 18 | 19 | ### LLM_3_Data Loader 20 | In this notebook we explain the concept of Dataset Class and Data Loader in pytorch. We also explain with a series of examples how the X data and traget sequence of tokens are generated for creating a next word predictor 21 | 22 | 23 | ### LLM_4 Embeddings 24 | In this notebooks ; we explain the concept of word embeddings and how both tokens and token positions are accounted for through embeddings as input for the train process , through initial randomly generated embedding weights. 25 | 26 | ### LLM_5 Self Attention 27 | In this notebooks ; We describe step by step the implementation of a simple self attention concept. 28 | 29 | ### LLM_6 Self Attention with trainable weigts 30 | In this notebooks ; We describe step by step the implementation of a self attention with trainable weights, wherein we show how Query , Key and Value vectors are utilized in computing a context vector 31 | --------------------------------------------------------------------------------