├── README.md ├── pages └── Let's build the GPT Tokenizer │ ├── 1. Introduction to Tokenization.md │ ├── 10. Live Demonstration of Tokenization.md │ ├── 11. Tokenization of English Sentences.md │ ├── 12. Tokenization of Arithmetic.md │ ├── 13. Tokenization of Non-English Languages.md │ ├── 14. Tokenization of Programming Languages.md │ ├── 15. Improvements in GPT-4 Tokenizer.md │ ├── 16. Writing Tokenization Code.md │ ├── 17. Understanding Unicode and UTF-8 Encoding.md │ ├── 18. Implementing Byte Pair Encoding.md │ ├── 19. Training the Tokenizer.md │ ├── 2. Naive Tokenization and Its Limitations.md │ ├── 20. Encoding and Decoding with the Tokenizer.md │ ├── 21. Special Tokens and Their Usage.md │ ├── 22. Tokenization in State-of-the-Art LLMs.md │ ├── 23. Using SentencePiece for Tokenization.md │ ├── 24. Recap and Final Thoughts.md │ ├── 3. Character-Level Tokenization.md │ ├── 4. Embedding Table and Token Representation.md │ ├── 5. Advanced Tokenization Schemes.md │ ├── 6. Byte Pair Encoding Algorithm.md │ ├── 7. Tokenization in GPT-2 Paper.md │ ├── 8. Building Our Own Tokenizer.md │ ├── 9. Complexities of Tokenization.md │ └── index.md └── scripts └── convert.py /README.md: -------------------------------------------------------------------------------- 1 | # YouTube to Post 📺 ->📝 2 | 3 | Convert educational videos into equivalent written materials. 4 | 5 | Based on the challenge from [Andrej Karpathy](https://twitter.com/karpathy/status/1760740503614836917): 6 | 7 | > Fun LLM challenge that I'm thinking about: take my 2h13m tokenizer video and translate the video into the format of a 8 | > book chapter (or a blog post) on tokenization. 9 | > Something like: 10 | > 11 | > 1. Whisper the video 12 | > 2. Chop up into segments of aligned images and text 13 | > 3. Prompt engineer an LLM to translate piece by piece 14 | > 4. Export as a page, with links citing parts of original video 15 | > 16 | > More generally, a workflow like this could be applied to any input video and auto-generate "companion guides" for 17 | > various tutorials in a more readable, skimmable, searchable format. Feels tractable but non-trivial. 18 | 19 | 20 | ## Generated posts 21 | [Let's build the GPT Tokenizer](./pages/Let's%20build%20the%20GPT%20Tokenizer/index) 22 | 23 | 24 | ## Prompts & scripts 25 | [This Wordware prompt](https://app.wordware.ai/r/b058e9c3-ffee-4661-a5e3-c788eef0dfbc) takes in the JSON output from 26 | running Whisper (I used [this one](https://replicate.com/vaibhavs10/incredibly-fast-whisper) on 27 | [Replicate](https://replicate.com/)) and processes it into sections of a written lesson. 28 | 29 | The simple script in `scripts/convert.py` turns the output into a set of Markdown files that are then served via GitHub 30 | pages. 31 | 32 | ### Getting the transcript 33 | [Here](https://gist.github.com/wordware-ai/95312691264f66c7a893ab1dfea15807) is the output from running Whisper on the 34 | audio track of [Andrej's Tokenizer video](https://www.youtube.com/watch?v=zduSFxRajkE). 35 | 36 | Alternatively you can run [`youtube-dl`](https://github.com/ytdl-org/youtube-dl) on the video e.g. 37 | ```commandline 38 | youtube-dl --extract-audio --audio-format mp3 "https://www.youtube.com/watch?v=" 39 | ``` 40 | then run it through a transcription model like [this one](https://replicate.com/vaibhavs10/incredibly-fast-whisper). 41 | -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/1. Introduction to Tokenization.md: -------------------------------------------------------------------------------- 1 | # Introduction to Tokenization 2 | 3 | ### Introduction to Tokenization in Large Language Models 4 | 5 | Tokenization is a fundamental process in the functioning of large language models (LLMs). It is the step where raw text is converted into a sequence of tokens, which are essentially the building blocks that LLMs understand and manipulate. 6 | 7 | #### The Importance of Tokenization 8 | 9 | Understanding tokenization is crucial when working with LLMs because it is intricately tied to many of the peculiar behaviors and limitations observed in these models. Issues with spelling, handling of non-English languages, and even certain security concerns can often be traced back to how tokenization is handled. 10 | 11 | #### What is Tokenization? 12 | 13 | Tokenization is the process of converting a string of text into a sequence of tokens. These tokens are then used by the LLM to understand and generate text. In a naive tokenization approach, we might simply split text into words or characters, but modern LLMs use more sophisticated methods. 14 | 15 | #### Character-Level Tokenization 16 | 17 | In the simplest form of tokenization, each character in a text is treated as a token. This means that the text "hi there" would be broken down into individual characters 'h', 'i', ' ', 't', 'h', 'e', 'r', 'e', each assigned a unique integer token. This approach, however, is quite limited and inefficient for large language models. 18 | 19 | #### Token Representation with Embedding Tables 20 | 21 | Tokens are not used directly by the LLM. Instead, they are passed through an embedding table that converts each token into a vector of real numbers. These vectors are then used as input for the LLM. The embedding table is a matrix where each row corresponds to a token and contains trainable parameters that are adjusted during the model's learning process. 22 | 23 | #### Advanced Tokenization Schemes 24 | 25 | State-of-the-art LLMs use advanced tokenization schemes that go beyond character-level tokenization. These schemes involve chunking text into larger pieces and using algorithms to construct a vocabulary of these chunks. One such algorithm is Byte Pair Encoding (BPE), which we will explore in detail. 26 | 27 | #### Byte Pair Encoding (BPE) 28 | 29 | BPE is a method used to construct token vocabularies by iteratively merging the most frequently occurring pairs of characters or bytes. For example, if 'aa' is the most common pair in a dataset, it would be merged into a single token, reducing the sequence length. This process is repeated until a desired vocabulary size is reached. 30 | 31 | #### Tokenization in Practice: GPT-2 Example 32 | 33 | The GPT-2 paper introduced BPE in the context of LLMs. It describes how tokenization works in their model, with a vocabulary size of 50,257 tokens and a context size of 1,024 tokens. This means that at any given time, the model can consider a sequence of up to 1,024 tokens to generate the next piece of text. 34 | 35 | #### Building Our Own Tokenizer 36 | 37 | Building a tokenizer from scratch allows for a deeper understanding of the process. By implementing BPE ourselves, we can see exactly how tokens are created and how they influence the performance and behavior of an LLM. 38 | 39 | #### The Complexities of Tokenization 40 | 41 | Tokenization is not without its complexities. It can introduce issues such as difficulty in spelling tasks, problems with non-English languages, and challenges in handling programming languages. Understanding these complexities is essential for effectively working with LLMs. 42 | 43 | #### Live Demonstration of Tokenization 44 | 45 | A live demonstration of tokenization can be illustrative. For example, using a web application that tokenizes input text in real-time can show how different strings are broken down into tokens and how this affects the processing by an LLM. 46 | 47 | #### Conclusion 48 | 49 | Tokenization is a critical yet complex part of working with LLMs. It influences many aspects of a model's behavior and performance. By understanding tokenization, we can better grasp the capabilities and limitations of these powerful models and work towards improving them. 50 | 51 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=0) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/10. Live Demonstration of Tokenization.md: -------------------------------------------------------------------------------- 1 | # Live Demonstration of Tokenization 2 | 3 | ### 10. Live Demonstration of Tokenization 4 | 5 | Tokenization is a critical process in the functioning of large language models (LLMs). It involves converting strings of text into sequences of tokens, which are essentially integers that the model can process. This process is not as straightforward as it might seem, and it's filled with nuances and complexities that can significantly affect the performance of LLMs. 6 | 7 | #### Understanding Tokens and Tokenization 8 | 9 | Tokens are the fundamental units in LLMs, serving as the atomic elements that the models perceive and manipulate. The process of tokenization translates text into tokens and vice versa, which is crucial for feeding data into the models and interpreting their outputs. 10 | 11 | #### The Role of Tokenization in LLMs 12 | 13 | Tokenization impacts the behavior of LLMs in various ways. Issues with tokenization can lead to difficulties in tasks such as spelling, string processing, and handling non-English languages or even simple arithmetic. The way tokenization is handled can also affect the model's performance with programming languages, as seen with different versions of GPT. 14 | 15 | #### Demonstration Using a Web Application 16 | 17 | To illustrate how tokenization works in practice, we can use a web application like `tiktokenizer.versal.app`. This app runs tokenization live in your browser, allowing you to input text and see how it gets tokenized using different tokenizer versions, such as GPT-2 or GPT-4. 18 | 19 | For example, inputting an English sentence like "hello world" will show you how the tokenizer breaks it into tokens. You can also see the tokenization of arithmetic expressions and observe how numbers are sometimes tokenized as single units or split into multiple tokens, which can complicate arithmetic processing for LLMs. 20 | 21 | #### Comparing Tokenizers 22 | 23 | Different tokenizers handle text in various ways. GPT-2, for instance, has a tendency to tokenize whitespace as individual tokens, which can be inefficient, particularly for programming languages like Python that use indentation. GPT-4 improves upon this by grouping whitespace more effectively, allowing for denser representation of code and better performance on coding-related tasks. 24 | 25 | #### Building Your Own Tokenizer 26 | 27 | You can also build your own tokenizer using algorithms like Byte Pair Encoding (BPE). By understanding the complexities of tokenization, you can create a tokenizer that suits your specific needs, whether for English text, arithmetic, or other languages. 28 | 29 | #### Challenges and Considerations 30 | 31 | Tokenization is at the heart of many issues in LLMs. From odd behaviors to performance limitations, many problems can be traced back to how tokenization is implemented. It's essential to approach tokenization with a thorough understanding of its implications and to recognize that it's more than just a preliminary step—it's a crucial component that shapes the capabilities and limitations of your language model. 32 | 33 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=351) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/11. Tokenization of English Sentences.md: -------------------------------------------------------------------------------- 1 | # Tokenization of English Sentences 2 | 3 | ### Tokenization of English Sentences 4 | 5 | Tokenization is a critical process in the operation of large language models (LLMs). It is the process of converting strings of text into sequences of tokens, which are essentially the building blocks that LLMs understand and manipulate. These tokens are not just words or characters; they are often complex combinations of characters that models use to encode information efficiently. 6 | 7 | #### The Tokenization Process 8 | 9 | When we tokenize a string in Python, we're converting text into a list of tokens. For instance, the string "hi there" might be tokenized into a sequence of integers, each representing a specific token or character in the model's vocabulary. In a character-level tokenizer, each character in the input string is directly mapped to a corresponding token. 10 | 11 | However, state-of-the-art language models use more sophisticated tokenization schemes that work on chunks of characters, not just individual characters. These chunks are constructed using algorithms like byte pair encoding (BPE), which we will discuss in more detail. 12 | 13 | #### Byte Pair Encoding (BPE) 14 | 15 | BPE is an algorithm used to compress data by replacing common pairs of bytes or characters with a single byte. In the context of tokenization, it helps in creating a vocabulary of tokens based on the frequency of character pairs in the training dataset. For example, if the pair "hi" occurs frequently, it might be replaced with a single token. 16 | 17 | In the GPT-2 paper, the authors use BPE to construct a vocabulary of 50,257 possible tokens. Each token can be thought of as an atom in the LLM, and every operation or prediction made by the model is done in terms of these tokens. 18 | 19 | #### Building Our Own Tokenizer 20 | 21 | When building a tokenizer, we typically start with a large corpus of text, such as a dataset of Shakespeare's works. We then identify all the unique characters in the dataset and assign each a unique token. Using BPE, we iteratively merge the most frequent pairs of tokens, creating a hierarchy of tokens representing various character combinations. 22 | 23 | This process allows us to compress the text data into a shorter sequence of tokens, which is crucial because LLMs have a finite context window. By compressing the text, we enable the model to consider a broader context, which is essential for understanding and generating coherent text. 24 | 25 | #### Complexities of Tokenization 26 | 27 | Tokenization is not without its challenges. One of the main issues is that tokenization can introduce oddities and inefficiencies that affect the performance of LLMs. For instance, tokenization can make simple string processing difficult, affect the model's ability to perform arithmetic, and lead to suboptimal performance with non-English languages. These issues often stem from the arbitrary ways in which text is broken down into tokens. 28 | 29 | Furthermore, the way we tokenize text can have a significant impact on the efficiency of the model. For example, tokenizing Python code can be inefficient if the tokenizer treats whitespace as individual tokens, leading to bloated sequences that exhaust the model's context window. 30 | 31 | #### Conclusion 32 | 33 | Tokenization is a foundational aspect of LLMs that directly influences their capabilities and limitations. A well-designed tokenizer can greatly enhance a model's performance, while a poorly designed one can introduce a range of issues. Understanding the intricacies of tokenization is therefore essential for anyone working with LLMs. 34 | 35 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=365) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/12. Tokenization of Arithmetic.md: -------------------------------------------------------------------------------- 1 | # Tokenization of Arithmetic 2 | 3 | ### Tokenization of Arithmetic 4 | 5 | Tokenization is the process by which raw text is converted into a sequence of tokens, which are the basic units that a language model (LM) understands and processes. This process is crucial for the performance of language models, especially when dealing with tasks that involve understanding and manipulating text at a character or symbol level, such as arithmetic. 6 | 7 | #### The Challenge with Arithmetic Tokenization 8 | 9 | Arithmetic poses a unique challenge for tokenization in language models because arithmetic operations are inherently character-level processes. For example, when adding numbers, humans typically align digits by their place value, carry over numbers during addition, and perform operations digit by digit. Language models that tokenize input into larger chunks may struggle with these tasks because the tokenization process can obscure the character-level details necessary for arithmetic. 10 | 11 | #### How Tokenization Affects Arithmetic in LLMs 12 | 13 | Large language models (LLMs) like GPT-2 often tokenize numbers in a way that can be arbitrary and non-intuitive. During the tokenization process, a number like "127" might be tokenized as a single unit, while "677" might be split into two separate tokens, "6" and "77". This inconsistent tokenization can make it difficult for the model to understand and perform arithmetic operations correctly. 14 | 15 | For example, if a model encounters the number "804", it may tokenize it into "8" and "04", treating these as two separate entities. When asked to perform arithmetic involving "804", the model must reconcile the fact that it's working with two tokens that it has learned to interpret as separate concepts, rather than as a single number. 16 | 17 | #### Tokenization Schemes and Arithmetic 18 | 19 | Different tokenization schemes can lead to different levels of performance in arithmetic tasks. For instance, character-level tokenization might be more suitable for arithmetic as it preserves the individual digits and their order. However, character-level tokenization can lead to very long sequences for large texts, which is computationally inefficient. 20 | 21 | Advanced tokenization schemes like Byte Pair Encoding (BPE) aim to strike a balance by creating a vocabulary of frequently occurring character chunks. BPE can help reduce sequence length while still preserving some character-level information. However, the way BPE chunks numbers can still be arbitrary, affecting the model's ability to perform arithmetic. 22 | 23 | #### Improving Arithmetic Tokenization 24 | 25 | The tokenization of arithmetic can be improved in several ways. One approach is to design tokenizers that are more sensitive to the structure of numbers. For example, ensuring that numbers are tokenized consistently, without splitting digits in a way that loses their numerical meaning, can help LLMs better understand and perform arithmetic. 26 | 27 | In the case of GPT-4, the tokenizer was improved to be more efficient in handling numbers and Python code by grouping more whitespace characters into single tokens. This denser representation allows the model to attend to more context, which is especially beneficial for tasks like coding where whitespace is meaningful. 28 | 29 | #### Implementing Arithmetic Tokenization 30 | 31 | When building a tokenizer, one must consider how it will handle different types of text, including arithmetic. Implementing a tokenizer involves defining a vocabulary and creating rules for how text is split into tokens. For arithmetic, it's important to ensure that numbers are tokenized in a way that preserves their mathematical properties. 32 | 33 | For instance, a simple tokenizer might split text into tokens based on whitespace and punctuation, but this approach could be problematic for arithmetic where whitespace isn't always meaningful, and numbers need to be kept intact. A more sophisticated tokenizer might use patterns or algorithms to identify numbers and tokenize them as whole units. 34 | 35 | #### Conclusion 36 | 37 | Tokenization is a foundational aspect of language model performance, with significant implications for tasks like arithmetic. The way a tokenizer handles numbers can either enable or hinder a model's ability to perform arithmetic operations. As such, careful design and implementation of tokenization strategies are essential for building LLMs that are competent in arithmetic and other tasks requiring fine-grained text manipulation. 38 | 39 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=439) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/13. Tokenization of Non-English Languages.md: -------------------------------------------------------------------------------- 1 | # Tokenization of Non-English Languages 2 | 3 | ### Tokenization of Non-English Languages 4 | 5 | Tokenization plays a critical role in the performance of large language models (LLMs) across various languages. It is essential to understand that tokenization schemes significantly impact the model's ability to process and generate text, especially in non-English languages. 6 | 7 | #### The Challenge with Non-English Tokenization 8 | 9 | Tokenization can be particularly challenging for non-English languages due to several reasons: 10 | 11 | 1. **Training Data Imbalance**: LLMs like GPT-2 or GPT-3 are often trained on datasets predominantly composed of English text. This imbalance leads to a tokenizer that is highly optimized for English, at the expense of other languages. 12 | 13 | 2. **Complex Tokenization**: Non-English languages may have different grammatical structures, scripts, and character sets that complicate the tokenization process. For example, languages like Chinese or Korean do not use spaces to delimit words, and languages like Arabic have script-specific features like cursive writing and diacritical marks. 14 | 15 | 3. **Sequence Length Bloat**: When tokenizing non-English text, the resulting token sequences tend to be longer than their English counterparts. This bloat occurs because the tokenizer, trained predominantly on English data, fails to efficiently chunk non-English text. Consequently, the model has a shorter effective context length for non-English languages due to more tokens being used to represent the same content. 16 | 17 | 4. **Lack of Token Merges**: In English, common phrases or words might be merged into single tokens, reducing the sequence length. However, due to less frequent exposure to non-English phrases or words during tokenizer training, such efficient merges are less likely to occur, leading to longer token sequences for non-English text. 18 | 19 | #### Example: Tokenization of Korean Text 20 | 21 | Consider the Korean greeting "안녕하세요" (annyeonghaseyo), which means "hello." In an LLM trained predominantly on English, this phrase might not be efficiently tokenized. Instead of being represented as a single token or a small number of tokens, it might be broken down into a larger number of tokens, each representing smaller pieces of the phrase. This inefficient tokenization stretches out the representation of the phrase, consuming more of the model's context window. 22 | 23 | #### Addressing Non-English Tokenization Issues 24 | 25 | To improve tokenization for non-English languages, the following approaches can be considered: 26 | 27 | 1. **Balanced Training Sets**: Ensure that the training set for the tokenizer includes a representative mix of languages. This balance allows the tokenizer to learn more efficient token merges for a variety of languages, not just English. 28 | 29 | 2. **Specialized Tokenizers**: Develop language-specific tokenizers that are trained on large datasets of the target language. These tokenizers can better capture the nuances and common patterns of the language. 30 | 31 | 3. **Post-Training Adjustments**: After training a general tokenizer, make adjustments to the vocabulary by adding more tokens specific to non-English languages or by fine-tuning the tokenizer on a more balanced dataset. 32 | 33 | 4. **Increased Context Window**: Design LLMs with a larger context window to mitigate the impact of sequence length bloat for non-English languages. 34 | 35 | 5. **Script-Specific Preprocessing**: Apply preprocessing steps that are tailored to the writing system of the non-English language before tokenization. For example, segmenting text into words for languages that do not use spaces. 36 | 37 | By taking these steps, we can create tokenizers that are more inclusive and performant across diverse languages, leading to LLMs that are truly global in their capabilities. 38 | 39 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=570) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/14. Tokenization of Programming Languages.md: -------------------------------------------------------------------------------- 1 | # Tokenization of Programming Languages 2 | 3 | ### Tokenization of Programming Languages 4 | 5 | Tokenization is a critical process in understanding and utilizing large language models (LLMs), especially when dealing with programming languages. Programming languages offer a unique set of challenges for tokenization due to their structured syntax and the importance of whitespace and special characters. 6 | 7 | #### The Challenge with Tokenization in Programming Languages 8 | 9 | Programming languages are different from natural languages in that they are designed to be parsed by machines, which means they have a strict syntax and structure. Tokenization of programming languages, therefore, requires careful consideration of these structures to ensure that the language model can understand and generate code effectively. 10 | 11 | Whitespace, for instance, is significant in programming languages like Python, where indentation levels determine the scope of code blocks. Incorrect tokenization of whitespace can lead to code that is syntactically incorrect or has a different meaning than intended. 12 | 13 | #### Example: Python and Whitespace 14 | 15 | Consider a simple Python code snippet for the FizzBuzz problem: 16 | 17 | ```python 18 | for i in range(1, 101): 19 | if i % 3 == 0 and i % 5 == 0: 20 | print("FizzBuzz") 21 | elif i % 3 == 0: 22 | print("Fizz") 23 | elif i % 5 == 0: 24 | print("Buzz") 25 | else: 26 | print(i) 27 | ``` 28 | 29 | In this example, the indentation is critical. If a tokenizer treats each space as an individual token, the resulting token sequence would be unnecessarily long and inefficient. This inefficiency is problematic for LLMs since they have a finite context window in which they can attend to tokens. Excessive tokenization of whitespace can consume this context window rapidly, leaving less room for the actual logical components of the code. 30 | 31 | #### Tokenization Improvements in GPT-4 32 | 33 | Comparing GPT-2 and GPT-4 reveals an improvement in handling whitespace in Python code. GPT-2 tokenizes each space individually, leading to a bloated token sequence. On the other hand, GPT-4 groups multiple spaces into a single token, allowing for a denser and more efficient representation of the code. 34 | 35 | This improvement in GPT-4's tokenizer is a deliberate design choice, optimizing the tokenization process for programming languages and resulting in better performance when generating or understanding code. 36 | 37 | #### Building a Custom Tokenizer for Programming Languages 38 | 39 | When constructing a tokenizer for programming languages, one must consider the following: 40 | 41 | - **Character-Level Tokenization**: Starting with a character-level tokenizer as a base is often not sufficient for programming languages due to the significance of whitespace and the need to recognize multi-character tokens like keywords and operators. 42 | 43 | - **Advanced Schemes**: Utilizing advanced tokenization schemes, such as Byte Pair Encoding (BPE), can help create a more suitable vocabulary for programming languages. BPE can merge frequently occurring character sequences into single tokens, which can include language keywords, common variable names, or function calls. 44 | 45 | - **Special Characters**: Programming languages often use special characters like braces, parentheses, and operators. A tokenizer must be able to recognize these characters and give them appropriate significance in the token sequence. 46 | 47 | - **Unicode and UTF-8**: Programming languages can include Unicode characters, especially in string literals or comments. A tokenizer must handle Unicode characters correctly, often using UTF-8 encoding. 48 | 49 | #### Conclusion 50 | 51 | Tokenization of programming languages is a complex task that requires a nuanced approach to handle the structured nature of code. Improvements in tokenizers, such as those seen from GPT-2 to GPT-4, demonstrate the importance of optimizing tokenization for programming contexts. Building a custom tokenizer for programming languages involves considering whitespace handling, advanced tokenization schemes, special character recognition, and Unicode support. By addressing these aspects, one can create a tokenizer that is more aligned with the syntactic and structural requirements of programming languages, leading to better performance in language models that process code. 52 | 53 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=684) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/15. Improvements in GPT-4 Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Improvements in GPT-4 Tokenizer 2 | 3 | ### Section 15: Improvements in GPT-4 Tokenizer 4 | 5 | The evolution of tokenization from GPT-2 to GPT-4 has brought about significant improvements in the way large language models (LLMs) understand and generate text. Tokenization, the process of converting strings of text into sequences of tokens, is a foundational step in training and utilizing LLMs. In this section, we will discuss the advancements made in the tokenizer used for GPT-4. 6 | 7 | #### GPT-2 Tokenizer Limitations 8 | 9 | In GPT-2, tokenization was based on a character-level scheme that often resulted in inefficiencies. For example, the tokenizer would treat whitespace and Python indentation with the same importance as other characters, leading to token bloating. This inefficiency was particularly evident when dealing with programming languages like Python, where indentation is syntactically significant. As a result, GPT-2's ability to understand and generate code was less than optimal. 10 | 11 | #### GPT-4 Tokenizer Enhancements 12 | 13 | GPT-4 introduced a tokenizer with a larger vocabulary size, approximately doubling from GPT-2's 50k tokens to 100k tokens. This expansion allows the model to represent text more densely, effectively doubling the context the model can see and process. With a larger vocabulary, the same text is represented with fewer tokens, which is beneficial for the transformer architecture that GPT models use. 14 | 15 | One of the most notable improvements in GPT-4's tokenizer is its handling of whitespace, particularly in the context of programming languages. The tokenizer now groups multiple whitespace characters into single tokens, which densifies the representation of code and allows the transformer to attend to more relevant parts of the code, improving its ability to predict the next token in a sequence. 16 | 17 | This change in handling whitespace is a deliberate design choice by OpenAI, reflecting a deeper understanding of how tokenization impacts model performance, especially in specialized tasks like code generation. By optimizing the tokenizer for common patterns in programming languages, GPT-4 shows marked improvements in code-related tasks. 18 | 19 | #### Practical Implications 20 | 21 | The improvements in GPT-4's tokenizer have several practical implications: 22 | 23 | 1. **Efficient Tokenization**: The model can process text more efficiently, with a particular improvement in handling programming languages and structured data. 24 | 2. **Expanded Context**: GPT-4 can consider a larger context when making predictions, which is crucial for tasks that require understanding longer passages of text. 25 | 3. **Specialized Performance**: By optimizing for whitespace in programming languages, GPT-4 becomes more adept at code generation and understanding. 26 | 27 | #### Code Example: Tokenizer Comparison 28 | 29 | To illustrate the difference in tokenization between GPT-2 and GPT-4, consider the following Python code snippet: 30 | 31 | ```python 32 | # Python code snippet for FizzBuzz 33 | for i in range(1, 16): 34 | if i % 3 == 0 and i % 5 == 0: 35 | print("FizzBuzz") 36 | elif i % 3 == 0: 37 | print("Fizz") 38 | elif i % 5 == 0: 39 | print("Buzz") 40 | else: 41 | print(i) 42 | ``` 43 | 44 | In GPT-2, each space would be tokenized separately, leading to a long sequence of tokens. In GPT-4, the tokenizer groups these spaces, resulting in a more compact token sequence that the model can process more effectively. 45 | 46 | #### Conclusion 47 | 48 | The tokenizer is a crucial component of LLMs, and the improvements in GPT-4's tokenizer demonstrate OpenAI's commitment to enhancing the model's understanding of language and code. These advancements not only improve the efficiency of the model but also open up new possibilities for its application, particularly in areas where understanding context and structure is essential. 49 | 50 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=748) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/16. Writing Tokenization Code.md: -------------------------------------------------------------------------------- 1 | # Writing Tokenization Code 2 | 3 | ### Writing Tokenization Code 4 | 5 | In this section, we delve into the process of writing code for tokenization, which is a critical step in preparing text data for use in large language models (LLMs). Tokenization is the conversion of strings of text into sequences of tokens, which are essentially integers that represent chunks of text. This process is not straightforward and comes with complexities that can significantly impact the performance of LLMs. 6 | 7 | #### Initial Character-Level Tokenization 8 | 9 | The initial step in tokenization often starts at the character level, where we create a vocabulary of possible characters in a given dataset. For instance, considering a dataset containing Shakespeare's work, we might have 65 unique characters. We then map each character to a unique integer, creating a one-to-one correspondence between characters and integers. 10 | 11 | ```python 12 | # Example of character-level tokenization 13 | vocabulary = {'a': 1, 'b': 2, 'c': 3, ...} 14 | ``` 15 | 16 | #### From Characters to Embeddings 17 | 18 | To feed tokens into an LLM, we use an embedding table, which is a matrix where each row corresponds to a token's embedding vector. These embeddings are learned during training and provide a dense representation of tokens that the model can work with. 19 | 20 | ```python 21 | # Example of using an embedding table 22 | embedding_table = [[...], [...], ...] # 65 rows for 65 tokens 23 | ``` 24 | 25 | #### The Need for Advanced Tokenization 26 | 27 | Character-level tokenization is naive and not efficient for state-of-the-art LLMs. Instead, more complex schemes are used to create token vocabularies, often based on larger chunks of text rather than individual characters. This approach reduces the sequence length and captures more meaning in each token. 28 | 29 | #### Byte Pair Encoding (BPE) 30 | 31 | The Byte Pair Encoding (BPE) algorithm is a popular method for creating a subword-level tokenization scheme. BPE iteratively merges the most frequent pairs of characters or character sequences in the dataset, creating a new token for each merged pair. This process continues until a specified vocabulary size is reached. 32 | 33 | #### Implementing BPE 34 | 35 | To implement BPE, we start by encoding text into UTF-8 byte sequences, as most text data is represented in Unicode. After encoding, we apply the BPE algorithm to these byte sequences, compressing them based on frequency of occurrence. The result is a set of tokens that represent commonly occurring character sequences in the text. 36 | 37 | ```python 38 | # Pseudocode for BPE implementation 39 | def byte_pair_encoding(text): 40 | byte_sequence = utf8_encode(text) 41 | vocab = initialize_vocab(byte_sequence) 42 | while vocab_size < desired_size: 43 | pair_to_merge = find_most_frequent_pair(byte_sequence) 44 | byte_sequence = merge_pair(byte_sequence, pair_to_merge) 45 | vocab.add(new_token_for_merged_pair) 46 | return vocab 47 | ``` 48 | 49 | #### Encoding and Decoding 50 | 51 | With a trained tokenizer, we can encode new text into token sequences and decode token sequences back into text. Encoding involves transforming text into a sequence of integers based on the BPE merges, while decoding reverses this process, reconstructing the original text from token sequences. 52 | 53 | ```python 54 | # Encoding text into tokens 55 | encoded_tokens = tokenizer.encode("Example text to encode") 56 | 57 | # Decoding tokens back into text 58 | decoded_text = tokenizer.decode(encoded_tokens) 59 | ``` 60 | 61 | #### Special Tokens 62 | 63 | In addition to tokens generated from text, we often use special tokens to denote specific meanings, such as the start or end of a text. These special tokens are manually added to the vocabulary and handled separately in the tokenization code. 64 | 65 | #### Conclusion 66 | 67 | Tokenization is a foundational component in the preprocessing pipeline for LLMs. Writing tokenization code involves understanding the intricacies of character encoding, the principles behind algorithms like BPE, and the nuances of handling special tokens. By mastering tokenization, we ensure that LLMs receive data in a format that maximizes their potential for learning and generating meaningful text. 68 | 69 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=896) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/17. Understanding Unicode and UTF-8 Encoding.md: -------------------------------------------------------------------------------- 1 | # Understanding Unicode and UTF-8 Encoding 2 | 3 | ### Understanding Unicode and UTF-8 Encoding 4 | 5 | In the realm of natural language processing and particularly in the context of large language models (LLMs), understanding how text is represented and processed is crucial. This understanding begins with the concept of Unicode and the UTF-8 encoding. 6 | 7 | #### Unicode Code Points 8 | 9 | Unicode is a standard maintained by the Unicode Consortium, which provides a unique number for every character, regardless of the platform, program, or language. The Unicode standard includes characters from various languages, symbols, and even emojis. As of the latest version, Unicode 15.1 (September 2023), there are over 150,000 characters covering 161 scripts. 10 | 11 | In Python, strings are sequences of Unicode code points. Each character in a string is associated with a Unicode code point, which can be retrieved using the `ord()` function. 12 | 13 | For example: 14 | ```python 15 | print(ord('H')) # Outputs: 104 16 | print(ord('😀')) # Outputs: 128512 17 | ``` 18 | 19 | This `ord()` function gives us the integer representation of a Unicode character. Conversely, the `chr()` function can be used to get the character represented by a Unicode code point. 20 | 21 | #### UTF-8 Encoding 22 | 23 | To store or transmit text represented by Unicode code points, we need a way to encode these into a sequence of bytes. This is where UTF-8 comes into play. UTF-8 is one of the most commonly used encodings for Unicode text because it offers several advantages, including backward compatibility with ASCII and efficient use of space. 24 | 25 | UTF-8 is a variable-length encoding, meaning it uses 1 to 4 bytes to represent a Unicode code point, depending on the character. ASCII characters (U+0000 to U+007F) are encoded in a single byte, while other characters use two, three, or four bytes. 26 | 27 | In Python, we can encode a string to its UTF-8 byte representation using the `.encode()` method: 28 | ```python 29 | string = "hello 😊" 30 | encoded_string = string.encode('utf-8') 31 | print(list(encoded_string)) 32 | # Outputs: [104, 101, 108, 108, 111, 32, 240, 159, 152, 138] 33 | ``` 34 | 35 | Decoding from bytes to a string is done with the `.decode()` method: 36 | ```python 37 | decoded_string = encoded_string.decode('utf-8') 38 | print(decoded_string) # Outputs: "hello 😊" 39 | ``` 40 | 41 | #### Byte Pair Encoding (BPE) 42 | 43 | While UTF-8 is efficient, using raw bytes as tokens for an LLM would result in extremely long sequences and would not be practical. To address this, we use a compression algorithm like Byte Pair Encoding (BPE). 44 | 45 | BPE works by iteratively merging the most frequent pairs of bytes or characters into a single new token, thus reducing the sequence length while slightly increasing the vocabulary size. This process is repeated until a desired vocabulary size or compression ratio is achieved. 46 | 47 | #### Implementing BPE 48 | 49 | To implement BPE in Python, we would start by encoding our text into UTF-8 bytes. We would then identify the most common pair of bytes and replace all occurrences with a new token. This process is repeated until we reach the desired vocabulary size. 50 | 51 | Here's a high-level overview of the steps: 52 | 53 | 1. Convert text to UTF-8 bytes. 54 | 2. Count the frequency of each byte pair in the data. 55 | 3. Replace the most frequent pair with a new token. 56 | 4. Update the token frequency count. 57 | 5. Repeat steps 2-4 until the desired vocabulary size is reached. 58 | 59 | #### Special Tokens 60 | 61 | In addition to the tokens generated by BPE, LLMs often use special tokens for specific purposes, such as marking the end of a sentence or segment. These special tokens are manually added to the tokenizer's vocabulary and have specific functions in the processing and generation of text. 62 | 63 | #### Conclusion 64 | 65 | Understanding Unicode, UTF-8 encoding, and BPE is essential for working with LLMs. These concepts underpin how text is converted into a format that can be processed by neural networks. By efficiently compressing text into tokens, we can feed more context into the LLMs, enabling them to generate better and more coherent outputs. 66 | 67 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=899) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/18. Implementing Byte Pair Encoding.md: -------------------------------------------------------------------------------- 1 | # Implementing Byte Pair Encoding 2 | 3 | ### Section 18: Implementing Byte Pair Encoding (BPE) 4 | 5 | Byte Pair Encoding (BPE) is a critical component in the tokenization process for large language models (LLMs), including models like GPT-2 and GPT-4. It allows us to compress a large sequence of text into a smaller sequence of tokens, which can then be used to train and operate these models more efficiently. 6 | 7 | #### Understanding BPE 8 | 9 | BPE is a data compression technique that iteratively replaces the most frequent pair of bytes (or characters) in a text with a single, unused byte. Initially used for data compression, BPE has been adapted for use in NLP for tokenization. The process starts with the raw text and ends with a sequence of tokens that represent the text. 10 | 11 | #### Training BPE 12 | 13 | The training process for BPE involves the following steps: 14 | 15 | 1. **Start with the raw text:** We begin with a large string of text, which is the training set. 16 | 17 | 2. **Convert to bytes:** The text is encoded into UTF-8 bytes, resulting in a sequence of byte tokens. 18 | 19 | 3. **Find the most common pair:** We iterate over the byte sequence to identify the most frequently occurring pair of bytes. 20 | 21 | 4. **Merge the pair:** Once the most common pair is found, we replace all occurrences of this pair with a new token (a new byte that was not used in the original text). 22 | 23 | 5. **Repeat the process:** We continue finding and merging common pairs, adding new tokens for each merged pair, until we reach the desired vocabulary size or until no more merges are possible. 24 | 25 | By doing this, we compress the text and reduce the number of tokens needed to represent it, while also creating a vocabulary that the model can learn from. 26 | 27 | #### Implementing BPE in Python 28 | 29 | Here's a simplified Python implementation of the BPE algorithm: 30 | 31 | ```python 32 | def get_stats(tokens): 33 | pairs = collections.defaultdict(int) 34 | for i in range(len(tokens)-1): 35 | pairs[tokens[i], tokens[i+1]] += 1 36 | return pairs 37 | 38 | def merge_pair(tokens, pair, new_token): 39 | new_tokens = [] 40 | i = 0 41 | while i < len(tokens): 42 | if tokens[i] == pair[0] and i < len(tokens)-1 and tokens[i+1] == pair[1]: 43 | new_tokens.append(new_token) 44 | i += 2 45 | else: 46 | new_tokens.append(tokens[i]) 47 | i += 1 48 | return new_tokens 49 | 50 | # Example usage: 51 | raw_text = "This is an example text for BPE." 52 | tokens = list(raw_text.encode('utf-8')) 53 | vocab_size = 300 # desired vocabulary size 54 | new_token_id = 256 # starting ID for new tokens 55 | 56 | while len(set(tokens)) < vocab_size: 57 | stats = get_stats(tokens) 58 | if not stats: 59 | break 60 | most_common_pair = max(stats, key=stats.get) 61 | tokens = merge_pair(tokens, most_common_pair, new_token_id) 62 | new_token_id += 1 63 | ``` 64 | 65 | In this example, `get_stats` function counts the frequency of each pair of tokens, `merge_pair` function replaces occurrences of the most common pair with a new token, and the loop continues until we reach our desired vocabulary size. 66 | 67 | #### Challenges with BPE 68 | 69 | While BPE is efficient and widely used, it comes with its own complexities: 70 | 71 | - **Tokenization Ambiguity:** The same word or phrase can be tokenized differently depending on context, which can lead to inconsistencies in how the model processes text. 72 | 73 | - **Handling of Rare Words:** Words that are not frequent in the training data may be split into subword tokens, which can lead to less efficient processing by the model. 74 | 75 | - **Special Tokens:** Special tokens (like ``, ``, etc.) are often added to the vocabulary to handle specific cases like the end of a sentence or padding, and they must be managed carefully to ensure the model behaves as expected. 76 | 77 | #### Conclusion 78 | 79 | BPE is a powerful tool for tokenizing text for use in LLMs. It allows us to compress text and create a manageable vocabulary for the model. However, it requires careful implementation and consideration of its limitations and complexities. Understanding BPE is crucial for anyone working with LLMs, as it is often the first step in preparing data for model training and inference. 80 | 81 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=1428) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/19. Training the Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Training the Tokenizer 2 | 3 | ### Training the Tokenizer 4 | 5 | Tokenization is a critical preprocessing step in working with large language models (LLMs). It involves converting raw text into a sequence of tokens that a model can understand. In this section, we focus on training a tokenizer using the Byte Pair Encoding (BPE) algorithm, which is commonly used in state-of-the-art language models like GPT-2 and GPT-4. 6 | 7 | #### Understanding the Need for Tokenization 8 | 9 | Text data comes in the form of strings, which are sequences of Unicode code points. Before feeding this text into an LLM, we must convert it into a sequence of integers representing tokens. Directly using Unicode code points as tokens isn't practical due to their vast number and variability. Instead, we use tokenization algorithms to create a more manageable and stable set of tokens. 10 | 11 | #### Byte Pair Encoding (BPE) Algorithm 12 | 13 | The BPE algorithm works by iteratively merging the most frequent pairs of tokens in the training data. Initially, the tokens are individual characters or bytes from the UTF-8 encoding of the text. As the algorithm proceeds, it creates new tokens representing frequent pairs of characters, thereby compressing the text and reducing the sequence length. 14 | 15 | #### Preparing the Data 16 | 17 | For our example, we use a dataset containing English text, such as a collection of Shakespeare's works. We start by encoding this text into UTF-8, resulting in a sequence of bytes. We then convert these bytes into a list of integers to facilitate processing. 18 | 19 | #### Finding the Most Frequent Pairs 20 | 21 | We define a function `getStats` to count the frequency of each pair of tokens in our sequence. Using this function, we identify the most common pair to merge next. 22 | 23 | #### Merging Pairs 24 | 25 | Once we have the most frequent pair, we replace every occurrence of this pair with a new token. We define a function to perform this replacement, which takes the list of current tokens and the pair to merge, creating a new list with the pair replaced by the new token index. 26 | 27 | #### Iterative Merging 28 | 29 | We set a target vocabulary size and iteratively perform the merging process until we reach this size. Each merge reduces the sequence length and adds a new token to our vocabulary. 30 | 31 | #### Training the Tokenizer 32 | 33 | The training process involves applying the BPE algorithm to our entire dataset. We keep track of the merges in a dictionary, effectively building a binary tree representing the tokenization process. This tree will later allow us to encode new text into tokens and decode token sequences back into text. 34 | 35 | #### Encoding and Decoding Functions 36 | 37 | With the tokenizer trained, we implement functions to encode raw text into token sequences and decode token sequences back into text. These functions rely on the merges dictionary and the vocabulary we created during training. 38 | 39 | #### Special Tokens 40 | 41 | In addition to the tokens created through BPE, we can introduce special tokens to represent specific concepts or delimiters, such as the end-of-text token used in GPT-2. These special tokens are manually added to the vocabulary and handled separately from the BPE-generated tokens. 42 | 43 | #### Final Thoughts 44 | 45 | Tokenization is a nuanced process with many complexities. It affects the performance and capabilities of LLMs in various tasks, including spelling, arithmetic, and handling non-English languages. Understanding tokenization is essential for anyone working with LLMs, as it influences the model's behavior and its interaction with different types of data. 46 | 47 | [Video link](findSectionTimestamps is not a function or its return value is not iterable) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/2. Naive Tokenization and Its Limitations.md: -------------------------------------------------------------------------------- 1 | # Naive Tokenization and Its Limitations 2 | 3 | ### Naive Tokenization and Its Limitations 4 | 5 | Tokenization is the foundational process of converting raw text into a sequence of tokens that a language model can interpret. While it might initially seem straightforward, naive tokenization approaches often lead to numerous complications and limitations when working with large language models (LLMs). 6 | 7 | #### What is Naive Tokenization? 8 | 9 | Naive tokenization is the simple process of splitting text into smaller pieces or tokens based on a predefined set of rules. For example, one might split a text into tokens by using spaces and punctuation as delimiters. This rudimentary method is easy to implement but quickly encounters limitations when applied to the complex requirements of LLMs. 10 | 11 | #### Limitations of Naive Tokenization 12 | 13 | 1. **Fixed Vocabulary Size**: Naive tokenization typically relies on a fixed vocabulary, often created by listing all unique characters or words in a dataset. This approach is inherently limited by the vocabulary size and fails to handle new words or characters that were not present in the initial dataset. 14 | 15 | 2. **Inefficient Representation**: Using character-level tokenization results in very long sequences for even moderately sized strings. For instance, a 1000-character text would result in a sequence of 1000 tokens, each representing a single character. This inefficiency becomes problematic when dealing with the fixed context sizes of attention mechanisms in transformers. 16 | 17 | 3. **Lack of Generalization**: Naive tokenization does not generalize well across different datasets or languages. It treats each character or word independently, ignoring the possibility of shared subword units across different words or languages, which could be used for more efficient representation. 18 | 19 | 4. **Handling of Unseen Text**: When a model encounters text that was not present in the training data, it struggles to tokenize and understand it, leading to poor model performance. This is especially problematic for LLMs expected to handle a wide variety of inputs. 20 | 21 | 5. **Complexity with Non-English Languages**: Naive tokenization schemes often fail to account for the complexities of non-English languages, which may not conform to the simple delimiters like spaces and punctuation used in English. This results in poor tokenization and, subsequently, poor model performance on non-English text. 22 | 23 | #### Example of Naive Tokenization 24 | 25 | Consider the text "hi there". A naive tokenizer with a character-level approach would tokenize this as a sequence of individual characters: ['h', 'i', ' ', 't', 'h', 'e', 'r', 'e'], each mapped to a unique integer in a lookup table. While this works for short texts, it becomes highly inefficient for larger texts and fails to capture the linguistic structure present in the text. 26 | 27 | #### Conclusion 28 | 29 | Naive tokenization is a starting point for understanding the tokenization process. However, it is inadequate for the needs of advanced LLMs. The limitations of naive tokenization necessitate the development of more sophisticated tokenization schemes that can handle variable-length sequences, generalize across languages, and efficiently represent the input text. Advanced tokenization techniques, such as Byte Pair Encoding (BPE), offer solutions to these challenges and are crucial for the development of state-of-the-art LLMs. 30 | 31 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=27) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/20. Encoding and Decoding with the Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Encoding and Decoding with the Tokenizer 2 | 3 | ### Encoding and Decoding with the Tokenizer 4 | 5 | #### Understanding the Role of Tokenization in LLMs 6 | 7 | Tokenization is a critical preprocessing step in the workflow of large language models (LLMs). It involves converting raw text into a sequence of tokens that the model can process. Each token corresponds to an integer, which is then used to retrieve a vector from an embedding table. This vector representation of the token is what feeds into the transformer layers of the LLM. 8 | 9 | #### The Encoding Process 10 | 11 | Encoding is the process of converting text into tokens. The encoding process typically starts with the raw text as a string of Unicode code points. This string is then encoded into bytes using a UTF-8 encoding scheme, resulting in a byte stream that represents the original text. 12 | 13 | The byte stream is then tokenized into integers using a tokenizer that has been trained using an algorithm like Byte Pair Encoding (BPE). BPE works by iteratively merging the most frequent pairs of bytes or characters in the training data until the desired vocabulary size is reached. 14 | 15 | Here's a simplified example of how encoding might work in Python: 16 | 17 | ```python 18 | # Assuming 'text' is our input string and 'tokenizer' is our trained tokenizer 19 | encoded_tokens = tokenizer.encode(text) 20 | ``` 21 | 22 | The `encoded_tokens` would be a list of integers representing the tokenized version of the input string. 23 | 24 | #### The Decoding Process 25 | 26 | Decoding is the reverse process of encoding, where a sequence of tokens is converted back into a string. This process involves looking up each token in the embedding table to retrieve the corresponding byte or character sequence. These sequences are then concatenated and decoded from UTF-8 back into a Unicode string. 27 | 28 | Here's an example of how decoding might work: 29 | 30 | ```python 31 | # Assuming 'tokens' is our sequence of integers representing tokens 32 | decoded_text = tokenizer.decode(tokens) 33 | ``` 34 | 35 | The `decoded_text` would be the original string reconstructed from the tokenized representation. 36 | 37 | #### Special Tokens 38 | 39 | In addition to the tokens derived from the text, LLMs often use special tokens to indicate the start or end of a sequence, padding, unknown words, or other control features. These special tokens are added to the tokenizer's vocabulary and are used to structure the input and output in a way that the model can understand. 40 | 41 | #### Encoding and Decoding in Practice 42 | 43 | The encoding and decoding processes are fundamental to training and using LLMs. They allow the model to handle text as discrete units of information, which are then used for various tasks such as text generation, translation, or classification. 44 | 45 | When building a tokenizer, it's essential to consider the complexity of the tokenization process, the desired vocabulary size, and how the tokenizer will handle unseen or rare characters. The tokenizer's performance can significantly impact the overall effectiveness of the LLM. 46 | 47 | #### Live Demonstration of Tokenization 48 | 49 | A live demonstration of tokenization can be insightful for understanding how different strings are tokenized. Using web applications like `tiktokenizer.versal.app`, users can interactively type text and observe how it is tokenized in real-time. These demonstrations can reveal how English sentences, arithmetic, non-English languages, and programming languages are tokenized differently by the model, highlighting the importance of a well-designed tokenizer in the performance of LLMs. 50 | 51 | #### Conclusion 52 | 53 | Encoding and decoding are not just technical details but critical components that can significantly influence the performance and capabilities of LLMs. A deep understanding of tokenization can help in designing better models and improving their ability to handle a wide range of language processing tasks. 54 | 55 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=2563) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/21. Special Tokens and Their Usage.md: -------------------------------------------------------------------------------- 1 | # Special Tokens and Their Usage 2 | 3 | ### Special Tokens and Their Usage 4 | 5 | Special tokens are unique identifiers in tokenization that serve specific purposes beyond representing the usual chunks of text. They are an essential aspect of tokenization, particularly when dealing with large language models (LLMs) like GPT-2 and GPT-4. These tokens are used to denote boundaries, signal specific actions, or represent abstract concepts within the data fed into LLMs. 6 | 7 | #### Purpose of Special Tokens 8 | 9 | The primary reason for using special tokens is to provide the LLM with structured information that can help it understand the context or perform certain tasks. For instance, special tokens can: 10 | 11 | - Indicate the start and end of a sentence or a document. 12 | - Separate different segments of text. 13 | - Represent padding in sequences of uneven length. 14 | - Signal the model to perform a specific action, such as generating a response or translating text. 15 | - Encode metadata that provides additional context to the model. 16 | 17 | #### Common Special Tokens 18 | 19 | Some of the commonly used special tokens include: 20 | 21 | - ``: End of Sentence or End of String token, used to signify the end of a text segment. 22 | - ``: Beginning of Sentence token, marking the start of a text segment. 23 | - ``: Padding token, used to fill in sequences to a uniform length. 24 | - ``: Unknown token, representing words or characters not found in the training vocabulary. 25 | 26 | #### Special Tokens in GPT Models 27 | 28 | In GPT-2, a notable special token is the "end of text" token. It is used to indicate the end of a document, allowing the model to differentiate between separate pieces of text. This token is particularly important during the training phase, where it helps the model learn when one input ends, and another begins. 29 | 30 | GPT-4 introduces additional special tokens to handle more complex structures and functionalities. For example, it uses "fill in the middle" (FIM) tokens to mark sections of text that require completion, and a "SERP" token, likely used to handle search engine result pages or similar structured data. 31 | 32 | #### Implementing Special Tokens 33 | 34 | When adding new special tokens to a tokenizer, it's crucial to perform model surgery carefully. This involves extending the embedding matrix and the output layer of the model to accommodate the new tokens. The embeddings for these new tokens are usually initialized with small random values and trained during the fine-tuning process. 35 | 36 | #### Considerations and Best Practices 37 | 38 | - Use special tokens judiciously to avoid bloating the model with unnecessary complexity. 39 | - Ensure that the addition of new tokens aligns with the model's architecture and training objectives. 40 | - When fine-tuning, consider freezing the base model parameters and only training the embeddings for the new special tokens. 41 | - Be aware of the potential security and AI safety implications of special tokens, as they can introduce unexpected behavior if mishandled. 42 | 43 | In summary, special tokens are a powerful tool in the tokenization process, enabling LLMs to handle a wide range of tasks with greater precision and context awareness. Understanding their usage and implications is key to harnessing the full potential of tokenization in state-of-the-art language models. 44 | 45 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=4706) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/22. Tokenization in State-of-the-Art LLMs.md: -------------------------------------------------------------------------------- 1 | # Tokenization in State-of-the-Art LLMs 2 | 3 | ### 22. Tokenization in State-of-the-Art LLMs 4 | 5 | In the realm of large language models (LLMs), tokenization plays a critical role, serving as the bridge between raw text and the numerical representations that models can understand and process. Despite its complexity and the pitfalls it presents, a thorough comprehension of tokenization is indispensable for working with LLMs. 6 | 7 | #### What is Tokenization? 8 | 9 | Tokenization is the process of converting strings of text into sequences of tokens, which are essentially atomic units that LLMs can interpret. Each token is an integer that represents a piece of text, such as a character, word, or subword. These tokens are then used to look up embeddings in a table, which are fed into the transformer model of the LLM. 10 | 11 | #### Character-Level Tokenization 12 | 13 | In a naive approach, we might start with character-level tokenization, where every character in the text is mapped to a unique integer token. For example, the string "hi there" could be tokenized into a sequence of integers representing each character. However, character-level tokenization is limited and doesn't scale well with the complexity of language, leading to inefficiencies in the model's understanding and generation capabilities. 14 | 15 | #### Advanced Tokenization Schemes 16 | 17 | State-of-the-art LLMs employ more sophisticated tokenization schemes that operate at the subword or chunk level. These schemes allow the model to handle a wider variety of words and phrases, including those that may not have been seen during training. One such algorithm used for constructing these token vocabularies is Byte Pair Encoding (BPE). 18 | 19 | #### Byte Pair Encoding (BPE) 20 | 21 | BPE is an algorithm that iteratively merges the most frequent pairs of tokens in the training data. It starts with individual characters or bytes as tokens and progressively merges them into larger chunks that represent common subwords or sequences of characters. This process allows the model to effectively handle common phrases and linguistic patterns while maintaining the ability to break down and understand rare or unseen words. 22 | 23 | #### Tokenization in GPT-2 and GPT-4 24 | 25 | The GPT-2 paper introduced byte-level BPE tokenization for LLMs. It established a fixed vocabulary size (e.g., 50,257 tokens) and a maximum context size (e.g., 1024 tokens) for its transformer neural network. In this architecture, each token attends to the previous tokens in the sequence, making tokens the fundamental unit of LLMs. 26 | 27 | GPT-4 further improved upon this by increasing the number of tokens in its tokenizer, effectively reducing the token count for the same text. This densification allows the model to attend to a larger context, as each token now represents a larger chunk of text. 28 | 29 | #### Building Your Own Tokenizer 30 | 31 | One can build a tokenizer from scratch using the BPE algorithm. The process involves identifying and merging frequent pairs of tokens in the training data, creating a vocabulary, and then using this vocabulary to encode and decode text. This task requires careful consideration of the complexities involved in tokenization, such as the handling of special characters, different languages, and programming code. 32 | 33 | #### Challenges and Considerations 34 | 35 | Tokenization is not without its challenges. Issues such as handling spelling tasks, string processing, and simple arithmetic can be traced back to the tokenization process. Non-English languages and programming languages often suffer in performance due to the tokenization approach, as tokens may not align well with the syntactic and semantic structures of these languages. 36 | 37 | Tokenization also introduces potential security risks. Special tokens, if not handled properly, can be exploited to produce unexpected or undesired behavior in LLMs. 38 | 39 | #### Conclusion 40 | 41 | Tokenization is a foundational aspect of working with LLMs. Despite its challenges, it is essential for transforming raw text into a form that models can process. As the field advances, there is a continuous effort to improve tokenization methods to enhance the performance, efficiency, and security of LLMs. 42 | 43 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=3451) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/23. Using SentencePiece for Tokenization.md: -------------------------------------------------------------------------------- 1 | # Using SentencePiece for Tokenization 2 | 3 | ### Using SentencePiece for Tokenization 4 | 5 | #### Introduction to SentencePiece 6 | 7 | SentencePiece is a widely used library for tokenization in the context of large language models (LLMs) because it supports both training and inference efficiently. It can handle multiple algorithms, including the Byte Pair Encoding (BPE) algorithm, which we have explored in detail. SentencePiece is employed by various models, including LAMA and Mistral series, among others. 8 | 9 | Unlike the approach taken by TickToken, which first converts text to UTF-8 bytes before tokenizing, SentencePiece operates directly on Unicode code points. It merges these code points using the BPE algorithm. For rare code points that may not appear frequently enough in the training data (as controlled by the `character_coverage` hyperparameter), SentencePiece either maps them to a special unknown (`unk`) token or, if `byte_fallback` is enabled, encodes them into UTF-8 bytes and uses special byte tokens added to the vocabulary. 10 | 11 | #### Working with SentencePiece 12 | 13 | To understand how SentencePiece handles tokenization, let's walk through an example using the SentencePiece Python library. 14 | 15 | First, install and import SentencePiece: 16 | 17 | ```python 18 | import sentencepiece as spm 19 | ``` 20 | 21 | Next, create a toy dataset and save it to a text file. This dataset will be used to train the tokenizer: 22 | 23 | ```python 24 | # Toy dataset content 25 | dataset_content = "Example content for training the SentencePiece tokenizer." 26 | 27 | # Save to a file 28 | with open('toy.txt', 'w') as f: 29 | f.write(dataset_content) 30 | ``` 31 | 32 | Configure SentencePiece with various options, including the algorithm to use (BPE in this case), the desired vocabulary size, and other preprocessing and normalization rules. It's crucial to disable many normalization options to preserve the raw form of the data, especially for LLM applications. 33 | 34 | Train the SentencePiece model: 35 | 36 | ```python 37 | spm.SentencePieceTrainer.train( 38 | input='toy.txt', 39 | model_prefix='sp_toy', 40 | vocab_size=400, 41 | model_type='bpe', 42 | # Additional configurations... 43 | ) 44 | ``` 45 | 46 | Once the model is trained, load it and inspect the vocabulary: 47 | 48 | ```python 49 | sp = spm.SentencePieceProcessor() 50 | sp.load('sp_toy.model') 51 | 52 | # Inspect vocabulary 53 | for id in range(sp.get_piece_size()): 54 | print(sp.id_to_piece(id), sp.is_unknown(id)) 55 | ``` 56 | 57 | The vocabulary will include special tokens (`unk`, `begin_of_sentence`, `end_of_sentence`), byte tokens, merge tokens, and individual code point tokens. 58 | 59 | Use the model to tokenize and detokenize text: 60 | 61 | ```python 62 | # Tokenize text 63 | tokens = sp.encode_as_ids("Hello 안녕하세요") 64 | 65 | # Detokenize tokens 66 | detokenized_text = sp.decode_ids(tokens) 67 | ``` 68 | 69 | In this example, the Korean phrase "안녕하세요" is tokenized into individual byte tokens because it wasn't part of the training set and `byte_fallback` is enabled. 70 | 71 | #### Considerations When Using SentencePiece 72 | 73 | - **Character Coverage**: Controls how SentencePiece treats rare characters. A low value may lead to more unknown tokens, while a high value includes more characters in the vocabulary. 74 | - **Byte Fallback**: Determines whether rare characters are encoded as bytes when they are not included in the vocabulary. 75 | - **Add Dummy Prefix**: Adds a dummy whitespace at the beginning of the text to treat words consistently regardless of their position. 76 | - **Special Tokens**: SentencePiece includes predefined special tokens and allows for additional ones to be specified. 77 | 78 | #### Conclusion 79 | 80 | SentencePiece offers a robust and flexible approach to tokenization, with many options to tailor the process to specific needs. However, its complexity and the necessity to carefully calibrate its numerous settings can be challenging. It's important to understand the implications of each option to avoid potential issues, such as inadvertently cropping sentences or misrepresenting data. For those who require training their own tokenization models, SentencePiece is a valuable tool, provided its features and configurations are managed with precision. 81 | 82 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=6596) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/24. Recap and Final Thoughts.md: -------------------------------------------------------------------------------- 1 | # Recap and Final Thoughts 2 | 3 | In this final section, we recap the key points covered in the lesson on tokenization and offer some concluding thoughts on the subject. 4 | 5 | Tokenization is a critical and complex process in the functioning of large language models (LLMs). It involves converting raw text into tokens, which are the fundamental units that LLMs operate on. The complexity of tokenization arises from the need to handle various types of data, including different languages, programming code, and even arithmetic. 6 | 7 | Throughout the lesson, we explored several aspects of tokenization: 8 | 9 | 1. **Naive Tokenization**: We started with a simple character-level tokenization that assigns a unique token to each character. This method is limited and inefficient for LLMs. 10 | 11 | 2. **Advanced Tokenization Schemes**: We discussed more sophisticated tokenization methods, such as byte pair encoding (BPE), which groups frequently occurring character pairs into single tokens, thus optimizing the tokenization process. 12 | 13 | 3. **Embedding Tables**: We examined how tokens are used to look up embeddings from an embedding table, which are then fed into the transformer model. 14 | 15 | 4. **Building a Tokenizer**: We went through the process of building our own tokenizer using the BPE algorithm. 16 | 17 | 5. **Tokenization Challenges**: We delved into the complexities and peculiarities of tokenization, such as handling whitespace, punctuation, and the encoding of non-English languages. 18 | 19 | 6. **Live Demonstration**: A live demonstration showed the dynamic nature of tokenization and how different inputs are tokenized. 20 | 21 | 7. **Tokenization of Various Data Types**: We saw how English sentences, arithmetic expressions, non-English languages, and programming languages are tokenized. 22 | 23 | 8. **Improvements in GPT-4**: We noted the advancements in the GPT-4 tokenizer, which more efficiently handles whitespace and other aspects, improving upon the limitations of GPT-2's tokenizer. 24 | 25 | 9. **Writing Tokenization Code**: We covered the essentials of writing code for tokenization, including handling Unicode and UTF-8 encoding. 26 | 27 | 10. **Special Tokens**: The use of special tokens in tokenization was discussed, highlighting their importance in representing specific data structures within LLMs. 28 | 29 | 11. **State-of-the-Art LLMs**: We explored tokenization in cutting-edge LLMs, observing the evolution and refinement of tokenization techniques. 30 | 31 | 12. **SentencePiece**: The SentencePiece library was introduced as a tool for tokenization, with its own set of features and configurations. 32 | 33 | 13. **Design Considerations**: We touched on the considerations for setting the vocabulary size of a tokenizer and the implications of adding new tokens to an existing model. 34 | 35 | 14. **Multimodal Tokenization**: We briefly mentioned the expansion of tokenization beyond text to include other modalities like images and audio. 36 | 37 | 15. **Security and Safety**: We discussed the potential security and safety issues related to tokenization, such as unexpected model behavior when encountering certain token sequences. 38 | 39 | In conclusion, tokenization is a nuanced and foundational component of LLMs that requires careful consideration and understanding. It has a direct impact on the performance, efficiency, and capabilities of LLMs. As we continue to push the boundaries of what LLMs can do, the role of tokenization will undoubtedly evolve, presenting both challenges and opportunities for innovation. 40 | 41 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=7818) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/3. Character-Level Tokenization.md: -------------------------------------------------------------------------------- 1 | # Character-Level Tokenization 2 | 3 | ### Character-Level Tokenization 4 | 5 | Character-level tokenization is a fundamental process in the preparation of text data for use in large language models (LLMs). It involves converting a string of text into a sequence of tokens, where each token represents a single character. 6 | 7 | #### The Basics of Character-Level Tokenization 8 | 9 | In character-level tokenization, every character in the text is treated as a separate token. This includes letters, numbers, punctuation marks, and other symbols. The tokenizer creates a vocabulary that consists of all the unique characters found in the dataset. Each character is then assigned a unique integer ID. For example, the character 'a' might be assigned the ID 1, 'b' the ID 2, and so on. 10 | 11 | #### Implementing Character-Level Tokenization 12 | 13 | To implement character-level tokenization, one starts by identifying all unique characters in the dataset. Here's a simple Python example: 14 | 15 | ```python 16 | # Given text 17 | text = "hi there" 18 | 19 | # Unique characters as vocabulary 20 | vocabulary = sorted(set(text)) 21 | 22 | # Character to index mapping 23 | char_to_index = {char: index for index, char in enumerate(vocabulary)} 24 | ``` 25 | 26 | In this example, `vocabulary` would be a list of unique characters, and `char_to_index` would be a dictionary mapping each character to its corresponding token (integer). 27 | 28 | #### Encoding and Decoding 29 | 30 | Encoding is the process of converting raw text into a sequence of tokens. Decoding is the reverse process, converting a sequence of tokens back into a string of text. Here's how one might encode and decode a string: 31 | 32 | ```python 33 | # Encoding text into tokens 34 | encoded_text = [char_to_index[char] for char in text] 35 | 36 | # Decoding tokens back into text 37 | decoded_text = ''.join(vocabulary[index] for index in encoded_text) 38 | ``` 39 | 40 | #### Limitations of Character-Level Tokenization 41 | 42 | While character-level tokenization is simple and straightforward, it has several limitations: 43 | 44 | 1. **Vocabulary Size**: The vocabulary size can be quite large, even though it's limited to the set of characters used in the text. 45 | 46 | 2. **Long Sequences**: Since each character is a separate token, sequences can become very long, which can be computationally expensive for LLMs. 47 | 48 | 3. **Semantic Understanding**: Character-level tokenization does not capture the meaning of words or phrases, as it treats each character independently. 49 | 50 | #### Use in LLMs 51 | 52 | In practice, character-level tokenization is often too naive for state-of-the-art LLMs. These models typically use more sophisticated tokenization schemes that operate on larger chunks of text, such as words or subwords, to better capture the semantic relationships within the text. Advanced schemes like Byte Pair Encoding (BPE) are commonly used to construct more efficient token vocabularies that balance the trade-offs between vocabulary size and sequence length. 53 | 54 | However, understanding character-level tokenization is crucial, as it forms the basis for more advanced tokenization techniques and highlights the importance of careful consideration in the tokenization process. 55 | 56 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=78) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/4. Embedding Table and Token Representation.md: -------------------------------------------------------------------------------- 1 | # Embedding Table and Token Representation 2 | 3 | ### Embedding Table and Token Representation 4 | 5 | In the context of large language models (LLMs), understanding the role of the embedding table and token representation is crucial. These concepts are essential for converting text into a format that LLMs can process and learn from. 6 | 7 | #### Tokenization and the Embedding Table 8 | 9 | Tokenization is the process of converting raw text into a sequence of tokens, which are essentially atomic units that the model can understand. Each unique token is associated with an integer in a process known as indexing. For example, in a simple character-level tokenization, each character in the text is assigned a unique integer. 10 | 11 | After tokenization, we use an embedding table to map these integer tokens to high-dimensional vectors. The embedding table is essentially a matrix where each row corresponds to a token's vector representation. The number of rows in the embedding table is equal to the vocabulary size, which is the number of unique tokens we have. Each row is a trainable parameter vector that the model will adjust during the learning process through backpropagation. 12 | 13 | #### The Role of Token Embeddings in Transformers 14 | 15 | In a transformer architecture, the embedding vectors replace the raw tokens as the input. These vectors capture the semantic and syntactic properties of the tokens, allowing the transformer to understand and generate language patterns. 16 | 17 | For example, consider a vocabulary with 65 characters and a corresponding embedding table with 65 rows. When processing a tokenized string, each token (integer) is used to look up its vector in the embedding table. These vectors are then fed into the transformer model. 18 | 19 | #### Advanced Tokenization Schemes 20 | 21 | While character-level tokenization is straightforward, it is not efficient for large-scale language models. Advanced tokenization schemes, such as Byte Pair Encoding (BPE), are used to create a more compact and informative token vocabulary. BPE works by iteratively merging the most frequent pairs of characters or tokens to form new tokens, reducing the sequence length and allowing the model to process text more efficiently. 22 | 23 | #### The Impact of Token Representation 24 | 25 | Token representation has a significant impact on the model's performance. Poorly designed tokenization can lead to issues such as difficulty in spelling tasks, handling non-English languages, and performing simple arithmetic. It is important to ensure that the tokenization process captures the necessary linguistic information without introducing inefficiencies or ambiguities. 26 | 27 | #### Building a Custom Tokenizer 28 | 29 | Building a custom tokenizer involves selecting a tokenization algorithm, creating a vocabulary, and constructing an embedding table. The tokenizer must be able to handle the complexities of language, including edge cases and rare words. It is a delicate process that requires careful consideration and testing. 30 | 31 | #### Conclusion 32 | 33 | The embedding table and token representation are foundational components of LLMs. They translate raw text into a numerical format that models can process, enabling them to learn and generate human-like language. Understanding and optimizing these components are key to building effective language models. 34 | 35 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=97) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/5. Advanced Tokenization Schemes.md: -------------------------------------------------------------------------------- 1 | # Advanced Tokenization Schemes 2 | 3 | # Advanced Tokenization Schemes 4 | 5 | ## Introduction 6 | 7 | Tokenization is a fundamental process in the handling of data for large language models (LLMs). While we have previously covered basic character-level tokenization, state-of-the-art LLMs employ more sophisticated schemes to construct their token vocabularies. In this section, we will explore the complexities and motivations behind these advanced tokenization methods. 8 | 9 | ## The Need for Advanced Tokenization 10 | 11 | The naive character-level tokenization, where each character is mapped to a unique token, is not efficient for large-scale text data. It results in long sequences of tokens for processing, which is computationally expensive and limits the context that a language model can consider. To address this, advanced tokenization techniques aim to reduce sequence length by representing frequent character combinations or even whole words as single tokens. 12 | 13 | ## Byte Pair Encoding (BPE) 14 | 15 | A popular algorithm for constructing advanced token vocabularies is Byte Pair Encoding (BPE). BPE iteratively merges the most frequent pair of tokens into a single new token. This process continues until a predefined vocabulary size is reached. BPE effectively compresses the training data by reducing the number of tokens needed to represent it. 16 | 17 | For example, the GPT-2 paper introduced byte-level BPE as a tokenization method. GPT-2 uses a vocabulary of 50,257 possible tokens, and the BPE algorithm is applied on the byte-level representation of UTF-8 encoded text. This allows GPT-2 to encode a large amount of text data efficiently, with a context size of up to 1024 tokens. 18 | 19 | ## Implementing BPE 20 | 21 | To build our own tokenizer using BPE, we start by encoding our training data into UTF-8 bytes. We then apply the BPE algorithm to this byte stream. The algorithm looks for the most common pairs of bytes and replaces them with a new token, effectively compressing the data. The process is repeated iteratively, each time adding a new token to the vocabulary until the desired vocabulary size is achieved. 22 | 23 | ## Complexities of Tokenization 24 | 25 | Tokenization is not straightforward and comes with its own set of complexities. For instance, the choice of vocabulary size is crucial. A vocabulary that is too small may not capture the nuances of the language, while one that is too large may lead to inefficiency and sparsity issues. Moreover, tokenization can impact the performance of LLMs in various tasks, such as spelling, arithmetic, and handling non-English languages. 26 | 27 | Tokenization also affects the way LLMs process programming languages. For instance, Python code that uses indentation can lead to a bloated token sequence, making it difficult for the model to maintain the necessary context. This was a problem in earlier versions of GPT, which was later addressed in GPT-4 by improving the efficiency of whitespace tokenization. 28 | 29 | ## Conclusion 30 | 31 | Advanced tokenization schemes like BPE are essential for the effective functioning of LLMs. They allow models to process large texts more efficiently and contribute significantly to the performance of LLMs in various tasks. Understanding the intricacies of tokenization is crucial for anyone working with LLMs, as it has a profound impact on the model's capabilities and limitations. 32 | 33 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=143) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/6. Byte Pair Encoding Algorithm.md: -------------------------------------------------------------------------------- 1 | # Byte Pair Encoding Algorithm 2 | 3 | ### Byte Pair Encoding (BPE) Algorithm 4 | 5 | The Byte Pair Encoding (BPE) algorithm is a tokenization method used in Natural Language Processing (NLP), particularly in the context of Large Language Models (LLMs) such as GPT-2 and GPT-4. Understanding BPE is crucial as it influences the way LLMs process and generate text. BPE is a middle ground between character-level tokenization, which can be inefficient, and word-level tokenization, which can struggle with large vocabularies and out-of-vocabulary words. 6 | 7 | #### The Need for BPE 8 | 9 | When training LLMs, we need to convert raw text into a format that the model can understand, typically a sequence of integers or tokens. A naive approach might involve using character-level tokenization, but this can lead to long sequences that are computationally expensive for the model to process. On the other end, word-level tokenization can lead to a vast vocabulary with many rare words that the model will rarely see during training. BPE addresses these issues by creating a vocabulary of subword units, which are more frequent than rare words but more meaningful than individual characters. 10 | 11 | #### How BPE Works 12 | 13 | BPE operates by iteratively merging the most frequent pairs of characters or tokens in the training data. Starting with a base vocabulary of individual characters, BPE looks for the most common adjacent pairs of tokens and merges them into a new token, adding it to the vocabulary. This process is repeated until a desired vocabulary size is reached or until no more merges can improve the model. 14 | 15 | Here's a step-by-step breakdown of the BPE algorithm: 16 | 17 | 1. **Initialize the Vocabulary**: Begin with a vocabulary containing every unique character in the dataset. 18 | 2. **Count Pairs**: Count the frequency of each adjacent pair of tokens in the dataset. 19 | 3. **Merge Pairs**: Identify the most frequent pair of tokens and merge them into a new single token. 20 | 4. **Update the Dataset**: Replace all instances of the identified pair in the dataset with the new token. 21 | 5. **Repeat**: Continue the process of counting and merging until the vocabulary reaches a predetermined size or no more beneficial merges can be made. 22 | 23 | #### BPE in Practice 24 | 25 | To illustrate BPE in practice, consider the following Python code snippet that demonstrates a simple implementation of the BPE algorithm: 26 | 27 | ```python 28 | # Define the initial data and vocabulary 29 | data = "this is a simple example of how BPE works" 30 | vocab = set(data.split()) 31 | 32 | # Define a function to count token pairs 33 | def get_stats(vocab): 34 | pairs = {} 35 | for word in vocab: 36 | symbols = word.split() 37 | for i in range(len(symbols) - 1): 38 | pair = (symbols[i], symbols[i + 1]) 39 | pairs[pair] = pairs.get(pair, 0) + 1 40 | return pairs 41 | 42 | # Define a function to merge the most frequent pair 43 | def merge_vocab(pair, vocab): 44 | new_vocab = [] 45 | bigram = ' '.join(pair) 46 | replacement = ''.join(pair) 47 | for word in vocab: 48 | new_word = word.replace(bigram, replacement) 49 | new_vocab.append(new_word) 50 | return new_vocab 51 | 52 | # BPE loop 53 | num_merges = 10 # for example 54 | for i in range(num_merges): 55 | pairs = get_stats(vocab) 56 | if not pairs: 57 | break 58 | best_pair = max(pairs, key=pairs.get) 59 | vocab = merge_vocab(best_pair, vocab) 60 | print(f"Merge #{i + 1}: {best_pair} -> {''.join(best_pair)}") 61 | ``` 62 | 63 | In this example, we start with a simple string and initial vocabulary based on the unique words in the string. We then define functions to count the frequency of adjacent pairs (`get_stats`) and to merge the most frequent pair (`merge_vocab`). The BPE loop iterates a specified number of times, each time merging the most frequent pair and updating the vocabulary. 64 | 65 | #### Considerations and Complexities 66 | 67 | While BPE is powerful, it introduces complexities: 68 | 69 | - **Token Ambiguity**: BPE may lead to ambiguous tokens where the same string can be tokenized differently depending on its context. 70 | - **Tokenization of Non-English Text**: BPE may not perform equally well on non-English text due to differences in character usage and frequency. 71 | - **Special Tokens**: LLMs may use special tokens (e.g., end-of-text) that require careful handling during tokenization. 72 | - **Vocabulary Size**: The choice of vocabulary size is a trade-off between model complexity and tokenization granularity. 73 | 74 | In summary, BPE is a tokenization scheme that helps LLMs handle a wide range of text data efficiently. It balances the need for a manageable vocabulary size with the ability to represent a diverse set of words and subwords, thus enabling more effective language modeling. Understanding and implementing BPE is a critical step in the development and training of LLMs. 75 | 76 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=155) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/7. Tokenization in GPT-2 Paper.md: -------------------------------------------------------------------------------- 1 | # Tokenization in GPT-2 Paper 2 | 3 | ### Tokenization in GPT-2 Paper 4 | 5 | Tokenization is a crucial pre-processing step in the use of large language models (LLMs), and it plays a significant role in how these models perceive and generate text. The GPT-2 paper introduced byte-level byte-pair encoding (BPE) as a method for tokenization, which is a more advanced approach compared to the naive tokenization methods. 6 | 7 | #### Understanding Tokenization in GPT-2 8 | 9 | The GPT-2 tokenizer operates on a byte level, meaning it takes the raw text data, which is initially a sequence of Unicode code points, and encodes it using UTF-8 to create a byte stream. The BPE algorithm is then applied to this byte stream, compressing it by iteratively merging the most frequent pairs of bytes and adding them as new tokens to the vocabulary. This process continues until a predefined vocabulary size is reached. 10 | 11 | In GPT-2, the vocabulary size is set to 50,257 tokens, and the model's context size is 1,024 tokens. This means that during attention operations within the transformer architecture, each token can attend to up to 1,024 previous tokens, making token sequences the fundamental unit of information processing in LLMs. 12 | 13 | #### The Byte-Pair Encoding (BPE) Algorithm 14 | 15 | The BPE algorithm used in GPT-2 is a core component of the tokenization process. It starts with the raw byte stream of the text and looks for the most frequently occurring pair of bytes. This pair is then replaced with a new token, effectively reducing the length of the sequence while expanding the vocabulary. The process repeats, finding the next most common pair, merging them, and continuing until the desired vocabulary size is reached. 16 | 17 | #### Practical Implications of Tokenization 18 | 19 | The tokenization process has several practical implications for the performance of LLMs. For instance, issues with spelling, string processing, non-English languages, and simple arithmetic can often be traced back to the way tokenization is handled. The tokenizer's vocabulary and the way it chunks text into tokens can significantly affect the model's ability to understand and generate text accurately. 20 | 21 | #### Building and Using a Tokenizer 22 | 23 | In the video, the process of building a tokenizer from scratch using the BPE algorithm is demonstrated. This involves creating a vocabulary from a training set, establishing an embedding table for the tokens, and training the tokenizer to convert strings into sequences of tokens and vice versa. 24 | 25 | #### Conclusion 26 | 27 | Tokenization is a complex but essential aspect of working with LLMs. The GPT-2 paper's introduction of byte-level BPE for tokenization marked a significant advancement in the field. Understanding the intricacies of the tokenization process is vital for anyone working with LLMs, as it influences the model's capabilities and limitations in processing and generating human language. 28 | 29 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=172) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/8. Building Our Own Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Building Our Own Tokenizer 2 | 3 | ### Building Our Own Tokenizer 4 | 5 | In this section, we will explore the construction of a tokenizer, which is an essential component in working with large language models (LLMs). Tokenization is the process of converting strings of text into sequences of tokens, which are essentially atomic units that the model can understand and process. 6 | 7 | #### Why Tokenization Matters 8 | 9 | Tokenization might not be the most exciting aspect of working with LLMs, but it is crucial. It influences the performance of the model significantly. Issues that may seem related to the architecture of the neural network often trace back to tokenization. For instance, difficulties in spelling, string processing, handling non-English languages, or even simple arithmetic can often be attributed to the way text is tokenized. 10 | 11 | #### The Naive Approach to Tokenization 12 | 13 | Previously, we tokenized text in a very simplistic way—character-level tokenization. Each character in the text was mapped to a unique integer token. However, this approach is not practical for state-of-the-art models due to its limitations in efficiency and expressiveness. 14 | 15 | #### Advanced Tokenization Schemes 16 | 17 | Instead of character-level tokenization, modern LLMs use more sophisticated methods. These methods tokenize text at the chunk level, where chunks are sequences of characters that are often merged based on their frequency of co-occurrence in the training data. 18 | 19 | One such method is Byte Pair Encoding (BPE), which iteratively merges the most frequent pairs of tokens (or characters in the initial iteration) to create a new token. This process continues until a predefined vocabulary size is reached. 20 | 21 | #### Implementing Byte Pair Encoding 22 | 23 | To implement BPE, we start by identifying the most frequent pair of tokens in our dataset. We then replace all occurrences of this pair with a new token. This process is repeated, each time identifying and merging the next most frequent pair, until we have built a vocabulary of desired size. 24 | 25 | Here's a simple Python function that finds the most common pair: 26 | 27 | ```python 28 | def get_stats(vocab): 29 | pairs = collections.defaultdict(int) 30 | for word, freq in vocab.items(): 31 | symbols = word.split() 32 | for i in range(len(symbols) - 1): 33 | pairs[symbols[i], symbols[i + 1]] += freq 34 | return pairs 35 | ``` 36 | 37 | And another function that merges the pair throughout the vocabulary: 38 | 39 | ```python 40 | def merge_vocab(pair, v_in): 41 | v_out = {} 42 | bigram = re.escape(' '.join(pair)) 43 | p = re.compile(r'(? 23: 21 | break 22 | # print(list(section['YT->BP/Write section'].keys())) 23 | title = section['YT->BP/Write section']['get_section_title']['logs'] 24 | print(title) 25 | content = section['YT->BP/Write section']['content'] 26 | print(content) 27 | yt_url = section['YT->BP/Get time stamped URL']['time']['output'] 28 | print(yt_url) 29 | 30 | section_path = output_folder / f"{i + 1}. {title}.md" 31 | sections.append({"path": section_path, "name": title}) 32 | with section_path.open('w') as f: 33 | f.write(f"# {title}\n\n{content}\n\n[Video link]({yt_url})") 34 | 35 | with (output_folder / "index.md").open('w') as f: 36 | paths = "\n\n".join([f"[{i+1}. {s['name']}]({quote(str(s['path'].relative_to(output_folder).with_suffix('')))})" for i, s in enumerate(sections)]) 37 | f.write(f"# {project_name}\n\n{paths}") 38 | 39 | 40 | if __name__ == '__main__': 41 | main() 42 | --------------------------------------------------------------------------------