├── scripts └── convert.py ├── README.md └── pages └── Let's build the GPT Tokenizer ├── index.md ├── 7. Tokenization in GPT-2 Paper.md ├── 10. Live Demonstration of Tokenization.md ├── 3. Character-Level Tokenization.md ├── 4. Embedding Table and Token Representation.md ├── 21. Special Tokens and Their Usage.md ├── 5. Advanced Tokenization Schemes.md ├── 2. Naive Tokenization and Its Limitations.md ├── 24. Recap and Final Thoughts.md ├── 11. Tokenization of English Sentences.md ├── 19. Training the Tokenizer.md ├── 13. Tokenization of Non-English Languages.md ├── 15. Improvements in GPT-4 Tokenizer.md ├── 8. Building Our Own Tokenizer.md ├── 20. Encoding and Decoding with the Tokenizer.md ├── 9. Complexities of Tokenization.md ├── 1. Introduction to Tokenization.md ├── 17. Understanding Unicode and UTF-8 Encoding.md ├── 22. Tokenization in State-of-the-Art LLMs.md ├── 16. Writing Tokenization Code.md ├── 23. Using SentencePiece for Tokenization.md ├── 18. Implementing Byte Pair Encoding.md ├── 14. Tokenization of Programming Languages.md ├── 12. Tokenization of Arithmetic.md └── 6. Byte Pair Encoding Algorithm.md /scripts/convert.py: -------------------------------------------------------------------------------- 1 | import json 2 | from pathlib import Path 3 | from urllib.parse import quote 4 | 5 | 6 | def main(): 7 | # This script converts the output of running the Wordware prompt into a set of markdown files that can be served as 8 | # GitHub pages 9 | project_name = "Let's build the GPT Tokenizer" 10 | input_path = Path("~/Documents/Random/Karpathy Tokenizer").expanduser() / "karpathy_tokenizer.json" 11 | 12 | with input_path.open() as f: 13 | data = json.load(f) 14 | 15 | output_folder = Path("../pages") / project_name 16 | output_folder.mkdir(exist_ok=True) 17 | 18 | sections = [] 19 | for i, section in enumerate(data['loop_sections']['generations']): 20 | if i > 23: 21 | break 22 | # print(list(section['YT->BP/Write section'].keys())) 23 | title = section['YT->BP/Write section']['get_section_title']['logs'] 24 | print(title) 25 | content = section['YT->BP/Write section']['content'] 26 | print(content) 27 | yt_url = section['YT->BP/Get time stamped URL']['time']['output'] 28 | print(yt_url) 29 | 30 | section_path = output_folder / f"{i + 1}. {title}.md" 31 | sections.append({"path": section_path, "name": title}) 32 | with section_path.open('w') as f: 33 | f.write(f"# {title}\n\n{content}\n\n[Video link]({yt_url})") 34 | 35 | with (output_folder / "index.md").open('w') as f: 36 | paths = "\n\n".join([f"[{i+1}. {s['name']}]({quote(str(s['path'].relative_to(output_folder).with_suffix('')))})" for i, s in enumerate(sections)]) 37 | f.write(f"# {project_name}\n\n{paths}") 38 | 39 | 40 | if __name__ == '__main__': 41 | main() 42 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # YouTube to Post 📺 ->📝 2 | 3 | Convert educational videos into equivalent written materials. 4 | 5 | Based on the challenge from [Andrej Karpathy](https://twitter.com/karpathy/status/1760740503614836917): 6 | 7 | > Fun LLM challenge that I'm thinking about: take my 2h13m tokenizer video and translate the video into the format of a 8 | > book chapter (or a blog post) on tokenization. 9 | > Something like: 10 | > 11 | > 1. Whisper the video 12 | > 2. Chop up into segments of aligned images and text 13 | > 3. Prompt engineer an LLM to translate piece by piece 14 | > 4. Export as a page, with links citing parts of original video 15 | > 16 | > More generally, a workflow like this could be applied to any input video and auto-generate "companion guides" for 17 | > various tutorials in a more readable, skimmable, searchable format. Feels tractable but non-trivial. 18 | 19 | 20 | ## Generated posts 21 | [Let's build the GPT Tokenizer](./pages/Let's%20build%20the%20GPT%20Tokenizer/index) 22 | 23 | 24 | ## Prompts & scripts 25 | [This Wordware prompt](https://app.wordware.ai/r/b058e9c3-ffee-4661-a5e3-c788eef0dfbc) takes in the JSON output from 26 | running Whisper (I used [this one](https://replicate.com/vaibhavs10/incredibly-fast-whisper) on 27 | [Replicate](https://replicate.com/)) and processes it into sections of a written lesson. 28 | 29 | The simple script in `scripts/convert.py` turns the output into a set of Markdown files that are then served via GitHub 30 | pages. 31 | 32 | ### Getting the transcript 33 | [Here](https://gist.github.com/wordware-ai/95312691264f66c7a893ab1dfea15807) is the output from running Whisper on the 34 | audio track of [Andrej's Tokenizer video](https://www.youtube.com/watch?v=zduSFxRajkE). 35 | 36 | Alternatively you can run [`youtube-dl`](https://github.com/ytdl-org/youtube-dl) on the video e.g. 37 | ```commandline 38 | youtube-dl --extract-audio --audio-format mp3 "https://www.youtube.com/watch?v=" 39 | ``` 40 | then run it through a transcription model like [this one](https://replicate.com/vaibhavs10/incredibly-fast-whisper). 41 | -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/index.md: -------------------------------------------------------------------------------- 1 | # Let's build the GPT Tokenizer 2 | 3 | [1. Introduction to Tokenization](1.%20Introduction%20to%20Tokenization) 4 | 5 | [2. Naive Tokenization and Its Limitations](2.%20Naive%20Tokenization%20and%20Its%20Limitations) 6 | 7 | [3. Character-Level Tokenization](3.%20Character-Level%20Tokenization) 8 | 9 | [4. Embedding Table and Token Representation](4.%20Embedding%20Table%20and%20Token%20Representation) 10 | 11 | [5. Advanced Tokenization Schemes](5.%20Advanced%20Tokenization%20Schemes) 12 | 13 | [6. Byte Pair Encoding Algorithm](6.%20Byte%20Pair%20Encoding%20Algorithm) 14 | 15 | [7. Tokenization in GPT-2 Paper](7.%20Tokenization%20in%20GPT-2%20Paper) 16 | 17 | [8. Building Our Own Tokenizer](8.%20Building%20Our%20Own%20Tokenizer) 18 | 19 | [9. Complexities of Tokenization](9.%20Complexities%20of%20Tokenization) 20 | 21 | [10. Live Demonstration of Tokenization](10.%20Live%20Demonstration%20of%20Tokenization) 22 | 23 | [11. Tokenization of English Sentences](11.%20Tokenization%20of%20English%20Sentences) 24 | 25 | [12. Tokenization of Arithmetic](12.%20Tokenization%20of%20Arithmetic) 26 | 27 | [13. Tokenization of Non-English Languages](13.%20Tokenization%20of%20Non-English%20Languages) 28 | 29 | [14. Tokenization of Programming Languages](14.%20Tokenization%20of%20Programming%20Languages) 30 | 31 | [15. Improvements in GPT-4 Tokenizer](15.%20Improvements%20in%20GPT-4%20Tokenizer) 32 | 33 | [16. Writing Tokenization Code](16.%20Writing%20Tokenization%20Code) 34 | 35 | [17. Understanding Unicode and UTF-8 Encoding](17.%20Understanding%20Unicode%20and%20UTF-8%20Encoding) 36 | 37 | [18. Implementing Byte Pair Encoding](18.%20Implementing%20Byte%20Pair%20Encoding) 38 | 39 | [19. Training the Tokenizer](19.%20Training%20the%20Tokenizer) 40 | 41 | [20. Encoding and Decoding with the Tokenizer](20.%20Encoding%20and%20Decoding%20with%20the%20Tokenizer) 42 | 43 | [21. Special Tokens and Their Usage](21.%20Special%20Tokens%20and%20Their%20Usage) 44 | 45 | [22. Tokenization in State-of-the-Art LLMs](22.%20Tokenization%20in%20State-of-the-Art%20LLMs) 46 | 47 | [23. Using SentencePiece for Tokenization](23.%20Using%20SentencePiece%20for%20Tokenization) 48 | 49 | [24. Recap and Final Thoughts](24.%20Recap%20and%20Final%20Thoughts) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/7. Tokenization in GPT-2 Paper.md: -------------------------------------------------------------------------------- 1 | # Tokenization in GPT-2 Paper 2 | 3 | ### Tokenization in GPT-2 Paper 4 | 5 | Tokenization is a crucial pre-processing step in the use of large language models (LLMs), and it plays a significant role in how these models perceive and generate text. The GPT-2 paper introduced byte-level byte-pair encoding (BPE) as a method for tokenization, which is a more advanced approach compared to the naive tokenization methods. 6 | 7 | #### Understanding Tokenization in GPT-2 8 | 9 | The GPT-2 tokenizer operates on a byte level, meaning it takes the raw text data, which is initially a sequence of Unicode code points, and encodes it using UTF-8 to create a byte stream. The BPE algorithm is then applied to this byte stream, compressing it by iteratively merging the most frequent pairs of bytes and adding them as new tokens to the vocabulary. This process continues until a predefined vocabulary size is reached. 10 | 11 | In GPT-2, the vocabulary size is set to 50,257 tokens, and the model's context size is 1,024 tokens. This means that during attention operations within the transformer architecture, each token can attend to up to 1,024 previous tokens, making token sequences the fundamental unit of information processing in LLMs. 12 | 13 | #### The Byte-Pair Encoding (BPE) Algorithm 14 | 15 | The BPE algorithm used in GPT-2 is a core component of the tokenization process. It starts with the raw byte stream of the text and looks for the most frequently occurring pair of bytes. This pair is then replaced with a new token, effectively reducing the length of the sequence while expanding the vocabulary. The process repeats, finding the next most common pair, merging them, and continuing until the desired vocabulary size is reached. 16 | 17 | #### Practical Implications of Tokenization 18 | 19 | The tokenization process has several practical implications for the performance of LLMs. For instance, issues with spelling, string processing, non-English languages, and simple arithmetic can often be traced back to the way tokenization is handled. The tokenizer's vocabulary and the way it chunks text into tokens can significantly affect the model's ability to understand and generate text accurately. 20 | 21 | #### Building and Using a Tokenizer 22 | 23 | In the video, the process of building a tokenizer from scratch using the BPE algorithm is demonstrated. This involves creating a vocabulary from a training set, establishing an embedding table for the tokens, and training the tokenizer to convert strings into sequences of tokens and vice versa. 24 | 25 | #### Conclusion 26 | 27 | Tokenization is a complex but essential aspect of working with LLMs. The GPT-2 paper's introduction of byte-level BPE for tokenization marked a significant advancement in the field. Understanding the intricacies of the tokenization process is vital for anyone working with LLMs, as it influences the model's capabilities and limitations in processing and generating human language. 28 | 29 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=172) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/10. Live Demonstration of Tokenization.md: -------------------------------------------------------------------------------- 1 | # Live Demonstration of Tokenization 2 | 3 | ### 10. Live Demonstration of Tokenization 4 | 5 | Tokenization is a critical process in the functioning of large language models (LLMs). It involves converting strings of text into sequences of tokens, which are essentially integers that the model can process. This process is not as straightforward as it might seem, and it's filled with nuances and complexities that can significantly affect the performance of LLMs. 6 | 7 | #### Understanding Tokens and Tokenization 8 | 9 | Tokens are the fundamental units in LLMs, serving as the atomic elements that the models perceive and manipulate. The process of tokenization translates text into tokens and vice versa, which is crucial for feeding data into the models and interpreting their outputs. 10 | 11 | #### The Role of Tokenization in LLMs 12 | 13 | Tokenization impacts the behavior of LLMs in various ways. Issues with tokenization can lead to difficulties in tasks such as spelling, string processing, and handling non-English languages or even simple arithmetic. The way tokenization is handled can also affect the model's performance with programming languages, as seen with different versions of GPT. 14 | 15 | #### Demonstration Using a Web Application 16 | 17 | To illustrate how tokenization works in practice, we can use a web application like `tiktokenizer.versal.app`. This app runs tokenization live in your browser, allowing you to input text and see how it gets tokenized using different tokenizer versions, such as GPT-2 or GPT-4. 18 | 19 | For example, inputting an English sentence like "hello world" will show you how the tokenizer breaks it into tokens. You can also see the tokenization of arithmetic expressions and observe how numbers are sometimes tokenized as single units or split into multiple tokens, which can complicate arithmetic processing for LLMs. 20 | 21 | #### Comparing Tokenizers 22 | 23 | Different tokenizers handle text in various ways. GPT-2, for instance, has a tendency to tokenize whitespace as individual tokens, which can be inefficient, particularly for programming languages like Python that use indentation. GPT-4 improves upon this by grouping whitespace more effectively, allowing for denser representation of code and better performance on coding-related tasks. 24 | 25 | #### Building Your Own Tokenizer 26 | 27 | You can also build your own tokenizer using algorithms like Byte Pair Encoding (BPE). By understanding the complexities of tokenization, you can create a tokenizer that suits your specific needs, whether for English text, arithmetic, or other languages. 28 | 29 | #### Challenges and Considerations 30 | 31 | Tokenization is at the heart of many issues in LLMs. From odd behaviors to performance limitations, many problems can be traced back to how tokenization is implemented. It's essential to approach tokenization with a thorough understanding of its implications and to recognize that it's more than just a preliminary step—it's a crucial component that shapes the capabilities and limitations of your language model. 32 | 33 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=351) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/3. Character-Level Tokenization.md: -------------------------------------------------------------------------------- 1 | # Character-Level Tokenization 2 | 3 | ### Character-Level Tokenization 4 | 5 | Character-level tokenization is a fundamental process in the preparation of text data for use in large language models (LLMs). It involves converting a string of text into a sequence of tokens, where each token represents a single character. 6 | 7 | #### The Basics of Character-Level Tokenization 8 | 9 | In character-level tokenization, every character in the text is treated as a separate token. This includes letters, numbers, punctuation marks, and other symbols. The tokenizer creates a vocabulary that consists of all the unique characters found in the dataset. Each character is then assigned a unique integer ID. For example, the character 'a' might be assigned the ID 1, 'b' the ID 2, and so on. 10 | 11 | #### Implementing Character-Level Tokenization 12 | 13 | To implement character-level tokenization, one starts by identifying all unique characters in the dataset. Here's a simple Python example: 14 | 15 | ```python 16 | # Given text 17 | text = "hi there" 18 | 19 | # Unique characters as vocabulary 20 | vocabulary = sorted(set(text)) 21 | 22 | # Character to index mapping 23 | char_to_index = {char: index for index, char in enumerate(vocabulary)} 24 | ``` 25 | 26 | In this example, `vocabulary` would be a list of unique characters, and `char_to_index` would be a dictionary mapping each character to its corresponding token (integer). 27 | 28 | #### Encoding and Decoding 29 | 30 | Encoding is the process of converting raw text into a sequence of tokens. Decoding is the reverse process, converting a sequence of tokens back into a string of text. Here's how one might encode and decode a string: 31 | 32 | ```python 33 | # Encoding text into tokens 34 | encoded_text = [char_to_index[char] for char in text] 35 | 36 | # Decoding tokens back into text 37 | decoded_text = ''.join(vocabulary[index] for index in encoded_text) 38 | ``` 39 | 40 | #### Limitations of Character-Level Tokenization 41 | 42 | While character-level tokenization is simple and straightforward, it has several limitations: 43 | 44 | 1. **Vocabulary Size**: The vocabulary size can be quite large, even though it's limited to the set of characters used in the text. 45 | 46 | 2. **Long Sequences**: Since each character is a separate token, sequences can become very long, which can be computationally expensive for LLMs. 47 | 48 | 3. **Semantic Understanding**: Character-level tokenization does not capture the meaning of words or phrases, as it treats each character independently. 49 | 50 | #### Use in LLMs 51 | 52 | In practice, character-level tokenization is often too naive for state-of-the-art LLMs. These models typically use more sophisticated tokenization schemes that operate on larger chunks of text, such as words or subwords, to better capture the semantic relationships within the text. Advanced schemes like Byte Pair Encoding (BPE) are commonly used to construct more efficient token vocabularies that balance the trade-offs between vocabulary size and sequence length. 53 | 54 | However, understanding character-level tokenization is crucial, as it forms the basis for more advanced tokenization techniques and highlights the importance of careful consideration in the tokenization process. 55 | 56 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=78) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/4. Embedding Table and Token Representation.md: -------------------------------------------------------------------------------- 1 | # Embedding Table and Token Representation 2 | 3 | ### Embedding Table and Token Representation 4 | 5 | In the context of large language models (LLMs), understanding the role of the embedding table and token representation is crucial. These concepts are essential for converting text into a format that LLMs can process and learn from. 6 | 7 | #### Tokenization and the Embedding Table 8 | 9 | Tokenization is the process of converting raw text into a sequence of tokens, which are essentially atomic units that the model can understand. Each unique token is associated with an integer in a process known as indexing. For example, in a simple character-level tokenization, each character in the text is assigned a unique integer. 10 | 11 | After tokenization, we use an embedding table to map these integer tokens to high-dimensional vectors. The embedding table is essentially a matrix where each row corresponds to a token's vector representation. The number of rows in the embedding table is equal to the vocabulary size, which is the number of unique tokens we have. Each row is a trainable parameter vector that the model will adjust during the learning process through backpropagation. 12 | 13 | #### The Role of Token Embeddings in Transformers 14 | 15 | In a transformer architecture, the embedding vectors replace the raw tokens as the input. These vectors capture the semantic and syntactic properties of the tokens, allowing the transformer to understand and generate language patterns. 16 | 17 | For example, consider a vocabulary with 65 characters and a corresponding embedding table with 65 rows. When processing a tokenized string, each token (integer) is used to look up its vector in the embedding table. These vectors are then fed into the transformer model. 18 | 19 | #### Advanced Tokenization Schemes 20 | 21 | While character-level tokenization is straightforward, it is not efficient for large-scale language models. Advanced tokenization schemes, such as Byte Pair Encoding (BPE), are used to create a more compact and informative token vocabulary. BPE works by iteratively merging the most frequent pairs of characters or tokens to form new tokens, reducing the sequence length and allowing the model to process text more efficiently. 22 | 23 | #### The Impact of Token Representation 24 | 25 | Token representation has a significant impact on the model's performance. Poorly designed tokenization can lead to issues such as difficulty in spelling tasks, handling non-English languages, and performing simple arithmetic. It is important to ensure that the tokenization process captures the necessary linguistic information without introducing inefficiencies or ambiguities. 26 | 27 | #### Building a Custom Tokenizer 28 | 29 | Building a custom tokenizer involves selecting a tokenization algorithm, creating a vocabulary, and constructing an embedding table. The tokenizer must be able to handle the complexities of language, including edge cases and rare words. It is a delicate process that requires careful consideration and testing. 30 | 31 | #### Conclusion 32 | 33 | The embedding table and token representation are foundational components of LLMs. They translate raw text into a numerical format that models can process, enabling them to learn and generate human-like language. Understanding and optimizing these components are key to building effective language models. 34 | 35 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=97) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/21. Special Tokens and Their Usage.md: -------------------------------------------------------------------------------- 1 | # Special Tokens and Their Usage 2 | 3 | ### Special Tokens and Their Usage 4 | 5 | Special tokens are unique identifiers in tokenization that serve specific purposes beyond representing the usual chunks of text. They are an essential aspect of tokenization, particularly when dealing with large language models (LLMs) like GPT-2 and GPT-4. These tokens are used to denote boundaries, signal specific actions, or represent abstract concepts within the data fed into LLMs. 6 | 7 | #### Purpose of Special Tokens 8 | 9 | The primary reason for using special tokens is to provide the LLM with structured information that can help it understand the context or perform certain tasks. For instance, special tokens can: 10 | 11 | - Indicate the start and end of a sentence or a document. 12 | - Separate different segments of text. 13 | - Represent padding in sequences of uneven length. 14 | - Signal the model to perform a specific action, such as generating a response or translating text. 15 | - Encode metadata that provides additional context to the model. 16 | 17 | #### Common Special Tokens 18 | 19 | Some of the commonly used special tokens include: 20 | 21 | - ``: End of Sentence or End of String token, used to signify the end of a text segment. 22 | - ``: Beginning of Sentence token, marking the start of a text segment. 23 | - ``: Padding token, used to fill in sequences to a uniform length. 24 | - ``: Unknown token, representing words or characters not found in the training vocabulary. 25 | 26 | #### Special Tokens in GPT Models 27 | 28 | In GPT-2, a notable special token is the "end of text" token. It is used to indicate the end of a document, allowing the model to differentiate between separate pieces of text. This token is particularly important during the training phase, where it helps the model learn when one input ends, and another begins. 29 | 30 | GPT-4 introduces additional special tokens to handle more complex structures and functionalities. For example, it uses "fill in the middle" (FIM) tokens to mark sections of text that require completion, and a "SERP" token, likely used to handle search engine result pages or similar structured data. 31 | 32 | #### Implementing Special Tokens 33 | 34 | When adding new special tokens to a tokenizer, it's crucial to perform model surgery carefully. This involves extending the embedding matrix and the output layer of the model to accommodate the new tokens. The embeddings for these new tokens are usually initialized with small random values and trained during the fine-tuning process. 35 | 36 | #### Considerations and Best Practices 37 | 38 | - Use special tokens judiciously to avoid bloating the model with unnecessary complexity. 39 | - Ensure that the addition of new tokens aligns with the model's architecture and training objectives. 40 | - When fine-tuning, consider freezing the base model parameters and only training the embeddings for the new special tokens. 41 | - Be aware of the potential security and AI safety implications of special tokens, as they can introduce unexpected behavior if mishandled. 42 | 43 | In summary, special tokens are a powerful tool in the tokenization process, enabling LLMs to handle a wide range of tasks with greater precision and context awareness. Understanding their usage and implications is key to harnessing the full potential of tokenization in state-of-the-art language models. 44 | 45 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=4706) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/5. Advanced Tokenization Schemes.md: -------------------------------------------------------------------------------- 1 | # Advanced Tokenization Schemes 2 | 3 | # Advanced Tokenization Schemes 4 | 5 | ## Introduction 6 | 7 | Tokenization is a fundamental process in the handling of data for large language models (LLMs). While we have previously covered basic character-level tokenization, state-of-the-art LLMs employ more sophisticated schemes to construct their token vocabularies. In this section, we will explore the complexities and motivations behind these advanced tokenization methods. 8 | 9 | ## The Need for Advanced Tokenization 10 | 11 | The naive character-level tokenization, where each character is mapped to a unique token, is not efficient for large-scale text data. It results in long sequences of tokens for processing, which is computationally expensive and limits the context that a language model can consider. To address this, advanced tokenization techniques aim to reduce sequence length by representing frequent character combinations or even whole words as single tokens. 12 | 13 | ## Byte Pair Encoding (BPE) 14 | 15 | A popular algorithm for constructing advanced token vocabularies is Byte Pair Encoding (BPE). BPE iteratively merges the most frequent pair of tokens into a single new token. This process continues until a predefined vocabulary size is reached. BPE effectively compresses the training data by reducing the number of tokens needed to represent it. 16 | 17 | For example, the GPT-2 paper introduced byte-level BPE as a tokenization method. GPT-2 uses a vocabulary of 50,257 possible tokens, and the BPE algorithm is applied on the byte-level representation of UTF-8 encoded text. This allows GPT-2 to encode a large amount of text data efficiently, with a context size of up to 1024 tokens. 18 | 19 | ## Implementing BPE 20 | 21 | To build our own tokenizer using BPE, we start by encoding our training data into UTF-8 bytes. We then apply the BPE algorithm to this byte stream. The algorithm looks for the most common pairs of bytes and replaces them with a new token, effectively compressing the data. The process is repeated iteratively, each time adding a new token to the vocabulary until the desired vocabulary size is achieved. 22 | 23 | ## Complexities of Tokenization 24 | 25 | Tokenization is not straightforward and comes with its own set of complexities. For instance, the choice of vocabulary size is crucial. A vocabulary that is too small may not capture the nuances of the language, while one that is too large may lead to inefficiency and sparsity issues. Moreover, tokenization can impact the performance of LLMs in various tasks, such as spelling, arithmetic, and handling non-English languages. 26 | 27 | Tokenization also affects the way LLMs process programming languages. For instance, Python code that uses indentation can lead to a bloated token sequence, making it difficult for the model to maintain the necessary context. This was a problem in earlier versions of GPT, which was later addressed in GPT-4 by improving the efficiency of whitespace tokenization. 28 | 29 | ## Conclusion 30 | 31 | Advanced tokenization schemes like BPE are essential for the effective functioning of LLMs. They allow models to process large texts more efficiently and contribute significantly to the performance of LLMs in various tasks. Understanding the intricacies of tokenization is crucial for anyone working with LLMs, as it has a profound impact on the model's capabilities and limitations. 32 | 33 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=143) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/2. Naive Tokenization and Its Limitations.md: -------------------------------------------------------------------------------- 1 | # Naive Tokenization and Its Limitations 2 | 3 | ### Naive Tokenization and Its Limitations 4 | 5 | Tokenization is the foundational process of converting raw text into a sequence of tokens that a language model can interpret. While it might initially seem straightforward, naive tokenization approaches often lead to numerous complications and limitations when working with large language models (LLMs). 6 | 7 | #### What is Naive Tokenization? 8 | 9 | Naive tokenization is the simple process of splitting text into smaller pieces or tokens based on a predefined set of rules. For example, one might split a text into tokens by using spaces and punctuation as delimiters. This rudimentary method is easy to implement but quickly encounters limitations when applied to the complex requirements of LLMs. 10 | 11 | #### Limitations of Naive Tokenization 12 | 13 | 1. **Fixed Vocabulary Size**: Naive tokenization typically relies on a fixed vocabulary, often created by listing all unique characters or words in a dataset. This approach is inherently limited by the vocabulary size and fails to handle new words or characters that were not present in the initial dataset. 14 | 15 | 2. **Inefficient Representation**: Using character-level tokenization results in very long sequences for even moderately sized strings. For instance, a 1000-character text would result in a sequence of 1000 tokens, each representing a single character. This inefficiency becomes problematic when dealing with the fixed context sizes of attention mechanisms in transformers. 16 | 17 | 3. **Lack of Generalization**: Naive tokenization does not generalize well across different datasets or languages. It treats each character or word independently, ignoring the possibility of shared subword units across different words or languages, which could be used for more efficient representation. 18 | 19 | 4. **Handling of Unseen Text**: When a model encounters text that was not present in the training data, it struggles to tokenize and understand it, leading to poor model performance. This is especially problematic for LLMs expected to handle a wide variety of inputs. 20 | 21 | 5. **Complexity with Non-English Languages**: Naive tokenization schemes often fail to account for the complexities of non-English languages, which may not conform to the simple delimiters like spaces and punctuation used in English. This results in poor tokenization and, subsequently, poor model performance on non-English text. 22 | 23 | #### Example of Naive Tokenization 24 | 25 | Consider the text "hi there". A naive tokenizer with a character-level approach would tokenize this as a sequence of individual characters: ['h', 'i', ' ', 't', 'h', 'e', 'r', 'e'], each mapped to a unique integer in a lookup table. While this works for short texts, it becomes highly inefficient for larger texts and fails to capture the linguistic structure present in the text. 26 | 27 | #### Conclusion 28 | 29 | Naive tokenization is a starting point for understanding the tokenization process. However, it is inadequate for the needs of advanced LLMs. The limitations of naive tokenization necessitate the development of more sophisticated tokenization schemes that can handle variable-length sequences, generalize across languages, and efficiently represent the input text. Advanced tokenization techniques, such as Byte Pair Encoding (BPE), offer solutions to these challenges and are crucial for the development of state-of-the-art LLMs. 30 | 31 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=27) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/24. Recap and Final Thoughts.md: -------------------------------------------------------------------------------- 1 | # Recap and Final Thoughts 2 | 3 | In this final section, we recap the key points covered in the lesson on tokenization and offer some concluding thoughts on the subject. 4 | 5 | Tokenization is a critical and complex process in the functioning of large language models (LLMs). It involves converting raw text into tokens, which are the fundamental units that LLMs operate on. The complexity of tokenization arises from the need to handle various types of data, including different languages, programming code, and even arithmetic. 6 | 7 | Throughout the lesson, we explored several aspects of tokenization: 8 | 9 | 1. **Naive Tokenization**: We started with a simple character-level tokenization that assigns a unique token to each character. This method is limited and inefficient for LLMs. 10 | 11 | 2. **Advanced Tokenization Schemes**: We discussed more sophisticated tokenization methods, such as byte pair encoding (BPE), which groups frequently occurring character pairs into single tokens, thus optimizing the tokenization process. 12 | 13 | 3. **Embedding Tables**: We examined how tokens are used to look up embeddings from an embedding table, which are then fed into the transformer model. 14 | 15 | 4. **Building a Tokenizer**: We went through the process of building our own tokenizer using the BPE algorithm. 16 | 17 | 5. **Tokenization Challenges**: We delved into the complexities and peculiarities of tokenization, such as handling whitespace, punctuation, and the encoding of non-English languages. 18 | 19 | 6. **Live Demonstration**: A live demonstration showed the dynamic nature of tokenization and how different inputs are tokenized. 20 | 21 | 7. **Tokenization of Various Data Types**: We saw how English sentences, arithmetic expressions, non-English languages, and programming languages are tokenized. 22 | 23 | 8. **Improvements in GPT-4**: We noted the advancements in the GPT-4 tokenizer, which more efficiently handles whitespace and other aspects, improving upon the limitations of GPT-2's tokenizer. 24 | 25 | 9. **Writing Tokenization Code**: We covered the essentials of writing code for tokenization, including handling Unicode and UTF-8 encoding. 26 | 27 | 10. **Special Tokens**: The use of special tokens in tokenization was discussed, highlighting their importance in representing specific data structures within LLMs. 28 | 29 | 11. **State-of-the-Art LLMs**: We explored tokenization in cutting-edge LLMs, observing the evolution and refinement of tokenization techniques. 30 | 31 | 12. **SentencePiece**: The SentencePiece library was introduced as a tool for tokenization, with its own set of features and configurations. 32 | 33 | 13. **Design Considerations**: We touched on the considerations for setting the vocabulary size of a tokenizer and the implications of adding new tokens to an existing model. 34 | 35 | 14. **Multimodal Tokenization**: We briefly mentioned the expansion of tokenization beyond text to include other modalities like images and audio. 36 | 37 | 15. **Security and Safety**: We discussed the potential security and safety issues related to tokenization, such as unexpected model behavior when encountering certain token sequences. 38 | 39 | In conclusion, tokenization is a nuanced and foundational component of LLMs that requires careful consideration and understanding. It has a direct impact on the performance, efficiency, and capabilities of LLMs. As we continue to push the boundaries of what LLMs can do, the role of tokenization will undoubtedly evolve, presenting both challenges and opportunities for innovation. 40 | 41 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=7818) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/11. Tokenization of English Sentences.md: -------------------------------------------------------------------------------- 1 | # Tokenization of English Sentences 2 | 3 | ### Tokenization of English Sentences 4 | 5 | Tokenization is a critical process in the operation of large language models (LLMs). It is the process of converting strings of text into sequences of tokens, which are essentially the building blocks that LLMs understand and manipulate. These tokens are not just words or characters; they are often complex combinations of characters that models use to encode information efficiently. 6 | 7 | #### The Tokenization Process 8 | 9 | When we tokenize a string in Python, we're converting text into a list of tokens. For instance, the string "hi there" might be tokenized into a sequence of integers, each representing a specific token or character in the model's vocabulary. In a character-level tokenizer, each character in the input string is directly mapped to a corresponding token. 10 | 11 | However, state-of-the-art language models use more sophisticated tokenization schemes that work on chunks of characters, not just individual characters. These chunks are constructed using algorithms like byte pair encoding (BPE), which we will discuss in more detail. 12 | 13 | #### Byte Pair Encoding (BPE) 14 | 15 | BPE is an algorithm used to compress data by replacing common pairs of bytes or characters with a single byte. In the context of tokenization, it helps in creating a vocabulary of tokens based on the frequency of character pairs in the training dataset. For example, if the pair "hi" occurs frequently, it might be replaced with a single token. 16 | 17 | In the GPT-2 paper, the authors use BPE to construct a vocabulary of 50,257 possible tokens. Each token can be thought of as an atom in the LLM, and every operation or prediction made by the model is done in terms of these tokens. 18 | 19 | #### Building Our Own Tokenizer 20 | 21 | When building a tokenizer, we typically start with a large corpus of text, such as a dataset of Shakespeare's works. We then identify all the unique characters in the dataset and assign each a unique token. Using BPE, we iteratively merge the most frequent pairs of tokens, creating a hierarchy of tokens representing various character combinations. 22 | 23 | This process allows us to compress the text data into a shorter sequence of tokens, which is crucial because LLMs have a finite context window. By compressing the text, we enable the model to consider a broader context, which is essential for understanding and generating coherent text. 24 | 25 | #### Complexities of Tokenization 26 | 27 | Tokenization is not without its challenges. One of the main issues is that tokenization can introduce oddities and inefficiencies that affect the performance of LLMs. For instance, tokenization can make simple string processing difficult, affect the model's ability to perform arithmetic, and lead to suboptimal performance with non-English languages. These issues often stem from the arbitrary ways in which text is broken down into tokens. 28 | 29 | Furthermore, the way we tokenize text can have a significant impact on the efficiency of the model. For example, tokenizing Python code can be inefficient if the tokenizer treats whitespace as individual tokens, leading to bloated sequences that exhaust the model's context window. 30 | 31 | #### Conclusion 32 | 33 | Tokenization is a foundational aspect of LLMs that directly influences their capabilities and limitations. A well-designed tokenizer can greatly enhance a model's performance, while a poorly designed one can introduce a range of issues. Understanding the intricacies of tokenization is therefore essential for anyone working with LLMs. 34 | 35 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=365) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/19. Training the Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Training the Tokenizer 2 | 3 | ### Training the Tokenizer 4 | 5 | Tokenization is a critical preprocessing step in working with large language models (LLMs). It involves converting raw text into a sequence of tokens that a model can understand. In this section, we focus on training a tokenizer using the Byte Pair Encoding (BPE) algorithm, which is commonly used in state-of-the-art language models like GPT-2 and GPT-4. 6 | 7 | #### Understanding the Need for Tokenization 8 | 9 | Text data comes in the form of strings, which are sequences of Unicode code points. Before feeding this text into an LLM, we must convert it into a sequence of integers representing tokens. Directly using Unicode code points as tokens isn't practical due to their vast number and variability. Instead, we use tokenization algorithms to create a more manageable and stable set of tokens. 10 | 11 | #### Byte Pair Encoding (BPE) Algorithm 12 | 13 | The BPE algorithm works by iteratively merging the most frequent pairs of tokens in the training data. Initially, the tokens are individual characters or bytes from the UTF-8 encoding of the text. As the algorithm proceeds, it creates new tokens representing frequent pairs of characters, thereby compressing the text and reducing the sequence length. 14 | 15 | #### Preparing the Data 16 | 17 | For our example, we use a dataset containing English text, such as a collection of Shakespeare's works. We start by encoding this text into UTF-8, resulting in a sequence of bytes. We then convert these bytes into a list of integers to facilitate processing. 18 | 19 | #### Finding the Most Frequent Pairs 20 | 21 | We define a function `getStats` to count the frequency of each pair of tokens in our sequence. Using this function, we identify the most common pair to merge next. 22 | 23 | #### Merging Pairs 24 | 25 | Once we have the most frequent pair, we replace every occurrence of this pair with a new token. We define a function to perform this replacement, which takes the list of current tokens and the pair to merge, creating a new list with the pair replaced by the new token index. 26 | 27 | #### Iterative Merging 28 | 29 | We set a target vocabulary size and iteratively perform the merging process until we reach this size. Each merge reduces the sequence length and adds a new token to our vocabulary. 30 | 31 | #### Training the Tokenizer 32 | 33 | The training process involves applying the BPE algorithm to our entire dataset. We keep track of the merges in a dictionary, effectively building a binary tree representing the tokenization process. This tree will later allow us to encode new text into tokens and decode token sequences back into text. 34 | 35 | #### Encoding and Decoding Functions 36 | 37 | With the tokenizer trained, we implement functions to encode raw text into token sequences and decode token sequences back into text. These functions rely on the merges dictionary and the vocabulary we created during training. 38 | 39 | #### Special Tokens 40 | 41 | In addition to the tokens created through BPE, we can introduce special tokens to represent specific concepts or delimiters, such as the end-of-text token used in GPT-2. These special tokens are manually added to the vocabulary and handled separately from the BPE-generated tokens. 42 | 43 | #### Final Thoughts 44 | 45 | Tokenization is a nuanced process with many complexities. It affects the performance and capabilities of LLMs in various tasks, including spelling, arithmetic, and handling non-English languages. Understanding tokenization is essential for anyone working with LLMs, as it influences the model's behavior and its interaction with different types of data. 46 | 47 | [Video link](findSectionTimestamps is not a function or its return value is not iterable) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/13. Tokenization of Non-English Languages.md: -------------------------------------------------------------------------------- 1 | # Tokenization of Non-English Languages 2 | 3 | ### Tokenization of Non-English Languages 4 | 5 | Tokenization plays a critical role in the performance of large language models (LLMs) across various languages. It is essential to understand that tokenization schemes significantly impact the model's ability to process and generate text, especially in non-English languages. 6 | 7 | #### The Challenge with Non-English Tokenization 8 | 9 | Tokenization can be particularly challenging for non-English languages due to several reasons: 10 | 11 | 1. **Training Data Imbalance**: LLMs like GPT-2 or GPT-3 are often trained on datasets predominantly composed of English text. This imbalance leads to a tokenizer that is highly optimized for English, at the expense of other languages. 12 | 13 | 2. **Complex Tokenization**: Non-English languages may have different grammatical structures, scripts, and character sets that complicate the tokenization process. For example, languages like Chinese or Korean do not use spaces to delimit words, and languages like Arabic have script-specific features like cursive writing and diacritical marks. 14 | 15 | 3. **Sequence Length Bloat**: When tokenizing non-English text, the resulting token sequences tend to be longer than their English counterparts. This bloat occurs because the tokenizer, trained predominantly on English data, fails to efficiently chunk non-English text. Consequently, the model has a shorter effective context length for non-English languages due to more tokens being used to represent the same content. 16 | 17 | 4. **Lack of Token Merges**: In English, common phrases or words might be merged into single tokens, reducing the sequence length. However, due to less frequent exposure to non-English phrases or words during tokenizer training, such efficient merges are less likely to occur, leading to longer token sequences for non-English text. 18 | 19 | #### Example: Tokenization of Korean Text 20 | 21 | Consider the Korean greeting "안녕하세요" (annyeonghaseyo), which means "hello." In an LLM trained predominantly on English, this phrase might not be efficiently tokenized. Instead of being represented as a single token or a small number of tokens, it might be broken down into a larger number of tokens, each representing smaller pieces of the phrase. This inefficient tokenization stretches out the representation of the phrase, consuming more of the model's context window. 22 | 23 | #### Addressing Non-English Tokenization Issues 24 | 25 | To improve tokenization for non-English languages, the following approaches can be considered: 26 | 27 | 1. **Balanced Training Sets**: Ensure that the training set for the tokenizer includes a representative mix of languages. This balance allows the tokenizer to learn more efficient token merges for a variety of languages, not just English. 28 | 29 | 2. **Specialized Tokenizers**: Develop language-specific tokenizers that are trained on large datasets of the target language. These tokenizers can better capture the nuances and common patterns of the language. 30 | 31 | 3. **Post-Training Adjustments**: After training a general tokenizer, make adjustments to the vocabulary by adding more tokens specific to non-English languages or by fine-tuning the tokenizer on a more balanced dataset. 32 | 33 | 4. **Increased Context Window**: Design LLMs with a larger context window to mitigate the impact of sequence length bloat for non-English languages. 34 | 35 | 5. **Script-Specific Preprocessing**: Apply preprocessing steps that are tailored to the writing system of the non-English language before tokenization. For example, segmenting text into words for languages that do not use spaces. 36 | 37 | By taking these steps, we can create tokenizers that are more inclusive and performant across diverse languages, leading to LLMs that are truly global in their capabilities. 38 | 39 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=570) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/15. Improvements in GPT-4 Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Improvements in GPT-4 Tokenizer 2 | 3 | ### Section 15: Improvements in GPT-4 Tokenizer 4 | 5 | The evolution of tokenization from GPT-2 to GPT-4 has brought about significant improvements in the way large language models (LLMs) understand and generate text. Tokenization, the process of converting strings of text into sequences of tokens, is a foundational step in training and utilizing LLMs. In this section, we will discuss the advancements made in the tokenizer used for GPT-4. 6 | 7 | #### GPT-2 Tokenizer Limitations 8 | 9 | In GPT-2, tokenization was based on a character-level scheme that often resulted in inefficiencies. For example, the tokenizer would treat whitespace and Python indentation with the same importance as other characters, leading to token bloating. This inefficiency was particularly evident when dealing with programming languages like Python, where indentation is syntactically significant. As a result, GPT-2's ability to understand and generate code was less than optimal. 10 | 11 | #### GPT-4 Tokenizer Enhancements 12 | 13 | GPT-4 introduced a tokenizer with a larger vocabulary size, approximately doubling from GPT-2's 50k tokens to 100k tokens. This expansion allows the model to represent text more densely, effectively doubling the context the model can see and process. With a larger vocabulary, the same text is represented with fewer tokens, which is beneficial for the transformer architecture that GPT models use. 14 | 15 | One of the most notable improvements in GPT-4's tokenizer is its handling of whitespace, particularly in the context of programming languages. The tokenizer now groups multiple whitespace characters into single tokens, which densifies the representation of code and allows the transformer to attend to more relevant parts of the code, improving its ability to predict the next token in a sequence. 16 | 17 | This change in handling whitespace is a deliberate design choice by OpenAI, reflecting a deeper understanding of how tokenization impacts model performance, especially in specialized tasks like code generation. By optimizing the tokenizer for common patterns in programming languages, GPT-4 shows marked improvements in code-related tasks. 18 | 19 | #### Practical Implications 20 | 21 | The improvements in GPT-4's tokenizer have several practical implications: 22 | 23 | 1. **Efficient Tokenization**: The model can process text more efficiently, with a particular improvement in handling programming languages and structured data. 24 | 2. **Expanded Context**: GPT-4 can consider a larger context when making predictions, which is crucial for tasks that require understanding longer passages of text. 25 | 3. **Specialized Performance**: By optimizing for whitespace in programming languages, GPT-4 becomes more adept at code generation and understanding. 26 | 27 | #### Code Example: Tokenizer Comparison 28 | 29 | To illustrate the difference in tokenization between GPT-2 and GPT-4, consider the following Python code snippet: 30 | 31 | ```python 32 | # Python code snippet for FizzBuzz 33 | for i in range(1, 16): 34 | if i % 3 == 0 and i % 5 == 0: 35 | print("FizzBuzz") 36 | elif i % 3 == 0: 37 | print("Fizz") 38 | elif i % 5 == 0: 39 | print("Buzz") 40 | else: 41 | print(i) 42 | ``` 43 | 44 | In GPT-2, each space would be tokenized separately, leading to a long sequence of tokens. In GPT-4, the tokenizer groups these spaces, resulting in a more compact token sequence that the model can process more effectively. 45 | 46 | #### Conclusion 47 | 48 | The tokenizer is a crucial component of LLMs, and the improvements in GPT-4's tokenizer demonstrate OpenAI's commitment to enhancing the model's understanding of language and code. These advancements not only improve the efficiency of the model but also open up new possibilities for its application, particularly in areas where understanding context and structure is essential. 49 | 50 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=748) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/8. Building Our Own Tokenizer.md: -------------------------------------------------------------------------------- 1 | # Building Our Own Tokenizer 2 | 3 | ### Building Our Own Tokenizer 4 | 5 | In this section, we will explore the construction of a tokenizer, which is an essential component in working with large language models (LLMs). Tokenization is the process of converting strings of text into sequences of tokens, which are essentially atomic units that the model can understand and process. 6 | 7 | #### Why Tokenization Matters 8 | 9 | Tokenization might not be the most exciting aspect of working with LLMs, but it is crucial. It influences the performance of the model significantly. Issues that may seem related to the architecture of the neural network often trace back to tokenization. For instance, difficulties in spelling, string processing, handling non-English languages, or even simple arithmetic can often be attributed to the way text is tokenized. 10 | 11 | #### The Naive Approach to Tokenization 12 | 13 | Previously, we tokenized text in a very simplistic way—character-level tokenization. Each character in the text was mapped to a unique integer token. However, this approach is not practical for state-of-the-art models due to its limitations in efficiency and expressiveness. 14 | 15 | #### Advanced Tokenization Schemes 16 | 17 | Instead of character-level tokenization, modern LLMs use more sophisticated methods. These methods tokenize text at the chunk level, where chunks are sequences of characters that are often merged based on their frequency of co-occurrence in the training data. 18 | 19 | One such method is Byte Pair Encoding (BPE), which iteratively merges the most frequent pairs of tokens (or characters in the initial iteration) to create a new token. This process continues until a predefined vocabulary size is reached. 20 | 21 | #### Implementing Byte Pair Encoding 22 | 23 | To implement BPE, we start by identifying the most frequent pair of tokens in our dataset. We then replace all occurrences of this pair with a new token. This process is repeated, each time identifying and merging the next most frequent pair, until we have built a vocabulary of desired size. 24 | 25 | Here's a simple Python function that finds the most common pair: 26 | 27 | ```python 28 | def get_stats(vocab): 29 | pairs = collections.defaultdict(int) 30 | for word, freq in vocab.items(): 31 | symbols = word.split() 32 | for i in range(len(symbols) - 1): 33 | pairs[symbols[i], symbols[i + 1]] += freq 34 | return pairs 35 | ``` 36 | 37 | And another function that merges the pair throughout the vocabulary: 38 | 39 | ```python 40 | def merge_vocab(pair, v_in): 41 | v_out = {} 42 | bigram = re.escape(' '.join(pair)) 43 | p = re.compile(r'(?`, ``, etc.) are often added to the vocabulary to handle specific cases like the end of a sentence or padding, and they must be managed carefully to ensure the model behaves as expected. 76 | 77 | #### Conclusion 78 | 79 | BPE is a powerful tool for tokenizing text for use in LLMs. It allows us to compress text and create a manageable vocabulary for the model. However, it requires careful implementation and consideration of its limitations and complexities. Understanding BPE is crucial for anyone working with LLMs, as it is often the first step in preparing data for model training and inference. 80 | 81 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=1428) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/14. Tokenization of Programming Languages.md: -------------------------------------------------------------------------------- 1 | # Tokenization of Programming Languages 2 | 3 | ### Tokenization of Programming Languages 4 | 5 | Tokenization is a critical process in understanding and utilizing large language models (LLMs), especially when dealing with programming languages. Programming languages offer a unique set of challenges for tokenization due to their structured syntax and the importance of whitespace and special characters. 6 | 7 | #### The Challenge with Tokenization in Programming Languages 8 | 9 | Programming languages are different from natural languages in that they are designed to be parsed by machines, which means they have a strict syntax and structure. Tokenization of programming languages, therefore, requires careful consideration of these structures to ensure that the language model can understand and generate code effectively. 10 | 11 | Whitespace, for instance, is significant in programming languages like Python, where indentation levels determine the scope of code blocks. Incorrect tokenization of whitespace can lead to code that is syntactically incorrect or has a different meaning than intended. 12 | 13 | #### Example: Python and Whitespace 14 | 15 | Consider a simple Python code snippet for the FizzBuzz problem: 16 | 17 | ```python 18 | for i in range(1, 101): 19 | if i % 3 == 0 and i % 5 == 0: 20 | print("FizzBuzz") 21 | elif i % 3 == 0: 22 | print("Fizz") 23 | elif i % 5 == 0: 24 | print("Buzz") 25 | else: 26 | print(i) 27 | ``` 28 | 29 | In this example, the indentation is critical. If a tokenizer treats each space as an individual token, the resulting token sequence would be unnecessarily long and inefficient. This inefficiency is problematic for LLMs since they have a finite context window in which they can attend to tokens. Excessive tokenization of whitespace can consume this context window rapidly, leaving less room for the actual logical components of the code. 30 | 31 | #### Tokenization Improvements in GPT-4 32 | 33 | Comparing GPT-2 and GPT-4 reveals an improvement in handling whitespace in Python code. GPT-2 tokenizes each space individually, leading to a bloated token sequence. On the other hand, GPT-4 groups multiple spaces into a single token, allowing for a denser and more efficient representation of the code. 34 | 35 | This improvement in GPT-4's tokenizer is a deliberate design choice, optimizing the tokenization process for programming languages and resulting in better performance when generating or understanding code. 36 | 37 | #### Building a Custom Tokenizer for Programming Languages 38 | 39 | When constructing a tokenizer for programming languages, one must consider the following: 40 | 41 | - **Character-Level Tokenization**: Starting with a character-level tokenizer as a base is often not sufficient for programming languages due to the significance of whitespace and the need to recognize multi-character tokens like keywords and operators. 42 | 43 | - **Advanced Schemes**: Utilizing advanced tokenization schemes, such as Byte Pair Encoding (BPE), can help create a more suitable vocabulary for programming languages. BPE can merge frequently occurring character sequences into single tokens, which can include language keywords, common variable names, or function calls. 44 | 45 | - **Special Characters**: Programming languages often use special characters like braces, parentheses, and operators. A tokenizer must be able to recognize these characters and give them appropriate significance in the token sequence. 46 | 47 | - **Unicode and UTF-8**: Programming languages can include Unicode characters, especially in string literals or comments. A tokenizer must handle Unicode characters correctly, often using UTF-8 encoding. 48 | 49 | #### Conclusion 50 | 51 | Tokenization of programming languages is a complex task that requires a nuanced approach to handle the structured nature of code. Improvements in tokenizers, such as those seen from GPT-2 to GPT-4, demonstrate the importance of optimizing tokenization for programming contexts. Building a custom tokenizer for programming languages involves considering whitespace handling, advanced tokenization schemes, special character recognition, and Unicode support. By addressing these aspects, one can create a tokenizer that is more aligned with the syntactic and structural requirements of programming languages, leading to better performance in language models that process code. 52 | 53 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=684) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/12. Tokenization of Arithmetic.md: -------------------------------------------------------------------------------- 1 | # Tokenization of Arithmetic 2 | 3 | ### Tokenization of Arithmetic 4 | 5 | Tokenization is the process by which raw text is converted into a sequence of tokens, which are the basic units that a language model (LM) understands and processes. This process is crucial for the performance of language models, especially when dealing with tasks that involve understanding and manipulating text at a character or symbol level, such as arithmetic. 6 | 7 | #### The Challenge with Arithmetic Tokenization 8 | 9 | Arithmetic poses a unique challenge for tokenization in language models because arithmetic operations are inherently character-level processes. For example, when adding numbers, humans typically align digits by their place value, carry over numbers during addition, and perform operations digit by digit. Language models that tokenize input into larger chunks may struggle with these tasks because the tokenization process can obscure the character-level details necessary for arithmetic. 10 | 11 | #### How Tokenization Affects Arithmetic in LLMs 12 | 13 | Large language models (LLMs) like GPT-2 often tokenize numbers in a way that can be arbitrary and non-intuitive. During the tokenization process, a number like "127" might be tokenized as a single unit, while "677" might be split into two separate tokens, "6" and "77". This inconsistent tokenization can make it difficult for the model to understand and perform arithmetic operations correctly. 14 | 15 | For example, if a model encounters the number "804", it may tokenize it into "8" and "04", treating these as two separate entities. When asked to perform arithmetic involving "804", the model must reconcile the fact that it's working with two tokens that it has learned to interpret as separate concepts, rather than as a single number. 16 | 17 | #### Tokenization Schemes and Arithmetic 18 | 19 | Different tokenization schemes can lead to different levels of performance in arithmetic tasks. For instance, character-level tokenization might be more suitable for arithmetic as it preserves the individual digits and their order. However, character-level tokenization can lead to very long sequences for large texts, which is computationally inefficient. 20 | 21 | Advanced tokenization schemes like Byte Pair Encoding (BPE) aim to strike a balance by creating a vocabulary of frequently occurring character chunks. BPE can help reduce sequence length while still preserving some character-level information. However, the way BPE chunks numbers can still be arbitrary, affecting the model's ability to perform arithmetic. 22 | 23 | #### Improving Arithmetic Tokenization 24 | 25 | The tokenization of arithmetic can be improved in several ways. One approach is to design tokenizers that are more sensitive to the structure of numbers. For example, ensuring that numbers are tokenized consistently, without splitting digits in a way that loses their numerical meaning, can help LLMs better understand and perform arithmetic. 26 | 27 | In the case of GPT-4, the tokenizer was improved to be more efficient in handling numbers and Python code by grouping more whitespace characters into single tokens. This denser representation allows the model to attend to more context, which is especially beneficial for tasks like coding where whitespace is meaningful. 28 | 29 | #### Implementing Arithmetic Tokenization 30 | 31 | When building a tokenizer, one must consider how it will handle different types of text, including arithmetic. Implementing a tokenizer involves defining a vocabulary and creating rules for how text is split into tokens. For arithmetic, it's important to ensure that numbers are tokenized in a way that preserves their mathematical properties. 32 | 33 | For instance, a simple tokenizer might split text into tokens based on whitespace and punctuation, but this approach could be problematic for arithmetic where whitespace isn't always meaningful, and numbers need to be kept intact. A more sophisticated tokenizer might use patterns or algorithms to identify numbers and tokenize them as whole units. 34 | 35 | #### Conclusion 36 | 37 | Tokenization is a foundational aspect of language model performance, with significant implications for tasks like arithmetic. The way a tokenizer handles numbers can either enable or hinder a model's ability to perform arithmetic operations. As such, careful design and implementation of tokenization strategies are essential for building LLMs that are competent in arithmetic and other tasks requiring fine-grained text manipulation. 38 | 39 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=439) -------------------------------------------------------------------------------- /pages/Let's build the GPT Tokenizer/6. Byte Pair Encoding Algorithm.md: -------------------------------------------------------------------------------- 1 | # Byte Pair Encoding Algorithm 2 | 3 | ### Byte Pair Encoding (BPE) Algorithm 4 | 5 | The Byte Pair Encoding (BPE) algorithm is a tokenization method used in Natural Language Processing (NLP), particularly in the context of Large Language Models (LLMs) such as GPT-2 and GPT-4. Understanding BPE is crucial as it influences the way LLMs process and generate text. BPE is a middle ground between character-level tokenization, which can be inefficient, and word-level tokenization, which can struggle with large vocabularies and out-of-vocabulary words. 6 | 7 | #### The Need for BPE 8 | 9 | When training LLMs, we need to convert raw text into a format that the model can understand, typically a sequence of integers or tokens. A naive approach might involve using character-level tokenization, but this can lead to long sequences that are computationally expensive for the model to process. On the other end, word-level tokenization can lead to a vast vocabulary with many rare words that the model will rarely see during training. BPE addresses these issues by creating a vocabulary of subword units, which are more frequent than rare words but more meaningful than individual characters. 10 | 11 | #### How BPE Works 12 | 13 | BPE operates by iteratively merging the most frequent pairs of characters or tokens in the training data. Starting with a base vocabulary of individual characters, BPE looks for the most common adjacent pairs of tokens and merges them into a new token, adding it to the vocabulary. This process is repeated until a desired vocabulary size is reached or until no more merges can improve the model. 14 | 15 | Here's a step-by-step breakdown of the BPE algorithm: 16 | 17 | 1. **Initialize the Vocabulary**: Begin with a vocabulary containing every unique character in the dataset. 18 | 2. **Count Pairs**: Count the frequency of each adjacent pair of tokens in the dataset. 19 | 3. **Merge Pairs**: Identify the most frequent pair of tokens and merge them into a new single token. 20 | 4. **Update the Dataset**: Replace all instances of the identified pair in the dataset with the new token. 21 | 5. **Repeat**: Continue the process of counting and merging until the vocabulary reaches a predetermined size or no more beneficial merges can be made. 22 | 23 | #### BPE in Practice 24 | 25 | To illustrate BPE in practice, consider the following Python code snippet that demonstrates a simple implementation of the BPE algorithm: 26 | 27 | ```python 28 | # Define the initial data and vocabulary 29 | data = "this is a simple example of how BPE works" 30 | vocab = set(data.split()) 31 | 32 | # Define a function to count token pairs 33 | def get_stats(vocab): 34 | pairs = {} 35 | for word in vocab: 36 | symbols = word.split() 37 | for i in range(len(symbols) - 1): 38 | pair = (symbols[i], symbols[i + 1]) 39 | pairs[pair] = pairs.get(pair, 0) + 1 40 | return pairs 41 | 42 | # Define a function to merge the most frequent pair 43 | def merge_vocab(pair, vocab): 44 | new_vocab = [] 45 | bigram = ' '.join(pair) 46 | replacement = ''.join(pair) 47 | for word in vocab: 48 | new_word = word.replace(bigram, replacement) 49 | new_vocab.append(new_word) 50 | return new_vocab 51 | 52 | # BPE loop 53 | num_merges = 10 # for example 54 | for i in range(num_merges): 55 | pairs = get_stats(vocab) 56 | if not pairs: 57 | break 58 | best_pair = max(pairs, key=pairs.get) 59 | vocab = merge_vocab(best_pair, vocab) 60 | print(f"Merge #{i + 1}: {best_pair} -> {''.join(best_pair)}") 61 | ``` 62 | 63 | In this example, we start with a simple string and initial vocabulary based on the unique words in the string. We then define functions to count the frequency of adjacent pairs (`get_stats`) and to merge the most frequent pair (`merge_vocab`). The BPE loop iterates a specified number of times, each time merging the most frequent pair and updating the vocabulary. 64 | 65 | #### Considerations and Complexities 66 | 67 | While BPE is powerful, it introduces complexities: 68 | 69 | - **Token Ambiguity**: BPE may lead to ambiguous tokens where the same string can be tokenized differently depending on its context. 70 | - **Tokenization of Non-English Text**: BPE may not perform equally well on non-English text due to differences in character usage and frequency. 71 | - **Special Tokens**: LLMs may use special tokens (e.g., end-of-text) that require careful handling during tokenization. 72 | - **Vocabulary Size**: The choice of vocabulary size is a trade-off between model complexity and tokenization granularity. 73 | 74 | In summary, BPE is a tokenization scheme that helps LLMs handle a wide range of text data efficiently. It balances the need for a manageable vocabulary size with the ability to represent a diverse set of words and subwords, thus enabling more effective language modeling. Understanding and implementing BPE is a critical step in the development and training of LLMs. 75 | 76 | [Video link](https://www.youtube.com/watch?v=zduSFxRajkE?t=155) --------------------------------------------------------------------------------