├── .gitattributes ├── 1-intro ├── README.md └── tok_pipeline.png ├── 2-bpe ├── README.md ├── ex_corpus.txt ├── openai_bpe_viz.py ├── orig_bpe.py └── walkthrough.ipynb ├── 3-hf-tokenizer ├── README.md ├── bpe.py ├── hf_slow.png ├── minimal_hf_tok.py ├── save_hf.py ├── vocab.json └── walkthrough.ipynb ├── 4-tokenization-is-hard ├── README.md └── toxicity_detection_nllb.png ├── 5-puzzles ├── README.md ├── get_stack_subset.py ├── get_token_counts.py ├── paul_graham_essay_scraper.py └── tok_pipeline.png ├── 6-postprocessing-and-more ├── README.md ├── benchmark.py ├── image-1.png ├── image-2.png ├── image-3.png ├── image.png └── tokenizer_shrink.py ├── 7-galactica ├── README.md ├── corpus.png ├── image-1.png ├── image-2.png ├── image-3.png ├── image-4.png └── image.png ├── 8-chat-templates └── README.md ├── LICENSE └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.ipynb linguist-vendored=true 2 | -------------------------------------------------------------------------------- /1-intro/README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | 4 | 5 | - [What does a Tokenizer do?](#what-does-a-tokenizer-do) 6 | - [What are Tokens?](#what-are-tokens) 7 | - [Main approaches to tokenization](#main-approaches-to-tokenization) 8 | * [Popular subword tokenization algorithms](#popular-subword-tokenization-algorithms) 9 | * [Byte-pair encoding (BPE)](#byte-pair-encoding-bpe) 10 | + [Test-Time Tokenization](#test-time-tokenization) 11 | * [WordPiece Tokenization](#wordpiece-tokenization) 12 | + [Test-Time Tokenization](#test-time-tokenization-1) 13 | * [Unigram Tokenization](#unigram-tokenization) 14 | + [Test-Time Tokenization](#test-time-tokenization-2) 15 | - [SentencePiece](#sentencepiece) 16 | - [The Tokenization Pipeline](#the-tokenization-pipeline) 17 | - [Next Chapter](#next-chapter) 18 | - [Further reading](#further-reading) 19 | 20 | 21 | 22 | # What does a Tokenizer do? 23 | Let's see a 🤗 tokenizer in action first: 24 | ``` 25 | from transformers import AutoTokenizer 26 | tokenizer = AutoTokenizer.from_pretrained("gpt2") 27 | print(tokenizer.encode("Let's understand tokens")) 28 | # Output: [5756, 338, 1833, 16326] 29 | print(tokenizer.batch_decode(encode("Let's understand tokens"))) # convert token ids to tokens 30 | # Output: ['Let', "'s", ' understand', ' tokens'] 31 | ``` 32 | The tokenizer encodes a piece of text into a sequence of _token ids_. These token ids are fed into our neural network. THe neural network has a special layer in the beginning called the _embedding_ layer. Corresponding to each _token id_, the embedding layer stores a unique embedding vector. Given the input sequence of token ids, the embedding layer effectively performs a per-token-id lookup to output a sequence of embedding vectors. Before we get any further, we should ask: What are tokens? How do you decide where to break up a piece of text? What are the different ways in which you can break up text? 33 | 34 | 35 | # What are Tokens? 36 | 37 | In essence, a token is an atomic piece of text. Tokenization/Segmentation of text is the process of breaking up text into smaller, meaningful units. Tokenization as a basic step in natural language processing is very old. These tokens should ideally be _word forms_, or in simple terms, some variation/derivative of a word. Adding some history, the Morphological Annotation Framework (MAF) ISO standard defines typographical units or tokens as “non-empty contiguous sequence of graphemes or phonemes in a document". The typographic separator we're all familiar with is the whitespace (which _separates_ words), but this is script dependent (Latin, for example, uses whitespace, but the Japanese script does not). Previously, one would have to make a set of arbitrary rules to come up with a mapping between words and tokens. (For example "don't" -> "do n't"). 38 | 39 | > Side note: Morphosyntactic is a fancy way of saying that you are marking up text to indicate various attributes of words, such as their parts of speech, gender, number, case, etc. MAF was a proposal that came up in 2005 (Interesting, you can still find their slides [here](http://atoll.inria.fr/RNIL/Jeju04.pdf)). A grapheme is the smallest unit of a writing system of any given language. A phoneme is the smallest unit of sound in a language that can distinguish one word from another. There are schools of thought on what constitutes a grapheme and what doesn't, and I don't want to get into that mess here. 40 | 41 | As expected, there's quite a bit of history about the evolution of tokenization/segmentation. Initially, this was based solely on meaningful typographic units (for the English language, this would be words and special characters separated by whitespace), and we've now moved towards a more fine-grained, sub-word level. An excellent study from [Mielke *et al*](https://arxiv.org/abs/2112.10508) provides an in-depth look. 42 | 43 | 44 | # Main approaches to tokenization 45 | Let's quickly go over different kinds of tokenization. The two extremes of tokenization are character and word-based tokenization. In one case, the vocabulary is too small, the splits are too fine-grained, leading to very long tokenized sequences. Further this does not provide enough meaningful language representation for the model to springboard from. With word tokenization, you get meaningful units, but they are *closed-vocabulary* models - they are unable to deal with rare/novel words that weren't seen during training (Out-Of-Vocabulary). Even if you did assemble a gigantic corpus of all possible words, this vocabulary would be too big, too large to deal with, because of the fact that words can contain declinations, punctuations, misspellings, etc. The most popular form of tokenization is the middle ground: sub-word tokenization. The optimal way of breaking down text into different component sub-words is something that is learned from a relevant text corpus. 46 | 47 | With subword tokenization, it is pretty non-trivial to figure out the best way to split words. Notable examples are words such as don't (this is an example of a contraction, where a combination of words `do` and `not` is shortened to don't), or code (an example with type-specific grammar). Further, as you'll see, the statistical nature of tokenization models also gives you weird results - such as numbers 450 and 448 being represented with a different number of tokens. (This is, in essence, a form of overfitting in your tokenizer) 48 | 49 | ## Popular subword tokenization algorithms 50 | If you're just starting out with learning about different subword tokenization algorithms, [HuggingFace's NLP course](https://huggingface.co/learn/nlp-course/chapter6/1) is a must watch. Their videos have crisp animations that provide a good introduction, and it's very hard to top that in writing. That said, I'll briefly summarize some of the most popular subword tokenization algorithms are: 51 | ## Byte-pair encoding (BPE) 52 | Byte-pair encoding/ digram coding comes from information theory, and was first proposed in 1994 ([Web archive](https://web.archive.org/web/20160326130908/http://www.csse.monash.edu.au/cluster/RJK/Compress/problem.html)). BPE tokenization first performs a character-level tokenization of the given corpus. Then, the most frequent pair of adjacent characters is merged into new, 2-character long tokens and all instances of the pair are replaced by this new token. This process is repeated (now with variable-length tokens, not just characters) until you achieve your desired vocabulary size. At each step, whenever merges happen, the tokenizer records these as "merge rules", to be used post-training while tokenizing a given piece of text. Going back to the information theoretic meaning, given a text file, you iteratively replace the most consecutive pairs of bytes in your data with an unused byte (a byte that didn't appear in your text file). This is where you get the term "byte pair encoding". (There are, as expected various ways BPE has been used. For example, GPT-2 uses a Byte-Level BPE, where BPE is applied to raw bytes, not characters) 53 | 54 | ### Test-Time Tokenization 55 | Given a piece of text, the BPE algorithm first performs a character-level tokenization, and then uses the merge rules used during training to iteratively merge different tokens until you can no longer reduce the tokenized sequence further. The exact time complexity depends on the implementation, but the SentencePiece implementation (which is very popular, and integrated into 🤗 tokenizers) takes $O(NlogN)$, where $N$ is the length of the input sentence. 56 | 57 | ## WordPiece Tokenization 58 | WordPiece is another popular tokenization strategy, used in BERT, RoBERTa, etc. WordPiece tokenization is very similar to BPE, but it differs in the way the pairs are selection during the merging steps. In the case of BPE, the pairs are ranked simply by frequency, and the most frequent pair is selected each time. With Wordpiece, a custom score is computed for each pair as follows: 59 | 60 | score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element) 61 | 62 | pairs with the higher score are merged each iteration. By normalizing the frequency of the pair by the individual token frequencies, only those tokens are merged that are already not that frequent in the vocabulary. 63 | 64 | ### Test-Time Tokenization 65 | Given a piece of text, the WordPiece algorithm uses a left-to-right longest match first strategy. (This seems to be linear time processing, I need to dig into implementation though) 66 | 67 | ## Unigram Tokenization 68 | Unigram tokenization uses a statistical language model to model token probabilities. The basic assumption made by the unigram model is that the occurence of each word is independent of its previous word (hence, it is a "unigram" language model). This is of course, not appropriate for any serious model you'd want for generation, but we're looking at tokenization. Compared to BPE, Unigram's training strategy proceeds in the opposite direction: A large vocabulary is iteratively reduced in size by removing certain tokens. The training process can be summarized as follows: you first start with a large vocabulary for example by considering all possible substrings or by training BPE and using it's learnt vocabulary. At each step of the training, the algorithm computes a loss over the training corpus, for the current vocabulary. Then, you reduce the vocabulary size by removing some $x$ percent of tokens that cause the loss to increase the least. 69 | ### Test-Time Tokenization 70 | Given a word, we look at all the possible segmentations of that word into tokens and compute the probability of each according to the trained Unigram model. We pick the tokenization/segmentation with the highest probability. Semgentations with more tokens, usually end up with lower total probability (because you multiple more small numbers), and thus the model favours segmentation into smaller number of tokens, similar to what we expect. 71 | 72 | # SentencePiece 73 | This is NOT a tokenization algorithm. SentencePiece is a framework for tokenization and detokenization. It is widely used because it has some desirable properties for an end-to-end text processing system, such as being purely data driven and not relying on pre-tokenization steps (i.e don't have to send in text separated by whitespace). It is also language independent, and supports both BPE and Unigram language model algorithms. When you see "a SentencePiece model" in any literature, this is simply referring to the fact that model was trained using the SentencePiece library, and the configuration/ parameters of the model are available via the SentencePiece model abstraction (roughly speaking, this is like saying "PyTorch model", or "model trained with 🤗Trainer"). The tokenization algorithm itself can be BPE (the most likely algorithm) or Unigram (or a custom variant if mentioned.) 74 | 75 | # The Tokenization Pipeline 76 | The full tokenization pipeline is below. I'm not bothering defining these, because you'll find precise definitions in a number of introductory material on tokenization. 77 | 78 | 1. Normalization: Cleaning - lower casing, removing accents, etc 79 | 2. Pre-tokenization: Optional stage of splitting up text into words based on whitespace (if applicable for that language) 80 | 3. Model: The tokenization algorithm that takes in text/ list of words and spits out a list of tokens 81 | 4. Post-processor : Adds special tokens like sequence separators, beginning and end of sequence tokens, etc 82 | 83 | An excellent visualization for the BERT tokenizer, from HuggingFace's NLP course: 84 | 85 | ![Alt text](tok_pipeline.png) 86 | 87 | # [Next Chapter](/2-bpe/) 88 | In the next chapter, we'll dive into the BPE algorithm, and train a simple BPE model from scratch. 89 | 90 | 91 | # Further reading 92 | - ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers: https://youtu.be/uSinkCeUg9U?si=5pVRed3GG0oP8B-b 93 | - Tokenizers chapter, The HuggingFace NLP course: https://huggingface.co/learn/nlp-course/chapter6/1 94 | - Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP : https://arxiv.org/abs/2112.10508 95 | - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Unigram): 96 | https://arxiv.org/abs/1804.10959 97 | - Neural Machine Translation of Rare Words with Subword Units (BPE): https://arxiv.org/abs/1508.07909 98 | 99 | 100 | -------------------------------------------------------------------------------- /1-intro/tok_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/1-intro/tok_pipeline.png -------------------------------------------------------------------------------- /2-bpe/README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | 4 | 5 | - [Agenda](#agenda) 6 | - [Byte-Pair Encoding](#byte-pair-encoding) 7 | * [Why is subword tokenization so popular?](#why-is-subword-tokenization-so-popular) 8 | * [Training](#training) 9 | * [Test time](#test-time) 10 | - [Implementation](#implementation) 11 | - [Step into the walkthrough](#step-into-the-walkthrough) 12 | - [Next Chapter](#next-chapter) 13 | 14 | 15 | 16 | # Agenda 17 | In this chapter, we take a closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model. 18 | 19 | # Byte-Pair Encoding 20 | Byte-Pair Encoding (BPE) is perhaps the most popular tokenization algorithm right now, used by GPT, OPT, BLOOM, Llama, Falcon, etc. Byte-pair encoding/ digram coding is a _compression algorithm_ that comes from information theory, and was first proposed in 1994 (Web archive). The original BPE algorithm iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. (In the sense that a sequence which only contains bytes 00000000, 00000001 and 00000010 might get compressed by using bytes like 00000011). [Sennrich et al.](https://arxiv.org/abs/1508.07909) proposed to use BPE for tokenization, where you apply the algorithm to merge characters/ character sequences. Their work is now considered to be a breakthrough moment for subword tokenization, quoting Mielke _et al_. 21 | 22 | Let's now go over the training and the test time algorithm for BPE. The focus in this chapter will be on _training_ a BPE model. We'll dive deeper into the implementation for merging at test time when we implement a GPT2 tokenizer (almost) from scratch in [chapter-3](/3-hf-tokenizer/). 23 | 24 | 25 | ## Why is subword tokenization so popular? 26 | A quick digression. Why is subword tokenization is the dominant one? Let's revisit the downsides of character and word-based tokenization methods. A character-based tokenization algorithm has good generalization capabilities - note that we're referring to the tokenizer - in that the training corpus can be widely different from the test corpus, and you'll rarely have out-of-vocabulary issues (happens only when a new character comes along), and the _fertility_ (the number of tokens a word is split into) is pretty predictable. On the other hand, imagine the input to the neural network - these are the embeddings corresponding to each character. The network has to learn the meanings for different words based on the sequence of character embeddings. Each embedding will likely contain very little word-specific information, because, well, the same character appears in a lot of words. Thus, a shortcoming is that this is not an informative _input representation_. On the other hand, you have word-based tokenization. The upside is that the neural network can learn a good embedding for each word that can summarize the meaning and the context in which the word appears accurately. However, you're stuck with dealing with a large vocabulary, because of all the different variations for each word. Some words are also naturally split into multiple meaningful sub-words in many languages. An example from Sennrich _et al._ is the German word "Abwasser|behandlungs|anlange", which means ("a sewage water treatment plant"). In such cases, it makes more sense to have a sequence of embeddings of the subwords to represent the word, instead of one vector. The fancy term for languages which extensively use such compounds is [agglutinative](https://en.wikipedia.org/wiki/Agglutinative_language) (Ex: Japanese, Turkish, etc). Thus, with this motivation, we want a subword tokenization algorithm because they can (1) represent unseen data atleast with a character-level tokenization (2) learn meaningful subwords that can be useful for the neural network you train. 27 | 28 | ## Training 29 | The steps for training a BPE model are as follows: 30 | 1. Extract a list of words from the training corpus. In Sennrich et al, this pre-tokenization step simply removed whitespace information. They also added an end-of-word token (like ``) for each word to get word boundaries. (A simple extension used in implementations like in Sentencepiece makes sure to preserve whitespace, etc) 31 | 2. Make a word counter dictionary with keys being words and values being frequencies in the training corpus. 32 | 3. Keep a vocabulary of symbols (variable length strings), initialized with unique characters present in the training corpus. 33 | 4. Initialize a list of _merge rules_ to be an empty. Each merge rule is a tuple of symbol/token to merge at test time. 34 | 4. Iteratively do the following: 35 | - Get the most frequent pair of symbols in the current vocabulary, by going over the word counter. (Ex: `("l", "e")`) 36 | - Merge the two symbols into a new symbol, and _add_ this to the vocabulary. This is like a new byte in the original BPE except we have variable length strings (_character n-gram_). 37 | - Add our tuple of symbols to the list of merge rules. 38 | - Replace all occurences of the pair of symbols with the new symbol. For example, a word, segmented as `("l", "e", "a", "r", "n")` becomes `("le", "a", "r", "n")`. 39 | - Repeat until you reach the target vocabulary size. (a hyperparameter) 40 | 41 | This is it! The full algorithm is pretty simple. 42 | 43 | ## Test time 44 | At test time, the algorithm is very similar to training, except you're doing lookups in the set of merge rules: 45 | 1. Perform character-level tokenization for input text. 46 | 2. Find all pairs of symbols/tokens in the current words. 47 | 3. Start merging pairs by going in order of merge rules: merges learnt earlier in the training process have higher priority, and are performed earlier. 48 | 4. Repeat until you can't merge anymore. 49 | 50 | 51 | # Implementation 52 | A minimal implementation for training a BPE tokenizer is in `orig_bpe.py`. The code is almost the same as the one in the original paper, with minor edits for clarity and for loading our own text corpus. To run the training on your machine, `cd` into the current directory and run 53 | ``` 54 | python orig_bpe.py 55 | ``` 56 | Now, as mentioned, we'd ideally like to keep whitespace information, but that is a detail that can be distracting while doing a minimal implementation. The BPE tokenizer implementated in [chapter-3](/3-hf-tokenizer/) will work with all special characters, so we'll ignore this detail for now. 57 | 58 | # Step into the walkthrough 59 | Head over to [walkthrough.ipynb](walkthrough.ipynb) for a simple guide to training a BPE model. This is a notebook version of the code in `orig_bpe.py`, and should be easier to digest. 60 | 61 | # [Next Chapter](/3-hf-tokenizer/) 62 | We'll take a close look at the Python implementation for a 🤗 tokenizer and implement a minimal version of GPT2's tokenizer ourselves! 63 | 64 | **References** 65 | - Neural Machine Translation of Rare Words with Subword Units (BPE): https://arxiv.org/abs/1508.07909 66 | - Lei Mao's BPE guide: https://leimao.github.io/blog/Byte-Pair-Encoding/ . The code here is also from the original paper. 67 | -------------------------------------------------------------------------------- /2-bpe/ex_corpus.txt: -------------------------------------------------------------------------------- 1 | Trying to learn about BPE 2 | I'm learning about byte-pair encoding 3 | My friend learnt that digram coding and byte pair encoding mean the same 4 | I love Jacques Cousteau -------------------------------------------------------------------------------- /2-bpe/openai_bpe_viz.py: -------------------------------------------------------------------------------- 1 | """ 2 | BPE Tokenizer Visualisation from OpenAI: https://github.com/openai/tiktoken 3 | """ 4 | from tiktoken._educational import * 5 | 6 | # Visualise how the GPT-4 encoder encodes text 7 | enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base") 8 | a = enc.encode("encode this string pleeeease") 9 | print("Final tokens: ", [enc.decode([i]) for i in a]) # see the final tokens -------------------------------------------------------------------------------- /2-bpe/orig_bpe.py: -------------------------------------------------------------------------------- 1 | """ 2 | Minimal implementation of BPE (Byte Pair Encoding) 3 | Simple extension of the original code in "Neural Machine Translation of Rare Words with Subword Units" 4 | """ 5 | import re 6 | from collections import defaultdict 7 | 8 | EOW_TOKEN = '' 9 | def get_initial_words(filename): 10 | segmented_word_to_freq = defaultdict(int) # {"w o r d ": 1, ...} 11 | with open(filename, 'r', encoding='utf-8') as f: 12 | for line in f: 13 | words = line.strip().split() # words are split on whitespace; this is our pre-tokenization step 14 | for word in words: 15 | # Separate each character with spaces, to show tokens for clarity. "word" -> "w o r d " 16 | segmented_word = ' '.join(list(word)) + " " + EOW_TOKEN 17 | segmented_word_to_freq[segmented_word] += 1 18 | return segmented_word_to_freq 19 | 20 | def get_tokens(word_to_freq): 21 | tokens = defaultdict(int) 22 | for word in word_to_freq.keys(): 23 | for token in word.split(): 24 | if token not in tokens: 25 | tokens[token] = len(tokens) 26 | return tokens 27 | 28 | def get_stats(word_to_freq: dict): 29 | pairs = defaultdict(int) 30 | for word, freq in word_to_freq.items(): 31 | symbols = word.split() 32 | for i in range(len(symbols)-1): 33 | pairs[symbols[i],symbols[i+1]] += freq 34 | return pairs 35 | 36 | def merge_word_splits(pair, v_in: dict): 37 | v_out = {} 38 | bigram = re.escape(' '.join(pair)) 39 | p = re.compile(r'(?'\n", 55 | "def get_initial_words(filename):\n", 56 | " segmented_word_to_freq = defaultdict(int) # {\"w o r d \": 1, ...}\n", 57 | " with open(filename, 'r', encoding='utf-8') as f:\n", 58 | " for line in f:\n", 59 | " words = line.strip().split() # words are split on whitespace; this is our pre-tokenization step\n", 60 | " for word in words:\n", 61 | " # Separate each character with spaces, to show tokens for clarity. \"word\" -> \"w o r d \"\n", 62 | " segmented_word = ' '.join(list(word)) + \" \" + EOW_TOKEN\n", 63 | " segmented_word_to_freq[segmented_word] += 1\n", 64 | " return segmented_word_to_freq" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 3, 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "name": "stdout", 74 | "output_type": "stream", 75 | "text": [ 76 | "defaultdict(, {'T r y i n g ': 1, 't o ': 1, 'l e a r n ': 1, 'a b o u t ': 2, 'B P E ': 1, \"I ' m \": 1, 'l e a r n i n g ': 1, 'b y t e - p a i r ': 1, 'e n c o d i n g ': 2, 'M y ': 1, 'f r i e n d ': 1, 'l e a r n t ': 1, 't h a t ': 1, 'd i g r a m ': 1, 'c o d i n g ': 1, 'a n d ': 1, 'b y t e ': 1, 'p a i r ': 1, 'm e a n ': 1, 't h e ': 1, 's a m e ': 1, 'I ': 1, 'l o v e ': 1, 'J a c q u e s ': 1, 'C o u s t e a u ': 1})\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "word_to_freq = get_initial_words('ex_corpus.txt')\n", 82 | "print(word_to_freq)" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "All the words are actually represented as a space-separated string of symbols/tokens (i.e the dictionary has _segmented_ words)- in this case, we start off with simple character-level tokenization, and add an end of word token as well to match the original BPE algorithm." 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## Get initial vocabulary" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 4, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "name": "stdout", 106 | "output_type": "stream", 107 | "text": [ 108 | "defaultdict(, {'T': 0, 'r': 1, 'y': 2, 'i': 3, 'n': 4, 'g': 5, '': 6, 't': 7, 'o': 8, 'l': 9, 'e': 10, 'a': 11, 'b': 12, 'u': 13, 'B': 14, 'P': 15, 'E': 16, 'I': 17, \"'\": 18, 'm': 19, '-': 20, 'p': 21, 'c': 22, 'd': 23, 'M': 24, 'f': 25, 'h': 26, 's': 27, 'v': 28, 'J': 29, 'q': 30, 'C': 31})\n" 109 | ] 110 | } 111 | ], 112 | "source": [ 113 | "def get_tokens(word_to_freq):\n", 114 | " tokens = defaultdict(int)\n", 115 | " for word in word_to_freq.keys():\n", 116 | " for token in word.split():\n", 117 | " if token not in tokens:\n", 118 | " tokens[token] = len(tokens)\n", 119 | " return tokens\n", 120 | "\n", 121 | "vocab = get_tokens(word_to_freq)\n", 122 | "print(vocab)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "Our vocabulary is a mapping from token to a unique token ID" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## Revisit the training algorithm\n", 137 | "\n", 138 | "1. Extract a list of words from the training corpus. (**Done**)\n", 139 | "2. Make a word counter dictionary with keys being words and values being frequencies in the training corpus. (**Done**)\n", 140 | "3. Keep a vocabulary of symbols (variable length strings), initialized with unique characters present in the training corpus. (**Done**)\n", 141 | "\n", 142 | "We now proceed to the core part of the algorithm:\n", 143 | "\n", 144 | "4. Initialize a list of _merge rules_ to be an empty. Each merge rule is a tuple of symbol/token to merge at test time.\n", 145 | "4. Iteratively do the following:\n", 146 | " - Get the most frequent pair of symbols in the current vocabulary, by going over the word counter. (Ex: `(\"l\", \"e\")`)\n", 147 | " - Merge the two symbols into a new symbol, and _add_ this to the vocabulary. This is like a new byte in the original BPE except we have variable length strings (_character n-gram_). \n", 148 | " - Add our tuple of symbols to the list of merge rules.\n", 149 | " - Replace all occurences of the pair of symbols with the new symbol. For example, a word, segmented as `(\"l\", \"e\", \"a\", \"r\", \"n\")` becomes `(\"le\", \"a\", \"r\", \"n\")`. \n", 150 | " - Repeat until you reach the target vocabulary size. (a hyperparameter)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "The training algorithm should look something like this:" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 5, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "# merges = []\n", 167 | "# num_merges = N\n", 168 | "# for i in range(num_merges):\n", 169 | "# pairs = \n", 170 | "# best_pair = max(pairs, key=pairs.get) -----> Do argmax for pairs, get the most frequent pair\n", 171 | "# word_to_freq = merge_word_splits(best_pair, word_to_freq) -----> Merge the most frequent pair and get the new words\n", 172 | "# new_token = ''.join(best_pair) -----> Get the new token\n", 173 | "# vocab[new_token] = len(vocab) -----> Add the new token to the vocab" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "The full algorithm is below:" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 6, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "name": "stdout", 190 | "output_type": "stream", 191 | "text": [ 192 | "Initial vocab: defaultdict(, {'T': 0, 'r': 1, 'y': 2, 'i': 3, 'n': 4, 'g': 5, '': 6, 't': 7, 'o': 8, 'l': 9, 'e': 10, 'a': 11, 'b': 12, 'u': 13, 'B': 14, 'P': 15, 'E': 16, 'I': 17, \"'\": 18, 'm': 19, '-': 20, 'p': 21, 'c': 22, 'd': 23, 'M': 24, 'f': 25, 'h': 26, 's': 27, 'v': 28, 'J': 29, 'q': 30, 'C': 31})\n", 193 | "##################\n", 194 | "Iteration 1\n", 195 | "Best pair: ('i', 'n')\n", 196 | "New token: in\n", 197 | "All words: ['T r y in g ', 't o ', 'l e a r n ', 'a b o u t ', 'B P E ', \"I ' m \", 'l e a r n in g ', 'b y t e - p a i r ', 'e n c o d in g ', 'M y ', 'f r i e n d ', 'l e a r n t ', 't h a t ', 'd i g r a m ', 'c o d in g ', 'a n d ', 'b y t e ', 'p a i r ', 'm e a n ', 't h e ', 's a m e ', 'I ', 'l o v e ', 'J a c q u e s ', 'C o u s t e a u ']\n", 198 | "##################\n", 199 | "Iteration 2\n", 200 | "Best pair: ('in', 'g')\n", 201 | "New token: ing\n", 202 | "All words: ['T r y ing ', 't o ', 'l e a r n ', 'a b o u t ', 'B P E ', \"I ' m \", 'l e a r n ing ', 'b y t e - p a i r ', 'e n c o d ing ', 'M y ', 'f r i e n d ', 'l e a r n t ', 't h a t ', 'd i g r a m ', 'c o d ing ', 'a n d ', 'b y t e ', 'p a i r ', 'm e a n ', 't h e ', 's a m e ', 'I ', 'l o v e ', 'J a c q u e s ', 'C o u s t e a u ']\n", 203 | "##################\n", 204 | "Iteration 3\n", 205 | "Best pair: ('ing', '')\n", 206 | "New token: ing\n", 207 | "All words: ['T r y ing', 't o ', 'l e a r n ', 'a b o u t ', 'B P E ', \"I ' m \", 'l e a r n ing', 'b y t e - p a i r ', 'e n c o d ing', 'M y ', 'f r i e n d ', 'l e a r n t ', 't h a t ', 'd i g r a m ', 'c o d ing', 'a n d ', 'b y t e ', 'p a i r ', 'm e a n ', 't h e ', 's a m e ', 'I ', 'l o v e ', 'J a c q u e s ', 'C o u s t e a u ']\n", 208 | "##################\n", 209 | "Iteration 4\n", 210 | "Best pair: ('e', 'a')\n", 211 | "New token: ea\n", 212 | "All words: ['T r y ing', 't o ', 'l ea r n ', 'a b o u t ', 'B P E ', \"I ' m \", 'l ea r n ing', 'b y t e - p a i r ', 'e n c o d ing', 'M y ', 'f r i e n d ', 'l ea r n t ', 't h a t ', 'd i g r a m ', 'c o d ing', 'a n d ', 'b y t e ', 'p a i r ', 'm ea n ', 't h e ', 's a m e ', 'I ', 'l o v e ', 'J a c q u e s ', 'C o u s t ea u ']\n", 213 | "##################\n", 214 | "Iteration 5\n", 215 | "Best pair: ('t', '')\n", 216 | "New token: t\n", 217 | "All words: ['T r y ing', 't o ', 'l ea r n ', 'a b o u t', 'B P E ', \"I ' m \", 'l ea r n ing', 'b y t e - p a i r ', 'e n c o d ing', 'M y ', 'f r i e n d ', 'l ea r n t', 't h a t', 'd i g r a m ', 'c o d ing', 'a n d ', 'b y t e ', 'p a i r ', 'm ea n ', 't h e ', 's a m e ', 'I ', 'l o v e ', 'J a c q u e s ', 'C o u s t ea u ']\n", 218 | "##################\n", 219 | "Iteration 6 done\n", 220 | "Iteration 7 done\n", 221 | "Iteration 8 done\n", 222 | "Iteration 9 done\n", 223 | "Iteration 10 done\n", 224 | "Iteration 11 done\n", 225 | "Iteration 12 done\n", 226 | "Iteration 13 done\n", 227 | "Iteration 14 done\n", 228 | "Iteration 15 done\n", 229 | "Final vocab: defaultdict(, {'T': 0, 'r': 1, 'y': 2, 'i': 3, 'n': 4, 'g': 5, '': 6, 't': 7, 'o': 8, 'l': 9, 'e': 10, 'a': 11, 'b': 12, 'u': 13, 'B': 14, 'P': 15, 'E': 16, 'I': 17, \"'\": 18, 'm': 19, '-': 20, 'p': 21, 'c': 22, 'd': 23, 'M': 24, 'f': 25, 'h': 26, 's': 27, 'v': 28, 'J': 29, 'q': 30, 'C': 31, 'in': 32, 'ing': 33, 'ing': 34, 'ea': 35, 't': 36, 'e': 37, 'lea': 38, 'lear': 39, 'learn': 40, 'ou': 41, 'en': 42, 'co': 43, 'cod': 44, 'coding': 45, 'ab': 46})\n", 230 | "All merges: [('i', 'n'), ('in', 'g'), ('ing', ''), ('e', 'a'), ('t', ''), ('e', ''), ('l', 'ea'), ('lea', 'r'), ('lear', 'n'), ('o', 'u'), ('e', 'n'), ('c', 'o'), ('co', 'd'), ('cod', 'ing'), ('a', 'b')]\n" 231 | ] 232 | } 233 | ], 234 | "source": [ 235 | "import re\n", 236 | "def get_stats(word_to_freq: dict):\n", 237 | " pairs = defaultdict(int)\n", 238 | " for word, freq in word_to_freq.items():\n", 239 | " symbols = word.split()\n", 240 | " for i in range(len(symbols)-1):\n", 241 | " pairs[symbols[i],symbols[i+1]] += freq \n", 242 | " return pairs\n", 243 | "\n", 244 | "def merge_word_splits(pair, v_in: dict):\n", 245 | " v_out = {}\n", 246 | " bigram = re.escape(' '.join(pair))\n", 247 | " p = re.compile(r'(?', 't o ', 'learn ', 'ab ou t', 'B P E ', \"I ' m \", 'learn ing', 'b y t e - p a i r ', 'en coding', 'M y ', 'f r i en d ', 'learn t', 't h a t', 'd i g r a m ', 'coding', 'a n d ', 'b y t e', 'p a i r ', 'm ea n ', 't h e', 's a m e', 'I ', 'l o v e', 'J a c q u e s ', 'C ou s t ea u ']\n" 294 | ] 295 | } 296 | ], 297 | "source": [ 298 | "print(list(word_to_freq.keys()))" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Observe what happens to the last line in the text: This is an out-of-place sentence compared to the above three. You're getting almost character-level tokenization for all the words here(\"love\" gets 4 tokens), while words that are repeated (\"learn\", \"encoding\") get represented with 1/2 tokens. This is the essence of BPE: _outliers_ do _not_ get _compressed_ as much. When you have a large training corpus, it's very rare to get \"outlier\" English sentences, but you can imagine other data domains (code for example) that wasn't represented well in the training corpus, to be segmentd into a lot of tokens." 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "# Test time\n", 313 | "Revisit the algorithm at test-time:\n", 314 | "1. Perform character-level tokenization for input text.\n", 315 | "2. Find all pairs of symbols/tokens in the current words.\n", 316 | "3. Start merging pairs by going in order of merge rules: merges learnt earlier in the training process have higher priority, and are performed earlier.\n", 317 | "4. Repeat until you can't merge anymore.\n", 318 | "\n", 319 | "We'll be going over this implementation for the BPE tokenizer in [chapter-3](../3-hf-tokenizer/README.md). For now, here's a nice colour-coded visualization of merges from OpenAI:" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 8, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "name": "stdout", 329 | "output_type": "stream", 330 | "text": [ 331 | "\u001b[48;5;167me\u001b[48;5;179mn\u001b[48;5;185mc\u001b[48;5;77mo\u001b[48;5;80md\u001b[48;5;68me\u001b[0m\n", 332 | "\u001b[48;5;167men\u001b[48;5;185mc\u001b[48;5;77mo\u001b[48;5;80md\u001b[48;5;68me\u001b[0m\n", 333 | "\u001b[48;5;167men\u001b[48;5;185mc\u001b[48;5;77mod\u001b[48;5;68me\u001b[0m\n", 334 | "\u001b[48;5;167men\u001b[48;5;185mc\u001b[48;5;77mode\u001b[0m\n", 335 | "\u001b[48;5;167men\u001b[48;5;185mcode\u001b[0m\n", 336 | "\n", 337 | "\u001b[48;5;167m \u001b[48;5;179mt\u001b[48;5;185mh\u001b[48;5;77mi\u001b[48;5;80ms\u001b[0m\n", 338 | "\u001b[48;5;167m t\u001b[48;5;185mh\u001b[48;5;77mi\u001b[48;5;80ms\u001b[0m\n", 339 | "\u001b[48;5;167m t\u001b[48;5;185mh\u001b[48;5;77mis\u001b[0m\n", 340 | "\u001b[48;5;167m th\u001b[48;5;77mis\u001b[0m\n", 341 | "\u001b[48;5;167m this\u001b[0m\n", 342 | "\n", 343 | "\u001b[48;5;167m \u001b[48;5;179ms\u001b[48;5;185mt\u001b[48;5;77mr\u001b[48;5;80mi\u001b[48;5;68mn\u001b[48;5;134mg\u001b[0m\n", 344 | "\u001b[48;5;167m \u001b[48;5;179ms\u001b[48;5;185mt\u001b[48;5;77mr\u001b[48;5;80min\u001b[48;5;134mg\u001b[0m\n", 345 | "\u001b[48;5;167m s\u001b[48;5;185mt\u001b[48;5;77mr\u001b[48;5;80min\u001b[48;5;134mg\u001b[0m\n", 346 | "\u001b[48;5;167m s\u001b[48;5;185mt\u001b[48;5;77mr\u001b[48;5;80ming\u001b[0m\n", 347 | "\u001b[48;5;167m st\u001b[48;5;77mr\u001b[48;5;80ming\u001b[0m\n", 348 | "\u001b[48;5;167m str\u001b[48;5;80ming\u001b[0m\n", 349 | "\u001b[48;5;167m string\u001b[0m\n", 350 | "\n", 351 | "\u001b[48;5;167m \u001b[48;5;179mp\u001b[48;5;185ml\u001b[48;5;77me\u001b[48;5;80me\u001b[48;5;68me\u001b[48;5;134me\u001b[48;5;167ma\u001b[48;5;179ms\u001b[48;5;185me\u001b[0m\n", 352 | "\u001b[48;5;167m p\u001b[48;5;185ml\u001b[48;5;77me\u001b[48;5;80me\u001b[48;5;68me\u001b[48;5;134me\u001b[48;5;167ma\u001b[48;5;179ms\u001b[48;5;185me\u001b[0m\n", 353 | "\u001b[48;5;167m p\u001b[48;5;185ml\u001b[48;5;77me\u001b[48;5;80me\u001b[48;5;68me\u001b[48;5;134me\u001b[48;5;167mas\u001b[48;5;185me\u001b[0m\n", 354 | "\u001b[48;5;167m p\u001b[48;5;185mle\u001b[48;5;80me\u001b[48;5;68me\u001b[48;5;134me\u001b[48;5;167mas\u001b[48;5;185me\u001b[0m\n", 355 | "\u001b[48;5;167m p\u001b[48;5;185mle\u001b[48;5;80me\u001b[48;5;68me\u001b[48;5;134me\u001b[48;5;167mase\u001b[0m\n", 356 | "\u001b[48;5;167m p\u001b[48;5;185mle\u001b[48;5;80mee\u001b[48;5;134me\u001b[48;5;167mase\u001b[0m\n", 357 | "\u001b[48;5;167m ple\u001b[48;5;80mee\u001b[48;5;134me\u001b[48;5;167mase\u001b[0m\n", 358 | "\n", 359 | "Final tokens: ['en', 'code', ' this', ' string', ' ple', 'ee', 'e', 'ase']\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "from tiktoken._educational import SimpleBytePairEncoding\n", 365 | "\n", 366 | "# Visualise how the GPT-2 encoder encodes text\n", 367 | "enc = SimpleBytePairEncoding.from_tiktoken(\"gpt2\")\n", 368 | "a = enc.encode(\"encode this string pleeeease\")\n", 369 | "print(\"Final tokens: \", [enc.decode([i]) for i in a]) # see the final tokens" 370 | ] 371 | } 372 | ], 373 | "metadata": { 374 | "kernelspec": { 375 | "display_name": "huggingface", 376 | "language": "python", 377 | "name": "python3" 378 | }, 379 | "language_info": { 380 | "codemirror_mode": { 381 | "name": "ipython", 382 | "version": 3 383 | }, 384 | "file_extension": ".py", 385 | "mimetype": "text/x-python", 386 | "name": "python", 387 | "nbconvert_exporter": "python", 388 | "pygments_lexer": "ipython3", 389 | "version": "3.11.5" 390 | }, 391 | "orig_nbformat": 4 392 | }, 393 | "nbformat": 4, 394 | "nbformat_minor": 2 395 | } 396 | -------------------------------------------------------------------------------- /3-hf-tokenizer/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | - [Agenda](#agenda) 6 | - [Diving into the HuggingFace tokenizer](#diving-into-the-huggingface-tokenizer) 7 | * [What makes up a HuggingFace tokenizer?](#what-makes-up-a-huggingface-tokenizer) 8 | + [BPE Tokenizer](#bpe-tokenizer) 9 | + [WordPiece tokenizer](#wordpiece-tokenizer) 10 | * [Data Structures and Methods](#data-structures-and-methods) 11 | + [`__call__`](#__call__) 12 | + [`decode`](#decode) 13 | + [`add_tokens`](#add_tokens) 14 | - [A minimal implementation](#a-minimal-implementation) 15 | - [Step-by-step walkthrough](#step-by-step-walkthrough) 16 | - [Next Chapter](#next-chapter) 17 | 18 | 19 | 20 | # Agenda 21 | The internals of HuggingFace tokenizers! We look at state (what's saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal <200 line version of the 🤗 Tokenizer in Python for GPT2. 22 | 23 | # Diving into the HuggingFace tokenizer 24 | ## What makes up a HuggingFace tokenizer? 25 | Well, let's first think about state: what information does a tokenizer need to save? 26 | Before we dive in, it's helpful to - you won't believe this - actually check out the saved tokenizers for different models. For example, here's the [GPT-2 Tokenizer](https://huggingface.co/SumanthRH/gpt2-tokenizer/tree/main). This is actually saved in the older format, so we can take a look at, for example, [Falcon's tokenizer](https://huggingface.co/SumanthRH/falcon-tokenizer/tree/main). Make sure to scroll through the large `tokenizer.json` files to get an idea for what's in there. 27 | ### BPE Tokenizer 28 | Let's consider a BPE tokenizer. In HuggingFace, you can save a tokenizer by calling the `save_pretained` method. Typically, you will see the following files for a BPE tokenizer: 29 | - [DEPR] `added_tokens.json`: Part of the older format for saving HF tokenizers. A little hard to figure out what this is for, since we have an "added_tokens" entry in the tokenizer.json file itself. Further, this doesn't actually have all the [AddedTokens](https://huggingface.co/docs/tokenizers/api/added-tokens) of your tokenizer (this inc. special tokens for some tokenizers like DeBERTa, Llama). 30 | - [DEPR] `merges.txt` : Saved in the older format for BPE tokenizers. Contains a list of BPE merge rules to be used while encoding a text sequence. 31 | - `special_tokens_map.json`: A dictionary of special token attribute names ("bos_token", etc) and their values ("\") and some metadata. What makes special tokens so special? These are commonly used tokens that are not a part of the corpus but have certain important designations (BOS- beginning of sequence, EOS-end of sequence, etc). All of these special tokens are accesible as attributes of the tokenizer directly i.e you can call `tokenizer.eos_token` for any HF tokenizer, since they all subclass the [`SpecialTokensMixin`](https://github.com/huggingface/transformers/blob/ced9fd86f55ebb6b656c273f6e23f8ba50652f83/src/transformers/tokenization_utils_base.py#L795) class. Maintaining such additional information is a good idea for obvious reasons- none of these are actually a part of your training corpus. You'd also want to add certain special tokens when you encode a piece of text by default (EOS or BOS+EOS, etc). This is the postprocessing step, covered in [chapter-6](/6-postprocessing-and-more/). In 🤗Tokenizers, you can also add `additional_special_tokens`, which can be tokens you use in the model's prompt templates (like `[INSTR]`, etc). 32 | - `tokenizer_config.json` : Some tokenizer specific config parameters such as max sequence length the model was trained on (`model_max_length`), some information on special tokens, etc. 33 | - `tokenizer.json`: Some notable entries: 34 | - `add_bos_token`: State for whether to add BOS token by default when you call the tokenizer. Caveats on this later. 35 | - `added_tokens`: a list of new tokens added via `tokenizer.add_tokens`/ tokens in `additional_special_tokens`. When you call `tokenizer.add_tokens`, the new token added is, by default, maintained as an [AddedToken](https://huggingface.co/docs/tokenizers/api/added-tokens) object, and not just a string. The difference is that an `AddedToken` can have special behaviour - you might match both ` ` and `` (note the left whitespace) to be the same token, specify whether the token should be matched in a normalized version of the text, etc. 36 | - `model`: Information about the tokenizer architecture/ algorithm ("type" -> "BPE" for ex). Also includes the vocabulary (mapping tokens -> token ids), and additional state such as merge rules for BPE. Each merge rule is really just a tuple of tokens to merge. 🤗 stores this tuple as one string, space-separated ex: "i am". 37 | - `normalizer`: Normalizer to use before segmentation. `null` for GPT2 and Falcon. 38 | - [DEPR] `vocab.json` : Saved in the older format. Contains a dictionary mapping tokens to token ids. This information is now stored in `tokenizer.json`. 39 | 40 | ### WordPiece tokenizer 41 | `raise NotImplementedError` 42 | 43 | ## Data Structures and Methods 44 | Let's take a look at how a HF tokenizer stores the vocabulary, added tokens, etc along with the different functionality it provides (as always, these are tightly coupled). For simplicity, I am only going to look into the slow tokenizers, implemented in Python, as opposed to the fast tokenizers implemented in Rust, as I basically haven't learnt Rust yet (my apologies to the Cargo cult). Here's what the initialization looks like: 45 | 46 | ![HF Slow tokenizer](hf_slow.png) 47 | 48 | So are all the tokens stored in a [prefix tree/ Trie](https://en.wikipedia.org/wiki/Trie)? No! This is only for `added_tokens`. For example, with GPT2, this trie will only store one token by default: `<|endoftext|>`. For some custom tokenizers like ByT5, the number of added tokens is in the hundreds, and so using a Trie makes a difference. This becomes useful when you are customizing your tokenizer by adding new tokens with the `tokenizer.add_tokens` method. ([Reference](https://github.com/huggingface/transformers/pull/13220)). The `added_tokens` Trie has two methods: 49 | - `trie.add(word)` : Adds a word to the prefix tree. 50 | - `trie.split(text)`: Splits a string into chunks, separated at the boundaries of tokens in the trie. 51 | Ex: `This is <|myspecialtoken|>` -> `["This is ", "<|myspecialtoken|>"]` 52 | 53 | To look at the other attributes/data structures stored, we'd need to move away from the parent class and actually go to the model-specific tokenizer. Here, this is `GPT2Tokenizer`. Some of the attributes are: 54 | - `encoder` - Vocabulary, keeping token -> token_id mappings 55 | - `decoder` - Inverse of the `encoder`, keeping token_id -> token mappings 56 | - `bpe_ranks` - Mapping between merge rule `token_1 token_2` and priority/rank. Merges which happened earlier in training have a lower rank, and thus higher priority i.e these merges should happen earlier than later while tokenizing a string. 57 | 58 | There are some more details here, but left for later. Let's quickly go over the summary for important methods first. 59 | 60 | ### `__call__` 61 | Okay, so what happens when you do call `tokenizer(text)`? An example with `gpt2`: 62 | ``` 63 | tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=False) # get the slow tokenizer 64 | print(tokenizer("The slow tokenizer")) # Output: {'input_ids': [464, 3105, 11241, 7509], 'attention_mask': [1, 1, 1, 1]} 65 | ``` 66 | You can see that the result is in fact a dictionary. `input_ids` are the token ids for the input sequence. If you decode the above sequence to get the actual tokens, you get `['The', ' slow', ' token', 'izer']`. Let's look at what happens inside the `__call__` method to get this result. The slow tokenizer class `PreTrainedTokenizer` derives the `__call__` method from the parent class `PreTrainedTokenizerBase`, in which [`__call__`](https://github.com/huggingface/transformers/blob/25b0f2033ba23e354ef2f665764248fcbb3f49ba/src/transformers/tokenization_utils_base.py#L2729) basically parses input arguments to make a call to the `encode_plus` function. HuggingFace tokenizers have two methods for encoding: `.encode()`, which gives you just a list of input_ids, and `encode_plus()`, which returns a dictionary with some additional information (`attention_mask`, `token_type_ids` to [mark sequence boundaries](https://huggingface.co/docs/transformers/glossary#token-type-ids), etc). The `encode_plus` implementation for the slow tokenizer (in reality, this is `_encode_plus`) is as follows(This is): 67 | 1. Normalize and pre-tokenize input text. With GPT2, the pre-tokenization involves breaking up the text on whitespace, contractions, punctuations, etc. 68 | 2. Tokenize input string/strings to get a list of tokens for each input string. This is handled by the `.tokenize()` method. (_Segmentation_) 69 | 3. Convert tokens to token ids using `.convert_tokens_to_ids()` method. (_Numericalization_) 70 | 4. Send in the token ids and other kwargs to `.prepare_for_model()`, which finally returns a dictionary with `attention_mask` and other keys if needed. 71 | 72 | This is the simple explanation. There's one important detail though: When you have `added_tokens` or special tokens, there are no merge rules for these tokens! And you can't make up ad-hoc merge rules without messing up the tokenization of other strings. So, we need to handle this in the pre-tokenization step - Along with splitting on whitespace, punctuations, etc, we will also split at the boundaries of `added_tokens`. 73 | 74 | ### `decode` 75 | When you run `tok.decode(token_ids)`, there are three operations: 76 | 1. Convert ids to tokens using the `id_to_token` mapping from `tok.bpe`. 77 | 2. Join all the tokens 78 | 3. Replace unicode symbols with normal characters 79 | 80 | ### `add_tokens` 81 | Another important feature is that you can add new tokens to your tokenizer. This needs to be handled carefully, as these tokens are not learned bottom up during training. We'll look at how exactly this works with our minimal implementation below. 82 | 83 | # A minimal implementation 84 | This folder contains two `.py` files: 85 | - `bpe.py`: Implements a simple `BPE` class that tokenizes a string according to GPT-2's byte-level BPE algorithm (a simple change to standard BPE). 86 | - `minimal_hf_tok.py`: Implements `MySlowTokenizer`, <100 line implementation for the basic features of HuggingFace's `GPT2Tokenizer` (the slow version). 87 | 88 | # Step-by-step walkthrough 89 | Head over to [walkthrough.ipynb](/3-hf-tokenizer/walkthrough.ipynb) for details on: 90 | - Implementing the merging algorithm for `BPE` 91 | - Implementing the different methods for encoding, decoding, added tokens etc. in `MySlowTokenizer` to match `GPT2Tokenizer`. 92 | 93 | # [Next Chapter](/4-tokenization-is-hard/) 94 | We'll be going over the challenges with tokenizing different types of data - numbers, other languages, etc. 95 | 96 | -------------------------------------------------------------------------------- /3-hf-tokenizer/bpe.py: -------------------------------------------------------------------------------- 1 | """ 2 | A simple BPE tokenizer implementation 3 | """ 4 | import json 5 | from typing import Any 6 | import regex as re # regex is cooler than re 7 | import warnings 8 | import os 9 | from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode 10 | 11 | # a hacky custom warning formatter to avoid full path being shown 12 | def custom_formatwarning(msg, category, filename, lineno, line=None): 13 | filename = os.path.basename(filename) 14 | # Format the warning message 15 | return f'{filename}:{lineno}: {category.__name__}: {msg}\n' 16 | 17 | warnings.formatwarning = custom_formatwarning 18 | 19 | def get_pairs(word): 20 | """ 21 | Return set of symbol pairs in a word. 22 | 23 | Word is represented as tuple of symbols (symbols being variable-length strings). 24 | Reference: HF's GPT-2 tokenizer 25 | """ 26 | pairs = set() 27 | prev_char = word[0] 28 | for char in word[1:]: 29 | pairs.add((prev_char, char)) 30 | prev_char = char 31 | return pairs 32 | 33 | class BPE: 34 | def __init__(self, vocab_file: str): 35 | self.token_to_id = {} # vocab. Called `encoder` in `GPT2Tokenizer` 36 | self.id_to_token = {} # called `decoder` in `GPT2Tokenizer` 37 | self.merges = [] 38 | self.bpe_ranks = dict() 39 | self.byte_encoder = bytes_to_unicode() # maps bytes to unicode strings 40 | self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} 41 | self.load_vocab(vocab_file) 42 | 43 | def load_vocab(self, vocab_file: str): 44 | with open(vocab_file, 'r') as f: 45 | vocab_data = json.load(f) 46 | self.token_to_id = vocab_data["vocab"] 47 | self.id_to_token = {v: k for k, v in self.token_to_id.items()} 48 | self.merges = vocab_data["merges"] 49 | for i, merge in enumerate(self.merges): 50 | pair = tuple(merge.split()) # merge is repr as "a b". Works because we split on whitespace in pre-tok step 51 | # i is the index of the merge in the merges list, also the rank. lower rank means merge happens earlier 52 | self.bpe_ranks[pair] = i 53 | 54 | def __call__(self, word: str, dont_byte_encode: bool = False) -> Any: 55 | if " " in word and not dont_byte_encode: 56 | warnings.warn("Word contains whitespaces. Encoding to unicode strings...") 57 | word = "".join([self.byte_encoder[b] for b in word.encode("utf-8")]) 58 | pairs = get_pairs(word) # "obobc" -> set([("o", "b"), ("b", "o"), ("b", "c")]) 59 | while True: 60 | # get pair of chars/tokens with lowest rank and merge 61 | bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf"))) 62 | if bigram not in self.bpe_ranks: 63 | break 64 | first, second = bigram 65 | new_word = [] 66 | i = 0 67 | while i < len(word): 68 | try: 69 | j = word.index(first, i) # find index of occurence of `first` in word[i:] 70 | except ValueError: 71 | new_word.extend(word[i:]) 72 | break 73 | else: 74 | new_word.extend(word[i:j]) 75 | i = j 76 | 77 | if word[i] == first and i < len(word) - 1 and word[i + 1] == second: 78 | new_word.append(first + second) 79 | i += 2 80 | else: 81 | new_word.append(word[i]) 82 | i += 1 83 | new_word = tuple(new_word) 84 | word = new_word 85 | if len(word) == 1: 86 | break 87 | else: 88 | pairs = get_pairs(word) 89 | word = " ".join(word) 90 | return word 91 | 92 | def __repr__(self) -> str: 93 | return f"BPE(vocab_size={len(self.token_to_id)})" 94 | 95 | def add_token(self, token: str): 96 | if token in self.token_to_id: 97 | raise ValueError(f"Token {token} already in vocabulary.") 98 | self.token_to_id[token] = len(self.token_to_id) 99 | self.id_to_token[len(self.id_to_token)] = token 100 | 101 | 102 | if __name__ == "__main__": 103 | 104 | bpe = BPE("vocab.json") 105 | from transformers import AutoTokenizer 106 | gpt2 = AutoTokenizer.from_pretrained("gpt2", use_fast=False) 107 | gpt2.encode(" worda") 108 | import pdb; pdb.set_trace() -------------------------------------------------------------------------------- /3-hf-tokenizer/hf_slow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/3-hf-tokenizer/hf_slow.png -------------------------------------------------------------------------------- /3-hf-tokenizer/minimal_hf_tok.py: -------------------------------------------------------------------------------- 1 | from transformers.tokenization_utils import Trie 2 | from transformers import AutoTokenizer 3 | import json 4 | from typing import Dict, Tuple, Union, List, Any 5 | import regex as re # regex is cooler than re 6 | from bpe import BPE 7 | 8 | 9 | EOS_TOKEN = "<|endoftext|>" 10 | class MyTrie(Trie): 11 | """ 12 | HF's Trie implementation for added tokens. Shown here with minor changes for clarity. 13 | """ 14 | def __init__(self, *args, **kwargs): 15 | super().__init__(*args, **kwargs) 16 | # self.data = {} - our trie graph, stored as a dict of dicts 17 | # self._tokens = set() - set of tokens in the trie 18 | 19 | def add(self, word: str): 20 | """ 21 | Adds a word to the trie 22 | """ 23 | return super().add(word) 24 | 25 | def split(self, word: str): 26 | """ 27 | Splits a word into chunks based on the trie 28 | """ 29 | return super().split(word) 30 | 31 | def __repr__(self) -> str: 32 | # format data dict into a json 33 | return json.dumps(self.data, indent=4) 34 | 35 | 36 | class MySlowTokenizer: 37 | """ 38 | A minimal implementation of HF's slow tokenizer, based on GPT2's tokenizer 39 | References: 40 | https://github.com/huggingface/transformers/blob/8aca43bdb3cb9a5020f6d57589d85679dc873b1c/src/transformers/models/gpt2/tokenization_gpt2.py 41 | """ 42 | def __init__(self, init_vocab_file: str = None): 43 | self.added_tokens_trie = MyTrie() # trie for added tokens only 44 | self.bpe = BPE(init_vocab_file) 45 | self.vocab = self.bpe.token_to_id # nice to have vocab accessible here 46 | self.byte_encoder = self.bpe.byte_encoder 47 | self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} 48 | self.unk_token = EOS_TOKEN 49 | 50 | # Regex for pre-tokenization - breaking up a piece of text into words by splitting at whitespaces, contractions, etc. Borrowed from GPT-2 51 | self.pattern_for_splitting = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") 52 | self._load_added_tokens() 53 | 54 | def _load_added_tokens(self): 55 | # loads added tokens from json and adds them to the trie 56 | # Hard coded for GPT2 demonstration 57 | self.added_tokens_trie.add(EOS_TOKEN) 58 | 59 | def __call__(self, *args: Any, **kwargs: Any) -> Any: 60 | self.encode(*args, **kwargs) 61 | 62 | def encode(self, text: str, **kwargs: Any) -> Any: 63 | text, kwargs = self.prepare_for_tokenization(text, **kwargs) 64 | 65 | # 1. Split text into chunks at the boundaries of added_tokens. Can be thought of as a pre-tokenization step. 66 | # "This isn't<|endoftext|> what you think" -> ["This isn't", "<|endoftext|>", " what you think"] 67 | chunks = self.added_tokens_trie.split(text) 68 | bpe_tokens = [] 69 | for chunk in chunks: 70 | if chunk in self.added_tokens_trie._tokens: 71 | # if chunk is an added token, directly add it to bpe_tokens 72 | bpe_tokens.append(chunk) 73 | else: 74 | # 2. Tokenize each chunk 75 | tokens = self._tokenize(chunk) 76 | bpe_tokens.extend(tokens) 77 | # 3. Convert tokens to ids 78 | bpe_tokens = [self.convert_token_to_id(token) for token in bpe_tokens] 79 | return bpe_tokens 80 | 81 | def decode(self, ids: List[int], **kwargs: Any) -> str: 82 | # 1. Convert ids to tokens 83 | tokens = [self.convert_id_to_token(id_) for id_ in ids] 84 | text = "".join(tokens) # join tokens 85 | # replace unicode symbols with normal characters 86 | text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8") 87 | return text 88 | 89 | def _tokenize(self, text: str) -> List[str]: 90 | all_tokens = [] 91 | # Pre-tokenization: split text into words based on regex. "This isn't" -> ["This", " isn", "'t"] 92 | words = self.pre_tokenize(text) 93 | for word in words: 94 | # Unicode string encoding. " isn" -> bytes object -> "Ġisn" 95 | word = "".join([self.byte_encoder[b] for b in word.encode("utf-8")]) 96 | tokens = self.bpe(word, dont_byte_encode=True).split(" ") # we already encoded the chunk to unicode strings 97 | all_tokens.extend(tokens) 98 | return all_tokens 99 | 100 | def pre_tokenize(self, text: str) -> List[str]: 101 | return self.pattern_for_splitting.findall(text) 102 | 103 | def prepare_for_tokenization(self, text: str, **kwargs: Any) -> Tuple[str, Dict[str, Any]]: 104 | """ 105 | In HF, this method performs any pre-processing needed before tokenization. Dummy function for now. 106 | """ 107 | # returns text and kwargs 108 | return (text, kwargs) 109 | 110 | def to_dict(self): 111 | dict1 = {} 112 | dict1["model"]["type"] = "BPE" 113 | dict1["model"]["vocab"] = self.vocab 114 | dict1["model"]["merges"] = self.bpe.merges 115 | dict1["added_tokens"] = list(self.added_tokens_trie._tokens) 116 | dict1["special_tokens_map"] = {"unk_token": self.unk_token} 117 | return dict1 118 | 119 | def add_tokens(self, new_tokens: Union[str, List[str]]): 120 | """ 121 | Adds new tokens to the tokenizer. 122 | """ 123 | if isinstance(new_tokens, str): 124 | new_tokens = [new_tokens] 125 | for token in new_tokens: 126 | self.bpe.add_token(token) # add to vocab first 127 | self.added_tokens_trie.add(token) 128 | print(f"Added {token} to the vocabulary.") 129 | 130 | def convert_token_to_id(self, token: str) -> int: 131 | """ 132 | Converts a token to its id. Returns unk token id if token is not in vocab. 133 | Fancy word: Numericalization 134 | """ 135 | return self.vocab.get(token, self.vocab[self.unk_token]) 136 | 137 | def convert_id_to_token(self, index: int) -> str: 138 | """ 139 | Converts an id to its token. Returns unk token if id is not in vocab. 140 | """ 141 | return self.bpe.id_to_token.get(index, self.unk_token) 142 | 143 | def __repr__(self) -> str: 144 | string = "MySlowTokenizer(" 145 | string += f"vocab_size={self.nvocab}, unk_token={self.unk_token}, added_tokens={str(self.added_tokens_trie._tokens)}, " 146 | string += f"bpe={str(self.bpe)})" 147 | return string 148 | 149 | # make vocab_size a property because it should change with added tokens 150 | @property 151 | def nvocab(self) -> int: 152 | return len(self.vocab) 153 | 154 | if __name__ == "__main__": 155 | input_text = "This isn't<|myspecialtoken|> that simple\n\t" 156 | new_token = "<|myspecialtoken|>" 157 | my_tokenizer = MySlowTokenizer("vocab.json") 158 | gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=False) 159 | my_tokenizer.add_tokens(new_token) 160 | gpt2_tokenizer.add_tokens(new_token) 161 | my_enc = my_tokenizer.encode(input_text) 162 | gpt2_enc = gpt2_tokenizer.encode(input_text) 163 | 164 | print("Input text:", input_text) 165 | print("MySlowTokenizer encoding:", my_enc) 166 | print("GPT2Tokenizer encoding:", gpt2_enc) 167 | 168 | print("MySlowTokenizer decoding:", my_tokenizer.decode(my_enc)) 169 | print("GPT2Tokenizer decoding:", gpt2_tokenizer.decode(gpt2_enc)) 170 | -------------------------------------------------------------------------------- /3-hf-tokenizer/save_hf.py: -------------------------------------------------------------------------------- 1 | # Simple script to save gpt2, BERT and Llama tokenizers 2 | from transformers import AutoTokenizer 3 | 4 | gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") 5 | 6 | bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") 7 | 8 | llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") 9 | 10 | gpt2_tokenizer.save_pretrained("gpt2") 11 | bert_tokenizer.save_pretrained("bert-base-uncased") 12 | llama_tokenizer.save_pretrained("meta-llama/Llama-2-7b-hf") -------------------------------------------------------------------------------- /3-hf-tokenizer/walkthrough.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A simple walkthrough for a minimal HuggingFace Tokenizer implementation" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "For the jupyter notebook fans out there, this is a scrappy walkthrough for all the different parts in the implementation for the `BPE` and the `MySlowTokenizer` classes implemented here. If you haven't gone through the chapter README yet, please do! The gist is that HuggingFace's [PreTokenizer class](https://huggingface.co/docs/transformers/main_classes/tokenizer) implements a \"slow\" tokenizer in python. `PreTokenizer` inherits from `PreTokenizerBase`, and a `SpecialTokensMixin` class. Thus, if you're trying to understand the slow tokenizer for GPT-2, you have a whole class genealogy to figure out what is implemented where:\n", 15 | "```\n", 16 | "PreTokenizerBase SpecialTokensMixin\n", 17 | " \\ /\n", 18 | " \\ /\n", 19 | " PreTrainedTokenizer\n", 20 | " |\n", 21 | " |\n", 22 | " GPT2Tokenizer\n", 23 | "```\n", 24 | "This is not pretty to navigate. So, I've tried to make things simple with just 2 classes:\n", 25 | "\n", 26 | "1. `BPE` : Meant to replicate what the BPE algorithm does in GPT-2's tokenizer. This is not a complete tokenizer in itself, it's got some weird kinks (with unicode symbols, etc) that we'll see soon.\n", 27 | "2. `MySlowTokenizer` : This is meant to a minimal implementation that can match all the basic features for HuggingFace's GPT2 tokenizer (the slow version).\n", 28 | "\n", 29 | "We're going to completely ignore the special tokens handling for now, as this is related to postprocessing that is in fact easier to understand later. \n" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Get GPT-2 Vocab\n", 37 | "We'll be using the vocabulary from the GPT-2 tokenizer. " 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 1, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "from transformers import AutoTokenizer\n", 47 | "import json\n", 48 | "tokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\n", 49 | "tokenizer_json = json.loads(tokenizer._tokenizer.to_str()) # get tokenizer.json state\n", 50 | "with open(\"vocab.json\", \"w\") as f:\n", 51 | " json.dump(tokenizer_json[\"model\"], f) # just need the model details" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "# BPE\n", 59 | "Let's see our implementation in action first." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 2, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "from bpe import BPE \n", 69 | "bpe = BPE(\"vocab.json\") # make sure you're in the current directory of the notebook" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## The `__call__` method\n", 77 | "Let's see what the `__call__` method does" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 3, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stderr", 87 | "output_type": "stream", 88 | "text": [ 89 | "bpe.py:56: UserWarning: Word contains whitespaces. Encoding to unicode strings...\n" 90 | ] 91 | }, 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "'Ġword aaa'" 96 | ] 97 | }, 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "bpe(\" wordaaa\")" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "The output for \" wordaaa\" from the BPE tokenizer is the strange string \"Ġword aaa\". Well, here's what it does:\n", 112 | "1. Converts input string into bytes and encodes each byte with a unicode symbol. Specifically, a space \" \" becomes \"Ġ\". This is how GPT-2 does it. \n", 113 | "2. Tokenizes the string into characters \n", 114 | "3. Applies the BPE merge algorithm to iteratively combine intermediate tokens until you can't reduce it further.\n", 115 | "4. Join the list of tokens with a whitespace and return the resultant string\n", 116 | "\n", 117 | "Before we dive into the implementation for the merge algorithm, let's see what the list of tokens are when you use GPT2's tokenizer from HuggingFace" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 4, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "name": "stdout", 127 | "output_type": "stream", 128 | "text": [ 129 | "HF's tokens: [' word', 'aaa']\n", 130 | "My tokens: ['Ġword', 'aaa']\n" 131 | ] 132 | } 133 | ], 134 | "source": [ 135 | "from transformers import AutoTokenizer\n", 136 | "gpt2_tokenizer = AutoTokenizer.from_pretrained(\"gpt2\", use_fast=False)\n", 137 | "text = \" wordaaa\"\n", 138 | "hf_tokens = gpt2_tokenizer.batch_decode(gpt2_tokenizer.encode(text))\n", 139 | "my_tokens = bpe(text).split(\" \")\n", 140 | "print(\"HF's tokens: \", hf_tokens)\n", 141 | "print(\"My tokens: \", my_tokens)" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "The two outputs should infact be exactly the same, except for special characters (whitespace, etc) which we'll not get into later. (these two classes aren't exactly comparable as I mentioned, but it's good to see that this work)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "## The algorithm\n", 156 | "Here's the complete merge algorithm with bpe, showing as a standalone function (with unicode symbols decoded)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 5, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "from bpe import get_pairs\n", 166 | "from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode\n", 167 | "\n", 168 | "byte_encoder = bytes_to_unicode()\n", 169 | "byte_decoder = {v: k for k, v in byte_encoder.items()}\n", 170 | "bpe_ranks = bpe.bpe_ranks\n", 171 | "\n", 172 | "def split_into_tokens(word: str):\n", 173 | " word = \"\".join([byte_encoder[b] for b in word.encode(\"utf-8\")])\n", 174 | " pairs = get_pairs(word) # \"obobc\" -> set([(\"o\", \"b\"), (\"b\", \"o\"), (\"b\", \"c\")])\n", 175 | " iter = 0\n", 176 | " while True:\n", 177 | " # get pair of chars/tokens with lowest rank and merge\n", 178 | " bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float(\"inf\")))\n", 179 | " if bigram not in bpe_ranks:\n", 180 | " break # no more mergeable pairs\n", 181 | " first, second = bigram\n", 182 | " print(f\"{iter=} : {first=} {second=}\")\n", 183 | " new_word = []\n", 184 | " i = 0\n", 185 | " while i < len(word):\n", 186 | " try:\n", 187 | " j = word.index(first, i) # find index of occurence of `first` token in word[i:]\n", 188 | " except ValueError:\n", 189 | " print(f\"\\t -> {iter=} : {first=} not found in {word[i:]=}\")\n", 190 | " new_word.extend(word[i:]) \n", 191 | " break\n", 192 | " else:\n", 193 | " print(f\"\\t -> {iter=} : Found {first=} in {word[i:]=}. Skipping previous tokens {word[i:j]}\")\n", 194 | " new_word.extend(word[i:j])\n", 195 | " i = j\n", 196 | "\n", 197 | " if word[i] == first and i < len(word) - 1 and word[i + 1] == second:\n", 198 | " print(f\"\\t -> {iter=} : Merging {first=} and {second=}\")\n", 199 | " new_word.append(first + second)\n", 200 | " i += 2\n", 201 | " else:\n", 202 | " new_word.append(word[i])\n", 203 | " i += 1\n", 204 | " new_word = tuple(new_word)\n", 205 | " word = new_word\n", 206 | " if len(word) == 1: # merged into a single token\n", 207 | " break\n", 208 | " else:\n", 209 | " pairs = get_pairs(word)\n", 210 | " print(f\"{iter=} : Updated {word=}, {pairs=}\")\n", 211 | " iter += 1\n", 212 | " word = \" \".join(word)\n", 213 | " return word" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 6, 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "name": "stdout", 228 | "output_type": "stream", 229 | "text": [ 230 | "iter=0 : first='Ġ' second='w'\n", 231 | "\t -> iter=0 : Found first='Ġ' in word[i:]='Ġwordaaa'. Skipping previous tokens \n", 232 | "\t -> iter=0 : Merging first='Ġ' and second='w'\n", 233 | "\t -> iter=0 : first='Ġ' not found in word[i:]='ordaaa'\n", 234 | "iter=0 : Updated word=('Ġw', 'o', 'r', 'd', 'a', 'a', 'a'), pairs={('d', 'a'), ('o', 'r'), ('Ġw', 'o'), ('r', 'd'), ('a', 'a')}\n", 235 | "iter=1 : first='o' second='r'\n", 236 | "\t -> iter=1 : Found first='o' in word[i:]=('Ġw', 'o', 'r', 'd', 'a', 'a', 'a'). Skipping previous tokens ('Ġw',)\n", 237 | "\t -> iter=1 : Merging first='o' and second='r'\n", 238 | "\t -> iter=1 : first='o' not found in word[i:]=('d', 'a', 'a', 'a')\n", 239 | "iter=1 : Updated word=('Ġw', 'or', 'd', 'a', 'a', 'a'), pairs={('a', 'a'), ('or', 'd'), ('Ġw', 'or'), ('d', 'a')}\n", 240 | "iter=2 : first='Ġw' second='or'\n", 241 | "\t -> iter=2 : Found first='Ġw' in word[i:]=('Ġw', 'or', 'd', 'a', 'a', 'a'). Skipping previous tokens ()\n", 242 | "\t -> iter=2 : Merging first='Ġw' and second='or'\n", 243 | "\t -> iter=2 : first='Ġw' not found in word[i:]=('d', 'a', 'a', 'a')\n", 244 | "iter=2 : Updated word=('Ġwor', 'd', 'a', 'a', 'a'), pairs={('a', 'a'), ('d', 'a'), ('Ġwor', 'd')}\n", 245 | "iter=3 : first='Ġwor' second='d'\n", 246 | "\t -> iter=3 : Found first='Ġwor' in word[i:]=('Ġwor', 'd', 'a', 'a', 'a'). Skipping previous tokens ()\n", 247 | "\t -> iter=3 : Merging first='Ġwor' and second='d'\n", 248 | "\t -> iter=3 : first='Ġwor' not found in word[i:]=('a', 'a', 'a')\n", 249 | "iter=3 : Updated word=('Ġword', 'a', 'a', 'a'), pairs={('a', 'a'), ('Ġword', 'a')}\n", 250 | "iter=4 : first='a' second='a'\n", 251 | "\t -> iter=4 : Found first='a' in word[i:]=('Ġword', 'a', 'a', 'a'). Skipping previous tokens ('Ġword',)\n", 252 | "\t -> iter=4 : Merging first='a' and second='a'\n", 253 | "\t -> iter=4 : Found first='a' in word[i:]=('a',). Skipping previous tokens ()\n", 254 | "iter=4 : Updated word=('Ġword', 'aa', 'a'), pairs={('Ġword', 'aa'), ('aa', 'a')}\n", 255 | "iter=5 : first='aa' second='a'\n", 256 | "\t -> iter=5 : Found first='aa' in word[i:]=('Ġword', 'aa', 'a'). Skipping previous tokens ('Ġword',)\n", 257 | "\t -> iter=5 : Merging first='aa' and second='a'\n", 258 | "iter=5 : Updated word=('Ġword', 'aaa'), pairs={('Ġword', 'aaa')}\n" 259 | ] 260 | }, 261 | { 262 | "data": { 263 | "text/plain": [ 264 | "'Ġword aaa'" 265 | ] 266 | }, 267 | "execution_count": 6, 268 | "metadata": {}, 269 | "output_type": "execute_result" 270 | } 271 | ], 272 | "source": [ 273 | "split_into_tokens(\" wordaaa\")" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "You can see how with each iteration, BPE merges the pair of tokens with the lowest rank / highest priority. It has to merge all occurences of the pair, and thus you have two while loops. The brief summary of what happened:\n", 281 | "1. Obtain all unique bigrams (pairs of adjacent symbols) from the word. (before the while loop)\n", 282 | "2. State: `word` is initially a string, but it is converted into a tuple in later iterations. `word` represents the current segmentation of the original word, represented as a tuple of tokens.\n", 283 | "3. The algorithm then enters a loop where it looks for the lowest-ranked bigram (the first merge that BPE learnt while training). This represents the most frequent pair to be merged, or rather, the best compression step that BPE knows for this word.\n", 284 | "4. If the bigram is not in `bpe_ranks`, it means that no more merges are possible, and the process terminates.\n", 285 | " - Why? Observe how the minimum has been calculated. If the bigram with minimum rank is not in `bpe_ranks`, then it has a rank of `inf`, which means none of the other bigrams are in `bpe_ranks` / merge list.\n", 286 | "5. Otherwise, the function reconstructs the word by merging instances of the bigram.\n", 287 | " - This is the inner while loop. We iteratively find the first occurence of `first` in `word`, record the index, and break up the word into two at this index. Keep proceeding till the end of `word`.\n", 288 | "6. This process continues until the word cannot be further merged, at which point the final tokenized word is returned by joining all the tokens with a whitespace." 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "What if we flipped the order of the tokens? How do the merges look then? Let's see:" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 7, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "name": "stdout", 305 | "output_type": "stream", 306 | "text": [ 307 | "iter=0 : first='Ġ' second='w'\n", 308 | "\t -> iter=0 : Found first='Ġ' in word[i:]='aaaĠword'. Skipping previous tokens aaa\n", 309 | "\t -> iter=0 : Merging first='Ġ' and second='w'\n", 310 | "\t -> iter=0 : first='Ġ' not found in word[i:]='ord'\n", 311 | "iter=0 : Updated word=('a', 'a', 'a', 'Ġw', 'o', 'r', 'd'), pairs={('o', 'r'), ('Ġw', 'o'), ('r', 'd'), ('a', 'a'), ('a', 'Ġw')}\n", 312 | "iter=1 : first='o' second='r'\n", 313 | "\t -> iter=1 : Found first='o' in word[i:]=('a', 'a', 'a', 'Ġw', 'o', 'r', 'd'). Skipping previous tokens ('a', 'a', 'a', 'Ġw')\n", 314 | "\t -> iter=1 : Merging first='o' and second='r'\n", 315 | "\t -> iter=1 : first='o' not found in word[i:]=('d',)\n", 316 | "iter=1 : Updated word=('a', 'a', 'a', 'Ġw', 'or', 'd'), pairs={('a', 'a'), ('Ġw', 'or'), ('a', 'Ġw'), ('or', 'd')}\n", 317 | "iter=2 : first='Ġw' second='or'\n", 318 | "\t -> iter=2 : Found first='Ġw' in word[i:]=('a', 'a', 'a', 'Ġw', 'or', 'd'). Skipping previous tokens ('a', 'a', 'a')\n", 319 | "\t -> iter=2 : Merging first='Ġw' and second='or'\n", 320 | "\t -> iter=2 : first='Ġw' not found in word[i:]=('d',)\n", 321 | "iter=2 : Updated word=('a', 'a', 'a', 'Ġwor', 'd'), pairs={('a', 'a'), ('a', 'Ġwor'), ('Ġwor', 'd')}\n", 322 | "iter=3 : first='Ġwor' second='d'\n", 323 | "\t -> iter=3 : Found first='Ġwor' in word[i:]=('a', 'a', 'a', 'Ġwor', 'd'). Skipping previous tokens ('a', 'a', 'a')\n", 324 | "\t -> iter=3 : Merging first='Ġwor' and second='d'\n", 325 | "iter=3 : Updated word=('a', 'a', 'a', 'Ġword'), pairs={('a', 'Ġword'), ('a', 'a')}\n", 326 | "iter=4 : first='a' second='a'\n", 327 | "\t -> iter=4 : Found first='a' in word[i:]=('a', 'a', 'a', 'Ġword'). Skipping previous tokens ()\n", 328 | "\t -> iter=4 : Merging first='a' and second='a'\n", 329 | "\t -> iter=4 : Found first='a' in word[i:]=('a', 'Ġword'). Skipping previous tokens ()\n", 330 | "\t -> iter=4 : first='a' not found in word[i:]=('Ġword',)\n", 331 | "iter=4 : Updated word=('aa', 'a', 'Ġword'), pairs={('a', 'Ġword'), ('aa', 'a')}\n", 332 | "iter=5 : first='aa' second='a'\n", 333 | "\t -> iter=5 : Found first='aa' in word[i:]=('aa', 'a', 'Ġword'). Skipping previous tokens ()\n", 334 | "\t -> iter=5 : Merging first='aa' and second='a'\n", 335 | "\t -> iter=5 : first='aa' not found in word[i:]=('Ġword',)\n", 336 | "iter=5 : Updated word=('aaa', 'Ġword'), pairs={('aaa', 'Ġword')}\n" 337 | ] 338 | }, 339 | { 340 | "data": { 341 | "text/plain": [ 342 | "'aaa Ġword'" 343 | ] 344 | }, 345 | "execution_count": 7, 346 | "metadata": {}, 347 | "output_type": "execute_result" 348 | } 349 | ], 350 | "source": [ 351 | "split_into_tokens(\"aaa word\")" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "It's almost exactly the same steps, except that the higest priority tokens seem to keep falling towards the end of the tuple `word`, and thus the tokens at the end keep getting merged (Notice how the first three tokens `('a','a','a')` keep getting skipped in the first few iterations)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "# MySlowTokenizer\n", 366 | "\n", 367 | "Let's now move on to the slow tokenizer's implementation. Let's see what you can _do_ with MySlowTokenizer first. Features implemented:\n", 368 | "- `tokenizer(text)` : Tokenizes a piece of text and returns a list of token ids. Equivalent to `tokenizer.encode(text)`\n", 369 | "- `tokenizer.decode(token_ids)`: Decodes a list of token ids and returns a stitched up string.\n", 370 | "- `tokenizer.add_tokens(my_new_tokens)`: Add new tokens to the tokenizer's vocabulary. If a token is already present, this errors out. \n", 371 | "- `tokenizer.convert_token_to_id(token)` : Self-evident\n", 372 | "- `tokenizer.convert_id_to_token(token_id)` : Self-evident\n", 373 | "- `tokenizer.pre_tokenize(text)`: Pretokenizes a string by splitting at whitespaces, contractions (Ex: `don't`), etc. This isn't a method in HuggingFace, but I find this convenient. \n", 374 | "- `tokenizer.to_dict()` : Export tokenizer state into a dictionary. (Sadly, a `from_dict` hasn't been implemented, as I think this might be overkill)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": 8, 380 | "metadata": {}, 381 | "outputs": [ 382 | { 383 | "data": { 384 | "text/plain": [ 385 | "MySlowTokenizer(vocab_size=50257, unk_token=<|endoftext|>, added_tokens={'<|endoftext|>'}, bpe=BPE(vocab_size=50257))" 386 | ] 387 | }, 388 | "execution_count": 8, 389 | "metadata": {}, 390 | "output_type": "execute_result" 391 | } 392 | ], 393 | "source": [ 394 | "from minimal_hf_tok import MySlowTokenizer\n", 395 | "from transformers import AutoTokenizer\n", 396 | "gpt2_tokenizer = AutoTokenizer.from_pretrained(\"gpt2\", use_fast=False)\n", 397 | "my_tokenizer = MySlowTokenizer(\"vocab.json\")\n", 398 | "my_tokenizer" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 9, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "name": "stdout", 408 | "output_type": "stream", 409 | "text": [ 410 | "Added <|myspecialtoken|> to the vocabulary.\n", 411 | "Added <|myspecialspecialtoken|> to the vocabulary.\n" 412 | ] 413 | }, 414 | { 415 | "data": { 416 | "text/plain": [ 417 | "MySlowTokenizer(vocab_size=50259, unk_token=<|endoftext|>, added_tokens={'<|myspecialtoken|>', '<|myspecialspecialtoken|>', '<|endoftext|>'}, bpe=BPE(vocab_size=50259))" 418 | ] 419 | }, 420 | "execution_count": 9, 421 | "metadata": {}, 422 | "output_type": "execute_result" 423 | } 424 | ], 425 | "source": [ 426 | "input_text = \"This isn't<|myspecialtoken|> that simple\"\n", 427 | "new_token1 = \"<|myspecialtoken|>\"\n", 428 | "new_token2 = \"<|myspecialspecialtoken|>\"\n", 429 | "my_tokenizer.add_tokens([new_token1, new_token2])\n", 430 | "gpt2_tokenizer.add_tokens([new_token1, new_token2])\n", 431 | "my_tokenizer" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 10, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "name": "stdout", 441 | "output_type": "stream", 442 | "text": [ 443 | "Input text: This isn't<|myspecialtoken|> that simple\n", 444 | "My tokenizer encoding: [1212, 2125, 470, 50257, 326, 220, 220, 2829]\n", 445 | "GPT2 tokenizer encoding: [1212, 2125, 470, 50257, 326, 220, 220, 2829]\n", 446 | "My tokenizer decoding: This isn't<|myspecialtoken|> that simple\n", 447 | "GPT2 tokenizer decoding: This isn't <|myspecialtoken|> that simple\n" 448 | ] 449 | } 450 | ], 451 | "source": [ 452 | "my_enc = my_tokenizer.encode(input_text)\n", 453 | "gpt2_enc = gpt2_tokenizer.encode(input_text)\n", 454 | "\n", 455 | "print(\"Input text:\", input_text)\n", 456 | "print(\"My tokenizer encoding:\", my_enc)\n", 457 | "print(\"GPT2 tokenizer encoding:\", gpt2_enc)\n", 458 | "\n", 459 | "print(\"My tokenizer decoding:\", my_tokenizer.decode(my_enc))\n", 460 | "print(\"GPT2 tokenizer decoding:\", gpt2_tokenizer.decode(gpt2_enc))" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Our tokenizer gives the same result as HuggingFace's tokenizer, and supports adding a new token!" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "# Pouring over the implementation\n", 475 | "I'm copying over the code from `minimal_hf_tok.py` here, because, well, this wouldn't be much of a walkthrough otherwise" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 11, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "from typing import Dict, Tuple, Union, List, Any\n", 485 | "import regex as re # regex is cooler than re\n", 486 | "from bpe import BPE\n", 487 | "from minimal_hf_tok import EOS_TOKEN, MyTrie\n", 488 | "\n", 489 | "class MySlowTokenizer:\n", 490 | " \"\"\"\n", 491 | " A minimal implementation of HF's slow tokenizer, based on GPT2's tokenizer\n", 492 | " References:\n", 493 | " https://github.com/huggingface/transformers/blob/8aca43bdb3cb9a5020f6d57589d85679dc873b1c/src/transformers/models/gpt2/tokenization_gpt2.py\n", 494 | " \"\"\"\n", 495 | " def __init__(self, init_vocab_file: str = None):\n", 496 | " self.added_tokens_trie = MyTrie() # trie for added tokens only\n", 497 | " self.bpe = BPE(init_vocab_file)\n", 498 | " self.vocab = self.bpe.token_to_id # nice to have vocab accessible here\n", 499 | " self.byte_encoder = self.bpe.byte_encoder\n", 500 | " self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}\n", 501 | " self.unk_token = EOS_TOKEN\n", 502 | "\n", 503 | " # Regex for pre-tokenization - breaking up a piece of text into words by splitting at whitespaces, contractions, etc. Borrowed from GPT-2\n", 504 | " self.pattern_for_splitting = re.compile(r\"\"\"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+\"\"\")\n", 505 | " self._load_added_tokens()\n", 506 | " \n", 507 | " def _load_added_tokens(self):\n", 508 | " # loads added tokens from json and adds them to the trie\n", 509 | " # Hard coded for GPT2 demonstration\n", 510 | " self.added_tokens_trie.add(EOS_TOKEN)\n", 511 | " \n", 512 | " def __call__(self, *args: Any, **kwargs: Any) -> Any:\n", 513 | " self.encode(*args, **kwargs)\n", 514 | " \n", 515 | " def encode(self, text: str, **kwargs: Any) -> Any:\n", 516 | " text, kwargs = self.prepare_for_tokenization(text, **kwargs)\n", 517 | "\n", 518 | " # 1. Split text into chunks at the boundaries of added_tokens. Can be thought of as a pre-tokenization step.\n", 519 | " # \"This isn't<|endoftext|> what you think\" -> [\"This isn't\", \"<|endoftext|>\", \" what you think\"] \n", 520 | " chunks = self.added_tokens_trie.split(text)\n", 521 | " bpe_tokens = []\n", 522 | " for chunk in chunks:\n", 523 | " if chunk in self.added_tokens_trie._tokens:\n", 524 | " # if chunk is an added token, directly add it to bpe_tokens\n", 525 | " bpe_tokens.append(chunk)\n", 526 | " else:\n", 527 | " # 2. Tokenize each chunk\n", 528 | " tokens = self._tokenize(chunk)\n", 529 | " bpe_tokens.extend(tokens)\n", 530 | " # 3. Convert tokens to ids\n", 531 | " bpe_tokens = [self.convert_token_to_id(token) for token in bpe_tokens]\n", 532 | " return bpe_tokens\n", 533 | "\n", 534 | " def decode(self, ids: List[int], **kwargs: Any) -> str:\n", 535 | " # 1. Convert ids to tokens\n", 536 | " tokens = [self.convert_id_to_token(id_) for id_ in ids] \n", 537 | " # 2. Join tokens\n", 538 | " text = \"\".join(tokens)\n", 539 | " # 3. Replace unicode symbols with normal characters\n", 540 | " text = bytearray([self.byte_decoder[c] for c in text]).decode(\"utf-8\")\n", 541 | " return text\n", 542 | " \n", 543 | " def _tokenize(self, text: str) -> List[str]:\n", 544 | " all_tokens = []\n", 545 | " # Pre-tokenization: split text into words based on regex. \"This isn't\" -> [\"This\", \" isn\", \"'t\"]\n", 546 | " words = self.pre_tokenize(text)\n", 547 | " for word in words:\n", 548 | " # Unicode string encoding. \" isn\" -> bytes object -> \"Ġisn\"\n", 549 | " word = \"\".join([self.byte_encoder[b] for b in word.encode(\"utf-8\")]) \n", 550 | " tokens = self.bpe(word, dont_byte_encode=True).split(\" \") # we already encoded the chunk to unicode strings\n", 551 | " all_tokens.extend(tokens)\n", 552 | " return all_tokens\n", 553 | "\n", 554 | " def pre_tokenize(self, text: str) -> List[str]:\n", 555 | " return self.pattern_for_splitting.findall(text)\n", 556 | " \n", 557 | " def prepare_for_tokenization(self, text: str, **kwargs: Any) -> Tuple[str, Dict[str, Any]]:\n", 558 | " \"\"\"\n", 559 | " In HF, this method performs any pre-processing needed before tokenization. Dummy function for now.\n", 560 | " \"\"\"\n", 561 | " # returns text and kwargs\n", 562 | " return (text, kwargs)\n", 563 | "\n", 564 | " def to_dict(self):\n", 565 | " dict1 = {}\n", 566 | " dict1[\"model\"][\"type\"] = \"BPE\"\n", 567 | " dict1[\"model\"][\"vocab\"] = self.vocab\n", 568 | " dict1[\"model\"][\"merges\"] = self.bpe.merges\n", 569 | " dict1[\"added_tokens\"] = list(self.added_tokens_trie._tokens)\n", 570 | " dict1[\"special_tokens_map\"] = {\"unk_token\": self.unk_token}\n", 571 | " return dict1\n", 572 | " \n", 573 | " def add_tokens(self, new_tokens: Union[str, List[str]]):\n", 574 | " \"\"\"\n", 575 | " Adds new tokens to the tokenizer.\n", 576 | " \"\"\"\n", 577 | " if isinstance(new_tokens, str):\n", 578 | " new_tokens = [new_tokens]\n", 579 | " for token in new_tokens:\n", 580 | " self.bpe.add_token(token) # add to vocab first\n", 581 | " self.added_tokens_trie.add(token)\n", 582 | " print(f\"Added {token} to the vocabulary.\")\n", 583 | " \n", 584 | " def convert_token_to_id(self, token: str) -> int:\n", 585 | " \"\"\"\n", 586 | " Converts a token to its id. Returns unk token id if token is not in vocab.\n", 587 | " Fancy word: Numericalization\n", 588 | " \"\"\"\n", 589 | " return self.vocab.get(token, self.vocab[self.unk_token])\n", 590 | "\n", 591 | " def convert_id_to_token(self, index: int) -> str:\n", 592 | " \"\"\"\n", 593 | " Converts an id to its token. Returns unk token if id is not in vocab.\n", 594 | " \"\"\"\n", 595 | " return self.bpe.id_to_token.get(index, self.unk_token)\n", 596 | "\n", 597 | " def __repr__(self) -> str:\n", 598 | " string = \"MySlowTokenizer(\"\n", 599 | " string += f\"vocab_size={self.nvocab}, unk_token={self.unk_token}, added_tokens={str(self.added_tokens_trie._tokens)}, \"\n", 600 | " string += f\"bpe={str(self.bpe)})\"\n", 601 | " return string\n", 602 | " \n", 603 | " # make vocab_size a property because it should change with added tokens\n", 604 | " @property\n", 605 | " def nvocab(self) -> int:\n", 606 | " return len(self.vocab)\n" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "## `__init__`\n", 614 | "\n", 615 | "1. Initializes the tokenizer with a vocabulary file if provided.\n", 616 | "2. Sets up the `MyTrie` structure for added tokens.\n", 617 | "3. Prepares the byte pair encoding (BPE) mechanism.\n", 618 | "4. Creates a byte encoder and decoder for character-level representation.\n", 619 | "5. Sets a pattern for pre-tokenization, which splits text into chunks based on specified rules." 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "## `encode`\n", 627 | "\n", 628 | "Let's break down how to implement the `encode` method. When you do `tok.encode(text)`, there are three operations:\n", 629 | "1. Normalize and pre-tokenize input text. With GPT2, the pre-tokenization involves breaking up the text on whitespace, contractions, punctuations, etc.\n", 630 | "2. Tokenize input string/strings to get a list of tokens for each word/chunk. This is handled by the `._tokenize()` method.\n", 631 | "3. Convert tokens to token ids using `.convert_tokens_to_ids()` method.\n", 632 | "\n", 633 | "\n", 634 | "Important detail on `added_tokens`: These are really tokens added to the BPE vocabulary after the model was trained - can you really just let the BPE model tokenize your string directly? Think about this:\n", 635 | "1. You added a new token `<|myspecialtoken|>` to the vocabulary and gave it a new ID.\n", 636 | "2. You tokenize a string `Hello there <|myspecialtoken|>`. \n", 637 | "What happens if you directly use the BPE model to tokenize this string? First, BPE will split this up into characters, and then it will iteratively merge neighbouring pairs of tokens/characters until you can't merge anymore. The problem is: we don't have any merge rules for this new token! Indeed, you can see this below" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 12, 643 | "metadata": {}, 644 | "outputs": [ 645 | { 646 | "name": "stdout", 647 | "output_type": "stream", 648 | "text": [ 649 | "['<', '|', 'mys', 'pe', 'cial', 'token', '|', '>']\n", 650 | "50257\n" 651 | ] 652 | } 653 | ], 654 | "source": [ 655 | "print(my_tokenizer._tokenize(new_token1)) # pre-tokenize and then tokenize using bpe\n", 656 | "print(my_tokenizer.vocab.get(new_token1, \"NOTPRESENT\")) # check if new_token is in vocab" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 13, 662 | "metadata": {}, 663 | "outputs": [ 664 | { 665 | "name": "stdout", 666 | "output_type": "stream", 667 | "text": [ 668 | "[50257]\n" 669 | ] 670 | } 671 | ], 672 | "source": [ 673 | "print(my_tokenizer.encode(new_token1))" 674 | ] 675 | }, 676 | { 677 | "cell_type": "markdown", 678 | "metadata": {}, 679 | "source": [ 680 | "The difference between `._tokenize` and `.encode` is because of handling the added tokens as a pre-tokenization step" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "## `decode`\n", 688 | "This is almost the same as the HuggingFace implementation, except that they have some code for handling added tokens, etc. When you run `tok.decode(token_ids)`, there are three operations, as discussed before:\n", 689 | "1. Convert ids to tokens using the `id_to_token` mapping from `tok.bpe`. \n", 690 | "2. Join all the tokens\n", 691 | "3. Replace unicode symbols with normal characters" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "metadata": {}, 697 | "source": [ 698 | "# Visualize the Trie" 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": 14, 704 | "metadata": {}, 705 | "outputs": [ 706 | { 707 | "name": "stdout", 708 | "output_type": "stream", 709 | "text": [ 710 | "{\n", 711 | " \"<\": {\n", 712 | " \"|\": {\n", 713 | " \"e\": {\n", 714 | " \"n\": {\n", 715 | " \"d\": {\n", 716 | " \"o\": {\n", 717 | " \"f\": {\n", 718 | " \"t\": {\n", 719 | " \"e\": {\n", 720 | " \"x\": {\n", 721 | " \"t\": {\n", 722 | " \"|\": {\n", 723 | " \">\": {\n", 724 | " \"\": 1\n", 725 | " }\n", 726 | " }\n", 727 | " }\n", 728 | " }\n", 729 | " }\n", 730 | " }\n", 731 | " }\n", 732 | " }\n", 733 | " }\n", 734 | " }\n", 735 | " },\n", 736 | " \"m\": {\n", 737 | " \"y\": {\n", 738 | " \"s\": {\n", 739 | " \"p\": {\n", 740 | " \"e\": {\n", 741 | " \"c\": {\n", 742 | " \"i\": {\n", 743 | " \"a\": {\n", 744 | " \"l\": {\n", 745 | " \"t\": {\n", 746 | " \"o\": {\n", 747 | " \"k\": {\n", 748 | " \"e\": {\n", 749 | " \"n\": {\n", 750 | " \"|\": {\n", 751 | " \">\": {\n", 752 | " \"\": 1\n", 753 | " }\n", 754 | " }\n", 755 | " }\n", 756 | " }\n", 757 | " }\n", 758 | " }\n", 759 | " },\n", 760 | " \"s\": {\n", 761 | " \"p\": {\n", 762 | " \"e\": {\n", 763 | " \"c\": {\n", 764 | " \"i\": {\n", 765 | " \"a\": {\n", 766 | " \"l\": {\n", 767 | " \"t\": {\n", 768 | " \"o\": {\n", 769 | " \"k\": {\n", 770 | " \"e\": {\n", 771 | " \"n\": {\n", 772 | " \"|\": {\n", 773 | " \">\": {\n", 774 | " \"\": 1\n", 775 | " }\n", 776 | " }\n", 777 | " }\n", 778 | " }\n", 779 | " }\n", 780 | " }\n", 781 | " }\n", 782 | " }\n", 783 | " }\n", 784 | " }\n", 785 | " }\n", 786 | " }\n", 787 | " }\n", 788 | " }\n", 789 | " }\n", 790 | " }\n", 791 | " }\n", 792 | " }\n", 793 | " }\n", 794 | " }\n", 795 | " }\n", 796 | " }\n", 797 | " }\n", 798 | " }\n", 799 | " }\n", 800 | "}\n" 801 | ] 802 | } 803 | ], 804 | "source": [ 805 | "print(my_tokenizer.added_tokens_trie) # a very bad visualization of the trie as a json. " 806 | ] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": null, 811 | "metadata": {}, 812 | "outputs": [], 813 | "source": [] 814 | } 815 | ], 816 | "metadata": { 817 | "kernelspec": { 818 | "display_name": "huggingface", 819 | "language": "python", 820 | "name": "python3" 821 | }, 822 | "language_info": { 823 | "codemirror_mode": { 824 | "name": "ipython", 825 | "version": 3 826 | }, 827 | "file_extension": ".py", 828 | "mimetype": "text/x-python", 829 | "name": "python", 830 | "nbconvert_exporter": "python", 831 | "pygments_lexer": "ipython3", 832 | "version": "3.11.5" 833 | }, 834 | "orig_nbformat": 4 835 | }, 836 | "nbformat": 4, 837 | "nbformat_minor": 2 838 | } 839 | -------------------------------------------------------------------------------- /4-tokenization-is-hard/README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | 4 | 5 | - [Agenda](#agenda) 6 | - [Number Tokenization](#number-tokenization) 7 | * [Effects of non-uniform tokenization](#effects-of-non-uniform-tokenization) 8 | - [Tokenization and metrics](#tokenization-and-metrics) 9 | - [Tokenization for non-English languages](#tokenization-for-non-english-languages) 10 | - [Tokens for Multilingual models](#tokens-for-multilingual-models) 11 | * [No Language Left Behind](#no-language-left-behind) 12 | + [Why tokenization matters](#why-tokenization-matters) 13 | - [How does temperature sampling help again?](#how-does-temperature-sampling-help-again) 14 | + [Tokenization affects evaluation](#tokenization-affects-evaluation) 15 | - [Low Resource => More Costly](#low-resource--more-costly) 16 | - [Next Chapter](#next-chapter) 17 | 18 | 19 | 20 | # Agenda 21 | This section is all about challenges with tokenization. The main idea is to get a sense of cases, data modalities, etc where "good" tokenization is hard - by looking at what "good" tokenization should be, how current tokenizers perform, and the downstream effects of "bad" performance. We'll look at integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta. 22 | 23 | # Number Tokenization 24 | 25 | What is an ideal way to tokenize numbers? Well, the main reference we have is the way we represent numbers. In the decimal number system, we assign unique symbols to numbers 0 to 9, and then all other numbers can be represented using these symbols (along with the "." for fractional parts). So, one expectation you could have is that tokenization should also follow this uniformity (along with a special token for continuity of a number, like "##" for BERT). That is far from the case in practice! Let's take T5, for example. The number "410" gets segmented as one token, but the numbers "411" or "490" are segmented as two tokens, "4" and "11"/ "90". There are a bunch of weird patterns in how many tokens different numbers get. The reason is pretty simple if you think about what BPE does - your training corpus contains a bunch of different numbers, and a tokenizer like BPE, while training on this corpus, would try to find the best way to compress all of these numbers. Your training data is likely to contain round numbers more ("400-odd students", "100s of people missing"). Strings with higher frequencies end up with lower number of tokens with BPE (recall that BPE's objective is specifically for good compression). Thus,round numbers like 400 get segmented with fewer tokens and you are likely to find them as a part of the vocabulary (i.e they get a single token), while a more random sequence of digits like 4134 might get 2/3/4 tokens. 26 | 27 | This non-uniform way of representing numbers might be nice for compression's sake, but it has some nasty downstream effects. The embedding for two numbers close to each other on the number line can be very different because of almost arbitrary splits for the two numbers (like in the above case for "410" and "411"). Thus, recent models explicitly ensure some uniformity. For example, with Llama, all numbers are tokenized as individual digits (Both "410" and "411" are segmented as 3 tokens). With Falcon, all numbers from 0-999 are segmented as 1 token, although strangely enough 957 is the only exception that gets 2 tokens (why??). The larger numbers are tokenized using smaller 3-digit/2-digit/1-digit tokens. As far as I can tell, even GPT-4's tokenizer (available in [tiktoken](https://github.com/openai/tiktoken)) does the same: All integers from 0-999 (inclusive) are present in the vocab and larger numbers get split based on these smaller numbers. 28 | 29 | ## Effects of non-uniform tokenization 30 | Let's think a little deeper into why uniformity might be nice to have. What do we want with a language model anyway? Maybe we want a language model to have some "understanding" of what integers are. However, we're not even passing integers directly: the language model gets access to an embedding vector or a sequence of embedding vectors. Thus, the right view is the _functional_ view: if the language model can do all the operations we want on integers, then that is as good as "understanding" integers. Key operations we'd want are: 31 | 1. Arthimetic ops: Addition (+), subtraction (-), multiplication (*) and division (/) 32 | 2. Ordering ops: Equality (`==`), Greater than (`>`), Lesser than (`<`) 33 | 34 | Our language model has to learn all of these different operations, but now as a mapping from a sequence of embedding vectors to another sequence of embedding vectors. Thinking in this way, you can see how even an increment operation ("410 + 1") can get messy with non-uniform tokenization. This of course, doesn't mean that tokenizing numbers as individual characters will solve our issues with arithmetic. But it certainly looks to be a better input representation. 35 | 36 | **Further reading** 37 | 38 | - LLaMA: Open and Efficient Foundation Language Models: https://arxiv.org/abs/2302.13971 39 | 40 | 41 | # Tokenization and metrics 42 | The differences in text pre-processing/ tokenization schemes can show up in your metrics. A prime example is the aptly named package _sacrebleu_ from [Post _et al_](https://aclanthology.org/W18-6319/) (2018). BLEU (BiLingual Evaluation Understudy) score is a popular metric in NLP for machine-translation. However, the score itself doesn't specify the exact format in which reference and machine-translated text are compared, and thus different users use their own tokenization and normalization schemes. There can be significant differences across such formats, with Post _et al._ showing that the changes can be as be as high as 1.8 (which could be the difference between your model being state-of-the-art or not). [SacreBLEU](https://github.com/mjpost/sacrebleu) supports scoring detokenized outputs using a common, standard tokenizer so that results are consistent. 43 | 44 | **Further reading** 45 | 46 | - A Call for Clarity in Reporting BLEU Scores: https://aclanthology.org/W18-6319/ 47 | 48 | # Tokenization for non-English languages 49 | 50 | Being able to process text in more than one language is an essential component of many NLP applications. For example, Whisper, OpenAI's speech-to-text model (which is a plain old encoder-decoder transformer model) [can process English and Chinese speech flawlessly](https://x.com/jeremyphoward/status/1721696652506100175?s=20). Of course, this is not a plain text-in -> text-out application, but the point is that tokenization in other languages, especially low-resource languages comes with it's own set of challenges. The first, and probably biggest challenge is lack of high-quality training data for many languages. Languages like English, Chinese, Russian, etc have a lot more content available on the internet than say, Kannada and Swahili. There are also fundamental differences across languages, such as the absence of a typographic separator (Ex: a whitespace in English) in some languages like Chinese and Japanese. Some languages also have complex script rules, specifically with how consonents and vowels combine. In Kannada, for example, the word for 'language' is ಭಾಷೆ. ಭಾ is a combination of the consonant 'ಭ' and the vowel sign 'ಾ'. You would want a tokenizer that recognizes ಭಾ as a meaningful unit and a distinct token, and doesn't break it down further into consonant and vowel signs. 51 | 52 | # Tokens for Multilingual models 53 | 54 | Firstly, how do you build a multilingual system? If you're building a machine-translation model, then one approach can be to learn one tokenizer per language. For example, you can imagine an encoder-decoder transformer trained to translate English to French. In this case, your input text will be tokenized and numericalized based on, say, a BPE tokenizer trained on English text. You would retrieve the appropriate embeddings for each token, pass it through the Transformer. At the decoder's output, you select the most probably class/ID for each position in the output sequence, where IDs are based on the vocabulary of a BPE tokenizer trained in French. When decoded tokens are fed back into the decoder, you make use of a different, output embedding layer that maps decoded tokens to embeddings which are passed to the decoder. 55 | 56 | Of course, one would want to build multilingual models that can translate between a lot more than 2 languages (and more than just 1 way translation). In this case, a simple approach would be to mix all available data for all the languages. The problem with this is that there is always significant disparity in the amount of data you have for each language. This disparity shows up in the number of tokens dedicated to each language in the vocabulary (i.e very low data -> close to character-level tokenization). Issues with complex script rules also hurt performance. Let's now look at one paper which built a massive multilingual system, and the world of multilingual tokens. 57 | 58 | ## No Language Left Behind 59 | 60 | [No Language Left Behind (NLLB)](https://ai.meta.com/research/no-language-left-behind/) was a massive effort from Meta AI to improve machine translation models for low-resouce languages. This is the first time we crossed the 200 language count in terms of datasets and models available. Key contributions include new datasets, models and benchmarks, focusing on languages never targeted at scale before. To break this down, they first conducted surveys of native speakers in different low resource languages to understand their needs reg. machine translation, then developed an automatic data generation pipeline focusing on said languages. They utilized smart data mining techniques (essentially, an improved version of [bitext mining](https://paperswithcode.com/task/cross-lingual-bitext-mining)) to collect quality training data. Using this mined data along with human-translated seed data, they trained multilingual Mixtures-of-Experts models. 61 | 62 | The full NLLB paper is 192 pages long! Needless to say, even a good summary of their contributions is going to be pretty long. Let's get back to our centre of discussion: tokenization. 63 | 64 | ### Why tokenization matters 65 | Well, how did they train their tokenizer, given that you're dealing with this extremely disproportionate amount of data per language? Here's a snippet from page. 91 of the paper: 66 | 67 | > ....we trained a new SentencePiece model....To train this SentencePiece model, we sample a total of 100M sentences from primary bitext corpora. Given that most of the languages in NLLB are low-resource languages (150), uniform sampling would over-represent high-resource languages and under-represent low-resource languages, leading to too much fragmentation of low-resource language text. To mitigate this, we apply temperature sampling (with temperature T = 5), which effectively downsamples high-resource languages and upsamples low-resource languages. This results in a more balanced distribution of samples over all languages. 68 | 69 | #### How does temperature sampling help again? 70 | Here's a quick refresher! Consider the case with 2 languages $En$ and $Kn$, with 1000 examples and 100 training examples respectively. With uniform sampling for the combined dataset, here are the probability that a random example belongs to a given language: 71 | $P(En) = 1000/1100 \approx 91\%$ and $P(Kn) = 100/1100 \approx 9\%$. Now with temperature sampling with T=5, you get 72 | 73 | $P(En) = \dfrac{(1000)^{1/5}}{(100)^{1/5} + (1000)^{1/5}} = 61.31\%$ 74 | 75 | $P(Kn) = \dfrac{(100)^{1/5}}{(100)^{1/5} + (1000)^{1/5}} = 38.69\%$ 76 | 77 | ​You can clearly the difference! Of course, when you've got a mixture with 200 languages with many of them low-resource, you will still have some im-balance. The idea of over-sampling low-resource languages to get a more balanced vocabulary is certainly not new. But I find it interesting that even at the scale of 200 languages, all you need (or all we know of?) is temperature sampling. 78 | 79 | ### Tokenization affects evaluation 80 | Another interesting side effect of poor tokenization from the paper: lower performance on toxicity detection. In NLLB, they evaluated their machine-translation system by trying to do toxicity detection on the translations. I don't want to bloat this section up by too many details, so here's the gist. Toxicity detection is important for a translation system, because *added toxicity* terms, i.e., translated content containing toxic words that were *not* present in the original text is (1) bad/low-quality transation (2) degrades user trust in the system(Like translating "I'm bad at this" to "Ich bin scheiße darin"/ I'm shit at this ). So, the main purpose here is: 81 | > ...to improve translation safety through minimizing the probability of catastrophic mistranslations 82 | 83 | The pipeline for a language like Hindi looks as follows : 84 | English text -> Translation Model -> Hindi tokens -> Hindi text -> Exact match vs a list of known toxic words. 85 | 86 | The results for different languages is given below (the details don't matter for this discussion; just note that smaller numbers mean worse performance, and observe which languages perform worse). 87 | 88 | ![Alt text](toxicity_detection_nllb.png) 89 | 90 | Here are the authors' comments on the languages with poor performance: 91 | 92 | > The results for the languages with the lowest performance in Figure 28, i.e., Hindi (hin_Deva), Kannada (kan_Knda), Maithili (mai_Deva), Telugu (tel_Telu), and Magahi (mag_Deva), may be partially explained by the fact that the scripts in which these languages are written are not always adequately tokenized by our detectors. 93 | 94 | While tokenization might not be the sole reason, it's important to note how "poor" tokenization can affect *everything*. Another question to think about is: What's special about these specific languages (and that too all Indic?) ? Well, many Indic languages have complex script rules in ways that different consonants and vowels fuse together to change gender, tense, mood, plurality, etc. Typically, tokenizer vocabularies are not large enough to adequately represent various meaningful units in these languages and fine-grained tokenization can lead to a loss of information. For example in [IndicNLP](https://aclanthology.org/2020.findings-emnlp.445.pdf), they trained a tokenizer on 11 Indian languages, but used a vocabulary size of 200k to accommodate for the large vocabularies of Indic languages. Similarly, in [IndicTrans2](https://arxiv.org/abs/2305.16307), the authors used a vocabulary of size 128K while training on just 22 languages. Compare that to the 256K vocabulary used in NLLB while training on 200 languages, and you can see how this starts to matter. 95 | 96 | **Further reading** 97 | - No Language Left Behind: Scaling Human-Centered Machine Translation - https://arxiv.org/abs/2207.04672 98 | - IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages - https://aclanthology.org/2020.findings-emnlp.445.pdf 99 | 100 | # Low Resource => More Costly 101 | One artifact of having an imbalanced mixture of different languages in your training corpus (for the tokenizer) is that your costs for text completions in low-resource languages can shoot up - simply because the text sequences get encoded with more tokens (i.e there is lesser _compression_ since a _smaller_ part of the vocabulary is _allocated_ for that language). For example, one user found that API calls in Hindi are 8 times more expensive than those in English: https://www.reddit.com/r/OpenAI/comments/124v2oi/hindi_8_times_more_expensive_than_english_the/ 102 | 103 | # [Next Chapter](/5-puzzles/) 104 | We'll look at two simple puzzles to get you thinking about pre-tokenization and the impact of vocabulary size on tokenizer _fertility_ (number of tokens per word). -------------------------------------------------------------------------------- /4-tokenization-is-hard/toxicity_detection_nllb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/4-tokenization-is-hard/toxicity_detection_nllb.png -------------------------------------------------------------------------------- /5-puzzles/README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | 4 | 5 | - [Applying what we've learned](#applying-what-weve-learned) 6 | - [Tokenizer Puzzle 1: White Spaces](#tokenizer-puzzle-1-white-spaces) 7 | - [Tokenizer Puzzle 2: Vocabulary Size](#tokenizer-puzzle-2-vocabulary-size) 8 | * [Testing on code](#testing-on-code) 9 | - [Next Chapter](#next-chapter) 10 | 11 | 12 | 13 | # Applying what we've learned 14 | Let's apply what we've learned through two simple puzzles! We'll look at two puzzles to get you thinking about pre-tokenization, vocabulary size, etc. 15 | 16 | # Tokenizer Puzzle 1: White Spaces 17 | Here's a simple puzzle to test your knowledge of tokenization. Consider the case where you have a sequence of English words all stuck together i.e whitespace between them has been removed. Here's a sample: 18 | 19 | ``` 20 | Myfirsttokenizerpuzzle 21 | ``` 22 | 23 | The question is: 24 | > Can you recover the original words without whitespace? 25 | 26 | 27 | This is, in general, ill-posed, but let's say that in the case we're interested in, there is only one possible set of words. This is a modification of [Leetcode-style questions](https://leetcode.com/problems/word-break/) you might have seen. The focus here is, of course, to get a better understanding of tokenization. So, the natural question is: what if you tokenize the sequence? What do you end up with? 28 | 29 | Let's use the BERT tokenizer, because you can clearly see the difference with and without. We get: 30 | 31 | ``` 32 | ['My', '##fi', '##rst', '##tok', '##eni', '##zer', '##pu', '##zzle'] 33 | ``` 34 | 35 | To try this out yourself, you can simply run: 36 | ``` 37 | from transformers import AutoTokenizer 38 | tok = AutoTokenizer.from_pretrained("bert-base-cased") 39 | tok.batch_decode(tok.encode("Myfirsttokenizerpuzzle", add_special_tokens=False)) # omit postprocessing 40 | ``` 41 | 42 | Here "##" denotes that this is infact a continuation from the previous token. If you had spaces ("My first tokenizer puzzle"), you'll see 43 | 44 | ``` 45 | ['My', 'first', 'token', '##izer', 'puzzle'] 46 | ``` 47 | 48 | You can clearly see the differences with and without whitespaces. In general, the sequence lengths can vary a lot with and without spaces. This is because the tokenizer sees one giant chunk of text, and in fact a large word it hasn't seen before. WordPiece uses a left-to-right longest-match first strategy to find the longest tokens to span the full length of the given text. This is different from the case with whitespaces. To better see the differences between the two cases, let's visit the full tokenization pipeline: 49 | 50 | ![Alt text](tok_pipeline.png) 51 | 52 | The image above is from the [🤗 NLP course](https://huggingface.co/learn/nlp-course/chapter6/4), showing the full pipeline for the uncased BERT tokenizer (which is why normalization converts everything to lower case). With our example, let's see what happens at each stage with the _cased_ BERT tokenizer: 53 | 54 | ``` 55 | Myfirsttokenizerpuzzle 56 | ↓········································(Normalization) 57 | Myfirsttokenizerpuzzle 58 | ↓········································ (Pre-tokenization) 59 | [('Myfirsttokenizerpuzzle', (0, 22))] 60 | ↓········································ (Model/Tokenization) 61 | ['My', '##fi', '##rst', '##tok', '##eni', '##zer', '##pu', '##zzle'] 62 | ``` 63 | 64 | ``` 65 | My first tokenizer puzzle 66 | ↓ ········································(Normalization) 67 | My first tokenizer puzzle 68 | ↓ ········································ (Pre-tokenization) 69 | [('My', (0, 2)), ('first', (3, 8)), ('tokenizer', (9, 18)), ('puzzle', (19, 25))] 70 | ↓ ········································ (Model/Tokenization) 71 | ['My', 'first', 'token', '##izer', 'puzzle'] 72 | ``` 73 | 74 | I've omitted post-processing here. You can see that the normalization step in this case did not change anything (I believe for BERT-cased, even extra whitespaces, etc are preserved, but, for example, tabs are converted to whitespaces, so it's not exactly doing nothing). The pre-tokenization step, the words are split based on whitespace, but with BERT, whitespace information is _implicit_ with the use of `##`. Recall that with GPT2,the whitespace information is preserved in the tokens by adding a unicode symbol Ġ. Also, in this case, the output also contains _offset mappings_ : these are the the offsets from the start of the text for the token boundaries. Looking at our example without whitespaces you can see that there's really nothing happening in the normalization and pre-tokenization step (which will be true for most modern tokenizers like Llama and GPT4). You can clearly see the difference in the inputs to the WordPiece model (i.e the third step). After pre-tokenization (post-pre-tokenization?), the WordPiece model will find the best (rather longest) match for the list of words. Note that in general, with sub-word tokenizers, tokens are usually sub-words , but it could well be a _multi-word token_. While some language models [like Baichuan](https://x.com/suchenzang/status/1697862650053660721?s=20) have this for Chinese text, I am yet to see a multi-word token for English in models like GPT-4 and Llama. I believe with English text, pre-tokenization generally splits on whitespace, and thus the inputs for the tokenization model are only single words which might get split further. Anyways, back to our tokenization pipeline: now that we've gone through all the steps for both the two cases, let's come back to the original question: 75 | 76 | > Can you recover the original words without whitespace? 77 | 78 | You can see that that the tokenizer gives 8 vs 5 tokens for the two cases (without spaces and with spaces, resp.). Clearly, finding a mapping between these two tokenized sequences is non-trivial in general. The one thing that you can say is that a good _sequence encoder_ model should place both these sequences very close together in the embedding space. Do you see where I'm going with this? One answer to the problem, of course, would be to build a dictionary with as many English words as possible (we need to deal with all possibilities with special characters, punctuations, etc) and then have some optimized dynamic programming strategy (because this is ill-posed, with multiple solutions in general, and many answers/splits won't be grammatically sound). But the best answer for the problem is to feed the sequence into a language model! A powerful-enough language model will be pretty much flawless in recovering the original words. (except, maybe in cases with [glitch tokens](../6-postprocessing-and-more/glitch_tok.md)) 79 | 80 | 81 | 82 | # Tokenizer Puzzle 2: Vocabulary Size 83 | 84 | Inference and Training speed/throughput are often reported in tokens/ sec. However, the number of tokens depends on the models' vocabulary size - and this can differ widely - GPT 2 has a 50k vocab size, Llama 2 has a 32k vocab size, and GPT4 has a whopping 100k vocab size. This means that comparisons based solely on token counts might not make sense. You might ask: How _exactly_ does this affect sequence length: Are there heuristics to predict sequence lengths based on vocab sizes or other data? Well, that's a very hard question. Firstly, a lot of current tokenizers are BPE-based, so let's say we're only looking at BPE tokenizers. Now, vocabulary sizes are _chosen_ while training tokenizers. But vocab size is not the only component here. One training corpus might include a lot of code, and thus the vocabulary would have a bunch of code-specific tokens, while another might include very little, and you might end up with only character-level tokens for code. With such variability, it's not easy to just look at vocabulary size and say that for this dataset, I will get x times more tokens with LLama 2 vs GPT4. Indeed, you can see the same from [Thomas Wolf's tokenizer puzzle](https://twitter.com/Thom_Wolf/status/1700812382392516936): 85 | 86 | > Sunday small guessing puzzle 87 | > Let's say I have 3 tokenizers: 88 | > - llama2: 32k vocab 89 | > - falcon: 65k vocab 90 | > - GPT4: 100k vocab 91 | I take ~2M random documents from the web (let’s say 10 random parquet files from RefinedWeb from https://huggingface.co/datasets/tiiuae/falcon-refinedweb roughly 1B tokens). I tokenize them with the tokenizers. 92 | What will be the relative fertilities of these 3 tokenizers? ie. how many more tokens with falcon and llama2 versus gpt4 for instance would you expect. And why does this matter? 93 | 94 | A general heuristic is that a bigger vocab will lead to fewer tokens. Having more tokens in the vocabulary means that longer character sequences might get represented with the additional tokens present, and you can get a shorter overall sequence length using the additional token ids. With BPE, you can simply say that you're definitely making more merges, and thus overall sequence length reduces (Of course, there is some more nuance here - you need to have more tokens dedicated for the given domain/ corpus you're dealing with). 95 | 96 | Here's the [answer](https://x.com/Thom_Wolf/status/1701206627859206450?s=20): 97 | > Running on 1B tokens from the RefinedWeb dataset. 98 | > - GPT4 tokenizer (100k vocab) gives you 0.997B tokens 99 | > - Falcon tokenizer (64k vocab) gives you ~5% more tokens (1.04B) 100 | > - Llama2 tokenizer (32k vocab) gives you ~20% more tokens (1.18B) 101 | 102 | These numbers are... definitely non-trivial to see. Notice that the absurdly large differences in vocab size do not get reflected as much in the number of tokens. Of course, one would have to go through GPT4's vocab to see what the representation for different data domains (code? other languages?) are like. For example, if the difference between Falcon and Llama2 tokenizers is that the extra tokens in Falcon's vocab were all for code, then you shouldn't expect to see a big difference in tokenized sequence length when you use a corpus of English text. To test this out, let's try this: We'll use a small corpus of plain English text - Paul Graham's essays - and see the differences in number of tokens. 103 | 104 | The script `paul_graham_essay_scraper.py` will scrape all the text from Paul Graham's essays. I'm not adding the processed, combined plain text file with all essays since that file is large, but this is a small peak at what it looks like: 105 | 106 | ``` 107 | February 2007A few days ago I finally figured out something I've wondered about 108 | for 25 years: the relationship between wisdom and intelligence. 109 | Anyone can see they're not the same by the number of people who are 110 | smart, but not very wise. And yet intelligence and wisdom do seem 111 | related. How?What is wisdom? I'd say it's knowing what to do in a lot of 112 | situations. I'm not trying to make a deep point here about the 113 | true nature of wisdom, just to figure out how we use the word. A 114 | wise person is someone who usually knows the right thing to do.And yet isn't being smart also knowing what to do in certain 115 | situations? For example, knowing what to do when the teacher tells 116 | your elementary school class to add all the numbers from 1 to 100? 117 | [1]Some say wisdom and intelligence apply to different types of 118 | problems—wisdom to human problems and intelligence to abstract 119 | ``` 120 | 121 | It's definitely not cleaned up (wrt separators, random new lines, etc), but it'll do. Locally, once you run 122 | ``` 123 | python paul_graham_essay_scraper.py 124 | ``` 125 | You'll have a `all_pg_essays.txt` file. Then, run 126 | 127 | ``` 128 | python get_token_counts.py 129 | ``` 130 | 131 | To get the following counts (I've added GPT2 as well, because why not): 132 | 133 | ``` 134 | Number of tokens for GPT2: 716928 (716K) 135 | Number of tokens for GPT4: 697456 (697K) 136 | Number of tokens for Llama: 791441 (791K) 137 | Number of tokens for Falcon: 732809 (732K) 138 | ``` 139 | The vocab sizes for GPT2, GPT4, Llama and Falcon are 50K, 100K, 32K, 64K respectively. You can see the trend in number of tokens follow the inverse of vocab sizes roughly: LLama > Falcon > GPT2 > GPT4. The exact details of merges (all are BPE tokenizers), the training corpus used, as well as the nature of the test corpus used, etc can affect the numbers you see. ( Paul Graham's essays certainly differs in characteristics from datasets like CommonCrawl which are used to train these tokenizers). But you can see that the numbers are actually very close to each other! If all of the extra tokens in GPT-4 were dedicated for English text, then you would definitely see a much bigger decrease in number of tokens. (I haven't found a heuristic yet to say increasing vocab from 50k -> 100K on the _same training corpus_ implies x% lesser tokens. If you do find one, please make a PR) 140 | 141 | 142 | ## Testing on code 143 | Let's now look at the number of tokens produced for a dataset of code. We'll look at the Stack dataset, and just about 1.5GB of data (5 files from the python subset). If you wish to try this out locally, run 144 | ``` 145 | python get_stack_subset.py 146 | ``` 147 | This will download a subset of the Stack dataset and then tokenize the data using GPT2, GPT4, LLama and Falcon tokenizers. The results are below: 148 | 149 | ``` 150 | Number of GPT2 tokens: 2,262,766,749 151 | Number of GPT4 tokens: 1,193,965,250 152 | Number of Llama tokens: 1,629,580,815 153 | Number of Falcon tokens: 1,496,059,289 154 | ``` 155 | 156 | Aha! We see clear differences across tokenizers now. GPT2 gives a whopping 2x the number of tokens as GPT4's tokenizer. To understand this better, let's look back at vocab sizes again: 157 | 158 | GPT4 (100K) > Falcon (65K) > GPT2 (50K) > Llama (32K) 159 | 160 | GPT2 has a _larger_ vocabulary size than LLama, but the number of tokens is in fact _1.3x more_. Why? Well, because it's not general vocabulary size that matters, of course! It's the vocabulary dedicated for the relevant characters/tokens in your test corpus that matters! If you notice, Llama actually gives you more tokens on English text (the previous experiment) than GPT2, so a clear tradeoff is visible. BPE is after all, a compression algorithm, and the training corpus for the Llama tokenizer had a good representation of code-related data, and thus more tokens were dedicated to code to _effectively compress_ that data. A simple demonstration: The encoding for 4 spaces (` ` - pretty common in code) is 1 single token with Llama's tokenizer, but 4 separate tokens with GPT2. 161 | 162 | # [Next Chapter](/6-postprocessing-and-more/) 163 | We'll be looking into the postprocessing step of tokenization (a special section on some special tokens), along with a look at a strange behaviour with "glitch tokens", and why you might want to make a tiny tokenizer. -------------------------------------------------------------------------------- /5-puzzles/get_stack_subset.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | from transformers import AutoTokenizer 3 | import tiktoken 4 | from functools import partial 5 | from dataclasses import dataclass 6 | 7 | @dataclass 8 | class Tokenizers: 9 | gpt2_tokenizer: AutoTokenizer 10 | gpt4_tokenizer: tiktoken.Encoding 11 | llama_tokenizer: AutoTokenizer 12 | falcon_tokenizer: AutoTokenizer 13 | 14 | def tokenized_lengths(examples, tokenizers: Tokenizers): 15 | examples["gpt2_length"] = [len(t) for t in tokenizers.gpt2_tokenizer(examples["content"])["input_ids"]] 16 | 17 | examples["gpt4_length"] = [len(t) for t in tokenizers.gpt4_tokenizer.encode_batch(examples["content"], disallowed_special=())] 18 | 19 | examples["llama_length"] = [len(t) for t in tokenizers.llama_tokenizer(examples["content"])["input_ids"]] 20 | 21 | examples["falcon_length"] = [len(t) for t in tokenizers.falcon_tokenizer(examples["content"])["input_ids"]] 22 | return examples 23 | 24 | 25 | 26 | if __name__ == "__main__": 27 | data_files = [f"data/python/train-0000{i}-of-00206.parquet" for i in range(5)] 28 | dataset = load_dataset("bigcode/the-stack",data_files=data_files) 29 | dataset = dataset["train"] # by default a train split is created 30 | print("Number of examples:", len(dataset)) 31 | gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") 32 | gpt4_tokenizer = tiktoken.encoding_for_model("gpt-4") 33 | llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf") 34 | falcon_tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b") 35 | 36 | tokenizers = Tokenizers(gpt2_tokenizer, gpt4_tokenizer, llama_tokenizer, falcon_tokenizer) 37 | 38 | mapper = partial(tokenized_lengths, tokenizers=tokenizers) 39 | 40 | dataset = dataset.map(mapper, batched=True) 41 | gpt2_tokens = sum(dataset["gpt2_length"]) 42 | print(f"Number of GPT2 tokens: {gpt2_tokens:,}") 43 | 44 | gpt4_tokens = sum(dataset["gpt4_length"]) 45 | print(f"Number of GPT4 tokens: {gpt4_tokens:,}") 46 | 47 | llama_tokens = sum(dataset["llama_length"]) 48 | print(f"Number of Llama tokens: {llama_tokens:,}") 49 | 50 | falcon_tokens = sum(dataset["falcon_length"]) 51 | print(f"Number of Falcon tokens: {falcon_tokens:,}") -------------------------------------------------------------------------------- /5-puzzles/get_token_counts.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoTokenizer 2 | import argparse 3 | import tiktoken 4 | 5 | parser = argparse.ArgumentParser() 6 | parser.add_argument("--file_path" , type=str, default="all_pg_essays.txt") 7 | 8 | if __name__ == "__main__": 9 | gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") 10 | gpt4_tokenizer = tiktoken.encoding_for_model("gpt-4") 11 | llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf") 12 | falcon_tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b") 13 | 14 | args = parser.parse_args() 15 | file_path = args.file_path 16 | with open(file_path, "r") as f: 17 | data = f.read() 18 | 19 | gpt2_tokens = len(gpt2_tokenizer.encode(data)) 20 | gpt4_tokens = len(gpt4_tokenizer.encode(data)) 21 | llama_tokens = len(llama_tokenizer.encode(data)) 22 | falcon_tokens = len(falcon_tokenizer.encode(data)) 23 | print(f"Number of tokens for GPT2: {gpt2_tokens}") 24 | print(f"Number of tokens for GPT4: {gpt4_tokens}") 25 | print(f"Number of tokens for Llama: {llama_tokens}") 26 | print(f"Number of tokens for Falcon: {falcon_tokens}") 27 | -------------------------------------------------------------------------------- /5-puzzles/paul_graham_essay_scraper.py: -------------------------------------------------------------------------------- 1 | """ 2 | A simply essay scraper for Paul Graham's essays. 3 | """ 4 | from bs4 import BeautifulSoup 5 | 6 | import requests 7 | import pprint 8 | from tqdm import tqdm 9 | 10 | save_path = "all_essays.txt" 11 | 12 | def get_article_text(url): 13 | r = requests.get(url) 14 | soup = BeautifulSoup(r.text, "html.parser") 15 | # Main text is within tags 16 | # and the second table contains the article text. 17 | if len(soup.find_all('table')) < 2: 18 | print("bad layout") 19 | return "" # bad layout 20 | article_text = soup.find_all('table')[1].get_text() 21 | return article_text 22 | 23 | 24 | pp = pprint.PrettyPrinter(indent=4) 25 | base_url = "http://www.paulgraham.com/" 26 | r = requests.get("http://www.paulgraham.com/articles.html") 27 | data = r.text 28 | articles = {} 29 | 30 | soup = BeautifulSoup(data, "html.parser") 31 | 32 | all_articles_text = "" 33 | 34 | # Extract articles URLs 35 | for link in tqdm(soup.select('font > a')): 36 | article_url = base_url + link.get('href') 37 | all_articles_text += get_article_text(article_url) + "\n\n" 38 | 39 | 40 | # Save all articles text into one file 41 | with open("all_pg_essays.txt", "w") as file: 42 | file.write(all_articles_text) -------------------------------------------------------------------------------- /5-puzzles/tok_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/5-puzzles/tok_pipeline.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | 4 | 5 | - [Agenda](#agenda) 6 | - [The PostProcessor: Be careful with special tokens](#the-postprocessor-be-careful-with-special-tokens) 7 | * [What is your task?](#what-is-your-task) 8 | * [Causal language modelling](#causal-language-modelling) 9 | * [Instruction-tuning](#instruction-tuning) 10 | + [Is this code snippet correct?](#is-this-code-snippet-correct) 11 | * [What's so special about the EOS token anyway?](#whats-so-special-about-the-eos-token-anyway) 12 | * [More on special tokens](#more-on-special-tokens) 13 | - [Modifying the tokenizer](#modifying-the-tokenizer) 14 | - [Glitch Tokens](#glitch-tokens) 15 | - [Tiny Tokenizers](#tiny-tokenizers) 16 | - [Next Chapter](#next-chapter) 17 | 18 | 19 | 20 | # Agenda 21 | Some miscallaneous topics that we haven't covered yet! We'll finally look at postprocessing step in tokenization and the gotchas with special tokens. We'll also look at problematic "glitch tokens" studying the behaviour of GPT-3.5, and if you're a developer, why you might want to shrink your tokenizer for fast iteration cycles. 22 | 23 | # The PostProcessor: Be careful with special tokens 24 | One aspect we didn't get into until now was the postprocessor: typically, certain special tokens are added at the beginning or end (or both) of your sequence. For example, if you encode "Hello there!" with BERT, you get: 25 | 26 | ``` 27 | ['[CLS]', 'Hello', 'there', '!', '[SEP]'] 28 | ``` 29 | 30 | With T5, you get: 31 | 32 | ``` 33 | ['Hello', 'there', '!', ''] 34 | ``` 35 | Here, `` is the EOS token. If you do the same with Llama, you get: 36 | 37 | ``` 38 | ['', 'Hello', 'there', '!'] 39 | ``` 40 | 41 | Here, `` is the BOS token. Well, why are there such differences between different tokenizers, even though you're just doing the same `tokenizer.encode()` call? 42 | 43 | ## What is your task? 44 | 45 | The difference in default behaviour is simply because of the task each model was trained on, and of course the settings for special tokens used (T5, for ex, doesn't use a BOS token). 46 | 47 | BERT is a encoder for which the CLS token acts as a start-of-sequence token (and for classification tasks, the hidden state produced for this token is used as the sequence embedding). The SEP token (`tok.sep_token`) serves as the separator between two sequences/sentences as well as an end-of-sequence token. For example, for predicting entailment on the [RTE dataset](https://huggingface.co/datasets/glue/viewer/rte), your input would look like `[CLS] sentA [SEP] sentB [SEP]`. 48 | 49 | For the task of causal language modelling, you're working with autoregressive models like Llama. Fundamentally, what it is that you want? You provide an input prompt to a Llama model, and you expect your Llama to provide a good completion to your prompt by autoregressively generating next tokens. When you see this, it becomes obvious why the default behaviour should not include the EOS token: you are looking for a completion of the current sequence of text, and the last thing you want to do is add an end-of-sequence token here! 50 | 51 | Thus, you have different behaviours depending on the pre-training task and the specific model. That said, let's take a look at what you want to be doing during _fine-tuning_. I'll only cover two popular cases: causal language modelling and instruction-tuning. 52 | 53 | ## Causal language modelling 54 | 55 | Here, for each element/ piece of text in your dataset, you would want to add an EOS token at the very end. Note that when you chunk your data, you do NOT want to be adding EOS tokens for every output chunk! 56 | 57 | ## Instruction-tuning 58 | With instruction-tuning for sequence-to-sequence models, the postprocessing is simple enough: encode input prompt and expected response individually with the tokenizer. However, with causal language models, we have an extra-step: We concatenate the prompt and response together and then pass that through the model. What do we want here? Well, causal language models are training to be blabbering machines, so we should definitely add an EOS token at the end of the response. 59 | 60 | ### Is this code snippet correct? 61 | Here's a small snippet of code I'm writing and testing with the GPT2 tokenizer, for preprocessing a text-based classification dataset: 62 | 63 | ``` 64 | from transformers import AutoTokenizer 65 | input_prompt = "Tweet text : @HMRCcustomers No this is my first job. Label: " 66 | label = "neutral" 67 | tokenizer = AutoTokenizer.from_pretrained("gpt2") 68 | concat_seq = tokenizer(input_prompt)["input_ids"] + tokenizer(label)["input_ids"] 69 | concat_seq_decoded = tokenizer.decode(concat_seq) 70 | print(concat_seq_decoded) 71 | # Output - "Tweet text : @HMRCcustomers No this is my first job. Label : Neutral" 72 | ``` 73 | This code snippet seems to perform correctly for GPT2. But is this right? No! Conveninent abstractions can unfortunately hide bugs in your code. If you run the same code for `meta-llama/Llama-2-7b-hf`, you get ` Tweet text : @HMRCcustomers No this is my first job Label : Neutral`. With some models, the BOS and EOS tokens are the same, so this becomes a dangerous bug then. What we should do is to simply encode the strings _without postprocessing_ (use `add_special_tokens=False` with 🤗 Tokenizers) and then postprocess the string ourselves. (I came across this bug first in a 🤗 PEFT example, which led to this PR [here](https://github.com/huggingface/peft/pull/926)) 74 | 75 | ## What's so special about the EOS token anyway? 76 | What is it that makes the EOS token special? Sometimes, it almost seems like the EOS token is magical, almost like a bad no-no word you should use with caution. The simple fact is that every special token is special because of it's designation during training. Let's look at causal language models. With the EOS token, it is special because it acts as a separator between documents. Thus, the presence of an EOS token means that the next token should likely be the start of a new document or topic (like "The"). Thus, if the model encounters an EOS token in the _middle_ of your text, you're fighting against the model's pre-training and asking it to output a _completion_ to the _preceeding_ token, while in pre-training it is conditioned to start afresh and "forget" the preceeding text. That is all, really. It's not even that performance will completely drop if you use this EOS token in weird places - it is just that performance will be a bit worse. 77 | 78 | 79 | ## More on special tokens 80 | Special tokens are also widely used in machine translation. Typically the target language or both the source and target language are provided as a special token to the model at hte start of the input sequence. For example, in Google's seminal paper on multi-lingual Neural Machine Translation (NMT), they preprocess each input sequence by adding the token for the target language (ex: "<2es>" for Spanish) at the beginning. 81 | 82 | **Further reading** 83 | 84 | Google’s Multilingual Neural Machine Translation System: 85 | Enabling Zero-Shot Translation: https://aclanthology.org/Q17-1024.pdf 86 | 87 | 88 | # Modifying the tokenizer 89 | When would you modify the tokenizer? We've already seen one way to modify a tokenizer with `tokenizer.add_tokens`. You have two options with a 🤗 tokenizer: 90 | 1. Add a fixed set of _known_ tokens via `tokenizer.add_tokens` 91 | 2. _Learn_ a _new_ set of tokens via `tokenizer.train_new_from_iterator` 92 | 93 | Option 1 is a bit ad hoc, and you can do this when you're probably experimenting with a new special token. Or, you're finetuning your model on this new dataset which has a fixed set of output words, and adding these words to the tokenizer would help. Note that modifying the tokenizer is only one part of the story. When you modify the tokenizer, you have to change the embedding layer of the model as well. With 🤗 Transformers, you can do 94 | 95 | ``` 96 | model.resize_token_embeddings(len(tokenizer)) 97 | ``` 98 | 99 | The new embedding vectors have some default initialization. You need to now _finetune_ your model on your data to _learn_ a good embedding for these new tokens. 100 | 101 | With Option 2, you need to have a representative training corpus. You would use this only when your corpus is _sufficiently different_ from the pre-training corpus used to train the tokenizer and model. Example scenarios are a new language, new characters, new domain (like protein sequences) or a new style (language from a different century). For example, let's say you want to make BERT work with the Kannada language. In this case, you want the tokenizer to _learn_ a good vocabulary/list of subwords for tokenizing Kannada text. Thus, Option 1 has no use here and you'll use Option 2. 102 | 103 | **Further reading** 104 | 105 | Training a New Tokenizer from an Old One, The 🤗 NLP course : https://huggingface.co/learn/nlp-course/chapter6/2 106 | 107 | # Glitch Tokens 108 | Another artifact you might have come across if you've been playing around with ChatGPT models: There are certain "glitch tokens" that make ChatGPT hallucinate in strange ways. Here are two cases that were seen first in [June 2023](https://twitter.com/goodside/status/1666598580319035392), but work even now in November 2023: 109 | 110 | ![Alt text](image-2.png) 111 | 112 | ![Alt text](image.png) 113 | 114 | ![Alt text](image-1.png) 115 | 116 | 117 | The first image is with GPT-3.5 and the second is with GPT-4. Further, the chat summary for the first image says the following: 118 | 119 | ![Alt text](image-3.png) 120 | 121 | Firstly, one reason this happens is that there's a token in the GPT-4 vocabulary dedicated to " davidjl", which seems to be a part of a reddit username davidjl123. From [Simon Wilson's](https://simonwillison.net/2023/Jun/8/gpt-tokenizers/) blog: 122 | 123 | > It looks likely that this token refers to user davidjl123 on Reddit, a keen member of the /r/counting subreddit. He’s posted incremented numbers there well over 163,000 times. 124 | 125 | You can verify this yourself by loading GPT-4's tokenizer from `tiktoken`. The interesting thing is not just that there's a weird token, but that this token gets confused with other tokens! 126 | 127 | A detailed explanation from a [user on HackerNews](https://news.ycombinator.com/item?id=36245187): 128 | 129 | > These glitch tokens are all near the centroid of the token embedding space. That means that the model cannot really differentiate between these tokens and the others equally near the center of the embedding space, and therefore when asked to ’repeat’ them, gets the wrong one. 130 | > 131 | > That happened because the tokens were on the internet many millions of times (the davidjl user has 163,000 posts on reddit simply counting increasing numbers), yet the tokens themselves were never hard to predict (and therefore while training, the gradients became nearly zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights). 132 | 133 | 134 | This makes perfect sense, in the same way my 90+ year old grandpa looks young if you close your left eye and squint really hard with your right eye. In all seriousness, I'm not sure about the second paragraph above, so let's just ignore it. The first seems roughly right. You can probably say that the model hasn't made meaninful updates to the embedding vector for this token during the training process, and given a sequence with this token, while trying to simply repeat the token, it gets confused and outputs a different token, likely with a similar embedding vector (is this the centroid, or just close to the initialization?). If you have a better explanation, let me know! 135 | 136 | **Further reading** 137 | 138 | Understanding GPT tokenizers: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/ 139 | 140 | # Tiny Tokenizers 141 | 142 | This tiny section on tiny tokenizers is from [Stas Bekman's engineering blog](https://github.com/stas00/ml-engineering). It is possibly the greatest resource on ML engineering and debugging, and, if GitHub had semantic search, all ML engineering doubts should probably go through this repo first. 143 | 144 | The motivation for going tiny is that you'd want to shrink your neural network for faster debugging cycles. If you want to launch a training run on 2 8-A100 nodes for Falcon-40b, then the time to simply load the model and start training itself would take up 15-20+ mins. You don't want to finally realise that a part of your code doesn't work for the Falcon model after this much time. Thus, we want to test our model, if possible just locally on our laptop, for a _tiny_ version of the model. For example, you can shrink GPT2 (the 128M parameter version), which has 12 layers with 12 blocks each, and with an embedding dimension of 768, to a version with just 5 layers with 5 blocks, with an embedding dimension of 32. (this is what is used [internally](https://huggingface.co/hf-internal-testing/tiny-random-gpt2) for testing by huggingface). However, the dominant component in your model weights is now the embedding layer: with GPT2, for example, you have a vocabulary size of 50,000. To get a truly tiny model, you need to shrink this as well. For example, with the above tiny configuration (5 layers, 5 blocks, 32 hidden state dimension), you'll have a model weights file of size 6.4MB. If you shrink the vocabulary to 1000, then you get a 0.5MB file! This difference can become larger for bigger models with bigger vocabs. 145 | 146 | Thus, we have to shrink vocabulary for really tiny models so that we can iterate faster. Well, if we have to shrink the vocabulary, we have to shrink the tokenizer first. I've added one such shrinking recipe from Stas Bekman in `tokenizer_shrink.py`, and have added a few comments for clarity. It's pretty simple: you keep only the first few tokens in your vocabulary, and then handle other model-specific data (such as merges) appropriately. 147 | 148 | One more point is that the shrinking of the tokenizer is mainly for the vocabulary size. It will have very little effect on the time for tokenizing a dataset, especially with a fast implementation (the default with 🤗 tokenizers). This is because a vocabulary lookup, roughly speaking, doesn't change much when you shrink from 50K to 1K. 149 | 150 | **Further reading** 151 | [Faster debug and development with tiny models, tokenizers and datasets](https://github.com/stas00/ml-engineering/blob/33561a45d122e7fdb3f3bc42e21b0e4aa3815702/transformers/make-tiny-models.md), Stas Bekman's engineering blog. 152 | 153 | # [Next Chapter](/7-galactica/) 154 | We'll now dive into the Galactica paper to understand how you can design a tokenizer. -------------------------------------------------------------------------------- /6-postprocessing-and-more/benchmark.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | from transformers import AutoTokenizer 3 | import time 4 | from functools import partial 5 | tiny_tokenizer = AutoTokenizer.from_pretrained(".") 6 | mname = "microsoft/deberta-base" 7 | tokenizer = AutoTokenizer.from_pretrained(mname, use_fast=True) 8 | 9 | dataset = load_dataset("imdb")["train"] 10 | 11 | def get_example(ex, tokenizer): 12 | inps = tokenizer(ex["text"]) 13 | return inps 14 | 15 | start_time = time.time() 16 | func = partial(get_example, tokenizer=tokenizer) 17 | mapped = dataset.map(func, batched=True, load_from_cache_file=False) 18 | end_time = time.time() 19 | print("time: ", end_time - start_time) 20 | 21 | 22 | 23 | start_time = time.time() 24 | func = partial(get_example, tokenizer=tiny_tokenizer) 25 | mapped = dataset.map(func, batched=True, load_from_cache_file=False) 26 | end_time = time.time() 27 | print("time: ", end_time - start_time) -------------------------------------------------------------------------------- /6-postprocessing-and-more/image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/6-postprocessing-and-more/image-1.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/6-postprocessing-and-more/image-2.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/image-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/6-postprocessing-and-more/image-3.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/6-postprocessing-and-more/image.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/tokenizer_shrink.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tokenizer shrinking recipe from Stas Bekman and Anthony Moi. 3 | Reference: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564 4 | For more details, please see: https://github.com/stas00/ml-engineering/ 5 | """ 6 | 7 | import json 8 | from transformers import AutoTokenizer 9 | from tokenizers import Tokenizer 10 | 11 | vocab_keep_items = 5000 12 | mname = "microsoft/deberta-base" 13 | 14 | 15 | tokenizer = AutoTokenizer.from_pretrained(mname, use_fast=True) 16 | assert tokenizer.is_fast, "This only works for fast tokenizers." 17 | tokenizer_json = json.loads(tokenizer._tokenizer.to_str()) 18 | vocab = tokenizer_json["model"]["vocab"] 19 | # Iterate over the vocabulary for different models and keep only the first `vocab_keep_items` tokens 20 | if tokenizer_json["model"]["type"] == "BPE": 21 | new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items } 22 | merges = tokenizer_json["model"]["merges"] 23 | new_merges = [] 24 | # handle merges: keep only the merge rules for which the (merging) pair of tokens and the merged token are in the new vocab 25 | for i in range(len(merges)): 26 | a, b = merges[i].split() 27 | new_token = "".join((a, b)) 28 | if a in new_vocab and b in new_vocab and new_token in new_vocab: 29 | new_merges.append(merges[i]) 30 | tokenizer_json["model"]["merges"] = new_merges 31 | elif tokenizer_json["model"]["type"] == "Unigram": 32 | new_vocab = vocab[:vocab_keep_items] 33 | elif tokenizer_json["model"]["type"] == "WordPiece" or tokenizer_json["model"]["type"] == "WordLevel": 34 | new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items } 35 | else: 36 | raise ValueError(f"don't know how to handle {tokenizer_json['model']['type']}") 37 | 38 | # a hack for GPT2, since a special token is at the END of the vocab - most tokenizers have it at the beginning 39 | if "gpt2" in mname: 40 | new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items-1 } 41 | new_vocab["<|endoftext|>"] = vocab_keep_items-1 42 | else: 43 | new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items } 44 | 45 | tokenizer_json["model"]["vocab"] = new_vocab 46 | tokenizer._tokenizer = Tokenizer.from_str(json.dumps(tokenizer_json)) 47 | save_name = mname.split("/")[-1] + "_tiny" 48 | tokenizer.save_pretrained(save_name) -------------------------------------------------------------------------------- /7-galactica/README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 3 | 4 | 5 | - [Tokenizer Design: Diving Deep Into Galactica](#tokenizer-design-diving-deep-into-galactica) 6 | - [The data](#the-data) 7 | * [Prompt pre-training](#prompt-pre-training) 8 | - [Tokens and input format in Galactica](#tokens-and-input-format-in-galactica) 9 | * [Generating References](#generating-references) 10 | * [Working memory](#working-memory) 11 | * [So Many Special Tokens](#so-many-special-tokens) 12 | * [Tokenizer training](#tokenizer-training) 13 | * [Takeways on tokenizer design](#takeways-on-tokenizer-design) 14 | - [Model Training and Results](#model-training-and-results) 15 | * [Results](#results) 16 | - [Further reading](#further-reading) 17 | 18 | 19 | 20 | # Tokenizer Design: Diving Deep Into Galactica 21 | This section is on tokenizer *design*: if you're looking to train a tokenizer from scratch for your use-case, *how* might one go about this? Well, first of all, let's take a step back and ask, *when* should you train a new tokenizer? The answer is straightforward (and perhaps underwhelming): when your dataset is different from the training corpus of the pretrained model, and you wish to *pretrain* a new model. It's hard to imagine companies pretraining a Llama-like model with trillions of tokens. However, the trend is that compute is going to become cheaper, the GPU hardware market will become more leveled, the amount of data and really use-case specific data each company has will only increase. Thus, one *can* make the case that the current budget for pretraining would be within hand for a bunch of companies in the near future. With the research community also squeezing out better and better performance with smaller models and datasets, some companies pretraining small models on private data(especially multimodal data) is not a crazy future. Anyways, back to tokenizer design: This section will dive deep into the Galactica model, breaking down the use case, dataset, model, etc (this is important context) with some focus on tokens. 22 | 23 | The Galactica model from [Taylor et al](https://arxiv.org/abs/2211.09085) is one of the most important models to have come about in recent times. It is an excellent study in training a domain-specific language model. The authors have thought deeply about every aspect of the training process, with special attention to tokenization and input representation. In summary, Galactica is a language model for science, and is "trained on a large and curated corpus of humanity’s scientific knowledge". The model was trained to serve as an powerful copilot for scientists, helping them write equations, find references based on contributions ("Find the E=mc^2 paper"), translate natural language text to code and vice versa,etc. 24 | 25 | # The data 26 | An important feature of Galactica is that it was trained on a large scientific corpus with various scientific modalities such as the chemical structure of molecules, amino acid sequences of proteins, etc. The exact split up for the various datasets from the paper is below: 27 | 28 | ![Alt text](corpus.png) 29 | 30 | In summary, the dataset consists of 106 billion tokens comprising of scientific literature (sources: arXiv, PubMed Abstracts, bioRxiv, ChemRxiv, etc), code (sources: GitHub repositories linked in PapersWithCode), reference material (Wikipedia, StackExchange, Papers With Code, etc), knowledge bases (sources: PubChem Compound, UniProt, RefSeq Genome, etc), filtered CommonCrawl (from scientific and academic domains), prompt datasets (basically a bunch of curated NLP datasets like CommonSenseQA, BoolQ, OpenBookQA,etc in a prompt format). 31 | 32 | The different kinds of knowledge/ modalities used are: 33 | 1. Text 34 | 2. LaTeX 35 | 3. Code 36 | 4. SMILES ([Simplified molecular input line entry system (SMILES)](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) notation for writing molecule structure in text) 37 | 5. Amino Acid Sequence 38 | 6. DNA Sequence 39 | 40 | 41 | The first three are pretty common in most pretraining datasets. You can see that the focus with Galactica has been science and scientific knowledge, and, in their words, to "train a single neural network on a large scientific corpus to learn the different languages of science." 42 | 43 | 44 | To appreciate this better, here are some of their [official examples](https://github.com/paperswithcode/galai) for: 45 | 1. Generating Molecules: 46 | ``` 47 | galactica.generate("[START_I_SMILES]", max_length=200) 48 | # Output: [START_I_SMILES]CCC1=CC=C(C=C1)C(=O)NC2=CC=CC(=C2)C(=O)NC3=CC=C(C=C3)S(=O)(=O)N[END_I_SMILES]\n\n### Molecular Formula\n\nC22H21N3O4S\n\n## Chemical and Physical Properties\n\nThe following are chemical properties for 3-[[3-(4-ethylphenyl)-3-oxo-propanoyl]amino]-N-(4-sulfamoylphenyl)benzamide.\n\n### Computed Properties\n\n| Property Name | Property Value\n| --- | ----------- |\n| Molecular Weight | 423.5\n| XLogP3-AA Log P | 3.2\n| Hydrogen Bond Donor Count | 3\n| Hydrogen Bond Acceptor Count 49 | ``` 50 | 2. Generating Annotations for Protein Structures: 51 | ``` 52 | galactica.generate("[START_AMINO]GHMQSITAGQKVISKHKNGRFYQCEVVRLTTETFYEVNFDDGSFSDNLYPEDIVSQDCLQFGPPAEGEVVQVRWTDGQVYGAKFVASHPIQMYQVEFEDGSQLVVKRDDVYTLDEELP[END_AMINO] ## Keywords", max_length=200) 53 | # [START_AMINO]GHMQSITAGQKVISKHKNGRFYQCEVVRLTTETFYEVNFDDGSFSDNLYPEDIVSQDCLQFGPPAEGEVVQVRWTDGQVYGAKFVASHPIQMYQVEFEDGSQLVVKRDDVYTLDEELP[END_AMINO] ## Keywords\n\nCytoplasm, Methyltransferase, rRNA processing, S-adenosyl-L-methionine, Transferase\n\n## References\n\nQuestion: What are some articles for Ribosomal RNA small subunit methyltransferase H?\n\nAnswer: \n\n[START_REF] Comparative Genomics of 28 Salmonella enterica Isolates: Evidence for CRISPR-Mediated Adaptive Sublineage Evolution, Fricke[END_REF]\n\n 54 | ``` 55 | 56 | ## Prompt pre-training 57 | Galactica has a pre-training corpus which is a mixture of pre-training datasets (the documents we saw earlier), and modern NLP datasets in an instruction/prompt format (ex: like in FLAN). THe set of prompt datasets used, from the paper: 58 | 59 | ![Alt text](image-4.png) 60 | 61 | Some examples for datasets used are BoolQ, CommonsenseQA, SciTail, Thermosol (chemical property prediction), etc. 62 | 63 | Mixing these datasets in prompt-formats along with the pre-training corpus with just documents on different modalities is what Galactica calls "prompt pre-training". 64 | 65 | 66 | 67 | # Tokens and input format in Galactica 68 | Tokenization design in Galactica is really at the heart of the paper. The key motivation is that you're dealing with these giant blocks of text with all the different modalities mixed in, and there are different kinds of tokenization appropriate for each. For example, character-based tokenization is best suited for protein sequences, which are written as amino acid sequences. Different modalities are also wrapped in their own special tokens, which aids in learning and gives different demarcated abilities for the model, which we'll get to later. Here are the full list of tokenization steps from the paper (it's hard to summarize this better): 69 | 70 | > 1. Citations: we wrap citations with special reference tokens [START_REF] and [END_REF]. 71 | > 2. Step-by-Step Reasoning: we wrap step-by-step reasoning with a working memory token , 72 | mimicking an internal working memory context. 73 | > 3. Mathematics: for mathematical content, with or without LaTeX, we split ASCII operations into individual characters. Parentheses are treated like digits. The rest of the operations allow for unsplit repetitions. Operation characters are !"#$%&’*+,-./:;<=>?\^_‘| and parentheses are ()[]{}. 74 | > 4. Numbers: we split digits into individual tokens. For example 737612.62 -> 7,3,7,6,1,2,.,6,2. 75 | > 5. SMILES formula: we wrap sequences with [START_SMILES] and [END_SMILES]and apply character-based tokenization. Similarly we use [START_I_SMILES] and [END_I_SMILES] where isomeric SMILES is denoted. For example, C(C(=O)O)N → C,(,C,(,=,O,),O,),N. 76 | > 6. Amino acid sequences: we wrap sequences with [START_AMINO] and [END_AMINO] and apply character-based tokenization, treating each amino acid character as a single token. For example, MIRLGAPQTL -> M,I,R,L,G,A,P,Q,T,L. 77 | > 7. DNA sequences: we also apply a character-based tokenization, treating each nucleotide base as a token, where the start tokens are [START_DNA] and [END_DNA]. For example, CGGTACCCTC -> C, G, G, T, A, C, C, C, T, C. 78 | 79 | 80 | One example for the processed text with a protein sequence, from the paper: 81 | 82 | ![Text](image.png) 83 | 84 | *Example for an annotated protein sequence with accompanying text* 85 | 86 | 87 | ![Alt text](image-1.png) 88 | 89 | *Example for an annotated block of text from scientific literature with citations* 90 | 91 | 92 | As mentioned above, having custom tokens defining boundaries and demarcating different modes of data (citation, working memory, etc) also gives your model new abilities in generation. 93 | 94 | ## Generating References 95 | 96 | Sometimes you might want to just look up a reference, in which case, you can use the special start and end tokens [START_REF] and [END_REF] to reliably get just the text you want (instead of ad-hoc prompt engineering; or waste tokens with few shot examples in every API call). The [official](https://github.com/paperswithcode/galai) repository implements this with a `generate_reference` method: 97 | 98 | > model.generate_reference("The paper introducing the formula for the $n$-th digit of $\\pi$ in base $16$") 99 | 100 | Going over the [implementation](https://github.com/paperswithcode/galai/blob/3a724f562af1a0c8ff97a096c5fbebe579e2160f/galai/model.py#L331) for `generate_reference`, it's pretty simple under-the-hood: 101 | 1. Add [START_REF] token at the end of the input text if not already present 102 | 2. Generate new tokens until you hit a [END_REF] token. 103 | 104 | ## Working memory 105 | Working memory is a special designation for a scratchpad for the model to write down intermediate steps and function calls. Note that the paper came out pre-ChatGPT, and is perhaps the first clear demonstration of combining chain of thought and function calling (like writing a python snippet to calculate the final answer) 106 | 107 | 108 | Clear usage of token can separate chain-of-thought mode and the direct answer mode. From the [official examples](https://github.com/paperswithcode/galai/blob/3a724f562af1a0c8ff97a096c5fbebe579e2160f/notebooks/Introduction%20to%20Galactica%20Models.ipynb): 109 | ![Alt text](image-2.png) 110 | 111 | ![Alt text](image-3.png) 112 | 113 | The authors also say this about the usage of working memory and, the difference from chain-of-thought: 114 | 115 | > There are two limitations with chain-of-thought. First, it relies on prompt discovery to find a prompt that elicits robust step-by-step reasoning; Not only does this require finding a robust prompt that works in all cases, but it also often relies on few-shot examples which take up context space........Secondly, chain-of-thought prompting uses the neural network to perform tasks that it is arguably not best suited to doing; for example, arithmetic......Given that classical computers are specialized for tasks like arithmetic, one strategy is to offload these tasks from the neural network to external modules 116 | 117 | 118 | There you have it! The use of and <\work\> tokens makes CoT-like behaviour more controllable than relying on prompt engineering to switch between modes (or passing few shot examples every API call). Further, you can combine this with function calling by writing code that can be executed on a computer. Note that this came about pre-ChatGPT, so it is a very unique contribution from the paper. The advantages/disadvantages, in the post-ChatGPT era (where GPT-4 does everything without explicit special tokens) hasn't been studied well yet. One thing is clear though - ChatGPT likely has a similar "working" memory type training, with special tokens to indicate when to switch over to code (for the Code Interpreter model), and do function calling with plugins. 119 | 120 | ## So Many Special Tokens 121 | A summary of all the special tokens used, [taken directly from Galactica's example notebook](https://github.com/paperswithcode/galai/blob/3a724f562af1a0c8ff97a096c5fbebe579e2160f/notebooks/Introduction%20to%20Galactica%20Models.ipynb): 122 | 123 | >`` - reserved. 124 | > 125 | > `` - reserved. 126 | > 127 | > `` - end-of-document token used to split documents during trainig. Prepending this token to prompt (see `new_doc` parameter in `Model.generate`) biases a model into generating a new document. 128 | > 129 | > `` - a standard padding token to align sequences in a batch. 130 | > 131 | > `[START_REF]` and `[END_REF]` - markers denoting a reference to a paper. Each paper is represented as `Title, First author name`. F.e., `[START_REF] Backpropagation Applied to Handwritten Zip Code Recognition, LeCun[END_REF]`. 132 | > 133 | > `[IMAGE]` - a placeholder for an image removed from a text. 134 | > 135 | > `` and `` - markers denoting fragments in FragmentedGlass dataset. 136 | > 137 | > `` and `` - markers denoting step-by-step reasoning (see Step-by-Step Reasoning Section). 138 | > 139 | > `[START_SUP]`, `[END_SUP]`, `[START_SUB]` and `[END_SUB]` - markers used to protect superscript and subscript digits from NFKC normaliziation. Our tokenizer uses the standard NFKC rules, which means that `x²⁵` would be tokenized in the same way as `x25`. To prevent this, we encode `x²⁵` as `x[START_SUP]25[END_SUP]`. 140 | > 141 | > `[START_DNA]`, `[END_DNA]`, `[START_AMINO]`, `[END_AMINO]`, `[START_SMILES]`, `[END_SMILES]`, `[START_I_SMILES]` and `[END_I_SMILES]` - markers denoting special sequences, respectively: nucleic acids sequences, amino acids sequeqnces, canonical simplified molecular-input line-entry system (SMILES) strings and isometric SMILES strings. Besides marking a sequence of a given type, these tokens force a special tokenization mode in which each character is represented as a single token. F.e., `GATTACA` is tokenized as `G|ATT|ACA`, while `[START_DNA]GATTACA[END_DNA]` is tokenized as `[START_DNA]|G|A|T|T|A|C|A|[END_DNA]`. Note that for this to work you need to transform your prompt with `galai.utils.escape_custom_split_sequence`. All standard text generation functions of `galai.model.Model` do this automatically. 142 | 143 | 144 | Now that's a lot of special tokens! The exact way in which you should preprocess your text which has different modalities so that they are tokenized appropriately, is an implementation detail we won't consider here. This kind of preprocessing is more or less the standard now when dealing with code, function calling, etc. The key difference of course, would be that with Galactica, you can so things like (1) `model.generate_with_steps` and (2) `model.generate` such that (1) provides a long answer with steps, possibly code and (2) provides a direct answer. Having a "shut up and give me the final answer" mode baked in, using special tokens would have been a nice ChatGPT feature. 145 | 146 | > _Task_ : Go to the HuggingFace repo for Galactica and have a look at the tokenizer config files to see where these special tokens go. 147 | 148 | > _Question_: How do you think you can make a simple change to the BPE algorithm in [chapter-2](../2-bpe/) so that integers always undergo character-level tokenization? Think about making a change in the step where we select the best pair of symbols. 149 | 150 | ## Tokenizer training 151 | The Tokenizer was trained on 2% of the training data, randomly selected. They trained a BPE tokenizer with a vocabulary of size 50K. Note that for all the other special modalities used, the authors have chosen a character-level tokenization (numbers, chemical structure, protein sequences, etc). The fundamental imbalance wrt English text/code vs scientific modalities doesn't matter much in terms of the compression you get, because you're 152 | not really compressing other modalities. (recall our discussion about low-resource languages from [chapter-4](../4-tokenization-is-hard/)) 153 | 154 | ## Takeways on tokenizer design 155 | Something to think about while designing your tokenizer. From Galactica,the important lesson, in my opinion, which has become the norm really, is that you can separate out different data modalities with their own special tokens to switch between different modes of generation. Note that training a tokenizer from scratch is not always needed. For example, if you want to have some custom function calling behaviour with Llama 2 (as far as I know, no such training was performed for the base model), then you don't have to train the tokenizer from scratch. Your function call will be represented in code-like syntax, which Llama has been trained on, and what you would need to do is add special tokens, like `[START_FUNC]` and `[END_FUNC]`, resize token embeddings for the model, and then finetune on some function calling data you have. 156 | 157 | Also, these special tokens don't _solve_ our issues with generating data according to our schema i.e we'll still have problems with schema validation. You can use a `[START_DNA]` token in your prompt to _better condition_ the model to generate a DNA-like sequence, but you might still end up with an _invalid_ DNA sequence. With more complicated schemas like JSONs with multiple fields/nesting, you can imagine the problems an LLM might have. Using special tokens helps, and should give better performance, but you'll need to do schema validation. 158 | 159 | # Model Training and Results 160 | This section will deviate slightly from tokenization, just to provide some context for other aspects in Galactica. 161 | 162 | They trained a whole family of models, from size 125M to 120B parameters. The models were trained with a 2048 context length, and they chose to use no biases like most modern LLMs (if you understand the precise reason/study for this, I'd love to know). Optimization-wise they used AdamW with weight decay and gradient clipping with 1.0 as the max global norm (I don't know where this magical 1.0 comes from again). 163 | 164 | They also trained the models for multiple epochs, specifically 4.25 epochs. This "4" epoch number also comes up in a later study on repeating tokens, [Scaling Data-Constrained Langauge Models](https://arxiv.org/abs/2305.16264), co-authored by the usual suspects at HuggingFace. 165 | 166 | ## Results 167 | Some interesting points from the results, focusing on input format: 168 | - Zero-shot + Working memory (``) Galactica-120B is better than 5-few shot prompting for OPT (175B), BLOOM (176B) and Gopher (280B). Not sure why they didn't compare with PaLM-540B here. 169 | - Zero-shot + working memory Galactica-120B is better than few-shot CoT prompting PaLM-540B. They also compared Galactica-120B with 5-shot CoT vs token, but this seems to be a prompt template setting used only in evaluation. That is, if the pre-trained model for both settings was trained with the token, then this comparsion doesn't seem fair. 170 | 171 | There are a number of great improvements over multiple tasks like reasoning, question answering, and the core idea, which is performance on scientific modalities, which you can find in the paper. 172 | 173 | # Further reading 174 | Galactica: A Large Language Model for Science: https://arxiv.org/abs/2211.09085 175 | -------------------------------------------------------------------------------- /7-galactica/corpus.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/7-galactica/corpus.png -------------------------------------------------------------------------------- /7-galactica/image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/7-galactica/image-1.png -------------------------------------------------------------------------------- /7-galactica/image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/7-galactica/image-2.png -------------------------------------------------------------------------------- /7-galactica/image-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/7-galactica/image-3.png -------------------------------------------------------------------------------- /7-galactica/image-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/7-galactica/image-4.png -------------------------------------------------------------------------------- /7-galactica/image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/705cfa4490130f5711788ce608adc87428f04f19/7-galactica/image.png -------------------------------------------------------------------------------- /8-chat-templates/README.md: -------------------------------------------------------------------------------- 1 | # Chat Templates 2 | 3 | This is a short section on chat templates and some tokenization gotchas you should be aware of while fine-tuning. Let's focus on two of the most popular open-source models: Mistral-8B and Llama-3. Chat templating is simple: Given a conversation between the user and the assistant i.e a list of messages , you want to apply a template to convert this into plain text input for the language model. Each "template" consists of special indicators/tags for different roles along with your regular BOS and EOS tokens. 4 | 5 | Let's consider the example chat template for `mistralai/Mistral-7B-Instruct-v0.1` (from the[ 🤗 docs](https://huggingface.co/docs/transformers/main/en/chat_templating)): 6 | 7 | ``` 8 | from transformers import AutoTokenizer 9 | tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") 10 | 11 | chat = [ 12 | {"role": "user", "content": "Hello, how are you?"}, 13 | {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, 14 | {"role": "user", "content": "I'd like to show off how chat templating works!"}, 15 | ] 16 | 17 | print(tokenizer.apply_chat_template(chat, tokenize=False)) 18 | # [INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST] 19 | ``` 20 | 21 | The output is as follows: 22 | 23 | <s>[INST] Hello, how are you? [/INST]</s> I'm doing great. How can I help you today? <s>[INST] I'd like to show off how chat templating works! [/INST]</s> 24 | 25 | 26 | I've colored characters added by the template in light blue. Coming back to tokenization, the central issue is as follows: 27 | - Typically, we want to have different `labels` for different roles in the message. While fine-tuning a chat model, you'd want to ignore all the messages from the user/system and compute the loss for the model only on the assistant messages. Specifically, for building the `labels` sequence for the given input ids, you want to use the label ignore token -100 for user/system tokens while copying assistant tokens, so that Pytorch's cross entropy loss function will ignore user/system tokens. Unfortunately, `tokenizer.apply_chat_template` is not enough (as of June 2024) since the default behaviour just tokenizes the input after application of the chat template, which means you lose role information. 28 | - This means that you have to apply the chat template youself. You'll have to format the input conversation messsage by message and compute the `labels` entry for each message in parallel. 29 | 30 | 31 | ## Applying the chat template message by message 32 | The issue with doing this message by message formatting yourself is that there are many places where you can make a mistake. But first, we need to go over two simple properties of tokenization. 33 | 34 | **Tokenization is not invertible**: This is especially important to understand while trying to design tests and comparing your implementation to a reference. The obvious issue is normalization and pre-tokenization steps that can lead to loss of information (whitespaces, accents, etc). Thus, when talking about "correctness" the equation to have in mind is something like this: 35 | ``` 36 | my_token_ids == tokenizer.encode(reference_text, add_special_tokens=False) 37 | ``` 38 | 39 | instead of something like this: 40 | 41 | ``` 42 | tokenizer.decode(my_token_ids) == reference_text 43 | ``` 44 | 45 | Fun fact: For Mistral (and Llama-2), there's an interesting bug where even without loss of information in normalization/pre-tokenization, the second equality doesn't hold: 46 | 47 | ``` 48 | tokenizer.decode(tokenizer.encode("[INST]", add_special_tokens=False)) == " [INST]" 49 | ``` 50 | Note the extra space added after the BOS token (happens with `legacy=True` as well for Llama-2 btw). 51 | 52 | **Tokenizing message by message and then concatenating is not the same as tokenizing the concatenate messages.** : The more general rule from which this comes from: Tokenizing segments of text and then concatenating is not the same as tokenizing the combined text. For the latter statement, a simple example is: 53 | ``` 54 | tokenizer.tokenize("man") + tokenizer.tokenize("go") != tokenizer.tokenize("mango") 55 | ``` 56 | 57 | The first sequence will be `["_man", "_go"]` while the second yields `["_m", "ango"]` (Consider the same Mistral tokenizer). In this case, the two segments joined such that at the borders, characters combined to yield a new token. Coming back to the case with messages, while a case like the above is rare, you will still notice the following with Llama-2 and Mistral: 58 | 59 | `tokenizer.tokenize("[INST]") + tokenizer.tokenize("hi") == tokenizer.tokenize("[INST] hi")` (Notice the extra space in the middle) 60 | 61 | When you combine tokenized sequences, an additional space gets added in the text space here. This means that, your code for chat templating (when you do it message by message) will be wrong if you're not careful! 62 | 63 | 64 | 65 | 66 | 67 | 68 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Sumanth R Hegde 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Everything About Tokenization 4 | 5 | Tokenization is an oft-neglected part of natural language processing. With the recent blow-up of interest in language models, it might be good to step back and really get into the guts of what tokenization is. This repo is meant to serve as a deep dive into different aspects of tokenization. It's been organized as bite-size chapters for easy navigation, with some code samples and (poorly designed) walkthrough notebooks. This is NOT meant to be a complete reference in itself, and is meant accompany other excellent resources like [HuggingFace's NLP course](https://huggingface.co/learn/nlp-course/chapter6/1). The following topics are covered: 6 | 7 | 1. [Intro](/1-intro/): A quick introduction on tokens and the different tokenization algorithms out there. 8 | 2. [BPE](/2-bpe/): A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model. 9 | 3. [🤗 Tokenizer](/3-hf-tokenizer/): The internals of HuggingFace tokenizers! We look at state (what's saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal <200 line version of the 🤗 Tokenizer in Python for GPT2. 10 | 4. [Challenges with Tokenization](/4-tokenization-is-hard/): Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta. 11 | 5. [Puzzles](/5-puzzles/): Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc. 12 | 6. [PostProcessing and more](/6-postprocessing-and-more/): A look at special tokens and postprocessing, glitch tokens and why you might want to shrink your tokenizer. 13 | 7. [Galactica](/7-galactica/): Thinking about tokenizer design by diving into the Galactica paper. 14 | 8. [Chat templates](/8-chat-templates/): Some tokenization tips and tricks while dealing with chat-templating for chat models. 15 | 16 | ## Requirements 17 | To run the notebooks in the repo, you only need two libraries: `transformers` and `tiktoken`: 18 | 19 | ``` 20 | pip install transformers tiktoken 21 | ``` 22 | 23 | Code has been tested with `transformers==4.35.0` and `tiktoken==0.5.1`. 24 | 25 | ## Recommended Prerequisites 26 | A basic understanding of language models and tokenization is a must: 27 | - [A Hackers' Guide to Language Models](https://youtu.be/jkrNMKz9pWU?si=y06_GUgoaG8_ASyd) by Prof. Jeremy Howard. 28 | - [What makes LLM tokenizers different from each other?](https://youtu.be/rT6wVLEDC_w?si=v58zCYEIf0pheaEo) by Jay Alammar. 29 | - [ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers.](https://youtu.be/uSinkCeUg9U?si=P25RHVkMKlm-Qtd6) by Jay Alammar 30 | - [Optional] [Chapter on tokenizers from The 🤗 NLP Course](https://huggingface.co/learn/nlp-course/chapter6/1) 31 | 32 | ## Contributing 33 | If you notice any mistake/bug, or feel you could make an improvement to any section of the repo, please open an issue or make a PR 🙏 34 | --------------------------------------------------------------------------------