├── .gitattributes ├── 1-intro ├── README.md └── tok_pipeline.png ├── 2-bpe ├── README.md ├── ex_corpus.txt ├── openai_bpe_viz.py ├── orig_bpe.py └── walkthrough.ipynb ├── 3-hf-tokenizer ├── README.md ├── bpe.py ├── hf_slow.png ├── minimal_hf_tok.py ├── save_hf.py ├── vocab.json └── walkthrough.ipynb ├── 4-tokenization-is-hard ├── README.md └── toxicity_detection_nllb.png ├── 5-puzzles ├── README.md ├── get_stack_subset.py ├── get_token_counts.py ├── paul_graham_essay_scraper.py └── tok_pipeline.png ├── 6-postprocessing-and-more ├── README.md ├── benchmark.py ├── image-1.png ├── image-2.png ├── image-3.png ├── image.png └── tokenizer_shrink.py ├── 7-galactica ├── README.md ├── corpus.png ├── image-1.png ├── image-2.png ├── image-3.png ├── image-4.png └── image.png ├── 8-chat-templates └── README.md ├── LICENSE └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.ipynb linguist-vendored=true 2 | -------------------------------------------------------------------------------- /1-intro/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/1-intro/README.md -------------------------------------------------------------------------------- /1-intro/tok_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/1-intro/tok_pipeline.png -------------------------------------------------------------------------------- /2-bpe/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/2-bpe/README.md -------------------------------------------------------------------------------- /2-bpe/ex_corpus.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/2-bpe/ex_corpus.txt -------------------------------------------------------------------------------- /2-bpe/openai_bpe_viz.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/2-bpe/openai_bpe_viz.py -------------------------------------------------------------------------------- /2-bpe/orig_bpe.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/2-bpe/orig_bpe.py -------------------------------------------------------------------------------- /2-bpe/walkthrough.ipynb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/2-bpe/walkthrough.ipynb -------------------------------------------------------------------------------- /3-hf-tokenizer/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/README.md -------------------------------------------------------------------------------- /3-hf-tokenizer/bpe.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/bpe.py -------------------------------------------------------------------------------- /3-hf-tokenizer/hf_slow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/hf_slow.png -------------------------------------------------------------------------------- /3-hf-tokenizer/minimal_hf_tok.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/minimal_hf_tok.py -------------------------------------------------------------------------------- /3-hf-tokenizer/save_hf.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/save_hf.py -------------------------------------------------------------------------------- /3-hf-tokenizer/vocab.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/vocab.json -------------------------------------------------------------------------------- /3-hf-tokenizer/walkthrough.ipynb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/3-hf-tokenizer/walkthrough.ipynb -------------------------------------------------------------------------------- /4-tokenization-is-hard/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/4-tokenization-is-hard/README.md -------------------------------------------------------------------------------- /4-tokenization-is-hard/toxicity_detection_nllb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/4-tokenization-is-hard/toxicity_detection_nllb.png -------------------------------------------------------------------------------- /5-puzzles/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/5-puzzles/README.md -------------------------------------------------------------------------------- /5-puzzles/get_stack_subset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/5-puzzles/get_stack_subset.py -------------------------------------------------------------------------------- /5-puzzles/get_token_counts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/5-puzzles/get_token_counts.py -------------------------------------------------------------------------------- /5-puzzles/paul_graham_essay_scraper.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/5-puzzles/paul_graham_essay_scraper.py -------------------------------------------------------------------------------- /5-puzzles/tok_pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/5-puzzles/tok_pipeline.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/README.md -------------------------------------------------------------------------------- /6-postprocessing-and-more/benchmark.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/benchmark.py -------------------------------------------------------------------------------- /6-postprocessing-and-more/image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/image-1.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/image-2.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/image-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/image-3.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/image.png -------------------------------------------------------------------------------- /6-postprocessing-and-more/tokenizer_shrink.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/6-postprocessing-and-more/tokenizer_shrink.py -------------------------------------------------------------------------------- /7-galactica/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/README.md -------------------------------------------------------------------------------- /7-galactica/corpus.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/corpus.png -------------------------------------------------------------------------------- /7-galactica/image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/image-1.png -------------------------------------------------------------------------------- /7-galactica/image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/image-2.png -------------------------------------------------------------------------------- /7-galactica/image-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/image-3.png -------------------------------------------------------------------------------- /7-galactica/image-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/image-4.png -------------------------------------------------------------------------------- /7-galactica/image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/7-galactica/image.png -------------------------------------------------------------------------------- /8-chat-templates/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/8-chat-templates/README.md -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SumanthRH/tokenization/HEAD/README.md --------------------------------------------------------------------------------