├── ch04 ├── 02_performance-analysis │ ├── requirements-extra.txt │ ├── README.md │ └── flops-analysis.ipynb ├── README.md └── 01_main-chapter-code │ ├── README.md │ ├── tests.py │ └── previous_chapters.py ├── ch07 ├── 03_model-evaluation │ ├── requirements-extra.txt │ ├── config.json │ ├── scores │ │ ├── llama3-8b-model-2-response.json │ │ ├── gpt4-model-2-response.json │ │ ├── llama3-8b-model-1-response.json │ │ └── gpt4-model-1-response.json │ └── README.md ├── 02_dataset-utilities │ ├── requirements-extra.txt │ ├── config.json │ ├── README.md │ └── find-near-duplicates.py ├── 05_dataset-generation │ └── README.md ├── 04_preference-tuning-with-dpo │ └── README.md ├── 01_main-chapter-code │ ├── tests.py │ ├── README.md │ ├── ollama_evaluate.py │ ├── gpt_download.py │ └── load-finetuned-model.ipynb └── README.md ├── ch02 ├── 02_bonus_bytepair-encoder │ ├── requirements-extra.txt │ ├── README.md │ └── bpe_openai_gpt2.py ├── 04_bonus_dataloader-intuition │ └── README.md ├── 03_bonus_embedding-vs-matmul │ └── README.md ├── 01_main-chapter-code │ ├── README.md │ └── dataloader.ipynb └── README.md ├── ch06 ├── 03_bonus_imdb-classification │ ├── requirements-extra.txt │ ├── train_sklearn_logreg.py │ ├── download_prepare_dataset.py │ ├── README.md │ └── gpt_download.py ├── README.md ├── 01_main-chapter-code │ ├── tests.py │ ├── README.md │ ├── exercise-solutions.ipynb │ ├── gpt_download.py │ └── load-finetuned-model.ipynb └── 02_bonus_additional-experiments │ └── gpt_download.py ├── appendix-E ├── README.md └── 01_main-chapter-code │ └── gpt_download.py ├── appendix-D └── README.md ├── appendix-A ├── 02_setup-recommendations │ └── README.md └── 01_main-chapter-code │ ├── exercise-solutions.ipynb │ └── DDP-script.py ├── setup ├── 03_optional-docker-environment │ ├── .devcontainer │ │ ├── README.md │ │ ├── Dockerfile │ │ └── devcontainer.json │ └── README.md ├── .vscode │ └── extensions.json ├── 02_installing-python-libraries │ ├── tests.py │ ├── python_environment_check.ipynb │ ├── README.md │ └── python_environment_check.py ├── 01_optional-python-setup-preferences │ └── README.md └── README.md ├── .github ├── ISSUE_TEMPLATE │ ├── ask-a-question.md │ └── bug-report.yaml └── workflows │ ├── pep8-linter.yml │ ├── check-spelling-errors.yml │ ├── check-links.yml │ ├── basic-tests-linux.yml │ ├── basic-tests-macos.yml │ ├── basic-tests-old-pytorch.yml │ └── basic-tests-windows.yml ├── ch03 ├── 01_main-chapter-code │ ├── README.md │ └── small-text-sample.txt ├── 02_bonus_efficient-multihead-attention │ ├── README.md │ └── ch03.py ├── 03_understanding-buffers │ └── README.md └── README.md ├── ch05 ├── 02_alternative_weight_loading │ └── README.md ├── 05_bonus_hparam_tuning │ ├── README.md │ └── hparam_search.py ├── 04_learning_rate_schedulers │ └── README.md ├── README.md ├── 03_bonus_pretraining_on_gutenberg │ ├── tests.py │ └── prepare_dataset.py └── 01_main-chapter-code │ ├── README.md │ ├── tests.py │ └── gpt_download.py ├── ch01 └── README.md ├── requirements.txt └── .gitignore /ch04/02_performance-analysis/requirements-extra.txt: -------------------------------------------------------------------------------- 1 | thop -------------------------------------------------------------------------------- /ch07/03_model-evaluation/requirements-extra.txt: -------------------------------------------------------------------------------- 1 | openai>=1.30.3 2 | tqdm>=4.65.0 3 | -------------------------------------------------------------------------------- /ch02/02_bonus_bytepair-encoder/requirements-extra.txt: -------------------------------------------------------------------------------- 1 | requests 2 | tqdm 3 | transformers>=4.33.2 4 | -------------------------------------------------------------------------------- /ch06/03_bonus_imdb-classification/requirements-extra.txt: -------------------------------------------------------------------------------- 1 | transformers>=4.33.2 2 | scikit-learn>=1.3.0 -------------------------------------------------------------------------------- /ch07/02_dataset-utilities/requirements-extra.txt: -------------------------------------------------------------------------------- 1 | openai>=1.30.3 2 | scikit-learn>=1.3.1 3 | tqdm>=4.65.0 -------------------------------------------------------------------------------- /appendix-E/README.md: -------------------------------------------------------------------------------- 1 | # Appendix E: Parameter-efficient Finetuning with LoRA 2 | 3 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code. -------------------------------------------------------------------------------- /appendix-D/README.md: -------------------------------------------------------------------------------- 1 | # Appendix D: Adding Bells and Whistles to the Training Loop 2 | 3 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code. -------------------------------------------------------------------------------- /ch07/02_dataset-utilities/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "OPENAI_API_KEY": "sk-...", 3 | "_comment": "Enter your API key from https://platform.openai.com/api-keys" 4 | } 5 | -------------------------------------------------------------------------------- /ch07/03_model-evaluation/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "OPENAI_API_KEY": "sk-...", 3 | "_comment": "Enter your API key from https://platform.openai.com/api-keys" 4 | } 5 | -------------------------------------------------------------------------------- /ch02/04_bonus_dataloader-intuition/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: Working with Text Data 2 | 3 | - [dataloader-intuition.ipynb](dataloader-intuition.ipynb) contains optional (bonus) code to explain the data loader more intuitively with simple numbers rather than text. 4 | -------------------------------------------------------------------------------- /appendix-A/02_setup-recommendations/README.md: -------------------------------------------------------------------------------- 1 | ## Python and Environment Setup Recommendations 2 | 3 | 4 | 5 | Please see the [README.md](../../setup/README.md) in the [setup](../../setup) directory for Python installation and setup recommendations. 6 | 7 | 8 | 9 | -------------------------------------------------------------------------------- /ch02/03_bonus_embedding-vs-matmul/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: Working with Text Data 2 | 3 | - [embeddings-and-linear-layers.ipynb](embeddings-and-linear-layers.ipynb) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent. 4 | -------------------------------------------------------------------------------- /ch02/02_bonus_bytepair-encoder/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: Working with Text Data 2 | 3 | 4 | 5 | - [compare-bpe-tiktoken.ipynb](compare-bpe-tiktoken.ipynb) benchmarks various byte pair encoding implementations 6 | - [bpe_openai_gpt2.py](bpe_openai_gpt2.py) is the original bytepair encoder code used by OpenAI 7 | 8 | -------------------------------------------------------------------------------- /setup/03_optional-docker-environment/.devcontainer/README.md: -------------------------------------------------------------------------------- 1 | # Optional Docker Environment 2 | 3 | This is an optional Docker environment for those users who prefer Docker. In case you are interested in using this Docker DevContainer, please see the *Using Docker DevContainers* section in the [../../README.md](../../README.md) for more information. -------------------------------------------------------------------------------- /ch07/05_dataset-generation/README.md: -------------------------------------------------------------------------------- 1 | # Generating a Dataset for Instruction Finetuning 2 | 3 | This folder contains utility code that can be used for generating a dataset for instruction finetuning. 4 | 5 | - [llama3-ollama.ipynb](llama3-ollama.ipynb): A notebook that creates a synthetic instruction finetuning dataset using Llama 3 and Ollama 6 | 7 | -------------------------------------------------------------------------------- /ch02/01_main-chapter-code/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: Working with Text Data 2 | 3 | ### Main Chapter Code 4 | 5 | - [ch02.ipynb](ch02.ipynb) contains all the code as it appears in the chapter 6 | 7 | ### Optional Code 8 | 9 | - [dataloader.ipynb](dataloader.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter 10 | -------------------------------------------------------------------------------- /setup/.vscode/extensions.json: -------------------------------------------------------------------------------- 1 | { 2 | "recommendations": [ 3 | "ms-python.python", 4 | "ms-toolsai.jupyter", 5 | "ms-azuretools.vscode-docker", 6 | "ms-vscode-remote.vscode-remote-extensionpack", 7 | "yahyabatulu.vscode-markdown-alert", 8 | "tomoki1207.pdf", 9 | "mechatroner.rainbow-csv" 10 | ] 11 | } -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/ask-a-question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Ask a Question 3 | about: Ask questions related to the book 4 | title: '' 5 | labels: [question] 6 | assignees: rasbt 7 | 8 | --- 9 | 10 | If you have a question that is not a bug, please consider asking it in this GitHub repository's [discussion forum](https://github.com/rasbt/LLMs-from-scratch/discussions). 11 | -------------------------------------------------------------------------------- /ch03/01_main-chapter-code/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 3: Coding Attention Mechanisms 2 | 3 | ### Main Chapter Code 4 | 5 | - [ch03.ipynb](ch03.ipynb) contains all the code as it appears in the chapter 6 | 7 | ### Optional Code 8 | 9 | - [multihead-attention.ipynb](multihead-attention.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter 10 | 11 | -------------------------------------------------------------------------------- /ch04/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 4: Implementing a GPT Model from Scratch to Generate Text 2 | 3 | ## Main Chapter Code 4 | 5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code. 6 | 7 | ## Optional Code 8 | 9 | - [02_performance-analysis](02_performance-analysis) contains optional code analyzing the performance of the GPT model(s) implemented in the main chapter. 10 | 11 | -------------------------------------------------------------------------------- /ch05/02_alternative_weight_loading/README.md: -------------------------------------------------------------------------------- 1 | # Alternative Approaches to Loading Pretrained Weights 2 | 3 | This folder contains alternative weight loading strategies in case the weights become unavailable from OpenAI. 4 | 5 | - [weight-loading-hf-transformers.ipynb](weight-loading-hf-transformers.ipynb): contains code to load the weights from the Hugging Face Model Hub via the `transformers` library 6 | -------------------------------------------------------------------------------- /ch01/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 1: Understanding Large Language Models 2 | 3 | There is no code in this chapter. 4 | 5 |
6 | As optional bonus material, below is a video tutorial where I explain the LLM development lifecycle covered in this book: 7 | 8 |
9 |
10 | 11 | [![Link to the video](https://img.youtube.com/vi/kPGTx4wcm_w/0.jpg)](https://www.youtube.com/watch?v=kPGTx4wcm_w) 12 | 13 | -------------------------------------------------------------------------------- /ch03/02_bonus_efficient-multihead-attention/README.md: -------------------------------------------------------------------------------- 1 | # More Efficient Multi-Head Attention Implementations 2 | 3 | - [mha-implementations.ipynb](mha-implementations.ipynb) contains and compares different implementations of multi-head attention 4 | 5 | -------------------------------------------------------------------------------- /ch07/04_preference-tuning-with-dpo/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7: Finetuning to Follow Instructions 2 | 3 | - [create-preference-data-ollama.ipynb](create-preference-data-ollama.ipynb): A notebook that creates a synthetic dataset for preference finetuning dataset using Llama 3.1 and Ollama 4 | 5 | - [dpo-from-scratch.ipynb](dpo-from-scratch.ipynb): This notebook implements Direct Preference Optimization (DPO) for LLM alignment 6 | 7 | 8 | -------------------------------------------------------------------------------- /setup/03_optional-docker-environment/.devcontainer/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime 2 | 3 | RUN apt-get update && \ 4 | apt-get upgrade -y && \ 5 | apt-get install -y rsync && \ 6 | apt-get install -y git && \ 7 | apt-get install -y curl && \ 8 | rm -rf /var/lib/apt/lists/* 9 | 10 | COPY requirements.txt requirements.txt 11 | 12 | RUN pip install --no-cache-dir -r requirements.txt 13 | -------------------------------------------------------------------------------- /ch07/03_model-evaluation/scores/llama3-8b-model-2-response.json: -------------------------------------------------------------------------------- 1 | [76, 85, 67, 90, 20, 98, 22, 96, 40, 80, 40, 20, 90, 98, 80, 92, 98, 98, 95, 99, 55, 99, 80, 90, 20, 4, 98, 4, 40, 95, 14, 44, 95, 44, 80, 4, 4, 40, 95, 80, 98, 95, 92, 98, 68, 20, 20, 60, 95, 90, 98, 0, 20, 80, 20, 80, 92, 98, 98, 20, 95, 100, 95, 85, 98, 4, 40, 98, 98, 65, 20, 76, 100, 67, 44, 92, 75, 97, 27, 98, 20, 60, 90, 96, 67, 98, 80, 10, 80, 98, 100, 40, 92, 98, 20, 98, 98, 20, 20] -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch >= 2.0.1 # all 2 | jupyterlab >= 4.0 # all 3 | tiktoken >= 0.5.1 # ch02; ch04; ch05 4 | matplotlib >= 3.7.1 # ch04; ch05 5 | tensorflow >= 2.15.0 # ch05 6 | tqdm >= 4.66.1 # ch05; ch07 7 | numpy >= 1.25, < 2.0 # dependency of several other libraries like torch and pandas 8 | pandas >= 2.2.1 # ch06 9 | psutil >= 5.9.5 # ch07; already installed automatically as dependency of torch 10 | -------------------------------------------------------------------------------- /ch07/03_model-evaluation/scores/gpt4-model-2-response.json: -------------------------------------------------------------------------------- 1 | [0, 100, 0, 100, 0, 100, 0, 100, 0, 0, 50, 0, 100, 100, 100, 100, 100, 100, 100, 95, 0, 50, 100, 100, 0, 0, 100, 0, 0, 100, 0, 0, 100, 0, 67, 0, 0, 0, 100, 100, 95, 100, 100, 100, 0, 0, 0, 0, 100, 100, 100, 0, 55, 100, 0, 100, 65, 100, 100, 0, 100, 100, 100, 0, 100, 0, 85, 100, 100, 85, 0, 75, 100, 0, 0, 100, 100, 100, 0, 100, 0, 50, 100, 100, 0, 100, 0, 0, 100, 85, 100, 0, 100, 100, 0, 100, 100, 0, 0, 0] -------------------------------------------------------------------------------- /ch07/03_model-evaluation/scores/llama3-8b-model-1-response.json: -------------------------------------------------------------------------------- 1 | [20, 92, 85, 90, 20, 90, 22, 97, 60, 96, 20, 20, 98, 95, 90, 98, 95, 20, 98, 98, 92, 20, 96, 96, 100, 98, 98, 95, 20, 95, 98, 20, 85, 95, 80, 97, 40, 21, 100, 85, 95, 98, 92, 98, 69, 98, 80, 60, 60, 20, 80, 68, 80, 96, 96, 68, 80, 95, 80, 20, 95, 98, 80, 98, 94, 20, 40, 98, 100, 85, 98, 90, 95, 85, 95, 80, 98, 98, 25, 98, 40, 92, 95, 82, 87, 98, 80, 90, 95, 4, 90, 90, 80, 98, 20, 98, 98, 40, 92, 98] -------------------------------------------------------------------------------- /ch07/03_model-evaluation/scores/gpt4-model-1-response.json: -------------------------------------------------------------------------------- 1 | [0, 50, 20, 100, 0, 100, 0, 100, 100, 100, 55, 0, 100, 100, 100, 100, 100, 0, 98, 100, 100, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 0, 100, 100, 85, 100, 0, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 95, 20, 50, 85, 100, 100, 100, 100, 55, 100, 100, 100, 0, 100, 98, 100, 100, 100, 0, 85, 100, 100, 98, 100, 100, 100, 0, 100, 100, 100, 100, 0, 100, 0, 100, 100, 0, 0, 100, 50, 100, 100, 10, 100, 100, 100, 100, 0, 100, 100, 25, 100, 30] -------------------------------------------------------------------------------- /ch03/03_understanding-buffers/README.md: -------------------------------------------------------------------------------- 1 | # Understanding PyTorch Buffers 2 | 3 | - [understanding-buffers.ipynb](understanding-buffers.ipynb) explains the idea behind PyTorch buffers, which are used to implement the causal attention mechanism in chapter 3 4 | 5 | 6 |
7 | Below is a hands-on video tutorial I recorded to explain the code: 8 | 9 |
10 |
11 | 12 | [![Link to the video](https://img.youtube.com/vi/PetlIokI9Ao/0.jpg)](https://www.youtube.com/watch?v=PetlIokI9Ao) 13 | 14 | -------------------------------------------------------------------------------- /ch05/05_bonus_hparam_tuning/README.md: -------------------------------------------------------------------------------- 1 | # Optimizing Hyperparameters for Pretraining 2 | 3 | The [hparam_search.py](hparam_search.py) script, based on the extended training function in [Appendix D: Adding Bells and Whistles to the Training Loop](../../appendix-D/01_main-chapter-code/appendix-D.ipynb), is designed to find optimal hyperparameters via grid search. 4 | 5 | >[!NOTE] 6 | This script will take a long time to run. You may want to reduce the number of hyperparameter configurations explored in the `HPARAM_GRID` dictionary at the top. -------------------------------------------------------------------------------- /ch03/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 3: Coding Attention Mechanisms 2 | 3 | ## Main Chapter Code 4 | 5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code. 6 | 7 | ## Bonus Materials 8 | 9 | - [02_bonus_efficient-multihead-attention](02_bonus_efficient-multihead-attention) implements and compares different implementation variants of multihead-attention 10 | - [03_understanding-buffers](03_understanding-buffers) explains the idea behind PyTorch buffers, which are used to implement the causal attention mechanism in chapter 3 -------------------------------------------------------------------------------- /ch05/04_learning_rate_schedulers/README.md: -------------------------------------------------------------------------------- 1 | # Adding Bells and Whistles to the Training Loop 2 | 3 | The main chapter used a relatively simple training function to keep the code readable and fit Chapter 5 within the page limits. Optionally, we can add a linear warm-up, a cosine decay schedule, and gradient clipping to improve the training stability and convergence. 4 | 5 | You can find the code for this more sophisticated training function in [Appendix D: Adding Bells and Whistles to the Training Loop](../../appendix-D/01_main-chapter-code/appendix-D.ipynb). -------------------------------------------------------------------------------- /setup/02_installing-python-libraries/tests.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # File for internal use (unit tests) 7 | 8 | from python_environment_check import main 9 | 10 | 11 | def test_main(capsys): 12 | main() 13 | captured = capsys.readouterr() 14 | assert "FAIL" not in captured.out 15 | -------------------------------------------------------------------------------- /ch06/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 6: Finetuning for Classification 2 | 3 | 4 | ## Main Chapter Code 5 | 6 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code 7 | 8 | ## Bonus Materials 9 | 10 | - [02_bonus_additional-experiments](02_bonus_additional-experiments) includes additional experiments (e.g., training the last vs first token, extending the input length, etc.) 11 | - [03_bonus_imdb-classification](03_bonus_imdb-classification) compares the LLM from chapter 6 with other models on a 50k IMDB movie review sentiment classification dataset -------------------------------------------------------------------------------- /setup/03_optional-docker-environment/.devcontainer/devcontainer.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "LLMs From Scratch", 3 | "build": { 4 | "context": "..", 5 | "dockerfile": "Dockerfile" 6 | }, 7 | "runArgs": ["--runtime=nvidia", "--gpus=all"], 8 | "customizations": { 9 | "vscode": { 10 | "extensions": [ 11 | "ms-python.python", 12 | "ms-azuretools.vscode-docker", 13 | "ms-toolsai.jupyter", 14 | "yahyabatulu.vscode-markdown-alert", 15 | "tomoki1207.pdf", 16 | "mechatroner.rainbow-csv" 17 | ] 18 | } 19 | } 20 | } -------------------------------------------------------------------------------- /.github/workflows/pep8-linter.yml: -------------------------------------------------------------------------------- 1 | name: PEP8 Style checks 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | pull_request: 7 | branches: [ main ] 8 | 9 | jobs: 10 | flake8: 11 | runs-on: ubuntu-latest 12 | steps: 13 | - uses: actions/checkout@v4 14 | - name: Set up Python 15 | uses: actions/setup-python@v5 16 | with: 17 | python-version: '3.10' 18 | - name: Install flake8 19 | run: pip install flake8 20 | - name: Run flake8 with exceptions 21 | run: flake8 . --max-line-length=140 --ignore=W504,E402,E731,C406,E741,E722,E226 22 | -------------------------------------------------------------------------------- /ch04/02_performance-analysis/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 4: Implementing a GPT Model from Scratch To Generate Text 2 | 3 | - [flops-analysis.ipynb](flops-analysis.ipynb) analyses the floating point operations per second (FLOPS) of the GPT model(s) implemented in the main chapter. 4 | - [previous_chapters.py](previous_chapters.py) is a Python module containing the `GPTModel` code we implemented in chapter 4 and other code implemented in previous chapters, which we import in the analysis notebook. 5 | - `requirements-extra.txt` includes additional Python libraries that need to be installed (via `pip install -r requirements-extra.txt`. -------------------------------------------------------------------------------- /ch04/01_main-chapter-code/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 4: Implementing a GPT Model from Scratch To Generate Text 2 | 3 | ### Main Chapter Code 4 | 5 | - [ch04.ipynb](ch04.ipynb) contains all the code as it appears in the chapter 6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the `MultiHeadAttention` module from the previous chapter, which we import in [ch04.ipynb](ch04.ipynb) to create the GPT model 7 | 8 | ### Optional Code 9 | 10 | - [gpt.py](gpt.py) is a standalone Python script file with the code that we implemented thus far, including the GPT model we coded in this chapter 11 | 12 | -------------------------------------------------------------------------------- /ch06/01_main-chapter-code/tests.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # File for internal use (unit tests) 7 | 8 | 9 | import subprocess 10 | 11 | 12 | def test_gpt_class_finetune(): 13 | command = ["python", "ch06/01_main-chapter-code/gpt_class_finetune.py", "--test_mode"] 14 | 15 | result = subprocess.run(command, capture_output=True, text=True) 16 | assert result.returncode == 0, f"Script exited with errors: {result.stderr}" 17 | -------------------------------------------------------------------------------- /ch07/01_main-chapter-code/tests.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # File for internal use (unit tests) 7 | 8 | 9 | import subprocess 10 | 11 | 12 | def test_gpt_class_finetune(): 13 | command = ["python", "ch06/01_main-chapter-code/gpt_class_finetune.py", "--test_mode"] 14 | 15 | result = subprocess.run(command, capture_output=True, text=True) 16 | assert result.returncode == 0, f"Script exited with errors: {result.stderr}" 17 | -------------------------------------------------------------------------------- /.github/workflows/check-spelling-errors.yml: -------------------------------------------------------------------------------- 1 | name: Spell Check 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | pull_request: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | spellcheck: 13 | runs-on: ubuntu-latest 14 | 15 | steps: 16 | - uses: actions/checkout@v4 17 | 18 | - name: Set up Python 19 | uses: actions/setup-python@v5 20 | with: 21 | python-version: '3.10' 22 | 23 | - name: Install codespell 24 | run: | 25 | python -m pip install --upgrade pip 26 | pip install codespell 27 | 28 | - name: Run codespell 29 | run: | 30 | codespell -L "ocassion,occassion,ot,te,tje" **/*.{txt,md,py,ipynb} 31 | -------------------------------------------------------------------------------- /ch02/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: Working with Text Data 2 | 3 | 4 | ## Main Chapter Code 5 | 6 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions 7 | 8 | ## Bonus Materials 9 | 10 | - [02_bonus_bytepair-encoder](02_bonus_bytepair-encoder) contains optional code to benchmark different byte pair encoder implementations 11 | 12 | - [03_bonus_embedding-vs-matmul](03_bonus_embedding-vs-matmul) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent. 13 | 14 | - [04_bonus_dataloader-intuition](04_bonus_dataloader-intuition) contains optional (bonus) code to explain the data loader more intuitively with simple numbers rather than text. 15 | -------------------------------------------------------------------------------- /ch07/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7: Finetuning to Follow Instructions 2 | 3 | ## Main Chapter Code 4 | 5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions 6 | 7 | ## Bonus Materials 8 | 9 | - [02_dataset-utilities](02_dataset-utilities) contains utility code that can be used for preparing an instruction dataset 10 | 11 | - [03_model-evaluation](03_model-evaluation) contains utility code for evaluating instruction responses using a local Llama 3 model and the GPT-4 API 12 | 13 | - [04_preference-tuning-with-dpo](04_preference-tuning-with-dpo) implements code for preference finetuning with Direct Preference Optimization (DPO) 14 | 15 | - [05_dataset-generation](05_dataset-generation) contains code to generate synthetic datasets for instruction finetuning 16 | -------------------------------------------------------------------------------- /ch05/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 5: Pretraining on Unlabeled Data 2 | 3 | ## Main Chapter Code 4 | 5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code 6 | 7 | ## Bonus Materials 8 | 9 | - [02_alternative_weight_loading](02_alternative_weight_loading) contains code to load the GPT model weights from alternative places in case the model weights become unavailable from OpenAI 10 | - [03_bonus_pretraining_on_gutenberg](03_bonus_pretraining_on_gutenberg) contains code to pretrain the LLM longer on the whole corpus of books from Project Gutenberg 11 | - [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping 12 | - [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script 13 | 14 | -------------------------------------------------------------------------------- /ch06/01_main-chapter-code/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 6: Finetuning for Classification 2 | 3 | ### Main Chapter Code 4 | 5 | - [ch06.ipynb](ch06.ipynb) contains all the code as it appears in the chapter 6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the GPT model we coded and trained in previous chapters, alongside many utility functions, which we reuse in this chapter 7 | - [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights 8 | - [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter 9 | 10 | ### Optional Code 11 | 12 | [load-finetuned-model.ipynb](load-finetuned-model.ipynb) is a standalone Jupyter notebook to load the finetuned model we created in this chapter 13 | 14 | 15 | - [gpt_class_finetune.py](gpt_class_finetune.py) is a standalone Python script file with the code that we implemented in [ch06.ipynb](ch06.ipynb) to finetune the GPT model (you can think of it as a chapter summary) 16 | 17 | -------------------------------------------------------------------------------- /ch05/03_bonus_pretraining_on_gutenberg/tests.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # File for internal use (unit tests) 7 | 8 | from pathlib import Path 9 | import os 10 | import subprocess 11 | 12 | 13 | def test_pretraining(): 14 | 15 | sequence = "a b c d" 16 | repetitions = 1000 17 | content = sequence * repetitions 18 | 19 | folder_path = Path("gutenberg") / "data" 20 | file_name = "repeated_sequence.txt" 21 | 22 | os.makedirs(folder_path, exist_ok=True) 23 | 24 | with open(folder_path/file_name, "w") as file: 25 | file.write(content) 26 | 27 | result = subprocess.run( 28 | ["python", "pretraining_simple.py", "--debug", "true"], 29 | capture_output=True, text=True 30 | ) 31 | print(result.stdout) 32 | assert "Maximum GPU memory allocated" in result.stdout 33 | -------------------------------------------------------------------------------- /.github/workflows/check-links.yml: -------------------------------------------------------------------------------- 1 | name: Check hyperlinks 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | pull_request: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | test: 13 | runs-on: ubuntu-latest 14 | 15 | steps: 16 | - uses: actions/checkout@v4 17 | 18 | - name: Set up Python 19 | uses: actions/setup-python@v5 20 | with: 21 | python-version: '3.10' 22 | 23 | - name: Install dependencies 24 | run: | 25 | python -m pip install --upgrade pip 26 | pip install pytest pytest-check-links 27 | # Current version of retry doesn't work well if there are broken non-URL links 28 | # pip install pytest pytest-check-links pytest-retry 29 | 30 | - name: Check links 31 | run: | 32 | pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" 33 | # pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" --retries 2 --retry-delay 5 34 | 35 | -------------------------------------------------------------------------------- /ch05/01_main-chapter-code/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 5: Pretraining on Unlabeled Data 2 | 3 | ### Main Chapter Code 4 | 5 | - [ch05.ipynb](ch05.ipynb) contains all the code as it appears in the chapter 6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the `MultiHeadAttention` module and `GPTModel` class from the previous chapters, which we import in [ch05.ipynb](ch05.ipynb) to pretrain the GPT model 7 | - [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights 8 | - [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter 9 | 10 | ### Optional Code 11 | 12 | - [gpt_train.py](gpt_train.py) is a standalone Python script file with the code that we implemented in [ch05.ipynb](ch05.ipynb) to train the GPT model (you can think of it as a code file summarizing this chapter) 13 | - [gpt_generate.py](gpt_generate.py) is a standalone Python script file with the code that we implemented in [ch05.ipynb](ch05.ipynb) to load and use the pretrained model weights from OpenAI 14 | 15 | -------------------------------------------------------------------------------- /ch07/03_model-evaluation/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7: Finetuning to Follow Instructions 2 | 3 | This folder contains utility code that can be used for model evaluation. 4 | 5 | 6 | 7 |   8 | ## Evaluating Instruction Responses Using the OpenAI API 9 | 10 | 11 | - The [llm-instruction-eval-openai.ipynb](llm-instruction-eval-openai.ipynb) notebook uses OpenAI's GPT-4 to evaluate responses generated by instruction finetuned models. It works with a JSON file in the following format: 12 | 13 | ```python 14 | { 15 | "instruction": "What is the atomic number of helium?", 16 | "input": "", 17 | "output": "The atomic number of helium is 2.", # <-- The target given in the test set 18 | "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM 19 | "model 2 response": "\nThe atomic number of helium is 3." # <-- Response by a 2nd LLM 20 | }, 21 | ``` 22 | 23 |   24 | ## Evaluating Instruction Responses Locally Using Ollama 25 | 26 | - The [llm-instruction-eval-ollama.ipynb](llm-instruction-eval-ollama.ipynb) notebook offers an alternative to the one above, utilizing a locally downloaded Llama 3 model via Ollama. -------------------------------------------------------------------------------- /ch04/01_main-chapter-code/tests.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # File for internal use (unit tests) 7 | 8 | from gpt import main 9 | 10 | expected = """ 11 | ================================================== 12 | IN 13 | ================================================== 14 | 15 | Input text: Hello, I am 16 | Encoded input text: [15496, 11, 314, 716] 17 | encoded_tensor.shape: torch.Size([1, 4]) 18 | 19 | 20 | ================================================== 21 | OUT 22 | ================================================== 23 | 24 | Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267, 25 | 49706, 43231, 47062, 34657]]) 26 | Output length: 14 27 | Output text: Hello, I am Featureiman Byeswickattribute argue logger Normandy Compton analogous 28 | """ 29 | 30 | 31 | def test_main(capsys): 32 | main() 33 | captured = capsys.readouterr() 34 | 35 | # Normalize line endings and strip trailing whitespace from each line 36 | normalized_expected = '\n'.join(line.rstrip() for line in expected.splitlines()) 37 | normalized_output = '\n'.join(line.rstrip() for line in captured.out.splitlines()) 38 | 39 | # Compare normalized strings 40 | assert normalized_output == normalized_expected 41 | -------------------------------------------------------------------------------- /.github/workflows/basic-tests-linux.yml: -------------------------------------------------------------------------------- 1 | name: Code tests (Linux) 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | paths: 7 | - '**/*.py' # Run workflow for changes in Python files 8 | - '**/*.ipynb' 9 | - '**/*.yaml' 10 | - '**/*.yml' 11 | - '**/*.sh' 12 | pull_request: 13 | branches: [ main ] 14 | paths: 15 | - '**/*.py' 16 | - '**/*.ipynb' 17 | - '**/*.yaml' 18 | - '**/*.yml' 19 | - '**/*.sh' 20 | 21 | jobs: 22 | test: 23 | runs-on: ubuntu-latest 24 | 25 | steps: 26 | - uses: actions/checkout@v4 27 | 28 | - name: Set up Python 29 | uses: actions/setup-python@v5 30 | with: 31 | python-version: "3.10" 32 | 33 | - name: Install dependencies 34 | run: | 35 | python -m pip install --upgrade pip 36 | pip install pytest nbval 37 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi 38 | 39 | - name: Test Selected Python Scripts 40 | run: | 41 | pytest setup/02_installing-python-libraries/tests.py 42 | pytest ch04/01_main-chapter-code/tests.py 43 | pytest ch05/01_main-chapter-code/tests.py 44 | pytest ch06/01_main-chapter-code/tests.py 45 | 46 | - name: Validate Selected Jupyter Notebooks 47 | run: | 48 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb 49 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb 50 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb 51 | -------------------------------------------------------------------------------- /.github/workflows/basic-tests-macos.yml: -------------------------------------------------------------------------------- 1 | name: Code tests (macOS) 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | paths: 7 | - '**/*.py' # Run workflow for changes in Python files 8 | - '**/*.ipynb' 9 | - '**/*.yaml' 10 | - '**/*.yml' 11 | - '**/*.sh' 12 | pull_request: 13 | branches: [ main ] 14 | paths: 15 | - '**/*.py' 16 | - '**/*.ipynb' 17 | - '**/*.yaml' 18 | - '**/*.yml' 19 | - '**/*.sh' 20 | 21 | jobs: 22 | test: 23 | runs-on: macos-latest 24 | 25 | steps: 26 | - uses: actions/checkout@v4 27 | 28 | - name: Set up Python 29 | uses: actions/setup-python@v5 30 | with: 31 | python-version: "3.10" 32 | 33 | - name: Install dependencies 34 | run: | 35 | python -m pip install --upgrade pip 36 | pip install pytest nbval 37 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi 38 | 39 | - name: Test Selected Python Scripts 40 | run: | 41 | pytest setup/02_installing-python-libraries/tests.py 42 | pytest ch04/01_main-chapter-code/tests.py 43 | pytest ch05/01_main-chapter-code/tests.py 44 | pytest ch06/01_main-chapter-code/tests.py 45 | 46 | - name: Validate Selected Jupyter Notebooks 47 | run: | 48 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb 49 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb 50 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb 51 | -------------------------------------------------------------------------------- /.github/workflows/basic-tests-old-pytorch.yml: -------------------------------------------------------------------------------- 1 | name: Test PyTorch 2.0 and 2.4 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | paths: 7 | - '**/*.py' # Run workflow for changes in Python files 8 | - '**/*.ipynb' 9 | - '**/*.yaml' 10 | - '**/*.yml' 11 | - '**/*.sh' 12 | pull_request: 13 | branches: [ main ] 14 | paths: 15 | - '**/*.py' 16 | - '**/*.ipynb' 17 | - '**/*.yaml' 18 | - '**/*.yml' 19 | - '**/*.sh' 20 | 21 | jobs: 22 | test: 23 | runs-on: ubuntu-latest 24 | strategy: 25 | matrix: 26 | pytorch-version: [ 2.0.1, 2.4.0 ] 27 | 28 | steps: 29 | - uses: actions/checkout@v4 30 | 31 | - name: Set up Python 32 | uses: actions/setup-python@v5 33 | with: 34 | python-version: "3.10" 35 | 36 | - name: Install dependencies 37 | run: | 38 | python -m pip install --upgrade pip 39 | pip install pytest nbval 40 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi 41 | pip install torch==${{ matrix.pytorch-version }} 42 | 43 | - name: Test Selected Python Scripts 44 | run: | 45 | pytest setup/02_installing-python-libraries/tests.py 46 | pytest ch04/01_main-chapter-code/tests.py 47 | pytest ch05/01_main-chapter-code/tests.py 48 | pytest ch06/01_main-chapter-code/tests.py 49 | 50 | - name: Validate Selected Jupyter Notebooks 51 | run: | 52 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb 53 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb 54 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb 55 | -------------------------------------------------------------------------------- /.github/workflows/basic-tests-windows.yml: -------------------------------------------------------------------------------- 1 | name: Code tests (Windows) 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | paths: 7 | - '**/*.py' # Run workflow for changes in Python files 8 | - '**/*.ipynb' 9 | - '**/*.yaml' 10 | - '**/*.yml' 11 | - '**/*.sh' 12 | pull_request: 13 | branches: [ main ] 14 | paths: 15 | - '**/*.py' 16 | - '**/*.ipynb' 17 | - '**/*.yaml' 18 | - '**/*.yml' 19 | - '**/*.sh' 20 | 21 | jobs: 22 | test: 23 | runs-on: windows-latest 24 | 25 | steps: 26 | - name: Checkout Code 27 | uses: actions/checkout@v4 28 | 29 | - name: Set up Python 30 | uses: actions/setup-python@v5 31 | with: 32 | python-version: '3.10' 33 | 34 | - name: Install dependencies 35 | shell: bash 36 | run: | 37 | python -m pip install --upgrade pip 38 | pip install pytest nbval 39 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi 40 | pip install matplotlib==3.9.0 41 | 42 | - name: Test Selected Python Scripts 43 | shell: bash 44 | run: | 45 | pytest setup/02_installing-python-libraries/tests.py 46 | pytest ch04/01_main-chapter-code/tests.py 47 | pytest ch05/01_main-chapter-code/tests.py 48 | pytest ch06/01_main-chapter-code/tests.py 49 | 50 | - name: Validate Selected Jupyter Notebooks 51 | shell: bash 52 | run: | 53 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb 54 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb 55 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb 56 | -------------------------------------------------------------------------------- /ch03/01_main-chapter-code/small-text-sample.txt: -------------------------------------------------------------------------------- 1 | Once upon a time in a quiet village nestled among rolling hills and whispering forests, there lived a young girl named Elara. Elara was known for her boundless curiosity and her love for the stars. Every night, she would climb to the highest hill near her home to gaze at the glittering sky, dreaming of distant worlds and galaxies. 2 | 3 | In the heart of the village, there was an ancient library, tended by an old, wise librarian named Mr. Bramwell. This library was a treasure trove of books on every subject, but most importantly, it housed a collection of old star maps and celestial guides. Elara, fascinated by these books, spent countless hours with Mr. Bramwell, learning about constellations, planets, and the mysteries of the universe. 4 | 5 | One evening, while studying an old star map, Elara noticed a small, uncharted star that twinkled differently. She shared this discovery with Mr. Bramwell, who was equally intrigued. They decided to observe this star every night, noting its unique patterns and movements. This small, mysterious star, which they named "Elara's Star," became the center of their nightly adventures. 6 | 7 | As days turned into weeks, the villagers began to take notice of Elara's star. The uncharted star brought the community together, with people of all ages joining Elara and Mr. Bramwell on the hill each night to gaze at the sky. The nightly gatherings turned into a festival of stars, where stories were shared, friendships were formed, and the mysteries of the cosmos were contemplated. 8 | 9 | The story of Elara and her star spread far and wide, attracting astronomers and dreamers from distant lands. The once quiet village became a beacon of wonder, a place where the sky seemed a little closer and the stars a bit friendlier. Elara's curiosity had not only unveiled a hidden star but had also brought her community together, reminding everyone that sometimes, the most extraordinary discoveries are waiting just above us, in the starlit sky. -------------------------------------------------------------------------------- /setup/02_installing-python-libraries/python_environment_check.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "c31e08b0-f551-4d67-b95e-41f49de3b392", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "Supplementary code for \"Build a Large Language Model From Scratch\": https://www.manning.com/books/build-a-large-language-model-from-scratch by Sebastian Raschka
\n", 10 | "Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 11 | "
" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "id": "67f6f7ed-b67d-465b-bf6f-a99b0d996930", 18 | "metadata": {}, 19 | "outputs": [ 20 | { 21 | "name": "stdout", 22 | "output_type": "stream", 23 | "text": [ 24 | "[OK] Your Python version is 3.10.12\n", 25 | "[OK] numpy 1.26.0\n", 26 | "[OK] matplotlib 3.8.2\n", 27 | "[OK] jupyterlab 4.0.6\n", 28 | "[OK] tensorflow 2.15.0\n", 29 | "[OK] torch 2.2.1\n", 30 | "[OK] tqdm 4.66.1\n", 31 | "[OK] tiktoken 0.5.1\n" 32 | ] 33 | } 34 | ], 35 | "source": [ 36 | "from python_environment_check import check_packages, get_requirements_dict\n", 37 | "\n", 38 | "d = get_requirements_dict()\n", 39 | "check_packages(d)" 40 | ] 41 | } 42 | ], 43 | "metadata": { 44 | "kernelspec": { 45 | "display_name": "Python 3 (ipykernel)", 46 | "language": "python", 47 | "name": "python3" 48 | }, 49 | "language_info": { 50 | "codemirror_mode": { 51 | "name": "ipython", 52 | "version": 3 53 | }, 54 | "file_extension": ".py", 55 | "mimetype": "text/x-python", 56 | "name": "python", 57 | "nbconvert_exporter": "python", 58 | "pygments_lexer": "ipython3", 59 | "version": "3.10.6" 60 | } 61 | }, 62 | "nbformat": 4, 63 | "nbformat_minor": 5 64 | } 65 | -------------------------------------------------------------------------------- /ch07/02_dataset-utilities/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7: Finetuning to Follow Instructions 2 | 3 | This folder contains utility code that can be used for preparing an instruction dataset. 4 | 5 | Install the additional package requirements via: 6 | 7 | ```bash 8 | pip install -r requirements-extra.txt 9 | ``` 10 | 11 | 12 | 13 | 14 | 15 | ### Finding Near Duplicates 16 | 17 | The `find-near-duplicates.py` function can be used to identify duplicates and near-duplicates in an instruction dataset. For example, 18 | 19 | 20 | 21 | ```bash 22 | python find-near-duplicates.py --json_file instruction-examples.json 23 | ``` 24 | 25 | ``` 26 | scikit-learn version: 1.3.1 27 | 28 | 29 | ================================================== 30 | Searching 'instruction' for duplicates ... 31 | ================================================== 32 | Duplicate pair found with similarity 0.94: 33 | 1. Edit the following sentence to make it more formal. 34 | 2. Edit the sentence to make it more formal. 35 | 36 | Duplicate pair found with similarity 1.00: 37 | 1. Name a dwarf planet in our solar system. 38 | 2. Name a dwarf planet in our solar system. 39 | 40 | Duplicate pair found with similarity 0.91: 41 | 1. Change the sentences from active voice to passive voice. 42 | 2. Change the sentence from passive to active voice. 43 | 44 | 45 | 46 | ================================================== 47 | Searching 'input' for duplicates ... 48 | ================================================== 49 | No duplicates found 50 | 51 | 52 | ================================================== 53 | Searching 'output' for duplicates ... 54 | ================================================== 55 | Duplicate pair found with similarity 1.00: 56 | 1. One dwarf planet in our solar system is Pluto. 57 | 2. One dwarf planet in our solar system is Pluto. 58 | 59 | 60 | ``` 61 | 62 |   63 | You can use the `--threshold` setting with a value between 0 and 1 to decrease or increase the sensitivity. 64 | The default threshold is 0.9. 65 | 66 | 67 | 68 |   69 | ## Creating Passive Voice Entries 70 | 71 | - The [create-passive-voice-entries.ipynb](create-passive-voice-entries.ipynb) notebook uses OpenAI's GPT-4 to create "passive voice" entries for an instruction dataset, as shown in the example below 72 | 73 | ```python 74 | { 75 | 'instruction': 'Identify the verb in the following sentence', 76 | 'input': 'The cat sleeps on the couch.', 77 | 'output': 'The verb in the sentence is "sleeps."', 78 | 'output_2': 'The sentence is "sleeps."' # <---- Newly created entry 79 | } 80 | ``` 81 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug-report.yaml: -------------------------------------------------------------------------------- 1 | name: Bug Report 2 | description: Report errors related to the book content or code 3 | title: "Description" 4 | labels: [bug] 5 | assignees: rasbt 6 | body: 7 | - type: markdown 8 | attributes: 9 | value: | 10 | Thank you for taking the time to report an issue. Please fill out the details below to help resolve it. 11 | 12 | - type: textarea 13 | id: bug_description 14 | attributes: 15 | label: Bug description 16 | description: A description of the issue. 17 | placeholder: | 18 | Please provide a description of what the bug or issue is. 19 | validations: 20 | required: true 21 | 22 | - type: dropdown 23 | id: operating_system 24 | attributes: 25 | label: What operating system are you using? 26 | description: If applicable, please select the operating system where you experienced this issue. 27 | options: 28 | - "Unknown" 29 | - "macOS" 30 | - "Linux" 31 | - "Windows" 32 | validations: 33 | required: False 34 | 35 | - type: dropdown 36 | id: compute_environment 37 | attributes: 38 | label: Where do you run your code? 39 | description: Please select the computing environment where you ran this code. 40 | options: 41 | - "Local (laptop, desktop)" 42 | - "Lightning AI Studio" 43 | - "Google Colab" 44 | - "Other cloud environment (AWS, Azure, GCP)" 45 | validations: 46 | required: False 47 | 48 | - type: textarea 49 | id: environment 50 | attributes: 51 | label: Environment 52 | description: | 53 | Please provide details about your Python environment via the environment collection script or notebook located at 54 | https://github.com/rasbt/LLMs-from-scratch/tree/main/setup/02_installing-python-libraries. 55 | For your convenience, you can download and run the script from your terminal as follows: 56 | 57 | ```bash 58 | curl --ssl-no-revoke -O https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/setup/02_installing-python-libraries/python_environment_check.py \ 59 | -O https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/requirements.txt 60 | 61 | python python_environment_check.py 62 | ``` 63 | 64 | The script will print your Python environment information in the following format 65 | ```console 66 | [OK] Your Python version is 3.11.4 67 | [OK] torch 2.3.1 68 | [OK] jupyterlab 4.2.2 69 | [OK] tiktoken 0.7.0 70 | [OK] matplotlib 3.9.0 71 | [OK] numpy 1.26.4 72 | [OK] tensorflow 2.16.1 73 | [OK] tqdm 4.66.4 74 | [OK] pandas 2.2.2 75 | [OK] psutil 5.9.8 76 | ``` 77 | You can simply copy and paste the outputs of this script below. 78 | value: | 79 | ``` 80 | 81 | 82 | 83 | ``` 84 | validations: 85 | required: false 86 | -------------------------------------------------------------------------------- /ch06/03_bonus_imdb-classification/train_sklearn_logreg.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | import pandas as pd 7 | from sklearn.feature_extraction.text import CountVectorizer 8 | from sklearn.linear_model import LogisticRegression 9 | from sklearn.metrics import accuracy_score 10 | # from sklearn.metrics import balanced_accuracy_score 11 | from sklearn.dummy import DummyClassifier 12 | 13 | 14 | def load_dataframes(): 15 | df_train = pd.read_csv("train.csv") 16 | df_val = pd.read_csv("validation.csv") 17 | df_test = pd.read_csv("test.csv") 18 | 19 | return df_train, df_val, df_test 20 | 21 | 22 | def eval(model, X_train, y_train, X_val, y_val, X_test, y_test): 23 | # Making predictions 24 | y_pred_train = model.predict(X_train) 25 | y_pred_val = model.predict(X_val) 26 | y_pred_test = model.predict(X_test) 27 | 28 | # Calculating accuracy and balanced accuracy 29 | accuracy_train = accuracy_score(y_train, y_pred_train) 30 | # balanced_accuracy_train = balanced_accuracy_score(y_train, y_pred_train) 31 | 32 | accuracy_val = accuracy_score(y_val, y_pred_val) 33 | # balanced_accuracy_val = balanced_accuracy_score(y_val, y_pred_val) 34 | 35 | accuracy_test = accuracy_score(y_test, y_pred_test) 36 | # balanced_accuracy_test = balanced_accuracy_score(y_test, y_pred_test) 37 | 38 | # Printing the results 39 | print(f"Training Accuracy: {accuracy_train*100:.2f}%") 40 | print(f"Validation Accuracy: {accuracy_val*100:.2f}%") 41 | print(f"Test Accuracy: {accuracy_test*100:.2f}%") 42 | 43 | # print(f"\nTraining Balanced Accuracy: {balanced_accuracy_train*100:.2f}%") 44 | # print(f"Validation Balanced Accuracy: {balanced_accuracy_val*100:.2f}%") 45 | # print(f"Test Balanced Accuracy: {balanced_accuracy_test*100:.2f}%") 46 | 47 | 48 | if __name__ == "__main__": 49 | df_train, df_val, df_test = load_dataframes() 50 | 51 | ######################################### 52 | # Convert text into bag-of-words model 53 | vectorizer = CountVectorizer() 54 | ######################################### 55 | 56 | X_train = vectorizer.fit_transform(df_train["text"]) 57 | X_val = vectorizer.transform(df_val["text"]) 58 | X_test = vectorizer.transform(df_test["text"]) 59 | y_train, y_val, y_test = df_train["label"], df_val["label"], df_test["label"] 60 | 61 | ##################################### 62 | # Model training and evaluation 63 | ##################################### 64 | 65 | # Create a dummy classifier with the strategy to predict the most frequent class 66 | dummy_clf = DummyClassifier(strategy="most_frequent") 67 | dummy_clf.fit(X_train, y_train) 68 | 69 | print("Dummy classifier:") 70 | eval(dummy_clf, X_train, y_train, X_val, y_val, X_test, y_test) 71 | 72 | print("\n\nLogistic regression classifier:") 73 | model = LogisticRegression(max_iter=1000) 74 | model.fit(X_train, y_train) 75 | eval(model, X_train, y_train, X_val, y_val, X_test, y_test) 76 | -------------------------------------------------------------------------------- /setup/02_installing-python-libraries/README.md: -------------------------------------------------------------------------------- 1 | # Installing Python Packages and Libraries Used In This Book 2 | 3 | This document provides more information on double-checking your installed Python version and packages. (Please see the [../01_optional-python-setup-preferences](../01_optional-python-setup-preferences) folder for more information on installing Python and Python packages.) 4 | 5 | I used the following libraries listed [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/requirements.txt) for this book. Newer versions of these libraries are likely compatible as well. However, if you experience any problems with the code, you can try these library versions as a fallback. 6 | 7 | To install these requirements most conveniently, you can use the `requirements.txt` file in the root directory for this code repository and execute the following command: 8 | 9 | ```bash 10 | pip install -r requirements.txt 11 | ``` 12 | 13 | Alternatively, you can install it via the GitHub URL as follows: 14 | 15 | ```bash 16 | pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/requirements.txt 17 | ``` 18 | 19 | 20 | Then, after completing the installation, please check if all the packages are installed and are up to date using 21 | 22 | ```bash 23 | python python_environment_check.py 24 | ``` 25 | 26 | 27 | 28 | It's also recommended to check the versions in JupyterLab by running the `python_environment_check.ipynb` in this directory, which should ideally give you the same results as above. 29 | 30 | 31 | 32 | If you see the following issues, it's likely that your JupyterLab instance is connected to wrong conda environment: 33 | 34 | 35 | 36 | In this case, you may want to use `watermark` to check if you opened the JupyterLab instance in the right conda environment using the `--conda` flag: 37 | 38 | 39 | 40 | 41 |
42 |
43 | 44 | 45 | ## Installing PyTorch 46 | 47 | PyTorch can be installed just like any other Python library or package using pip. For example: 48 | 49 | ```bash 50 | pip install torch==2.0.1 51 | ``` 52 | 53 | However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional settings and explanation (see the *A.1.3 Installing PyTorch in the book for more information*). 54 | 55 | It's also highly recommended to consult the installation guide menu on the official PyTorch website at [https://pytorch.org](https://pytorch.org). 56 | 57 | 58 | 59 | 60 | 61 | --- 62 | 63 | 64 | 65 | 66 | Any questions? Please feel free to reach out in the [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions). 67 | -------------------------------------------------------------------------------- /ch05/01_main-chapter-code/tests.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # File for internal use (unit tests) 7 | 8 | import pytest 9 | from gpt_train import main 10 | import http.client 11 | from urllib.parse import urlparse 12 | 13 | 14 | @pytest.fixture 15 | def gpt_config(): 16 | return { 17 | "vocab_size": 50257, 18 | "context_length": 12, # small for testing efficiency 19 | "emb_dim": 32, # small for testing efficiency 20 | "n_heads": 4, # small for testing efficiency 21 | "n_layers": 2, # small for testing efficiency 22 | "drop_rate": 0.1, 23 | "qkv_bias": False 24 | } 25 | 26 | 27 | @pytest.fixture 28 | def other_settings(): 29 | return { 30 | "learning_rate": 5e-4, 31 | "num_epochs": 1, # small for testing efficiency 32 | "batch_size": 2, 33 | "weight_decay": 0.1 34 | } 35 | 36 | 37 | def test_main(gpt_config, other_settings): 38 | train_losses, val_losses, tokens_seen, model = main(gpt_config, other_settings) 39 | 40 | assert len(train_losses) == 39, "Unexpected number of training losses" 41 | assert len(val_losses) == 39, "Unexpected number of validation losses" 42 | assert len(tokens_seen) == 39, "Unexpected number of tokens seen" 43 | 44 | 45 | def check_file_size(url, expected_size): 46 | parsed_url = urlparse(url) 47 | if parsed_url.scheme == "https": 48 | conn = http.client.HTTPSConnection(parsed_url.netloc) 49 | else: 50 | conn = http.client.HTTPConnection(parsed_url.netloc) 51 | 52 | conn.request("HEAD", parsed_url.path) 53 | response = conn.getresponse() 54 | if response.status != 200: 55 | return False, f"{url} not accessible" 56 | size = response.getheader("Content-Length") 57 | if size is None: 58 | return False, "Content-Length header is missing" 59 | size = int(size) 60 | if size != expected_size: 61 | return False, f"{url} file has expected size {expected_size}, but got {size}" 62 | return True, f"{url} file size is correct" 63 | 64 | 65 | def test_model_files(): 66 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 67 | 68 | model_size = "124M" 69 | files = { 70 | "checkpoint": 77, 71 | "encoder.json": 1042301, 72 | "hparams.json": 90, 73 | "model.ckpt.data-00000-of-00001": 497759232, 74 | "model.ckpt.index": 5215, 75 | "model.ckpt.meta": 471155, 76 | "vocab.bpe": 456318 77 | } 78 | 79 | for file_name, expected_size in files.items(): 80 | url = f"{base_url}/{model_size}/{file_name}" 81 | valid, message = check_file_size(url, expected_size) 82 | assert valid, message 83 | 84 | model_size = "355M" 85 | files = { 86 | "checkpoint": 77, 87 | "encoder.json": 1042301, 88 | "hparams.json": 91, 89 | "model.ckpt.data-00000-of-00001": 1419292672, 90 | "model.ckpt.index": 10399, 91 | "model.ckpt.meta": 926519, 92 | "vocab.bpe": 456318 93 | } 94 | 95 | for file_name, expected_size in files.items(): 96 | url = f"{base_url}/{model_size}/{file_name}" 97 | valid, message = check_file_size(url, expected_size) 98 | assert valid, message 99 | -------------------------------------------------------------------------------- /ch06/03_bonus_imdb-classification/download_prepare_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | import os 7 | import sys 8 | import tarfile 9 | import time 10 | import urllib.request 11 | import pandas as pd 12 | 13 | 14 | def reporthook(count, block_size, total_size): 15 | global start_time 16 | if count == 0: 17 | start_time = time.time() 18 | else: 19 | duration = time.time() - start_time 20 | progress_size = int(count * block_size) 21 | percent = count * block_size * 100 / total_size 22 | 23 | speed = int(progress_size / (1024 * duration)) if duration else 0 24 | sys.stdout.write( 25 | f"\r{int(percent)}% | {progress_size / (1024**2):.2f} MB " 26 | f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed" 27 | ) 28 | sys.stdout.flush() 29 | 30 | 31 | def download_and_extract_dataset(dataset_url, target_file, directory): 32 | if not os.path.exists(directory): 33 | if os.path.exists(target_file): 34 | os.remove(target_file) 35 | urllib.request.urlretrieve(dataset_url, target_file, reporthook) 36 | print("\nExtracting dataset ...") 37 | with tarfile.open(target_file, "r:gz") as tar: 38 | tar.extractall() 39 | else: 40 | print(f"Directory `{directory}` already exists. Skipping download.") 41 | 42 | 43 | def load_dataset_to_dataframe(basepath="aclImdb", labels={"pos": 1, "neg": 0}): 44 | data_frames = [] # List to store each chunk of DataFrame 45 | for subset in ("test", "train"): 46 | for label in ("pos", "neg"): 47 | path = os.path.join(basepath, subset, label) 48 | for file in sorted(os.listdir(path)): 49 | with open(os.path.join(path, file), "r", encoding="utf-8") as infile: 50 | # Create a DataFrame for each file and add it to the list 51 | data_frames.append(pd.DataFrame({"text": [infile.read()], "label": [labels[label]]})) 52 | # Concatenate all DataFrame chunks together 53 | df = pd.concat(data_frames, ignore_index=True) 54 | df = df.sample(frac=1, random_state=123).reset_index(drop=True) # Shuffle the DataFrame 55 | return df 56 | 57 | 58 | def partition_and_save(df, sizes=(35000, 5000, 10000)): 59 | # Shuffle the DataFrame 60 | df_shuffled = df.sample(frac=1, random_state=123).reset_index(drop=True) 61 | 62 | # Get indices for where to split the data 63 | train_end = sizes[0] 64 | val_end = sizes[0] + sizes[1] 65 | 66 | # Split the DataFrame 67 | train = df_shuffled.iloc[:train_end] 68 | val = df_shuffled.iloc[train_end:val_end] 69 | test = df_shuffled.iloc[val_end:] 70 | 71 | # Save to CSV files 72 | train.to_csv("train.csv", index=False) 73 | val.to_csv("validation.csv", index=False) 74 | test.to_csv("test.csv", index=False) 75 | 76 | 77 | if __name__ == "__main__": 78 | dataset_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" 79 | print("Downloading dataset ...") 80 | download_and_extract_dataset(dataset_url, "aclImdb_v1.tar.gz", "aclImdb") 81 | print("Creating data frames ...") 82 | df = load_dataset_to_dataframe() 83 | print("Partitioning and saving data frames ...") 84 | partition_and_save(df) 85 | -------------------------------------------------------------------------------- /ch07/01_main-chapter-code/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7: Finetuning to Follow Instructions 2 | 3 | ### Main Chapter Code 4 | 5 | - [ch07.ipynb](ch07.ipynb) contains all the code as it appears in the chapter 6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the GPT model we coded and trained in previous chapters, alongside many utility functions, which we reuse in this chapter 7 | - [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights 8 | - [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter 9 | 10 | 11 | ### Optional Code 12 | 13 | - [load-finetuned-model.ipynb](load-finetuned-model.ipynb) is a standalone Jupyter notebook to load the instruction finetuned model we created in this chapter 14 | 15 | - [gpt_instruction_finetuning.py](gpt_instruction_finetuning.py) is a standalone Python script to instruction finetune the model as described in the main chapter (think of it as a chapter summary focused on the finetuning parts) 16 | 17 | Usage: 18 | 19 | ```bash 20 | python gpt_instruction_finetuning.py 21 | ``` 22 | 23 | ``` 24 | matplotlib version: 3.9.0 25 | tiktoken version: 0.7.0 26 | torch version: 2.3.1 27 | tqdm version: 4.66.4 28 | tensorflow version: 2.16.1 29 | -------------------------------------------------- 30 | Training set length: 935 31 | Validation set length: 55 32 | Test set length: 110 33 | -------------------------------------------------- 34 | Device: cpu 35 | -------------------------------------------------- 36 | File already exists and is up-to-date: gpt2/355M/checkpoint 37 | File already exists and is up-to-date: gpt2/355M/encoder.json 38 | File already exists and is up-to-date: gpt2/355M/hparams.json 39 | File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001 40 | File already exists and is up-to-date: gpt2/355M/model.ckpt.index 41 | File already exists and is up-to-date: gpt2/355M/model.ckpt.meta 42 | File already exists and is up-to-date: gpt2/355M/vocab.bpe 43 | Loaded model: gpt2-medium (355M) 44 | -------------------------------------------------- 45 | Initial losses 46 | Training loss: 3.839039182662964 47 | Validation loss: 3.7619192123413088 48 | Ep 1 (Step 000000): Train loss 2.611, Val loss 2.668 49 | Ep 1 (Step 000005): Train loss 1.161, Val loss 1.131 50 | Ep 1 (Step 000010): Train loss 0.939, Val loss 0.973 51 | ... 52 | Training completed in 15.66 minutes. 53 | Plot saved as loss-plot-standalone.pdf 54 | -------------------------------------------------- 55 | Generating responses 56 | 100%|█████████████████████████████████████████████████████████| 110/110 [06:57<00:00, 3.80s/it] 57 | Responses saved as instruction-data-with-response-standalone.json 58 | Model saved as gpt2-medium355M-sft-standalone.pth 59 | ``` 60 | 61 | - [ollama_evaluate.py](ollama_evaluate.py) is a standalone Python script to evaluate the responses of the finetuned model as described in the main chapter (think of it as a chapter summary focused on the evaluation parts) 62 | 63 | Usage: 64 | 65 | ```bash 66 | python ollama_evaluate.py --file_path instruction-data-with-response-standalone.json 67 | ``` 68 | 69 | ``` 70 | Ollama running: True 71 | Scoring entries: 100%|███████████████████████████████████████| 110/110 [01:08<00:00, 1.62it/s] 72 | Number of scores: 110 of 110 73 | Average score: 51.75 74 | ``` 75 | 76 | - [exercise_experiments.py](exercise_experiments.py) is an optional scropt that implements the exercise solutions; for more details see [exercise-solutions.ipynb](exercise-solutions.ipynb) 77 | -------------------------------------------------------------------------------- /ch05/03_bonus_pretraining_on_gutenberg/prepare_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | """ 7 | Script that processes the Project Gutenberg files into fewer larger files. 8 | """ 9 | 10 | import argparse 11 | import os 12 | import re 13 | from tqdm import tqdm 14 | from gutenberg.src.cleanup import strip_headers 15 | 16 | 17 | def is_english(text, threshold=0.9): 18 | ascii_chars = sum(1 for c in text if ord(c) < 128) 19 | return ascii_chars / len(text) > threshold 20 | 21 | 22 | def combine_files(file_paths, target_dir, max_size_mb=500, separator="<|endoftext|>", fallback_encoding="latin1"): 23 | if not os.path.exists(target_dir): 24 | os.makedirs(target_dir) 25 | 26 | current_content = [] 27 | current_size = 0 28 | file_counter = 1 29 | 30 | for file_path in tqdm(file_paths): 31 | try: 32 | with open(file_path, "r", encoding="utf-8") as file: 33 | content = file.read() 34 | except UnicodeDecodeError: 35 | # Attempt to read the file with a fallback encoding 36 | tqdm.write(f"Warning: UnicodeDecodeError encountered. Trying fallback encoding for {file_path}") 37 | with open(file_path, "r", encoding=fallback_encoding) as file: 38 | content = file.read() 39 | 40 | if not is_english(content): 41 | tqdm.write(f"Skipping {file_path} as it does not contain primarily English text.") 42 | continue 43 | content = strip_headers(content) 44 | 45 | # Regular expression to replace multiple blank lines with a single blank line 46 | content = re.sub(r'\n\s*\n', '\n\n', content) 47 | estimated_size = len(content.encode("utf-8")) 48 | 49 | if current_size + estimated_size > max_size_mb * 1024 * 1024: 50 | target_file_path = os.path.join(target_dir, f"combined_{file_counter}.txt") 51 | with open(target_file_path, "w", encoding="utf-8") as target_file: 52 | target_file.write(separator.join(current_content)) 53 | file_counter += 1 54 | current_content = [content] 55 | current_size = estimated_size 56 | else: 57 | current_content.append(content) 58 | current_size += estimated_size 59 | 60 | if current_content: 61 | target_file_path = os.path.join(target_dir, f"combined_{file_counter}.txt") 62 | with open(target_file_path, "w", encoding="utf-8") as target_file: 63 | target_file.write(separator.join(current_content)) 64 | return file_counter 65 | 66 | 67 | if __name__ == "__main__": 68 | 69 | parser = argparse.ArgumentParser(description="Preprocess and combine text files for pretraining") 70 | 71 | parser.add_argument("--data_dir", type=str, default="gutenberg/data/raw", 72 | help="Directory containing the downloaded raw training data") 73 | parser.add_argument("--max_size_mb", type=int, default=500, 74 | help="The maximum file size for each concatenated file in megabytes") 75 | parser.add_argument("--output_dir", type=str, default="gutenberg_preprocessed", 76 | help="Directory where the preprocessed data will be saved") 77 | 78 | args = parser.parse_args() 79 | 80 | all_files = [os.path.join(path, name) for path, subdirs, files in os.walk(args.data_dir) 81 | for name in files if name.endswith((".txt", ".txt.utf8"))] 82 | 83 | print(f"{len(all_files)} file(s) to process.") 84 | file_counter = combine_files(all_files, args.output_dir, max_size_mb=args.max_size_mb) 85 | print(f"{file_counter} file(s) saved in {os.path.abspath(args.output_dir)}") 86 | -------------------------------------------------------------------------------- /setup/02_installing-python-libraries/python_environment_check.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | from importlib.metadata import PackageNotFoundError, import_module 7 | import importlib.metadata 8 | from os.path import dirname, exists, join, realpath 9 | from packaging.version import parse as version_parse 10 | import platform 11 | import sys 12 | 13 | if version_parse(platform.python_version()) < version_parse("3.9"): 14 | print("[FAIL] We recommend Python 3.9 or newer but" 15 | " found version %s" % (sys.version)) 16 | else: 17 | print("[OK] Your Python version is %s" % (platform.python_version())) 18 | 19 | 20 | def get_packages(pkgs): 21 | versions = [] 22 | for p in pkgs: 23 | try: 24 | imported = import_module(p) 25 | try: 26 | version = (getattr(imported, "__version__", None) or 27 | getattr(imported, "version", None) or 28 | getattr(imported, "version_info", None)) 29 | if version is None: 30 | # If common attributes don"t exist, use importlib.metadata 31 | version = importlib.metadata.version(p) 32 | versions.append(version) 33 | except PackageNotFoundError: 34 | # Handle case where package is not installed 35 | versions.append("0.0") 36 | except ImportError: 37 | # Fallback if importlib.import_module fails for unexpected reasons 38 | versions.append("0.0") 39 | return versions 40 | 41 | 42 | def get_requirements_dict(): 43 | PROJECT_ROOT = dirname(realpath(__file__)) 44 | PROJECT_ROOT_UP_TWO = dirname(dirname(PROJECT_ROOT)) 45 | REQUIREMENTS_FILE = join(PROJECT_ROOT_UP_TWO, "requirements.txt") 46 | if not exists(REQUIREMENTS_FILE): 47 | REQUIREMENTS_FILE = join(PROJECT_ROOT, "requirements.txt") 48 | 49 | d = {} 50 | with open(REQUIREMENTS_FILE) as f: 51 | for line in f: 52 | if not line.strip(): 53 | continue 54 | if "," in line: 55 | left, right = line.split(",") 56 | lower = right.split("#")[0].strip() 57 | package, _, upper = left.split(" ") 58 | package = package.strip() 59 | _, lower = lower.split(" ") 60 | lower = lower.strip() 61 | upper = upper.strip() 62 | d[package] = (upper, lower) 63 | else: 64 | line = line.split("#")[0].strip() 65 | line = line.split(" ") 66 | line = [ln.strip() for ln in line] 67 | d[line[0]] = line[-1] 68 | return d 69 | 70 | 71 | def check_packages(d): 72 | versions = get_packages(d.keys()) 73 | 74 | for (pkg_name, suggested_ver), actual_ver in zip(d.items(), versions): 75 | if isinstance(suggested_ver, tuple): 76 | lower, upper = suggested_ver[0], suggested_ver[1] 77 | else: 78 | lower = suggested_ver 79 | upper = None 80 | if actual_ver == "N/A": 81 | continue 82 | actual_ver = version_parse(actual_ver) 83 | lower = version_parse(lower) 84 | if upper is not None: 85 | upper = version_parse(upper) 86 | if actual_ver < lower and upper is None: 87 | print(f"[FAIL] {pkg_name} {actual_ver}, please upgrade to >= {lower}") 88 | elif actual_ver < lower: 89 | print(f"[FAIL] {pkg_name} {actual_ver}, please upgrade to >= {lower} and < {upper}") 90 | elif upper is not None and actual_ver >= upper: 91 | print(f"[FAIL] {pkg_name} {actual_ver}, please downgrade to >= {lower} and < {upper}") 92 | else: 93 | print(f"[OK] {pkg_name} {actual_ver}") 94 | 95 | 96 | def main(): 97 | d = get_requirements_dict() 98 | check_packages(d) 99 | 100 | 101 | if __name__ == "__main__": 102 | main() 103 | -------------------------------------------------------------------------------- /ch06/03_bonus_imdb-classification/README.md: -------------------------------------------------------------------------------- 1 | # Additional Experiments Classifying the Sentiment of 50k IMDB Movie Reviews 2 | 3 |   4 | ## Step 1: Install Dependencies 5 | 6 | Install the extra dependencies via 7 | 8 | ```bash 9 | pip install -r requirements-extra.txt 10 | ``` 11 | 12 |   13 | ## Step 2: Download Dataset 14 | 15 | The codes are using the 50k movie reviews from IMDb ([dataset source](https://ai.stanford.edu/~amaas/data/sentiment/)) to predict whether a movie review is positive or negative. 16 | 17 | Run the following code to create the `train.csv`, `validation.csv`, and `test.csv` datasets: 18 | 19 | ```bash 20 | python download_prepare_dataset.py 21 | ``` 22 | 23 | 24 |   25 | ## Step 3: Run Models 26 | 27 | The 124M GPT-2 model used in the main chapter, starting with pretrained weights, and finetuning all weights: 28 | 29 | ```bash 30 | python train_gpt.py --trainable_layers "all" --num_epochs 1 31 | ``` 32 | 33 | ``` 34 | Ep 1 (Step 000000): Train loss 3.706, Val loss 3.853 35 | Ep 1 (Step 000050): Train loss 0.682, Val loss 0.706 36 | ... 37 | Ep 1 (Step 004300): Train loss 0.199, Val loss 0.285 38 | Ep 1 (Step 004350): Train loss 0.188, Val loss 0.208 39 | Training accuracy: 95.62% | Validation accuracy: 95.00% 40 | Training completed in 9.48 minutes. 41 | 42 | Evaluating on the full datasets ... 43 | 44 | Training accuracy: 95.64% 45 | Validation accuracy: 92.32% 46 | Test accuracy: 91.88% 47 | ``` 48 | 49 | 50 |
51 | 52 | --- 53 | 54 |
55 | 56 | A 340M parameter encoder-style [BERT](https://arxiv.org/abs/1810.04805) model: 57 | 58 | ```bash 59 | python train_bert_hf.py --trainable_layers "all" --num_epochs 1 --model "bert" 60 | ``` 61 | 62 | ``` 63 | Ep 1 (Step 000000): Train loss 0.848, Val loss 0.775 64 | Ep 1 (Step 000050): Train loss 0.655, Val loss 0.682 65 | ... 66 | Ep 1 (Step 004300): Train loss 0.146, Val loss 0.318 67 | Ep 1 (Step 004350): Train loss 0.204, Val loss 0.217 68 | Training accuracy: 92.50% | Validation accuracy: 88.75% 69 | Training completed in 7.65 minutes. 70 | 71 | Evaluating on the full datasets ... 72 | 73 | Training accuracy: 94.35% 74 | Validation accuracy: 90.74% 75 | Test accuracy: 90.89% 76 | ``` 77 | 78 |
79 | 80 | --- 81 | 82 |
83 | 84 | A 66M parameter encoder-style [DistilBERT](https://arxiv.org/abs/1910.01108) model (distilled down from a 340M parameter BERT model), starting for the pretrained weights and only training the last transformer block plus output layers: 85 | 86 | 87 | 88 | ```bash 89 | python train_bert_hf.py --trainable_layers "all" --num_epochs 1 --model "distilbert" 90 | ``` 91 | 92 | ``` 93 | Ep 1 (Step 000000): Train loss 0.693, Val loss 0.688 94 | Ep 1 (Step 000050): Train loss 0.452, Val loss 0.460 95 | ... 96 | Ep 1 (Step 004300): Train loss 0.179, Val loss 0.272 97 | Ep 1 (Step 004350): Train loss 0.199, Val loss 0.182 98 | Training accuracy: 95.62% | Validation accuracy: 91.25% 99 | Training completed in 4.26 minutes. 100 | 101 | Evaluating on the full datasets ... 102 | 103 | Training accuracy: 95.30% 104 | Validation accuracy: 91.12% 105 | Test accuracy: 91.40% 106 | ``` 107 |
108 | 109 | --- 110 | 111 |
112 | 113 | A 355M parameter encoder-style [RoBERTa](https://arxiv.org/abs/1907.11692) model, starting for the pretrained weights and only training the last transformer block plus output layers: 114 | 115 | 116 | ```bash 117 | python train_bert_hf.py --trainable_layers "last_block" --num_epochs 1 --model "roberta" 118 | ``` 119 | 120 | ``` 121 | Ep 1 (Step 000000): Train loss 0.695, Val loss 0.698 122 | Ep 1 (Step 000050): Train loss 0.670, Val loss 0.690 123 | ... 124 | Ep 1 (Step 004300): Train loss 0.126, Val loss 0.149 125 | Ep 1 (Step 004350): Train loss 0.211, Val loss 0.138 126 | Training accuracy: 92.50% | Validation accuracy: 94.38% 127 | Training completed in 7.20 minutes. 128 | 129 | Evaluating on the full datasets ... 130 | 131 | Training accuracy: 93.44% 132 | Validation accuracy: 93.02% 133 | Test accuracy: 92.95% 134 | ``` 135 | 136 | 137 |
138 | 139 | --- 140 | 141 |
142 | 143 | A scikit-learn logistic regression classifier as a baseline: 144 | 145 | 146 | ```bash 147 | python train_sklearn_logreg.py 148 | ``` 149 | 150 | ``` 151 | Dummy classifier: 152 | Training Accuracy: 50.01% 153 | Validation Accuracy: 50.14% 154 | Test Accuracy: 49.91% 155 | 156 | 157 | Logistic regression classifier: 158 | Training Accuracy: 99.80% 159 | Validation Accuracy: 88.62% 160 | Test Accuracy: 88.85% 161 | ``` 162 | -------------------------------------------------------------------------------- /ch07/01_main-chapter-code/ollama_evaluate.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | # 6 | # A minimal instruction finetuning file based on the code in chapter 7 7 | 8 | import json 9 | import psutil 10 | from tqdm import tqdm 11 | import urllib.request 12 | 13 | 14 | def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"): 15 | # Create the data payload as a dictionary 16 | data = { 17 | "model": model, 18 | "messages": [ 19 | {"role": "user", "content": prompt} 20 | ], 21 | "options": { # Settings below are required for deterministic responses 22 | "seed": 123, 23 | "temperature": 0, 24 | "num_ctx": 2048 25 | } 26 | } 27 | 28 | # Convert the dictionary to a JSON formatted string and encode it to bytes 29 | payload = json.dumps(data).encode("utf-8") 30 | 31 | # Create a request object, setting the method to POST and adding necessary headers 32 | request = urllib.request.Request(url, data=payload, method="POST") 33 | request.add_header("Content-Type", "application/json") 34 | 35 | # Send the request and capture the response 36 | response_data = "" 37 | with urllib.request.urlopen(request) as response: 38 | # Read and decode the response 39 | while True: 40 | line = response.readline().decode("utf-8") 41 | if not line: 42 | break 43 | response_json = json.loads(line) 44 | response_data += response_json["message"]["content"] 45 | 46 | return response_data 47 | 48 | 49 | def check_if_running(process_name): 50 | running = False 51 | for proc in psutil.process_iter(["name"]): 52 | if process_name in proc.info["name"]: 53 | running = True 54 | break 55 | return running 56 | 57 | 58 | def format_input(entry): 59 | instruction_text = ( 60 | f"Below is an instruction that describes a task. " 61 | f"Write a response that appropriately completes the request." 62 | f"\n\n### Instruction:\n{entry['instruction']}" 63 | ) 64 | 65 | input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else "" 66 | 67 | return instruction_text + input_text 68 | 69 | 70 | def main(file_path): 71 | ollama_running = check_if_running("ollama") 72 | 73 | if not ollama_running: 74 | raise RuntimeError("Ollama not running. Launch ollama before proceeding.") 75 | print("Ollama running:", check_if_running("ollama")) 76 | 77 | with open(file_path, "r") as file: 78 | test_data = json.load(file) 79 | 80 | model = "llama3" 81 | scores = generate_model_scores(test_data, "model_response", model) 82 | print(f"Number of scores: {len(scores)} of {len(test_data)}") 83 | print(f"Average score: {sum(scores)/len(scores):.2f}\n") 84 | 85 | 86 | def generate_model_scores(json_data, json_key, model="llama3"): 87 | scores = [] 88 | for entry in tqdm(json_data, desc="Scoring entries"): 89 | if entry[json_key] == "": 90 | scores.append(0) 91 | else: 92 | prompt = ( 93 | f"Given the input `{format_input(entry)}` " 94 | f"and correct output `{entry['output']}`, " 95 | f"score the model response `{entry[json_key]}`" 96 | f" on a scale from 0 to 100, where 100 is the best score. " 97 | f"Respond with the integer number only." 98 | ) 99 | score = query_model(prompt, model) 100 | try: 101 | scores.append(int(score)) 102 | except ValueError: 103 | print(f"Could not convert score: {score}") 104 | continue 105 | 106 | return scores 107 | 108 | 109 | if __name__ == "__main__": 110 | 111 | import argparse 112 | 113 | parser = argparse.ArgumentParser( 114 | description="Evaluate model responses with ollama" 115 | ) 116 | parser.add_argument( 117 | "--file_path", 118 | required=True, 119 | help=( 120 | "The path to the test dataset `.json` file with the" 121 | " `'output'` and `'model_response'` keys" 122 | ) 123 | ) 124 | args = parser.parse_args() 125 | 126 | main(file_path=args.file_path) 127 | -------------------------------------------------------------------------------- /ch04/01_main-chapter-code/previous_chapters.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | import tiktoken 7 | import torch 8 | import torch.nn as nn 9 | from torch.utils.data import Dataset, DataLoader 10 | 11 | 12 | class GPTDatasetV1(Dataset): 13 | def __init__(self, txt, tokenizer, max_length, stride): 14 | self.input_ids = [] 15 | self.target_ids = [] 16 | 17 | # Tokenize the entire text 18 | token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) 19 | 20 | # Use a sliding window to chunk the book into overlapping sequences of max_length 21 | for i in range(0, len(token_ids) - max_length, stride): 22 | input_chunk = token_ids[i:i + max_length] 23 | target_chunk = token_ids[i + 1: i + max_length + 1] 24 | self.input_ids.append(torch.tensor(input_chunk)) 25 | self.target_ids.append(torch.tensor(target_chunk)) 26 | 27 | def __len__(self): 28 | return len(self.input_ids) 29 | 30 | def __getitem__(self, idx): 31 | return self.input_ids[idx], self.target_ids[idx] 32 | 33 | 34 | def create_dataloader_v1(txt, batch_size=4, max_length=256, 35 | stride=128, shuffle=True, drop_last=True, num_workers=0): 36 | # Initialize the tokenizer 37 | tokenizer = tiktoken.get_encoding("gpt2") 38 | 39 | # Create dataset 40 | dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) 41 | 42 | # Create dataloader 43 | dataloader = DataLoader( 44 | dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers) 45 | 46 | return dataloader 47 | 48 | 49 | class MultiHeadAttention(nn.Module): 50 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): 51 | super().__init__() 52 | assert d_out % num_heads == 0, "d_out must be divisible by num_heads" 53 | 54 | self.d_out = d_out 55 | self.num_heads = num_heads 56 | self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim 57 | 58 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 59 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 60 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 61 | self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs 62 | self.dropout = nn.Dropout(dropout) 63 | self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) 64 | 65 | def forward(self, x): 66 | b, num_tokens, d_in = x.shape 67 | 68 | keys = self.W_key(x) # Shape: (b, num_tokens, d_out) 69 | queries = self.W_query(x) 70 | values = self.W_value(x) 71 | 72 | # We implicitly split the matrix by adding a `num_heads` dimension 73 | # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) 74 | keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 75 | values = values.view(b, num_tokens, self.num_heads, self.head_dim) 76 | queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) 77 | 78 | # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) 79 | keys = keys.transpose(1, 2) 80 | queries = queries.transpose(1, 2) 81 | values = values.transpose(1, 2) 82 | 83 | # Compute scaled dot-product attention (aka self-attention) with a causal mask 84 | attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head 85 | 86 | # Original mask truncated to the number of tokens and converted to boolean 87 | mask_bool = self.mask.bool()[:num_tokens, :num_tokens] 88 | 89 | # Use the mask to fill attention scores 90 | attn_scores.masked_fill_(mask_bool, -torch.inf) 91 | 92 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 93 | attn_weights = self.dropout(attn_weights) 94 | 95 | # Shape: (b, num_tokens, num_heads, head_dim) 96 | context_vec = (attn_weights @ values).transpose(1, 2) 97 | 98 | # Combine heads, where self.d_out = self.num_heads * self.head_dim 99 | context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) 100 | context_vec = self.out_proj(context_vec) # optional projection 101 | 102 | return context_vec 103 | -------------------------------------------------------------------------------- /setup/01_optional-python-setup-preferences/README.md: -------------------------------------------------------------------------------- 1 | # Python Setup Tips 2 | 3 | 4 | 5 | There are several different ways you can install Python and set up your computing environment. Here, I am illustrating my personal preference. 6 | 7 | (I am using computers running macOS, but this workflow is similar for Linux machines and may work for other operating systems as well.) 8 | 9 | 10 |
11 |
12 | 13 | 14 | ## 1. Download and install Miniforge 15 | 16 | Download miniforge from the GitHub repository [here](https://github.com/conda-forge/miniforge). 17 | 18 | download 19 | 20 | Depending on your operating system, this should download either an `.sh` (macOS, Linux) or `.exe` file (Windows). 21 | 22 | For the `.sh` file, open your command line terminal and execute the following command 23 | 24 | ```bash 25 | sh ~/Desktop/Miniforge3-MacOSX-arm64.sh 26 | ``` 27 | 28 | where `Desktop/` is the folder where the Miniforge installer was downloaded to. On your computer, you may have to replace it with `Downloads/`. 29 | 30 | miniforge-install 31 | 32 | Next, step through the download instructions, confirming with "Enter". 33 | 34 | 35 | 36 | If you work with many packages, Conda can be slow because of its thorough but complex dependency resolution process and the handling of large package indexes and metadata. To speed up Conda, you can use the following setting, which switches to a more efficient Rust reimplementation for solving dependencies: 37 | 38 | ``` 39 | conda config --set solver libmamba 40 | ``` 41 | 42 |
43 |
44 | 45 | 46 | ## 2. Create a new virtual environment 47 | 48 | After the installation was successfully completed, I recommend creating a new virtual environment called `LLMs`, which you can do by executing 49 | 50 | ```bash 51 | conda create -n LLMs python=3.10 52 | ``` 53 | 54 | new-env 55 | 56 | > Many scientific computing libraries do not immediately support the newest version of Python. Therefore, when installing PyTorch, it's advisable to use a version of Python that is one or two releases older. For instance, if the latest version of Python is 3.13, using Python 3.10 or 3.11 is recommended. 57 | 58 | Next, activate your new virtual environment (you have to do it every time you open a new terminal window or tab): 59 | 60 | ```bash 61 | conda activate LLMs 62 | ``` 63 | 64 | activate-env 65 | 66 |
67 |
68 | 69 | ## Optional: styling your terminal 70 | 71 | If you want to style your terminal similar to mine so that you can see which virtual environment is active, check out the [Oh My Zsh](https://github.com/ohmyzsh/ohmyzsh) project. 72 | 73 |
74 |
75 | 76 | ## 3. Install new Python libraries 77 | 78 | 79 | 80 | To install new Python libraries, you can now use the `conda` package installer. For example, you can install [JupyterLab](https://jupyter.org/install) and [watermark](https://github.com/rasbt/watermark) as follows: 81 | 82 | ```bash 83 | conda install jupyterlab watermark 84 | ``` 85 | 86 | conda-install 87 | 88 | 89 | 90 | You can also still use `pip` to install libraries. By default, `pip` should be linked to your new `LLms` conda environment: 91 | 92 | check-pip 93 | 94 |
95 |
96 | 97 | ## 4. Install PyTorch 98 | 99 | PyTorch can be installed just like any other Python library or package using pip. For example: 100 | 101 | ```bash 102 | pip install torch==2.0.1 103 | ``` 104 | 105 | However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional settings and explanation (see the *A.1.3 Installing PyTorch in the book for more information*). 106 | 107 | It's also highly recommended to consult the installation guide menu on the official PyTorch website at [https://pytorch.org](https://pytorch.org). 108 | 109 | 110 | 111 | 112 | 113 | --- 114 | 115 | 116 | 117 | 118 | Any questions? Please feel free to reach out in the [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions). -------------------------------------------------------------------------------- /ch03/02_bonus_efficient-multihead-attention/ch03.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | # 6 | # This file contains the relevant code from chapter 3 that is going to be used 7 | # in forthcoming chapters. 8 | 9 | import torch 10 | import torch.nn as nn 11 | 12 | 13 | class CausalAttention(nn.Module): 14 | 15 | def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False): 16 | super().__init__() 17 | self.d_out = d_out 18 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 19 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 20 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 21 | self.dropout = nn.Dropout(dropout) # New 22 | self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New 23 | 24 | def forward(self, x): 25 | b, num_tokens, d_in = x.shape # New batch dimension b 26 | keys = self.W_key(x) 27 | queries = self.W_query(x) 28 | values = self.W_value(x) 29 | 30 | attn_scores = queries @ keys.transpose(1, 2) # Changed transpose 31 | attn_scores.masked_fill_( # New, _ ops are in-place 32 | self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) 33 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 34 | attn_weights = self.dropout(attn_weights) # New 35 | 36 | context_vec = attn_weights @ values 37 | return context_vec 38 | 39 | 40 | class MultiHeadAttentionWrapper(nn.Module): 41 | 42 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): 43 | super().__init__() 44 | self.heads = nn.ModuleList( 45 | [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 46 | for _ in range(num_heads)] 47 | ) 48 | self.out_proj = nn.Linear(d_out*num_heads, d_out*num_heads) 49 | 50 | def forward(self, x): 51 | context_vec = torch.cat([head(x) for head in self.heads], dim=-1) 52 | return self.out_proj(context_vec) 53 | 54 | 55 | class MultiHeadAttention(nn.Module): 56 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): 57 | super().__init__() 58 | assert d_out % num_heads == 0, "d_out must be divisible by num_heads" 59 | 60 | self.d_out = d_out 61 | self.num_heads = num_heads 62 | self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim 63 | 64 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 65 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 66 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 67 | self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs 68 | self.dropout = nn.Dropout(dropout) 69 | self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) 70 | 71 | def forward(self, x): 72 | b, num_tokens, d_in = x.shape 73 | 74 | keys = self.W_key(x) # Shape: (b, num_tokens, d_out) 75 | queries = self.W_query(x) 76 | values = self.W_value(x) 77 | 78 | # We implicitly split the matrix by adding a `num_heads` dimension 79 | # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) 80 | keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 81 | values = values.view(b, num_tokens, self.num_heads, self.head_dim) 82 | queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) 83 | 84 | # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) 85 | keys = keys.transpose(1, 2) 86 | queries = queries.transpose(1, 2) 87 | values = values.transpose(1, 2) 88 | 89 | # Compute scaled dot-product attention (aka self-attention) with a causal mask 90 | attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head 91 | 92 | # Original mask truncated to the number of tokens and converted to boolean 93 | mask_bool = self.mask.bool()[:num_tokens, :num_tokens] 94 | 95 | # Use the mask to fill attention scores 96 | attn_scores.masked_fill_(mask_bool, -torch.inf) 97 | 98 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 99 | attn_weights = self.dropout(attn_weights) 100 | 101 | # Shape: (b, num_tokens, num_heads, head_dim) 102 | context_vec = (attn_weights @ values).transpose(1, 2) 103 | 104 | # Combine heads, where self.d_out = self.num_heads * self.head_dim 105 | context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) 106 | context_vec = self.out_proj(context_vec) # optional projection 107 | 108 | return context_vec 109 | -------------------------------------------------------------------------------- /ch04/02_performance-analysis/flops-analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "\n", 15 | "\n", 18 | "\n", 19 | "
\n", 10 | "\n", 11 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 12 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 13 | "
\n", 14 | "
\n", 16 | "\n", 17 | "
" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## FLOPS Analysis" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "- FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations executed\n", 34 | "- High FLOPs indicate more intensive computation and energy consumption" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 1, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# pip install -r requirements-extra.txt" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "thop version: 0.1.1-2209072238\n", 56 | "torch version: 2.2.2\n", 57 | "tiktoken version: 0.5.1\n" 58 | ] 59 | } 60 | ], 61 | "source": [ 62 | "from importlib.metadata import version\n", 63 | "\n", 64 | "import matplotlib\n", 65 | "import torch\n", 66 | "\n", 67 | "print(\"thop version:\", version(\"thop\"))\n", 68 | "print(\"torch version:\", version(\"torch\"))" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": { 75 | "colab": { 76 | "base_uri": "https://localhost:8080/" 77 | }, 78 | "id": "GerIdRMXd6g9", 79 | "outputId": "ccdd5c71-d221-4a84-f9bc-09557e77162d" 80 | }, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "gpt-small (124M) : 5.1e+11 FLOPS\n", 87 | "gpt-medium (355M) : 1.4e+12 FLOPS\n", 88 | "gpt-large (774M) : 3.2e+12 FLOPS\n", 89 | "gpt-xl (1558M) : 6.4e+12 FLOPS\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "import torch\n", 95 | "from thop import profile\n", 96 | "\n", 97 | "from previous_chapters import GPTModel\n", 98 | "\n", 99 | "\n", 100 | "BASE_CONFIG = {\n", 101 | " \"vocab_size\": 50257, # Vocabulary size\n", 102 | " \"context_length\": 1024, # Context length\n", 103 | " \"drop_rate\": 0.0, # Dropout rate\n", 104 | " \"qkv_bias\": True # Query-key-value bias\n", 105 | "}\n", 106 | "\n", 107 | "model_configs = {\n", 108 | " \"gpt-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", 109 | " \"gpt-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", 110 | " \"gpt-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", 111 | " \"gpt-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", 112 | "}\n", 113 | "\n", 114 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 115 | "input_tensor = torch.randint(0, 50257, (2, 1024)).to(device)\n", 116 | "\n", 117 | "for size in model_configs:\n", 118 | " BASE_CONFIG.update(model_configs[size])\n", 119 | " \n", 120 | " model = GPTModel(BASE_CONFIG).bfloat16()\n", 121 | " model.to(device)\n", 122 | "\n", 123 | " # MACS = multiply-accumulate operations\n", 124 | " # MACS are typically counted as two FLOPS (one multiply and one accumulate)\n", 125 | " macs, params = profile(model, inputs=(input_tensor,), verbose=False)\n", 126 | " flops = 2*macs\n", 127 | " print(f\"{size:18}: {flops:.1e} FLOPS\")\n", 128 | " \n", 129 | " del model\n", 130 | " torch.cuda.empty_cache()" 131 | ] 132 | } 133 | ], 134 | "metadata": { 135 | "accelerator": "GPU", 136 | "colab": { 137 | "gpuType": "A100", 138 | "machine_shape": "hm", 139 | "provenance": [] 140 | }, 141 | "kernelspec": { 142 | "display_name": "Python 3 (ipykernel)", 143 | "language": "python", 144 | "name": "python3" 145 | }, 146 | "language_info": { 147 | "codemirror_mode": { 148 | "name": "ipython", 149 | "version": 3 150 | }, 151 | "file_extension": ".py", 152 | "mimetype": "text/x-python", 153 | "name": "python", 154 | "nbconvert_exporter": "python", 155 | "pygments_lexer": "ipython3", 156 | "version": "3.11.4" 157 | } 158 | }, 159 | "nbformat": 4, 160 | "nbformat_minor": 4 161 | } 162 | -------------------------------------------------------------------------------- /setup/README.md: -------------------------------------------------------------------------------- 1 | # Optional Setup Instructions 2 | 3 | 4 | This document lists different approaches for setting up your machine and using the code in this repository. I recommend browsing through the different sections from top to bottom and then deciding which approach best suits your needs. 5 | 6 |   7 | 8 | ## Quickstart 9 | 10 | If you already have a Python installation on your machine, the quickest way to get started is to install the package requirements from the [../requirements.txt](../requirements.txt) file by executing the following pip installation command from the root directory of this code repository: 11 | 12 | ```bash 13 | pip install -r requirements.txt 14 | ``` 15 | 16 |   17 | 18 | # Local Setup 19 | 20 | This section provides recommendations for running the code in this book locally. Note that the code in the main chapters of this book is designed to run on conventional laptops within a reasonable timeframe and does not require specialized hardware. I tested all main chapters on an M3 MacBook Air laptop. Additionally, if your laptop or desktop computer has an NVIDIA GPU, the code will automatically take advantage of it. 21 | 22 |   23 | ## Setting up Python 24 | 25 | If you don't have Python set up on your machine yet, I have written about my personal Python setup preferences in the following directories: 26 | 27 | - [01_optional-python-setup-preferences](./01_optional-python-setup-preferences) 28 | - [02_installing-python-libraries](./02_installing-python-libraries) 29 | 30 | The *Using DevContainers* section below outlines an alternative approach for installing project dependencies on your machine. 31 | 32 |   33 | 34 | ## Using Docker DevContainers 35 | 36 | As an alternative to the *Setting up Python* section above, if you prefer a development setup that isolates a project's dependencies and configurations, using Docker is a highly effective solution. This approach eliminates the need to manually install software packages and libraries and ensures a consistent development environment. You can find more instructions for setting up Docker and using a DevContainer: 37 | 38 | - [03_optional-docker-environment](03_optional-docker-environment) 39 | 40 |   41 | 42 | ## Visual Studio Code Editor 43 | 44 | There are many good options for code editors. My preferred choice is the popular open-source [Visual Studio Code (VSCode)](https://code.visualstudio.com) editor, which can be easily enhanced with many useful plugins and extensions (see the *VSCode Extensions* section below for more information). Download instructions for macOS, Linux, and Windows can be found on the [main VSCode website](https://code.visualstudio.com). 45 | 46 |   47 | 48 | ## VSCode Extensions 49 | 50 | If you are using Visual Studio Code (VSCode) as your primary code editor, you can find recommended extensions in the `.vscode` subfolder. To install these, open the `extensions.json` file in VSCode and click the "Install" button in the pop-up menu on the lower right. 51 | 52 |   53 | 54 | # Cloud Resources 55 | 56 | This section describes cloud alternatives for running the code presented in this book. 57 | 58 | While the code can run on conventional laptops and desktop computers without a dedicated GPU, cloud platforms with NVIDIA GPUs can substantially improve the runtime of the code, especially in chapters 5 to 7. 59 | 60 |   61 | 62 | ## Using Lightning Studio 63 | 64 | For a smooth development experience in the cloud, I recommend the [Lightning AI Studio](https://lightning.ai/) platform, which allows users to set up a persistent environment and use both VSCode and Jupyter Lab on cloud CPUs and GPUs. 65 | 66 | Once you start a new Studio, you can open the terminal and execute the following setup steps to clone the repository and install the dependencies: 67 | 68 | ```bash 69 | git clone https://github.com/rasbt/LLMs-from-scratch.git 70 | cd LLMs-from-scratch 71 | pip install -r requirements.txt 72 | ``` 73 | 74 | (In contrast to Google Colab, these only need to be executed once since the Lightning AI Studio environments are persistent, even if you switch between CPU and GPU machines.) 75 | 76 | Then, navigate to the Python script or Jupyter Notebook you want to run. Optionally, you can also easily connect a GPU to accelerate the code's runtime, for example, when you are pretraining the LLM in chapter 5 or finetuning it in chapters 6 and 7. 77 | 78 | 1 79 | 80 |   81 | 82 | ## Using Google Colab 83 | 84 | To use a Google Colab environment in the cloud, head over to [https://colab.research.google.com/](https://colab.research.google.com/) and open the respective chapter notebook from the GitHub menu or by dragging the notebook into the *Upload* field as shown in the figure below. 85 | 86 | 1 87 | 88 | 89 | Also make sure you upload the relevant files (dataset files and .py files the notebook is importing from) to the Colab environment as well, as shown below. 90 | 91 | 2 92 | 93 | 94 | You can optionally run the code on a GPU by changing the *Runtime* as illustrated in the figure below. 95 | 96 | 3 97 | 98 | 99 |   100 | 101 | # Questions? 102 | 103 | If you have any questions, please don't hesitate to reach out via the [Discussions](https://github.com/rasbt/LLMs-from-scratch/discussions) forum in this GitHub repository. 104 | -------------------------------------------------------------------------------- /ch07/02_dataset-utilities/find-near-duplicates.py: -------------------------------------------------------------------------------- 1 | 2 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 3 | # Source for "Build a Large Language Model From Scratch" 4 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 5 | # Code: https://github.com/rasbt/LLMs-from-scratch 6 | 7 | import argparse 8 | import json 9 | import re 10 | from sklearn import __version__ as sklearn_version 11 | from sklearn.feature_extraction.text import TfidfVectorizer 12 | from sklearn.metrics.pairwise import cosine_similarity 13 | 14 | 15 | # Sample JSON dataset 16 | example_data = [ 17 | {"instruction": "What is the capital of Italy?", 18 | "input": "", "output": "The capital of Italy is Rome." 19 | }, 20 | {"instruction": "What's the capital city of Italy?", 21 | "input": "", "output": "The capital city is Rome." 22 | }, 23 | {"instruction": "Identify the main verb in the sentence: 'The cat sleeps on the couch.'", 24 | "input": "", "output": "The verb is 'sleeps'." 25 | }, 26 | {"instruction": "Identify the verb in the following sentence: The cat sleeps on the couch.", 27 | "input": "", "output": "The verb in the sentence is \"sleeps.\"" 28 | }, 29 | # ... 30 | ] 31 | 32 | 33 | def preprocess_text(text): 34 | # Lowercase the text 35 | text = text.lower() 36 | # Remove punctuation 37 | text = re.sub(r'[^\w\s]', '', text) 38 | return text 39 | 40 | 41 | def find_near_duplicates(json_data, threshold=0.75, key="instruction"): 42 | """The higher the threshold, the more similar the texts have to be to match""" 43 | 44 | # Extract instructions 45 | text = [preprocess_text(item[key]) for item in json_data if item[key]] 46 | near_duplicates = [] 47 | indices_to_remove = set() 48 | 49 | if not text: 50 | return {}, near_duplicates 51 | 52 | # Vectorize the text data 53 | vectorizer = TfidfVectorizer(stop_words=None, analyzer='char', ngram_range=(1, 3)) 54 | tfidf_matrix = vectorizer.fit_transform(text) 55 | 56 | # Compute cosine similarity between each pair of entries 57 | cos_sim_matrix = cosine_similarity(tfidf_matrix) 58 | 59 | # Find pairs of near-duplicate instructions based on the threshold 60 | 61 | for i in range(len(cos_sim_matrix)): 62 | for j in range(i+1, len(cos_sim_matrix)): 63 | if cos_sim_matrix[i, j] > threshold: 64 | if len(json_data[i][key]) <= 1 or len(json_data[j][key]) <= 1: 65 | continue 66 | near_duplicates.append((json_data[i], json_data[j], cos_sim_matrix[i, j])) 67 | if key in ("input", "output"): # Don't remove duplicates based on the instruction 68 | indices_to_remove.add(j) # Mark the second entry for removal 69 | 70 | # Remove the near-duplicate entries 71 | filtered_json_data = [item for index, item in enumerate(json_data) if index not in indices_to_remove] 72 | 73 | return filtered_json_data, near_duplicates 74 | 75 | 76 | def find_print_and_remove_near_duplicates(json_data, remove_duplicates=False, threshold=0.75): 77 | """ 78 | Searches each key in the first JSON object for duplicates across a list of JSON objects. 79 | Prints the duplicates if found. 80 | """ 81 | for key in json_data[0].keys(): 82 | 83 | if remove_duplicates: 84 | json_data, near_duplicates = find_near_duplicates(json_data, key=key, threshold=threshold) 85 | else: 86 | _, near_duplicates = find_near_duplicates(json_data, key=key, threshold=threshold) 87 | separator = 50 * '=' 88 | print(f"\n\n{separator}\nSearching '{key}' for duplicates ...\n{separator}") 89 | if not near_duplicates: 90 | print("No duplicates found") 91 | else: 92 | for dup in near_duplicates: 93 | print( 94 | f"Duplicate pair found with similarity {dup[2]:.2f}:\n" 95 | f"1. {dup[0][key]}\n2. {dup[1][key]}\n" 96 | ) 97 | return json_data 98 | 99 | 100 | if __name__ == "__main__": 101 | print("scikit-learn version:", sklearn_version) 102 | 103 | parser = argparse.ArgumentParser() 104 | parser.add_argument( 105 | "--json_file", 106 | type=str, 107 | help=("Path to the dataset JSON file") 108 | ) 109 | parser.add_argument( 110 | "--threshold", 111 | type=float, 112 | default=0.9, 113 | help=("A sensitivity threshold between 0 and 1 where 1 is strictest") 114 | ) 115 | parser.add_argument( 116 | "--remove_duplicates", 117 | action='store_true', 118 | default=False, 119 | help=( 120 | "Removes duplicates based on the 'input' or 'output' keys " 121 | " (but not the 'instruction') and saves the cleaned JSON file as --json_output_file" 122 | ) 123 | ) 124 | parser.add_argument( 125 | "--json_output_file", 126 | type=str, 127 | help=("Path to the dataset JSON file") 128 | ) 129 | 130 | args = parser.parse_args() 131 | 132 | if args.remove_duplicates and not args.json_output_file: 133 | raise ValueError( 134 | "Provide an output file via --json_output_file " 135 | "to save the cleaned JSON data." 136 | ) 137 | 138 | if not args.json_file: 139 | json_data = example_data 140 | 141 | else: 142 | with open(args.json_file, "r") as file: 143 | json_data = json.load(file) 144 | 145 | json_data = find_print_and_remove_near_duplicates( 146 | json_data=json_data, 147 | remove_duplicates=args.remove_duplicates, 148 | threshold=args.threshold 149 | ) 150 | 151 | if args.remove_duplicates: 152 | with open(args.json_output_file, "w") as file: 153 | json.dump(json_data, file, indent=4) 154 | -------------------------------------------------------------------------------- /ch06/01_main-chapter-code/exercise-solutions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "ba450fb1-8a26-4894-ab7a-5d7bfefe90ce", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "\n", 16 | "\n", 19 | "\n", 20 | "
\n", 11 | "\n", 12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 13 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "
\n", 15 | "
\n", 17 | "\n", 18 | "
" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14", 26 | "metadata": {}, 27 | "source": [ 28 | "# Chapter 6 Exercise solutions" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e", 34 | "metadata": {}, 35 | "source": [ 36 | "## Exercise 6.1: Increasing the context length" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "5860ba9f-2db3-4480-b96b-4be1c68981eb", 42 | "metadata": {}, 43 | "source": [ 44 | "We can pad the inputs to the maximum number of tokens the model supports by setting the max length to 1024:\n", 45 | "\n", 46 | "```python\n", 47 | "max_length = 1024\n", 48 | "\n", 49 | "train_dataset = SpamDataset(base_path / \"train.csv\", max_length=max_length, tokenizer=tokenizer)\n", 50 | "val_dataset = SpamDataset(base_path / \"validation.csv\", max_length=max_length, tokenizer=tokenizer)\n", 51 | "test_dataset = SpamDataset(base_path / \"test.csv\", max_length=max_length, tokenizer=tokenizer)\n", 52 | "```\n", 53 | "\n", 54 | "or, equivalently, we can define the `max_length` via:\n", 55 | "\n", 56 | "```python\n", 57 | "max_length = model.pos_emb.weight.shape[0]\n", 58 | "```\n", 59 | "\n", 60 | "or\n", 61 | "\n", 62 | "```python\n", 63 | "max_length = BASE_CONFIG[\"context_length\"]\n", 64 | "```" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "id": "2b0f4d5d-17fd-4265-93d8-ea08a22fdaf8", 70 | "metadata": {}, 71 | "source": [ 72 | "For convenience, you can run this experiment via\n", 73 | "\n", 74 | "```bash\n", 75 | "python additional-experiments.py --context_length \"model_context_length\"\n", 76 | "```\n", 77 | "\n", 78 | "using the code in the [../02_bonus_additional-experiments](../02_bonus_additional-experiments) folder, which results in a substantially worse test accuracy of 78.33% (versus the 95.67% in the main chapter)." 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "id": "5a780455-f52a-48d1-ab82-6afd40bcad8b", 84 | "metadata": {}, 85 | "source": [ 86 | "## Exercise 6.2: Finetuning the whole model" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "id": "56aa5208-aa29-4165-a0ec-7480754e2a18", 92 | "metadata": {}, 93 | "source": [ 94 | "Instead of finetuning just the final transformer block, we can finetune the entire model by removing the following lines from the code:\n", 95 | "\n", 96 | "```python\n", 97 | "for param in model.parameters():\n", 98 | " param.requires_grad = False\n", 99 | "```\n", 100 | "\n", 101 | "For convenience, you can run this experiment via\n", 102 | "\n", 103 | "```bash\n", 104 | "python additional-experiments.py --trainable_layers all\n", 105 | "```\n", 106 | "\n", 107 | "using the code in the [../02_bonus_additional-experiments](../02_bonus_additional-experiments) folder, which results in a 1% improved test accuracy of 96.67% (versus the 95.67% in the main chapter)." 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "id": "2269bce3-f2b5-4a76-a692-5977c75a57b6", 113 | "metadata": {}, 114 | "source": [ 115 | "## Exercise 6.3: Finetuning the first versus last token " 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "7418a629-51b6-4aa2-83b7-bc0261bc370f", 121 | "metadata": {}, 122 | "source": [ 123 | "Rather than finetuning the last output token, we can finetune the first output token by changing \n", 124 | "\n", 125 | "```python\n", 126 | "model(input_batch)[:, -1, :]\n", 127 | "```\n", 128 | "\n", 129 | "to\n", 130 | "\n", 131 | "```python\n", 132 | "model(input_batch)[:, 0, :]\n", 133 | "```\n", 134 | "\n", 135 | "everywhere in the code.\n", 136 | "\n", 137 | "For convenience, you can run this experiment via\n", 138 | "\n", 139 | "```\n", 140 | "python additional-experiments.py --trainable_token first\n", 141 | "```\n", 142 | "\n", 143 | "using the code in the [../02_bonus_additional-experiments](../02_bonus_additional-experiments) folder, which results in a substantially worse test accuracy of 75.00% (versus the 95.67% in the main chapter)." 144 | ] 145 | } 146 | ], 147 | "metadata": { 148 | "kernelspec": { 149 | "display_name": "Python 3 (ipykernel)", 150 | "language": "python", 151 | "name": "python3" 152 | }, 153 | "language_info": { 154 | "codemirror_mode": { 155 | "name": "ipython", 156 | "version": 3 157 | }, 158 | "file_extension": ".py", 159 | "mimetype": "text/x-python", 160 | "name": "python", 161 | "nbconvert_exporter": "python", 162 | "pygments_lexer": "ipython3", 163 | "version": "3.10.11" 164 | } 165 | }, 166 | "nbformat": 4, 167 | "nbformat_minor": 5 168 | } 169 | -------------------------------------------------------------------------------- /appendix-A/01_main-chapter-code/exercise-solutions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "\n", 15 | "\n", 18 | "\n", 19 | "
\n", 10 | "\n", 11 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 12 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 13 | "
\n", 14 | "
\n", 16 | "\n", 17 | "
\n" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## Exercise A.1" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "The [Python Setup Tips](../../setup/01_optional-python-setup-preferences/README.md) document in this repository contains additional recommendations and tips to set up your Python environment.\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Exercise A.2" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "The [Installing Libraries Used In This Book document](../../setup/02_installing-python-libraries/README.md) and [directory](../../setup/02_installing-python-libraries/) contains utilities to check whether your environment is set up correctly." 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## Exercise A.3" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "import torch\n", 64 | "\n", 65 | "class NeuralNetwork(torch.nn.Module):\n", 66 | " def __init__(self, num_inputs, num_outputs):\n", 67 | " super().__init__()\n", 68 | "\n", 69 | " self.layers = torch.nn.Sequential(\n", 70 | " \n", 71 | " # 1st hidden layer\n", 72 | " torch.nn.Linear(num_inputs, 30),\n", 73 | " torch.nn.ReLU(),\n", 74 | "\n", 75 | " # 2nd hidden layer\n", 76 | " torch.nn.Linear(30, 20),\n", 77 | " torch.nn.ReLU(),\n", 78 | "\n", 79 | " # output layer\n", 80 | " torch.nn.Linear(20, num_outputs),\n", 81 | " )\n", 82 | "\n", 83 | " def forward(self, x):\n", 84 | " logits = self.layers(x)\n", 85 | " return logits" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 3, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "Total number of trainable model parameters: 752\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "model = NeuralNetwork(2, 2)\n", 103 | "\n", 104 | "num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", 105 | "print(\"Total number of trainable model parameters:\", num_params)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "## Exercise A.4" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 1, 118 | "metadata": { 119 | "id": "qGgnamiyLJxp" 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "import torch\n", 124 | "\n", 125 | "a = torch.rand(100, 200)\n", 126 | "b = torch.rand(200, 300)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 2, 132 | "metadata": { 133 | "colab": { 134 | "base_uri": "https://localhost:8080/" 135 | }, 136 | "id": "CvGvIeVkLzXE", 137 | "outputId": "44d027be-0787-4348-9c06-4e559d94d0e1" 138 | }, 139 | "outputs": [ 140 | { 141 | "name": "stdout", 142 | "output_type": "stream", 143 | "text": [ 144 | "63.8 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "%timeit a @ b" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 3, 155 | "metadata": { 156 | "id": "OmRtZLa9L2ZG" 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "a, b = a.to(\"cuda\"), b.to(\"cuda\")" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 4, 166 | "metadata": { 167 | "colab": { 168 | "base_uri": "https://localhost:8080/" 169 | }, 170 | "id": "duLEhXDPL6k0", 171 | "outputId": "3486471d-fd62-446f-9855-2d01f41fd101" 172 | }, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "13.8 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" 179 | ] 180 | } 181 | ], 182 | "source": [ 183 | "%timeit a @ b" 184 | ] 185 | } 186 | ], 187 | "metadata": { 188 | "accelerator": "GPU", 189 | "colab": { 190 | "gpuType": "V100", 191 | "machine_shape": "hm", 192 | "provenance": [] 193 | }, 194 | "kernelspec": { 195 | "display_name": "Python 3 (ipykernel)", 196 | "language": "python", 197 | "name": "python3" 198 | }, 199 | "language_info": { 200 | "codemirror_mode": { 201 | "name": "ipython", 202 | "version": 3 203 | }, 204 | "file_extension": ".py", 205 | "mimetype": "text/x-python", 206 | "name": "python", 207 | "nbconvert_exporter": "python", 208 | "pygments_lexer": "ipython3", 209 | "version": "3.10.6" 210 | } 211 | }, 212 | "nbformat": 4, 213 | "nbformat_minor": 4 214 | } 215 | -------------------------------------------------------------------------------- /appendix-A/01_main-chapter-code/DDP-script.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | # Appendix A: Introduction to PyTorch (Part 3) 7 | 8 | import torch 9 | import torch.nn.functional as F 10 | from torch.utils.data import Dataset, DataLoader 11 | 12 | # NEW imports: 13 | import os 14 | import torch.multiprocessing as mp 15 | from torch.utils.data.distributed import DistributedSampler 16 | from torch.nn.parallel import DistributedDataParallel as DDP 17 | from torch.distributed import init_process_group, destroy_process_group 18 | 19 | 20 | # NEW: function to initialize a distributed process group (1 process / GPU) 21 | # this allows communication among processes 22 | def ddp_setup(rank, world_size): 23 | """ 24 | Arguments: 25 | rank: a unique process ID 26 | world_size: total number of processes in the group 27 | """ 28 | # rank of machine running rank:0 process 29 | # here, we assume all GPUs are on the same machine 30 | os.environ["MASTER_ADDR"] = "localhost" 31 | # any free port on the machine 32 | os.environ["MASTER_PORT"] = "12345" 33 | 34 | # initialize process group 35 | # Windows users may have to use "gloo" instead of "nccl" as backend 36 | # nccl: NVIDIA Collective Communication Library 37 | init_process_group(backend="nccl", rank=rank, world_size=world_size) 38 | torch.cuda.set_device(rank) 39 | 40 | 41 | class ToyDataset(Dataset): 42 | def __init__(self, X, y): 43 | self.features = X 44 | self.labels = y 45 | 46 | def __getitem__(self, index): 47 | one_x = self.features[index] 48 | one_y = self.labels[index] 49 | return one_x, one_y 50 | 51 | def __len__(self): 52 | return self.labels.shape[0] 53 | 54 | 55 | class NeuralNetwork(torch.nn.Module): 56 | def __init__(self, num_inputs, num_outputs): 57 | super().__init__() 58 | 59 | self.layers = torch.nn.Sequential( 60 | # 1st hidden layer 61 | torch.nn.Linear(num_inputs, 30), 62 | torch.nn.ReLU(), 63 | 64 | # 2nd hidden layer 65 | torch.nn.Linear(30, 20), 66 | torch.nn.ReLU(), 67 | 68 | # output layer 69 | torch.nn.Linear(20, num_outputs), 70 | ) 71 | 72 | def forward(self, x): 73 | logits = self.layers(x) 74 | return logits 75 | 76 | 77 | def prepare_dataset(): 78 | X_train = torch.tensor([ 79 | [-1.2, 3.1], 80 | [-0.9, 2.9], 81 | [-0.5, 2.6], 82 | [2.3, -1.1], 83 | [2.7, -1.5] 84 | ]) 85 | y_train = torch.tensor([0, 0, 0, 1, 1]) 86 | 87 | X_test = torch.tensor([ 88 | [-0.8, 2.8], 89 | [2.6, -1.6], 90 | ]) 91 | y_test = torch.tensor([0, 1]) 92 | 93 | train_ds = ToyDataset(X_train, y_train) 94 | test_ds = ToyDataset(X_test, y_test) 95 | 96 | train_loader = DataLoader( 97 | dataset=train_ds, 98 | batch_size=2, 99 | shuffle=False, # NEW: False because of DistributedSampler below 100 | pin_memory=True, 101 | drop_last=True, 102 | # NEW: chunk batches across GPUs without overlapping samples: 103 | sampler=DistributedSampler(train_ds) # NEW 104 | ) 105 | test_loader = DataLoader( 106 | dataset=test_ds, 107 | batch_size=2, 108 | shuffle=False, 109 | ) 110 | return train_loader, test_loader 111 | 112 | 113 | # NEW: wrapper 114 | def main(rank, world_size, num_epochs): 115 | 116 | ddp_setup(rank, world_size) # NEW: initialize process groups 117 | 118 | train_loader, test_loader = prepare_dataset() 119 | model = NeuralNetwork(num_inputs=2, num_outputs=2) 120 | model.to(rank) 121 | optimizer = torch.optim.SGD(model.parameters(), lr=0.5) 122 | 123 | model = DDP(model, device_ids=[rank]) # NEW: wrap model with DDP 124 | # the core model is now accessible as model.module 125 | 126 | for epoch in range(num_epochs): 127 | 128 | model.train() 129 | for features, labels in train_loader: 130 | 131 | features, labels = features.to(rank), labels.to(rank) # New: use rank 132 | logits = model(features) 133 | loss = F.cross_entropy(logits, labels) # Loss function 134 | 135 | optimizer.zero_grad() 136 | loss.backward() 137 | optimizer.step() 138 | 139 | # LOGGING 140 | print(f"[GPU{rank}] Epoch: {epoch+1:03d}/{num_epochs:03d}" 141 | f" | Batchsize {labels.shape[0]:03d}" 142 | f" | Train/Val Loss: {loss:.2f}") 143 | 144 | model.eval() 145 | train_acc = compute_accuracy(model, train_loader, device=rank) 146 | print(f"[GPU{rank}] Training accuracy", train_acc) 147 | test_acc = compute_accuracy(model, test_loader, device=rank) 148 | print(f"[GPU{rank}] Test accuracy", test_acc) 149 | 150 | destroy_process_group() # NEW: cleanly exit distributed mode 151 | 152 | 153 | def compute_accuracy(model, dataloader, device): 154 | model = model.eval() 155 | correct = 0.0 156 | total_examples = 0 157 | 158 | for idx, (features, labels) in enumerate(dataloader): 159 | features, labels = features.to(device), labels.to(device) 160 | 161 | with torch.no_grad(): 162 | logits = model(features) 163 | predictions = torch.argmax(logits, dim=1) 164 | compare = labels == predictions 165 | correct += torch.sum(compare) 166 | total_examples += len(compare) 167 | return (correct / total_examples).item() 168 | 169 | 170 | if __name__ == "__main__": 171 | print("PyTorch version:", torch.__version__) 172 | print("CUDA available:", torch.cuda.is_available()) 173 | print("Number of GPUs available:", torch.cuda.device_count()) 174 | 175 | torch.manual_seed(123) 176 | 177 | # NEW: spawn new processes 178 | # note that spawn will automatically pass the rank 179 | num_epochs = 3 180 | world_size = torch.cuda.device_count() 181 | mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size) 182 | # nprocs=world_size spawns one process per GPU 183 | -------------------------------------------------------------------------------- /ch05/01_main-chapter-code/gpt_download.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | 7 | import os 8 | import urllib.request 9 | 10 | # import requests 11 | import json 12 | import numpy as np 13 | import tensorflow as tf 14 | from tqdm import tqdm 15 | 16 | 17 | def download_and_load_gpt2(model_size, models_dir): 18 | # Validate model size 19 | allowed_sizes = ("124M", "355M", "774M", "1558M") 20 | if model_size not in allowed_sizes: 21 | raise ValueError(f"Model size not in {allowed_sizes}") 22 | 23 | # Define paths 24 | model_dir = os.path.join(models_dir, model_size) 25 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 26 | filenames = [ 27 | "checkpoint", "encoder.json", "hparams.json", 28 | "model.ckpt.data-00000-of-00001", "model.ckpt.index", 29 | "model.ckpt.meta", "vocab.bpe" 30 | ] 31 | 32 | # Download files 33 | os.makedirs(model_dir, exist_ok=True) 34 | for filename in filenames: 35 | file_url = os.path.join(base_url, model_size, filename) 36 | file_path = os.path.join(model_dir, filename) 37 | download_file(file_url, file_path) 38 | 39 | # Load settings and params 40 | tf_ckpt_path = tf.train.latest_checkpoint(model_dir) 41 | settings = json.load(open(os.path.join(model_dir, "hparams.json"))) 42 | params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) 43 | 44 | return settings, params 45 | 46 | 47 | def download_file(url, destination): 48 | # Send a GET request to download the file 49 | 50 | try: 51 | with urllib.request.urlopen(url) as response: 52 | # Get the total file size from headers, defaulting to 0 if not present 53 | file_size = int(response.headers.get("Content-Length", 0)) 54 | 55 | # Check if file exists and has the same size 56 | if os.path.exists(destination): 57 | file_size_local = os.path.getsize(destination) 58 | if file_size == file_size_local: 59 | print(f"File already exists and is up-to-date: {destination}") 60 | return 61 | 62 | # Define the block size for reading the file 63 | block_size = 1024 # 1 Kilobyte 64 | 65 | # Initialize the progress bar with total file size 66 | progress_bar_description = os.path.basename(url) # Extract filename from URL 67 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 68 | # Open the destination file in binary write mode 69 | with open(destination, "wb") as file: 70 | # Read the file in chunks and write to destination 71 | while True: 72 | chunk = response.read(block_size) 73 | if not chunk: 74 | break 75 | file.write(chunk) 76 | progress_bar.update(len(chunk)) # Update progress bar 77 | except urllib.error.HTTPError: 78 | s = ( 79 | f"The specified URL ({url}) is incorrect, the internet connection cannot be established," 80 | "\nor the requested file is temporarily unavailable.\nPlease visit the following website" 81 | " for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273") 82 | print(s) 83 | 84 | 85 | # Alternative way using `requests` 86 | """ 87 | def download_file(url, destination): 88 | # Send a GET request to download the file in streaming mode 89 | response = requests.get(url, stream=True) 90 | 91 | # Get the total file size from headers, defaulting to 0 if not present 92 | file_size = int(response.headers.get("content-length", 0)) 93 | 94 | # Check if file exists and has the same size 95 | if os.path.exists(destination): 96 | file_size_local = os.path.getsize(destination) 97 | if file_size == file_size_local: 98 | print(f"File already exists and is up-to-date: {destination}") 99 | return 100 | 101 | # Define the block size for reading the file 102 | block_size = 1024 # 1 Kilobyte 103 | 104 | # Initialize the progress bar with total file size 105 | progress_bar_description = url.split("/")[-1] # Extract filename from URL 106 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 107 | # Open the destination file in binary write mode 108 | with open(destination, "wb") as file: 109 | # Iterate over the file data in chunks 110 | for chunk in response.iter_content(block_size): 111 | progress_bar.update(len(chunk)) # Update progress bar 112 | file.write(chunk) # Write the chunk to the file 113 | """ 114 | 115 | 116 | def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): 117 | # Initialize parameters dictionary with empty blocks for each layer 118 | params = {"blocks": [{} for _ in range(settings["n_layer"])]} 119 | 120 | # Iterate over each variable in the checkpoint 121 | for name, _ in tf.train.list_variables(ckpt_path): 122 | # Load the variable and remove singleton dimensions 123 | variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) 124 | 125 | # Process the variable name to extract relevant parts 126 | variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix 127 | 128 | # Identify the target dictionary for the variable 129 | target_dict = params 130 | if variable_name_parts[0].startswith("h"): 131 | layer_number = int(variable_name_parts[0][1:]) 132 | target_dict = params["blocks"][layer_number] 133 | 134 | # Recursively access or create nested dictionaries 135 | for key in variable_name_parts[1:-1]: 136 | target_dict = target_dict.setdefault(key, {}) 137 | 138 | # Assign the variable array to the last key 139 | last_key = variable_name_parts[-1] 140 | target_dict[last_key] = variable_array 141 | 142 | return params 143 | -------------------------------------------------------------------------------- /ch06/01_main-chapter-code/gpt_download.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | 7 | import os 8 | import urllib.request 9 | 10 | # import requests 11 | import json 12 | import numpy as np 13 | import tensorflow as tf 14 | from tqdm import tqdm 15 | 16 | 17 | def download_and_load_gpt2(model_size, models_dir): 18 | # Validate model size 19 | allowed_sizes = ("124M", "355M", "774M", "1558M") 20 | if model_size not in allowed_sizes: 21 | raise ValueError(f"Model size not in {allowed_sizes}") 22 | 23 | # Define paths 24 | model_dir = os.path.join(models_dir, model_size) 25 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 26 | filenames = [ 27 | "checkpoint", "encoder.json", "hparams.json", 28 | "model.ckpt.data-00000-of-00001", "model.ckpt.index", 29 | "model.ckpt.meta", "vocab.bpe" 30 | ] 31 | 32 | # Download files 33 | os.makedirs(model_dir, exist_ok=True) 34 | for filename in filenames: 35 | file_url = os.path.join(base_url, model_size, filename) 36 | file_path = os.path.join(model_dir, filename) 37 | download_file(file_url, file_path) 38 | 39 | # Load settings and params 40 | tf_ckpt_path = tf.train.latest_checkpoint(model_dir) 41 | settings = json.load(open(os.path.join(model_dir, "hparams.json"))) 42 | params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) 43 | 44 | return settings, params 45 | 46 | 47 | def download_file(url, destination): 48 | # Send a GET request to download the file 49 | 50 | try: 51 | with urllib.request.urlopen(url) as response: 52 | # Get the total file size from headers, defaulting to 0 if not present 53 | file_size = int(response.headers.get("Content-Length", 0)) 54 | 55 | # Check if file exists and has the same size 56 | if os.path.exists(destination): 57 | file_size_local = os.path.getsize(destination) 58 | if file_size == file_size_local: 59 | print(f"File already exists and is up-to-date: {destination}") 60 | return 61 | 62 | # Define the block size for reading the file 63 | block_size = 1024 # 1 Kilobyte 64 | 65 | # Initialize the progress bar with total file size 66 | progress_bar_description = os.path.basename(url) # Extract filename from URL 67 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 68 | # Open the destination file in binary write mode 69 | with open(destination, "wb") as file: 70 | # Read the file in chunks and write to destination 71 | while True: 72 | chunk = response.read(block_size) 73 | if not chunk: 74 | break 75 | file.write(chunk) 76 | progress_bar.update(len(chunk)) # Update progress bar 77 | except urllib.error.HTTPError: 78 | s = ( 79 | f"The specified URL ({url}) is incorrect, the internet connection cannot be established," 80 | "\nor the requested file is temporarily unavailable.\nPlease visit the following website" 81 | " for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273") 82 | print(s) 83 | 84 | 85 | # Alternative way using `requests` 86 | """ 87 | def download_file(url, destination): 88 | # Send a GET request to download the file in streaming mode 89 | response = requests.get(url, stream=True) 90 | 91 | # Get the total file size from headers, defaulting to 0 if not present 92 | file_size = int(response.headers.get("content-length", 0)) 93 | 94 | # Check if file exists and has the same size 95 | if os.path.exists(destination): 96 | file_size_local = os.path.getsize(destination) 97 | if file_size == file_size_local: 98 | print(f"File already exists and is up-to-date: {destination}") 99 | return 100 | 101 | # Define the block size for reading the file 102 | block_size = 1024 # 1 Kilobyte 103 | 104 | # Initialize the progress bar with total file size 105 | progress_bar_description = url.split("/")[-1] # Extract filename from URL 106 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 107 | # Open the destination file in binary write mode 108 | with open(destination, "wb") as file: 109 | # Iterate over the file data in chunks 110 | for chunk in response.iter_content(block_size): 111 | progress_bar.update(len(chunk)) # Update progress bar 112 | file.write(chunk) # Write the chunk to the file 113 | """ 114 | 115 | 116 | def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): 117 | # Initialize parameters dictionary with empty blocks for each layer 118 | params = {"blocks": [{} for _ in range(settings["n_layer"])]} 119 | 120 | # Iterate over each variable in the checkpoint 121 | for name, _ in tf.train.list_variables(ckpt_path): 122 | # Load the variable and remove singleton dimensions 123 | variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) 124 | 125 | # Process the variable name to extract relevant parts 126 | variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix 127 | 128 | # Identify the target dictionary for the variable 129 | target_dict = params 130 | if variable_name_parts[0].startswith("h"): 131 | layer_number = int(variable_name_parts[0][1:]) 132 | target_dict = params["blocks"][layer_number] 133 | 134 | # Recursively access or create nested dictionaries 135 | for key in variable_name_parts[1:-1]: 136 | target_dict = target_dict.setdefault(key, {}) 137 | 138 | # Assign the variable array to the last key 139 | last_key = variable_name_parts[-1] 140 | target_dict[last_key] = variable_array 141 | 142 | return params 143 | -------------------------------------------------------------------------------- /ch07/01_main-chapter-code/gpt_download.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | 7 | import os 8 | import urllib.request 9 | 10 | # import requests 11 | import json 12 | import numpy as np 13 | import tensorflow as tf 14 | from tqdm import tqdm 15 | 16 | 17 | def download_and_load_gpt2(model_size, models_dir): 18 | # Validate model size 19 | allowed_sizes = ("124M", "355M", "774M", "1558M") 20 | if model_size not in allowed_sizes: 21 | raise ValueError(f"Model size not in {allowed_sizes}") 22 | 23 | # Define paths 24 | model_dir = os.path.join(models_dir, model_size) 25 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 26 | filenames = [ 27 | "checkpoint", "encoder.json", "hparams.json", 28 | "model.ckpt.data-00000-of-00001", "model.ckpt.index", 29 | "model.ckpt.meta", "vocab.bpe" 30 | ] 31 | 32 | # Download files 33 | os.makedirs(model_dir, exist_ok=True) 34 | for filename in filenames: 35 | file_url = os.path.join(base_url, model_size, filename) 36 | file_path = os.path.join(model_dir, filename) 37 | download_file(file_url, file_path) 38 | 39 | # Load settings and params 40 | tf_ckpt_path = tf.train.latest_checkpoint(model_dir) 41 | settings = json.load(open(os.path.join(model_dir, "hparams.json"))) 42 | params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) 43 | 44 | return settings, params 45 | 46 | 47 | def download_file(url, destination): 48 | # Send a GET request to download the file 49 | 50 | try: 51 | with urllib.request.urlopen(url) as response: 52 | # Get the total file size from headers, defaulting to 0 if not present 53 | file_size = int(response.headers.get("Content-Length", 0)) 54 | 55 | # Check if file exists and has the same size 56 | if os.path.exists(destination): 57 | file_size_local = os.path.getsize(destination) 58 | if file_size == file_size_local: 59 | print(f"File already exists and is up-to-date: {destination}") 60 | return 61 | 62 | # Define the block size for reading the file 63 | block_size = 1024 # 1 Kilobyte 64 | 65 | # Initialize the progress bar with total file size 66 | progress_bar_description = os.path.basename(url) # Extract filename from URL 67 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 68 | # Open the destination file in binary write mode 69 | with open(destination, "wb") as file: 70 | # Read the file in chunks and write to destination 71 | while True: 72 | chunk = response.read(block_size) 73 | if not chunk: 74 | break 75 | file.write(chunk) 76 | progress_bar.update(len(chunk)) # Update progress bar 77 | except urllib.error.HTTPError: 78 | s = ( 79 | f"The specified URL ({url}) is incorrect, the internet connection cannot be established," 80 | "\nor the requested file is temporarily unavailable.\nPlease visit the following website" 81 | " for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273") 82 | print(s) 83 | 84 | 85 | # Alternative way using `requests` 86 | """ 87 | def download_file(url, destination): 88 | # Send a GET request to download the file in streaming mode 89 | response = requests.get(url, stream=True) 90 | 91 | # Get the total file size from headers, defaulting to 0 if not present 92 | file_size = int(response.headers.get("content-length", 0)) 93 | 94 | # Check if file exists and has the same size 95 | if os.path.exists(destination): 96 | file_size_local = os.path.getsize(destination) 97 | if file_size == file_size_local: 98 | print(f"File already exists and is up-to-date: {destination}") 99 | return 100 | 101 | # Define the block size for reading the file 102 | block_size = 1024 # 1 Kilobyte 103 | 104 | # Initialize the progress bar with total file size 105 | progress_bar_description = url.split("/")[-1] # Extract filename from URL 106 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 107 | # Open the destination file in binary write mode 108 | with open(destination, "wb") as file: 109 | # Iterate over the file data in chunks 110 | for chunk in response.iter_content(block_size): 111 | progress_bar.update(len(chunk)) # Update progress bar 112 | file.write(chunk) # Write the chunk to the file 113 | """ 114 | 115 | 116 | def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): 117 | # Initialize parameters dictionary with empty blocks for each layer 118 | params = {"blocks": [{} for _ in range(settings["n_layer"])]} 119 | 120 | # Iterate over each variable in the checkpoint 121 | for name, _ in tf.train.list_variables(ckpt_path): 122 | # Load the variable and remove singleton dimensions 123 | variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) 124 | 125 | # Process the variable name to extract relevant parts 126 | variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix 127 | 128 | # Identify the target dictionary for the variable 129 | target_dict = params 130 | if variable_name_parts[0].startswith("h"): 131 | layer_number = int(variable_name_parts[0][1:]) 132 | target_dict = params["blocks"][layer_number] 133 | 134 | # Recursively access or create nested dictionaries 135 | for key in variable_name_parts[1:-1]: 136 | target_dict = target_dict.setdefault(key, {}) 137 | 138 | # Assign the variable array to the last key 139 | last_key = variable_name_parts[-1] 140 | target_dict[last_key] = variable_array 141 | 142 | return params 143 | -------------------------------------------------------------------------------- /appendix-E/01_main-chapter-code/gpt_download.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | 7 | import os 8 | import urllib.request 9 | 10 | # import requests 11 | import json 12 | import numpy as np 13 | import tensorflow as tf 14 | from tqdm import tqdm 15 | 16 | 17 | def download_and_load_gpt2(model_size, models_dir): 18 | # Validate model size 19 | allowed_sizes = ("124M", "355M", "774M", "1558M") 20 | if model_size not in allowed_sizes: 21 | raise ValueError(f"Model size not in {allowed_sizes}") 22 | 23 | # Define paths 24 | model_dir = os.path.join(models_dir, model_size) 25 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 26 | filenames = [ 27 | "checkpoint", "encoder.json", "hparams.json", 28 | "model.ckpt.data-00000-of-00001", "model.ckpt.index", 29 | "model.ckpt.meta", "vocab.bpe" 30 | ] 31 | 32 | # Download files 33 | os.makedirs(model_dir, exist_ok=True) 34 | for filename in filenames: 35 | file_url = os.path.join(base_url, model_size, filename) 36 | file_path = os.path.join(model_dir, filename) 37 | download_file(file_url, file_path) 38 | 39 | # Load settings and params 40 | tf_ckpt_path = tf.train.latest_checkpoint(model_dir) 41 | settings = json.load(open(os.path.join(model_dir, "hparams.json"))) 42 | params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) 43 | 44 | return settings, params 45 | 46 | 47 | def download_file(url, destination): 48 | # Send a GET request to download the file 49 | 50 | try: 51 | with urllib.request.urlopen(url) as response: 52 | # Get the total file size from headers, defaulting to 0 if not present 53 | file_size = int(response.headers.get("Content-Length", 0)) 54 | 55 | # Check if file exists and has the same size 56 | if os.path.exists(destination): 57 | file_size_local = os.path.getsize(destination) 58 | if file_size == file_size_local: 59 | print(f"File already exists and is up-to-date: {destination}") 60 | return 61 | 62 | # Define the block size for reading the file 63 | block_size = 1024 # 1 Kilobyte 64 | 65 | # Initialize the progress bar with total file size 66 | progress_bar_description = os.path.basename(url) # Extract filename from URL 67 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 68 | # Open the destination file in binary write mode 69 | with open(destination, "wb") as file: 70 | # Read the file in chunks and write to destination 71 | while True: 72 | chunk = response.read(block_size) 73 | if not chunk: 74 | break 75 | file.write(chunk) 76 | progress_bar.update(len(chunk)) # Update progress bar 77 | except urllib.error.HTTPError: 78 | s = ( 79 | f"The specified URL ({url}) is incorrect, the internet connection cannot be established," 80 | "\nor the requested file is temporarily unavailable.\nPlease visit the following website" 81 | " for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273") 82 | print(s) 83 | 84 | 85 | # Alternative way using `requests` 86 | """ 87 | def download_file(url, destination): 88 | # Send a GET request to download the file in streaming mode 89 | response = requests.get(url, stream=True) 90 | 91 | # Get the total file size from headers, defaulting to 0 if not present 92 | file_size = int(response.headers.get("content-length", 0)) 93 | 94 | # Check if file exists and has the same size 95 | if os.path.exists(destination): 96 | file_size_local = os.path.getsize(destination) 97 | if file_size == file_size_local: 98 | print(f"File already exists and is up-to-date: {destination}") 99 | return 100 | 101 | # Define the block size for reading the file 102 | block_size = 1024 # 1 Kilobyte 103 | 104 | # Initialize the progress bar with total file size 105 | progress_bar_description = url.split("/")[-1] # Extract filename from URL 106 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 107 | # Open the destination file in binary write mode 108 | with open(destination, "wb") as file: 109 | # Iterate over the file data in chunks 110 | for chunk in response.iter_content(block_size): 111 | progress_bar.update(len(chunk)) # Update progress bar 112 | file.write(chunk) # Write the chunk to the file 113 | """ 114 | 115 | 116 | def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): 117 | # Initialize parameters dictionary with empty blocks for each layer 118 | params = {"blocks": [{} for _ in range(settings["n_layer"])]} 119 | 120 | # Iterate over each variable in the checkpoint 121 | for name, _ in tf.train.list_variables(ckpt_path): 122 | # Load the variable and remove singleton dimensions 123 | variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) 124 | 125 | # Process the variable name to extract relevant parts 126 | variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix 127 | 128 | # Identify the target dictionary for the variable 129 | target_dict = params 130 | if variable_name_parts[0].startswith("h"): 131 | layer_number = int(variable_name_parts[0][1:]) 132 | target_dict = params["blocks"][layer_number] 133 | 134 | # Recursively access or create nested dictionaries 135 | for key in variable_name_parts[1:-1]: 136 | target_dict = target_dict.setdefault(key, {}) 137 | 138 | # Assign the variable array to the last key 139 | last_key = variable_name_parts[-1] 140 | target_dict[last_key] = variable_array 141 | 142 | return params 143 | -------------------------------------------------------------------------------- /ch06/02_bonus_additional-experiments/gpt_download.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | 7 | import os 8 | import urllib.request 9 | 10 | # import requests 11 | import json 12 | import numpy as np 13 | import tensorflow as tf 14 | from tqdm import tqdm 15 | 16 | 17 | def download_and_load_gpt2(model_size, models_dir): 18 | # Validate model size 19 | allowed_sizes = ("124M", "355M", "774M", "1558M") 20 | if model_size not in allowed_sizes: 21 | raise ValueError(f"Model size not in {allowed_sizes}") 22 | 23 | # Define paths 24 | model_dir = os.path.join(models_dir, model_size) 25 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 26 | filenames = [ 27 | "checkpoint", "encoder.json", "hparams.json", 28 | "model.ckpt.data-00000-of-00001", "model.ckpt.index", 29 | "model.ckpt.meta", "vocab.bpe" 30 | ] 31 | 32 | # Download files 33 | os.makedirs(model_dir, exist_ok=True) 34 | for filename in filenames: 35 | file_url = os.path.join(base_url, model_size, filename) 36 | file_path = os.path.join(model_dir, filename) 37 | download_file(file_url, file_path) 38 | 39 | # Load settings and params 40 | tf_ckpt_path = tf.train.latest_checkpoint(model_dir) 41 | settings = json.load(open(os.path.join(model_dir, "hparams.json"))) 42 | params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) 43 | 44 | return settings, params 45 | 46 | 47 | def download_file(url, destination): 48 | # Send a GET request to download the file 49 | 50 | try: 51 | with urllib.request.urlopen(url) as response: 52 | # Get the total file size from headers, defaulting to 0 if not present 53 | file_size = int(response.headers.get("Content-Length", 0)) 54 | 55 | # Check if file exists and has the same size 56 | if os.path.exists(destination): 57 | file_size_local = os.path.getsize(destination) 58 | if file_size == file_size_local: 59 | print(f"File already exists and is up-to-date: {destination}") 60 | return 61 | 62 | # Define the block size for reading the file 63 | block_size = 1024 # 1 Kilobyte 64 | 65 | # Initialize the progress bar with total file size 66 | progress_bar_description = os.path.basename(url) # Extract filename from URL 67 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 68 | # Open the destination file in binary write mode 69 | with open(destination, "wb") as file: 70 | # Read the file in chunks and write to destination 71 | while True: 72 | chunk = response.read(block_size) 73 | if not chunk: 74 | break 75 | file.write(chunk) 76 | progress_bar.update(len(chunk)) # Update progress bar 77 | except urllib.error.HTTPError: 78 | s = ( 79 | f"The specified URL ({url}) is incorrect, the internet connection cannot be established," 80 | "\nor the requested file is temporarily unavailable.\nPlease visit the following website" 81 | " for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273") 82 | print(s) 83 | 84 | 85 | # Alternative way using `requests` 86 | """ 87 | def download_file(url, destination): 88 | # Send a GET request to download the file in streaming mode 89 | response = requests.get(url, stream=True) 90 | 91 | # Get the total file size from headers, defaulting to 0 if not present 92 | file_size = int(response.headers.get("content-length", 0)) 93 | 94 | # Check if file exists and has the same size 95 | if os.path.exists(destination): 96 | file_size_local = os.path.getsize(destination) 97 | if file_size == file_size_local: 98 | print(f"File already exists and is up-to-date: {destination}") 99 | return 100 | 101 | # Define the block size for reading the file 102 | block_size = 1024 # 1 Kilobyte 103 | 104 | # Initialize the progress bar with total file size 105 | progress_bar_description = url.split("/")[-1] # Extract filename from URL 106 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 107 | # Open the destination file in binary write mode 108 | with open(destination, "wb") as file: 109 | # Iterate over the file data in chunks 110 | for chunk in response.iter_content(block_size): 111 | progress_bar.update(len(chunk)) # Update progress bar 112 | file.write(chunk) # Write the chunk to the file 113 | """ 114 | 115 | 116 | def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): 117 | # Initialize parameters dictionary with empty blocks for each layer 118 | params = {"blocks": [{} for _ in range(settings["n_layer"])]} 119 | 120 | # Iterate over each variable in the checkpoint 121 | for name, _ in tf.train.list_variables(ckpt_path): 122 | # Load the variable and remove singleton dimensions 123 | variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) 124 | 125 | # Process the variable name to extract relevant parts 126 | variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix 127 | 128 | # Identify the target dictionary for the variable 129 | target_dict = params 130 | if variable_name_parts[0].startswith("h"): 131 | layer_number = int(variable_name_parts[0][1:]) 132 | target_dict = params["blocks"][layer_number] 133 | 134 | # Recursively access or create nested dictionaries 135 | for key in variable_name_parts[1:-1]: 136 | target_dict = target_dict.setdefault(key, {}) 137 | 138 | # Assign the variable array to the last key 139 | last_key = variable_name_parts[-1] 140 | target_dict[last_key] = variable_array 141 | 142 | return params 143 | -------------------------------------------------------------------------------- /ch06/03_bonus_imdb-classification/gpt_download.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | 7 | import os 8 | import urllib.request 9 | 10 | # import requests 11 | import json 12 | import numpy as np 13 | import tensorflow as tf 14 | from tqdm import tqdm 15 | 16 | 17 | def download_and_load_gpt2(model_size, models_dir): 18 | # Validate model size 19 | allowed_sizes = ("124M", "355M", "774M", "1558M") 20 | if model_size not in allowed_sizes: 21 | raise ValueError(f"Model size not in {allowed_sizes}") 22 | 23 | # Define paths 24 | model_dir = os.path.join(models_dir, model_size) 25 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models" 26 | filenames = [ 27 | "checkpoint", "encoder.json", "hparams.json", 28 | "model.ckpt.data-00000-of-00001", "model.ckpt.index", 29 | "model.ckpt.meta", "vocab.bpe" 30 | ] 31 | 32 | # Download files 33 | os.makedirs(model_dir, exist_ok=True) 34 | for filename in filenames: 35 | file_url = os.path.join(base_url, model_size, filename) 36 | file_path = os.path.join(model_dir, filename) 37 | download_file(file_url, file_path) 38 | 39 | # Load settings and params 40 | tf_ckpt_path = tf.train.latest_checkpoint(model_dir) 41 | settings = json.load(open(os.path.join(model_dir, "hparams.json"))) 42 | params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings) 43 | 44 | return settings, params 45 | 46 | 47 | def download_file(url, destination): 48 | # Send a GET request to download the file 49 | 50 | try: 51 | with urllib.request.urlopen(url) as response: 52 | # Get the total file size from headers, defaulting to 0 if not present 53 | file_size = int(response.headers.get("Content-Length", 0)) 54 | 55 | # Check if file exists and has the same size 56 | if os.path.exists(destination): 57 | file_size_local = os.path.getsize(destination) 58 | if file_size == file_size_local: 59 | print(f"File already exists and is up-to-date: {destination}") 60 | return 61 | 62 | # Define the block size for reading the file 63 | block_size = 1024 # 1 Kilobyte 64 | 65 | # Initialize the progress bar with total file size 66 | progress_bar_description = os.path.basename(url) # Extract filename from URL 67 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 68 | # Open the destination file in binary write mode 69 | with open(destination, "wb") as file: 70 | # Read the file in chunks and write to destination 71 | while True: 72 | chunk = response.read(block_size) 73 | if not chunk: 74 | break 75 | file.write(chunk) 76 | progress_bar.update(len(chunk)) # Update progress bar 77 | except urllib.error.HTTPError: 78 | s = ( 79 | f"The specified URL ({url}) is incorrect, the internet connection cannot be established," 80 | "\nor the requested file is temporarily unavailable.\nPlease visit the following website" 81 | " for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273") 82 | print(s) 83 | 84 | 85 | # Alternative way using `requests` 86 | """ 87 | def download_file(url, destination): 88 | # Send a GET request to download the file in streaming mode 89 | response = requests.get(url, stream=True) 90 | 91 | # Get the total file size from headers, defaulting to 0 if not present 92 | file_size = int(response.headers.get("content-length", 0)) 93 | 94 | # Check if file exists and has the same size 95 | if os.path.exists(destination): 96 | file_size_local = os.path.getsize(destination) 97 | if file_size == file_size_local: 98 | print(f"File already exists and is up-to-date: {destination}") 99 | return 100 | 101 | # Define the block size for reading the file 102 | block_size = 1024 # 1 Kilobyte 103 | 104 | # Initialize the progress bar with total file size 105 | progress_bar_description = url.split("/")[-1] # Extract filename from URL 106 | with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar: 107 | # Open the destination file in binary write mode 108 | with open(destination, "wb") as file: 109 | # Iterate over the file data in chunks 110 | for chunk in response.iter_content(block_size): 111 | progress_bar.update(len(chunk)) # Update progress bar 112 | file.write(chunk) # Write the chunk to the file 113 | """ 114 | 115 | 116 | def load_gpt2_params_from_tf_ckpt(ckpt_path, settings): 117 | # Initialize parameters dictionary with empty blocks for each layer 118 | params = {"blocks": [{} for _ in range(settings["n_layer"])]} 119 | 120 | # Iterate over each variable in the checkpoint 121 | for name, _ in tf.train.list_variables(ckpt_path): 122 | # Load the variable and remove singleton dimensions 123 | variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name)) 124 | 125 | # Process the variable name to extract relevant parts 126 | variable_name_parts = name.split("/")[1:] # Skip the 'model/' prefix 127 | 128 | # Identify the target dictionary for the variable 129 | target_dict = params 130 | if variable_name_parts[0].startswith("h"): 131 | layer_number = int(variable_name_parts[0][1:]) 132 | target_dict = params["blocks"][layer_number] 133 | 134 | # Recursively access or create nested dictionaries 135 | for key in variable_name_parts[1:-1]: 136 | target_dict = target_dict.setdefault(key, {}) 137 | 138 | # Assign the variable array to the last key 139 | last_key = variable_name_parts[-1] 140 | target_dict[last_key] = variable_array 141 | 142 | return params 143 | -------------------------------------------------------------------------------- /setup/03_optional-docker-environment/README.md: -------------------------------------------------------------------------------- 1 | # Docker Environment Setup Guide 2 | 3 | If you prefer a development setup that isolates a project's dependencies and configurations, using Docker is a highly effective solution. This approach eliminates the need to manually install software packages and libraries and ensures a consistent development environment. 4 | 5 | This guide will walk you through the process for setting up an optional docker environment for this book if you prefer it over using the conda approach explained in [../01_optional-python-setup-preferences](../01_optional-python-setup-preferences) and [../02_installing-python-libraries](../02_installing-python-libraries). 6 | 7 |
8 | 9 | ## Downloading and installing Docker 10 | 11 | The easiest way to get started with Docker is by installing [Docker Desktop](https://docs.docker.com/desktop/) for your relevant platform. 12 | 13 | Linux (Ubuntu) users may prefer to install the [Docker Engine](https://docs.docker.com/engine/install/ubuntu/) instead and follow the [post-installation](https://docs.docker.com/engine/install/linux-postinstall/) steps. 14 | 15 |
16 | 17 | ## Using a Docker DevContainer in Visual Studio Code 18 | 19 | A Docker DevContainer, or Development Container, is a tool that allows developers to use Docker containers as a fully-fledged development environment. This approach ensures that users can quickly get up and running with a consistent development environment, regardless of their local machine setup. 20 | 21 | While DevContainers also work with other IDEs, a commonly used IDE/editor for working with DevContainers is Visual Studio Code (VS Code). The guide below explains how to use the DevContainer for this book within a VS Code context, but a similar process should also apply to PyCharm. [Install](https://code.visualstudio.com/download) it if you don't have it and want to use it. 22 | 23 | 1. Clone this GitHub repository and `cd` into the project root directory. 24 | 25 | ```bash 26 | git clone https://github.com/rasbt/LLMs-from-scratch.git 27 | cd LLMs-from-scratch 28 | ``` 29 | 30 | 2. Move the `.devcontainer` folder from `setup/03_optional-docker-environment/` to the current directory (project root). 31 | 32 | ```bash 33 | mv setup/03_optional-docker-environment/.devcontainer ./ 34 | ``` 35 | 36 | 3. In Docker Desktop, make sure that **_desktop-linux_ builder** is running and will be used to build the Docker container (see _Docker Desktop_ -> _Change settings_ -> _Builders_ -> _desktop-linux_ -> _..._ -> _Use_) 37 | 38 | 4. If you have a [CUDA-supported GPU](https://developer.nvidia.com/cuda-gpus), you can speed up the training and inference: 39 | 40 | 3.1 Install **NVIDIA Container Toolkit** as described [here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt). NVIDIA Container Toolkit is supported as written [here](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#nvidia-compute-software-support-on-wsl-2). 41 | 42 | 3.2 Add _nvidia_ as runtime in Docker Engine daemon config (see _Docker Desktop_ -> _Change settings_ -> _Docker Engine_). Add these lines to your config: 43 | 44 | ```json 45 | "runtimes": { 46 | "nvidia": { 47 | "path": "nvidia-container-runtime", 48 | "runtimeArgs": [] 49 | ``` 50 | 51 | For example, the full Docker Engine daemon config json code should look like that: 52 | 53 | ```json 54 | { 55 | "builder": { 56 | "gc": { 57 | "defaultKeepStorage": "20GB", 58 | "enabled": true 59 | } 60 | }, 61 | "experimental": false, 62 | "runtimes": { 63 | "nvidia": { 64 | "path": "nvidia-container-runtime", 65 | "runtimeArgs": [] 66 | } 67 | } 68 | } 69 | ``` 70 | 71 | and restart Docker Desktop. 72 | 73 | 5. Type `code .` in the terminal to open the project in VS Code. Alternatively, you can launch VS Code and select the project to open from the UI. 74 | 75 | 6. Install the **Remote Development** extension from the VS Code _Extensions_ menu on the left-hand side. 76 | 77 | 7. Open the DevContainer. 78 | 79 | Since the `.devcontainer` folder is present in the main `LLMs-from-scratch` directory (folders starting with `.` may be invisible in your OS depending on your settings), VS Code should automatically detect it and ask whether you would like to open the project in a devcontainer. If it doesn't, simply press `Ctrl + Shift + P` to open the command palette and start typing `dev containers` to see a list of all DevContainer-specific options. 80 | 81 | 8. Select **Reopen in Container**. 82 | 83 | Docker will now begin the process of building the Docker image specified in the `.devcontainer` configuration if it hasn't been built before, or pull the image if it's available from a registry. 84 | 85 | The entire process is automated and might take a few minutes, depending on your system and internet speed. Optionally click on "Starting Dev Container (show log)" in the lower right corner of VS Code to see the current built progress. 86 | 87 | Once completed, VS Code will automatically connect to the container and reopen the project within the newly created Docker development environment. You will be able to write, execute, and debug code as if it were running on your local machine, but with the added benefits of Docker's isolation and consistency. 88 | 89 | > [!WARNING] 90 | > If you are encountering an error during the build process, this is likely because your machine does not support NVIDIA container toolkit because your machine doesn't have a compatible GPU. In this case, edit the `devcontainer.json` file to remove the `"runArgs": ["--runtime=nvidia", "--gpus=all"],` line and run the "Reopen Dev Container" procedure again. 91 | 92 | 9. Finished. 93 | 94 | Once the image has been pulled and built, you should have your project mounted inside the container with all the packages installed, ready for development. 95 | 96 |
97 | 98 | ## Uninstalling the Docker Image 99 | 100 | Below are instructions for uninstalling or removing a Docker container and image if you no longer plan to use it. This process does not remove Docker itself from your system but rather cleans up the project-specific Docker artifacts. 101 | 102 | 1. List all Docker images to find the one associated with your DevContainer: 103 | 104 | ```bash 105 | docker image ls 106 | ``` 107 | 108 | 2. Remove the Docker image using its image ID or name: 109 | 110 | ```bash 111 | docker image rm [IMAGE_ID_OR_NAME] 112 | ``` 113 | 114 |
115 | 116 | ## Uninstalling Docker 117 | 118 | If you decide that Docker is not for you and wish to uninstall it, see the official documentation [here](https://docs.docker.com/desktop/uninstall/) that outlines the steps for your specific operating system. 119 | -------------------------------------------------------------------------------- /ch02/01_main-chapter-code/dataloader.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6e2a4891-c257-4d6b-afb3-e8fef39d0437", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "\n", 16 | "\n", 19 | "\n", 20 | "
\n", 11 | "\n", 12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 13 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "
\n", 15 | "
\n", 17 | "\n", 18 | "
\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "6f678e62-7bcb-4405-86ae-dce94f494303", 26 | "metadata": {}, 27 | "source": [ 28 | "# The Main Data Loading Pipeline Summarized" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "070000fc-a7b7-4c56-a2c0-a938d413a790", 34 | "metadata": {}, 35 | "source": [ 36 | "The complete chapter code is located in [ch02.ipynb](./ch02.ipynb).\n", 37 | "\n", 38 | "This notebook contains the main takeaway, the data loading pipeline without the intermediate steps." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "2b4e8f2d-cb81-41a3-8780-a70b382e18ae", 44 | "metadata": {}, 45 | "source": [ 46 | "Packages that are being used in this notebook:" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 1, 52 | "id": "c7ed6fbe-45ac-40ce-8ea5-4edb212565e1", 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "torch version: 2.4.0\n", 60 | "tiktoken version: 0.7.0\n" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "# NBVAL_SKIP\n", 66 | "from importlib.metadata import version\n", 67 | "\n", 68 | "print(\"torch version:\", version(\"torch\"))\n", 69 | "print(\"tiktoken version:\", version(\"tiktoken\"))" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "id": "0ed4b7db-3b47-4fd3-a4a6-5f4ed5dd166e", 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "import tiktoken\n", 80 | "import torch\n", 81 | "from torch.utils.data import Dataset, DataLoader\n", 82 | "\n", 83 | "\n", 84 | "class GPTDatasetV1(Dataset):\n", 85 | " def __init__(self, txt, tokenizer, max_length, stride):\n", 86 | " self.input_ids = []\n", 87 | " self.target_ids = []\n", 88 | "\n", 89 | " # Tokenize the entire text\n", 90 | " token_ids = tokenizer.encode(txt, allowed_special={\"<|endoftext|>\"})\n", 91 | "\n", 92 | " # Use a sliding window to chunk the book into overlapping sequences of max_length\n", 93 | " for i in range(0, len(token_ids) - max_length, stride):\n", 94 | " input_chunk = token_ids[i:i + max_length]\n", 95 | " target_chunk = token_ids[i + 1: i + max_length + 1]\n", 96 | " self.input_ids.append(torch.tensor(input_chunk))\n", 97 | " self.target_ids.append(torch.tensor(target_chunk))\n", 98 | "\n", 99 | " def __len__(self):\n", 100 | " return len(self.input_ids)\n", 101 | "\n", 102 | " def __getitem__(self, idx):\n", 103 | " return self.input_ids[idx], self.target_ids[idx]\n", 104 | "\n", 105 | "\n", 106 | "def create_dataloader_v1(txt, batch_size=4, max_length=256, \n", 107 | " stride=128, shuffle=True, drop_last=True, num_workers=0):\n", 108 | " # Initialize the tokenizer\n", 109 | " tokenizer = tiktoken.get_encoding(\"gpt2\")\n", 110 | "\n", 111 | " # Create dataset\n", 112 | " dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n", 113 | "\n", 114 | " # Create dataloader\n", 115 | " dataloader = DataLoader(\n", 116 | " dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)\n", 117 | "\n", 118 | " return dataloader\n", 119 | "\n", 120 | "\n", 121 | "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n", 122 | " raw_text = f.read()\n", 123 | "\n", 124 | "tokenizer = tiktoken.get_encoding(\"gpt2\")\n", 125 | "encoded_text = tokenizer.encode(raw_text)\n", 126 | "\n", 127 | "vocab_size = 50257\n", 128 | "output_dim = 256\n", 129 | "context_length = 1024\n", 130 | "\n", 131 | "\n", 132 | "token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)\n", 133 | "pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)\n", 134 | "\n", 135 | "max_length = 4\n", 136 | "dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 3, 142 | "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846", 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "for batch in dataloader:\n", 147 | " x, y = batch\n", 148 | "\n", 149 | " token_embeddings = token_embedding_layer(x)\n", 150 | " pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n", 151 | "\n", 152 | " input_embeddings = token_embeddings + pos_embeddings\n", 153 | "\n", 154 | " break" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 4, 160 | "id": "d3664332-e6bb-447e-8b96-203aafde8b24", 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "torch.Size([8, 4, 256])\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "print(input_embeddings.shape)" 173 | ] 174 | } 175 | ], 176 | "metadata": { 177 | "kernelspec": { 178 | "display_name": "Python 3 (ipykernel)", 179 | "language": "python", 180 | "name": "python3" 181 | }, 182 | "language_info": { 183 | "codemirror_mode": { 184 | "name": "ipython", 185 | "version": 3 186 | }, 187 | "file_extension": ".py", 188 | "mimetype": "text/x-python", 189 | "name": "python", 190 | "nbconvert_exporter": "python", 191 | "pygments_lexer": "ipython3", 192 | "version": "3.10.6" 193 | } 194 | }, 195 | "nbformat": 4, 196 | "nbformat_minor": 5 197 | } 198 | -------------------------------------------------------------------------------- /ch07/01_main-chapter-code/load-finetuned-model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "1545a16b-bc8d-4e49-b9a6-db6631e7483d", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "\n", 16 | "\n", 19 | "\n", 20 | "
\n", 11 | "\n", 12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 13 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "
\n", 15 | "
\n", 17 | "\n", 18 | "
" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "f3f83194-82b9-4478-9550-5ad793467bd0", 26 | "metadata": {}, 27 | "source": [ 28 | "# Load And Use Finetuned Model" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "466b564e-4fd5-4d76-a3a1-63f9f0993b7e", 34 | "metadata": {}, 35 | "source": [ 36 | "This notebook contains minimal code to load the finetuned model that was instruction finetuned and saved in chapter 7 via [ch07.ipynb](ch07.ipynb)." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 1, 42 | "id": "fd80e5f5-0f79-4a6c-bf31-2026e7d30e52", 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "tiktoken version: 0.7.0\n", 50 | "torch version: 2.4.0\n" 51 | ] 52 | } 53 | ], 54 | "source": [ 55 | "from importlib.metadata import version\n", 56 | "\n", 57 | "pkgs = [\n", 58 | " \"tiktoken\", # Tokenizer\n", 59 | " \"torch\", # Deep learning library\n", 60 | "]\n", 61 | "for p in pkgs:\n", 62 | " print(f\"{p} version: {version(p)}\")" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 2, 68 | "id": "ed86d6b7-f32d-4601-b585-a2ea3dbf7201", 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "from pathlib import Path\n", 73 | "\n", 74 | "finetuned_model_path = Path(\"gpt2-medium355M-sft.pth\")\n", 75 | "if not finetuned_model_path.exists():\n", 76 | " print(\n", 77 | " f\"Could not find '{finetuned_model_path}'.\\n\"\n", 78 | " \"Run the `ch07.ipynb` notebook to finetune and save the finetuned model.\"\n", 79 | " )" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 3, 85 | "id": "fb02584a-5e31-45d5-8377-794876907bc6", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "from previous_chapters import GPTModel\n", 90 | "\n", 91 | "\n", 92 | "BASE_CONFIG = {\n", 93 | " \"vocab_size\": 50257, # Vocabulary size\n", 94 | " \"context_length\": 1024, # Context length\n", 95 | " \"drop_rate\": 0.0, # Dropout rate\n", 96 | " \"qkv_bias\": True # Query-key-value bias\n", 97 | "}\n", 98 | "\n", 99 | "model_configs = {\n", 100 | " \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", 101 | " \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", 102 | " \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", 103 | " \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", 104 | "}\n", 105 | "\n", 106 | "CHOOSE_MODEL = \"gpt2-medium (355M)\"\n", 107 | "\n", 108 | "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n", 109 | "\n", 110 | "model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n", 111 | "model = GPTModel(BASE_CONFIG)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 4, 117 | "id": "f1ccf2b7-176e-4cfd-af7a-53fb76010b94", 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "import torch\n", 122 | "\n", 123 | "model.load_state_dict(torch.load(\n", 124 | " \"gpt2-medium355M-sft.pth\",\n", 125 | " map_location=torch.device(\"cpu\"),\n", 126 | " weights_only=True\n", 127 | "))\n", 128 | "model.eval();" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 5, 134 | "id": "a1fd174e-9555-46c5-8780-19b0aa4f26e5", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "import tiktoken\n", 139 | "\n", 140 | "tokenizer = tiktoken.get_encoding(\"gpt2\")" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 6, 146 | "id": "2a4c0129-efe5-46e9-bb90-ba08d407c1a2", 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "prompt = \"\"\"Below is an instruction that describes a task. Write a response \n", 151 | "that appropriately completes the request.\n", 152 | "\n", 153 | "### Instruction:\n", 154 | "Convert the active sentence to passive: 'The chef cooks the meal every day.'\n", 155 | "\"\"\"" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 7, 161 | "id": "1e26862c-10b5-4a0f-9dd6-b6ddbad2fc3f", 162 | "metadata": {}, 163 | "outputs": [ 164 | { 165 | "name": "stdout", 166 | "output_type": "stream", 167 | "text": [ 168 | "The meal is cooked every day by the chef.\n" 169 | ] 170 | } 171 | ], 172 | "source": [ 173 | "from previous_chapters import (\n", 174 | " generate,\n", 175 | " text_to_token_ids,\n", 176 | " token_ids_to_text\n", 177 | ")\n", 178 | "\n", 179 | "def extract_response(response_text, input_text):\n", 180 | " return response_text[len(input_text):].replace(\"### Response:\", \"\").strip()\n", 181 | "\n", 182 | "torch.manual_seed(123)\n", 183 | "\n", 184 | "token_ids = generate(\n", 185 | " model=model,\n", 186 | " idx=text_to_token_ids(prompt, tokenizer),\n", 187 | " max_new_tokens=35,\n", 188 | " context_size=BASE_CONFIG[\"context_length\"],\n", 189 | " eos_id=50256\n", 190 | ")\n", 191 | "\n", 192 | "response = token_ids_to_text(token_ids, tokenizer)\n", 193 | "response = extract_response(response, prompt)\n", 194 | "print(response)" 195 | ] 196 | } 197 | ], 198 | "metadata": { 199 | "kernelspec": { 200 | "display_name": "Python 3 (ipykernel)", 201 | "language": "python", 202 | "name": "python3" 203 | }, 204 | "language_info": { 205 | "codemirror_mode": { 206 | "name": "ipython", 207 | "version": 3 208 | }, 209 | "file_extension": ".py", 210 | "mimetype": "text/x-python", 211 | "name": "python", 212 | "nbconvert_exporter": "python", 213 | "pygments_lexer": "ipython3", 214 | "version": "3.11.4" 215 | } 216 | }, 217 | "nbformat": 4, 218 | "nbformat_minor": 5 219 | } 220 | -------------------------------------------------------------------------------- /ch02/02_bonus_bytepair-encoder/bpe_openai_gpt2.py: -------------------------------------------------------------------------------- 1 | # Source: https://github.com/openai/gpt-2/blob/master/src/encoder.py 2 | # License: 3 | # Modified MIT License 4 | 5 | # Software Copyright (c) 2019 OpenAI 6 | 7 | # We don’t claim ownership of the content you create with GPT-2, so it is yours to do with as you please. 8 | # We only ask that you use GPT-2 responsibly and clearly indicate your content was created using GPT-2. 9 | 10 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 11 | # associated documentation files (the "Software"), to deal in the Software without restriction, 12 | # including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 13 | # and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, 14 | # subject to the following conditions: 15 | 16 | # The above copyright notice and this permission notice shall be included 17 | # in all copies or substantial portions of the Software. 18 | # The above copyright notice and this permission notice need not be included 19 | # with content created by the Software. 20 | 21 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 22 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 23 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS 24 | # BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 25 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE 26 | # OR OTHER DEALINGS IN THE SOFTWARE. 27 | 28 | import os 29 | import json 30 | import regex as re 31 | import requests 32 | from tqdm import tqdm 33 | from functools import lru_cache 34 | 35 | 36 | @lru_cache() 37 | def bytes_to_unicode(): 38 | """ 39 | Returns list of utf-8 byte and a corresponding list of unicode strings. 40 | The reversible bpe codes work on unicode strings. 41 | This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. 42 | When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. 43 | This is a significant percentage of your normal, say, 32K bpe vocab. 44 | To avoid that, we want lookup tables between utf-8 bytes and unicode strings. 45 | And avoids mapping to whitespace/control characters the bpe code barfs on. 46 | """ 47 | bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1)) 48 | cs = bs[:] 49 | n = 0 50 | for b in range(2**8): 51 | if b not in bs: 52 | bs.append(b) 53 | cs.append(2**8 + n) 54 | n += 1 55 | cs = [chr(n) for n in cs] 56 | return dict(zip(bs, cs)) 57 | 58 | 59 | def get_pairs(word): 60 | """ 61 | Return set of symbol pairs in a word. 62 | Word is represented as tuple of symbols (symbols being variable-length strings). 63 | """ 64 | pairs = set() 65 | prev_char = word[0] 66 | for char in word[1:]: 67 | pairs.add((prev_char, char)) 68 | prev_char = char 69 | return pairs 70 | 71 | 72 | class Encoder: 73 | def __init__(self, encoder, bpe_merges, errors='replace'): 74 | self.encoder = encoder 75 | self.decoder = {v: k for k, v in self.encoder.items()} 76 | self.errors = errors # how to handle errors in decoding 77 | self.byte_encoder = bytes_to_unicode() 78 | self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} 79 | self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) 80 | self.cache = {} 81 | 82 | # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions 83 | self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") 84 | 85 | def bpe(self, token): 86 | if token in self.cache: 87 | return self.cache[token] 88 | word = tuple(token) 89 | pairs = get_pairs(word) 90 | 91 | if not pairs: 92 | return token 93 | 94 | while True: 95 | bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf'))) 96 | if bigram not in self.bpe_ranks: 97 | break 98 | first, second = bigram 99 | new_word = [] 100 | i = 0 101 | while i < len(word): 102 | try: 103 | j = word.index(first, i) 104 | new_word.extend(word[i:j]) 105 | i = j 106 | except ValueError: 107 | new_word.extend(word[i:]) 108 | break 109 | 110 | if word[i] == first and i < len(word) - 1 and word[i + 1] == second: 111 | new_word.append(first + second) 112 | i += 2 113 | else: 114 | new_word.append(word[i]) 115 | i += 1 116 | new_word = tuple(new_word) 117 | word = new_word 118 | if len(word) == 1: 119 | break 120 | else: 121 | pairs = get_pairs(word) 122 | word = ' '.join(word) 123 | self.cache[token] = word 124 | return word 125 | 126 | def encode(self, text): 127 | bpe_tokens = [] 128 | for token in re.findall(self.pat, text): 129 | token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8')) 130 | bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' ')) 131 | return bpe_tokens 132 | 133 | def decode(self, tokens): 134 | text = ''.join([self.decoder[token] for token in tokens]) 135 | text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors) 136 | return text 137 | 138 | 139 | def get_encoder(model_name, models_dir): 140 | with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f: 141 | encoder = json.load(f) 142 | with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f: 143 | bpe_data = f.read() 144 | bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]] 145 | return Encoder(encoder=encoder, bpe_merges=bpe_merges) 146 | 147 | 148 | def download_vocab(): 149 | # Modified code from 150 | subdir = 'gpt2_model' 151 | if not os.path.exists(subdir): 152 | os.makedirs(subdir) 153 | subdir = subdir.replace('\\', '/') # needed for Windows 154 | 155 | for filename in ['encoder.json', 'vocab.bpe']: 156 | r = requests.get("https://openaipublic.blob.core.windows.net/gpt-2/models/117M/" + filename, stream=True) 157 | 158 | with open(os.path.join(subdir, filename), 'wb') as f: 159 | file_size = int(r.headers["content-length"]) 160 | chunk_size = 1000 161 | with tqdm(ncols=100, desc="Fetching " + filename, total=file_size, unit_scale=True) as pbar: 162 | # 1k for chunk_size, since Ethernet packet size is around 1500 bytes 163 | for chunk in r.iter_content(chunk_size=chunk_size): 164 | f.write(chunk) 165 | pbar.update(chunk_size) 166 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Configs and keys 2 | ch07/02_dataset-utilities/config.json 3 | ch07/03_model-evaluation/config.json 4 | 5 | # Graphics 6 | appendix-D/01_main-chapter-code/1.pdf 7 | appendix-D/01_main-chapter-code/2.pdf 8 | appendix-D/01_main-chapter-code/3.pdf 9 | 10 | appendix-E/01_main-chapter-code/loss-plot.pdf 11 | 12 | ch05/01_main-chapter-code/loss-plot.pdf 13 | ch05/01_main-chapter-code/temperature-plot.pdf 14 | ch05/01_main-chapter-code/the-verdict.txt 15 | 16 | ch06/01_main-chapter-code/loss-plot.pdf 17 | ch06/01_main-chapter-code/accuracy-plot.pdf 18 | 19 | ch07/01_main-chapter-code/loss-plot.pdf 20 | ch07/01_main-chapter-code/loss-plot-standalone.pdf 21 | ch07/01_main-chapter-code/loss-plot-baseline.pdf 22 | ch07/01_main-chapter-code/loss-plot-mask-instructions.pdf 23 | ch07/01_main-chapter-code/loss-plot-phi3-prompt.pdf 24 | ch07/01_main-chapter-code/loss-plot-alpaca52k.pdf 25 | 26 | # Checkpoint files 27 | appendix-A/01_main-chapter-code/model.pth 28 | 29 | appendix-E/01_main-chapter-code/gpt2 30 | 31 | ch05/01_main-chapter-code/gpt2/ 32 | ch05/02_alternative_weight_loading/checkpoints 33 | ch05/01_main-chapter-code/model.pth 34 | ch05/01_main-chapter-code/model_and_optimizer.pth 35 | ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints 36 | 37 | ch06/01_main-chapter-code/gpt2 38 | ch06/02_bonus_additional-experiments/gpt2 39 | ch06/03_bonus_imdb-classification/gpt2 40 | 41 | ch07/01_main-chapter-code/gpt2-medium355M-sft-baseline.pth 42 | ch07/01_main-chapter-code/gpt2-medium355M-sft-mask-instructions.pth 43 | ch07/01_main-chapter-code/gpt2-medium355M-sft-phi3-prompt.pth 44 | ch07/01_main-chapter-code/gpt2-medium355M-sft-alpaca52k.pth 45 | ch07/01_main-chapter-code/gpt2-medium355M-sft-lora.pth 46 | ch07/01_main-chapter-code/gpt2-medium355M-sft.pth 47 | ch07/01_main-chapter-code/gpt2-medium355M-sft-standalone.pth 48 | ch07/01_main-chapter-code/Smalltestmodel-sft-standalone.pth 49 | ch07/01_main-chapter-code/gpt2/ 50 | 51 | # Datasets 52 | appendix-E/01_main-chapter-code/sms_spam_collection.zip 53 | appendix-E/01_main-chapter-code/sms_spam_collection 54 | appendix-E/01_main-chapter-code/train.csv 55 | appendix-E/01_main-chapter-code/test.csv 56 | appendix-E/01_main-chapter-code/validation.csv 57 | 58 | ch02/01_main-chapter-code/number-data.txt 59 | 60 | ch05/03_bonus_pretraining_on_gutenberg/gutenberg 61 | ch05/03_bonus_pretraining_on_gutenberg/gutenberg_preprocessed 62 | 63 | ch06/01_main-chapter-code/sms_spam_collection.zip 64 | ch06/01_main-chapter-code/sms_spam_collection 65 | ch06/01_main-chapter-code/test.csv 66 | ch06/01_main-chapter-code/train.csv 67 | ch06/01_main-chapter-code/validation.csv 68 | ch06/01_main-chapter-code/review_classifier.pth 69 | ch06/02_bonus_additional-experiments/test.csv 70 | ch06/02_bonus_additional-experiments/train.csv 71 | ch06/02_bonus_additional-experiments/validation.csv 72 | ch06/02_bonus_additional-experiments/sms_spam_collection.zip 73 | ch06/02_bonus_additional-experiments/sms_spam_collection 74 | ch06/03_bonus_imdb-classification/aclImdb/ 75 | ch06/03_bonus_imdb-classification/aclImdb_v1.tar.gz 76 | ch06/03_bonus_imdb-classification/test.csv 77 | ch06/03_bonus_imdb-classification/train.csv 78 | ch06/03_bonus_imdb-classification/validation.csv 79 | 80 | ch07/01_main-chapter-code/instruction-data-with-response-standalone.json 81 | ch07/01_main-chapter-code/instruction-data-with-response-baseline.json 82 | ch07/01_main-chapter-code/instruction-data-with-response-mask-instructions.json 83 | ch07/01_main-chapter-code/loss-plot-lora.pdf 84 | ch07/01_main-chapter-code/instruction-data-with-response-alpaca52k.json 85 | ch07/01_main-chapter-code/instruction-data-with-response-lora.json 86 | ch07/01_main-chapter-code/instruction-data-with-response-phi3-prompt.json 87 | ch07/02_dataset-utilities/instruction-examples-modified.json 88 | ch07/04_preference-tuning-with-dpo/gpt2-medium355M-sft.pth 89 | ch07/04_preference-tuning-with-dpo/loss-plot.pdf 90 | 91 | # Temporary OS-related files 92 | .DS_Store 93 | 94 | # Byte-compiled / optimized / DLL files 95 | __pycache__/ 96 | *.py[cod] 97 | *$py.class 98 | *.key 99 | solution/ 100 | 101 | # C extensions 102 | *.so 103 | 104 | # Distribution / packaging 105 | .Python 106 | build/ 107 | develop-eggs/ 108 | dist/ 109 | downloads/ 110 | eggs/ 111 | .eggs/ 112 | lib/ 113 | lib64/ 114 | parts/ 115 | sdist/ 116 | var/ 117 | wheels/ 118 | share/python-wheels/ 119 | *.egg-info/ 120 | .installed.cfg 121 | *.egg 122 | MANIFEST 123 | 124 | # PyInstaller 125 | # Usually these files are written by a python script from a template 126 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 127 | *.manifest 128 | *.spec 129 | 130 | # Installer logs 131 | pip-log.txt 132 | pip-delete-this-directory.txt 133 | 134 | # Unit test / coverage reports 135 | htmlcov/ 136 | .tox/ 137 | .nox/ 138 | .coverage 139 | .coverage.* 140 | .cache 141 | nosetests.xml 142 | coverage.xml 143 | *.cover 144 | *.py,cover 145 | .hypothesis/ 146 | .pytest_cache/ 147 | cover/ 148 | 149 | # Translations 150 | *.mo 151 | *.pot 152 | 153 | # Django stuff: 154 | *.log 155 | local_settings.py 156 | db.sqlite3 157 | db.sqlite3-journal 158 | 159 | # Flask stuff: 160 | instance/ 161 | .webassets-cache 162 | 163 | # Scrapy stuff: 164 | .scrapy 165 | 166 | # Sphinx documentation 167 | docs/_build/ 168 | 169 | # PyBuilder 170 | .pybuilder/ 171 | target/ 172 | 173 | # Jupyter Notebook 174 | .ipynb_checkpoints 175 | 176 | # IPython 177 | profile_default/ 178 | ipython_config.py 179 | 180 | # pyenv 181 | # For a library or package, you might want to ignore these files since the code is 182 | # intended to run in multiple environments; otherwise, check them in: 183 | # .python-version 184 | 185 | # pipenv 186 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 187 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 188 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 189 | # install all needed dependencies. 190 | #Pipfile.lock 191 | 192 | # poetry 193 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 194 | # This is especially recommended for binary packages to ensure reproducibility, and is more 195 | # commonly ignored for libraries. 196 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 197 | #poetry.lock 198 | 199 | # pdm 200 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 201 | #pdm.lock 202 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 203 | # in version control. 204 | # https://pdm.fming.dev/#use-with-ide 205 | .pdm.toml 206 | 207 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 208 | __pypackages__/ 209 | 210 | # Celery stuff 211 | celerybeat-schedule 212 | celerybeat.pid 213 | 214 | # SageMath parsed files 215 | *.sage.py 216 | 217 | # Environments 218 | .env 219 | .venv 220 | env/ 221 | venv/ 222 | ENV/ 223 | env.bak/ 224 | venv.bak/ 225 | 226 | # Spyder project settings 227 | .spyderproject 228 | .spyproject 229 | 230 | # Rope project settings 231 | .ropeproject 232 | 233 | # mkdocs documentation 234 | /site 235 | 236 | # mypy 237 | .mypy_cache/ 238 | .dmypy.json 239 | dmypy.json 240 | 241 | # Pyre type checker 242 | .pyre/ 243 | 244 | # pytype static type analyzer 245 | .pytype/ 246 | 247 | # Cython debug symbols 248 | cython_debug/ 249 | 250 | # PyCharm 251 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 252 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 253 | # and can be added to the global gitignore or merged into this file. For a more nuclear 254 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 255 | #.idea/ 256 | 257 | # vscode 258 | .vscode/ 259 | -------------------------------------------------------------------------------- /ch06/01_main-chapter-code/load-finetuned-model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "1545a16b-bc8d-4e49-b9a6-db6631e7483d", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "\n", 16 | "\n", 19 | "\n", 20 | "
\n", 11 | "\n", 12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", 13 | "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "
\n", 15 | "
\n", 17 | "\n", 18 | "
" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "f3f83194-82b9-4478-9550-5ad793467bd0", 26 | "metadata": {}, 27 | "source": [ 28 | "# Load And Use Finetuned Model" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "466b564e-4fd5-4d76-a3a1-63f9f0993b7e", 34 | "metadata": {}, 35 | "source": [ 36 | "This notebook contains minimal code to load the finetuned model that was created and saved in chapter 6 via [ch06.ipynb](ch06.ipynb)." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 1, 42 | "id": "fd80e5f5-0f79-4a6c-bf31-2026e7d30e52", 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "tiktoken version: 0.7.0\n", 50 | "torch version: 2.4.0\n" 51 | ] 52 | } 53 | ], 54 | "source": [ 55 | "from importlib.metadata import version\n", 56 | "\n", 57 | "pkgs = [\n", 58 | " \"tiktoken\", # Tokenizer\n", 59 | " \"torch\", # Deep learning library\n", 60 | "]\n", 61 | "for p in pkgs:\n", 62 | " print(f\"{p} version: {version(p)}\")" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 2, 68 | "id": "ed86d6b7-f32d-4601-b585-a2ea3dbf7201", 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "from pathlib import Path\n", 73 | "\n", 74 | "finetuned_model_path = Path(\"review_classifier.pth\")\n", 75 | "if not finetuned_model_path.exists():\n", 76 | " print(\n", 77 | " f\"Could not find '{finetuned_model_path}'.\\n\"\n", 78 | " \"Run the `ch06.ipynb` notebook to finetune and save the finetuned model.\"\n", 79 | " )" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 3, 85 | "id": "fb02584a-5e31-45d5-8377-794876907bc6", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "from previous_chapters import GPTModel\n", 90 | "\n", 91 | "\n", 92 | "BASE_CONFIG = {\n", 93 | " \"vocab_size\": 50257, # Vocabulary size\n", 94 | " \"context_length\": 1024, # Context length\n", 95 | " \"drop_rate\": 0.0, # Dropout rate\n", 96 | " \"qkv_bias\": True # Query-key-value bias\n", 97 | "}\n", 98 | "\n", 99 | "model_configs = {\n", 100 | " \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", 101 | " \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", 102 | " \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", 103 | " \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", 104 | "}\n", 105 | "\n", 106 | "CHOOSE_MODEL = \"gpt2-small (124M)\"\n", 107 | "\n", 108 | "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n", 109 | "\n", 110 | "# Initialize base model\n", 111 | "model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n", 112 | "model = GPTModel(BASE_CONFIG)" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 4, 118 | "id": "f1ccf2b7-176e-4cfd-af7a-53fb76010b94", 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "import torch\n", 123 | "\n", 124 | "# Convert model to classifier as in section 6.5 in ch06.ipynb\n", 125 | "num_classes = 2\n", 126 | "model.out_head = torch.nn.Linear(in_features=BASE_CONFIG[\"emb_dim\"], out_features=num_classes)\n", 127 | "\n", 128 | "# Then load pretrained weights\n", 129 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 130 | "model.load_state_dict(torch.load(\"review_classifier.pth\", map_location=device, weights_only=True))\n", 131 | "model.eval();" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 5, 137 | "id": "a1fd174e-9555-46c5-8780-19b0aa4f26e5", 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "import tiktoken\n", 142 | "\n", 143 | "tokenizer = tiktoken.get_encoding(\"gpt2\")" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 6, 149 | "id": "2a4c0129-efe5-46e9-bb90-ba08d407c1a2", 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "# This function was implemented in ch06.ipynb\n", 154 | "def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256):\n", 155 | " model.eval()\n", 156 | "\n", 157 | " # Prepare inputs to the model\n", 158 | " input_ids = tokenizer.encode(text)\n", 159 | " supported_context_length = model.pos_emb.weight.shape[1]\n", 160 | "\n", 161 | " # Truncate sequences if they too long\n", 162 | " input_ids = input_ids[:min(max_length, supported_context_length)]\n", 163 | "\n", 164 | " # Pad sequences to the longest sequence\n", 165 | " input_ids += [pad_token_id] * (max_length - len(input_ids))\n", 166 | " input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) # add batch dimension\n", 167 | "\n", 168 | " # Model inference\n", 169 | " with torch.no_grad():\n", 170 | " logits = model(input_tensor)[:, -1, :] # Logits of the last output token\n", 171 | " predicted_label = torch.argmax(logits, dim=-1).item()\n", 172 | "\n", 173 | " # Return the classified result\n", 174 | " return \"spam\" if predicted_label == 1 else \"not spam\"" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 7, 180 | "id": "1e26862c-10b5-4a0f-9dd6-b6ddbad2fc3f", 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "spam\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "text_1 = (\n", 193 | " \"You are a winner you have been specially\"\n", 194 | " \" selected to receive $1000 cash or a $2000 award.\"\n", 195 | ")\n", 196 | "\n", 197 | "print(classify_review(\n", 198 | " text_1, model, tokenizer, device, max_length=120\n", 199 | "))" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 8, 205 | "id": "78472e05-cb4e-4ec4-82e8-23777aa90cf8", 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "not spam\n" 213 | ] 214 | } 215 | ], 216 | "source": [ 217 | "text_2 = (\n", 218 | " \"Hey, just wanted to check if we're still on\"\n", 219 | " \" for dinner tonight? Let me know!\"\n", 220 | ")\n", 221 | "\n", 222 | "print(classify_review(\n", 223 | " text_2, model, tokenizer, device, max_length=120\n", 224 | "))" 225 | ] 226 | } 227 | ], 228 | "metadata": { 229 | "kernelspec": { 230 | "display_name": "Python 3 (ipykernel)", 231 | "language": "python", 232 | "name": "python3" 233 | }, 234 | "language_info": { 235 | "codemirror_mode": { 236 | "name": "ipython", 237 | "version": 3 238 | }, 239 | "file_extension": ".py", 240 | "mimetype": "text/x-python", 241 | "name": "python", 242 | "nbconvert_exporter": "python", 243 | "pygments_lexer": "ipython3", 244 | "version": "3.11.4" 245 | } 246 | }, 247 | "nbformat": 4, 248 | "nbformat_minor": 5 249 | } 250 | -------------------------------------------------------------------------------- /ch05/05_bonus_hparam_tuning/hparam_search.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt). 2 | # Source for "Build a Large Language Model From Scratch" 3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | # Code: https://github.com/rasbt/LLMs-from-scratch 5 | 6 | import itertools 7 | import math 8 | import os 9 | import tiktoken 10 | import torch 11 | from previous_chapters import GPTModel, create_dataloader_v1 12 | 13 | 14 | # Define a grid of hyperparameters to search over 15 | HPARAM_GRID = { 16 | "batch_size": [2, 4, 8, 16], 17 | "drop_rate": [0.0, 0.1, 0.2], 18 | "warmup_iters": [10, 20, 30], 19 | "weight_decay": [0.1, 0.01, 0.0], 20 | "peak_lr": [0.0001, 0.0005, 0.001, 0.005], 21 | "initial_lr": [0.00005, 0.0001], 22 | "min_lr": [0.00005, 0.00001, 0.0001], 23 | "n_epochs": [5, 10, 15, 20, 25], 24 | } 25 | 26 | 27 | def calc_loss_loader(data_loader, model, device, num_batches=None): 28 | total_loss = 0. 29 | if len(data_loader) == 0: 30 | return float("nan") 31 | elif num_batches is None: 32 | num_batches = len(data_loader) 33 | else: 34 | num_batches = min(num_batches, len(data_loader)) 35 | for i, (input_batch, target_batch) in enumerate(data_loader): 36 | if i < num_batches: 37 | loss = calc_loss_batch(input_batch, target_batch, model, device) 38 | total_loss += loss.item() 39 | else: 40 | break 41 | return total_loss / num_batches 42 | 43 | 44 | def calc_loss_batch(input_batch, target_batch, model, device): 45 | input_batch, target_batch = input_batch.to(device), target_batch.to(device) 46 | 47 | logits = model(input_batch) 48 | logits = logits.view(-1, logits.size(-1)) 49 | loss = torch.nn.functional.cross_entropy(logits, target_batch.view(-1)) 50 | return loss 51 | 52 | 53 | def evaluate_model(model, train_loader, val_loader, device, eval_iter): 54 | model.eval() 55 | with torch.no_grad(): 56 | train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter) 57 | val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter) 58 | model.train() 59 | return train_loss, val_loss 60 | 61 | 62 | def train_model(model, train_loader, val_loader, optimizer, device, 63 | n_epochs, eval_freq, eval_iter, 64 | encoded_start_context, tokenizer, warmup_iters=10, 65 | initial_lr=3e-05, min_lr=1e-6): 66 | global_step = 0 67 | 68 | max_lr = optimizer.param_groups[0]["lr"] 69 | 70 | # Calculate total number of iterations 71 | total_training_iters = len(train_loader) * n_epochs 72 | 73 | # Calculate the learning rate increment at each step during warmup 74 | lr_increment = (optimizer.param_groups[0]["lr"] - initial_lr) / warmup_iters 75 | 76 | for epoch in range(n_epochs): 77 | model.train() 78 | for input_batch, target_batch in train_loader: 79 | optimizer.zero_grad() 80 | 81 | # Increment the global step at the beginning of the iteration 82 | global_step += 1 83 | 84 | # Warmup: adjust learning rate linearly 85 | if global_step <= warmup_iters: 86 | lr = initial_lr + global_step * lr_increment 87 | # Cosine annealing phase 88 | else: 89 | progress = (global_step - warmup_iters) / (total_training_iters - warmup_iters) 90 | lr = min_lr + (max_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress)) 91 | 92 | # Apply the calculated learning rate 93 | for param_group in optimizer.param_groups: 94 | param_group["lr"] = lr 95 | 96 | loss = calc_loss_batch(input_batch, target_batch, model, device) 97 | loss.backward() 98 | 99 | # Apply gradient clipping 100 | if global_step >= warmup_iters: 101 | torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) 102 | 103 | optimizer.step() 104 | 105 | train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter) 106 | 107 | return train_loss, val_loss 108 | 109 | 110 | if __name__ == "__main__": 111 | 112 | # Generate all combinations of hyperparameters 113 | hyperparameter_combinations = list(itertools.product(*HPARAM_GRID.values())) 114 | total_combinations = len(hyperparameter_combinations) 115 | print(f"Total hyperparameter configurations: {total_combinations}") 116 | 117 | # Placeholder for the best loss and best hyperparameters 118 | best_val_loss = float('inf') 119 | best_hparams = {} 120 | 121 | script_path = os.path.abspath(__file__) 122 | script_dir = os.path.dirname(script_path) 123 | with open(os.path.join(script_dir, "the-verdict.txt"), "r", encoding="utf-8") as file: 124 | text_data = file.read() 125 | 126 | tokenizer = tiktoken.get_encoding("gpt2") 127 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 128 | 129 | train_ratio = 0.95 130 | split_idx = int(train_ratio * len(text_data)) 131 | 132 | torch.manual_seed(123) 133 | 134 | interrupted = False 135 | current_config = 0 136 | for combination in hyperparameter_combinations: 137 | 138 | try: 139 | current_config += 1 140 | print(f"Evaluating configuration {current_config} of {total_combinations}") 141 | 142 | # Unpack the current combination of hyperparameters 143 | HPARAM_CONFIG = dict(zip(HPARAM_GRID.keys(), combination)) 144 | 145 | GPT_CONFIG_124M = { 146 | "vocab_size": 50257, # Vocabulary size 147 | "context_length": 256, # Context length -- shortened from original 1024 tokens 148 | "emb_dim": 768, # Embedding dimension 149 | "n_heads": 12, # Number of attention heads 150 | "n_layers": 12, # Number of layers 151 | "drop_rate": HPARAM_CONFIG["drop_rate"], 152 | "qkv_bias": False, # Query-Key-Value bias 153 | } 154 | 155 | torch.manual_seed(123) 156 | train_loader = create_dataloader_v1( 157 | text_data[:split_idx], 158 | batch_size=HPARAM_CONFIG["batch_size"], 159 | max_length=GPT_CONFIG_124M["context_length"], 160 | stride=GPT_CONFIG_124M["context_length"], 161 | drop_last=True, 162 | shuffle=True, 163 | num_workers=0 164 | ) 165 | 166 | val_loader = create_dataloader_v1( 167 | text_data[split_idx:], 168 | batch_size=HPARAM_CONFIG["batch_size"], 169 | max_length=GPT_CONFIG_124M["context_length"], 170 | stride=GPT_CONFIG_124M["context_length"], 171 | drop_last=False, 172 | shuffle=False, 173 | num_workers=0 174 | ) 175 | 176 | model = GPTModel(GPT_CONFIG_124M) 177 | model.to(device) 178 | 179 | optimizer = torch.optim.AdamW( 180 | model.parameters(), 181 | lr=HPARAM_CONFIG["peak_lr"], 182 | weight_decay=HPARAM_CONFIG["weight_decay"] 183 | ) 184 | 185 | encoded_start_context = tokenizer.encode("Nevertheless") 186 | encoded_tensor = torch.tensor(encoded_start_context).unsqueeze(0) 187 | 188 | train_loss, val_loss = train_model( 189 | model, train_loader, val_loader, optimizer, device, 190 | n_epochs=HPARAM_CONFIG["n_epochs"], 191 | eval_freq=5, eval_iter=1, 192 | encoded_start_context=encoded_tensor, 193 | tokenizer=tokenizer, 194 | warmup_iters=HPARAM_CONFIG["warmup_iters"], 195 | initial_lr=HPARAM_CONFIG["initial_lr"], 196 | min_lr=HPARAM_CONFIG["min_lr"] 197 | ) 198 | 199 | # Log the best hyperparameters based on validation loss 200 | if val_loss < best_val_loss: 201 | best_val_loss = val_loss 202 | best_train_loss = train_loss 203 | best_hparams = HPARAM_CONFIG 204 | 205 | except KeyboardInterrupt: 206 | print("Hyperparameter search completed.") 207 | print(f"Best hyperparameters: {best_hparams}") 208 | print(f"Best Val loss: {best_val_loss} | Training loss {train_loss}") 209 | interrupted = True 210 | break 211 | 212 | if not interrupted: 213 | print("Hyperparameter search completed.") 214 | print(f"Best hyperparameters: {best_hparams}") 215 | print(f"Best Val loss: {best_val_loss} | Training loss {train_loss}") 216 | --------------------------------------------------------------------------------