├── ch04
├── 02_performance-analysis
│ ├── requirements-extra.txt
│ ├── README.md
│ └── flops-analysis.ipynb
├── README.md
└── 01_main-chapter-code
│ ├── README.md
│ ├── tests.py
│ └── previous_chapters.py
├── ch07
├── 03_model-evaluation
│ ├── requirements-extra.txt
│ ├── config.json
│ ├── scores
│ │ ├── llama3-8b-model-2-response.json
│ │ ├── gpt4-model-2-response.json
│ │ ├── llama3-8b-model-1-response.json
│ │ └── gpt4-model-1-response.json
│ └── README.md
├── 02_dataset-utilities
│ ├── requirements-extra.txt
│ ├── config.json
│ ├── README.md
│ └── find-near-duplicates.py
├── 05_dataset-generation
│ └── README.md
├── 04_preference-tuning-with-dpo
│ └── README.md
├── 01_main-chapter-code
│ ├── tests.py
│ ├── README.md
│ ├── ollama_evaluate.py
│ ├── gpt_download.py
│ └── load-finetuned-model.ipynb
└── README.md
├── ch02
├── 02_bonus_bytepair-encoder
│ ├── requirements-extra.txt
│ ├── README.md
│ └── bpe_openai_gpt2.py
├── 04_bonus_dataloader-intuition
│ └── README.md
├── 03_bonus_embedding-vs-matmul
│ └── README.md
├── 01_main-chapter-code
│ ├── README.md
│ └── dataloader.ipynb
└── README.md
├── ch06
├── 03_bonus_imdb-classification
│ ├── requirements-extra.txt
│ ├── train_sklearn_logreg.py
│ ├── download_prepare_dataset.py
│ ├── README.md
│ └── gpt_download.py
├── README.md
├── 01_main-chapter-code
│ ├── tests.py
│ ├── README.md
│ ├── exercise-solutions.ipynb
│ ├── gpt_download.py
│ └── load-finetuned-model.ipynb
└── 02_bonus_additional-experiments
│ └── gpt_download.py
├── appendix-E
├── README.md
└── 01_main-chapter-code
│ └── gpt_download.py
├── appendix-D
└── README.md
├── appendix-A
├── 02_setup-recommendations
│ └── README.md
└── 01_main-chapter-code
│ ├── exercise-solutions.ipynb
│ └── DDP-script.py
├── setup
├── 03_optional-docker-environment
│ ├── .devcontainer
│ │ ├── README.md
│ │ ├── Dockerfile
│ │ └── devcontainer.json
│ └── README.md
├── .vscode
│ └── extensions.json
├── 02_installing-python-libraries
│ ├── tests.py
│ ├── python_environment_check.ipynb
│ ├── README.md
│ └── python_environment_check.py
├── 01_optional-python-setup-preferences
│ └── README.md
└── README.md
├── .github
├── ISSUE_TEMPLATE
│ ├── ask-a-question.md
│ └── bug-report.yaml
└── workflows
│ ├── pep8-linter.yml
│ ├── check-spelling-errors.yml
│ ├── check-links.yml
│ ├── basic-tests-linux.yml
│ ├── basic-tests-macos.yml
│ ├── basic-tests-old-pytorch.yml
│ └── basic-tests-windows.yml
├── ch03
├── 01_main-chapter-code
│ ├── README.md
│ └── small-text-sample.txt
├── 02_bonus_efficient-multihead-attention
│ ├── README.md
│ └── ch03.py
├── 03_understanding-buffers
│ └── README.md
└── README.md
├── ch05
├── 02_alternative_weight_loading
│ └── README.md
├── 05_bonus_hparam_tuning
│ ├── README.md
│ └── hparam_search.py
├── 04_learning_rate_schedulers
│ └── README.md
├── README.md
├── 03_bonus_pretraining_on_gutenberg
│ ├── tests.py
│ └── prepare_dataset.py
└── 01_main-chapter-code
│ ├── README.md
│ ├── tests.py
│ └── gpt_download.py
├── ch01
└── README.md
├── requirements.txt
└── .gitignore
/ch04/02_performance-analysis/requirements-extra.txt:
--------------------------------------------------------------------------------
1 | thop
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/requirements-extra.txt:
--------------------------------------------------------------------------------
1 | openai>=1.30.3
2 | tqdm>=4.65.0
3 |
--------------------------------------------------------------------------------
/ch02/02_bonus_bytepair-encoder/requirements-extra.txt:
--------------------------------------------------------------------------------
1 | requests
2 | tqdm
3 | transformers>=4.33.2
4 |
--------------------------------------------------------------------------------
/ch06/03_bonus_imdb-classification/requirements-extra.txt:
--------------------------------------------------------------------------------
1 | transformers>=4.33.2
2 | scikit-learn>=1.3.0
--------------------------------------------------------------------------------
/ch07/02_dataset-utilities/requirements-extra.txt:
--------------------------------------------------------------------------------
1 | openai>=1.30.3
2 | scikit-learn>=1.3.1
3 | tqdm>=4.65.0
--------------------------------------------------------------------------------
/appendix-E/README.md:
--------------------------------------------------------------------------------
1 | # Appendix E: Parameter-efficient Finetuning with LoRA
2 |
3 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code.
--------------------------------------------------------------------------------
/appendix-D/README.md:
--------------------------------------------------------------------------------
1 | # Appendix D: Adding Bells and Whistles to the Training Loop
2 |
3 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code.
--------------------------------------------------------------------------------
/ch07/02_dataset-utilities/config.json:
--------------------------------------------------------------------------------
1 | {
2 | "OPENAI_API_KEY": "sk-...",
3 | "_comment": "Enter your API key from https://platform.openai.com/api-keys"
4 | }
5 |
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/config.json:
--------------------------------------------------------------------------------
1 | {
2 | "OPENAI_API_KEY": "sk-...",
3 | "_comment": "Enter your API key from https://platform.openai.com/api-keys"
4 | }
5 |
--------------------------------------------------------------------------------
/ch02/04_bonus_dataloader-intuition/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 2: Working with Text Data
2 |
3 | - [dataloader-intuition.ipynb](dataloader-intuition.ipynb) contains optional (bonus) code to explain the data loader more intuitively with simple numbers rather than text.
4 |
--------------------------------------------------------------------------------
/appendix-A/02_setup-recommendations/README.md:
--------------------------------------------------------------------------------
1 | ## Python and Environment Setup Recommendations
2 |
3 |
4 |
5 | Please see the [README.md](../../setup/README.md) in the [setup](../../setup) directory for Python installation and setup recommendations.
6 |
7 |
8 |
9 |
--------------------------------------------------------------------------------
/ch02/03_bonus_embedding-vs-matmul/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 2: Working with Text Data
2 |
3 | - [embeddings-and-linear-layers.ipynb](embeddings-and-linear-layers.ipynb) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.
4 |
--------------------------------------------------------------------------------
/ch02/02_bonus_bytepair-encoder/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 2: Working with Text Data
2 |
3 |
4 |
5 | - [compare-bpe-tiktoken.ipynb](compare-bpe-tiktoken.ipynb) benchmarks various byte pair encoding implementations
6 | - [bpe_openai_gpt2.py](bpe_openai_gpt2.py) is the original bytepair encoder code used by OpenAI
7 |
8 |
--------------------------------------------------------------------------------
/setup/03_optional-docker-environment/.devcontainer/README.md:
--------------------------------------------------------------------------------
1 | # Optional Docker Environment
2 |
3 | This is an optional Docker environment for those users who prefer Docker. In case you are interested in using this Docker DevContainer, please see the *Using Docker DevContainers* section in the [../../README.md](../../README.md) for more information.
--------------------------------------------------------------------------------
/ch07/05_dataset-generation/README.md:
--------------------------------------------------------------------------------
1 | # Generating a Dataset for Instruction Finetuning
2 |
3 | This folder contains utility code that can be used for generating a dataset for instruction finetuning.
4 |
5 | - [llama3-ollama.ipynb](llama3-ollama.ipynb): A notebook that creates a synthetic instruction finetuning dataset using Llama 3 and Ollama
6 |
7 |
--------------------------------------------------------------------------------
/ch02/01_main-chapter-code/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 2: Working with Text Data
2 |
3 | ### Main Chapter Code
4 |
5 | - [ch02.ipynb](ch02.ipynb) contains all the code as it appears in the chapter
6 |
7 | ### Optional Code
8 |
9 | - [dataloader.ipynb](dataloader.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter
10 |
--------------------------------------------------------------------------------
/setup/.vscode/extensions.json:
--------------------------------------------------------------------------------
1 | {
2 | "recommendations": [
3 | "ms-python.python",
4 | "ms-toolsai.jupyter",
5 | "ms-azuretools.vscode-docker",
6 | "ms-vscode-remote.vscode-remote-extensionpack",
7 | "yahyabatulu.vscode-markdown-alert",
8 | "tomoki1207.pdf",
9 | "mechatroner.rainbow-csv"
10 | ]
11 | }
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/ask-a-question.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Ask a Question
3 | about: Ask questions related to the book
4 | title: ''
5 | labels: [question]
6 | assignees: rasbt
7 |
8 | ---
9 |
10 | If you have a question that is not a bug, please consider asking it in this GitHub repository's [discussion forum](https://github.com/rasbt/LLMs-from-scratch/discussions).
11 |
--------------------------------------------------------------------------------
/ch03/01_main-chapter-code/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 3: Coding Attention Mechanisms
2 |
3 | ### Main Chapter Code
4 |
5 | - [ch03.ipynb](ch03.ipynb) contains all the code as it appears in the chapter
6 |
7 | ### Optional Code
8 |
9 | - [multihead-attention.ipynb](multihead-attention.ipynb) is a minimal notebook with the main data loading pipeline implemented in this chapter
10 |
11 |
--------------------------------------------------------------------------------
/ch04/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 4: Implementing a GPT Model from Scratch to Generate Text
2 |
3 | ## Main Chapter Code
4 |
5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code.
6 |
7 | ## Optional Code
8 |
9 | - [02_performance-analysis](02_performance-analysis) contains optional code analyzing the performance of the GPT model(s) implemented in the main chapter.
10 |
11 |
--------------------------------------------------------------------------------
/ch05/02_alternative_weight_loading/README.md:
--------------------------------------------------------------------------------
1 | # Alternative Approaches to Loading Pretrained Weights
2 |
3 | This folder contains alternative weight loading strategies in case the weights become unavailable from OpenAI.
4 |
5 | - [weight-loading-hf-transformers.ipynb](weight-loading-hf-transformers.ipynb): contains code to load the weights from the Hugging Face Model Hub via the `transformers` library
6 |
--------------------------------------------------------------------------------
/ch01/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 1: Understanding Large Language Models
2 |
3 | There is no code in this chapter.
4 |
5 |
6 | As optional bonus material, below is a video tutorial where I explain the LLM development lifecycle covered in this book:
7 |
8 |
9 |
10 |
11 | [](https://www.youtube.com/watch?v=kPGTx4wcm_w)
12 |
13 |
--------------------------------------------------------------------------------
/ch03/02_bonus_efficient-multihead-attention/README.md:
--------------------------------------------------------------------------------
1 | # More Efficient Multi-Head Attention Implementations
2 |
3 | - [mha-implementations.ipynb](mha-implementations.ipynb) contains and compares different implementations of multi-head attention
4 |
5 |
--------------------------------------------------------------------------------
/ch07/04_preference-tuning-with-dpo/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 7: Finetuning to Follow Instructions
2 |
3 | - [create-preference-data-ollama.ipynb](create-preference-data-ollama.ipynb): A notebook that creates a synthetic dataset for preference finetuning dataset using Llama 3.1 and Ollama
4 |
5 | - [dpo-from-scratch.ipynb](dpo-from-scratch.ipynb): This notebook implements Direct Preference Optimization (DPO) for LLM alignment
6 |
7 |
8 |
--------------------------------------------------------------------------------
/setup/03_optional-docker-environment/.devcontainer/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
2 |
3 | RUN apt-get update && \
4 | apt-get upgrade -y && \
5 | apt-get install -y rsync && \
6 | apt-get install -y git && \
7 | apt-get install -y curl && \
8 | rm -rf /var/lib/apt/lists/*
9 |
10 | COPY requirements.txt requirements.txt
11 |
12 | RUN pip install --no-cache-dir -r requirements.txt
13 |
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/scores/llama3-8b-model-2-response.json:
--------------------------------------------------------------------------------
1 | [76, 85, 67, 90, 20, 98, 22, 96, 40, 80, 40, 20, 90, 98, 80, 92, 98, 98, 95, 99, 55, 99, 80, 90, 20, 4, 98, 4, 40, 95, 14, 44, 95, 44, 80, 4, 4, 40, 95, 80, 98, 95, 92, 98, 68, 20, 20, 60, 95, 90, 98, 0, 20, 80, 20, 80, 92, 98, 98, 20, 95, 100, 95, 85, 98, 4, 40, 98, 98, 65, 20, 76, 100, 67, 44, 92, 75, 97, 27, 98, 20, 60, 90, 96, 67, 98, 80, 10, 80, 98, 100, 40, 92, 98, 20, 98, 98, 20, 20]
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch >= 2.0.1 # all
2 | jupyterlab >= 4.0 # all
3 | tiktoken >= 0.5.1 # ch02; ch04; ch05
4 | matplotlib >= 3.7.1 # ch04; ch05
5 | tensorflow >= 2.15.0 # ch05
6 | tqdm >= 4.66.1 # ch05; ch07
7 | numpy >= 1.25, < 2.0 # dependency of several other libraries like torch and pandas
8 | pandas >= 2.2.1 # ch06
9 | psutil >= 5.9.5 # ch07; already installed automatically as dependency of torch
10 |
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/scores/gpt4-model-2-response.json:
--------------------------------------------------------------------------------
1 | [0, 100, 0, 100, 0, 100, 0, 100, 0, 0, 50, 0, 100, 100, 100, 100, 100, 100, 100, 95, 0, 50, 100, 100, 0, 0, 100, 0, 0, 100, 0, 0, 100, 0, 67, 0, 0, 0, 100, 100, 95, 100, 100, 100, 0, 0, 0, 0, 100, 100, 100, 0, 55, 100, 0, 100, 65, 100, 100, 0, 100, 100, 100, 0, 100, 0, 85, 100, 100, 85, 0, 75, 100, 0, 0, 100, 100, 100, 0, 100, 0, 50, 100, 100, 0, 100, 0, 0, 100, 85, 100, 0, 100, 100, 0, 100, 100, 0, 0, 0]
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/scores/llama3-8b-model-1-response.json:
--------------------------------------------------------------------------------
1 | [20, 92, 85, 90, 20, 90, 22, 97, 60, 96, 20, 20, 98, 95, 90, 98, 95, 20, 98, 98, 92, 20, 96, 96, 100, 98, 98, 95, 20, 95, 98, 20, 85, 95, 80, 97, 40, 21, 100, 85, 95, 98, 92, 98, 69, 98, 80, 60, 60, 20, 80, 68, 80, 96, 96, 68, 80, 95, 80, 20, 95, 98, 80, 98, 94, 20, 40, 98, 100, 85, 98, 90, 95, 85, 95, 80, 98, 98, 25, 98, 40, 92, 95, 82, 87, 98, 80, 90, 95, 4, 90, 90, 80, 98, 20, 98, 98, 40, 92, 98]
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/scores/gpt4-model-1-response.json:
--------------------------------------------------------------------------------
1 | [0, 50, 20, 100, 0, 100, 0, 100, 100, 100, 55, 0, 100, 100, 100, 100, 100, 0, 98, 100, 100, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 0, 100, 100, 85, 100, 0, 0, 100, 100, 100, 100, 100, 100, 0, 100, 100, 95, 20, 50, 85, 100, 100, 100, 100, 55, 100, 100, 100, 0, 100, 98, 100, 100, 100, 0, 85, 100, 100, 98, 100, 100, 100, 0, 100, 100, 100, 100, 0, 100, 0, 100, 100, 0, 0, 100, 50, 100, 100, 10, 100, 100, 100, 100, 0, 100, 100, 25, 100, 30]
--------------------------------------------------------------------------------
/ch03/03_understanding-buffers/README.md:
--------------------------------------------------------------------------------
1 | # Understanding PyTorch Buffers
2 |
3 | - [understanding-buffers.ipynb](understanding-buffers.ipynb) explains the idea behind PyTorch buffers, which are used to implement the causal attention mechanism in chapter 3
4 |
5 |
6 |
7 | Below is a hands-on video tutorial I recorded to explain the code:
8 |
9 |
10 |
11 |
12 | [](https://www.youtube.com/watch?v=PetlIokI9Ao)
13 |
14 |
--------------------------------------------------------------------------------
/ch05/05_bonus_hparam_tuning/README.md:
--------------------------------------------------------------------------------
1 | # Optimizing Hyperparameters for Pretraining
2 |
3 | The [hparam_search.py](hparam_search.py) script, based on the extended training function in [Appendix D: Adding Bells and Whistles to the Training Loop](../../appendix-D/01_main-chapter-code/appendix-D.ipynb), is designed to find optimal hyperparameters via grid search.
4 |
5 | >[!NOTE]
6 | This script will take a long time to run. You may want to reduce the number of hyperparameter configurations explored in the `HPARAM_GRID` dictionary at the top.
--------------------------------------------------------------------------------
/ch03/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 3: Coding Attention Mechanisms
2 |
3 | ## Main Chapter Code
4 |
5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code.
6 |
7 | ## Bonus Materials
8 |
9 | - [02_bonus_efficient-multihead-attention](02_bonus_efficient-multihead-attention) implements and compares different implementation variants of multihead-attention
10 | - [03_understanding-buffers](03_understanding-buffers) explains the idea behind PyTorch buffers, which are used to implement the causal attention mechanism in chapter 3
--------------------------------------------------------------------------------
/ch05/04_learning_rate_schedulers/README.md:
--------------------------------------------------------------------------------
1 | # Adding Bells and Whistles to the Training Loop
2 |
3 | The main chapter used a relatively simple training function to keep the code readable and fit Chapter 5 within the page limits. Optionally, we can add a linear warm-up, a cosine decay schedule, and gradient clipping to improve the training stability and convergence.
4 |
5 | You can find the code for this more sophisticated training function in [Appendix D: Adding Bells and Whistles to the Training Loop](../../appendix-D/01_main-chapter-code/appendix-D.ipynb).
--------------------------------------------------------------------------------
/setup/02_installing-python-libraries/tests.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | # File for internal use (unit tests)
7 |
8 | from python_environment_check import main
9 |
10 |
11 | def test_main(capsys):
12 | main()
13 | captured = capsys.readouterr()
14 | assert "FAIL" not in captured.out
15 |
--------------------------------------------------------------------------------
/ch06/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 6: Finetuning for Classification
2 |
3 |
4 | ## Main Chapter Code
5 |
6 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code
7 |
8 | ## Bonus Materials
9 |
10 | - [02_bonus_additional-experiments](02_bonus_additional-experiments) includes additional experiments (e.g., training the last vs first token, extending the input length, etc.)
11 | - [03_bonus_imdb-classification](03_bonus_imdb-classification) compares the LLM from chapter 6 with other models on a 50k IMDB movie review sentiment classification dataset
--------------------------------------------------------------------------------
/setup/03_optional-docker-environment/.devcontainer/devcontainer.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "LLMs From Scratch",
3 | "build": {
4 | "context": "..",
5 | "dockerfile": "Dockerfile"
6 | },
7 | "runArgs": ["--runtime=nvidia", "--gpus=all"],
8 | "customizations": {
9 | "vscode": {
10 | "extensions": [
11 | "ms-python.python",
12 | "ms-azuretools.vscode-docker",
13 | "ms-toolsai.jupyter",
14 | "yahyabatulu.vscode-markdown-alert",
15 | "tomoki1207.pdf",
16 | "mechatroner.rainbow-csv"
17 | ]
18 | }
19 | }
20 | }
--------------------------------------------------------------------------------
/.github/workflows/pep8-linter.yml:
--------------------------------------------------------------------------------
1 | name: PEP8 Style checks
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | pull_request:
7 | branches: [ main ]
8 |
9 | jobs:
10 | flake8:
11 | runs-on: ubuntu-latest
12 | steps:
13 | - uses: actions/checkout@v4
14 | - name: Set up Python
15 | uses: actions/setup-python@v5
16 | with:
17 | python-version: '3.10'
18 | - name: Install flake8
19 | run: pip install flake8
20 | - name: Run flake8 with exceptions
21 | run: flake8 . --max-line-length=140 --ignore=W504,E402,E731,C406,E741,E722,E226
22 |
--------------------------------------------------------------------------------
/ch04/02_performance-analysis/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 4: Implementing a GPT Model from Scratch To Generate Text
2 |
3 | - [flops-analysis.ipynb](flops-analysis.ipynb) analyses the floating point operations per second (FLOPS) of the GPT model(s) implemented in the main chapter.
4 | - [previous_chapters.py](previous_chapters.py) is a Python module containing the `GPTModel` code we implemented in chapter 4 and other code implemented in previous chapters, which we import in the analysis notebook.
5 | - `requirements-extra.txt` includes additional Python libraries that need to be installed (via `pip install -r requirements-extra.txt`.
--------------------------------------------------------------------------------
/ch04/01_main-chapter-code/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 4: Implementing a GPT Model from Scratch To Generate Text
2 |
3 | ### Main Chapter Code
4 |
5 | - [ch04.ipynb](ch04.ipynb) contains all the code as it appears in the chapter
6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the `MultiHeadAttention` module from the previous chapter, which we import in [ch04.ipynb](ch04.ipynb) to create the GPT model
7 |
8 | ### Optional Code
9 |
10 | - [gpt.py](gpt.py) is a standalone Python script file with the code that we implemented thus far, including the GPT model we coded in this chapter
11 |
12 |
--------------------------------------------------------------------------------
/ch06/01_main-chapter-code/tests.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | # File for internal use (unit tests)
7 |
8 |
9 | import subprocess
10 |
11 |
12 | def test_gpt_class_finetune():
13 | command = ["python", "ch06/01_main-chapter-code/gpt_class_finetune.py", "--test_mode"]
14 |
15 | result = subprocess.run(command, capture_output=True, text=True)
16 | assert result.returncode == 0, f"Script exited with errors: {result.stderr}"
17 |
--------------------------------------------------------------------------------
/ch07/01_main-chapter-code/tests.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | # File for internal use (unit tests)
7 |
8 |
9 | import subprocess
10 |
11 |
12 | def test_gpt_class_finetune():
13 | command = ["python", "ch06/01_main-chapter-code/gpt_class_finetune.py", "--test_mode"]
14 |
15 | result = subprocess.run(command, capture_output=True, text=True)
16 | assert result.returncode == 0, f"Script exited with errors: {result.stderr}"
17 |
--------------------------------------------------------------------------------
/.github/workflows/check-spelling-errors.yml:
--------------------------------------------------------------------------------
1 | name: Spell Check
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 | pull_request:
8 | branches:
9 | - main
10 |
11 | jobs:
12 | spellcheck:
13 | runs-on: ubuntu-latest
14 |
15 | steps:
16 | - uses: actions/checkout@v4
17 |
18 | - name: Set up Python
19 | uses: actions/setup-python@v5
20 | with:
21 | python-version: '3.10'
22 |
23 | - name: Install codespell
24 | run: |
25 | python -m pip install --upgrade pip
26 | pip install codespell
27 |
28 | - name: Run codespell
29 | run: |
30 | codespell -L "ocassion,occassion,ot,te,tje" **/*.{txt,md,py,ipynb}
31 |
--------------------------------------------------------------------------------
/ch02/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 2: Working with Text Data
2 |
3 |
4 | ## Main Chapter Code
5 |
6 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions
7 |
8 | ## Bonus Materials
9 |
10 | - [02_bonus_bytepair-encoder](02_bonus_bytepair-encoder) contains optional code to benchmark different byte pair encoder implementations
11 |
12 | - [03_bonus_embedding-vs-matmul](03_bonus_embedding-vs-matmul) contains optional (bonus) code to explain that embedding layers and fully connected layers applied to one-hot encoded vectors are equivalent.
13 |
14 | - [04_bonus_dataloader-intuition](04_bonus_dataloader-intuition) contains optional (bonus) code to explain the data loader more intuitively with simple numbers rather than text.
15 |
--------------------------------------------------------------------------------
/ch07/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 7: Finetuning to Follow Instructions
2 |
3 | ## Main Chapter Code
4 |
5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code and exercise solutions
6 |
7 | ## Bonus Materials
8 |
9 | - [02_dataset-utilities](02_dataset-utilities) contains utility code that can be used for preparing an instruction dataset
10 |
11 | - [03_model-evaluation](03_model-evaluation) contains utility code for evaluating instruction responses using a local Llama 3 model and the GPT-4 API
12 |
13 | - [04_preference-tuning-with-dpo](04_preference-tuning-with-dpo) implements code for preference finetuning with Direct Preference Optimization (DPO)
14 |
15 | - [05_dataset-generation](05_dataset-generation) contains code to generate synthetic datasets for instruction finetuning
16 |
--------------------------------------------------------------------------------
/ch05/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 5: Pretraining on Unlabeled Data
2 |
3 | ## Main Chapter Code
4 |
5 | - [01_main-chapter-code](01_main-chapter-code) contains the main chapter code
6 |
7 | ## Bonus Materials
8 |
9 | - [02_alternative_weight_loading](02_alternative_weight_loading) contains code to load the GPT model weights from alternative places in case the model weights become unavailable from OpenAI
10 | - [03_bonus_pretraining_on_gutenberg](03_bonus_pretraining_on_gutenberg) contains code to pretrain the LLM longer on the whole corpus of books from Project Gutenberg
11 | - [04_learning_rate_schedulers](04_learning_rate_schedulers) contains code implementing a more sophisticated training function including learning rate schedulers and gradient clipping
12 | - [05_bonus_hparam_tuning](05_bonus_hparam_tuning) contains an optional hyperparameter tuning script
13 |
14 |
--------------------------------------------------------------------------------
/ch06/01_main-chapter-code/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 6: Finetuning for Classification
2 |
3 | ### Main Chapter Code
4 |
5 | - [ch06.ipynb](ch06.ipynb) contains all the code as it appears in the chapter
6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the GPT model we coded and trained in previous chapters, alongside many utility functions, which we reuse in this chapter
7 | - [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights
8 | - [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter
9 |
10 | ### Optional Code
11 |
12 | [load-finetuned-model.ipynb](load-finetuned-model.ipynb) is a standalone Jupyter notebook to load the finetuned model we created in this chapter
13 |
14 |
15 | - [gpt_class_finetune.py](gpt_class_finetune.py) is a standalone Python script file with the code that we implemented in [ch06.ipynb](ch06.ipynb) to finetune the GPT model (you can think of it as a chapter summary)
16 |
17 |
--------------------------------------------------------------------------------
/ch05/03_bonus_pretraining_on_gutenberg/tests.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | # File for internal use (unit tests)
7 |
8 | from pathlib import Path
9 | import os
10 | import subprocess
11 |
12 |
13 | def test_pretraining():
14 |
15 | sequence = "a b c d"
16 | repetitions = 1000
17 | content = sequence * repetitions
18 |
19 | folder_path = Path("gutenberg") / "data"
20 | file_name = "repeated_sequence.txt"
21 |
22 | os.makedirs(folder_path, exist_ok=True)
23 |
24 | with open(folder_path/file_name, "w") as file:
25 | file.write(content)
26 |
27 | result = subprocess.run(
28 | ["python", "pretraining_simple.py", "--debug", "true"],
29 | capture_output=True, text=True
30 | )
31 | print(result.stdout)
32 | assert "Maximum GPU memory allocated" in result.stdout
33 |
--------------------------------------------------------------------------------
/.github/workflows/check-links.yml:
--------------------------------------------------------------------------------
1 | name: Check hyperlinks
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 | pull_request:
8 | branches:
9 | - main
10 |
11 | jobs:
12 | test:
13 | runs-on: ubuntu-latest
14 |
15 | steps:
16 | - uses: actions/checkout@v4
17 |
18 | - name: Set up Python
19 | uses: actions/setup-python@v5
20 | with:
21 | python-version: '3.10'
22 |
23 | - name: Install dependencies
24 | run: |
25 | python -m pip install --upgrade pip
26 | pip install pytest pytest-check-links
27 | # Current version of retry doesn't work well if there are broken non-URL links
28 | # pip install pytest pytest-check-links pytest-retry
29 |
30 | - name: Check links
31 | run: |
32 | pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org"
33 | # pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" --retries 2 --retry-delay 5
34 |
35 |
--------------------------------------------------------------------------------
/ch05/01_main-chapter-code/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 5: Pretraining on Unlabeled Data
2 |
3 | ### Main Chapter Code
4 |
5 | - [ch05.ipynb](ch05.ipynb) contains all the code as it appears in the chapter
6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the `MultiHeadAttention` module and `GPTModel` class from the previous chapters, which we import in [ch05.ipynb](ch05.ipynb) to pretrain the GPT model
7 | - [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights
8 | - [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter
9 |
10 | ### Optional Code
11 |
12 | - [gpt_train.py](gpt_train.py) is a standalone Python script file with the code that we implemented in [ch05.ipynb](ch05.ipynb) to train the GPT model (you can think of it as a code file summarizing this chapter)
13 | - [gpt_generate.py](gpt_generate.py) is a standalone Python script file with the code that we implemented in [ch05.ipynb](ch05.ipynb) to load and use the pretrained model weights from OpenAI
14 |
15 |
--------------------------------------------------------------------------------
/ch07/03_model-evaluation/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 7: Finetuning to Follow Instructions
2 |
3 | This folder contains utility code that can be used for model evaluation.
4 |
5 |
6 |
7 |
8 | ## Evaluating Instruction Responses Using the OpenAI API
9 |
10 |
11 | - The [llm-instruction-eval-openai.ipynb](llm-instruction-eval-openai.ipynb) notebook uses OpenAI's GPT-4 to evaluate responses generated by instruction finetuned models. It works with a JSON file in the following format:
12 |
13 | ```python
14 | {
15 | "instruction": "What is the atomic number of helium?",
16 | "input": "",
17 | "output": "The atomic number of helium is 2.", # <-- The target given in the test set
18 | "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
19 | "model 2 response": "\nThe atomic number of helium is 3." # <-- Response by a 2nd LLM
20 | },
21 | ```
22 |
23 |
24 | ## Evaluating Instruction Responses Locally Using Ollama
25 |
26 | - The [llm-instruction-eval-ollama.ipynb](llm-instruction-eval-ollama.ipynb) notebook offers an alternative to the one above, utilizing a locally downloaded Llama 3 model via Ollama.
--------------------------------------------------------------------------------
/ch04/01_main-chapter-code/tests.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | # File for internal use (unit tests)
7 |
8 | from gpt import main
9 |
10 | expected = """
11 | ==================================================
12 | IN
13 | ==================================================
14 |
15 | Input text: Hello, I am
16 | Encoded input text: [15496, 11, 314, 716]
17 | encoded_tensor.shape: torch.Size([1, 4])
18 |
19 |
20 | ==================================================
21 | OUT
22 | ==================================================
23 |
24 | Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267,
25 | 49706, 43231, 47062, 34657]])
26 | Output length: 14
27 | Output text: Hello, I am Featureiman Byeswickattribute argue logger Normandy Compton analogous
28 | """
29 |
30 |
31 | def test_main(capsys):
32 | main()
33 | captured = capsys.readouterr()
34 |
35 | # Normalize line endings and strip trailing whitespace from each line
36 | normalized_expected = '\n'.join(line.rstrip() for line in expected.splitlines())
37 | normalized_output = '\n'.join(line.rstrip() for line in captured.out.splitlines())
38 |
39 | # Compare normalized strings
40 | assert normalized_output == normalized_expected
41 |
--------------------------------------------------------------------------------
/.github/workflows/basic-tests-linux.yml:
--------------------------------------------------------------------------------
1 | name: Code tests (Linux)
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | paths:
7 | - '**/*.py' # Run workflow for changes in Python files
8 | - '**/*.ipynb'
9 | - '**/*.yaml'
10 | - '**/*.yml'
11 | - '**/*.sh'
12 | pull_request:
13 | branches: [ main ]
14 | paths:
15 | - '**/*.py'
16 | - '**/*.ipynb'
17 | - '**/*.yaml'
18 | - '**/*.yml'
19 | - '**/*.sh'
20 |
21 | jobs:
22 | test:
23 | runs-on: ubuntu-latest
24 |
25 | steps:
26 | - uses: actions/checkout@v4
27 |
28 | - name: Set up Python
29 | uses: actions/setup-python@v5
30 | with:
31 | python-version: "3.10"
32 |
33 | - name: Install dependencies
34 | run: |
35 | python -m pip install --upgrade pip
36 | pip install pytest nbval
37 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
38 |
39 | - name: Test Selected Python Scripts
40 | run: |
41 | pytest setup/02_installing-python-libraries/tests.py
42 | pytest ch04/01_main-chapter-code/tests.py
43 | pytest ch05/01_main-chapter-code/tests.py
44 | pytest ch06/01_main-chapter-code/tests.py
45 |
46 | - name: Validate Selected Jupyter Notebooks
47 | run: |
48 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb
49 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb
50 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb
51 |
--------------------------------------------------------------------------------
/.github/workflows/basic-tests-macos.yml:
--------------------------------------------------------------------------------
1 | name: Code tests (macOS)
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | paths:
7 | - '**/*.py' # Run workflow for changes in Python files
8 | - '**/*.ipynb'
9 | - '**/*.yaml'
10 | - '**/*.yml'
11 | - '**/*.sh'
12 | pull_request:
13 | branches: [ main ]
14 | paths:
15 | - '**/*.py'
16 | - '**/*.ipynb'
17 | - '**/*.yaml'
18 | - '**/*.yml'
19 | - '**/*.sh'
20 |
21 | jobs:
22 | test:
23 | runs-on: macos-latest
24 |
25 | steps:
26 | - uses: actions/checkout@v4
27 |
28 | - name: Set up Python
29 | uses: actions/setup-python@v5
30 | with:
31 | python-version: "3.10"
32 |
33 | - name: Install dependencies
34 | run: |
35 | python -m pip install --upgrade pip
36 | pip install pytest nbval
37 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
38 |
39 | - name: Test Selected Python Scripts
40 | run: |
41 | pytest setup/02_installing-python-libraries/tests.py
42 | pytest ch04/01_main-chapter-code/tests.py
43 | pytest ch05/01_main-chapter-code/tests.py
44 | pytest ch06/01_main-chapter-code/tests.py
45 |
46 | - name: Validate Selected Jupyter Notebooks
47 | run: |
48 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb
49 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb
50 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb
51 |
--------------------------------------------------------------------------------
/.github/workflows/basic-tests-old-pytorch.yml:
--------------------------------------------------------------------------------
1 | name: Test PyTorch 2.0 and 2.4
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | paths:
7 | - '**/*.py' # Run workflow for changes in Python files
8 | - '**/*.ipynb'
9 | - '**/*.yaml'
10 | - '**/*.yml'
11 | - '**/*.sh'
12 | pull_request:
13 | branches: [ main ]
14 | paths:
15 | - '**/*.py'
16 | - '**/*.ipynb'
17 | - '**/*.yaml'
18 | - '**/*.yml'
19 | - '**/*.sh'
20 |
21 | jobs:
22 | test:
23 | runs-on: ubuntu-latest
24 | strategy:
25 | matrix:
26 | pytorch-version: [ 2.0.1, 2.4.0 ]
27 |
28 | steps:
29 | - uses: actions/checkout@v4
30 |
31 | - name: Set up Python
32 | uses: actions/setup-python@v5
33 | with:
34 | python-version: "3.10"
35 |
36 | - name: Install dependencies
37 | run: |
38 | python -m pip install --upgrade pip
39 | pip install pytest nbval
40 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
41 | pip install torch==${{ matrix.pytorch-version }}
42 |
43 | - name: Test Selected Python Scripts
44 | run: |
45 | pytest setup/02_installing-python-libraries/tests.py
46 | pytest ch04/01_main-chapter-code/tests.py
47 | pytest ch05/01_main-chapter-code/tests.py
48 | pytest ch06/01_main-chapter-code/tests.py
49 |
50 | - name: Validate Selected Jupyter Notebooks
51 | run: |
52 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb
53 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb
54 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb
55 |
--------------------------------------------------------------------------------
/.github/workflows/basic-tests-windows.yml:
--------------------------------------------------------------------------------
1 | name: Code tests (Windows)
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | paths:
7 | - '**/*.py' # Run workflow for changes in Python files
8 | - '**/*.ipynb'
9 | - '**/*.yaml'
10 | - '**/*.yml'
11 | - '**/*.sh'
12 | pull_request:
13 | branches: [ main ]
14 | paths:
15 | - '**/*.py'
16 | - '**/*.ipynb'
17 | - '**/*.yaml'
18 | - '**/*.yml'
19 | - '**/*.sh'
20 |
21 | jobs:
22 | test:
23 | runs-on: windows-latest
24 |
25 | steps:
26 | - name: Checkout Code
27 | uses: actions/checkout@v4
28 |
29 | - name: Set up Python
30 | uses: actions/setup-python@v5
31 | with:
32 | python-version: '3.10'
33 |
34 | - name: Install dependencies
35 | shell: bash
36 | run: |
37 | python -m pip install --upgrade pip
38 | pip install pytest nbval
39 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
40 | pip install matplotlib==3.9.0
41 |
42 | - name: Test Selected Python Scripts
43 | shell: bash
44 | run: |
45 | pytest setup/02_installing-python-libraries/tests.py
46 | pytest ch04/01_main-chapter-code/tests.py
47 | pytest ch05/01_main-chapter-code/tests.py
48 | pytest ch06/01_main-chapter-code/tests.py
49 |
50 | - name: Validate Selected Jupyter Notebooks
51 | shell: bash
52 | run: |
53 | pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb
54 | pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb
55 | pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb
56 |
--------------------------------------------------------------------------------
/ch03/01_main-chapter-code/small-text-sample.txt:
--------------------------------------------------------------------------------
1 | Once upon a time in a quiet village nestled among rolling hills and whispering forests, there lived a young girl named Elara. Elara was known for her boundless curiosity and her love for the stars. Every night, she would climb to the highest hill near her home to gaze at the glittering sky, dreaming of distant worlds and galaxies.
2 |
3 | In the heart of the village, there was an ancient library, tended by an old, wise librarian named Mr. Bramwell. This library was a treasure trove of books on every subject, but most importantly, it housed a collection of old star maps and celestial guides. Elara, fascinated by these books, spent countless hours with Mr. Bramwell, learning about constellations, planets, and the mysteries of the universe.
4 |
5 | One evening, while studying an old star map, Elara noticed a small, uncharted star that twinkled differently. She shared this discovery with Mr. Bramwell, who was equally intrigued. They decided to observe this star every night, noting its unique patterns and movements. This small, mysterious star, which they named "Elara's Star," became the center of their nightly adventures.
6 |
7 | As days turned into weeks, the villagers began to take notice of Elara's star. The uncharted star brought the community together, with people of all ages joining Elara and Mr. Bramwell on the hill each night to gaze at the sky. The nightly gatherings turned into a festival of stars, where stories were shared, friendships were formed, and the mysteries of the cosmos were contemplated.
8 |
9 | The story of Elara and her star spread far and wide, attracting astronomers and dreamers from distant lands. The once quiet village became a beacon of wonder, a place where the sky seemed a little closer and the stars a bit friendlier. Elara's curiosity had not only unveiled a hidden star but had also brought her community together, reminding everyone that sometimes, the most extraordinary discoveries are waiting just above us, in the starlit sky.
--------------------------------------------------------------------------------
/setup/02_installing-python-libraries/python_environment_check.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "c31e08b0-f551-4d67-b95e-41f49de3b392",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "Supplementary code for \"Build a Large Language Model From Scratch\": https://www.manning.com/books/build-a-large-language-model-from-scratch by Sebastian Raschka
\n",
10 | "Code repository: https://github.com/rasbt/LLMs-from-scratch\n",
11 | ""
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "id": "67f6f7ed-b67d-465b-bf6f-a99b0d996930",
18 | "metadata": {},
19 | "outputs": [
20 | {
21 | "name": "stdout",
22 | "output_type": "stream",
23 | "text": [
24 | "[OK] Your Python version is 3.10.12\n",
25 | "[OK] numpy 1.26.0\n",
26 | "[OK] matplotlib 3.8.2\n",
27 | "[OK] jupyterlab 4.0.6\n",
28 | "[OK] tensorflow 2.15.0\n",
29 | "[OK] torch 2.2.1\n",
30 | "[OK] tqdm 4.66.1\n",
31 | "[OK] tiktoken 0.5.1\n"
32 | ]
33 | }
34 | ],
35 | "source": [
36 | "from python_environment_check import check_packages, get_requirements_dict\n",
37 | "\n",
38 | "d = get_requirements_dict()\n",
39 | "check_packages(d)"
40 | ]
41 | }
42 | ],
43 | "metadata": {
44 | "kernelspec": {
45 | "display_name": "Python 3 (ipykernel)",
46 | "language": "python",
47 | "name": "python3"
48 | },
49 | "language_info": {
50 | "codemirror_mode": {
51 | "name": "ipython",
52 | "version": 3
53 | },
54 | "file_extension": ".py",
55 | "mimetype": "text/x-python",
56 | "name": "python",
57 | "nbconvert_exporter": "python",
58 | "pygments_lexer": "ipython3",
59 | "version": "3.10.6"
60 | }
61 | },
62 | "nbformat": 4,
63 | "nbformat_minor": 5
64 | }
65 |
--------------------------------------------------------------------------------
/ch07/02_dataset-utilities/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 7: Finetuning to Follow Instructions
2 |
3 | This folder contains utility code that can be used for preparing an instruction dataset.
4 |
5 | Install the additional package requirements via:
6 |
7 | ```bash
8 | pip install -r requirements-extra.txt
9 | ```
10 |
11 |
12 |
13 |
14 |
15 | ### Finding Near Duplicates
16 |
17 | The `find-near-duplicates.py` function can be used to identify duplicates and near-duplicates in an instruction dataset. For example,
18 |
19 |
20 |
21 | ```bash
22 | python find-near-duplicates.py --json_file instruction-examples.json
23 | ```
24 |
25 | ```
26 | scikit-learn version: 1.3.1
27 |
28 |
29 | ==================================================
30 | Searching 'instruction' for duplicates ...
31 | ==================================================
32 | Duplicate pair found with similarity 0.94:
33 | 1. Edit the following sentence to make it more formal.
34 | 2. Edit the sentence to make it more formal.
35 |
36 | Duplicate pair found with similarity 1.00:
37 | 1. Name a dwarf planet in our solar system.
38 | 2. Name a dwarf planet in our solar system.
39 |
40 | Duplicate pair found with similarity 0.91:
41 | 1. Change the sentences from active voice to passive voice.
42 | 2. Change the sentence from passive to active voice.
43 |
44 |
45 |
46 | ==================================================
47 | Searching 'input' for duplicates ...
48 | ==================================================
49 | No duplicates found
50 |
51 |
52 | ==================================================
53 | Searching 'output' for duplicates ...
54 | ==================================================
55 | Duplicate pair found with similarity 1.00:
56 | 1. One dwarf planet in our solar system is Pluto.
57 | 2. One dwarf planet in our solar system is Pluto.
58 |
59 |
60 | ```
61 |
62 |
63 | You can use the `--threshold` setting with a value between 0 and 1 to decrease or increase the sensitivity.
64 | The default threshold is 0.9.
65 |
66 |
67 |
68 |
69 | ## Creating Passive Voice Entries
70 |
71 | - The [create-passive-voice-entries.ipynb](create-passive-voice-entries.ipynb) notebook uses OpenAI's GPT-4 to create "passive voice" entries for an instruction dataset, as shown in the example below
72 |
73 | ```python
74 | {
75 | 'instruction': 'Identify the verb in the following sentence',
76 | 'input': 'The cat sleeps on the couch.',
77 | 'output': 'The verb in the sentence is "sleeps."',
78 | 'output_2': 'The sentence is "sleeps."' # <---- Newly created entry
79 | }
80 | ```
81 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug-report.yaml:
--------------------------------------------------------------------------------
1 | name: Bug Report
2 | description: Report errors related to the book content or code
3 | title: "Description"
4 | labels: [bug]
5 | assignees: rasbt
6 | body:
7 | - type: markdown
8 | attributes:
9 | value: |
10 | Thank you for taking the time to report an issue. Please fill out the details below to help resolve it.
11 |
12 | - type: textarea
13 | id: bug_description
14 | attributes:
15 | label: Bug description
16 | description: A description of the issue.
17 | placeholder: |
18 | Please provide a description of what the bug or issue is.
19 | validations:
20 | required: true
21 |
22 | - type: dropdown
23 | id: operating_system
24 | attributes:
25 | label: What operating system are you using?
26 | description: If applicable, please select the operating system where you experienced this issue.
27 | options:
28 | - "Unknown"
29 | - "macOS"
30 | - "Linux"
31 | - "Windows"
32 | validations:
33 | required: False
34 |
35 | - type: dropdown
36 | id: compute_environment
37 | attributes:
38 | label: Where do you run your code?
39 | description: Please select the computing environment where you ran this code.
40 | options:
41 | - "Local (laptop, desktop)"
42 | - "Lightning AI Studio"
43 | - "Google Colab"
44 | - "Other cloud environment (AWS, Azure, GCP)"
45 | validations:
46 | required: False
47 |
48 | - type: textarea
49 | id: environment
50 | attributes:
51 | label: Environment
52 | description: |
53 | Please provide details about your Python environment via the environment collection script or notebook located at
54 | https://github.com/rasbt/LLMs-from-scratch/tree/main/setup/02_installing-python-libraries.
55 | For your convenience, you can download and run the script from your terminal as follows:
56 |
57 | ```bash
58 | curl --ssl-no-revoke -O https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/setup/02_installing-python-libraries/python_environment_check.py \
59 | -O https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/requirements.txt
60 |
61 | python python_environment_check.py
62 | ```
63 |
64 | The script will print your Python environment information in the following format
65 | ```console
66 | [OK] Your Python version is 3.11.4
67 | [OK] torch 2.3.1
68 | [OK] jupyterlab 4.2.2
69 | [OK] tiktoken 0.7.0
70 | [OK] matplotlib 3.9.0
71 | [OK] numpy 1.26.4
72 | [OK] tensorflow 2.16.1
73 | [OK] tqdm 4.66.4
74 | [OK] pandas 2.2.2
75 | [OK] psutil 5.9.8
76 | ```
77 | You can simply copy and paste the outputs of this script below.
78 | value: |
79 | ```
80 |
81 |
82 |
83 | ```
84 | validations:
85 | required: false
86 |
--------------------------------------------------------------------------------
/ch06/03_bonus_imdb-classification/train_sklearn_logreg.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | import pandas as pd
7 | from sklearn.feature_extraction.text import CountVectorizer
8 | from sklearn.linear_model import LogisticRegression
9 | from sklearn.metrics import accuracy_score
10 | # from sklearn.metrics import balanced_accuracy_score
11 | from sklearn.dummy import DummyClassifier
12 |
13 |
14 | def load_dataframes():
15 | df_train = pd.read_csv("train.csv")
16 | df_val = pd.read_csv("validation.csv")
17 | df_test = pd.read_csv("test.csv")
18 |
19 | return df_train, df_val, df_test
20 |
21 |
22 | def eval(model, X_train, y_train, X_val, y_val, X_test, y_test):
23 | # Making predictions
24 | y_pred_train = model.predict(X_train)
25 | y_pred_val = model.predict(X_val)
26 | y_pred_test = model.predict(X_test)
27 |
28 | # Calculating accuracy and balanced accuracy
29 | accuracy_train = accuracy_score(y_train, y_pred_train)
30 | # balanced_accuracy_train = balanced_accuracy_score(y_train, y_pred_train)
31 |
32 | accuracy_val = accuracy_score(y_val, y_pred_val)
33 | # balanced_accuracy_val = balanced_accuracy_score(y_val, y_pred_val)
34 |
35 | accuracy_test = accuracy_score(y_test, y_pred_test)
36 | # balanced_accuracy_test = balanced_accuracy_score(y_test, y_pred_test)
37 |
38 | # Printing the results
39 | print(f"Training Accuracy: {accuracy_train*100:.2f}%")
40 | print(f"Validation Accuracy: {accuracy_val*100:.2f}%")
41 | print(f"Test Accuracy: {accuracy_test*100:.2f}%")
42 |
43 | # print(f"\nTraining Balanced Accuracy: {balanced_accuracy_train*100:.2f}%")
44 | # print(f"Validation Balanced Accuracy: {balanced_accuracy_val*100:.2f}%")
45 | # print(f"Test Balanced Accuracy: {balanced_accuracy_test*100:.2f}%")
46 |
47 |
48 | if __name__ == "__main__":
49 | df_train, df_val, df_test = load_dataframes()
50 |
51 | #########################################
52 | # Convert text into bag-of-words model
53 | vectorizer = CountVectorizer()
54 | #########################################
55 |
56 | X_train = vectorizer.fit_transform(df_train["text"])
57 | X_val = vectorizer.transform(df_val["text"])
58 | X_test = vectorizer.transform(df_test["text"])
59 | y_train, y_val, y_test = df_train["label"], df_val["label"], df_test["label"]
60 |
61 | #####################################
62 | # Model training and evaluation
63 | #####################################
64 |
65 | # Create a dummy classifier with the strategy to predict the most frequent class
66 | dummy_clf = DummyClassifier(strategy="most_frequent")
67 | dummy_clf.fit(X_train, y_train)
68 |
69 | print("Dummy classifier:")
70 | eval(dummy_clf, X_train, y_train, X_val, y_val, X_test, y_test)
71 |
72 | print("\n\nLogistic regression classifier:")
73 | model = LogisticRegression(max_iter=1000)
74 | model.fit(X_train, y_train)
75 | eval(model, X_train, y_train, X_val, y_val, X_test, y_test)
76 |
--------------------------------------------------------------------------------
/setup/02_installing-python-libraries/README.md:
--------------------------------------------------------------------------------
1 | # Installing Python Packages and Libraries Used In This Book
2 |
3 | This document provides more information on double-checking your installed Python version and packages. (Please see the [../01_optional-python-setup-preferences](../01_optional-python-setup-preferences) folder for more information on installing Python and Python packages.)
4 |
5 | I used the following libraries listed [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/requirements.txt) for this book. Newer versions of these libraries are likely compatible as well. However, if you experience any problems with the code, you can try these library versions as a fallback.
6 |
7 | To install these requirements most conveniently, you can use the `requirements.txt` file in the root directory for this code repository and execute the following command:
8 |
9 | ```bash
10 | pip install -r requirements.txt
11 | ```
12 |
13 | Alternatively, you can install it via the GitHub URL as follows:
14 |
15 | ```bash
16 | pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/requirements.txt
17 | ```
18 |
19 |
20 | Then, after completing the installation, please check if all the packages are installed and are up to date using
21 |
22 | ```bash
23 | python python_environment_check.py
24 | ```
25 |
26 |
27 |
28 | It's also recommended to check the versions in JupyterLab by running the `python_environment_check.ipynb` in this directory, which should ideally give you the same results as above.
29 |
30 |
31 |
32 | If you see the following issues, it's likely that your JupyterLab instance is connected to wrong conda environment:
33 |
34 |
35 |
36 | In this case, you may want to use `watermark` to check if you opened the JupyterLab instance in the right conda environment using the `--conda` flag:
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 | ## Installing PyTorch
46 |
47 | PyTorch can be installed just like any other Python library or package using pip. For example:
48 |
49 | ```bash
50 | pip install torch==2.0.1
51 | ```
52 |
53 | However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional settings and explanation (see the *A.1.3 Installing PyTorch in the book for more information*).
54 |
55 | It's also highly recommended to consult the installation guide menu on the official PyTorch website at [https://pytorch.org](https://pytorch.org).
56 |
57 |
58 |
59 |
60 |
61 | ---
62 |
63 |
64 |
65 |
66 | Any questions? Please feel free to reach out in the [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions).
67 |
--------------------------------------------------------------------------------
/ch05/01_main-chapter-code/tests.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | # File for internal use (unit tests)
7 |
8 | import pytest
9 | from gpt_train import main
10 | import http.client
11 | from urllib.parse import urlparse
12 |
13 |
14 | @pytest.fixture
15 | def gpt_config():
16 | return {
17 | "vocab_size": 50257,
18 | "context_length": 12, # small for testing efficiency
19 | "emb_dim": 32, # small for testing efficiency
20 | "n_heads": 4, # small for testing efficiency
21 | "n_layers": 2, # small for testing efficiency
22 | "drop_rate": 0.1,
23 | "qkv_bias": False
24 | }
25 |
26 |
27 | @pytest.fixture
28 | def other_settings():
29 | return {
30 | "learning_rate": 5e-4,
31 | "num_epochs": 1, # small for testing efficiency
32 | "batch_size": 2,
33 | "weight_decay": 0.1
34 | }
35 |
36 |
37 | def test_main(gpt_config, other_settings):
38 | train_losses, val_losses, tokens_seen, model = main(gpt_config, other_settings)
39 |
40 | assert len(train_losses) == 39, "Unexpected number of training losses"
41 | assert len(val_losses) == 39, "Unexpected number of validation losses"
42 | assert len(tokens_seen) == 39, "Unexpected number of tokens seen"
43 |
44 |
45 | def check_file_size(url, expected_size):
46 | parsed_url = urlparse(url)
47 | if parsed_url.scheme == "https":
48 | conn = http.client.HTTPSConnection(parsed_url.netloc)
49 | else:
50 | conn = http.client.HTTPConnection(parsed_url.netloc)
51 |
52 | conn.request("HEAD", parsed_url.path)
53 | response = conn.getresponse()
54 | if response.status != 200:
55 | return False, f"{url} not accessible"
56 | size = response.getheader("Content-Length")
57 | if size is None:
58 | return False, "Content-Length header is missing"
59 | size = int(size)
60 | if size != expected_size:
61 | return False, f"{url} file has expected size {expected_size}, but got {size}"
62 | return True, f"{url} file size is correct"
63 |
64 |
65 | def test_model_files():
66 | base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models"
67 |
68 | model_size = "124M"
69 | files = {
70 | "checkpoint": 77,
71 | "encoder.json": 1042301,
72 | "hparams.json": 90,
73 | "model.ckpt.data-00000-of-00001": 497759232,
74 | "model.ckpt.index": 5215,
75 | "model.ckpt.meta": 471155,
76 | "vocab.bpe": 456318
77 | }
78 |
79 | for file_name, expected_size in files.items():
80 | url = f"{base_url}/{model_size}/{file_name}"
81 | valid, message = check_file_size(url, expected_size)
82 | assert valid, message
83 |
84 | model_size = "355M"
85 | files = {
86 | "checkpoint": 77,
87 | "encoder.json": 1042301,
88 | "hparams.json": 91,
89 | "model.ckpt.data-00000-of-00001": 1419292672,
90 | "model.ckpt.index": 10399,
91 | "model.ckpt.meta": 926519,
92 | "vocab.bpe": 456318
93 | }
94 |
95 | for file_name, expected_size in files.items():
96 | url = f"{base_url}/{model_size}/{file_name}"
97 | valid, message = check_file_size(url, expected_size)
98 | assert valid, message
99 |
--------------------------------------------------------------------------------
/ch06/03_bonus_imdb-classification/download_prepare_dataset.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | import os
7 | import sys
8 | import tarfile
9 | import time
10 | import urllib.request
11 | import pandas as pd
12 |
13 |
14 | def reporthook(count, block_size, total_size):
15 | global start_time
16 | if count == 0:
17 | start_time = time.time()
18 | else:
19 | duration = time.time() - start_time
20 | progress_size = int(count * block_size)
21 | percent = count * block_size * 100 / total_size
22 |
23 | speed = int(progress_size / (1024 * duration)) if duration else 0
24 | sys.stdout.write(
25 | f"\r{int(percent)}% | {progress_size / (1024**2):.2f} MB "
26 | f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed"
27 | )
28 | sys.stdout.flush()
29 |
30 |
31 | def download_and_extract_dataset(dataset_url, target_file, directory):
32 | if not os.path.exists(directory):
33 | if os.path.exists(target_file):
34 | os.remove(target_file)
35 | urllib.request.urlretrieve(dataset_url, target_file, reporthook)
36 | print("\nExtracting dataset ...")
37 | with tarfile.open(target_file, "r:gz") as tar:
38 | tar.extractall()
39 | else:
40 | print(f"Directory `{directory}` already exists. Skipping download.")
41 |
42 |
43 | def load_dataset_to_dataframe(basepath="aclImdb", labels={"pos": 1, "neg": 0}):
44 | data_frames = [] # List to store each chunk of DataFrame
45 | for subset in ("test", "train"):
46 | for label in ("pos", "neg"):
47 | path = os.path.join(basepath, subset, label)
48 | for file in sorted(os.listdir(path)):
49 | with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
50 | # Create a DataFrame for each file and add it to the list
51 | data_frames.append(pd.DataFrame({"text": [infile.read()], "label": [labels[label]]}))
52 | # Concatenate all DataFrame chunks together
53 | df = pd.concat(data_frames, ignore_index=True)
54 | df = df.sample(frac=1, random_state=123).reset_index(drop=True) # Shuffle the DataFrame
55 | return df
56 |
57 |
58 | def partition_and_save(df, sizes=(35000, 5000, 10000)):
59 | # Shuffle the DataFrame
60 | df_shuffled = df.sample(frac=1, random_state=123).reset_index(drop=True)
61 |
62 | # Get indices for where to split the data
63 | train_end = sizes[0]
64 | val_end = sizes[0] + sizes[1]
65 |
66 | # Split the DataFrame
67 | train = df_shuffled.iloc[:train_end]
68 | val = df_shuffled.iloc[train_end:val_end]
69 | test = df_shuffled.iloc[val_end:]
70 |
71 | # Save to CSV files
72 | train.to_csv("train.csv", index=False)
73 | val.to_csv("validation.csv", index=False)
74 | test.to_csv("test.csv", index=False)
75 |
76 |
77 | if __name__ == "__main__":
78 | dataset_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
79 | print("Downloading dataset ...")
80 | download_and_extract_dataset(dataset_url, "aclImdb_v1.tar.gz", "aclImdb")
81 | print("Creating data frames ...")
82 | df = load_dataset_to_dataframe()
83 | print("Partitioning and saving data frames ...")
84 | partition_and_save(df)
85 |
--------------------------------------------------------------------------------
/ch07/01_main-chapter-code/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 7: Finetuning to Follow Instructions
2 |
3 | ### Main Chapter Code
4 |
5 | - [ch07.ipynb](ch07.ipynb) contains all the code as it appears in the chapter
6 | - [previous_chapters.py](previous_chapters.py) is a Python module that contains the GPT model we coded and trained in previous chapters, alongside many utility functions, which we reuse in this chapter
7 | - [gpt_download.py](gpt_download.py) contains the utility functions for downloading the pretrained GPT model weights
8 | - [exercise-solutions.ipynb](exercise-solutions.ipynb) contains the exercise solutions for this chapter
9 |
10 |
11 | ### Optional Code
12 |
13 | - [load-finetuned-model.ipynb](load-finetuned-model.ipynb) is a standalone Jupyter notebook to load the instruction finetuned model we created in this chapter
14 |
15 | - [gpt_instruction_finetuning.py](gpt_instruction_finetuning.py) is a standalone Python script to instruction finetune the model as described in the main chapter (think of it as a chapter summary focused on the finetuning parts)
16 |
17 | Usage:
18 |
19 | ```bash
20 | python gpt_instruction_finetuning.py
21 | ```
22 |
23 | ```
24 | matplotlib version: 3.9.0
25 | tiktoken version: 0.7.0
26 | torch version: 2.3.1
27 | tqdm version: 4.66.4
28 | tensorflow version: 2.16.1
29 | --------------------------------------------------
30 | Training set length: 935
31 | Validation set length: 55
32 | Test set length: 110
33 | --------------------------------------------------
34 | Device: cpu
35 | --------------------------------------------------
36 | File already exists and is up-to-date: gpt2/355M/checkpoint
37 | File already exists and is up-to-date: gpt2/355M/encoder.json
38 | File already exists and is up-to-date: gpt2/355M/hparams.json
39 | File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001
40 | File already exists and is up-to-date: gpt2/355M/model.ckpt.index
41 | File already exists and is up-to-date: gpt2/355M/model.ckpt.meta
42 | File already exists and is up-to-date: gpt2/355M/vocab.bpe
43 | Loaded model: gpt2-medium (355M)
44 | --------------------------------------------------
45 | Initial losses
46 | Training loss: 3.839039182662964
47 | Validation loss: 3.7619192123413088
48 | Ep 1 (Step 000000): Train loss 2.611, Val loss 2.668
49 | Ep 1 (Step 000005): Train loss 1.161, Val loss 1.131
50 | Ep 1 (Step 000010): Train loss 0.939, Val loss 0.973
51 | ...
52 | Training completed in 15.66 minutes.
53 | Plot saved as loss-plot-standalone.pdf
54 | --------------------------------------------------
55 | Generating responses
56 | 100%|█████████████████████████████████████████████████████████| 110/110 [06:57<00:00, 3.80s/it]
57 | Responses saved as instruction-data-with-response-standalone.json
58 | Model saved as gpt2-medium355M-sft-standalone.pth
59 | ```
60 |
61 | - [ollama_evaluate.py](ollama_evaluate.py) is a standalone Python script to evaluate the responses of the finetuned model as described in the main chapter (think of it as a chapter summary focused on the evaluation parts)
62 |
63 | Usage:
64 |
65 | ```bash
66 | python ollama_evaluate.py --file_path instruction-data-with-response-standalone.json
67 | ```
68 |
69 | ```
70 | Ollama running: True
71 | Scoring entries: 100%|███████████████████████████████████████| 110/110 [01:08<00:00, 1.62it/s]
72 | Number of scores: 110 of 110
73 | Average score: 51.75
74 | ```
75 |
76 | - [exercise_experiments.py](exercise_experiments.py) is an optional scropt that implements the exercise solutions; for more details see [exercise-solutions.ipynb](exercise-solutions.ipynb)
77 |
--------------------------------------------------------------------------------
/ch05/03_bonus_pretraining_on_gutenberg/prepare_dataset.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | """
7 | Script that processes the Project Gutenberg files into fewer larger files.
8 | """
9 |
10 | import argparse
11 | import os
12 | import re
13 | from tqdm import tqdm
14 | from gutenberg.src.cleanup import strip_headers
15 |
16 |
17 | def is_english(text, threshold=0.9):
18 | ascii_chars = sum(1 for c in text if ord(c) < 128)
19 | return ascii_chars / len(text) > threshold
20 |
21 |
22 | def combine_files(file_paths, target_dir, max_size_mb=500, separator="<|endoftext|>", fallback_encoding="latin1"):
23 | if not os.path.exists(target_dir):
24 | os.makedirs(target_dir)
25 |
26 | current_content = []
27 | current_size = 0
28 | file_counter = 1
29 |
30 | for file_path in tqdm(file_paths):
31 | try:
32 | with open(file_path, "r", encoding="utf-8") as file:
33 | content = file.read()
34 | except UnicodeDecodeError:
35 | # Attempt to read the file with a fallback encoding
36 | tqdm.write(f"Warning: UnicodeDecodeError encountered. Trying fallback encoding for {file_path}")
37 | with open(file_path, "r", encoding=fallback_encoding) as file:
38 | content = file.read()
39 |
40 | if not is_english(content):
41 | tqdm.write(f"Skipping {file_path} as it does not contain primarily English text.")
42 | continue
43 | content = strip_headers(content)
44 |
45 | # Regular expression to replace multiple blank lines with a single blank line
46 | content = re.sub(r'\n\s*\n', '\n\n', content)
47 | estimated_size = len(content.encode("utf-8"))
48 |
49 | if current_size + estimated_size > max_size_mb * 1024 * 1024:
50 | target_file_path = os.path.join(target_dir, f"combined_{file_counter}.txt")
51 | with open(target_file_path, "w", encoding="utf-8") as target_file:
52 | target_file.write(separator.join(current_content))
53 | file_counter += 1
54 | current_content = [content]
55 | current_size = estimated_size
56 | else:
57 | current_content.append(content)
58 | current_size += estimated_size
59 |
60 | if current_content:
61 | target_file_path = os.path.join(target_dir, f"combined_{file_counter}.txt")
62 | with open(target_file_path, "w", encoding="utf-8") as target_file:
63 | target_file.write(separator.join(current_content))
64 | return file_counter
65 |
66 |
67 | if __name__ == "__main__":
68 |
69 | parser = argparse.ArgumentParser(description="Preprocess and combine text files for pretraining")
70 |
71 | parser.add_argument("--data_dir", type=str, default="gutenberg/data/raw",
72 | help="Directory containing the downloaded raw training data")
73 | parser.add_argument("--max_size_mb", type=int, default=500,
74 | help="The maximum file size for each concatenated file in megabytes")
75 | parser.add_argument("--output_dir", type=str, default="gutenberg_preprocessed",
76 | help="Directory where the preprocessed data will be saved")
77 |
78 | args = parser.parse_args()
79 |
80 | all_files = [os.path.join(path, name) for path, subdirs, files in os.walk(args.data_dir)
81 | for name in files if name.endswith((".txt", ".txt.utf8"))]
82 |
83 | print(f"{len(all_files)} file(s) to process.")
84 | file_counter = combine_files(all_files, args.output_dir, max_size_mb=args.max_size_mb)
85 | print(f"{file_counter} file(s) saved in {os.path.abspath(args.output_dir)}")
86 |
--------------------------------------------------------------------------------
/setup/02_installing-python-libraries/python_environment_check.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | from importlib.metadata import PackageNotFoundError, import_module
7 | import importlib.metadata
8 | from os.path import dirname, exists, join, realpath
9 | from packaging.version import parse as version_parse
10 | import platform
11 | import sys
12 |
13 | if version_parse(platform.python_version()) < version_parse("3.9"):
14 | print("[FAIL] We recommend Python 3.9 or newer but"
15 | " found version %s" % (sys.version))
16 | else:
17 | print("[OK] Your Python version is %s" % (platform.python_version()))
18 |
19 |
20 | def get_packages(pkgs):
21 | versions = []
22 | for p in pkgs:
23 | try:
24 | imported = import_module(p)
25 | try:
26 | version = (getattr(imported, "__version__", None) or
27 | getattr(imported, "version", None) or
28 | getattr(imported, "version_info", None))
29 | if version is None:
30 | # If common attributes don"t exist, use importlib.metadata
31 | version = importlib.metadata.version(p)
32 | versions.append(version)
33 | except PackageNotFoundError:
34 | # Handle case where package is not installed
35 | versions.append("0.0")
36 | except ImportError:
37 | # Fallback if importlib.import_module fails for unexpected reasons
38 | versions.append("0.0")
39 | return versions
40 |
41 |
42 | def get_requirements_dict():
43 | PROJECT_ROOT = dirname(realpath(__file__))
44 | PROJECT_ROOT_UP_TWO = dirname(dirname(PROJECT_ROOT))
45 | REQUIREMENTS_FILE = join(PROJECT_ROOT_UP_TWO, "requirements.txt")
46 | if not exists(REQUIREMENTS_FILE):
47 | REQUIREMENTS_FILE = join(PROJECT_ROOT, "requirements.txt")
48 |
49 | d = {}
50 | with open(REQUIREMENTS_FILE) as f:
51 | for line in f:
52 | if not line.strip():
53 | continue
54 | if "," in line:
55 | left, right = line.split(",")
56 | lower = right.split("#")[0].strip()
57 | package, _, upper = left.split(" ")
58 | package = package.strip()
59 | _, lower = lower.split(" ")
60 | lower = lower.strip()
61 | upper = upper.strip()
62 | d[package] = (upper, lower)
63 | else:
64 | line = line.split("#")[0].strip()
65 | line = line.split(" ")
66 | line = [ln.strip() for ln in line]
67 | d[line[0]] = line[-1]
68 | return d
69 |
70 |
71 | def check_packages(d):
72 | versions = get_packages(d.keys())
73 |
74 | for (pkg_name, suggested_ver), actual_ver in zip(d.items(), versions):
75 | if isinstance(suggested_ver, tuple):
76 | lower, upper = suggested_ver[0], suggested_ver[1]
77 | else:
78 | lower = suggested_ver
79 | upper = None
80 | if actual_ver == "N/A":
81 | continue
82 | actual_ver = version_parse(actual_ver)
83 | lower = version_parse(lower)
84 | if upper is not None:
85 | upper = version_parse(upper)
86 | if actual_ver < lower and upper is None:
87 | print(f"[FAIL] {pkg_name} {actual_ver}, please upgrade to >= {lower}")
88 | elif actual_ver < lower:
89 | print(f"[FAIL] {pkg_name} {actual_ver}, please upgrade to >= {lower} and < {upper}")
90 | elif upper is not None and actual_ver >= upper:
91 | print(f"[FAIL] {pkg_name} {actual_ver}, please downgrade to >= {lower} and < {upper}")
92 | else:
93 | print(f"[OK] {pkg_name} {actual_ver}")
94 |
95 |
96 | def main():
97 | d = get_requirements_dict()
98 | check_packages(d)
99 |
100 |
101 | if __name__ == "__main__":
102 | main()
103 |
--------------------------------------------------------------------------------
/ch06/03_bonus_imdb-classification/README.md:
--------------------------------------------------------------------------------
1 | # Additional Experiments Classifying the Sentiment of 50k IMDB Movie Reviews
2 |
3 |
4 | ## Step 1: Install Dependencies
5 |
6 | Install the extra dependencies via
7 |
8 | ```bash
9 | pip install -r requirements-extra.txt
10 | ```
11 |
12 |
13 | ## Step 2: Download Dataset
14 |
15 | The codes are using the 50k movie reviews from IMDb ([dataset source](https://ai.stanford.edu/~amaas/data/sentiment/)) to predict whether a movie review is positive or negative.
16 |
17 | Run the following code to create the `train.csv`, `validation.csv`, and `test.csv` datasets:
18 |
19 | ```bash
20 | python download_prepare_dataset.py
21 | ```
22 |
23 |
24 |
25 | ## Step 3: Run Models
26 |
27 | The 124M GPT-2 model used in the main chapter, starting with pretrained weights, and finetuning all weights:
28 |
29 | ```bash
30 | python train_gpt.py --trainable_layers "all" --num_epochs 1
31 | ```
32 |
33 | ```
34 | Ep 1 (Step 000000): Train loss 3.706, Val loss 3.853
35 | Ep 1 (Step 000050): Train loss 0.682, Val loss 0.706
36 | ...
37 | Ep 1 (Step 004300): Train loss 0.199, Val loss 0.285
38 | Ep 1 (Step 004350): Train loss 0.188, Val loss 0.208
39 | Training accuracy: 95.62% | Validation accuracy: 95.00%
40 | Training completed in 9.48 minutes.
41 |
42 | Evaluating on the full datasets ...
43 |
44 | Training accuracy: 95.64%
45 | Validation accuracy: 92.32%
46 | Test accuracy: 91.88%
47 | ```
48 |
49 |
50 |
51 |
52 | ---
53 |
54 |
55 |
56 | A 340M parameter encoder-style [BERT](https://arxiv.org/abs/1810.04805) model:
57 |
58 | ```bash
59 | python train_bert_hf.py --trainable_layers "all" --num_epochs 1 --model "bert"
60 | ```
61 |
62 | ```
63 | Ep 1 (Step 000000): Train loss 0.848, Val loss 0.775
64 | Ep 1 (Step 000050): Train loss 0.655, Val loss 0.682
65 | ...
66 | Ep 1 (Step 004300): Train loss 0.146, Val loss 0.318
67 | Ep 1 (Step 004350): Train loss 0.204, Val loss 0.217
68 | Training accuracy: 92.50% | Validation accuracy: 88.75%
69 | Training completed in 7.65 minutes.
70 |
71 | Evaluating on the full datasets ...
72 |
73 | Training accuracy: 94.35%
74 | Validation accuracy: 90.74%
75 | Test accuracy: 90.89%
76 | ```
77 |
78 |
79 |
80 | ---
81 |
82 |
83 |
84 | A 66M parameter encoder-style [DistilBERT](https://arxiv.org/abs/1910.01108) model (distilled down from a 340M parameter BERT model), starting for the pretrained weights and only training the last transformer block plus output layers:
85 |
86 |
87 |
88 | ```bash
89 | python train_bert_hf.py --trainable_layers "all" --num_epochs 1 --model "distilbert"
90 | ```
91 |
92 | ```
93 | Ep 1 (Step 000000): Train loss 0.693, Val loss 0.688
94 | Ep 1 (Step 000050): Train loss 0.452, Val loss 0.460
95 | ...
96 | Ep 1 (Step 004300): Train loss 0.179, Val loss 0.272
97 | Ep 1 (Step 004350): Train loss 0.199, Val loss 0.182
98 | Training accuracy: 95.62% | Validation accuracy: 91.25%
99 | Training completed in 4.26 minutes.
100 |
101 | Evaluating on the full datasets ...
102 |
103 | Training accuracy: 95.30%
104 | Validation accuracy: 91.12%
105 | Test accuracy: 91.40%
106 | ```
107 |
108 |
109 | ---
110 |
111 |
112 |
113 | A 355M parameter encoder-style [RoBERTa](https://arxiv.org/abs/1907.11692) model, starting for the pretrained weights and only training the last transformer block plus output layers:
114 |
115 |
116 | ```bash
117 | python train_bert_hf.py --trainable_layers "last_block" --num_epochs 1 --model "roberta"
118 | ```
119 |
120 | ```
121 | Ep 1 (Step 000000): Train loss 0.695, Val loss 0.698
122 | Ep 1 (Step 000050): Train loss 0.670, Val loss 0.690
123 | ...
124 | Ep 1 (Step 004300): Train loss 0.126, Val loss 0.149
125 | Ep 1 (Step 004350): Train loss 0.211, Val loss 0.138
126 | Training accuracy: 92.50% | Validation accuracy: 94.38%
127 | Training completed in 7.20 minutes.
128 |
129 | Evaluating on the full datasets ...
130 |
131 | Training accuracy: 93.44%
132 | Validation accuracy: 93.02%
133 | Test accuracy: 92.95%
134 | ```
135 |
136 |
137 |
138 |
139 | ---
140 |
141 |
142 |
143 | A scikit-learn logistic regression classifier as a baseline:
144 |
145 |
146 | ```bash
147 | python train_sklearn_logreg.py
148 | ```
149 |
150 | ```
151 | Dummy classifier:
152 | Training Accuracy: 50.01%
153 | Validation Accuracy: 50.14%
154 | Test Accuracy: 49.91%
155 |
156 |
157 | Logistic regression classifier:
158 | Training Accuracy: 99.80%
159 | Validation Accuracy: 88.62%
160 | Test Accuracy: 88.85%
161 | ```
162 |
--------------------------------------------------------------------------------
/ch07/01_main-chapter-code/ollama_evaluate.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 | #
6 | # A minimal instruction finetuning file based on the code in chapter 7
7 |
8 | import json
9 | import psutil
10 | from tqdm import tqdm
11 | import urllib.request
12 |
13 |
14 | def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
15 | # Create the data payload as a dictionary
16 | data = {
17 | "model": model,
18 | "messages": [
19 | {"role": "user", "content": prompt}
20 | ],
21 | "options": { # Settings below are required for deterministic responses
22 | "seed": 123,
23 | "temperature": 0,
24 | "num_ctx": 2048
25 | }
26 | }
27 |
28 | # Convert the dictionary to a JSON formatted string and encode it to bytes
29 | payload = json.dumps(data).encode("utf-8")
30 |
31 | # Create a request object, setting the method to POST and adding necessary headers
32 | request = urllib.request.Request(url, data=payload, method="POST")
33 | request.add_header("Content-Type", "application/json")
34 |
35 | # Send the request and capture the response
36 | response_data = ""
37 | with urllib.request.urlopen(request) as response:
38 | # Read and decode the response
39 | while True:
40 | line = response.readline().decode("utf-8")
41 | if not line:
42 | break
43 | response_json = json.loads(line)
44 | response_data += response_json["message"]["content"]
45 |
46 | return response_data
47 |
48 |
49 | def check_if_running(process_name):
50 | running = False
51 | for proc in psutil.process_iter(["name"]):
52 | if process_name in proc.info["name"]:
53 | running = True
54 | break
55 | return running
56 |
57 |
58 | def format_input(entry):
59 | instruction_text = (
60 | f"Below is an instruction that describes a task. "
61 | f"Write a response that appropriately completes the request."
62 | f"\n\n### Instruction:\n{entry['instruction']}"
63 | )
64 |
65 | input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
66 |
67 | return instruction_text + input_text
68 |
69 |
70 | def main(file_path):
71 | ollama_running = check_if_running("ollama")
72 |
73 | if not ollama_running:
74 | raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
75 | print("Ollama running:", check_if_running("ollama"))
76 |
77 | with open(file_path, "r") as file:
78 | test_data = json.load(file)
79 |
80 | model = "llama3"
81 | scores = generate_model_scores(test_data, "model_response", model)
82 | print(f"Number of scores: {len(scores)} of {len(test_data)}")
83 | print(f"Average score: {sum(scores)/len(scores):.2f}\n")
84 |
85 |
86 | def generate_model_scores(json_data, json_key, model="llama3"):
87 | scores = []
88 | for entry in tqdm(json_data, desc="Scoring entries"):
89 | if entry[json_key] == "":
90 | scores.append(0)
91 | else:
92 | prompt = (
93 | f"Given the input `{format_input(entry)}` "
94 | f"and correct output `{entry['output']}`, "
95 | f"score the model response `{entry[json_key]}`"
96 | f" on a scale from 0 to 100, where 100 is the best score. "
97 | f"Respond with the integer number only."
98 | )
99 | score = query_model(prompt, model)
100 | try:
101 | scores.append(int(score))
102 | except ValueError:
103 | print(f"Could not convert score: {score}")
104 | continue
105 |
106 | return scores
107 |
108 |
109 | if __name__ == "__main__":
110 |
111 | import argparse
112 |
113 | parser = argparse.ArgumentParser(
114 | description="Evaluate model responses with ollama"
115 | )
116 | parser.add_argument(
117 | "--file_path",
118 | required=True,
119 | help=(
120 | "The path to the test dataset `.json` file with the"
121 | " `'output'` and `'model_response'` keys"
122 | )
123 | )
124 | args = parser.parse_args()
125 |
126 | main(file_path=args.file_path)
127 |
--------------------------------------------------------------------------------
/ch04/01_main-chapter-code/previous_chapters.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 |
6 | import tiktoken
7 | import torch
8 | import torch.nn as nn
9 | from torch.utils.data import Dataset, DataLoader
10 |
11 |
12 | class GPTDatasetV1(Dataset):
13 | def __init__(self, txt, tokenizer, max_length, stride):
14 | self.input_ids = []
15 | self.target_ids = []
16 |
17 | # Tokenize the entire text
18 | token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
19 |
20 | # Use a sliding window to chunk the book into overlapping sequences of max_length
21 | for i in range(0, len(token_ids) - max_length, stride):
22 | input_chunk = token_ids[i:i + max_length]
23 | target_chunk = token_ids[i + 1: i + max_length + 1]
24 | self.input_ids.append(torch.tensor(input_chunk))
25 | self.target_ids.append(torch.tensor(target_chunk))
26 |
27 | def __len__(self):
28 | return len(self.input_ids)
29 |
30 | def __getitem__(self, idx):
31 | return self.input_ids[idx], self.target_ids[idx]
32 |
33 |
34 | def create_dataloader_v1(txt, batch_size=4, max_length=256,
35 | stride=128, shuffle=True, drop_last=True, num_workers=0):
36 | # Initialize the tokenizer
37 | tokenizer = tiktoken.get_encoding("gpt2")
38 |
39 | # Create dataset
40 | dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
41 |
42 | # Create dataloader
43 | dataloader = DataLoader(
44 | dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
45 |
46 | return dataloader
47 |
48 |
49 | class MultiHeadAttention(nn.Module):
50 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
51 | super().__init__()
52 | assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
53 |
54 | self.d_out = d_out
55 | self.num_heads = num_heads
56 | self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
57 |
58 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
59 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
60 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
61 | self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
62 | self.dropout = nn.Dropout(dropout)
63 | self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
64 |
65 | def forward(self, x):
66 | b, num_tokens, d_in = x.shape
67 |
68 | keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
69 | queries = self.W_query(x)
70 | values = self.W_value(x)
71 |
72 | # We implicitly split the matrix by adding a `num_heads` dimension
73 | # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
74 | keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
75 | values = values.view(b, num_tokens, self.num_heads, self.head_dim)
76 | queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
77 |
78 | # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
79 | keys = keys.transpose(1, 2)
80 | queries = queries.transpose(1, 2)
81 | values = values.transpose(1, 2)
82 |
83 | # Compute scaled dot-product attention (aka self-attention) with a causal mask
84 | attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
85 |
86 | # Original mask truncated to the number of tokens and converted to boolean
87 | mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
88 |
89 | # Use the mask to fill attention scores
90 | attn_scores.masked_fill_(mask_bool, -torch.inf)
91 |
92 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
93 | attn_weights = self.dropout(attn_weights)
94 |
95 | # Shape: (b, num_tokens, num_heads, head_dim)
96 | context_vec = (attn_weights @ values).transpose(1, 2)
97 |
98 | # Combine heads, where self.d_out = self.num_heads * self.head_dim
99 | context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
100 | context_vec = self.out_proj(context_vec) # optional projection
101 |
102 | return context_vec
103 |
--------------------------------------------------------------------------------
/setup/01_optional-python-setup-preferences/README.md:
--------------------------------------------------------------------------------
1 | # Python Setup Tips
2 |
3 |
4 |
5 | There are several different ways you can install Python and set up your computing environment. Here, I am illustrating my personal preference.
6 |
7 | (I am using computers running macOS, but this workflow is similar for Linux machines and may work for other operating systems as well.)
8 |
9 |
10 |
11 |
12 |
13 |
14 | ## 1. Download and install Miniforge
15 |
16 | Download miniforge from the GitHub repository [here](https://github.com/conda-forge/miniforge).
17 |
18 |
19 |
20 | Depending on your operating system, this should download either an `.sh` (macOS, Linux) or `.exe` file (Windows).
21 |
22 | For the `.sh` file, open your command line terminal and execute the following command
23 |
24 | ```bash
25 | sh ~/Desktop/Miniforge3-MacOSX-arm64.sh
26 | ```
27 |
28 | where `Desktop/` is the folder where the Miniforge installer was downloaded to. On your computer, you may have to replace it with `Downloads/`.
29 |
30 |
31 |
32 | Next, step through the download instructions, confirming with "Enter".
33 |
34 |
35 |
36 | If you work with many packages, Conda can be slow because of its thorough but complex dependency resolution process and the handling of large package indexes and metadata. To speed up Conda, you can use the following setting, which switches to a more efficient Rust reimplementation for solving dependencies:
37 |
38 | ```
39 | conda config --set solver libmamba
40 | ```
41 |
42 |
43 |
44 |
45 |
46 | ## 2. Create a new virtual environment
47 |
48 | After the installation was successfully completed, I recommend creating a new virtual environment called `LLMs`, which you can do by executing
49 |
50 | ```bash
51 | conda create -n LLMs python=3.10
52 | ```
53 |
54 |
55 |
56 | > Many scientific computing libraries do not immediately support the newest version of Python. Therefore, when installing PyTorch, it's advisable to use a version of Python that is one or two releases older. For instance, if the latest version of Python is 3.13, using Python 3.10 or 3.11 is recommended.
57 |
58 | Next, activate your new virtual environment (you have to do it every time you open a new terminal window or tab):
59 |
60 | ```bash
61 | conda activate LLMs
62 | ```
63 |
64 |
65 |
66 |
67 |
68 |
69 | ## Optional: styling your terminal
70 |
71 | If you want to style your terminal similar to mine so that you can see which virtual environment is active, check out the [Oh My Zsh](https://github.com/ohmyzsh/ohmyzsh) project.
72 |
73 |
74 |
75 |
76 | ## 3. Install new Python libraries
77 |
78 |
79 |
80 | To install new Python libraries, you can now use the `conda` package installer. For example, you can install [JupyterLab](https://jupyter.org/install) and [watermark](https://github.com/rasbt/watermark) as follows:
81 |
82 | ```bash
83 | conda install jupyterlab watermark
84 | ```
85 |
86 |
87 |
88 |
89 |
90 | You can also still use `pip` to install libraries. By default, `pip` should be linked to your new `LLms` conda environment:
91 |
92 |
93 |
94 |
95 |
96 |
97 | ## 4. Install PyTorch
98 |
99 | PyTorch can be installed just like any other Python library or package using pip. For example:
100 |
101 | ```bash
102 | pip install torch==2.0.1
103 | ```
104 |
105 | However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional settings and explanation (see the *A.1.3 Installing PyTorch in the book for more information*).
106 |
107 | It's also highly recommended to consult the installation guide menu on the official PyTorch website at [https://pytorch.org](https://pytorch.org).
108 |
109 |
110 |
111 |
112 |
113 | ---
114 |
115 |
116 |
117 |
118 | Any questions? Please feel free to reach out in the [Discussion Forum](https://github.com/rasbt/LLMs-from-scratch/discussions).
--------------------------------------------------------------------------------
/ch03/02_bonus_efficient-multihead-attention/ch03.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
2 | # Source for "Build a Large Language Model From Scratch"
3 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
4 | # Code: https://github.com/rasbt/LLMs-from-scratch
5 | #
6 | # This file contains the relevant code from chapter 3 that is going to be used
7 | # in forthcoming chapters.
8 |
9 | import torch
10 | import torch.nn as nn
11 |
12 |
13 | class CausalAttention(nn.Module):
14 |
15 | def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
16 | super().__init__()
17 | self.d_out = d_out
18 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
19 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
20 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
21 | self.dropout = nn.Dropout(dropout) # New
22 | self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
23 |
24 | def forward(self, x):
25 | b, num_tokens, d_in = x.shape # New batch dimension b
26 | keys = self.W_key(x)
27 | queries = self.W_query(x)
28 | values = self.W_value(x)
29 |
30 | attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
31 | attn_scores.masked_fill_( # New, _ ops are in-place
32 | self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
33 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
34 | attn_weights = self.dropout(attn_weights) # New
35 |
36 | context_vec = attn_weights @ values
37 | return context_vec
38 |
39 |
40 | class MultiHeadAttentionWrapper(nn.Module):
41 |
42 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
43 | super().__init__()
44 | self.heads = nn.ModuleList(
45 | [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
46 | for _ in range(num_heads)]
47 | )
48 | self.out_proj = nn.Linear(d_out*num_heads, d_out*num_heads)
49 |
50 | def forward(self, x):
51 | context_vec = torch.cat([head(x) for head in self.heads], dim=-1)
52 | return self.out_proj(context_vec)
53 |
54 |
55 | class MultiHeadAttention(nn.Module):
56 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
57 | super().__init__()
58 | assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
59 |
60 | self.d_out = d_out
61 | self.num_heads = num_heads
62 | self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
63 |
64 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
65 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
66 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
67 | self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
68 | self.dropout = nn.Dropout(dropout)
69 | self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
70 |
71 | def forward(self, x):
72 | b, num_tokens, d_in = x.shape
73 |
74 | keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
75 | queries = self.W_query(x)
76 | values = self.W_value(x)
77 |
78 | # We implicitly split the matrix by adding a `num_heads` dimension
79 | # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
80 | keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
81 | values = values.view(b, num_tokens, self.num_heads, self.head_dim)
82 | queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
83 |
84 | # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
85 | keys = keys.transpose(1, 2)
86 | queries = queries.transpose(1, 2)
87 | values = values.transpose(1, 2)
88 |
89 | # Compute scaled dot-product attention (aka self-attention) with a causal mask
90 | attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
91 |
92 | # Original mask truncated to the number of tokens and converted to boolean
93 | mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
94 |
95 | # Use the mask to fill attention scores
96 | attn_scores.masked_fill_(mask_bool, -torch.inf)
97 |
98 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
99 | attn_weights = self.dropout(attn_weights)
100 |
101 | # Shape: (b, num_tokens, num_heads, head_dim)
102 | context_vec = (attn_weights @ values).transpose(1, 2)
103 |
104 | # Combine heads, where self.d_out = self.num_heads * self.head_dim
105 | context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
106 | context_vec = self.out_proj(context_vec) # optional projection
107 |
108 | return context_vec
109 |
--------------------------------------------------------------------------------
/ch04/02_performance-analysis/flops-analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
| \n",
10 | "\n",
11 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n", 12 | " Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 13 | "\n", 14 | " | \n",
15 | "\n",
16 | " \n",
17 | " | \n",
18 | "
79 |
80 |
81 |
82 | ## Using Google Colab
83 |
84 | To use a Google Colab environment in the cloud, head over to [https://colab.research.google.com/](https://colab.research.google.com/) and open the respective chapter notebook from the GitHub menu or by dragging the notebook into the *Upload* field as shown in the figure below.
85 |
86 |
87 |
88 |
89 | Also make sure you upload the relevant files (dataset files and .py files the notebook is importing from) to the Colab environment as well, as shown below.
90 |
91 |
92 |
93 |
94 | You can optionally run the code on a GPU by changing the *Runtime* as illustrated in the figure below.
95 |
96 |
97 |
98 |
99 |
100 |
101 | # Questions?
102 |
103 | If you have any questions, please don't hesitate to reach out via the [Discussions](https://github.com/rasbt/LLMs-from-scratch/discussions) forum in this GitHub repository.
104 |
--------------------------------------------------------------------------------
/ch07/02_dataset-utilities/find-near-duplicates.py:
--------------------------------------------------------------------------------
1 |
2 | # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
3 | # Source for "Build a Large Language Model From Scratch"
4 | # - https://www.manning.com/books/build-a-large-language-model-from-scratch
5 | # Code: https://github.com/rasbt/LLMs-from-scratch
6 |
7 | import argparse
8 | import json
9 | import re
10 | from sklearn import __version__ as sklearn_version
11 | from sklearn.feature_extraction.text import TfidfVectorizer
12 | from sklearn.metrics.pairwise import cosine_similarity
13 |
14 |
15 | # Sample JSON dataset
16 | example_data = [
17 | {"instruction": "What is the capital of Italy?",
18 | "input": "", "output": "The capital of Italy is Rome."
19 | },
20 | {"instruction": "What's the capital city of Italy?",
21 | "input": "", "output": "The capital city is Rome."
22 | },
23 | {"instruction": "Identify the main verb in the sentence: 'The cat sleeps on the couch.'",
24 | "input": "", "output": "The verb is 'sleeps'."
25 | },
26 | {"instruction": "Identify the verb in the following sentence: The cat sleeps on the couch.",
27 | "input": "", "output": "The verb in the sentence is \"sleeps.\""
28 | },
29 | # ...
30 | ]
31 |
32 |
33 | def preprocess_text(text):
34 | # Lowercase the text
35 | text = text.lower()
36 | # Remove punctuation
37 | text = re.sub(r'[^\w\s]', '', text)
38 | return text
39 |
40 |
41 | def find_near_duplicates(json_data, threshold=0.75, key="instruction"):
42 | """The higher the threshold, the more similar the texts have to be to match"""
43 |
44 | # Extract instructions
45 | text = [preprocess_text(item[key]) for item in json_data if item[key]]
46 | near_duplicates = []
47 | indices_to_remove = set()
48 |
49 | if not text:
50 | return {}, near_duplicates
51 |
52 | # Vectorize the text data
53 | vectorizer = TfidfVectorizer(stop_words=None, analyzer='char', ngram_range=(1, 3))
54 | tfidf_matrix = vectorizer.fit_transform(text)
55 |
56 | # Compute cosine similarity between each pair of entries
57 | cos_sim_matrix = cosine_similarity(tfidf_matrix)
58 |
59 | # Find pairs of near-duplicate instructions based on the threshold
60 |
61 | for i in range(len(cos_sim_matrix)):
62 | for j in range(i+1, len(cos_sim_matrix)):
63 | if cos_sim_matrix[i, j] > threshold:
64 | if len(json_data[i][key]) <= 1 or len(json_data[j][key]) <= 1:
65 | continue
66 | near_duplicates.append((json_data[i], json_data[j], cos_sim_matrix[i, j]))
67 | if key in ("input", "output"): # Don't remove duplicates based on the instruction
68 | indices_to_remove.add(j) # Mark the second entry for removal
69 |
70 | # Remove the near-duplicate entries
71 | filtered_json_data = [item for index, item in enumerate(json_data) if index not in indices_to_remove]
72 |
73 | return filtered_json_data, near_duplicates
74 |
75 |
76 | def find_print_and_remove_near_duplicates(json_data, remove_duplicates=False, threshold=0.75):
77 | """
78 | Searches each key in the first JSON object for duplicates across a list of JSON objects.
79 | Prints the duplicates if found.
80 | """
81 | for key in json_data[0].keys():
82 |
83 | if remove_duplicates:
84 | json_data, near_duplicates = find_near_duplicates(json_data, key=key, threshold=threshold)
85 | else:
86 | _, near_duplicates = find_near_duplicates(json_data, key=key, threshold=threshold)
87 | separator = 50 * '='
88 | print(f"\n\n{separator}\nSearching '{key}' for duplicates ...\n{separator}")
89 | if not near_duplicates:
90 | print("No duplicates found")
91 | else:
92 | for dup in near_duplicates:
93 | print(
94 | f"Duplicate pair found with similarity {dup[2]:.2f}:\n"
95 | f"1. {dup[0][key]}\n2. {dup[1][key]}\n"
96 | )
97 | return json_data
98 |
99 |
100 | if __name__ == "__main__":
101 | print("scikit-learn version:", sklearn_version)
102 |
103 | parser = argparse.ArgumentParser()
104 | parser.add_argument(
105 | "--json_file",
106 | type=str,
107 | help=("Path to the dataset JSON file")
108 | )
109 | parser.add_argument(
110 | "--threshold",
111 | type=float,
112 | default=0.9,
113 | help=("A sensitivity threshold between 0 and 1 where 1 is strictest")
114 | )
115 | parser.add_argument(
116 | "--remove_duplicates",
117 | action='store_true',
118 | default=False,
119 | help=(
120 | "Removes duplicates based on the 'input' or 'output' keys "
121 | " (but not the 'instruction') and saves the cleaned JSON file as --json_output_file"
122 | )
123 | )
124 | parser.add_argument(
125 | "--json_output_file",
126 | type=str,
127 | help=("Path to the dataset JSON file")
128 | )
129 |
130 | args = parser.parse_args()
131 |
132 | if args.remove_duplicates and not args.json_output_file:
133 | raise ValueError(
134 | "Provide an output file via --json_output_file "
135 | "to save the cleaned JSON data."
136 | )
137 |
138 | if not args.json_file:
139 | json_data = example_data
140 |
141 | else:
142 | with open(args.json_file, "r") as file:
143 | json_data = json.load(file)
144 |
145 | json_data = find_print_and_remove_near_duplicates(
146 | json_data=json_data,
147 | remove_duplicates=args.remove_duplicates,
148 | threshold=args.threshold
149 | )
150 |
151 | if args.remove_duplicates:
152 | with open(args.json_output_file, "w") as file:
153 | json.dump(json_data, file, indent=4)
154 |
--------------------------------------------------------------------------------
/ch06/01_main-chapter-code/exercise-solutions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "ba450fb1-8a26-4894-ab7a-5d7bfefe90ce",
6 | "metadata": {},
7 | "source": [
8 | "| \n",
11 | "\n",
12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n", 13 | " Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "\n", 15 | " | \n",
16 | "\n",
17 | " \n",
18 | " | \n",
19 | "
| \n",
10 | "\n",
11 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n", 12 | " Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 13 | "\n", 14 | " | \n",
15 | "\n",
16 | " \n",
17 | " | \n",
18 | "
| \n",
11 | "\n",
12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n", 13 | " Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "\n", 15 | " | \n",
16 | "\n",
17 | " \n",
18 | " | \n",
19 | "
| \n",
11 | "\n",
12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n", 13 | " Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "\n", 15 | " | \n",
16 | "\n",
17 | " \n",
18 | " | \n",
19 | "
| \n",
11 | "\n",
12 | "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka \n", 13 | " Code repository: https://github.com/rasbt/LLMs-from-scratch\n", 14 | "\n", 15 | " | \n",
16 | "\n",
17 | " \n",
18 | " | \n",
19 | "