├── .gitignore
├── 01_Intro_to_PyTorch
    ├── A.1.3_Installing_PyTorch
    │   └── main.py
    ├── A.2_Understanding_Tensors
    │   ├── A.2.1_Scalars_Vectors_Matrices_Tensors
    │   │   └── main.py
    │   ├── A.2.2_Tensor_data_types
    │   │   └── main.py
    │   ├── A.2.3_Common_PyTorch_tensor_operations
    │   │   └── main.py
    │   └── images
    │   │   └── tensors.png
    ├── A.3_Seeing_models_as_computation_graphs
    │   ├── images
    │   │   └── logistic-regression-forward-pass-as-computation-graph.png
    │   └── main.py
    ├── A.4_Automatic_differentiation_made_easy
    │   ├── images
    │   │   └── partial-derivatives-gradients.png
    │   └── main.py
    ├── A.5_Implementing_multilayer_neural_networks
    │   ├── images
    │   │   ├── multilayer-neural-network.png
    │   │   └── multilayer-perceptron-two-hidden-layers.png
    │   └── main.py
    ├── A.6_Setting_up_efficient_data_loaders
    │   ├── images
    │   │   ├── data-loaders-workers.png
    │   │   └── data-loaders.png
    │   └── main.py
    ├── A.7_A_typical_training_loop
    │   └── main.py
    ├── A.8_Saving_and_loading_models
    │   ├── main.py
    │   └── model.pth
    └── A.9_Optimizing_training_performance_with_GPUs
    │   ├── A.9.1_PyTorch_computations_on_GPU_devices
    │       ├── main.py
    │       └── model.pth
    │   ├── A.9.2_Single_GPU_Training
    │       └── main.py
    │   └── A.9.3_Training_with_multiple_GPUs
    │       ├── images
    │           ├── code-summary-1.png
    │           ├── code-summary-2.png
    │           ├── multi-gpu-1.png
    │           └── multi-gpu-2.png
    │       └── main.py
├── 02_Working_with_text_data
    ├── 2.2_Tokenizing_text
    │   ├── main.py
    │   └── the-verdict.txt
    ├── 2.3_Converting_tokens_into_token_IDs
    │   ├── main.py
    │   ├── the-verdict.txt
    │   └── token_ids.py
    ├── 2.4_Adding_special_context_tokens
    │   ├── main.py
    │   ├── the-verdict.txt
    │   └── token_ids.py
    ├── 2.5_Byte_pair_encoding
    │   └── main.py
    ├── 2.6_Data_sampling_with_a_sliding_window
    │   ├── images
    │   │   └── sliding window.png
    │   ├── input_token_pairs.py
    │   ├── main.py
    │   └── the-verdict.txt
    ├── 2.7_Creating_token_embeddings
    │   └── embedding_example.py
    └── 2.8_Encoding_word_positions
    │   ├── main.py
    │   └── the-verdict.txt
├── 03_Coding_attention_mechanisms
    ├── 3.3_Attending_to_different_parts_of_the_input_with_self_attention
    │   ├── 3.3.1_A_simple_self_attention_mechanism_without_trainable_weights
    │   │   ├── images
    │   │   │   ├── attention mechanism.png
    │   │   │   └── vector similarity example.png
    │   │   └── main.py
    │   └── 3.3.2_Computing_attention_weights_for_all_input_tokens
    │   │   ├── images
    │   │       └── attention weights heatmap.png
    │   │   └── main.py
    ├── 3.4_Implementing_self_attention_with_training_weights
    │   ├── 3.4.1_Computing_the_attention_weights_step_by_step
    │   │   ├── images
    │   │   │   ├── input and output dimensions.png
    │   │   │   ├── q, k distinction, pt1.png
    │   │   │   ├── q, k distinction, pt2.png
    │   │   │   ├── q, k distinction, pt3.png
    │   │   │   ├── q, k distinction, pt4.png
    │   │   │   ├── query, key, value, pt1.png
    │   │   │   ├── query, key, value, pt2.png
    │   │   │   ├── query, key, value, pt3.png
    │   │   │   └── query, key, value, pt4.png
    │   │   └── main.py
    │   └── 3.4.2_Implementing_a_compact_self_attention_Python_class
    │   │   ├── images
    │   │       ├── attention weights heatmap.png
    │   │       ├── key vector token attributes, pt1.png
    │   │       ├── key vector token attributes, pt2.png
    │   │       ├── key vector token attributes, pt3.png
    │   │       ├── key vector token attributes, pt4.png
    │   │       ├── key vector token attributes, pt5.png
    │   │       ├── key vector token attributes, pt6.png
    │   │       ├── key vector token attributes, pt7.png
    │   │       ├── key vector token attributes, pt8.png
    │   │       └── q,k,v,z.png
    │   │   ├── self-attention-class-v1.py
    │   │   └── self-attention-class-v2.py
    ├── 3.5_Hiding_future_words_with_causal_attention
    │   ├── 3.5.1_Applying_a_causal_attention_mask
    │   │   ├── images
    │   │   │   └── attn weights normalized with masked future tokens .png
    │   │   └── main.py
    │   ├── 3.5.2_Masking_additional_attention_weights_with_dropout
    │   │   ├── images
    │   │   │   ├── attn weights normalized with dropout and masked future tokens .png
    │   │   │   └── dropout.png
    │   │   └── main.py
    │   └── 3.5.3_Implementing_a_compact_causal_attention_class
    │   │   └── main.py
    └── 3.6_Extending_single_head_attention_to_multi_head_attention
    │   ├── 3.6.1_Stacking_multiple_single_head_attention_layers
    │       ├── images
    │       │   ├── multi head attention output.png
    │       │   └── multi head attention.png
    │       └── main.py
    │   └── 3.6.2_Implementing_multi_head_attention_with_weight_splits
    │       ├── batched_matrix_multiplication.py
    │       ├── images
    │           ├── diagram.png
    │           ├── gpt-2 param explanation, pt1.png
    │           ├── gpt-2 param explanation, pt2.png
    │           └── gpt-2 param explanation, pt3.png
    │       └── main.py
├── 04_Implementing_a_GPT_model_to_generate_text
    ├── 4.1_Coding_an_LLM_Architecture
    │   ├── gpt_config.py
    │   ├── images
    │   │   └── logits explanation.png
    │   └── main.py
    ├── 4.2_Normalizing_activations_with_layer_normalization
    │   ├── images
    │   │   ├── biased variance broken down.png
    │   │   ├── dim parameter, pt1.png
    │   │   ├── dim parameter, pt2.png
    │   │   ├── forward pass & after check, pt1.png
    │   │   ├── forward pass & after check, pt2.png
    │   │   ├── forward pass & after check, pt3.png
    │   │   ├── layer normalization explained, pt1.png
    │   │   ├── layer normalization explained, pt2.png
    │   │   ├── layer normalization explained, pt3.png
    │   │   ├── layer normalization.png
    │   │   ├── variance calculation, pt1.png
    │   │   └── variance calculation, pt2.png
    │   ├── main.py
    │   └── normalization.py
    ├── 4.3_Implementing_a_feed_forward_network_with_GELU_activations
    │   ├── gelu.py
    │   ├── images
    │   │   ├── fnn diagram.png
    │   │   ├── gelu and relu plot.png
    │   │   └── input into feedforward neural net (fnn).png
    │   └── main.py
    ├── 4.4_Adding_shortcut_connections
    │   ├── images
    │   │   └── shortcut connections.png
    │   └── main.py
    ├── 4.5_Connecting_attention_and_linear_layers_in_a_transformer_block
    │   ├── images
    │   │   └── transformer block.png
    │   └── main.py
    ├── 4.6_Coding_the_GPT_Model
    │   ├── images
    │   │   └── gpt2 architecture.png
    │   └── main.py
    └── 4.7_Generating_text
    │   ├── exercise.py
    │   ├── images
    │       ├── iterations of a token prediction cycle.png
    │       ├── mechanics of text generation.png
    │       └── step by step text generation.png
    │   └── main.py
├── 05_Pretraining_on_unlabeled_data
    ├── 5.1_Evaluating_generative_text_models
    │   ├── 5.1.1_Using_GPT_to_generate_text
    │   │   ├── images
    │   │   │   ├── chapter topics.png
    │   │   │   ├── gpt build stages.png
    │   │   │   └── tokenizer placement in flow.png
    │   │   └── main.py
    │   ├── 5.1.2_Calculating_the_text_generation_loss
    │   │   ├── images
    │   │   │   ├── loss calculation steps.png
    │   │   │   ├── next tokens.png
    │   │   │   ├── perplexity score explanation.png
    │   │   │   └── text generation process.png
    │   │   └── main.py
    │   └── 5.1.3_Calculating_the_training_and_validation_set_losses
    │   │   ├── images
    │   │       └── dataloaders.png
    │   │   ├── loading_dataset.py
    │   │   ├── main.py
    │   │   └── the-verdict.txt
    ├── 5.2_Training_an_LLM
    │   ├── images
    │   │   ├── loss-plot.pdf
    │   │   ├── plot explanation.png
    │   │   ├── training loop.png
    │   │   └── training process.png
    │   ├── main.py
    │   └── the-verdict.txt
    ├── 5.3_Decoding_strategies_to_control_randomness
    │   ├── 5.3.0_Same_Outputs
    │   │   ├── main.py
    │   │   └── the-verdict.txt
    │   ├── 5.3.1_Temperature_Scaling
    │   │   ├── images
    │   │   │   ├── temperature explanation.png
    │   │   │   └── temperature-plot.pdf
    │   │   └── token_gen_process.py
    │   ├── 5.3.2_Top_k_sampling
    │   │   ├── images
    │   │   │   └── top k steps.png
    │   │   └── top_k.py
    │   └── 5.3.3_Modifying_the_text_generation_function
    │   │   ├── main.py
    │   │   └── the-verdict.txt
    ├── 5.4_Loading_and_saving_model_weights_in_Pytorch
    │   ├── exercise.py
    │   ├── images
    │   │   ├── torch no_grad explained, pt1.png
    │   │   ├── torch no_grad explained, pt2.png
    │   │   ├── torch no_grad explained, pt3.png
    │   │   └── torch no_grad explained, pt4.png
    │   ├── main.py
    │   └── the-verdict.txt
    └── 5.5_Loading_pretrained_weights_from_OpenAI
    │   ├── exercises
    │       ├── exercise_5.5_5.6.py
    │       ├── exercise_main.py
    │       └── the-verdict.txt
    │   ├── gpt_setup
    │       ├── __init__.py
    │       ├── gpt_download.py
    │       └── load_weights.py
    │   ├── images
    │       └── gpt architecture.png
    │   └── main.py
├── 06_Fine_tuning_for_classification
    ├── 6.2_Preparing_the_dataset
    │   ├── data_preprocessing.py
    │   ├── download.py
    │   ├── images
    │   │   └── classification fine tuning stages.png
    │   ├── sms_spam_collection.zip
    │   ├── sms_spam_collection
    │   │   ├── SMSSpamCollection.tsv
    │   │   └── readme
    │   ├── test.csv
    │   ├── train.csv
    │   └── validation.csv
    ├── 6.3_Creating_data_loaders
    │   ├── images
    │   │   ├── input text prep process.png
    │   │   └── single training batch.png
    │   ├── padding_token.py
    │   ├── spam_dataset.py
    │   ├── test.csv
    │   ├── train.csv
    │   └── validation.csv
    ├── 6.4_Initializing_a_model_with_pretrained_weights
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── spam_dataset.py
    │   │   ├── test.csv
    │   │   ├── train.csv
    │   │   └── validation.csv
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   └── stages.png
    │   └── main.py
    ├── 6.5_Adding_a_classification_head
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── spam_dataset.py
    │   │   ├── test.csv
    │   │   ├── train.csv
    │   │   └── validation.csv
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   ├── architecture adapation for binary classification.png
    │   │   ├── final layernorm and trf block set to trainable.png
    │   │   ├── fine-tuning selected layers vs all layers.png
    │   │   ├── last row of output tensor.png
    │   │   ├── last token contains attention score to all other tokens .png
    │   │   ├── layer training summary, pt1.png
    │   │   ├── layer training summary, pt2.png
    │   │   ├── layer training summary, pt3.png
    │   │   └── modifying output layer.png
    │   └── main.py
    ├── 6.6_Calculating_the_classification_loss_and_accuracy
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── spam_dataset.py
    │   │   ├── test.csv
    │   │   ├── train.csv
    │   │   └── validation.csv
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   ├── model outputs.png
    │   │   └── stages.png
    │   └── main.py
    ├── 6.7_Fine_tuning_the_model_on_supervised_data
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── spam_dataset.py
    │   │   ├── test.csv
    │   │   ├── train.csv
    │   │   └── validation.csv
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   ├── accuracy-plot.pdf
    │   │   ├── choosing the number of epochs.png
    │   │   ├── loss-plot.pdf
    │   │   ├── train, validation, test dataset explanation, pt1.png
    │   │   ├── train, validation, test dataset explanation, pt2.png
    │   │   ├── train, validation, test dataset explanation, pt3.png
    │   │   ├── train, validation, test dataset explanation, pt4.png
    │   │   ├── train, validation, test dataset explanation, pt5.png
    │   │   ├── train, validation, test dataset explanation, pt6.png
    │   │   ├── train, validation, test dataset explanation, pt7.png
    │   │   ├── training & validation accuracy plot explanation.png
    │   │   ├── training & validation loss plot explanation.png
    │   │   └── training loop.png
    │   └── main.py
    ├── 6.8_Using_the_LLM_as_a_spam_classifier
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── spam_dataset.py
    │   │   ├── test.csv
    │   │   ├── train.csv
    │   │   └── validation.csv
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   ├── overview, pt1.png
    │   │   ├── overview, pt10.png
    │   │   ├── overview, pt11.png
    │   │   ├── overview, pt12.png
    │   │   ├── overview, pt13.png
    │   │   ├── overview, pt2.png
    │   │   ├── overview, pt3.png
    │   │   ├── overview, pt4.png
    │   │   ├── overview, pt5.png
    │   │   ├── overview, pt6.png
    │   │   ├── overview, pt7.png
    │   │   ├── overview, pt8.png
    │   │   └── overview, pt9.png
    │   ├── inference_playground.py
    │   └── main.py
    └── images
    │   ├── classification fine tuning.png
    │   ├── fine tuning approach.png
    │   ├── instruction fine tuning.png
    │   └── stages.png
├── 07_Fine_tuning_for_instructions
    ├── 7.1_Introduction_to_instruction_fine_tuning
    │   └── images
    │   │   ├── desired goal.png
    │   │   └── stages.png
    ├── 7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning
    │   ├── data_preprocessing.py
    │   ├── download.py
    │   ├── images
    │   │   └── instruction fine tuning prompt styles.png
    │   └── instruction-data.json
    ├── 7.3_Organizing_data_into_training_batches
    │   ├── images
    │   │   ├── -100 purpose in target IDs.png
    │   │   ├── Instruction Fine-Tuning Training Process, pt1.png
    │   │   ├── Instruction Fine-Tuning Training Process, pt2.png
    │   │   ├── Instruction Fine-Tuning Training Process, pt3.png
    │   │   ├── Instruction Fine-Tuning Training Process, pt4.png
    │   │   ├── Instruction Fine-Tuning Training Process, pt5.png
    │   │   ├── Instruction Fine-Tuning Training Process, pt6.png
    │   │   ├── batching process.png
    │   │   ├── cross entropy loss for logits_1 & targets_1.png
    │   │   ├── custom collate (assemble) function.png
    │   │   ├── first two steps.png
    │   │   ├── ignore_index book explanation.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt1.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt2.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt3.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt4.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt5.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt6.png
    │   │   ├── ignore_index in cross-cross-entropy loss, pt7.png
    │   │   ├── input and target token alignment.png
    │   │   ├── masked instruction tokens.png
    │   │   ├── masking the instruction tokens explained.png
    │   │   ├── padding token replacement in target batch.png
    │   │   ├── stages.png
    │   │   └── target IDs explained.png
    │   └── instruction_dataset.py
    ├── 7.4_Creating_data_loaders_for_an_instruction_dataset
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── data_preprocessing.py
    │   │   ├── instruction-data.json
    │   │   └── instruction_dataset.py
    │   ├── images
    │   │   └── stages.png
    │   └── testing.py
    ├── 7.5_Loading_a_pretrained_LLM
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── data_preprocessing.py
    │   │   ├── instruction-data.json
    │   │   └── instruction_dataset.py
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   └── stages.png
    │   └── main.py
    ├── 7.6_Fine_tuning_the_LLM_on_instruction_data
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── data_preprocessing.py
    │   │   ├── instruction-data.json
    │   │   └── instruction_dataset.py
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   ├── dealing with hardware limitations.png
    │   │   ├── device runtimes.png
    │   │   ├── loss plot.png
    │   │   ├── loss-plot.pdf
    │   │   └── stages.png
    │   └── main.py
    ├── 7.7_Extracting_and_saving_responses
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── data_preprocessing.py
    │   │   ├── instruction-data.json
    │   │   └── instruction_dataset.py
    │   ├── gpt_setup
    │   │   ├── __init__.py
    │   │   ├── gpt_download.py
    │   │   └── load_weights.py
    │   ├── images
    │   │   ├── model response explained, pt1.png
    │   │   ├── model response explained, pt2.png
    │   │   ├── model response, pt1.png
    │   │   ├── model response, pt2.png
    │   │   └── stages.png
    │   ├── instruction-data-with-response.json
    │   └── main.py
    ├── 7.8_Evaluating_the_fine_tuned_LLM
    │   ├── check_ollama_status.py
    │   ├── data_setup
    │   │   ├── __init__.py
    │   │   ├── data_preprocessing.py
    │   │   ├── instruction-data.json
    │   │   └── instruction_dataset.py
    │   ├── images
    │   │   ├── alternative ollama models.png
    │   │   ├── llama 3 score for gpt2 instruct, pt1.png
    │   │   ├── llama 3 score for gpt2 instruct, pt2.png
    │   │   ├── llama 3 score for gpt2 instruct, pt3.png
    │   │   ├── ollama.png
    │   │   ├── stages.png
    │   │   └── using larger llms via web apis.png
    │   ├── instruction-data-with-response.json
    │   └── main.py
    └── images
    │   └── stages.png
├── README.md
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | .DS_Store
3 | gpt2
4 | gpt2-medium355M-sft.pth


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.1.3_Installing_PyTorch/main.py:
--------------------------------------------------------------------------------
1 | import torch
2 | 
3 | print(torch.cuda.is_available()) # False if no GPU
4 | 
5 | print(torch.__version__) # Version #
6 | 
7 | print(torch.backends.mps.is_available()) # Check whether your Mac supports PyTorch acceleration with its Apple Silicon chip


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.2_Understanding_Tensors/A.2.1_Scalars_Vectors_Matrices_Tensors/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | tensor0d = torch.tensor(1) # Creates a zero-dimensional tensor (scalar) from a Python integer
 5 | 
 6 | tensor1d = torch.tensor([1, 2, 3]) # Creates a one-dimensional tensor (vector) from a Python list
 7 | 
 8 | tensor2d = torch.tensor([[1, 2], # Creates a two-dimensional tensor from a nested Python list
 9 |                          [3, 4]])
10 | 
11 | tensor3d = torch.tensor([[[1, 2], [3, 4]], # Creates a three-dimensional tensor from a nested Python list
12 |                          [[5, 6],  [7, 8]]])
13 | 
14 | 
15 | print("0d tensor: \n", tensor0d, "\n")
16 | print("1d tensor: \n", tensor1d, "\n")
17 | print("2d tensor: \n", tensor2d, "\n")
18 | print("3d tensor: \n", tensor3d, "\n")


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.2_Understanding_Tensors/A.2.2_Tensor_data_types/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | tensor1d = torch.tensor([1, 2, 3])
 5 | print(tensor1d.dtype) # torch.int64
 6 | 
 7 | floatvec = torch.tensor([1.0, 2.0, 3.0])
 8 | print(floatvec.dtype) # torch.float32
 9 | 
10 | floatvec = tensor1d.to(torch.float32)
11 | print(floatvec.dtype) # torch.float32


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.2_Understanding_Tensors/A.2.3_Common_PyTorch_tensor_operations/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | tensor2d = torch.tensor([[1, 2, 3],
 5 |                          [4, 5, 6]])
 6 | 
 7 | print(tensor2d)
 8 | 
 9 | print(tensor2d.shape) # [2, 3], 2 rows by 3 columns
10 | 
11 | # print(tensor2d.reshape(3, 2)) # [3, 2], 3 rows by 2 columns, reshaping tensor
12 | 
13 | print(tensor2d.view(3, 2)) # [3, 2], 3 rows by 2 columns, more common command for reshaping tensors in PyTorch
14 | 
15 | print("Transpose: \n ", tensor2d.T, "\n") # Transpose, flip across its diagonal
16 | 
17 | print(tensor2d.matmul(tensor2d.T)) # Matrix multiplication
18 | 
19 | print(tensor2d @ tensor2d.T) # Matrix multiplication (compact syntax)


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.2_Understanding_Tensors/images/tensors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.2_Understanding_Tensors/images/tensors.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.3_Seeing_models_as_computation_graphs/images/logistic-regression-forward-pass-as-computation-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.3_Seeing_models_as_computation_graphs/images/logistic-regression-forward-pass-as-computation-graph.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.3_Seeing_models_as_computation_graphs/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | 
 4 | 
 5 | y = torch.tensor([1.0]) # true label
 6 | x1 = torch.tensor([1.1]) # input feature
 7 | w1 = torch.tensor([2.2]) # weight parameter
 8 | b = torch.tensor([0.0]) # bias unit
 9 | z = x1 * w1 + b # net input
10 | a = torch.sigmoid(z) # activation and output
11 | 
12 | print(z)
13 | print(a)
14 | 
15 | loss = F.binary_cross_entropy(a, y)
16 | print(loss)
17 | 


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.4_Automatic_differentiation_made_easy/images/partial-derivatives-gradients.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.4_Automatic_differentiation_made_easy/images/partial-derivatives-gradients.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.4_Automatic_differentiation_made_easy/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | from torch.autograd import grad
 4 | 
 5 | 
 6 | y = torch.tensor([1.0])
 7 | x1 = torch.tensor([1.1])
 8 | w1 = torch.tensor([2.2], requires_grad=True)
 9 | b = torch.tensor([0.0], requires_grad=True)
10 | 
11 | z = x1 * w1 + b
12 | a = torch.sigmoid(z)
13 | 
14 | loss = F.binary_cross_entropy(a, y)
15 | 
16 | loss.backward()
17 | print(w1.grad)
18 | print(b.grad)
19 | 
20 | 
21 | #            Manual Method
22 | # grad_L_w1 = grad(loss, w1, retain_graph=True)
23 | # grad_L_b = grad(loss, b, retain_graph=True)
24 | # print(grad_L_w1)
25 | # print(grad_L_b)
26 | 


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-neural-network.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-neural-network.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-perceptron-two-hidden-layers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-perceptron-two-hidden-layers.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | class NeuralNetwork(torch.nn.Module):
 5 |     def __init__(self, num_inputs, num_outputs):
 6 |         super().__init__()
 7 | 
 8 |         self.layers = torch.nn.Sequential(
 9 | 
10 |             # 1st hidden layer
11 |             torch.nn.Linear(num_inputs, 30),
12 |             torch.nn.ReLU(),
13 | 
14 |             # 2nd hidden layer
15 |             torch.nn.Linear(30, 20),
16 |             torch.nn.ReLU(),
17 | 
18 |             # Output layer
19 |             torch.nn.Linear(20, num_outputs)
20 |         )
21 | 
22 |     def forward(self, x):
23 |         logits = self.layers(x)
24 |         return logits
25 | 
26 | torch.manual_seed(123)
27 | model = NeuralNetwork(50, 3)
28 | print(model)
29 | 
30 | num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
31 | print("Total number of trainable model parameters: ", num_params)
32 | 
33 | print(model.layers[0].weight)
34 | # print(len(model.layers[0].weight))
35 | # print(model.layers[0].weight.shape)
36 | # print(model.layers[0].bias)
37 | 
38 | X = torch.rand((1, 50))
39 | # out = model(X)
40 | # print(out)
41 | 
42 | #  When we use a model for inference (for instance, making predictions) rather than training, 
43 | #  the best practice is to use the torch.no_grad() con- text manager. This tells PyTorch that it doesn’t
44 | #  need to keep track of the gradients, which can result in significant savings in memory and computation
45 | # with torch.no_grad():
46 | #     out = model(X)
47 | 
48 | # If we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly.
49 | with torch.no_grad():
50 |     out = torch.softmax(model(X), dim=1)
51 | 
52 | print(out)


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders-workers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders-workers.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | 
 4 | 
 5 | X_train = torch.tensor([
 6 |     [-1.2, 3.1],
 7 |     [-0.9, 2.9],
 8 |     [-0.5, 2.6],
 9 |     [2.3, -1.1],
10 |     [2.7, -1.5]
11 | ])
12 | 
13 | y_train = torch.tensor([0, 0, 0, 1, 1]) # class labels
14 | 
15 | X_test = torch.tensor([
16 |     [-0.8, 2.8],
17 |     [2.6, -1.6]
18 | ])
19 | 
20 | y_test = torch.tensor([0, 1])
21 | 
22 | class ToyDataset(Dataset):
23 |     def __init__(self, X, y):
24 |         self.features = X
25 |         self.labels = y
26 |     
27 |     def __getitem__(self, index):
28 |         one_x = self.features[index]
29 |         one_y = self.labels[index]
30 |         return one_x, one_y
31 |     
32 |     def __len__(self):
33 |         return self.labels.shape[0]
34 | 
35 | train_ds = ToyDataset(X_train, y_train)
36 | test_ds = ToyDataset(X_test, y_test)
37 | 
38 | print(len(train_ds))
39 | 
40 | torch.manual_seed(123)
41 | 
42 | # train_loader = DataLoader(
43 | #     dataset=train_ds,
44 | #     batch_size=2,
45 | #     shuffle=True,
46 | #     num_workers=0
47 | # )
48 | 
49 | train_loader = DataLoader(
50 |     dataset=train_ds,
51 |     batch_size=2,
52 |     shuffle=True,
53 |     num_workers=0,
54 |     drop_last=True # will drop 5th sample, since its not even
55 | )
56 | 
57 | test_loader = DataLoader(
58 |     dataset=test_ds,
59 |     batch_size=2,
60 |     shuffle=False,
61 |     num_workers=0
62 | )
63 | 
64 | for idx, (x, y) in enumerate(train_loader):
65 |     print(f"Batch {idx + 1}: ", x, y)
66 |     print()


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.8_Saving_and_loading_models/model.pth:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.8_Saving_and_loading_models/model.pth


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.1_PyTorch_computations_on_GPU_devices/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | from torch.utils.data import Dataset, DataLoader
 4 | 
 5 | 
 6 | torch.manual_seed(123)
 7 | 
 8 | class NeuralNetwork(torch.nn.Module):
 9 |     def __init__(self, num_inputs, num_outputs):
10 |         super().__init__()
11 | 
12 |         self.layers = torch.nn.Sequential(
13 | 
14 |             # 1st hidden layer
15 |             torch.nn.Linear(num_inputs, 30),
16 |             torch.nn.ReLU(),
17 | 
18 |             # 2nd hidden layer
19 |             torch.nn.Linear(30, 20),
20 |             torch.nn.ReLU(),
21 | 
22 |             # Output layer
23 |             torch.nn.Linear(20, num_outputs)
24 |         )
25 | 
26 |     def forward(self, x):
27 |         logits = self.layers(x)
28 |         return logits
29 | 
30 | model = NeuralNetwork(2, 2)
31 | model.load_state_dict(torch.load("model.pth"))
32 | 
33 | # If GPU present
34 | # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
35 | 
36 | # If Apple Silicon Chip, MPS = Metal Performance Shader
37 | device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
38 | 
39 | print(device)
40 | 
41 | # CPU addition
42 | tensor_1 = torch.tensor([1., 2., 3.])
43 | tensor_2 = torch.tensor([4., 5., 6.])
44 | print(tensor_1 + tensor_2)
45 | 
46 | 
47 | tensor_1 = tensor_1.to("mps")
48 | tensor_2 = tensor_2.to("mps")
49 | print(tensor_1 + tensor_2)
50 | 
51 | # tensor_2 = tensor_2.to("cpu") # Will crash, tensors need to be on the same device
52 | # print(tensor_1 + tensor_2)
53 | 
54 | 


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.1_PyTorch_computations_on_GPU_devices/model.pth:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.1_PyTorch_computations_on_GPU_devices/model.pth


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.2_Single_GPU_Training/main.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | from torch.utils.data import Dataset, DataLoader
  4 | 
  5 | 
  6 | torch.manual_seed(123)
  7 | 
  8 | class NeuralNetwork(torch.nn.Module):
  9 |     def __init__(self, num_inputs, num_outputs):
 10 |         super().__init__()
 11 | 
 12 |         self.layers = torch.nn.Sequential(
 13 | 
 14 |             # 1st hidden layer
 15 |             torch.nn.Linear(num_inputs, 30),
 16 |             torch.nn.ReLU(),
 17 | 
 18 |             # 2nd hidden layer
 19 |             torch.nn.Linear(30, 20),
 20 |             torch.nn.ReLU(),
 21 | 
 22 |             # Output layer
 23 |             torch.nn.Linear(20, num_outputs)
 24 |         )
 25 | 
 26 |     def forward(self, x):
 27 |         logits = self.layers(x)
 28 |         return logits
 29 | 
 30 | X_train = torch.tensor([
 31 |     [-1.2, 3.1],
 32 |     [-0.9, 2.9],
 33 |     [-0.5, 2.6],
 34 |     [2.3, -1.1],
 35 |     [2.7, -1.5]
 36 | ])
 37 | 
 38 | y_train = torch.tensor([0, 0, 0, 1, 1]) # class labels
 39 | 
 40 | X_test = torch.tensor([
 41 |     [-0.8, 2.8],
 42 |     [2.6, -1.6]
 43 | ])
 44 | 
 45 | y_test = torch.tensor([0, 1])
 46 | 
 47 | class ToyDataset(Dataset):
 48 |     def __init__(self, X, y):
 49 |         self.features = X
 50 |         self.labels = y
 51 |     
 52 |     def __getitem__(self, index):
 53 |         one_x = self.features[index]
 54 |         one_y = self.labels[index]
 55 |         return one_x, one_y
 56 |     
 57 |     def __len__(self):
 58 |         return self.labels.shape[0]
 59 | 
 60 | train_ds = ToyDataset(X_train, y_train)
 61 | test_ds = ToyDataset(X_test, y_test)
 62 | 
 63 | train_loader = DataLoader(
 64 |     dataset=train_ds,
 65 |     batch_size=2,
 66 |     shuffle=True,
 67 |     num_workers=0,
 68 |     drop_last=True # will drop 5th sample, since it's not even
 69 | )
 70 | 
 71 | test_loader = DataLoader(
 72 |     dataset=test_ds,
 73 |     batch_size=2,
 74 |     shuffle=False,
 75 |     num_workers=0
 76 | )
 77 | 
 78 | model = NeuralNetwork(num_inputs=2, num_outputs=2)
 79 | 
 80 | device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
 81 | model = model.to(device)
 82 | 
 83 | optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
 84 | 
 85 | num_epochs = 3
 86 | 
 87 | for epoch in range(num_epochs):
 88 | 
 89 |     model.train()
 90 | 
 91 |     for batch_idx, (features, labels) in enumerate(train_loader):
 92 |         features, labels = features.to(device), labels.to(device)
 93 |         logits = model(features)
 94 |         loss = F.cross_entropy(logits, labels) # Loss function
 95 | 
 96 |         optimizer.zero_grad()
 97 |         loss.backward()
 98 |         optimizer.step()
 99 | 
100 |         ### LOGGING
101 |         print(f"Epoch: {epoch + 1:03d}/{num_epochs:03d}"
102 |               f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
103 |               f" | Train Loss {loss:.2f}")
104 |     
105 |     model.eval()
106 | 


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-1.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-2.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-1.png


--------------------------------------------------------------------------------
/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-2.png


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.2_Tokenizing_text/main.py:
--------------------------------------------------------------------------------
 1 | import urllib.request
 2 | import re
 3 | 
 4 | url = ("https://raw.githubusercontent.com/rasbt/"
 5 | 	   "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
 6 | 	   "the-verdict.txt")
 7 | 
 8 | #   Download .txt file into directory
 9 | # file_path = "the-verdict.txt"
10 | # urllib.request.urlretrieve(url, file_path)
11 | 
12 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
13 | 	raw_text = f.read()
14 | 
15 | print("Total number of character:", len(raw_text))
16 | print(raw_text[:99])
17 | 
18 | #           Tokenization 
19 | # Split text on whitespace characters
20 | text = "Hello, world. This, is a test."
21 | result = re.split(r"(\s)", text)
22 | print("\n", result)
23 | 
24 | # Split text on whitespace, commas, and periods
25 | result = re.split(r"([,.]|\s)", text)
26 | print("\n", result)
27 | 
28 | # Optional, remove redundant whitespace characters
29 | result = [item for item in result if item.strip()]
30 | print("\n", result)
31 | 
32 | # Split text to handle more such as question marks, quotation marks, and double-dashes.
33 | text = "Hello, world. Is this-- a test?"
34 | result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
35 | result = [item.strip() for item in result if item.strip()] # fully removes whitespaces
36 | print("\n", result, " \n Token Count:", len(result))
37 | 
38 | # Basic Tokenizer applied to full short story, "the-verdict.txt"
39 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
40 | preprocessed = [item.strip() for item in preprocessed if item.strip()]
41 | print("\n Sample of Tokenized output: \n", preprocessed[:30], "\n Full Token Count:", len(preprocessed))


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.3_Converting_tokens_into_token_IDs/main.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | class SimpleTokenizerV1:
 5 |     def __init__(self, vocab):
 6 |         self.str_to_int = vocab
 7 |         self.int_to_str = {i:s for s, i in vocab.items()}
 8 | 
 9 |     def encode(self, text):
10 |         """ Processes input text into token IDs """
11 |         preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
12 |         preprocessed = [item.strip() for item in preprocessed if item.strip()]
13 |         ids = [self.str_to_int[s] for s in preprocessed]
14 |         return ids
15 |     
16 |     def decode(self, ids):
17 |         """ Converts token IDs back into text """
18 |         text = " ".join([self.int_to_str[i] for i in ids])
19 |         # Replace spaces before the specified punctuations
20 |         text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
21 |         return text
22 | 
23 | 
24 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
25 | 	raw_text = f.read()
26 | 
27 | #           Tokenization 
28 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
29 | preprocessed = [item.strip() for item in preprocessed if item.strip()]
30 | 
31 | # Converting Tokens into Token IDs
32 | all_words = sorted(set(preprocessed))
33 | vocab_size = len(all_words)
34 | 
35 | # Creating Vocabulary dictionary
36 | vocab = {token:integer for integer, token in enumerate(all_words)}
37 | 
38 | tokenizer = SimpleTokenizerV1(vocab)
39 | 
40 | text = """"It's the last he painted, you know,"
41 |            Mrs. Gisburn said with pardonable pride."""
42 | 
43 | ids = tokenizer.encode(text)
44 | print("\n Token IDs:", ids)
45 | 
46 | decoded_ids = tokenizer.decode(ids) 
47 | print("\n Decoded IDs:", decoded_ids)
48 | 
49 | text = "Hello, do you like tea?"
50 | print(tokenizer.encode(text)) # KeyError: 'Hello'
51 | # "Hello" was not part of the training data and thus not part of the existing vocabulary dictionary.


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.3_Converting_tokens_into_token_IDs/token_ids.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
 5 | 	raw_text = f.read()
 6 | 
 7 | #           Tokenization 
 8 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
 9 | preprocessed = [item.strip() for item in preprocessed if item.strip()]
10 | print("\n Sample of Tokenized output:\n", preprocessed[:30], "\n Full Token Count:", len(preprocessed))
11 | 
12 | # Converting Tokens into Token IDs
13 | all_words = sorted(set(preprocessed))
14 | vocab_size = len(all_words)
15 | print("\n Vocabular Size:", vocab_size)
16 | 
17 | # Creating Vocabulary dictionary
18 | vocab = {token:integer for integer, token in enumerate(all_words)}
19 | 
20 | # Printing first 51 entries of vocabulary
21 | for i, item in enumerate(vocab.items()):
22 | 	print(item)
23 | 	if i >= 50:
24 | 		break


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.4_Adding_special_context_tokens/main.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | class SimpleTokenizerV2:
 5 |     def __init__(self, vocab):
 6 |         self.str_to_int = vocab
 7 |         self.int_to_str = {i:s for s, i in vocab.items()}
 8 | 
 9 |     def encode(self, text):
10 |         """ Processes input text into token IDs """
11 |         preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
12 |         preprocessed = [item.strip() for item in preprocessed if item.strip()]
13 |         preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
14 |         ids = [self.str_to_int[s] for s in preprocessed]
15 |         return ids
16 | 
17 |     def decode(self, ids):
18 |         """ Converts token IDs back into text """
19 |         text = " ".join([self.int_to_str[i] for i in ids])
20 |         # Replace spaces before the specified punctuations
21 |         text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
22 |         return text
23 | 
24 | 
25 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
26 | 	raw_text = f.read()
27 | 
28 | #           Tokenization 
29 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
30 | preprocessed = [item.strip() for item in preprocessed if item.strip()]
31 | 
32 | # Converting Tokens into Token IDs - Adding 2 new special tokens
33 | all_tokens = sorted(list(set(preprocessed)))
34 | all_tokens.extend(["<|endoftext|>", "<|unk|>"])
35 | vocab_size = len(all_tokens)
36 | print("\n Vocabular Size: ", vocab_size, "\n")
37 | 
38 | # Creating Vocabulary dictionary
39 | vocab = {token:integer for integer, token in enumerate(all_tokens)}
40 | 
41 | tokenizer = SimpleTokenizerV2(vocab)
42 | 
43 | text1 = "Hello, do you like tea?"
44 | text2 = "In the sunlit terraces of the palace."
45 | 
46 | text = " <|endoftext|> ".join((text1, text2))
47 | print(text)
48 | 
49 | ids = tokenizer.encode(text)
50 | print("\n Token IDs:", ids)
51 | 
52 | decoded_ids = tokenizer.decode(ids) 
53 | print("\n Decoded IDs:", decoded_ids)


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.4_Adding_special_context_tokens/token_ids.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
 5 | 	raw_text = f.read()
 6 | 
 7 | #           Tokenization 
 8 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
 9 | preprocessed = [item.strip() for item in preprocessed if item.strip()]
10 | 
11 | # Converting Tokens into Token IDs - Adding 2 new special tokens
12 | all_tokens = sorted(list(set(preprocessed)))
13 | all_tokens.extend(["<|endoftext|>", "<|unk|>"])
14 | vocab_size = len(all_tokens)
15 | print("\n Vocabular Size:", vocab_size, "\n")
16 | 
17 | # Creating Vocabulary dictionary
18 | vocab = {token:integer for integer, token in enumerate(all_tokens)}
19 | 
20 | # Printing last 5 entries of the updated vocabulary
21 | for i, item in enumerate(list(vocab.items())[-5:]):
22 | 	print(item)


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.5_Byte_pair_encoding/main.py:
--------------------------------------------------------------------------------
 1 | from importlib.metadata import version
 2 | import tiktoken
 3 | 
 4 | 
 5 | print("tiktoken version:", version("tiktoken"))
 6 | 
 7 | tokenizer = tiktoken.get_encoding("gpt2")
 8 | 
 9 | text = (
10 |     "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
11 |     "of someunknownPlace."
12 | )
13 | 
14 | integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
15 | print(integers)  
16 | 
17 | strings = tokenizer.decode(integers)
18 | print(strings)
19 | 
20 | 
21 | # Exercise 2.1
22 | text = "Akwirw ier"
23 | token_ids = tokenizer.encode(text)
24 | print("\n", token_ids)
25 | decoded_token_ids = tokenizer.decode(token_ids)
26 | print(decoded_token_ids)


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/images/sliding window.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/images/sliding window.png


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/input_token_pairs.py:
--------------------------------------------------------------------------------
 1 | import tiktoken
 2 | 
 3 | 
 4 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
 5 | 	raw_text = f.read()
 6 | 
 7 | tokenizer = tiktoken.get_encoding("gpt2")
 8 | 
 9 | enc_text = tokenizer.encode(raw_text)
10 | print(len(enc_text)) # 5145
11 | 
12 | enc_sample = enc_text[50:] # remove first 50 tokens
13 | 
14 | # Creating input - target pairs
15 | context_size = 4
16 | x = enc_sample[:context_size]
17 | y = enc_sample[1:context_size + 1]
18 | print(f"x: {x}")
19 | print(f"y:      {y}")
20 | 
21 | print("\nOriginal Text:", tokenizer.decode(enc_sample[:context_size + 1]))
22 | 
23 | print()
24 | #      Token IDs
25 | # Input - Target Pairs
26 | for i in range(1, context_size + 1):
27 | 	context = enc_sample[:i]
28 | 	desired = enc_sample[i]
29 | 	print(context, "---->", desired) # left side of arrow is what LLM receives, right side of arrow is what LLM needs to predict
30 | 
31 | print()
32 | #      Text
33 | # Input - Target Pairs
34 | for i in range(1, context_size + 1):
35 | 	context = enc_sample[:i]
36 | 	desired = enc_sample[i]
37 | 	print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))
38 | 


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import tiktoken
 4 | 
 5 | 
 6 | class GPTDataSetV1(Dataset):
 7 |     def __init__(self, txt, tokenizer, max_length, stride):
 8 |         self.input_ids = []
 9 |         self.target_ids = []
10 | 
11 |         # Tokenize the entire text
12 |         token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
13 | 
14 |         # Use a sliding window to chunk the book into overlapping sequences of max_length
15 |         for i in range(0, len(token_ids) - max_length, stride):
16 |             input_chunk = token_ids[i:i + max_length]
17 |             target_chunk = token_ids[i + 1: i + max_length + 1]
18 |             self.input_ids.append(torch.tensor(input_chunk))
19 |             self.target_ids.append(torch.tensor(target_chunk))
20 | 
21 |     def __len__(self):
22 |         return len(self.input_ids)
23 | 
24 |     def __getitem__(self, idx):
25 |         return self.input_ids[idx], self.target_ids[idx]
26 | 
27 | 
28 | def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
29 |     # Initialize the tokenizer
30 |     tokenizer = tiktoken.get_encoding("gpt2")
31 |     
32 |     # Create dataset
33 |     dataset = GPTDataSetV1(txt, tokenizer, max_length, stride)
34 | 
35 |     # Create dataloader
36 |     dataloader = DataLoader(
37 |         dataset,
38 |         batch_size=batch_size,
39 |         shuffle=shuffle,
40 |         drop_last=drop_last,
41 |         num_workers=num_workers
42 |     )
43 |     
44 |     return dataloader
45 | 
46 | 
47 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
48 |     raw_text = f.read()
49 | 
50 | dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
51 | data_iter = iter(dataloader)
52 | first_batch = next(data_iter)
53 | print("\nFirst Batch:", first_batch)
54 | 
55 | second_batch = next(data_iter)
56 | print("Second Batch:", second_batch)
57 | 
58 | # Exercise 2.2
59 | dataloader2 = create_dataloader_v1(raw_text, batch_size=1, max_length=2, stride=2, shuffle=False)
60 | data_iter2 = iter(dataloader2)
61 | first_batch2 = next(data_iter2)
62 | print("\nFirst Batch 2:", first_batch2)
63 | 
64 | second_batch2 = next(data_iter2)
65 | print("Second Batch 2:", second_batch2)
66 | 
67 | dataloader3 = create_dataloader_v1(raw_text, batch_size=1, max_length=8, stride=2, shuffle=False)
68 | data_iter3 = iter(dataloader3)
69 | first_batch3 = next(data_iter3)
70 | print("\nFirst Batch 3:", first_batch3)
71 | 
72 | second_batch3 = next(data_iter3)
73 | print("Second Batch 3:", second_batch3)
74 | 
75 | # --------------------------------------------------------------------------------
76 | 
77 | dataloader_final = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
78 | data_iter_final = iter(dataloader_final)
79 | inputs, targets = next(data_iter_final)
80 | print("\nInputs:\n", inputs)
81 | print("\nTargets:\n", targets)


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.7_Creating_token_embeddings/embedding_example.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | input_ids = torch.tensor([2, 3, 5, 1])
 5 | vocab_size = 6
 6 | output_dim = 3
 7 | 
 8 | torch.manual_seed(123)
 9 | embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
10 | 
11 | print(embedding_layer.weight)
12 | 
13 | print("\n", embedding_layer(torch.tensor([3])))
14 | 
15 | print("\n", embedding_layer(input_ids))
16 | 
17 | # Understanding the Difference Between Embedding Layers and Linear Layers (Bonus)
18 | # https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb
19 | 


--------------------------------------------------------------------------------
/02_Working_with_text_data/2.8_Encoding_word_positions/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import tiktoken
 4 | 
 5 | 
 6 | class GPTDataSetV1(Dataset):
 7 |     def __init__(self, txt, tokenizer, max_length, stride):
 8 |         self.input_ids = []
 9 |         self.target_ids = []
10 | 
11 |         # Tokenize the entire text
12 |         token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
13 | 
14 |         # Use a sliding window to chunk the book into overlapping sequences of max_length
15 |         for i in range(0, len(token_ids) - max_length, stride):
16 |             input_chunk = token_ids[i:i + max_length]
17 |             target_chunk = token_ids[i + 1: i + max_length + 1]
18 |             self.input_ids.append(torch.tensor(input_chunk))
19 |             self.target_ids.append(torch.tensor(target_chunk))
20 | 
21 |     def __len__(self):
22 |         return len(self.input_ids)
23 | 
24 |     def __getitem__(self, idx):
25 |         return self.input_ids[idx], self.target_ids[idx]
26 | 
27 | 
28 | def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
29 |     # Initialize the tokenizer
30 |     tokenizer = tiktoken.get_encoding("gpt2")
31 |     
32 |     # Create dataset
33 |     dataset = GPTDataSetV1(txt, tokenizer, max_length, stride)
34 | 
35 |     # Create dataloader
36 |     dataloader = DataLoader(
37 |         dataset,
38 |         batch_size=batch_size,
39 |         shuffle=shuffle,
40 |         drop_last=drop_last,
41 |         num_workers=num_workers
42 |     )
43 |     
44 |     return dataloader
45 | 
46 | 
47 | with open("the-verdict.txt", "r", encoding="utf-8") as f:
48 |     raw_text = f.read()
49 | 
50 | 
51 | vocab_size = 50257
52 | output_dim = 256
53 | token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
54 | 
55 | max_length = 4
56 | dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
57 | data_iter = iter(dataloader)
58 | inputs, targets = next(data_iter)
59 | print("\nToken IDs:\n", inputs)
60 | print("\nInputs shape:\n", inputs.shape)
61 | 
62 | token_embeddings = token_embedding_layer(inputs)
63 | 
64 | print("\nToken Embeddings:", token_embeddings)
65 | print("\nToken Embeddings shape:\n", token_embeddings.shape)
66 | 
67 | context_length = max_length
68 | pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
69 | pos_embeddings = pos_embedding_layer(torch.arange(context_length))
70 | 
71 | input_embeddings = token_embeddings + pos_embeddings
72 | 
73 | print("\nPos embeddings:", pos_embeddings)
74 | print(pos_embeddings.shape)
75 | 
76 | print("\nInput embeddings:", input_embeddings)
77 | print(input_embeddings.shape)


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/attention mechanism.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/attention mechanism.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/vector similarity example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/vector similarity example.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | inputs = torch.tensor([
 5 |     [0.43, 0.15, 0.89], # Your      (x^1)
 6 |     [0.55, 0.87, 0.66], # journey   (x^2)
 7 |     [0.57, 0.85, 0.64], # starts    (x^3)
 8 |     [0.22, 0.58, 0.33], # with      (x^4)
 9 |     [0.77, 0.25, 0.10], # one       (x^5)
10 |     [0.05, 0.80, 0.55]  # step      (x^6)
11 | ])
12 | 
13 | query = inputs[1] # journey   (x^2)
14 | attn_scores_2 = torch.empty(inputs.shape[0]) # shape[0] = 6
15 | 
16 | for i, x_i, in enumerate(inputs):
17 |     attn_scores_2[i] = torch.dot(x_i, query)
18 | 
19 | print("\nAttention Scores 2:\n", attn_scores_2, "\n")
20 | 
21 | # Understanding dot product
22 | print("** Elements being multiplied in for loop **")
23 | res = 0
24 | for idx, element in enumerate(inputs[0]):
25 |     print(inputs[0][idx], "*", query[idx])
26 |     res += inputs[0][idx] * query[idx]
27 | print("\nDot Product Manual Example:", res)
28 | print("Dot Product Torch Example:", torch.dot(inputs[0], query))
29 | 
30 | # Normalization
31 | attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
32 | print("\nAttention Weights:", attn_weights_2_tmp)
33 | print("Sum:", attn_weights_2_tmp.sum())
34 | 
35 | # Softmax Normalization
36 | def softmax_naive(x):
37 |     return torch.exp(x) / torch.exp(x).sum(dim=0)
38 | 
39 | attn_weights_2_naive = softmax_naive(attn_scores_2)
40 | print("\nAttention Weights Naive:", attn_weights_2_naive)
41 | print("Sum Naive:", attn_weights_2_naive.sum())
42 | 
43 | # PyTorch Softmax Normalization
44 | attn_weights_2_torch = torch.softmax(attn_scores_2, dim=0)
45 | print("\nAttention Weights PyTorch:", attn_weights_2_torch)
46 | print("Sum PyTorch:", attn_weights_2_torch.sum())
47 | 
48 | # Calculating the context vector for the 2nd input (journey (x^2))
49 | query = inputs[1]
50 | context_vec_2 = torch.zeros(query.shape) # shape = 3
51 | for i, x_i in enumerate(inputs):
52 |     context_vec_2 += attn_weights_2_torch[i] * x_i
53 | 
54 | print("\nContext Vector 2:", context_vec_2)


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.2_Computing_attention_weights_for_all_input_tokens/images/attention weights heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.2_Computing_attention_weights_for_all_input_tokens/images/attention weights heatmap.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.2_Computing_attention_weights_for_all_input_tokens/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | inputs = torch.tensor([
 5 |     [0.43, 0.15, 0.89], # Your      (x^1)
 6 |     [0.55, 0.87, 0.66], # journey   (x^2)
 7 |     [0.57, 0.85, 0.64], # starts    (x^3)
 8 |     [0.22, 0.58, 0.33], # with      (x^4)
 9 |     [0.77, 0.25, 0.10], # one       (x^5)
10 |     [0.05, 0.80, 0.55]  # step      (x^6)
11 | ])
12 | 
13 | print(inputs.shape)
14 | 
15 | attn_scores = torch.empty(6, 6)
16 | 
17 | # For Loop Method
18 | for i, x_i in enumerate(inputs):
19 |     for j, x_j in enumerate(inputs):
20 |         attn_scores[i, j] = torch.dot(x_i, x_j)
21 | 
22 | # Matrix Multiplication
23 | attn_scores = inputs @ inputs.T
24 | print("\nAttention Scores:\n", attn_scores)
25 | 
26 | # Normalized
27 | attn_weights = torch.softmax(attn_scores, dim=-1)
28 | print("Attention Weights:\n", attn_weights, "\n")
29 | 
30 | # Verification of Sum to 1
31 | row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
32 | print("Row 2 Sum:", row_2_sum)
33 | print("All row sums:", attn_weights.sum(dim=-1))
34 | 
35 | # All Context Vectors
36 | all_context_vecs = attn_weights @ inputs
37 | print("\nAll Context Vectors:\n", all_context_vecs)
38 | 
39 | 
40 | # Verification of 2nd context vector
41 | query = inputs[1] # journey   (x^2)
42 | attn_scores_2 = torch.empty(inputs.shape[0]) # shape[0] = 6
43 | 
44 | for i, x_i, in enumerate(inputs):
45 |     attn_scores_2[i] = torch.dot(x_i, query)
46 | 
47 | attn_weights_2_torch = torch.softmax(attn_scores_2, dim=0)
48 | 
49 | # Calculating the context vector for the 2nd input (journey (x^2))
50 | query = inputs[1]
51 | context_vec_2 = torch.zeros(query.shape) # shape = 3
52 | for i, x_i in enumerate(inputs):
53 |     context_vec_2 += attn_weights_2_torch[i] * x_i
54 | 
55 | print("\nPrevious 2nd Context Vector:", context_vec_2)
56 | 


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/input and output dimensions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/input and output dimensions.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt1.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt2.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt3.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt4.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt1.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt2.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt3.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt4.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | inputs = torch.tensor([
 5 |     [0.43, 0.15, 0.89], # Your      (x^1)
 6 |     [0.55, 0.87, 0.66], # journey   (x^2)
 7 |     [0.57, 0.85, 0.64], # starts    (x^3)
 8 |     [0.22, 0.58, 0.33], # with      (x^4)
 9 |     [0.77, 0.25, 0.10], # one       (x^5)
10 |     [0.05, 0.80, 0.55]  # step      (x^6)
11 | ])
12 | 
13 | x_2 = inputs[1] # journey
14 | d_in = inputs.shape[1] # last embedding size, 3
15 | d_out = 2 # the output embedding size
16 | 
17 | torch.manual_seed(123)
18 | W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
19 | W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
20 | W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
21 | 
22 | query_2 = x_2 @ W_query
23 | key_2 = x_2 @ W_key
24 | value_2 = x_2 @ W_value
25 | 
26 | print(query_2) # tensor([0.4306, 1.4551])
27 | 
28 | # Obtaining all keys and values via matrix multiplication
29 | keys = inputs @ W_key
30 | values = inputs @ W_value
31 | print("keys.shape:", keys.shape)
32 | print("values.shape:", values.shape)
33 | 
34 | # attention score w22
35 | keys_2 = keys[1]
36 | attn_score_22 = query_2.dot(keys_2)
37 | print(attn_score_22)
38 | 
39 | # all attention scores for given query
40 | attn_scores_2 = query_2 @ keys.T
41 | print(attn_scores_2)
42 | 
43 | d_k = keys.shape[-1]
44 | attn_weights_2 = torch.softmax(attn_scores_2 / d_k ** 0.5, dim=-1)
45 | print(attn_weights_2)
46 | 
47 | context_vec_2 = attn_weights_2 @ values
48 | print(context_vec_2) # context vector for "journey"
49 | 


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/attention weights heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/attention weights heatmap.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt1.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt2.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt3.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt4.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt5.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt6.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt7.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt8.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/q,k,v,z.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/q,k,v,z.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/self-attention-class-v1.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | import torch
 3 | 
 4 | 
 5 | class SelfAttention_v1(nn.Module):
 6 |     def __init__(self, d_in, d_out):
 7 |         super().__init__()
 8 |         self.W_query = nn.Parameter(torch.rand(d_in, d_out))
 9 |         self.W_key = nn.Parameter(torch.rand(d_in, d_out))
10 |         self.W_value = nn.Parameter(torch.rand(d_in, d_out))
11 |     
12 |     def forward(self, x):
13 |         queries = x @ self.W_query
14 |         keys = x @ self.W_key
15 |         values = x @ self.W_value
16 |         attn_scores = queries @ keys.T # omega
17 |         attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
18 |         context_vec = attn_weights @ values
19 |         return context_vec
20 | 
21 | 
22 | inputs = torch.tensor([
23 |     [0.43, 0.15, 0.89], # Your      (x^1)
24 |     [0.55, 0.87, 0.66], # journey   (x^2)
25 |     [0.57, 0.85, 0.64], # starts    (x^3)
26 |     [0.22, 0.58, 0.33], # with      (x^4)
27 |     [0.77, 0.25, 0.10], # one       (x^5)
28 |     [0.05, 0.80, 0.55]  # step      (x^6)
29 | ])
30 | 
31 | torch.manual_seed(123)
32 | d_in = inputs.shape[1] # 3
33 | d_out = 2
34 | sa_v1 = SelfAttention_v1(d_in, d_out)
35 | 
36 | # Since inputs contains six embedding vectors, this results 
37 | # in a matrix storing the six context vectors.
38 | print(sa_v1(inputs))


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/self-attention-class-v2.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | import torch
 3 | 
 4 | 
 5 | class SelfAttention_v2(nn.Module):
 6 |     def __init__(self, d_in, d_out, qkv_bias=False):
 7 |         super().__init__()
 8 |         self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 9 |         self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
10 |         self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
11 |     
12 |     def forward(self, x):
13 |         queries = self.W_query(x)
14 |         keys = self.W_key(x)
15 |         values = self.W_value(x)
16 |         attn_scores = queries @ keys.T # omega
17 |         attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
18 |         context_vec = attn_weights @ values
19 |         return context_vec
20 | 
21 | 
22 | inputs = torch.tensor([
23 |     [0.43, 0.15, 0.89], # Your      (x^1)
24 |     [0.55, 0.87, 0.66], # journey   (x^2)
25 |     [0.57, 0.85, 0.64], # starts    (x^3)
26 |     [0.22, 0.58, 0.33], # with      (x^4)
27 |     [0.77, 0.25, 0.10], # one       (x^5)
28 |     [0.05, 0.80, 0.55]  # step      (x^6)
29 | ])
30 | 
31 | torch.manual_seed(789)
32 | d_in = inputs.shape[1] # 3
33 | d_out = 2
34 | sa_v2 = SelfAttention_v2(d_in, d_out)
35 | 
36 | # Since inputs contains six embedding vectors, this results 
37 | # in a matrix storing the six context vectors.
38 | print("SelfAttention_v2 Context Vectors:\n",sa_v2(inputs))
39 | 
40 | 
41 | # print("W_query weights:")
42 | # print(sa_v2.W_query.weight)
43 | # print(sa_v2.W_query.weight.T)
44 | 
45 | # Exercise 3.1 - Comparing SelfAttention_v1 and SelfAttention_v2
46 | 
47 | class SelfAttention_v1(nn.Module):
48 |     def __init__(self, d_in, d_out):
49 |         super().__init__()
50 |         self.W_query = nn.Parameter(torch.rand(d_in, d_out))
51 |         self.W_key = nn.Parameter(torch.rand(d_in, d_out))
52 |         self.W_value = nn.Parameter(torch.rand(d_in, d_out))
53 |     
54 |     def forward(self, x):
55 |         queries = x @ self.W_query
56 |         keys = x @ self.W_key
57 |         values = x @ self.W_value
58 |         attn_scores = queries @ keys.T # omega
59 |         attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
60 |         context_vec = attn_weights @ values
61 |         return context_vec
62 | 
63 | sa_v1 = SelfAttention_v1(d_in, d_out)
64 | sa_v1.W_query = nn.Parameter(sa_v2.W_query.weight.T)
65 | sa_v1.W_key = nn.Parameter(sa_v2.W_key.weight.T)
66 | sa_v1.W_value = nn.Parameter(sa_v2.W_value.weight.T)
67 | 
68 | print("\nSelfAttention_v1 Context Vectors:\n", sa_v1(inputs))
69 | 


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.1_Applying_a_causal_attention_mask/images/attn weights normalized with masked future tokens .png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.1_Applying_a_causal_attention_mask/images/attn weights normalized with masked future tokens .png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.1_Applying_a_causal_attention_mask/main.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | import torch
 3 | 
 4 | 
 5 | class SelfAttention_v2(nn.Module):
 6 |     def __init__(self, d_in, d_out, qkv_bias=False):
 7 |         super().__init__()
 8 |         self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
 9 |         self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
10 |         self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
11 |     
12 |     def forward(self, x):
13 |         queries = self.W_query(x)
14 |         keys = self.W_key(x)
15 |         values = self.W_value(x)
16 |         attn_scores = queries @ keys.T # omega
17 |         attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
18 |         context_vec = attn_weights @ values
19 |         return context_vec
20 | 
21 | 
22 | inputs = torch.tensor([
23 |     [0.43, 0.15, 0.89], # Your      (x^1)
24 |     [0.55, 0.87, 0.66], # journey   (x^2)
25 |     [0.57, 0.85, 0.64], # starts    (x^3)
26 |     [0.22, 0.58, 0.33], # with      (x^4)
27 |     [0.77, 0.25, 0.10], # one       (x^5)
28 |     [0.05, 0.80, 0.55]  # step      (x^6)
29 | ])
30 | 
31 | torch.manual_seed(789)
32 | d_in = inputs.shape[1] # 3
33 | d_out = 2
34 | sa_v2 = SelfAttention_v2(d_in, d_out)
35 | 
36 | # sa_v2(inputs)
37 | 
38 | # Manually Getting weights
39 | queries = sa_v2.W_query(inputs)
40 | keys = sa_v2.W_key(inputs)
41 | attn_scores = queries @ keys.T
42 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
43 | print("\nAttention Weights:\n", attn_weights)
44 | 
45 | # Creating Mask
46 | context_length = attn_scores.shape[0] # 6
47 | mask_simple = torch.tril(torch.ones(context_length, context_length))
48 | print("\nMask Simple:\n", mask_simple)
49 | 
50 | # Multiply mask with attention weights to zero-out the values above the diagonal
51 | masked_simple = attn_weights * mask_simple
52 | print("\nMasked Simple (zero-out values above diag):\n", masked_simple)
53 | 
54 | # Normalize the attention weights to sum up to 1 again in each row
55 | row_sums = masked_simple.sum(dim=-1, keepdim=True)
56 | print("\nRow Sums:\n", row_sums)
57 | masked_simple_norm = masked_simple / row_sums
58 | print("\nNormalized Masked Attention Weights:\n", masked_simple_norm)
59 | 
60 | # Masking with 1's above the diagonal and replacing the 1s with negativt infinity (-inf) values, more efficient
61 | mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
62 | masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
63 | print("\nAlternate Masking Method with -inf:\n", masked)
64 | 
65 | # Normalizing alternate masked attention scores
66 | attn_weights_different = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
67 | print("\nAlternate Normalized Masked Attention Weights:\n", attn_weights_different)


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/attn weights normalized with dropout and masked future tokens .png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/attn weights normalized with dropout and masked future tokens .png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/dropout.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/dropout.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | torch.manual_seed(123)
 6 | 
 7 | dropout = torch.nn.Dropout(0.5)
 8 | example = torch.ones(6, 6)
 9 | 
10 | print("\nPre-dropout:\n", example)
11 | print("\nPost-dropout:\n", dropout(example))
12 | 
13 | 
14 | 
15 | class SelfAttention_v2(nn.Module):
16 |     def __init__(self, d_in, d_out, qkv_bias=False):
17 |         super().__init__()
18 |         self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
19 |         self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
20 |         self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
21 |     
22 |     def forward(self, x):
23 |         queries = self.W_query(x)
24 |         keys = self.W_key(x)
25 |         values = self.W_value(x)
26 |         attn_scores = queries @ keys.T # omega
27 |         attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
28 |         context_vec = attn_weights @ values
29 |         return context_vec
30 | 
31 | 
32 | inputs = torch.tensor([
33 |     [0.43, 0.15, 0.89], # Your      (x^1)
34 |     [0.55, 0.87, 0.66], # journey   (x^2)
35 |     [0.57, 0.85, 0.64], # starts    (x^3)
36 |     [0.22, 0.58, 0.33], # with      (x^4)
37 |     [0.77, 0.25, 0.10], # one       (x^5)
38 |     [0.05, 0.80, 0.55]  # step      (x^6)
39 | ])
40 | 
41 | d_in = inputs.shape[1] # 3
42 | d_out = 2
43 | sa_v2 = SelfAttention_v2(d_in, d_out)
44 | 
45 | 
46 | # Manually Getting weights
47 | queries = sa_v2.W_query(inputs)
48 | keys = sa_v2.W_key(inputs)
49 | attn_scores = queries @ keys.T
50 | 
51 | context_length = attn_scores.shape[0] # 6
52 | 
53 | # Masking with 1's above the diagonal and replacing the 1s with negativt infinity (-inf) values, more efficient
54 | mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
55 | masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
56 | print("\nAlternate Masking Method with -inf:\n", masked)
57 | 
58 | # Normalizing alternate masked attention scores
59 | attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
60 | print("\nAlternate Normalized Masked Attention Weights:\n", attn_weights)
61 | 
62 | print("\nAttention Weights with Dropout:\n", dropout(attn_weights))


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.3_Implementing_a_compact_causal_attention_class/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | class CausalAttention(nn.Module):
 6 |     def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
 7 |         super().__init__()
 8 |         self.d_out = d_out
 9 |         self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
10 |         self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
11 |         self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
12 |         self.dropout = nn.Dropout(dropout)
13 |         self.register_buffer(
14 |             "mask",
15 |             torch.triu(torch.ones(context_length, context_length), diagonal=1)
16 |         )
17 |     
18 |     def forward(self, x):
19 |         b, num_tokens, d_in = x.shape
20 |         queries = self.W_query(x)
21 |         keys = self.W_key(x)
22 |         values = self.W_value(x)
23 | 
24 |         attn_scores = queries @ keys.transpose(1, 2)
25 |         attn_scores.masked_fill_(
26 |             self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
27 |         attn_weights = torch.softmax(
28 |             attn_scores / keys.shape[-1]**0.5, dim=-1
29 |         )
30 |         attn_weights = self.dropout(attn_weights)
31 | 
32 |         context_vec = attn_weights @ values
33 |         return context_vec
34 | 
35 | 
36 | inputs = torch.tensor([
37 |     [0.43, 0.15, 0.89], # Your      (x^1)
38 |     [0.55, 0.87, 0.66], # journey   (x^2)
39 |     [0.57, 0.85, 0.64], # starts    (x^3)
40 |     [0.22, 0.58, 0.33], # with      (x^4)
41 |     [0.77, 0.25, 0.10], # one       (x^5)
42 |     [0.05, 0.80, 0.55]  # step      (x^6)
43 | ])
44 | 
45 | 
46 | batch = torch.stack((inputs, inputs), dim=0)
47 | print("\nBatch of inputs:\n", batch)
48 | print("\nBatch shape:\n", batch.shape)
49 | 
50 | torch.manual_seed(123)
51 | d_in = inputs.shape[1] # 3
52 | d_out = 2
53 | context_length = batch.shape[1] # 6
54 | ca = CausalAttention(d_in, d_out, context_length, 0.0)
55 | context_vecs = ca(batch)
56 | 
57 | print("\nContext Vectors:\n", context_vecs)
58 | print("\nContext Vectors Shape:", context_vecs.shape)


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention output.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | class CausalAttention(nn.Module):
 6 |     def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
 7 |         super().__init__()
 8 |         self.d_out = d_out
 9 |         self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
10 |         self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
11 |         self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
12 |         self.dropout = nn.Dropout(dropout)
13 |         self.register_buffer(
14 |             "mask",
15 |             torch.triu(torch.ones(context_length, context_length), diagonal=1)
16 |         )
17 |     
18 |     def forward(self, x):
19 |         b, num_tokens, d_in = x.shape
20 |         queries = self.W_query(x)
21 |         keys = self.W_key(x)
22 |         values = self.W_value(x)
23 | 
24 |         attn_scores = queries @ keys.transpose(1, 2)
25 |         attn_scores.masked_fill_(
26 |             self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
27 |         attn_weights = torch.softmax(
28 |             attn_scores / keys.shape[-1]**0.5, dim=-1
29 |         )
30 |         attn_weights = self.dropout(attn_weights)
31 | 
32 |         context_vec = attn_weights @ values
33 |         return context_vec
34 | 
35 | 
36 | class MultiHeadAttentionWrapper(nn.Module):
37 |     def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
38 |         super().__init__()
39 |         self.heads = nn.ModuleList(
40 |             [CausalAttention(
41 |                 d_in, d_out, context_length, dropout, qkv_bias
42 |             )
43 |             for _ in range(num_heads)]
44 |         )
45 | 
46 |     def forward(self, x):
47 |         for head in self.heads:
48 |             print(head(x))
49 |         return torch.cat([head(x) for head in self.heads], dim=-1)
50 | 
51 | 
52 | inputs = torch.tensor([
53 |     [0.43, 0.15, 0.89], # Your      (x^1)
54 |     [0.55, 0.87, 0.66], # journey   (x^2)
55 |     [0.57, 0.85, 0.64], # starts    (x^3)
56 |     [0.22, 0.58, 0.33], # with      (x^4)
57 |     [0.77, 0.25, 0.10], # one       (x^5)
58 |     [0.05, 0.80, 0.55]  # step      (x^6)
59 | ])
60 | 
61 | 
62 | batch = torch.stack((inputs, inputs), dim=0)
63 | print("\nBatch of inputs:\n", batch)
64 | print("\nBatch shape:\n", batch.shape)
65 | 
66 | torch.manual_seed(123)
67 | d_in = 3
68 | d_out = 2
69 | context_length = batch.shape[1] # 6, number of tokens
70 | 
71 | mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
72 | context_vecs = mha(batch)
73 | 
74 | print("\nMulti Head Attn - Context Vectors:\n", context_vecs)
75 | print("\nMulti Head Attn - Context Vectors Shape:", context_vecs.shape)
76 | 
77 | # Exercise 3.2 Returning two-dimensional embedding vectors
78 | d_out = 1
79 | mha_two_dim = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
80 | context_vecs_two_dim = mha_two_dim(batch)
81 | 
82 | print("\nMulti Head Attn 2 Dimensional - Context Vectors:\n", context_vecs_two_dim)
83 | print("\nMulti Head Attn 2 Dimensional - Context Vectors Shape:", context_vecs_two_dim.shape)
84 | 


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/batched_matrix_multiplication.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | # (b, num_heads, num_tokens, head_dim) = (1, 2, 3, 4)
 5 | a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],
 6 |                     [0.8993, 0.0390, 0.9268, 0.7388],
 7 |                     [0.7179, 0.7058, 0.9156, 0.4340]],
 8 | 
 9 |                    [[0.0772, 0.3565, 0.1479, 0.5331],
10 |                     [0.4066, 0.2318, 0.4545, 0.9737],
11 |                     [0.4606, 0.5159, 0.4220, 0.5786]]]])
12 | 
13 | 
14 | print("\nTransposed Matrix:\n", a.transpose(2, 3))
15 | print("\nTransposed Matrix Shape:\n", a.transpose(2, 3).shape) # (1, 2, 4, 3)
16 | 
17 | print("\nMultiplication Result:\n", a @ a.transpose(2, 3))
18 | 
19 | 
20 | first_head = a[0, 0, :, :]
21 | print("\nFirst Head:\n", first_head)
22 | first_res = first_head @ first_head.T
23 | print("\nFirst Head Result:\n", first_res)
24 | 
25 | second_head = a[0, 1, :, :]
26 | print("\nSecond Head:\n", second_head)
27 | second_res = second_head @ second_head.T
28 | print("\nSecond Head Result:\n", second_res)


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/diagram.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt1.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt2.png


--------------------------------------------------------------------------------
/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt3.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/gpt_config.py:
--------------------------------------------------------------------------------
 1 | 
 2 | GPT_CONFIG_124M = {
 3 |     "vocab_size": 50257,        # Vocabulary size
 4 |     "context_length": 1024,     # Context length
 5 |     "emb_dim": 768,             # Embedding dimension
 6 |     "n_heads": 12,              # Number of attention heads
 7 |     "n_layers": 12,             # Number of layers
 8 |     "drop_rate": 0.1,           # Dropout rate
 9 |     "qkv_bias": False           # Query-Key-Value bias
10 | }
11 | 
12 | 
13 | # vocab_size: refers to a vocabulary of 50,257 words, as used by the BPE tokenizer (see chapter 2).
14 | 
15 | # context_length: denotes the maximum number of input tokens the model can handle via the positional embeddings (see chapter 2).
16 | 
17 | # emb_dim: represents the embedding size, transforming each token into a 768- dimensional vector.
18 | 
19 | # n_heads: indicates the count of attention heads in the multi-head attention mechanism (see chapter 3).
20 | 
21 | # n_layers: specifies the number of transformer blocks in the model, which we will cover in the upcoming discussion.
22 | 
23 | # drop_rate: indicates the intensity of the dropout mechanism (0.1 implies a 10% random drop out of hidden units) to prevent overfitting (see chapter 3).
24 | 
25 | # qkv_bias: determines whether to include a bias vector in the Linear layers of the multi-head attention for query, key, and value computations. 
26 |     # We will initially disable this, following the norms of modern LLMs, but we will revisit it in chapter 6 when we load pretrained GPT-2 weights 
27 |     # from OpenAI into our model (see chapter 6).
28 | 


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/images/logits explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/images/logits explanation.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import tiktoken
 4 | 
 5 | 
 6 | class DummyGPTModel(nn.Module):
 7 |     def __init__(self, cfg):
 8 |         super().__init__()
 9 |         self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
10 |         self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
11 |         self.drop_emb = nn.Dropout(cfg["drop_rate"])
12 |         
13 |         # Use a placeholder for TransformerBlock
14 |         self.trf_blocks = nn.Sequential(
15 |             *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]
16 |         )
17 | 
18 |         # Use a placeholder for LayerNorm
19 |         self.final_norm = DummyLayerNorm(cfg["emb_dim"])
20 |         self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
21 |     
22 |     def forward(self, in_idx):
23 |         batch_size, seq_len = in_idx.shape
24 |         tok_embeds = self.tok_emb(in_idx)
25 |         pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
26 |         x = tok_embeds + pos_embeds
27 |         x = self.drop_emb(x)
28 |         x = self.trf_blocks(x)
29 |         x = self.final_norm(x)
30 |         logits = self.out_head(x)
31 |         return logits
32 | 
33 | 
34 | class DummyTransformerBlock(nn.Module):
35 |     def __init__(self, cfg):
36 |         super().__init__()
37 |         # A simple placeholder
38 |     
39 |     def forward(self, x):
40 |         # This block does nothing and just returns its input.
41 |         return x
42 | 
43 | 
44 | class DummyLayerNorm(nn.Module):
45 |     def __init__(self, normalized_shape, eps=1e-5):
46 |         super().__init__()
47 |         # The parameters here are just to mimic the LayerNorm interface.
48 |     
49 |     def forward(self, x):
50 |         # This layer does nothing and just returns its input.
51 |         return x
52 | 
53 | 
54 | tokenizer = tiktoken.get_encoding("gpt2")
55 | batch = []
56 | txt1 = "Every effort moves you"
57 | txt2 = "Every day holds a"
58 | 
59 | batch.append(torch.tensor(tokenizer.encode(txt1)))
60 | batch.append(torch.tensor(tokenizer.encode(txt2)))
61 | 
62 | batch = torch.stack(batch, dim=0)
63 | print(batch)
64 | 
65 | 
66 | GPT_CONFIG_124M = {
67 |     "vocab_size": 50257,        # Vocabulary size
68 |     "context_length": 1024,     # Context length
69 |     "emb_dim": 768,             # Embedding dimension
70 |     "n_heads": 12,              # Number of attention heads
71 |     "n_layers": 12,             # Number of layers
72 |     "drop_rate": 0.1,           # Dropout rate
73 |     "qkv_bias": False           # Query-Key-Value bias
74 | }
75 | 
76 | torch.manual_seed(123)
77 | model = DummyGPTModel(GPT_CONFIG_124M)
78 | # print(model)
79 | logits = model(batch)
80 | 
81 | print("\nOutput shape:\n", logits.shape)
82 | print("Logits:\n", logits)


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/biased variance broken down.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/biased variance broken down.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt1.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt2.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt1.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt2.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt3.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt1.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt2.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt3.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt1.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt2.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | class LayerNorm(nn.Module):
 6 |     def __init__(self, emb_dim):
 7 |         super().__init__()
 8 |         self.eps = 1e-5 # small constant "epsilon" added to the variance, prevents division by zero during normalization
 9 |         self.scale = nn.Parameter(torch.ones(emb_dim)) # trainable param, LLM will adjust this during training
10 |         self.shift = nn.Parameter(torch.zeros(emb_dim)) # trainable param, LLM will adjust this during training
11 |     
12 |     def forward(self, x):
13 |         mean = x.mean(dim=-1, keepdim=True)
14 |         var = x.var(dim=-1, keepdim=True, unbiased=False)
15 |         norm_x = (x - mean) / torch.sqrt(var + self.eps)
16 |         return self.scale * norm_x + self.shift
17 | 
18 | 
19 | torch.manual_seed(123)
20 | torch.set_printoptions(sci_mode=False)
21 | 
22 | batch_example = torch.randn(2, 5)
23 | print(batch_example)
24 | 
25 | ln = LayerNorm(emb_dim=5)
26 | out_ln = ln(batch_example)
27 | 
28 | # Verification that the mean = 0 and variance = 1
29 | mean = out_ln.mean(dim=-1, keepdim=True)
30 | var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
31 | 
32 | print("Mean:\n", mean)
33 | print("Variance:\n", var)


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/normalization.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | torch.manual_seed(123)
 6 | 
 7 | batch_example = torch.randn(2, 5)
 8 | layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
 9 | out = layer(batch_example)
10 | 
11 | print("\nLayer outputs:\n", out)
12 | 
13 | # Mean values for both row 1 and row 2
14 | mean = out.mean(dim=-1, keepdim=True)
15 | 
16 | var = out.var(dim=-1, keepdim=True)
17 | print("\nMean:\n", mean)
18 | print("Variance:\n", var)
19 | 
20 | print("\n------------------------")
21 | # Layer Normalization
22 | out_norm = (out - mean) / torch.sqrt(var)
23 | mean = out_norm.mean(dim=-1, keepdim=True)
24 | var = out_norm.var(dim=-1, keepdim=True)
25 | print("\nNormalized layer outputs:\n", out_norm)
26 | print("Mean:\n", mean)
27 | print("Variance:\n", var)
28 | 
29 | print("\n------------------------")
30 | # Removing scientific notation
31 | torch.set_printoptions(sci_mode=False)
32 | print("Mean:\n", mean)
33 | print("Variance:\n", var)
34 | 
35 | 


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/gelu.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import matplotlib.pyplot as plt
 4 | 
 5 | 
 6 | class GELU(nn.Module):
 7 |     def __init__(self):
 8 |         super().__init__()
 9 |     
10 |     def forward(self, x):
11 |         return 0.5 * x * (1 + torch.tanh(
12 |             torch.sqrt(torch.tensor(2.0 / torch.pi)) *
13 |             (x + 0.044715 * torch.pow(x, 3))
14 |         ))
15 | 
16 | 
17 | gelu = GELU()
18 | relu = nn.ReLU()
19 | 
20 | x = torch.linspace(-3, 3, 100) # creates 100 sample data points in the range -3 to 3
21 | y_gelu = gelu(x)
22 | y_relu = relu(x)
23 | 
24 | plt.figure(figsize=(8, 3))
25 | 
26 | for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
27 |     plt.subplot(1, 2, i)
28 |     plt.plot(x, y)
29 |     # plt.plot(x, y, marker='o', linestyle='-', markersize=3)  # Add marker='o' and markersize=3
30 |     plt.title(f"{label} activation function")
31 |     plt.xlabel("x")
32 |     plt.ylabel(f"{label}(x)")
33 |     plt.grid(True)
34 | 
35 | plt.tight_layout()
36 | plt.show()


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/fnn diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/fnn diagram.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/gelu and relu plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/gelu and relu plot.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/input into feedforward neural net (fnn).png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/input into feedforward neural net (fnn).png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | class FeedForward(nn.Module):
 6 |     def __init__(self, cfg):
 7 |         super().__init__()
 8 |         self.layers = nn.Sequential(
 9 |             nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
10 |             GELU(),
11 |             nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"])
12 |         )
13 |     
14 |     def forward(self, x):
15 |         return self.layers(x)
16 | 
17 | 
18 | class GELU(nn.Module):
19 |     def __init__(self):
20 |         super().__init__()
21 |     
22 |     def forward(self, x):
23 |         return 0.5 * x * (1 + torch.tanh(
24 |             torch.sqrt(torch.tensor(2.0 / torch.pi)) *
25 |             (x + 0.044715 * torch.pow(x, 3))
26 |         ))
27 | 
28 | 
29 | GPT_CONFIG_124M = {
30 |     "vocab_size": 50257,        # Vocabulary size
31 |     "context_length": 1024,     # Context length
32 |     "emb_dim": 768,             # Embedding dimension
33 |     "n_heads": 12,              # Number of attention heads
34 |     "n_layers": 12,             # Number of layers
35 |     "drop_rate": 0.1,           # Dropout rate
36 |     "qkv_bias": False           # Query-Key-Value bias
37 | }
38 | 
39 | ffn = FeedForward(GPT_CONFIG_124M)
40 | print("\nFNN Architecture:\n", ffn)
41 | 
42 | x = torch.rand(2, 3, 768) # sample input with batch dimension 2
43 | out = ffn(x)
44 | 
45 | print("\nFNN 1st Sample:\n", out[0])
46 | print("\n1st Sample Shape:\n", out[0].shape)


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.4_Adding_shortcut_connections/images/shortcut connections.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.4_Adding_shortcut_connections/images/shortcut connections.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.4_Adding_shortcut_connections/main.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | 
 4 | 
 5 | class ExampleDeepNeuralNetwork(nn.Module):
 6 |     def __init__(self, layer_sizes, use_shortcut):
 7 |         super().__init__()
 8 |         self.use_shortcut = use_shortcut
 9 |         self.layers = nn.ModuleList([
10 |             nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
11 |             nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
12 |             nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
13 |             nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
14 |             nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU()),
15 |         ])
16 |     
17 |     def forward(self, x):
18 |         for layer in self.layers:
19 |             layer_output = layer(x) # compute the output of the current layer
20 |             if self.use_shortcut and x.shape == layer_output.shape: # check if shortcut can be applied
21 |                 x = x + layer_output
22 |             else:
23 |                 x = layer_output
24 |         return x
25 | 
26 | 
27 | class GELU(nn.Module):
28 |     def __init__(self):
29 |         super().__init__()
30 |     
31 |     def forward(self, x):
32 |         return 0.5 * x * (1 + torch.tanh(
33 |             torch.sqrt(torch.tensor(2.0 / torch.pi)) *
34 |             (x + 0.044715 * torch.pow(x, 3))
35 |         ))
36 | 
37 | torch.manual_seed(123)
38 | layer_sizes = [3, 3, 3, 3, 3, 1]
39 | sample_input = torch.tensor([[1., 0., -1.]])
40 | 
41 | # model without shortcut
42 | model_without_shortcut = ExampleDeepNeuralNetwork(
43 |     layer_sizes, use_shortcut=False
44 | )
45 | 
46 | def print_gradients(model, x):
47 |     output = model(x) # forward pass
48 |     target = torch.tensor([[0.]])
49 | 
50 |     loss = nn.MSELoss() # calculate loss based on how close the target and output are
51 |     loss = loss(output, target)
52 | 
53 |     loss.backward() # backward pass to calculate gradients
54 | 
55 |     for name, param in model.named_parameters():
56 |         if "weight" in name:
57 |             print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")
58 | 
59 | print("Model without shortcut:")
60 | print_gradients(model_without_shortcut, sample_input) # vanishing gradient problem occurs here
61 | 
62 | # model with skip / shortcut connections
63 | model_with_shortcut = ExampleDeepNeuralNetwork(
64 |     layer_sizes, use_shortcut=True
65 | )
66 | 
67 | print("\nModel with shortcut:")
68 | print_gradients(model_with_shortcut, sample_input)


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.5_Connecting_attention_and_linear_layers_in_a_transformer_block/images/transformer block.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.5_Connecting_attention_and_linear_layers_in_a_transformer_block/images/transformer block.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.6_Coding_the_GPT_Model/images/gpt2 architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.6_Coding_the_GPT_Model/images/gpt2 architecture.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/iterations of a token prediction cycle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/iterations of a token prediction cycle.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/mechanics of text generation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/mechanics of text generation.png


--------------------------------------------------------------------------------
/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/step by step text generation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/step by step text generation.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/chapter topics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/chapter topics.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/gpt build stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/gpt build stages.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/tokenizer placement in flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/tokenizer placement in flow.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/loss calculation steps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/loss calculation steps.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/next tokens.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/next tokens.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/perplexity score explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/perplexity score explanation.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/text generation process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/text generation process.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.3_Calculating_the_training_and_validation_set_losses/images/dataloaders.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.3_Calculating_the_training_and_validation_set_losses/images/dataloaders.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.3_Calculating_the_training_and_validation_set_losses/loading_dataset.py:
--------------------------------------------------------------------------------
  1 | from torch.utils.data import Dataset, DataLoader
  2 | import tiktoken
  3 | import torch
  4 | 
  5 | 
  6 | class GPTDataSetV1(Dataset):
  7 |     def __init__(self, txt, tokenizer, max_length, stride):
  8 |         self.input_ids = []
  9 |         self.target_ids = []
 10 | 
 11 |         # Tokenize the entire text
 12 |         token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
 13 | 
 14 |         # Use a sliding window to chunk the book into overlapping sequences of max_length
 15 |         for i in range(0, len(token_ids) - max_length, stride):
 16 |             input_chunk = token_ids[i:i + max_length]
 17 |             target_chunk = token_ids[i + 1: i + max_length + 1]
 18 |             self.input_ids.append(torch.tensor(input_chunk))
 19 |             self.target_ids.append(torch.tensor(target_chunk))
 20 | 
 21 |     def __len__(self):
 22 |         return len(self.input_ids)
 23 | 
 24 |     def __getitem__(self, idx):
 25 |         return self.input_ids[idx], self.target_ids[idx]
 26 | 
 27 | 
 28 | def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
 29 |     # Initialize the tokenizer
 30 |     tokenizer = tiktoken.get_encoding("gpt2")
 31 |     
 32 |     # Create dataset
 33 |     dataset = GPTDataSetV1(txt, tokenizer, max_length, stride)
 34 | 
 35 |     # Create dataloader
 36 |     dataloader = DataLoader(
 37 |         dataset,
 38 |         batch_size=batch_size,
 39 |         shuffle=shuffle,
 40 |         drop_last=drop_last,
 41 |         num_workers=num_workers
 42 |     )
 43 |     
 44 |     return dataloader
 45 | 
 46 | 
 47 | torch.manual_seed(123)
 48 | 
 49 | file_path = "the-verdict.txt"
 50 | with open(file_path, "r", encoding="utf-8") as file:
 51 |     text_data = file.read()
 52 | 
 53 | 
 54 | tokenizer = tiktoken.get_encoding("gpt2")
 55 | 
 56 | total_characters = len(text_data)
 57 | total_tokens = len(tokenizer.encode(text_data))
 58 | 
 59 | print("Characters:", total_characters)
 60 | print("Tokens:", total_tokens)
 61 | 
 62 | # Data splitting into training and validation datasets
 63 | train_ratio = 0.90
 64 | split_idx = int(train_ratio * len(text_data))
 65 | 
 66 | train_data = text_data[:split_idx]
 67 | val_data = text_data[split_idx:]
 68 | 
 69 | 
 70 | # known as "GPT-2 small"
 71 | GPT_CONFIG_124M = {
 72 |     "vocab_size": 50257,        # Vocabulary size
 73 |     "context_length": 256,     # Context length
 74 |     "emb_dim": 768,             # Embedding dimension
 75 |     "n_heads": 12,              # Number of attention heads
 76 |     "n_layers": 12,             # Number of layers
 77 |     "drop_rate": 0.1,           # Dropout rate
 78 |     "qkv_bias": False           # Query-Key-Value bias
 79 | }
 80 | 
 81 | 
 82 | train_loader = create_dataloader_v1(
 83 |     train_data,
 84 |     batch_size=2,
 85 |     max_length=GPT_CONFIG_124M["context_length"],
 86 |     stride=GPT_CONFIG_124M["context_length"],
 87 |     drop_last=True,
 88 |     shuffle=True,
 89 |     num_workers=0
 90 | )
 91 | 
 92 | val_loader = create_dataloader_v1(
 93 |     val_data,
 94 |     batch_size=2,
 95 |     max_length=GPT_CONFIG_124M["context_length"],
 96 |     stride=GPT_CONFIG_124M["context_length"],
 97 |     drop_last=False,
 98 |     shuffle=False,
 99 |     num_workers=0
100 | )
101 | 
102 | print("\nTrain loader:")
103 | for x, y in train_loader:
104 |     print(x.shape, y.shape)
105 | 
106 | print("\nValidation loader:")
107 | for x, y in val_loader:
108 |     print(x.shape, y.shape)


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/loss-plot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/loss-plot.pdf


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/plot explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/plot explanation.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training loop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training loop.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training process.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature explanation.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature-plot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature-plot.pdf


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.2_Top_k_sampling/images/top k steps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.2_Top_k_sampling/images/top k steps.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.2_Top_k_sampling/top_k.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | # Assume the LLM is given the start context "every effort moves you" and
 5 | # generates the following next-token logits:
 6 | next_token_logits = torch.tensor(
 7 |     [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
 8 | )
 9 | 
10 | top_k = 3
11 | top_logits, top_pos = torch.topk(next_token_logits, top_k)
12 | print("Top logits:", top_logits)
13 | print("Top positions:", top_pos)
14 | 
15 | new_logits = torch.where(
16 |     condition=next_token_logits < top_logits[-1], # identifies logits less than the minimum in the top 3
17 |     input=torch.tensor(float("-inf")), # assigns -inf to these lower logits
18 |     other=next_token_logits # retains the original logits for all other tokens
19 | )
20 | 
21 | print(new_logits)
22 | 
23 | # An alternative, slightly more efficient implementation of the previous code
24 | new_logits_alt = torch.full_like( # create tensor containing -inf values
25 |     next_token_logits, -torch.inf
26 | )
27 | new_logits_alt[top_pos] = next_token_logits[top_pos] # copy top k values into the -inf tensor
28 | 
29 | print(new_logits_alt)
30 | 
31 | # -----
32 | topk_probas = torch.softmax(new_logits, dim=0)
33 | print(topk_probas)
34 | 
35 | 


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt1.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt2.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt3.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt4.png


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/images/gpt architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/images/gpt architecture.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | from download import data_file_path
 3 | 
 4 | 
 5 | df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
 6 | 
 7 | # print(df)
 8 | # print(df["Label"].value_counts())
 9 | 
10 | def create_balanced_dataset(df):
11 | 
12 |     # Count the instances of "spam"
13 |     num_spam = df[df["Label"] == "spam"].shape[0]
14 | 
15 |     # Randomly sample "ham" instances to match the number of "spam" instances
16 |     ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)
17 |     
18 |     # Combine ham "subset" with "spam"
19 |     balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])
20 | 
21 |     return balanced_df
22 | 
23 | 
24 | # Split dataset into 3 parts. These ratios are common in machine learning to train, adjust, and evaluate models.
25 | # Training = 70%
26 | # Validation = 10%
27 | # Testing = 20%
28 | def random_split(df, train_frac, validation_frac):
29 |     
30 |     # Shuffle the entire DataFrame
31 |     df = df.sample(frac=1, random_state=123).reset_index(drop=True)
32 | 
33 |     # Calculate the split indices
34 |     train_end = int(len(df) * train_frac)
35 |     validation_end = train_end + int(len(df) * validation_frac)
36 | 
37 |     # Split the DataFrame
38 |     train_df = df[:train_end]
39 |     validation_df = df[train_end:validation_end]
40 |     test_df = df[validation_end:]
41 | 
42 |     return train_df, validation_df, test_df
43 | 
44 | 
45 | balanced_df = create_balanced_dataset(df)
46 | # print(balanced_df["Label"].value_counts())
47 | # print(balanced_df.shape[0])
48 | 
49 | # Change the string class labels "ham" and "spam" into integer class labels 0 and 1
50 | balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})
51 | # print(balanced_df)
52 | 
53 | train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
54 | # Test size is implied to be 0.2 as the remainder
55 | 
56 | print(train_df.shape[0])
57 | print(validation_df.shape[0])
58 | print(test_df.shape[0])
59 | 
60 | # Save the dataset as CSV (comma-seperated values) files so we can reuse it later
61 | train_df.to_csv("train.csv", index=None)
62 | validation_df.to_csv("validation.csv", index=None)
63 | test_df.to_csv("test.csv", index=None)


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/download.py:
--------------------------------------------------------------------------------
 1 | import urllib.request
 2 | import zipfile
 3 | import os
 4 | from pathlib import Path
 5 | 
 6 | 
 7 | url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
 8 | zip_path = "sms_spam_collection.zip"
 9 | extracted_path = "sms_spam_collection"
10 | data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"
11 | 
12 | 
13 | def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
14 |     
15 |     if data_file_path.exists():
16 |         print(f"{data_file_path} already exists. Skipping download and extraction.")
17 |         return
18 |     
19 |     # Downloading the file
20 |     with urllib.request.urlopen(url) as response:
21 |         with open(zip_path, "wb") as out_file:
22 |             out_file.write(response.read())
23 |     
24 |     # Unzipping the file
25 |     with zipfile.ZipFile(zip_path, "r") as zip_ref:
26 |         zip_ref.extractall(extracted_path)
27 | 
28 |     # Add .tsv file extension
29 |     original_file_path = Path(extracted_path) / "SMSSpamCollection"
30 |     os.rename(original_file_path, data_file_path)
31 |     print(f"File downloaded and saved as {data_file_path}")
32 | 
33 | download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/images/classification fine tuning stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/images/classification fine tuning stages.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection.zip


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection/readme:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection/readme


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/input text prep process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/input text prep process.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/single training batch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/single training batch.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/padding_token.py:
--------------------------------------------------------------------------------
1 | import tiktoken
2 | 
3 | 
4 | tokenizer = tiktoken.get_encoding("gpt2")
5 | 
6 | print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})) 
7 | # 50256


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/data_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/data_setup/spam_dataset.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import pandas as pd
 4 | import tiktoken
 5 | 
 6 | 
 7 | class SpamDataset(Dataset):
 8 |     def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
 9 |         self.data = pd.read_csv(csv_file)
10 | 
11 |         # Pre-tokenize texts
12 |         self.encoded_texts = [
13 |             tokenizer.encode(text) for text in self.data["Text"]
14 |         ]
15 | 
16 |         if max_length is None:
17 |             self.max_length = self._longest_encoded_length()
18 |         else:
19 |             self.max_length = max_length
20 |             # Truncate sequences if they are longer than max_length
21 |             self.encoded_texts = [
22 |                 encoded_text[:self.max_length]
23 |                 for encoded_text in self.encoded_texts
24 |             ]
25 |         
26 |         # Pad sequences to the longest sequence
27 |         self.encoded_texts = [
28 |             encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
29 |             for encoded_text in self.encoded_texts
30 |         ]
31 |     
32 |     def __getitem__(self, index):
33 |         encoded = self.encoded_texts[index]
34 |         label = self.data.iloc[index]["Label"]
35 |         return (
36 |             torch.tensor(encoded, dtype=torch.long),
37 |             torch.tensor(label, dtype=torch.long)
38 |         )
39 |     
40 |     def __len__(self):
41 |         return len(self.data)
42 | 
43 |     def _longest_encoded_length(self):
44 |         max_length = 0
45 |         for encoded_text in self.encoded_texts:
46 |             encoded_length = len(encoded_text)
47 |             if encoded_length > max_length:
48 |                 max_length = encoded_length
49 |         return max_length
50 | 
51 | 
52 | tokenizer = tiktoken.get_encoding("gpt2")
53 | 
54 | train_dataset = SpamDataset(
55 |     csv_file="data_setup/train.csv",
56 |     max_length=None,
57 |     tokenizer=tokenizer
58 | )
59 | 
60 | val_dataset = SpamDataset(
61 |     csv_file="data_setup/validation.csv",
62 |     max_length=train_dataset.max_length,
63 |     tokenizer=tokenizer
64 | )
65 | 
66 | test_dataset = SpamDataset(
67 |     csv_file="data_setup/test.csv",
68 |     max_length=train_dataset.max_length,
69 |     tokenizer=tokenizer
70 | )
71 | 
72 | # Setting up data loaders
73 | num_workers = 0
74 | batch_size = 8
75 | 
76 | torch.manual_seed(123)
77 | 
78 | train_loader = DataLoader(
79 |     dataset=train_dataset,
80 |     batch_size=batch_size,
81 |     shuffle=True,
82 |     num_workers=num_workers,
83 |     drop_last=True
84 | )
85 | 
86 | val_loader = DataLoader(
87 |     dataset=val_dataset,
88 |     batch_size=batch_size,
89 |     num_workers=num_workers,
90 |     drop_last=False
91 | )
92 | 
93 | test_loader = DataLoader(
94 |     dataset=test_dataset,
95 |     batch_size=batch_size,
96 |     num_workers=num_workers,
97 |     drop_last=False
98 | )
99 | 


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/images/stages.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/data_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/data_setup/spam_dataset.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import pandas as pd
 4 | import tiktoken
 5 | 
 6 | 
 7 | class SpamDataset(Dataset):
 8 |     def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
 9 |         self.data = pd.read_csv(csv_file)
10 | 
11 |         # Pre-tokenize texts
12 |         self.encoded_texts = [
13 |             tokenizer.encode(text) for text in self.data["Text"]
14 |         ]
15 | 
16 |         if max_length is None:
17 |             self.max_length = self._longest_encoded_length()
18 |         else:
19 |             self.max_length = max_length
20 |             # Truncate sequences if they are longer than max_length
21 |             self.encoded_texts = [
22 |                 encoded_text[:self.max_length]
23 |                 for encoded_text in self.encoded_texts
24 |             ]
25 |         
26 |         # Pad sequences to the longest sequence
27 |         self.encoded_texts = [
28 |             encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
29 |             for encoded_text in self.encoded_texts
30 |         ]
31 |     
32 |     def __getitem__(self, index):
33 |         encoded = self.encoded_texts[index]
34 |         label = self.data.iloc[index]["Label"]
35 |         return (
36 |             torch.tensor(encoded, dtype=torch.long),
37 |             torch.tensor(label, dtype=torch.long)
38 |         )
39 |     
40 |     def __len__(self):
41 |         return len(self.data)
42 | 
43 |     def _longest_encoded_length(self):
44 |         max_length = 0
45 |         for encoded_text in self.encoded_texts:
46 |             encoded_length = len(encoded_text)
47 |             if encoded_length > max_length:
48 |                 max_length = encoded_length
49 |         return max_length
50 | 
51 | 
52 | tokenizer = tiktoken.get_encoding("gpt2")
53 | 
54 | train_dataset = SpamDataset(
55 |     csv_file="data_setup/train.csv",
56 |     max_length=None,
57 |     tokenizer=tokenizer
58 | )
59 | 
60 | val_dataset = SpamDataset(
61 |     csv_file="data_setup/validation.csv",
62 |     max_length=train_dataset.max_length,
63 |     tokenizer=tokenizer
64 | )
65 | 
66 | test_dataset = SpamDataset(
67 |     csv_file="data_setup/test.csv",
68 |     max_length=train_dataset.max_length,
69 |     tokenizer=tokenizer
70 | )
71 | 
72 | # Setting up data loaders
73 | num_workers = 0
74 | batch_size = 8
75 | 
76 | torch.manual_seed(123)
77 | 
78 | train_loader = DataLoader(
79 |     dataset=train_dataset,
80 |     batch_size=batch_size,
81 |     shuffle=True,
82 |     num_workers=num_workers,
83 |     drop_last=True
84 | )
85 | 
86 | val_loader = DataLoader(
87 |     dataset=val_dataset,
88 |     batch_size=batch_size,
89 |     num_workers=num_workers,
90 |     drop_last=False
91 | )
92 | 
93 | test_loader = DataLoader(
94 |     dataset=test_dataset,
95 |     batch_size=batch_size,
96 |     num_workers=num_workers,
97 |     drop_last=False
98 | )
99 | 


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/architecture adapation for binary classification.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/architecture adapation for binary classification.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/final layernorm and trf block set to trainable.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/final layernorm and trf block set to trainable.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/fine-tuning selected layers vs all layers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/fine-tuning selected layers vs all layers.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last row of output tensor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last row of output tensor.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last token contains attention score to all other tokens .png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last token contains attention score to all other tokens .png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt1.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt2.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt3.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/modifying output layer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/modifying output layer.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/data_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/data_setup/spam_dataset.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import pandas as pd
 4 | import tiktoken
 5 | 
 6 | 
 7 | class SpamDataset(Dataset):
 8 |     def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
 9 |         self.data = pd.read_csv(csv_file)
10 | 
11 |         # Pre-tokenize texts
12 |         self.encoded_texts = [
13 |             tokenizer.encode(text) for text in self.data["Text"]
14 |         ]
15 | 
16 |         if max_length is None:
17 |             self.max_length = self._longest_encoded_length()
18 |         else:
19 |             self.max_length = max_length
20 |             # Truncate sequences if they are longer than max_length
21 |             self.encoded_texts = [
22 |                 encoded_text[:self.max_length]
23 |                 for encoded_text in self.encoded_texts
24 |             ]
25 |         
26 |         # Pad sequences to the longest sequence
27 |         self.encoded_texts = [
28 |             encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
29 |             for encoded_text in self.encoded_texts
30 |         ]
31 |     
32 |     def __getitem__(self, index):
33 |         encoded = self.encoded_texts[index]
34 |         label = self.data.iloc[index]["Label"]
35 |         return (
36 |             torch.tensor(encoded, dtype=torch.long),
37 |             torch.tensor(label, dtype=torch.long)
38 |         )
39 |     
40 |     def __len__(self):
41 |         return len(self.data)
42 | 
43 |     def _longest_encoded_length(self):
44 |         max_length = 0
45 |         for encoded_text in self.encoded_texts:
46 |             encoded_length = len(encoded_text)
47 |             if encoded_length > max_length:
48 |                 max_length = encoded_length
49 |         return max_length
50 | 
51 | 
52 | tokenizer = tiktoken.get_encoding("gpt2")
53 | 
54 | train_dataset = SpamDataset(
55 |     csv_file="data_setup/train.csv",
56 |     max_length=None,
57 |     tokenizer=tokenizer
58 | )
59 | 
60 | val_dataset = SpamDataset(
61 |     csv_file="data_setup/validation.csv",
62 |     max_length=train_dataset.max_length,
63 |     tokenizer=tokenizer
64 | )
65 | 
66 | test_dataset = SpamDataset(
67 |     csv_file="data_setup/test.csv",
68 |     max_length=train_dataset.max_length,
69 |     tokenizer=tokenizer
70 | )
71 | 
72 | # Setting up data loaders
73 | num_workers = 0
74 | batch_size = 8
75 | 
76 | torch.manual_seed(123)
77 | 
78 | train_loader = DataLoader(
79 |     dataset=train_dataset,
80 |     batch_size=batch_size,
81 |     shuffle=True,
82 |     num_workers=num_workers,
83 |     drop_last=True
84 | )
85 | 
86 | val_loader = DataLoader(
87 |     dataset=val_dataset,
88 |     batch_size=batch_size,
89 |     num_workers=num_workers,
90 |     drop_last=False
91 | )
92 | 
93 | test_loader = DataLoader(
94 |     dataset=test_dataset,
95 |     batch_size=batch_size,
96 |     num_workers=num_workers,
97 |     drop_last=False
98 | )
99 | 


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/model outputs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/model outputs.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/stages.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/data_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/data_setup/spam_dataset.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import pandas as pd
 4 | import tiktoken
 5 | 
 6 | 
 7 | class SpamDataset(Dataset):
 8 |     def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
 9 |         self.data = pd.read_csv(csv_file)
10 | 
11 |         # Pre-tokenize texts
12 |         self.encoded_texts = [
13 |             tokenizer.encode(text) for text in self.data["Text"]
14 |         ]
15 | 
16 |         if max_length is None:
17 |             self.max_length = self._longest_encoded_length()
18 |         else:
19 |             self.max_length = max_length
20 |             # Truncate sequences if they are longer than max_length
21 |             self.encoded_texts = [
22 |                 encoded_text[:self.max_length]
23 |                 for encoded_text in self.encoded_texts
24 |             ]
25 |         
26 |         # Pad sequences to the longest sequence
27 |         self.encoded_texts = [
28 |             encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
29 |             for encoded_text in self.encoded_texts
30 |         ]
31 |     
32 |     def __getitem__(self, index):
33 |         encoded = self.encoded_texts[index]
34 |         label = self.data.iloc[index]["Label"]
35 |         return (
36 |             torch.tensor(encoded, dtype=torch.long),
37 |             torch.tensor(label, dtype=torch.long)
38 |         )
39 |     
40 |     def __len__(self):
41 |         return len(self.data)
42 | 
43 |     def _longest_encoded_length(self):
44 |         max_length = 0
45 |         for encoded_text in self.encoded_texts:
46 |             encoded_length = len(encoded_text)
47 |             if encoded_length > max_length:
48 |                 max_length = encoded_length
49 |         return max_length
50 | 
51 | 
52 | tokenizer = tiktoken.get_encoding("gpt2")
53 | 
54 | train_dataset = SpamDataset(
55 |     csv_file="data_setup/train.csv",
56 |     max_length=None,
57 |     tokenizer=tokenizer
58 | )
59 | 
60 | val_dataset = SpamDataset(
61 |     csv_file="data_setup/validation.csv",
62 |     max_length=train_dataset.max_length,
63 |     tokenizer=tokenizer
64 | )
65 | 
66 | test_dataset = SpamDataset(
67 |     csv_file="data_setup/test.csv",
68 |     max_length=train_dataset.max_length,
69 |     tokenizer=tokenizer
70 | )
71 | 
72 | # Setting up data loaders
73 | num_workers = 0
74 | batch_size = 8
75 | 
76 | torch.manual_seed(123)
77 | 
78 | train_loader = DataLoader(
79 |     dataset=train_dataset,
80 |     batch_size=batch_size,
81 |     shuffle=True,
82 |     num_workers=num_workers,
83 |     drop_last=True
84 | )
85 | 
86 | val_loader = DataLoader(
87 |     dataset=val_dataset,
88 |     batch_size=batch_size,
89 |     num_workers=num_workers,
90 |     drop_last=False
91 | )
92 | 
93 | test_loader = DataLoader(
94 |     dataset=test_dataset,
95 |     batch_size=batch_size,
96 |     num_workers=num_workers,
97 |     drop_last=False
98 | )
99 | 


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/accuracy-plot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/accuracy-plot.pdf


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/choosing the number of epochs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/choosing the number of epochs.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/loss-plot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/loss-plot.pdf


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt1.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt2.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt3.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt4.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt5.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt6.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt7.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation accuracy plot explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation accuracy plot explanation.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation loss plot explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation loss plot explanation.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training loop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training loop.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/data_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/data_setup/spam_dataset.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.utils.data import Dataset, DataLoader
 3 | import pandas as pd
 4 | import tiktoken
 5 | 
 6 | 
 7 | class SpamDataset(Dataset):
 8 |     def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
 9 |         self.data = pd.read_csv(csv_file)
10 | 
11 |         # Pre-tokenize texts
12 |         self.encoded_texts = [
13 |             tokenizer.encode(text) for text in self.data["Text"]
14 |         ]
15 | 
16 |         if max_length is None:
17 |             self.max_length = self._longest_encoded_length()
18 |         else:
19 |             self.max_length = max_length
20 |             # Truncate sequences if they are longer than max_length
21 |             self.encoded_texts = [
22 |                 encoded_text[:self.max_length]
23 |                 for encoded_text in self.encoded_texts
24 |             ]
25 |         
26 |         # Pad sequences to the longest sequence
27 |         self.encoded_texts = [
28 |             encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
29 |             for encoded_text in self.encoded_texts
30 |         ]
31 |     
32 |     def __getitem__(self, index):
33 |         encoded = self.encoded_texts[index]
34 |         label = self.data.iloc[index]["Label"]
35 |         return (
36 |             torch.tensor(encoded, dtype=torch.long),
37 |             torch.tensor(label, dtype=torch.long)
38 |         )
39 |     
40 |     def __len__(self):
41 |         return len(self.data)
42 | 
43 |     def _longest_encoded_length(self):
44 |         max_length = 0
45 |         for encoded_text in self.encoded_texts:
46 |             encoded_length = len(encoded_text)
47 |             if encoded_length > max_length:
48 |                 max_length = encoded_length
49 |         return max_length
50 | 
51 | 
52 | tokenizer = tiktoken.get_encoding("gpt2")
53 | 
54 | train_dataset = SpamDataset(
55 |     csv_file="data_setup/train.csv",
56 |     max_length=None,
57 |     tokenizer=tokenizer
58 | )
59 | 
60 | val_dataset = SpamDataset(
61 |     csv_file="data_setup/validation.csv",
62 |     max_length=train_dataset.max_length,
63 |     tokenizer=tokenizer
64 | )
65 | 
66 | test_dataset = SpamDataset(
67 |     csv_file="data_setup/test.csv",
68 |     max_length=train_dataset.max_length,
69 |     tokenizer=tokenizer
70 | )
71 | 
72 | # Setting up data loaders
73 | num_workers = 0
74 | batch_size = 8
75 | 
76 | torch.manual_seed(123)
77 | 
78 | train_loader = DataLoader(
79 |     dataset=train_dataset,
80 |     batch_size=batch_size,
81 |     shuffle=True,
82 |     num_workers=num_workers,
83 |     drop_last=True
84 | )
85 | 
86 | val_loader = DataLoader(
87 |     dataset=val_dataset,
88 |     batch_size=batch_size,
89 |     num_workers=num_workers,
90 |     drop_last=False
91 | )
92 | 
93 | test_loader = DataLoader(
94 |     dataset=test_dataset,
95 |     batch_size=batch_size,
96 |     num_workers=num_workers,
97 |     drop_last=False
98 | )
99 | 


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt1.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt10.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt11.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt12.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt13.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt13.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt2.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt3.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt4.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt5.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt6.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt7.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt8.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt9.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/images/classification fine tuning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/classification fine tuning.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/images/fine tuning approach.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/fine tuning approach.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/images/instruction fine tuning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/instruction fine tuning.png


--------------------------------------------------------------------------------
/06_Fine_tuning_for_classification/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/desired goal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/desired goal.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | 
 4 | 
 5 | def load_file(file_path):
 6 |     with open(file_path, "r", encoding="utf-8") as file:
 7 |         data = json.load(file)
 8 | 
 9 |     return data
10 | 
11 | 
12 | def format_input(entry):
13 |     # Alpaca-style prompt formatting
14 |     instruction_text = (
15 |         f"Below is an instruction that describes a task. "
16 |         f"Write a response that appropriately completes the request."
17 |         f"\n\n### Instruction:\n{entry['instruction']}"
18 |     )
19 | 
20 |     input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
21 | 
22 |     return instruction_text + input_text
23 | 
24 | 
25 | data = load_file("instruction-data.json")
26 | 
27 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data."
28 | 
29 | # print("Number of entries:", len(data))
30 | # print("Example entry:\n", data[50])
31 | # print("Example entry:\n", data[999])
32 | 
33 | # Formatting input
34 | model_input = format_input(data[50])
35 | 
36 | desired_response = f"\n\n### Response:\n{data[50]['output']}"
37 | 
38 | print(model_input + desired_response)
39 | 
40 | # Divide the dataset into a training, validation, and test set
41 | train_portion = int(len(data) * 0.85)  # 85% for training
42 | test_portion = int(len(data) * 0.1)    # 10% for testing
43 | val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation
44 | 
45 | print(train_portion)
46 | print(test_portion)
47 | print(val_portion)
48 | 
49 | train_data = data[:train_portion]
50 | test_data = data[train_portion:train_portion + test_portion]
51 | val_data = data[train_portion + test_portion:]
52 | 
53 | print("\nTraining set length:", len(train_data))
54 | print("Validation set length:", len(val_data))
55 | print("Test set length:", len(test_data))
56 | 
57 | # TODO: Exercise 7.1: Changing prompt styles
58 | # After fine-tuning the model with the Alpaca prompt style, try the Phi-3 prompt style
59 | # shown in figure 7.4 and observe whether it affects the response quality of the model.
60 | 
61 | # def format_input(entry):
62 | #     instruction_text = (
63 | #         f"<|user|>\n{entry['instruction']}"
64 | #     )
65 | 
66 | #     input_text = f"\n{entry['input']}" if entry["input"] else ""
67 | 
68 | #     return instruction_text + input_text


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/download.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import urllib.request
 4 | 
 5 | 
 6 | def download_and_load_file(file_path, url):
 7 | 
 8 |     if not os.path.exists(file_path):
 9 |         with urllib.request.urlopen(url) as response:
10 |             text_data = response.read().decode("utf-8")
11 |         with open(file_path, "w", encoding="utf-8") as file:
12 |             file.write(text_data)
13 |     else:
14 |         with open(file_path, "r", encoding="utf-8") as file:
15 |             text_data = file.read()
16 | 
17 |     with open(file_path, "r", encoding="utf-8") as file:
18 |         data = json.load(file)
19 | 
20 |     return data
21 | 
22 | 
23 | file_path = "instruction-data.json"
24 | url = (
25 |     "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
26 |     "/main/ch07/01_main-chapter-code/instruction-data.json"
27 | )
28 | 
29 | data = download_and_load_file(file_path, url)
30 | print("Number of entries:", len(data))
31 | 
32 | print("Example entry:\n", data[50])
33 | 
34 | print("Example entry:\n", data[999])


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/images/instruction fine tuning prompt styles.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/images/instruction fine tuning prompt styles.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/-100 purpose in target IDs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/-100 purpose in target IDs.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt1.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt2.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt3.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt4.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt5.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt6.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/batching process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/batching process.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/cross entropy loss for logits_1 & targets_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/cross entropy loss for logits_1 & targets_1.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/custom collate (assemble) function.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/custom collate (assemble) function.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/first two steps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/first two steps.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index book explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index book explanation.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt1.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt2.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt3.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt4.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt5.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt6.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt7.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/input and target token alignment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/input and target token alignment.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masked instruction tokens.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masked instruction tokens.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masking the instruction tokens explained.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masking the instruction tokens explained.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/padding token replacement in target batch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/padding token replacement in target batch.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/target IDs explained.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/target IDs explained.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/data_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/data_setup/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | 
 4 | def load_file(file_path):
 5 |     with open(file_path, "r", encoding="utf-8") as file:
 6 |         data = json.load(file)
 7 | 
 8 |     return data
 9 | 
10 | data = load_file("data_setup/instruction-data.json")
11 | 
12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data."
13 | 
14 | # Divide the dataset into a training, validation, and test set
15 | train_portion = int(len(data) * 0.85)  # 85% for training
16 | test_portion = int(len(data) * 0.1)    # 10% for testing
17 | val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation
18 | 
19 | train_data = data[:train_portion]
20 | test_data = data[train_portion:train_portion + test_portion]
21 | val_data = data[train_portion + test_portion:]
22 | 


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/testing.py:
--------------------------------------------------------------------------------
 1 | from data_setup.instruction_dataset import train_loader, val_loader, test_loader, train_dataset
 2 | import tiktoken
 3 | 
 4 | 
 5 | tokenizer = tiktoken.get_encoding("gpt2")
 6 | 
 7 | 
 8 | # Prerequisite for validating, set shuffle=False in train_loader in instruction_dataset.py
 9 | 
10 | # The very first sequence in the first batch is the sequence of longest length, thus no padding tokens are present.
11 | # Because of this the second sequence is tested because I want to see the padding token when I decode.
12 | IDX = 1
13 | 
14 | print("\n----------------------------------------------------------")
15 | print("\n******* RAW TOKEN IDs + CORRESPONDING DECODED TEXT OF SECOND TRAIN LOADER SAMPLE *******\n")
16 | 
17 | for inputs, targets in train_loader:
18 |     # Get the second example from the batch (token IDs)
19 |     second_example_tokens = inputs[IDX].tolist()
20 | 
21 |     print(second_example_tokens, "\n")
22 | 
23 |     print("Length of second example (train loader) tokens:", len(second_example_tokens), "\n") # 15 padding tokens (50256) get added
24 | 
25 |     # Decode token IDs back into text
26 |     second_example_text = tokenizer.decode(second_example_tokens)
27 | 
28 |     print("Decoded Text for Second Example in Batch:")
29 |     print(second_example_text)
30 |     break  # Stop after the first batch
31 | 
32 | 
33 | # {
34 | #     "instruction": "Edit the following sentence for grammar.",
35 | #     "input": "He go to the park every day.",
36 | #     "output": "He goes to the park every day."
37 | # },
38 | 
39 | 
40 | print("\n----------------------------------------------------------")
41 | print("\n******* RAW TOKEN IDs + CORRESPONDING DECODED TEXT OF SECOND TRAIN DATASET SAMPLE *******\n")
42 | 
43 | raw_data = train_dataset[IDX]
44 | 
45 | print("Length of second example (train dataset) tokens:", len(raw_data), "\n")
46 | 
47 | print(raw_data, "\n")
48 | 
49 | print(tokenizer.decode(raw_data))
50 | 
51 | print("\n----------------------------------------------------------\n")
52 | 
53 | 
54 | print("******* CHECKING FIRST BATCH FOR LONGEST SEQUENCE *******\n")
55 | max_length = 0
56 | max_length_index = None
57 | 
58 | # Get the first batch
59 | for inputs, targets in train_loader:
60 |     for i, seq in enumerate(inputs):
61 |         seq_length = (seq != 50256).sum().item()  # Count non-padding tokens
62 |         if seq_length > max_length:
63 |             max_length = seq_length
64 |             max_length_index = i
65 | 
66 |     break  # Stop after processing the first batch
67 | 
68 | 
69 | 
70 | # Print results
71 | print(f"Index of longest sequence in the first batch: {max_length_index}")
72 | print(f"Length of the longest sequence (excluding padding): {max_length}")
73 | 
74 | 
75 | print("\n----------------------------------------------------------\n")
76 | 


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/data_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/data_setup/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | 
 4 | def load_file(file_path):
 5 |     with open(file_path, "r", encoding="utf-8") as file:
 6 |         data = json.load(file)
 7 | 
 8 |     return data
 9 | 
10 | data = load_file("data_setup/instruction-data.json")
11 | 
12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data."
13 | 
14 | # Divide the dataset into a training, validation, and test set
15 | train_portion = int(len(data) * 0.85)  # 85% for training
16 | test_portion = int(len(data) * 0.1)    # 10% for testing
17 | val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation
18 | 
19 | train_data = data[:train_portion]
20 | test_data = data[train_portion:train_portion + test_portion]
21 | val_data = data[train_portion + test_portion:]
22 | 


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/data_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/data_setup/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | 
 4 | def load_file(file_path):
 5 |     with open(file_path, "r", encoding="utf-8") as file:
 6 |         data = json.load(file)
 7 | 
 8 |     return data
 9 | 
10 | data = load_file("data_setup/instruction-data.json")
11 | 
12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data."
13 | 
14 | # Divide the dataset into a training, validation, and test set
15 | train_portion = int(len(data) * 0.85)  # 85% for training
16 | test_portion = int(len(data) * 0.1)    # 10% for testing
17 | val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation
18 | 
19 | train_data = data[:train_portion]
20 | test_data = data[train_portion:train_portion + test_portion]
21 | val_data = data[train_portion + test_portion:]
22 | 


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/dealing with hardware limitations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/dealing with hardware limitations.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/device runtimes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/device runtimes.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss plot.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss-plot.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss-plot.pdf


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/data_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/data_setup/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | 
 4 | def load_file(file_path):
 5 |     with open(file_path, "r", encoding="utf-8") as file:
 6 |         data = json.load(file)
 7 | 
 8 |     return data
 9 | 
10 | data = load_file("data_setup/instruction-data.json")
11 | 
12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data."
13 | 
14 | # Divide the dataset into a training, validation, and test set
15 | train_portion = int(len(data) * 0.85)  # 85% for training
16 | test_portion = int(len(data) * 0.1)    # 10% for testing
17 | val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation
18 | 
19 | train_data = data[:train_portion]
20 | test_data = data[train_portion:train_portion + test_portion]
21 | val_data = data[train_portion + test_portion:]
22 | 


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/gpt_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/gpt_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/gpt_setup/load_weights.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | 
 4 | 
 5 | def assign(left, right):
 6 |     if left.shape != right.shape:
 7 |         raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
 8 |     return torch.nn.Parameter(torch.tensor(right))
 9 | 
10 | def load_weights_into_gpt(gpt, params):
11 |     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
12 |     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
13 |     
14 |     for b in range(len(params["blocks"])): # iterate over each transformer block in the model
15 |         q_w, k_w, v_w = np.split(
16 |             (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
17 |         gpt.trf_blocks[b].att.W_query.weight = assign(
18 |             gpt.trf_blocks[b].att.W_query.weight, q_w.T)
19 |         gpt.trf_blocks[b].att.W_key.weight = assign(
20 |             gpt.trf_blocks[b].att.W_key.weight, k_w.T)
21 |         gpt.trf_blocks[b].att.W_value.weight = assign(
22 |             gpt.trf_blocks[b].att.W_value.weight, v_w.T)
23 | 
24 |         q_b, k_b, v_b = np.split(
25 |             (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
26 |         gpt.trf_blocks[b].att.W_query.bias = assign(
27 |             gpt.trf_blocks[b].att.W_query.bias, q_b)
28 |         gpt.trf_blocks[b].att.W_key.bias = assign(
29 |             gpt.trf_blocks[b].att.W_key.bias, k_b)
30 |         gpt.trf_blocks[b].att.W_value.bias = assign(
31 |             gpt.trf_blocks[b].att.W_value.bias, v_b)
32 | 
33 |         gpt.trf_blocks[b].att.out_proj.weight = assign(
34 |             gpt.trf_blocks[b].att.out_proj.weight, 
35 |             params["blocks"][b]["attn"]["c_proj"]["w"].T)
36 |         gpt.trf_blocks[b].att.out_proj.bias = assign(
37 |             gpt.trf_blocks[b].att.out_proj.bias, 
38 |             params["blocks"][b]["attn"]["c_proj"]["b"])
39 | 
40 |         gpt.trf_blocks[b].ff.layers[0].weight = assign(
41 |             gpt.trf_blocks[b].ff.layers[0].weight, 
42 |             params["blocks"][b]["mlp"]["c_fc"]["w"].T)
43 |         gpt.trf_blocks[b].ff.layers[0].bias = assign(
44 |             gpt.trf_blocks[b].ff.layers[0].bias, 
45 |             params["blocks"][b]["mlp"]["c_fc"]["b"])
46 |         gpt.trf_blocks[b].ff.layers[2].weight = assign(
47 |             gpt.trf_blocks[b].ff.layers[2].weight, 
48 |             params["blocks"][b]["mlp"]["c_proj"]["w"].T)
49 |         gpt.trf_blocks[b].ff.layers[2].bias = assign(
50 |             gpt.trf_blocks[b].ff.layers[2].bias, 
51 |             params["blocks"][b]["mlp"]["c_proj"]["b"])
52 | 
53 |         gpt.trf_blocks[b].norm1.scale = assign(
54 |             gpt.trf_blocks[b].norm1.scale, 
55 |             params["blocks"][b]["ln_1"]["g"])
56 |         gpt.trf_blocks[b].norm1.shift = assign(
57 |             gpt.trf_blocks[b].norm1.shift, 
58 |             params["blocks"][b]["ln_1"]["b"])
59 |         gpt.trf_blocks[b].norm2.scale = assign(
60 |             gpt.trf_blocks[b].norm2.scale, 
61 |             params["blocks"][b]["ln_2"]["g"])
62 |         gpt.trf_blocks[b].norm2.shift = assign(
63 |             gpt.trf_blocks[b].norm2.shift, 
64 |             params["blocks"][b]["ln_2"]["b"])
65 | 
66 |     gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
67 |     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
68 |     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
69 | 
70 |     return gpt
71 | 
72 | #  ------ Notes ------
73 | # load_weights_into_gpt() will set the model's positional and token embedding weights
74 | # to those specified in params
75 | 
76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal
77 | # parts for the query, key, and value components
78 | 
79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the
80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying.


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt1.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt2.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt1.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt2.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/check_ollama_status.py:
--------------------------------------------------------------------------------
 1 | import psutil
 2 | 
 3 | 
 4 | def check_if_running(process_name):
 5 |     running = False
 6 |     for proc in psutil.process_iter(["name"]):
 7 |         if process_name in proc.info["name"]:
 8 |             running = True
 9 |             break
10 |     return running
11 | 
12 | ollama_running = check_if_running("ollama")
13 | 
14 | if not ollama_running:
15 |     raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
16 | print("Ollama running:", check_if_running("ollama"))


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/data_setup/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/data_setup/__init__.py


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/data_setup/data_preprocessing.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | 
 4 | def load_file(file_path):
 5 |     with open(file_path, "r", encoding="utf-8") as file:
 6 |         data = json.load(file)
 7 | 
 8 |     return data
 9 | 
10 | data = load_file("data_setup/instruction-data.json")
11 | 
12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data."
13 | 
14 | # Divide the dataset into a training, validation, and test set
15 | train_portion = int(len(data) * 0.85)  # 85% for training
16 | test_portion = int(len(data) * 0.1)    # 10% for testing
17 | val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation
18 | 
19 | train_data = data[:train_portion]
20 | test_data = data[train_portion:train_portion + test_portion]
21 | val_data = data[train_portion + test_portion:]
22 | 


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/alternative ollama models.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/alternative ollama models.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt1.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt2.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt3.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/ollama.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/ollama.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/stages.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/using larger llms via web apis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/using larger llms via web apis.png


--------------------------------------------------------------------------------
/07_Fine_tuning_for_instructions/images/stages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/images/stages.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Build-a-Large-Language-Model-from-Scratch
 2 | 
 3 | https://www.manning.com/books/build-a-large-language-model-from-scratch
 4 | 
 5 | "In Build a Large Language Model (from Scratch), you’ll discover how LLMs work from the inside out. In this book, I’ll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples.
 6 | 
 7 | The method described in this book for training and developing your own small-but-functional model for educational purposes mirrors the approach used in creating large-scale foundational models such as those behind ChatGPT. The book uses Python and PyTorch for all its coding examples."
 8 | 
 9 | By Sebastian Raschka
10 | 
11 | Book Repo:
12 | https://github.com/rasbt/LLMs-from-scratch/
13 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==2.4.0
2 | tiktoken==0.8.0
3 | matplotlib==3.9.2
4 | tensorflow>=2.15.0
5 | tqdm>=4.66
6 | pandas==2.2.3
7 | psutil==6.1.1


--------------------------------------------------------------------------------