├── .gitignore ├── 01_Intro_to_PyTorch ├── A.1.3_Installing_PyTorch │ └── main.py ├── A.2_Understanding_Tensors │ ├── A.2.1_Scalars_Vectors_Matrices_Tensors │ │ └── main.py │ ├── A.2.2_Tensor_data_types │ │ └── main.py │ ├── A.2.3_Common_PyTorch_tensor_operations │ │ └── main.py │ └── images │ │ └── tensors.png ├── A.3_Seeing_models_as_computation_graphs │ ├── images │ │ └── logistic-regression-forward-pass-as-computation-graph.png │ └── main.py ├── A.4_Automatic_differentiation_made_easy │ ├── images │ │ └── partial-derivatives-gradients.png │ └── main.py ├── A.5_Implementing_multilayer_neural_networks │ ├── images │ │ ├── multilayer-neural-network.png │ │ └── multilayer-perceptron-two-hidden-layers.png │ └── main.py ├── A.6_Setting_up_efficient_data_loaders │ ├── images │ │ ├── data-loaders-workers.png │ │ └── data-loaders.png │ └── main.py ├── A.7_A_typical_training_loop │ └── main.py ├── A.8_Saving_and_loading_models │ ├── main.py │ └── model.pth └── A.9_Optimizing_training_performance_with_GPUs │ ├── A.9.1_PyTorch_computations_on_GPU_devices │ ├── main.py │ └── model.pth │ ├── A.9.2_Single_GPU_Training │ └── main.py │ └── A.9.3_Training_with_multiple_GPUs │ ├── images │ ├── code-summary-1.png │ ├── code-summary-2.png │ ├── multi-gpu-1.png │ └── multi-gpu-2.png │ └── main.py ├── 02_Working_with_text_data ├── 2.2_Tokenizing_text │ ├── main.py │ └── the-verdict.txt ├── 2.3_Converting_tokens_into_token_IDs │ ├── main.py │ ├── the-verdict.txt │ └── token_ids.py ├── 2.4_Adding_special_context_tokens │ ├── main.py │ ├── the-verdict.txt │ └── token_ids.py ├── 2.5_Byte_pair_encoding │ └── main.py ├── 2.6_Data_sampling_with_a_sliding_window │ ├── images │ │ └── sliding window.png │ ├── input_token_pairs.py │ ├── main.py │ └── the-verdict.txt ├── 2.7_Creating_token_embeddings │ └── embedding_example.py └── 2.8_Encoding_word_positions │ ├── main.py │ └── the-verdict.txt ├── 03_Coding_attention_mechanisms ├── 3.3_Attending_to_different_parts_of_the_input_with_self_attention │ ├── 3.3.1_A_simple_self_attention_mechanism_without_trainable_weights │ │ ├── images │ │ │ ├── attention mechanism.png │ │ │ └── vector similarity example.png │ │ └── main.py │ └── 3.3.2_Computing_attention_weights_for_all_input_tokens │ │ ├── images │ │ └── attention weights heatmap.png │ │ └── main.py ├── 3.4_Implementing_self_attention_with_training_weights │ ├── 3.4.1_Computing_the_attention_weights_step_by_step │ │ ├── images │ │ │ ├── input and output dimensions.png │ │ │ ├── q, k distinction, pt1.png │ │ │ ├── q, k distinction, pt2.png │ │ │ ├── q, k distinction, pt3.png │ │ │ ├── q, k distinction, pt4.png │ │ │ ├── query, key, value, pt1.png │ │ │ ├── query, key, value, pt2.png │ │ │ ├── query, key, value, pt3.png │ │ │ └── query, key, value, pt4.png │ │ └── main.py │ └── 3.4.2_Implementing_a_compact_self_attention_Python_class │ │ ├── images │ │ ├── attention weights heatmap.png │ │ ├── key vector token attributes, pt1.png │ │ ├── key vector token attributes, pt2.png │ │ ├── key vector token attributes, pt3.png │ │ ├── key vector token attributes, pt4.png │ │ ├── key vector token attributes, pt5.png │ │ ├── key vector token attributes, pt6.png │ │ ├── key vector token attributes, pt7.png │ │ ├── key vector token attributes, pt8.png │ │ └── q,k,v,z.png │ │ ├── self-attention-class-v1.py │ │ └── self-attention-class-v2.py ├── 3.5_Hiding_future_words_with_causal_attention │ ├── 3.5.1_Applying_a_causal_attention_mask │ │ ├── images │ │ │ └── attn weights normalized with masked future tokens .png │ │ └── main.py │ ├── 3.5.2_Masking_additional_attention_weights_with_dropout │ │ ├── images │ │ │ ├── attn weights normalized with dropout and masked future tokens .png │ │ │ └── dropout.png │ │ └── main.py │ └── 3.5.3_Implementing_a_compact_causal_attention_class │ │ └── main.py └── 3.6_Extending_single_head_attention_to_multi_head_attention │ ├── 3.6.1_Stacking_multiple_single_head_attention_layers │ ├── images │ │ ├── multi head attention output.png │ │ └── multi head attention.png │ └── main.py │ └── 3.6.2_Implementing_multi_head_attention_with_weight_splits │ ├── batched_matrix_multiplication.py │ ├── images │ ├── diagram.png │ ├── gpt-2 param explanation, pt1.png │ ├── gpt-2 param explanation, pt2.png │ └── gpt-2 param explanation, pt3.png │ └── main.py ├── 04_Implementing_a_GPT_model_to_generate_text ├── 4.1_Coding_an_LLM_Architecture │ ├── gpt_config.py │ ├── images │ │ └── logits explanation.png │ └── main.py ├── 4.2_Normalizing_activations_with_layer_normalization │ ├── images │ │ ├── biased variance broken down.png │ │ ├── dim parameter, pt1.png │ │ ├── dim parameter, pt2.png │ │ ├── forward pass & after check, pt1.png │ │ ├── forward pass & after check, pt2.png │ │ ├── forward pass & after check, pt3.png │ │ ├── layer normalization explained, pt1.png │ │ ├── layer normalization explained, pt2.png │ │ ├── layer normalization explained, pt3.png │ │ ├── layer normalization.png │ │ ├── variance calculation, pt1.png │ │ └── variance calculation, pt2.png │ ├── main.py │ └── normalization.py ├── 4.3_Implementing_a_feed_forward_network_with_GELU_activations │ ├── gelu.py │ ├── images │ │ ├── fnn diagram.png │ │ ├── gelu and relu plot.png │ │ └── input into feedforward neural net (fnn).png │ └── main.py ├── 4.4_Adding_shortcut_connections │ ├── images │ │ └── shortcut connections.png │ └── main.py ├── 4.5_Connecting_attention_and_linear_layers_in_a_transformer_block │ ├── images │ │ └── transformer block.png │ └── main.py ├── 4.6_Coding_the_GPT_Model │ ├── images │ │ └── gpt2 architecture.png │ └── main.py └── 4.7_Generating_text │ ├── exercise.py │ ├── images │ ├── iterations of a token prediction cycle.png │ ├── mechanics of text generation.png │ └── step by step text generation.png │ └── main.py ├── 05_Pretraining_on_unlabeled_data ├── 5.1_Evaluating_generative_text_models │ ├── 5.1.1_Using_GPT_to_generate_text │ │ ├── images │ │ │ ├── chapter topics.png │ │ │ ├── gpt build stages.png │ │ │ └── tokenizer placement in flow.png │ │ └── main.py │ ├── 5.1.2_Calculating_the_text_generation_loss │ │ ├── images │ │ │ ├── loss calculation steps.png │ │ │ ├── next tokens.png │ │ │ ├── perplexity score explanation.png │ │ │ └── text generation process.png │ │ └── main.py │ └── 5.1.3_Calculating_the_training_and_validation_set_losses │ │ ├── images │ │ └── dataloaders.png │ │ ├── loading_dataset.py │ │ ├── main.py │ │ └── the-verdict.txt ├── 5.2_Training_an_LLM │ ├── images │ │ ├── loss-plot.pdf │ │ ├── plot explanation.png │ │ ├── training loop.png │ │ └── training process.png │ ├── main.py │ └── the-verdict.txt ├── 5.3_Decoding_strategies_to_control_randomness │ ├── 5.3.0_Same_Outputs │ │ ├── main.py │ │ └── the-verdict.txt │ ├── 5.3.1_Temperature_Scaling │ │ ├── images │ │ │ ├── temperature explanation.png │ │ │ └── temperature-plot.pdf │ │ └── token_gen_process.py │ ├── 5.3.2_Top_k_sampling │ │ ├── images │ │ │ └── top k steps.png │ │ └── top_k.py │ └── 5.3.3_Modifying_the_text_generation_function │ │ ├── main.py │ │ └── the-verdict.txt ├── 5.4_Loading_and_saving_model_weights_in_Pytorch │ ├── exercise.py │ ├── images │ │ ├── torch no_grad explained, pt1.png │ │ ├── torch no_grad explained, pt2.png │ │ ├── torch no_grad explained, pt3.png │ │ └── torch no_grad explained, pt4.png │ ├── main.py │ └── the-verdict.txt └── 5.5_Loading_pretrained_weights_from_OpenAI │ ├── exercises │ ├── exercise_5.5_5.6.py │ ├── exercise_main.py │ └── the-verdict.txt │ ├── gpt_setup │ ├── __init__.py │ ├── gpt_download.py │ └── load_weights.py │ ├── images │ └── gpt architecture.png │ └── main.py ├── 06_Fine_tuning_for_classification ├── 6.2_Preparing_the_dataset │ ├── data_preprocessing.py │ ├── download.py │ ├── images │ │ └── classification fine tuning stages.png │ ├── sms_spam_collection.zip │ ├── sms_spam_collection │ │ ├── SMSSpamCollection.tsv │ │ └── readme │ ├── test.csv │ ├── train.csv │ └── validation.csv ├── 6.3_Creating_data_loaders │ ├── images │ │ ├── input text prep process.png │ │ └── single training batch.png │ ├── padding_token.py │ ├── spam_dataset.py │ ├── test.csv │ ├── train.csv │ └── validation.csv ├── 6.4_Initializing_a_model_with_pretrained_weights │ ├── data_setup │ │ ├── __init__.py │ │ ├── spam_dataset.py │ │ ├── test.csv │ │ ├── train.csv │ │ └── validation.csv │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ └── stages.png │ └── main.py ├── 6.5_Adding_a_classification_head │ ├── data_setup │ │ ├── __init__.py │ │ ├── spam_dataset.py │ │ ├── test.csv │ │ ├── train.csv │ │ └── validation.csv │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ ├── architecture adapation for binary classification.png │ │ ├── final layernorm and trf block set to trainable.png │ │ ├── fine-tuning selected layers vs all layers.png │ │ ├── last row of output tensor.png │ │ ├── last token contains attention score to all other tokens .png │ │ ├── layer training summary, pt1.png │ │ ├── layer training summary, pt2.png │ │ ├── layer training summary, pt3.png │ │ └── modifying output layer.png │ └── main.py ├── 6.6_Calculating_the_classification_loss_and_accuracy │ ├── data_setup │ │ ├── __init__.py │ │ ├── spam_dataset.py │ │ ├── test.csv │ │ ├── train.csv │ │ └── validation.csv │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ ├── model outputs.png │ │ └── stages.png │ └── main.py ├── 6.7_Fine_tuning_the_model_on_supervised_data │ ├── data_setup │ │ ├── __init__.py │ │ ├── spam_dataset.py │ │ ├── test.csv │ │ ├── train.csv │ │ └── validation.csv │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ ├── accuracy-plot.pdf │ │ ├── choosing the number of epochs.png │ │ ├── loss-plot.pdf │ │ ├── train, validation, test dataset explanation, pt1.png │ │ ├── train, validation, test dataset explanation, pt2.png │ │ ├── train, validation, test dataset explanation, pt3.png │ │ ├── train, validation, test dataset explanation, pt4.png │ │ ├── train, validation, test dataset explanation, pt5.png │ │ ├── train, validation, test dataset explanation, pt6.png │ │ ├── train, validation, test dataset explanation, pt7.png │ │ ├── training & validation accuracy plot explanation.png │ │ ├── training & validation loss plot explanation.png │ │ └── training loop.png │ └── main.py ├── 6.8_Using_the_LLM_as_a_spam_classifier │ ├── data_setup │ │ ├── __init__.py │ │ ├── spam_dataset.py │ │ ├── test.csv │ │ ├── train.csv │ │ └── validation.csv │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ ├── overview, pt1.png │ │ ├── overview, pt10.png │ │ ├── overview, pt11.png │ │ ├── overview, pt12.png │ │ ├── overview, pt13.png │ │ ├── overview, pt2.png │ │ ├── overview, pt3.png │ │ ├── overview, pt4.png │ │ ├── overview, pt5.png │ │ ├── overview, pt6.png │ │ ├── overview, pt7.png │ │ ├── overview, pt8.png │ │ └── overview, pt9.png │ ├── inference_playground.py │ └── main.py └── images │ ├── classification fine tuning.png │ ├── fine tuning approach.png │ ├── instruction fine tuning.png │ └── stages.png ├── 07_Fine_tuning_for_instructions ├── 7.1_Introduction_to_instruction_fine_tuning │ └── images │ │ ├── desired goal.png │ │ └── stages.png ├── 7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning │ ├── data_preprocessing.py │ ├── download.py │ ├── images │ │ └── instruction fine tuning prompt styles.png │ └── instruction-data.json ├── 7.3_Organizing_data_into_training_batches │ ├── images │ │ ├── -100 purpose in target IDs.png │ │ ├── Instruction Fine-Tuning Training Process, pt1.png │ │ ├── Instruction Fine-Tuning Training Process, pt2.png │ │ ├── Instruction Fine-Tuning Training Process, pt3.png │ │ ├── Instruction Fine-Tuning Training Process, pt4.png │ │ ├── Instruction Fine-Tuning Training Process, pt5.png │ │ ├── Instruction Fine-Tuning Training Process, pt6.png │ │ ├── batching process.png │ │ ├── cross entropy loss for logits_1 & targets_1.png │ │ ├── custom collate (assemble) function.png │ │ ├── first two steps.png │ │ ├── ignore_index book explanation.png │ │ ├── ignore_index in cross-cross-entropy loss, pt1.png │ │ ├── ignore_index in cross-cross-entropy loss, pt2.png │ │ ├── ignore_index in cross-cross-entropy loss, pt3.png │ │ ├── ignore_index in cross-cross-entropy loss, pt4.png │ │ ├── ignore_index in cross-cross-entropy loss, pt5.png │ │ ├── ignore_index in cross-cross-entropy loss, pt6.png │ │ ├── ignore_index in cross-cross-entropy loss, pt7.png │ │ ├── input and target token alignment.png │ │ ├── masked instruction tokens.png │ │ ├── masking the instruction tokens explained.png │ │ ├── padding token replacement in target batch.png │ │ ├── stages.png │ │ └── target IDs explained.png │ └── instruction_dataset.py ├── 7.4_Creating_data_loaders_for_an_instruction_dataset │ ├── data_setup │ │ ├── __init__.py │ │ ├── data_preprocessing.py │ │ ├── instruction-data.json │ │ └── instruction_dataset.py │ ├── images │ │ └── stages.png │ └── testing.py ├── 7.5_Loading_a_pretrained_LLM │ ├── data_setup │ │ ├── __init__.py │ │ ├── data_preprocessing.py │ │ ├── instruction-data.json │ │ └── instruction_dataset.py │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ └── stages.png │ └── main.py ├── 7.6_Fine_tuning_the_LLM_on_instruction_data │ ├── data_setup │ │ ├── __init__.py │ │ ├── data_preprocessing.py │ │ ├── instruction-data.json │ │ └── instruction_dataset.py │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ ├── dealing with hardware limitations.png │ │ ├── device runtimes.png │ │ ├── loss plot.png │ │ ├── loss-plot.pdf │ │ └── stages.png │ └── main.py ├── 7.7_Extracting_and_saving_responses │ ├── data_setup │ │ ├── __init__.py │ │ ├── data_preprocessing.py │ │ ├── instruction-data.json │ │ └── instruction_dataset.py │ ├── gpt_setup │ │ ├── __init__.py │ │ ├── gpt_download.py │ │ └── load_weights.py │ ├── images │ │ ├── model response explained, pt1.png │ │ ├── model response explained, pt2.png │ │ ├── model response, pt1.png │ │ ├── model response, pt2.png │ │ └── stages.png │ ├── instruction-data-with-response.json │ └── main.py ├── 7.8_Evaluating_the_fine_tuned_LLM │ ├── check_ollama_status.py │ ├── data_setup │ │ ├── __init__.py │ │ ├── data_preprocessing.py │ │ ├── instruction-data.json │ │ └── instruction_dataset.py │ ├── images │ │ ├── alternative ollama models.png │ │ ├── llama 3 score for gpt2 instruct, pt1.png │ │ ├── llama 3 score for gpt2 instruct, pt2.png │ │ ├── llama 3 score for gpt2 instruct, pt3.png │ │ ├── ollama.png │ │ ├── stages.png │ │ └── using larger llms via web apis.png │ ├── instruction-data-with-response.json │ └── main.py └── images │ └── stages.png ├── README.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .DS_Store 3 | gpt2 4 | gpt2-medium355M-sft.pth -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.1.3_Installing_PyTorch/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | print(torch.cuda.is_available()) # False if no GPU 4 | 5 | print(torch.__version__) # Version # 6 | 7 | print(torch.backends.mps.is_available()) # Check whether your Mac supports PyTorch acceleration with its Apple Silicon chip -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.2_Understanding_Tensors/A.2.1_Scalars_Vectors_Matrices_Tensors/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | tensor0d = torch.tensor(1) # Creates a zero-dimensional tensor (scalar) from a Python integer 5 | 6 | tensor1d = torch.tensor([1, 2, 3]) # Creates a one-dimensional tensor (vector) from a Python list 7 | 8 | tensor2d = torch.tensor([[1, 2], # Creates a two-dimensional tensor from a nested Python list 9 | [3, 4]]) 10 | 11 | tensor3d = torch.tensor([[[1, 2], [3, 4]], # Creates a three-dimensional tensor from a nested Python list 12 | [[5, 6], [7, 8]]]) 13 | 14 | 15 | print("0d tensor: \n", tensor0d, "\n") 16 | print("1d tensor: \n", tensor1d, "\n") 17 | print("2d tensor: \n", tensor2d, "\n") 18 | print("3d tensor: \n", tensor3d, "\n") -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.2_Understanding_Tensors/A.2.2_Tensor_data_types/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | tensor1d = torch.tensor([1, 2, 3]) 5 | print(tensor1d.dtype) # torch.int64 6 | 7 | floatvec = torch.tensor([1.0, 2.0, 3.0]) 8 | print(floatvec.dtype) # torch.float32 9 | 10 | floatvec = tensor1d.to(torch.float32) 11 | print(floatvec.dtype) # torch.float32 -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.2_Understanding_Tensors/A.2.3_Common_PyTorch_tensor_operations/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | tensor2d = torch.tensor([[1, 2, 3], 5 | [4, 5, 6]]) 6 | 7 | print(tensor2d) 8 | 9 | print(tensor2d.shape) # [2, 3], 2 rows by 3 columns 10 | 11 | # print(tensor2d.reshape(3, 2)) # [3, 2], 3 rows by 2 columns, reshaping tensor 12 | 13 | print(tensor2d.view(3, 2)) # [3, 2], 3 rows by 2 columns, more common command for reshaping tensors in PyTorch 14 | 15 | print("Transpose: \n ", tensor2d.T, "\n") # Transpose, flip across its diagonal 16 | 17 | print(tensor2d.matmul(tensor2d.T)) # Matrix multiplication 18 | 19 | print(tensor2d @ tensor2d.T) # Matrix multiplication (compact syntax) -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.2_Understanding_Tensors/images/tensors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.2_Understanding_Tensors/images/tensors.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.3_Seeing_models_as_computation_graphs/images/logistic-regression-forward-pass-as-computation-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.3_Seeing_models_as_computation_graphs/images/logistic-regression-forward-pass-as-computation-graph.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.3_Seeing_models_as_computation_graphs/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | 5 | y = torch.tensor([1.0]) # true label 6 | x1 = torch.tensor([1.1]) # input feature 7 | w1 = torch.tensor([2.2]) # weight parameter 8 | b = torch.tensor([0.0]) # bias unit 9 | z = x1 * w1 + b # net input 10 | a = torch.sigmoid(z) # activation and output 11 | 12 | print(z) 13 | print(a) 14 | 15 | loss = F.binary_cross_entropy(a, y) 16 | print(loss) 17 | -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.4_Automatic_differentiation_made_easy/images/partial-derivatives-gradients.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.4_Automatic_differentiation_made_easy/images/partial-derivatives-gradients.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.4_Automatic_differentiation_made_easy/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | from torch.autograd import grad 4 | 5 | 6 | y = torch.tensor([1.0]) 7 | x1 = torch.tensor([1.1]) 8 | w1 = torch.tensor([2.2], requires_grad=True) 9 | b = torch.tensor([0.0], requires_grad=True) 10 | 11 | z = x1 * w1 + b 12 | a = torch.sigmoid(z) 13 | 14 | loss = F.binary_cross_entropy(a, y) 15 | 16 | loss.backward() 17 | print(w1.grad) 18 | print(b.grad) 19 | 20 | 21 | # Manual Method 22 | # grad_L_w1 = grad(loss, w1, retain_graph=True) 23 | # grad_L_b = grad(loss, b, retain_graph=True) 24 | # print(grad_L_w1) 25 | # print(grad_L_b) 26 | -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-neural-network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-neural-network.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-perceptron-two-hidden-layers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/images/multilayer-perceptron-two-hidden-layers.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.5_Implementing_multilayer_neural_networks/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | class NeuralNetwork(torch.nn.Module): 5 | def __init__(self, num_inputs, num_outputs): 6 | super().__init__() 7 | 8 | self.layers = torch.nn.Sequential( 9 | 10 | # 1st hidden layer 11 | torch.nn.Linear(num_inputs, 30), 12 | torch.nn.ReLU(), 13 | 14 | # 2nd hidden layer 15 | torch.nn.Linear(30, 20), 16 | torch.nn.ReLU(), 17 | 18 | # Output layer 19 | torch.nn.Linear(20, num_outputs) 20 | ) 21 | 22 | def forward(self, x): 23 | logits = self.layers(x) 24 | return logits 25 | 26 | torch.manual_seed(123) 27 | model = NeuralNetwork(50, 3) 28 | print(model) 29 | 30 | num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 31 | print("Total number of trainable model parameters: ", num_params) 32 | 33 | print(model.layers[0].weight) 34 | # print(len(model.layers[0].weight)) 35 | # print(model.layers[0].weight.shape) 36 | # print(model.layers[0].bias) 37 | 38 | X = torch.rand((1, 50)) 39 | # out = model(X) 40 | # print(out) 41 | 42 | # When we use a model for inference (for instance, making predictions) rather than training, 43 | # the best practice is to use the torch.no_grad() con- text manager. This tells PyTorch that it doesn’t 44 | # need to keep track of the gradients, which can result in significant savings in memory and computation 45 | # with torch.no_grad(): 46 | # out = model(X) 47 | 48 | # If we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly. 49 | with torch.no_grad(): 50 | out = torch.softmax(model(X), dim=1) 51 | 52 | print(out) -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders-workers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders-workers.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/images/data-loaders.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.6_Setting_up_efficient_data_loaders/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | 4 | 5 | X_train = torch.tensor([ 6 | [-1.2, 3.1], 7 | [-0.9, 2.9], 8 | [-0.5, 2.6], 9 | [2.3, -1.1], 10 | [2.7, -1.5] 11 | ]) 12 | 13 | y_train = torch.tensor([0, 0, 0, 1, 1]) # class labels 14 | 15 | X_test = torch.tensor([ 16 | [-0.8, 2.8], 17 | [2.6, -1.6] 18 | ]) 19 | 20 | y_test = torch.tensor([0, 1]) 21 | 22 | class ToyDataset(Dataset): 23 | def __init__(self, X, y): 24 | self.features = X 25 | self.labels = y 26 | 27 | def __getitem__(self, index): 28 | one_x = self.features[index] 29 | one_y = self.labels[index] 30 | return one_x, one_y 31 | 32 | def __len__(self): 33 | return self.labels.shape[0] 34 | 35 | train_ds = ToyDataset(X_train, y_train) 36 | test_ds = ToyDataset(X_test, y_test) 37 | 38 | print(len(train_ds)) 39 | 40 | torch.manual_seed(123) 41 | 42 | # train_loader = DataLoader( 43 | # dataset=train_ds, 44 | # batch_size=2, 45 | # shuffle=True, 46 | # num_workers=0 47 | # ) 48 | 49 | train_loader = DataLoader( 50 | dataset=train_ds, 51 | batch_size=2, 52 | shuffle=True, 53 | num_workers=0, 54 | drop_last=True # will drop 5th sample, since its not even 55 | ) 56 | 57 | test_loader = DataLoader( 58 | dataset=test_ds, 59 | batch_size=2, 60 | shuffle=False, 61 | num_workers=0 62 | ) 63 | 64 | for idx, (x, y) in enumerate(train_loader): 65 | print(f"Batch {idx + 1}: ", x, y) 66 | print() -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.8_Saving_and_loading_models/model.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.8_Saving_and_loading_models/model.pth -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.1_PyTorch_computations_on_GPU_devices/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | from torch.utils.data import Dataset, DataLoader 4 | 5 | 6 | torch.manual_seed(123) 7 | 8 | class NeuralNetwork(torch.nn.Module): 9 | def __init__(self, num_inputs, num_outputs): 10 | super().__init__() 11 | 12 | self.layers = torch.nn.Sequential( 13 | 14 | # 1st hidden layer 15 | torch.nn.Linear(num_inputs, 30), 16 | torch.nn.ReLU(), 17 | 18 | # 2nd hidden layer 19 | torch.nn.Linear(30, 20), 20 | torch.nn.ReLU(), 21 | 22 | # Output layer 23 | torch.nn.Linear(20, num_outputs) 24 | ) 25 | 26 | def forward(self, x): 27 | logits = self.layers(x) 28 | return logits 29 | 30 | model = NeuralNetwork(2, 2) 31 | model.load_state_dict(torch.load("model.pth")) 32 | 33 | # If GPU present 34 | # device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 35 | 36 | # If Apple Silicon Chip, MPS = Metal Performance Shader 37 | device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 38 | 39 | print(device) 40 | 41 | # CPU addition 42 | tensor_1 = torch.tensor([1., 2., 3.]) 43 | tensor_2 = torch.tensor([4., 5., 6.]) 44 | print(tensor_1 + tensor_2) 45 | 46 | 47 | tensor_1 = tensor_1.to("mps") 48 | tensor_2 = tensor_2.to("mps") 49 | print(tensor_1 + tensor_2) 50 | 51 | # tensor_2 = tensor_2.to("cpu") # Will crash, tensors need to be on the same device 52 | # print(tensor_1 + tensor_2) 53 | 54 | -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.1_PyTorch_computations_on_GPU_devices/model.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.1_PyTorch_computations_on_GPU_devices/model.pth -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.2_Single_GPU_Training/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | from torch.utils.data import Dataset, DataLoader 4 | 5 | 6 | torch.manual_seed(123) 7 | 8 | class NeuralNetwork(torch.nn.Module): 9 | def __init__(self, num_inputs, num_outputs): 10 | super().__init__() 11 | 12 | self.layers = torch.nn.Sequential( 13 | 14 | # 1st hidden layer 15 | torch.nn.Linear(num_inputs, 30), 16 | torch.nn.ReLU(), 17 | 18 | # 2nd hidden layer 19 | torch.nn.Linear(30, 20), 20 | torch.nn.ReLU(), 21 | 22 | # Output layer 23 | torch.nn.Linear(20, num_outputs) 24 | ) 25 | 26 | def forward(self, x): 27 | logits = self.layers(x) 28 | return logits 29 | 30 | X_train = torch.tensor([ 31 | [-1.2, 3.1], 32 | [-0.9, 2.9], 33 | [-0.5, 2.6], 34 | [2.3, -1.1], 35 | [2.7, -1.5] 36 | ]) 37 | 38 | y_train = torch.tensor([0, 0, 0, 1, 1]) # class labels 39 | 40 | X_test = torch.tensor([ 41 | [-0.8, 2.8], 42 | [2.6, -1.6] 43 | ]) 44 | 45 | y_test = torch.tensor([0, 1]) 46 | 47 | class ToyDataset(Dataset): 48 | def __init__(self, X, y): 49 | self.features = X 50 | self.labels = y 51 | 52 | def __getitem__(self, index): 53 | one_x = self.features[index] 54 | one_y = self.labels[index] 55 | return one_x, one_y 56 | 57 | def __len__(self): 58 | return self.labels.shape[0] 59 | 60 | train_ds = ToyDataset(X_train, y_train) 61 | test_ds = ToyDataset(X_test, y_test) 62 | 63 | train_loader = DataLoader( 64 | dataset=train_ds, 65 | batch_size=2, 66 | shuffle=True, 67 | num_workers=0, 68 | drop_last=True # will drop 5th sample, since it's not even 69 | ) 70 | 71 | test_loader = DataLoader( 72 | dataset=test_ds, 73 | batch_size=2, 74 | shuffle=False, 75 | num_workers=0 76 | ) 77 | 78 | model = NeuralNetwork(num_inputs=2, num_outputs=2) 79 | 80 | device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 81 | model = model.to(device) 82 | 83 | optimizer = torch.optim.SGD(model.parameters(), lr=0.5) 84 | 85 | num_epochs = 3 86 | 87 | for epoch in range(num_epochs): 88 | 89 | model.train() 90 | 91 | for batch_idx, (features, labels) in enumerate(train_loader): 92 | features, labels = features.to(device), labels.to(device) 93 | logits = model(features) 94 | loss = F.cross_entropy(logits, labels) # Loss function 95 | 96 | optimizer.zero_grad() 97 | loss.backward() 98 | optimizer.step() 99 | 100 | ### LOGGING 101 | print(f"Epoch: {epoch + 1:03d}/{num_epochs:03d}" 102 | f" | Batch {batch_idx:03d}/{len(train_loader):03d}" 103 | f" | Train Loss {loss:.2f}") 104 | 105 | model.eval() 106 | -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-1.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/code-summary-2.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-1.png -------------------------------------------------------------------------------- /01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/01_Intro_to_PyTorch/A.9_Optimizing_training_performance_with_GPUs/A.9.3_Training_with_multiple_GPUs/images/multi-gpu-2.png -------------------------------------------------------------------------------- /02_Working_with_text_data/2.2_Tokenizing_text/main.py: -------------------------------------------------------------------------------- 1 | import urllib.request 2 | import re 3 | 4 | url = ("https://raw.githubusercontent.com/rasbt/" 5 | "LLMs-from-scratch/main/ch02/01_main-chapter-code/" 6 | "the-verdict.txt") 7 | 8 | # Download .txt file into directory 9 | # file_path = "the-verdict.txt" 10 | # urllib.request.urlretrieve(url, file_path) 11 | 12 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 13 | raw_text = f.read() 14 | 15 | print("Total number of character:", len(raw_text)) 16 | print(raw_text[:99]) 17 | 18 | # Tokenization 19 | # Split text on whitespace characters 20 | text = "Hello, world. This, is a test." 21 | result = re.split(r"(\s)", text) 22 | print("\n", result) 23 | 24 | # Split text on whitespace, commas, and periods 25 | result = re.split(r"([,.]|\s)", text) 26 | print("\n", result) 27 | 28 | # Optional, remove redundant whitespace characters 29 | result = [item for item in result if item.strip()] 30 | print("\n", result) 31 | 32 | # Split text to handle more such as question marks, quotation marks, and double-dashes. 33 | text = "Hello, world. Is this-- a test?" 34 | result = re.split(r'([,.:;?_!"()\']|--|\s)', text) 35 | result = [item.strip() for item in result if item.strip()] # fully removes whitespaces 36 | print("\n", result, " \n Token Count:", len(result)) 37 | 38 | # Basic Tokenizer applied to full short story, "the-verdict.txt" 39 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) 40 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 41 | print("\n Sample of Tokenized output: \n", preprocessed[:30], "\n Full Token Count:", len(preprocessed)) -------------------------------------------------------------------------------- /02_Working_with_text_data/2.3_Converting_tokens_into_token_IDs/main.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | class SimpleTokenizerV1: 5 | def __init__(self, vocab): 6 | self.str_to_int = vocab 7 | self.int_to_str = {i:s for s, i in vocab.items()} 8 | 9 | def encode(self, text): 10 | """ Processes input text into token IDs """ 11 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) 12 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 13 | ids = [self.str_to_int[s] for s in preprocessed] 14 | return ids 15 | 16 | def decode(self, ids): 17 | """ Converts token IDs back into text """ 18 | text = " ".join([self.int_to_str[i] for i in ids]) 19 | # Replace spaces before the specified punctuations 20 | text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) 21 | return text 22 | 23 | 24 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 25 | raw_text = f.read() 26 | 27 | # Tokenization 28 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) 29 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 30 | 31 | # Converting Tokens into Token IDs 32 | all_words = sorted(set(preprocessed)) 33 | vocab_size = len(all_words) 34 | 35 | # Creating Vocabulary dictionary 36 | vocab = {token:integer for integer, token in enumerate(all_words)} 37 | 38 | tokenizer = SimpleTokenizerV1(vocab) 39 | 40 | text = """"It's the last he painted, you know," 41 | Mrs. Gisburn said with pardonable pride.""" 42 | 43 | ids = tokenizer.encode(text) 44 | print("\n Token IDs:", ids) 45 | 46 | decoded_ids = tokenizer.decode(ids) 47 | print("\n Decoded IDs:", decoded_ids) 48 | 49 | text = "Hello, do you like tea?" 50 | print(tokenizer.encode(text)) # KeyError: 'Hello' 51 | # "Hello" was not part of the training data and thus not part of the existing vocabulary dictionary. -------------------------------------------------------------------------------- /02_Working_with_text_data/2.3_Converting_tokens_into_token_IDs/token_ids.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 5 | raw_text = f.read() 6 | 7 | # Tokenization 8 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) 9 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 10 | print("\n Sample of Tokenized output:\n", preprocessed[:30], "\n Full Token Count:", len(preprocessed)) 11 | 12 | # Converting Tokens into Token IDs 13 | all_words = sorted(set(preprocessed)) 14 | vocab_size = len(all_words) 15 | print("\n Vocabular Size:", vocab_size) 16 | 17 | # Creating Vocabulary dictionary 18 | vocab = {token:integer for integer, token in enumerate(all_words)} 19 | 20 | # Printing first 51 entries of vocabulary 21 | for i, item in enumerate(vocab.items()): 22 | print(item) 23 | if i >= 50: 24 | break -------------------------------------------------------------------------------- /02_Working_with_text_data/2.4_Adding_special_context_tokens/main.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | class SimpleTokenizerV2: 5 | def __init__(self, vocab): 6 | self.str_to_int = vocab 7 | self.int_to_str = {i:s for s, i in vocab.items()} 8 | 9 | def encode(self, text): 10 | """ Processes input text into token IDs """ 11 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) 12 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 13 | preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed] 14 | ids = [self.str_to_int[s] for s in preprocessed] 15 | return ids 16 | 17 | def decode(self, ids): 18 | """ Converts token IDs back into text """ 19 | text = " ".join([self.int_to_str[i] for i in ids]) 20 | # Replace spaces before the specified punctuations 21 | text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) 22 | return text 23 | 24 | 25 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 26 | raw_text = f.read() 27 | 28 | # Tokenization 29 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) 30 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 31 | 32 | # Converting Tokens into Token IDs - Adding 2 new special tokens 33 | all_tokens = sorted(list(set(preprocessed))) 34 | all_tokens.extend(["<|endoftext|>", "<|unk|>"]) 35 | vocab_size = len(all_tokens) 36 | print("\n Vocabular Size: ", vocab_size, "\n") 37 | 38 | # Creating Vocabulary dictionary 39 | vocab = {token:integer for integer, token in enumerate(all_tokens)} 40 | 41 | tokenizer = SimpleTokenizerV2(vocab) 42 | 43 | text1 = "Hello, do you like tea?" 44 | text2 = "In the sunlit terraces of the palace." 45 | 46 | text = " <|endoftext|> ".join((text1, text2)) 47 | print(text) 48 | 49 | ids = tokenizer.encode(text) 50 | print("\n Token IDs:", ids) 51 | 52 | decoded_ids = tokenizer.decode(ids) 53 | print("\n Decoded IDs:", decoded_ids) -------------------------------------------------------------------------------- /02_Working_with_text_data/2.4_Adding_special_context_tokens/token_ids.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 5 | raw_text = f.read() 6 | 7 | # Tokenization 8 | preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) 9 | preprocessed = [item.strip() for item in preprocessed if item.strip()] 10 | 11 | # Converting Tokens into Token IDs - Adding 2 new special tokens 12 | all_tokens = sorted(list(set(preprocessed))) 13 | all_tokens.extend(["<|endoftext|>", "<|unk|>"]) 14 | vocab_size = len(all_tokens) 15 | print("\n Vocabular Size:", vocab_size, "\n") 16 | 17 | # Creating Vocabulary dictionary 18 | vocab = {token:integer for integer, token in enumerate(all_tokens)} 19 | 20 | # Printing last 5 entries of the updated vocabulary 21 | for i, item in enumerate(list(vocab.items())[-5:]): 22 | print(item) -------------------------------------------------------------------------------- /02_Working_with_text_data/2.5_Byte_pair_encoding/main.py: -------------------------------------------------------------------------------- 1 | from importlib.metadata import version 2 | import tiktoken 3 | 4 | 5 | print("tiktoken version:", version("tiktoken")) 6 | 7 | tokenizer = tiktoken.get_encoding("gpt2") 8 | 9 | text = ( 10 | "Hello, do you like tea? <|endoftext|> In the sunlit terraces" 11 | "of someunknownPlace." 12 | ) 13 | 14 | integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) 15 | print(integers) 16 | 17 | strings = tokenizer.decode(integers) 18 | print(strings) 19 | 20 | 21 | # Exercise 2.1 22 | text = "Akwirw ier" 23 | token_ids = tokenizer.encode(text) 24 | print("\n", token_ids) 25 | decoded_token_ids = tokenizer.decode(token_ids) 26 | print(decoded_token_ids) -------------------------------------------------------------------------------- /02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/images/sliding window.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/images/sliding window.png -------------------------------------------------------------------------------- /02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/input_token_pairs.py: -------------------------------------------------------------------------------- 1 | import tiktoken 2 | 3 | 4 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 5 | raw_text = f.read() 6 | 7 | tokenizer = tiktoken.get_encoding("gpt2") 8 | 9 | enc_text = tokenizer.encode(raw_text) 10 | print(len(enc_text)) # 5145 11 | 12 | enc_sample = enc_text[50:] # remove first 50 tokens 13 | 14 | # Creating input - target pairs 15 | context_size = 4 16 | x = enc_sample[:context_size] 17 | y = enc_sample[1:context_size + 1] 18 | print(f"x: {x}") 19 | print(f"y: {y}") 20 | 21 | print("\nOriginal Text:", tokenizer.decode(enc_sample[:context_size + 1])) 22 | 23 | print() 24 | # Token IDs 25 | # Input - Target Pairs 26 | for i in range(1, context_size + 1): 27 | context = enc_sample[:i] 28 | desired = enc_sample[i] 29 | print(context, "---->", desired) # left side of arrow is what LLM receives, right side of arrow is what LLM needs to predict 30 | 31 | print() 32 | # Text 33 | # Input - Target Pairs 34 | for i in range(1, context_size + 1): 35 | context = enc_sample[:i] 36 | desired = enc_sample[i] 37 | print(tokenizer.decode(context), "---->", tokenizer.decode([desired])) 38 | -------------------------------------------------------------------------------- /02_Working_with_text_data/2.6_Data_sampling_with_a_sliding_window/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import tiktoken 4 | 5 | 6 | class GPTDataSetV1(Dataset): 7 | def __init__(self, txt, tokenizer, max_length, stride): 8 | self.input_ids = [] 9 | self.target_ids = [] 10 | 11 | # Tokenize the entire text 12 | token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) 13 | 14 | # Use a sliding window to chunk the book into overlapping sequences of max_length 15 | for i in range(0, len(token_ids) - max_length, stride): 16 | input_chunk = token_ids[i:i + max_length] 17 | target_chunk = token_ids[i + 1: i + max_length + 1] 18 | self.input_ids.append(torch.tensor(input_chunk)) 19 | self.target_ids.append(torch.tensor(target_chunk)) 20 | 21 | def __len__(self): 22 | return len(self.input_ids) 23 | 24 | def __getitem__(self, idx): 25 | return self.input_ids[idx], self.target_ids[idx] 26 | 27 | 28 | def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): 29 | # Initialize the tokenizer 30 | tokenizer = tiktoken.get_encoding("gpt2") 31 | 32 | # Create dataset 33 | dataset = GPTDataSetV1(txt, tokenizer, max_length, stride) 34 | 35 | # Create dataloader 36 | dataloader = DataLoader( 37 | dataset, 38 | batch_size=batch_size, 39 | shuffle=shuffle, 40 | drop_last=drop_last, 41 | num_workers=num_workers 42 | ) 43 | 44 | return dataloader 45 | 46 | 47 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 48 | raw_text = f.read() 49 | 50 | dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False) 51 | data_iter = iter(dataloader) 52 | first_batch = next(data_iter) 53 | print("\nFirst Batch:", first_batch) 54 | 55 | second_batch = next(data_iter) 56 | print("Second Batch:", second_batch) 57 | 58 | # Exercise 2.2 59 | dataloader2 = create_dataloader_v1(raw_text, batch_size=1, max_length=2, stride=2, shuffle=False) 60 | data_iter2 = iter(dataloader2) 61 | first_batch2 = next(data_iter2) 62 | print("\nFirst Batch 2:", first_batch2) 63 | 64 | second_batch2 = next(data_iter2) 65 | print("Second Batch 2:", second_batch2) 66 | 67 | dataloader3 = create_dataloader_v1(raw_text, batch_size=1, max_length=8, stride=2, shuffle=False) 68 | data_iter3 = iter(dataloader3) 69 | first_batch3 = next(data_iter3) 70 | print("\nFirst Batch 3:", first_batch3) 71 | 72 | second_batch3 = next(data_iter3) 73 | print("Second Batch 3:", second_batch3) 74 | 75 | # -------------------------------------------------------------------------------- 76 | 77 | dataloader_final = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False) 78 | data_iter_final = iter(dataloader_final) 79 | inputs, targets = next(data_iter_final) 80 | print("\nInputs:\n", inputs) 81 | print("\nTargets:\n", targets) -------------------------------------------------------------------------------- /02_Working_with_text_data/2.7_Creating_token_embeddings/embedding_example.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | input_ids = torch.tensor([2, 3, 5, 1]) 5 | vocab_size = 6 6 | output_dim = 3 7 | 8 | torch.manual_seed(123) 9 | embedding_layer = torch.nn.Embedding(vocab_size, output_dim) 10 | 11 | print(embedding_layer.weight) 12 | 13 | print("\n", embedding_layer(torch.tensor([3]))) 14 | 15 | print("\n", embedding_layer(input_ids)) 16 | 17 | # Understanding the Difference Between Embedding Layers and Linear Layers (Bonus) 18 | # https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb 19 | -------------------------------------------------------------------------------- /02_Working_with_text_data/2.8_Encoding_word_positions/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import tiktoken 4 | 5 | 6 | class GPTDataSetV1(Dataset): 7 | def __init__(self, txt, tokenizer, max_length, stride): 8 | self.input_ids = [] 9 | self.target_ids = [] 10 | 11 | # Tokenize the entire text 12 | token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) 13 | 14 | # Use a sliding window to chunk the book into overlapping sequences of max_length 15 | for i in range(0, len(token_ids) - max_length, stride): 16 | input_chunk = token_ids[i:i + max_length] 17 | target_chunk = token_ids[i + 1: i + max_length + 1] 18 | self.input_ids.append(torch.tensor(input_chunk)) 19 | self.target_ids.append(torch.tensor(target_chunk)) 20 | 21 | def __len__(self): 22 | return len(self.input_ids) 23 | 24 | def __getitem__(self, idx): 25 | return self.input_ids[idx], self.target_ids[idx] 26 | 27 | 28 | def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): 29 | # Initialize the tokenizer 30 | tokenizer = tiktoken.get_encoding("gpt2") 31 | 32 | # Create dataset 33 | dataset = GPTDataSetV1(txt, tokenizer, max_length, stride) 34 | 35 | # Create dataloader 36 | dataloader = DataLoader( 37 | dataset, 38 | batch_size=batch_size, 39 | shuffle=shuffle, 40 | drop_last=drop_last, 41 | num_workers=num_workers 42 | ) 43 | 44 | return dataloader 45 | 46 | 47 | with open("the-verdict.txt", "r", encoding="utf-8") as f: 48 | raw_text = f.read() 49 | 50 | 51 | vocab_size = 50257 52 | output_dim = 256 53 | token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim) 54 | 55 | max_length = 4 56 | dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False) 57 | data_iter = iter(dataloader) 58 | inputs, targets = next(data_iter) 59 | print("\nToken IDs:\n", inputs) 60 | print("\nInputs shape:\n", inputs.shape) 61 | 62 | token_embeddings = token_embedding_layer(inputs) 63 | 64 | print("\nToken Embeddings:", token_embeddings) 65 | print("\nToken Embeddings shape:\n", token_embeddings.shape) 66 | 67 | context_length = max_length 68 | pos_embedding_layer = torch.nn.Embedding(context_length, output_dim) 69 | pos_embeddings = pos_embedding_layer(torch.arange(context_length)) 70 | 71 | input_embeddings = token_embeddings + pos_embeddings 72 | 73 | print("\nPos embeddings:", pos_embeddings) 74 | print(pos_embeddings.shape) 75 | 76 | print("\nInput embeddings:", input_embeddings) 77 | print(input_embeddings.shape) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/attention mechanism.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/attention mechanism.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/vector similarity example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/images/vector similarity example.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.1_A_simple_self_attention_mechanism_without_trainable_weights/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | inputs = torch.tensor([ 5 | [0.43, 0.15, 0.89], # Your (x^1) 6 | [0.55, 0.87, 0.66], # journey (x^2) 7 | [0.57, 0.85, 0.64], # starts (x^3) 8 | [0.22, 0.58, 0.33], # with (x^4) 9 | [0.77, 0.25, 0.10], # one (x^5) 10 | [0.05, 0.80, 0.55] # step (x^6) 11 | ]) 12 | 13 | query = inputs[1] # journey (x^2) 14 | attn_scores_2 = torch.empty(inputs.shape[0]) # shape[0] = 6 15 | 16 | for i, x_i, in enumerate(inputs): 17 | attn_scores_2[i] = torch.dot(x_i, query) 18 | 19 | print("\nAttention Scores 2:\n", attn_scores_2, "\n") 20 | 21 | # Understanding dot product 22 | print("** Elements being multiplied in for loop **") 23 | res = 0 24 | for idx, element in enumerate(inputs[0]): 25 | print(inputs[0][idx], "*", query[idx]) 26 | res += inputs[0][idx] * query[idx] 27 | print("\nDot Product Manual Example:", res) 28 | print("Dot Product Torch Example:", torch.dot(inputs[0], query)) 29 | 30 | # Normalization 31 | attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum() 32 | print("\nAttention Weights:", attn_weights_2_tmp) 33 | print("Sum:", attn_weights_2_tmp.sum()) 34 | 35 | # Softmax Normalization 36 | def softmax_naive(x): 37 | return torch.exp(x) / torch.exp(x).sum(dim=0) 38 | 39 | attn_weights_2_naive = softmax_naive(attn_scores_2) 40 | print("\nAttention Weights Naive:", attn_weights_2_naive) 41 | print("Sum Naive:", attn_weights_2_naive.sum()) 42 | 43 | # PyTorch Softmax Normalization 44 | attn_weights_2_torch = torch.softmax(attn_scores_2, dim=0) 45 | print("\nAttention Weights PyTorch:", attn_weights_2_torch) 46 | print("Sum PyTorch:", attn_weights_2_torch.sum()) 47 | 48 | # Calculating the context vector for the 2nd input (journey (x^2)) 49 | query = inputs[1] 50 | context_vec_2 = torch.zeros(query.shape) # shape = 3 51 | for i, x_i in enumerate(inputs): 52 | context_vec_2 += attn_weights_2_torch[i] * x_i 53 | 54 | print("\nContext Vector 2:", context_vec_2) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.2_Computing_attention_weights_for_all_input_tokens/images/attention weights heatmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.2_Computing_attention_weights_for_all_input_tokens/images/attention weights heatmap.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.3_Attending_to_different_parts_of_the_input_with_self_attention/3.3.2_Computing_attention_weights_for_all_input_tokens/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | inputs = torch.tensor([ 5 | [0.43, 0.15, 0.89], # Your (x^1) 6 | [0.55, 0.87, 0.66], # journey (x^2) 7 | [0.57, 0.85, 0.64], # starts (x^3) 8 | [0.22, 0.58, 0.33], # with (x^4) 9 | [0.77, 0.25, 0.10], # one (x^5) 10 | [0.05, 0.80, 0.55] # step (x^6) 11 | ]) 12 | 13 | print(inputs.shape) 14 | 15 | attn_scores = torch.empty(6, 6) 16 | 17 | # For Loop Method 18 | for i, x_i in enumerate(inputs): 19 | for j, x_j in enumerate(inputs): 20 | attn_scores[i, j] = torch.dot(x_i, x_j) 21 | 22 | # Matrix Multiplication 23 | attn_scores = inputs @ inputs.T 24 | print("\nAttention Scores:\n", attn_scores) 25 | 26 | # Normalized 27 | attn_weights = torch.softmax(attn_scores, dim=-1) 28 | print("Attention Weights:\n", attn_weights, "\n") 29 | 30 | # Verification of Sum to 1 31 | row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581]) 32 | print("Row 2 Sum:", row_2_sum) 33 | print("All row sums:", attn_weights.sum(dim=-1)) 34 | 35 | # All Context Vectors 36 | all_context_vecs = attn_weights @ inputs 37 | print("\nAll Context Vectors:\n", all_context_vecs) 38 | 39 | 40 | # Verification of 2nd context vector 41 | query = inputs[1] # journey (x^2) 42 | attn_scores_2 = torch.empty(inputs.shape[0]) # shape[0] = 6 43 | 44 | for i, x_i, in enumerate(inputs): 45 | attn_scores_2[i] = torch.dot(x_i, query) 46 | 47 | attn_weights_2_torch = torch.softmax(attn_scores_2, dim=0) 48 | 49 | # Calculating the context vector for the 2nd input (journey (x^2)) 50 | query = inputs[1] 51 | context_vec_2 = torch.zeros(query.shape) # shape = 3 52 | for i, x_i in enumerate(inputs): 53 | context_vec_2 += attn_weights_2_torch[i] * x_i 54 | 55 | print("\nPrevious 2nd Context Vector:", context_vec_2) 56 | -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/input and output dimensions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/input and output dimensions.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt1.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt2.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt3.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/q, k distinction, pt4.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt1.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt2.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt3.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/images/query, key, value, pt4.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.1_Computing_the_attention_weights_step_by_step/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | inputs = torch.tensor([ 5 | [0.43, 0.15, 0.89], # Your (x^1) 6 | [0.55, 0.87, 0.66], # journey (x^2) 7 | [0.57, 0.85, 0.64], # starts (x^3) 8 | [0.22, 0.58, 0.33], # with (x^4) 9 | [0.77, 0.25, 0.10], # one (x^5) 10 | [0.05, 0.80, 0.55] # step (x^6) 11 | ]) 12 | 13 | x_2 = inputs[1] # journey 14 | d_in = inputs.shape[1] # last embedding size, 3 15 | d_out = 2 # the output embedding size 16 | 17 | torch.manual_seed(123) 18 | W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) 19 | W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) 20 | W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) 21 | 22 | query_2 = x_2 @ W_query 23 | key_2 = x_2 @ W_key 24 | value_2 = x_2 @ W_value 25 | 26 | print(query_2) # tensor([0.4306, 1.4551]) 27 | 28 | # Obtaining all keys and values via matrix multiplication 29 | keys = inputs @ W_key 30 | values = inputs @ W_value 31 | print("keys.shape:", keys.shape) 32 | print("values.shape:", values.shape) 33 | 34 | # attention score w22 35 | keys_2 = keys[1] 36 | attn_score_22 = query_2.dot(keys_2) 37 | print(attn_score_22) 38 | 39 | # all attention scores for given query 40 | attn_scores_2 = query_2 @ keys.T 41 | print(attn_scores_2) 42 | 43 | d_k = keys.shape[-1] 44 | attn_weights_2 = torch.softmax(attn_scores_2 / d_k ** 0.5, dim=-1) 45 | print(attn_weights_2) 46 | 47 | context_vec_2 = attn_weights_2 @ values 48 | print(context_vec_2) # context vector for "journey" 49 | -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/attention weights heatmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/attention weights heatmap.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt1.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt2.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt3.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt4.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt5.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt6.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt7.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/key vector token attributes, pt8.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/q,k,v,z.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/images/q,k,v,z.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/self-attention-class-v1.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch 3 | 4 | 5 | class SelfAttention_v1(nn.Module): 6 | def __init__(self, d_in, d_out): 7 | super().__init__() 8 | self.W_query = nn.Parameter(torch.rand(d_in, d_out)) 9 | self.W_key = nn.Parameter(torch.rand(d_in, d_out)) 10 | self.W_value = nn.Parameter(torch.rand(d_in, d_out)) 11 | 12 | def forward(self, x): 13 | queries = x @ self.W_query 14 | keys = x @ self.W_key 15 | values = x @ self.W_value 16 | attn_scores = queries @ keys.T # omega 17 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 18 | context_vec = attn_weights @ values 19 | return context_vec 20 | 21 | 22 | inputs = torch.tensor([ 23 | [0.43, 0.15, 0.89], # Your (x^1) 24 | [0.55, 0.87, 0.66], # journey (x^2) 25 | [0.57, 0.85, 0.64], # starts (x^3) 26 | [0.22, 0.58, 0.33], # with (x^4) 27 | [0.77, 0.25, 0.10], # one (x^5) 28 | [0.05, 0.80, 0.55] # step (x^6) 29 | ]) 30 | 31 | torch.manual_seed(123) 32 | d_in = inputs.shape[1] # 3 33 | d_out = 2 34 | sa_v1 = SelfAttention_v1(d_in, d_out) 35 | 36 | # Since inputs contains six embedding vectors, this results 37 | # in a matrix storing the six context vectors. 38 | print(sa_v1(inputs)) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.4_Implementing_self_attention_with_training_weights/3.4.2_Implementing_a_compact_self_attention_Python_class/self-attention-class-v2.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch 3 | 4 | 5 | class SelfAttention_v2(nn.Module): 6 | def __init__(self, d_in, d_out, qkv_bias=False): 7 | super().__init__() 8 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 9 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 10 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 11 | 12 | def forward(self, x): 13 | queries = self.W_query(x) 14 | keys = self.W_key(x) 15 | values = self.W_value(x) 16 | attn_scores = queries @ keys.T # omega 17 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 18 | context_vec = attn_weights @ values 19 | return context_vec 20 | 21 | 22 | inputs = torch.tensor([ 23 | [0.43, 0.15, 0.89], # Your (x^1) 24 | [0.55, 0.87, 0.66], # journey (x^2) 25 | [0.57, 0.85, 0.64], # starts (x^3) 26 | [0.22, 0.58, 0.33], # with (x^4) 27 | [0.77, 0.25, 0.10], # one (x^5) 28 | [0.05, 0.80, 0.55] # step (x^6) 29 | ]) 30 | 31 | torch.manual_seed(789) 32 | d_in = inputs.shape[1] # 3 33 | d_out = 2 34 | sa_v2 = SelfAttention_v2(d_in, d_out) 35 | 36 | # Since inputs contains six embedding vectors, this results 37 | # in a matrix storing the six context vectors. 38 | print("SelfAttention_v2 Context Vectors:\n",sa_v2(inputs)) 39 | 40 | 41 | # print("W_query weights:") 42 | # print(sa_v2.W_query.weight) 43 | # print(sa_v2.W_query.weight.T) 44 | 45 | # Exercise 3.1 - Comparing SelfAttention_v1 and SelfAttention_v2 46 | 47 | class SelfAttention_v1(nn.Module): 48 | def __init__(self, d_in, d_out): 49 | super().__init__() 50 | self.W_query = nn.Parameter(torch.rand(d_in, d_out)) 51 | self.W_key = nn.Parameter(torch.rand(d_in, d_out)) 52 | self.W_value = nn.Parameter(torch.rand(d_in, d_out)) 53 | 54 | def forward(self, x): 55 | queries = x @ self.W_query 56 | keys = x @ self.W_key 57 | values = x @ self.W_value 58 | attn_scores = queries @ keys.T # omega 59 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 60 | context_vec = attn_weights @ values 61 | return context_vec 62 | 63 | sa_v1 = SelfAttention_v1(d_in, d_out) 64 | sa_v1.W_query = nn.Parameter(sa_v2.W_query.weight.T) 65 | sa_v1.W_key = nn.Parameter(sa_v2.W_key.weight.T) 66 | sa_v1.W_value = nn.Parameter(sa_v2.W_value.weight.T) 67 | 68 | print("\nSelfAttention_v1 Context Vectors:\n", sa_v1(inputs)) 69 | -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.1_Applying_a_causal_attention_mask/images/attn weights normalized with masked future tokens .png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.1_Applying_a_causal_attention_mask/images/attn weights normalized with masked future tokens .png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.1_Applying_a_causal_attention_mask/main.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch 3 | 4 | 5 | class SelfAttention_v2(nn.Module): 6 | def __init__(self, d_in, d_out, qkv_bias=False): 7 | super().__init__() 8 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 9 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 10 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 11 | 12 | def forward(self, x): 13 | queries = self.W_query(x) 14 | keys = self.W_key(x) 15 | values = self.W_value(x) 16 | attn_scores = queries @ keys.T # omega 17 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 18 | context_vec = attn_weights @ values 19 | return context_vec 20 | 21 | 22 | inputs = torch.tensor([ 23 | [0.43, 0.15, 0.89], # Your (x^1) 24 | [0.55, 0.87, 0.66], # journey (x^2) 25 | [0.57, 0.85, 0.64], # starts (x^3) 26 | [0.22, 0.58, 0.33], # with (x^4) 27 | [0.77, 0.25, 0.10], # one (x^5) 28 | [0.05, 0.80, 0.55] # step (x^6) 29 | ]) 30 | 31 | torch.manual_seed(789) 32 | d_in = inputs.shape[1] # 3 33 | d_out = 2 34 | sa_v2 = SelfAttention_v2(d_in, d_out) 35 | 36 | # sa_v2(inputs) 37 | 38 | # Manually Getting weights 39 | queries = sa_v2.W_query(inputs) 40 | keys = sa_v2.W_key(inputs) 41 | attn_scores = queries @ keys.T 42 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 43 | print("\nAttention Weights:\n", attn_weights) 44 | 45 | # Creating Mask 46 | context_length = attn_scores.shape[0] # 6 47 | mask_simple = torch.tril(torch.ones(context_length, context_length)) 48 | print("\nMask Simple:\n", mask_simple) 49 | 50 | # Multiply mask with attention weights to zero-out the values above the diagonal 51 | masked_simple = attn_weights * mask_simple 52 | print("\nMasked Simple (zero-out values above diag):\n", masked_simple) 53 | 54 | # Normalize the attention weights to sum up to 1 again in each row 55 | row_sums = masked_simple.sum(dim=-1, keepdim=True) 56 | print("\nRow Sums:\n", row_sums) 57 | masked_simple_norm = masked_simple / row_sums 58 | print("\nNormalized Masked Attention Weights:\n", masked_simple_norm) 59 | 60 | # Masking with 1's above the diagonal and replacing the 1s with negativt infinity (-inf) values, more efficient 61 | mask = torch.triu(torch.ones(context_length, context_length), diagonal=1) 62 | masked = attn_scores.masked_fill(mask.bool(), -torch.inf) 63 | print("\nAlternate Masking Method with -inf:\n", masked) 64 | 65 | # Normalizing alternate masked attention scores 66 | attn_weights_different = torch.softmax(masked / keys.shape[-1]**0.5, dim=1) 67 | print("\nAlternate Normalized Masked Attention Weights:\n", attn_weights_different) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/attn weights normalized with dropout and masked future tokens .png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/attn weights normalized with dropout and masked future tokens .png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/dropout.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/images/dropout.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.2_Masking_additional_attention_weights_with_dropout/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | torch.manual_seed(123) 6 | 7 | dropout = torch.nn.Dropout(0.5) 8 | example = torch.ones(6, 6) 9 | 10 | print("\nPre-dropout:\n", example) 11 | print("\nPost-dropout:\n", dropout(example)) 12 | 13 | 14 | 15 | class SelfAttention_v2(nn.Module): 16 | def __init__(self, d_in, d_out, qkv_bias=False): 17 | super().__init__() 18 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 19 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 20 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 21 | 22 | def forward(self, x): 23 | queries = self.W_query(x) 24 | keys = self.W_key(x) 25 | values = self.W_value(x) 26 | attn_scores = queries @ keys.T # omega 27 | attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) 28 | context_vec = attn_weights @ values 29 | return context_vec 30 | 31 | 32 | inputs = torch.tensor([ 33 | [0.43, 0.15, 0.89], # Your (x^1) 34 | [0.55, 0.87, 0.66], # journey (x^2) 35 | [0.57, 0.85, 0.64], # starts (x^3) 36 | [0.22, 0.58, 0.33], # with (x^4) 37 | [0.77, 0.25, 0.10], # one (x^5) 38 | [0.05, 0.80, 0.55] # step (x^6) 39 | ]) 40 | 41 | d_in = inputs.shape[1] # 3 42 | d_out = 2 43 | sa_v2 = SelfAttention_v2(d_in, d_out) 44 | 45 | 46 | # Manually Getting weights 47 | queries = sa_v2.W_query(inputs) 48 | keys = sa_v2.W_key(inputs) 49 | attn_scores = queries @ keys.T 50 | 51 | context_length = attn_scores.shape[0] # 6 52 | 53 | # Masking with 1's above the diagonal and replacing the 1s with negativt infinity (-inf) values, more efficient 54 | mask = torch.triu(torch.ones(context_length, context_length), diagonal=1) 55 | masked = attn_scores.masked_fill(mask.bool(), -torch.inf) 56 | print("\nAlternate Masking Method with -inf:\n", masked) 57 | 58 | # Normalizing alternate masked attention scores 59 | attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1) 60 | print("\nAlternate Normalized Masked Attention Weights:\n", attn_weights) 61 | 62 | print("\nAttention Weights with Dropout:\n", dropout(attn_weights)) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.5_Hiding_future_words_with_causal_attention/3.5.3_Implementing_a_compact_causal_attention_class/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class CausalAttention(nn.Module): 6 | def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False): 7 | super().__init__() 8 | self.d_out = d_out 9 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 10 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 11 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 12 | self.dropout = nn.Dropout(dropout) 13 | self.register_buffer( 14 | "mask", 15 | torch.triu(torch.ones(context_length, context_length), diagonal=1) 16 | ) 17 | 18 | def forward(self, x): 19 | b, num_tokens, d_in = x.shape 20 | queries = self.W_query(x) 21 | keys = self.W_key(x) 22 | values = self.W_value(x) 23 | 24 | attn_scores = queries @ keys.transpose(1, 2) 25 | attn_scores.masked_fill_( 26 | self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) 27 | attn_weights = torch.softmax( 28 | attn_scores / keys.shape[-1]**0.5, dim=-1 29 | ) 30 | attn_weights = self.dropout(attn_weights) 31 | 32 | context_vec = attn_weights @ values 33 | return context_vec 34 | 35 | 36 | inputs = torch.tensor([ 37 | [0.43, 0.15, 0.89], # Your (x^1) 38 | [0.55, 0.87, 0.66], # journey (x^2) 39 | [0.57, 0.85, 0.64], # starts (x^3) 40 | [0.22, 0.58, 0.33], # with (x^4) 41 | [0.77, 0.25, 0.10], # one (x^5) 42 | [0.05, 0.80, 0.55] # step (x^6) 43 | ]) 44 | 45 | 46 | batch = torch.stack((inputs, inputs), dim=0) 47 | print("\nBatch of inputs:\n", batch) 48 | print("\nBatch shape:\n", batch.shape) 49 | 50 | torch.manual_seed(123) 51 | d_in = inputs.shape[1] # 3 52 | d_out = 2 53 | context_length = batch.shape[1] # 6 54 | ca = CausalAttention(d_in, d_out, context_length, 0.0) 55 | context_vecs = ca(batch) 56 | 57 | print("\nContext Vectors:\n", context_vecs) 58 | print("\nContext Vectors Shape:", context_vecs.shape) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention output.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/images/multi head attention.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.1_Stacking_multiple_single_head_attention_layers/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class CausalAttention(nn.Module): 6 | def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False): 7 | super().__init__() 8 | self.d_out = d_out 9 | self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) 10 | self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) 11 | self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) 12 | self.dropout = nn.Dropout(dropout) 13 | self.register_buffer( 14 | "mask", 15 | torch.triu(torch.ones(context_length, context_length), diagonal=1) 16 | ) 17 | 18 | def forward(self, x): 19 | b, num_tokens, d_in = x.shape 20 | queries = self.W_query(x) 21 | keys = self.W_key(x) 22 | values = self.W_value(x) 23 | 24 | attn_scores = queries @ keys.transpose(1, 2) 25 | attn_scores.masked_fill_( 26 | self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) 27 | attn_weights = torch.softmax( 28 | attn_scores / keys.shape[-1]**0.5, dim=-1 29 | ) 30 | attn_weights = self.dropout(attn_weights) 31 | 32 | context_vec = attn_weights @ values 33 | return context_vec 34 | 35 | 36 | class MultiHeadAttentionWrapper(nn.Module): 37 | def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): 38 | super().__init__() 39 | self.heads = nn.ModuleList( 40 | [CausalAttention( 41 | d_in, d_out, context_length, dropout, qkv_bias 42 | ) 43 | for _ in range(num_heads)] 44 | ) 45 | 46 | def forward(self, x): 47 | for head in self.heads: 48 | print(head(x)) 49 | return torch.cat([head(x) for head in self.heads], dim=-1) 50 | 51 | 52 | inputs = torch.tensor([ 53 | [0.43, 0.15, 0.89], # Your (x^1) 54 | [0.55, 0.87, 0.66], # journey (x^2) 55 | [0.57, 0.85, 0.64], # starts (x^3) 56 | [0.22, 0.58, 0.33], # with (x^4) 57 | [0.77, 0.25, 0.10], # one (x^5) 58 | [0.05, 0.80, 0.55] # step (x^6) 59 | ]) 60 | 61 | 62 | batch = torch.stack((inputs, inputs), dim=0) 63 | print("\nBatch of inputs:\n", batch) 64 | print("\nBatch shape:\n", batch.shape) 65 | 66 | torch.manual_seed(123) 67 | d_in = 3 68 | d_out = 2 69 | context_length = batch.shape[1] # 6, number of tokens 70 | 71 | mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2) 72 | context_vecs = mha(batch) 73 | 74 | print("\nMulti Head Attn - Context Vectors:\n", context_vecs) 75 | print("\nMulti Head Attn - Context Vectors Shape:", context_vecs.shape) 76 | 77 | # Exercise 3.2 Returning two-dimensional embedding vectors 78 | d_out = 1 79 | mha_two_dim = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2) 80 | context_vecs_two_dim = mha_two_dim(batch) 81 | 82 | print("\nMulti Head Attn 2 Dimensional - Context Vectors:\n", context_vecs_two_dim) 83 | print("\nMulti Head Attn 2 Dimensional - Context Vectors Shape:", context_vecs_two_dim.shape) 84 | -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/batched_matrix_multiplication.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | # (b, num_heads, num_tokens, head_dim) = (1, 2, 3, 4) 5 | a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573], 6 | [0.8993, 0.0390, 0.9268, 0.7388], 7 | [0.7179, 0.7058, 0.9156, 0.4340]], 8 | 9 | [[0.0772, 0.3565, 0.1479, 0.5331], 10 | [0.4066, 0.2318, 0.4545, 0.9737], 11 | [0.4606, 0.5159, 0.4220, 0.5786]]]]) 12 | 13 | 14 | print("\nTransposed Matrix:\n", a.transpose(2, 3)) 15 | print("\nTransposed Matrix Shape:\n", a.transpose(2, 3).shape) # (1, 2, 4, 3) 16 | 17 | print("\nMultiplication Result:\n", a @ a.transpose(2, 3)) 18 | 19 | 20 | first_head = a[0, 0, :, :] 21 | print("\nFirst Head:\n", first_head) 22 | first_res = first_head @ first_head.T 23 | print("\nFirst Head Result:\n", first_res) 24 | 25 | second_head = a[0, 1, :, :] 26 | print("\nSecond Head:\n", second_head) 27 | second_res = second_head @ second_head.T 28 | print("\nSecond Head Result:\n", second_res) -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/diagram.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt1.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt2.png -------------------------------------------------------------------------------- /03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/03_Coding_attention_mechanisms/3.6_Extending_single_head_attention_to_multi_head_attention/3.6.2_Implementing_multi_head_attention_with_weight_splits/images/gpt-2 param explanation, pt3.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/gpt_config.py: -------------------------------------------------------------------------------- 1 | 2 | GPT_CONFIG_124M = { 3 | "vocab_size": 50257, # Vocabulary size 4 | "context_length": 1024, # Context length 5 | "emb_dim": 768, # Embedding dimension 6 | "n_heads": 12, # Number of attention heads 7 | "n_layers": 12, # Number of layers 8 | "drop_rate": 0.1, # Dropout rate 9 | "qkv_bias": False # Query-Key-Value bias 10 | } 11 | 12 | 13 | # vocab_size: refers to a vocabulary of 50,257 words, as used by the BPE tokenizer (see chapter 2). 14 | 15 | # context_length: denotes the maximum number of input tokens the model can handle via the positional embeddings (see chapter 2). 16 | 17 | # emb_dim: represents the embedding size, transforming each token into a 768- dimensional vector. 18 | 19 | # n_heads: indicates the count of attention heads in the multi-head attention mechanism (see chapter 3). 20 | 21 | # n_layers: specifies the number of transformer blocks in the model, which we will cover in the upcoming discussion. 22 | 23 | # drop_rate: indicates the intensity of the dropout mechanism (0.1 implies a 10% random drop out of hidden units) to prevent overfitting (see chapter 3). 24 | 25 | # qkv_bias: determines whether to include a bias vector in the Linear layers of the multi-head attention for query, key, and value computations. 26 | # We will initially disable this, following the norms of modern LLMs, but we will revisit it in chapter 6 when we load pretrained GPT-2 weights 27 | # from OpenAI into our model (see chapter 6). 28 | -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/images/logits explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/images/logits explanation.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.1_Coding_an_LLM_Architecture/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import tiktoken 4 | 5 | 6 | class DummyGPTModel(nn.Module): 7 | def __init__(self, cfg): 8 | super().__init__() 9 | self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) 10 | self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) 11 | self.drop_emb = nn.Dropout(cfg["drop_rate"]) 12 | 13 | # Use a placeholder for TransformerBlock 14 | self.trf_blocks = nn.Sequential( 15 | *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])] 16 | ) 17 | 18 | # Use a placeholder for LayerNorm 19 | self.final_norm = DummyLayerNorm(cfg["emb_dim"]) 20 | self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) 21 | 22 | def forward(self, in_idx): 23 | batch_size, seq_len = in_idx.shape 24 | tok_embeds = self.tok_emb(in_idx) 25 | pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) 26 | x = tok_embeds + pos_embeds 27 | x = self.drop_emb(x) 28 | x = self.trf_blocks(x) 29 | x = self.final_norm(x) 30 | logits = self.out_head(x) 31 | return logits 32 | 33 | 34 | class DummyTransformerBlock(nn.Module): 35 | def __init__(self, cfg): 36 | super().__init__() 37 | # A simple placeholder 38 | 39 | def forward(self, x): 40 | # This block does nothing and just returns its input. 41 | return x 42 | 43 | 44 | class DummyLayerNorm(nn.Module): 45 | def __init__(self, normalized_shape, eps=1e-5): 46 | super().__init__() 47 | # The parameters here are just to mimic the LayerNorm interface. 48 | 49 | def forward(self, x): 50 | # This layer does nothing and just returns its input. 51 | return x 52 | 53 | 54 | tokenizer = tiktoken.get_encoding("gpt2") 55 | batch = [] 56 | txt1 = "Every effort moves you" 57 | txt2 = "Every day holds a" 58 | 59 | batch.append(torch.tensor(tokenizer.encode(txt1))) 60 | batch.append(torch.tensor(tokenizer.encode(txt2))) 61 | 62 | batch = torch.stack(batch, dim=0) 63 | print(batch) 64 | 65 | 66 | GPT_CONFIG_124M = { 67 | "vocab_size": 50257, # Vocabulary size 68 | "context_length": 1024, # Context length 69 | "emb_dim": 768, # Embedding dimension 70 | "n_heads": 12, # Number of attention heads 71 | "n_layers": 12, # Number of layers 72 | "drop_rate": 0.1, # Dropout rate 73 | "qkv_bias": False # Query-Key-Value bias 74 | } 75 | 76 | torch.manual_seed(123) 77 | model = DummyGPTModel(GPT_CONFIG_124M) 78 | # print(model) 79 | logits = model(batch) 80 | 81 | print("\nOutput shape:\n", logits.shape) 82 | print("Logits:\n", logits) -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/biased variance broken down.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/biased variance broken down.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt1.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/dim parameter, pt2.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt1.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt2.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/forward pass & after check, pt3.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt1.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt2.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization explained, pt3.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/layer normalization.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt1.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/images/variance calculation, pt2.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class LayerNorm(nn.Module): 6 | def __init__(self, emb_dim): 7 | super().__init__() 8 | self.eps = 1e-5 # small constant "epsilon" added to the variance, prevents division by zero during normalization 9 | self.scale = nn.Parameter(torch.ones(emb_dim)) # trainable param, LLM will adjust this during training 10 | self.shift = nn.Parameter(torch.zeros(emb_dim)) # trainable param, LLM will adjust this during training 11 | 12 | def forward(self, x): 13 | mean = x.mean(dim=-1, keepdim=True) 14 | var = x.var(dim=-1, keepdim=True, unbiased=False) 15 | norm_x = (x - mean) / torch.sqrt(var + self.eps) 16 | return self.scale * norm_x + self.shift 17 | 18 | 19 | torch.manual_seed(123) 20 | torch.set_printoptions(sci_mode=False) 21 | 22 | batch_example = torch.randn(2, 5) 23 | print(batch_example) 24 | 25 | ln = LayerNorm(emb_dim=5) 26 | out_ln = ln(batch_example) 27 | 28 | # Verification that the mean = 0 and variance = 1 29 | mean = out_ln.mean(dim=-1, keepdim=True) 30 | var = out_ln.var(dim=-1, unbiased=False, keepdim=True) 31 | 32 | print("Mean:\n", mean) 33 | print("Variance:\n", var) -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.2_Normalizing_activations_with_layer_normalization/normalization.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | torch.manual_seed(123) 6 | 7 | batch_example = torch.randn(2, 5) 8 | layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU()) 9 | out = layer(batch_example) 10 | 11 | print("\nLayer outputs:\n", out) 12 | 13 | # Mean values for both row 1 and row 2 14 | mean = out.mean(dim=-1, keepdim=True) 15 | 16 | var = out.var(dim=-1, keepdim=True) 17 | print("\nMean:\n", mean) 18 | print("Variance:\n", var) 19 | 20 | print("\n------------------------") 21 | # Layer Normalization 22 | out_norm = (out - mean) / torch.sqrt(var) 23 | mean = out_norm.mean(dim=-1, keepdim=True) 24 | var = out_norm.var(dim=-1, keepdim=True) 25 | print("\nNormalized layer outputs:\n", out_norm) 26 | print("Mean:\n", mean) 27 | print("Variance:\n", var) 28 | 29 | print("\n------------------------") 30 | # Removing scientific notation 31 | torch.set_printoptions(sci_mode=False) 32 | print("Mean:\n", mean) 33 | print("Variance:\n", var) 34 | 35 | -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/gelu.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import matplotlib.pyplot as plt 4 | 5 | 6 | class GELU(nn.Module): 7 | def __init__(self): 8 | super().__init__() 9 | 10 | def forward(self, x): 11 | return 0.5 * x * (1 + torch.tanh( 12 | torch.sqrt(torch.tensor(2.0 / torch.pi)) * 13 | (x + 0.044715 * torch.pow(x, 3)) 14 | )) 15 | 16 | 17 | gelu = GELU() 18 | relu = nn.ReLU() 19 | 20 | x = torch.linspace(-3, 3, 100) # creates 100 sample data points in the range -3 to 3 21 | y_gelu = gelu(x) 22 | y_relu = relu(x) 23 | 24 | plt.figure(figsize=(8, 3)) 25 | 26 | for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1): 27 | plt.subplot(1, 2, i) 28 | plt.plot(x, y) 29 | # plt.plot(x, y, marker='o', linestyle='-', markersize=3) # Add marker='o' and markersize=3 30 | plt.title(f"{label} activation function") 31 | plt.xlabel("x") 32 | plt.ylabel(f"{label}(x)") 33 | plt.grid(True) 34 | 35 | plt.tight_layout() 36 | plt.show() -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/fnn diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/fnn diagram.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/gelu and relu plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/gelu and relu plot.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/input into feedforward neural net (fnn).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/images/input into feedforward neural net (fnn).png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.3_Implementing_a_feed_forward_network_with_GELU_activations/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class FeedForward(nn.Module): 6 | def __init__(self, cfg): 7 | super().__init__() 8 | self.layers = nn.Sequential( 9 | nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), 10 | GELU(), 11 | nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]) 12 | ) 13 | 14 | def forward(self, x): 15 | return self.layers(x) 16 | 17 | 18 | class GELU(nn.Module): 19 | def __init__(self): 20 | super().__init__() 21 | 22 | def forward(self, x): 23 | return 0.5 * x * (1 + torch.tanh( 24 | torch.sqrt(torch.tensor(2.0 / torch.pi)) * 25 | (x + 0.044715 * torch.pow(x, 3)) 26 | )) 27 | 28 | 29 | GPT_CONFIG_124M = { 30 | "vocab_size": 50257, # Vocabulary size 31 | "context_length": 1024, # Context length 32 | "emb_dim": 768, # Embedding dimension 33 | "n_heads": 12, # Number of attention heads 34 | "n_layers": 12, # Number of layers 35 | "drop_rate": 0.1, # Dropout rate 36 | "qkv_bias": False # Query-Key-Value bias 37 | } 38 | 39 | ffn = FeedForward(GPT_CONFIG_124M) 40 | print("\nFNN Architecture:\n", ffn) 41 | 42 | x = torch.rand(2, 3, 768) # sample input with batch dimension 2 43 | out = ffn(x) 44 | 45 | print("\nFNN 1st Sample:\n", out[0]) 46 | print("\n1st Sample Shape:\n", out[0].shape) -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.4_Adding_shortcut_connections/images/shortcut connections.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.4_Adding_shortcut_connections/images/shortcut connections.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.4_Adding_shortcut_connections/main.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class ExampleDeepNeuralNetwork(nn.Module): 6 | def __init__(self, layer_sizes, use_shortcut): 7 | super().__init__() 8 | self.use_shortcut = use_shortcut 9 | self.layers = nn.ModuleList([ 10 | nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()), 11 | nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()), 12 | nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()), 13 | nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()), 14 | nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU()), 15 | ]) 16 | 17 | def forward(self, x): 18 | for layer in self.layers: 19 | layer_output = layer(x) # compute the output of the current layer 20 | if self.use_shortcut and x.shape == layer_output.shape: # check if shortcut can be applied 21 | x = x + layer_output 22 | else: 23 | x = layer_output 24 | return x 25 | 26 | 27 | class GELU(nn.Module): 28 | def __init__(self): 29 | super().__init__() 30 | 31 | def forward(self, x): 32 | return 0.5 * x * (1 + torch.tanh( 33 | torch.sqrt(torch.tensor(2.0 / torch.pi)) * 34 | (x + 0.044715 * torch.pow(x, 3)) 35 | )) 36 | 37 | torch.manual_seed(123) 38 | layer_sizes = [3, 3, 3, 3, 3, 1] 39 | sample_input = torch.tensor([[1., 0., -1.]]) 40 | 41 | # model without shortcut 42 | model_without_shortcut = ExampleDeepNeuralNetwork( 43 | layer_sizes, use_shortcut=False 44 | ) 45 | 46 | def print_gradients(model, x): 47 | output = model(x) # forward pass 48 | target = torch.tensor([[0.]]) 49 | 50 | loss = nn.MSELoss() # calculate loss based on how close the target and output are 51 | loss = loss(output, target) 52 | 53 | loss.backward() # backward pass to calculate gradients 54 | 55 | for name, param in model.named_parameters(): 56 | if "weight" in name: 57 | print(f"{name} has gradient mean of {param.grad.abs().mean().item()}") 58 | 59 | print("Model without shortcut:") 60 | print_gradients(model_without_shortcut, sample_input) # vanishing gradient problem occurs here 61 | 62 | # model with skip / shortcut connections 63 | model_with_shortcut = ExampleDeepNeuralNetwork( 64 | layer_sizes, use_shortcut=True 65 | ) 66 | 67 | print("\nModel with shortcut:") 68 | print_gradients(model_with_shortcut, sample_input) -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.5_Connecting_attention_and_linear_layers_in_a_transformer_block/images/transformer block.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.5_Connecting_attention_and_linear_layers_in_a_transformer_block/images/transformer block.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.6_Coding_the_GPT_Model/images/gpt2 architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.6_Coding_the_GPT_Model/images/gpt2 architecture.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/iterations of a token prediction cycle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/iterations of a token prediction cycle.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/mechanics of text generation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/mechanics of text generation.png -------------------------------------------------------------------------------- /04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/step by step text generation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/04_Implementing_a_GPT_model_to_generate_text/4.7_Generating_text/images/step by step text generation.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/chapter topics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/chapter topics.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/gpt build stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/gpt build stages.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/tokenizer placement in flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.1_Using_GPT_to_generate_text/images/tokenizer placement in flow.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/loss calculation steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/loss calculation steps.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/next tokens.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/next tokens.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/perplexity score explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/perplexity score explanation.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/text generation process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.2_Calculating_the_text_generation_loss/images/text generation process.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.3_Calculating_the_training_and_validation_set_losses/images/dataloaders.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.3_Calculating_the_training_and_validation_set_losses/images/dataloaders.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.1_Evaluating_generative_text_models/5.1.3_Calculating_the_training_and_validation_set_losses/loading_dataset.py: -------------------------------------------------------------------------------- 1 | from torch.utils.data import Dataset, DataLoader 2 | import tiktoken 3 | import torch 4 | 5 | 6 | class GPTDataSetV1(Dataset): 7 | def __init__(self, txt, tokenizer, max_length, stride): 8 | self.input_ids = [] 9 | self.target_ids = [] 10 | 11 | # Tokenize the entire text 12 | token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) 13 | 14 | # Use a sliding window to chunk the book into overlapping sequences of max_length 15 | for i in range(0, len(token_ids) - max_length, stride): 16 | input_chunk = token_ids[i:i + max_length] 17 | target_chunk = token_ids[i + 1: i + max_length + 1] 18 | self.input_ids.append(torch.tensor(input_chunk)) 19 | self.target_ids.append(torch.tensor(target_chunk)) 20 | 21 | def __len__(self): 22 | return len(self.input_ids) 23 | 24 | def __getitem__(self, idx): 25 | return self.input_ids[idx], self.target_ids[idx] 26 | 27 | 28 | def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): 29 | # Initialize the tokenizer 30 | tokenizer = tiktoken.get_encoding("gpt2") 31 | 32 | # Create dataset 33 | dataset = GPTDataSetV1(txt, tokenizer, max_length, stride) 34 | 35 | # Create dataloader 36 | dataloader = DataLoader( 37 | dataset, 38 | batch_size=batch_size, 39 | shuffle=shuffle, 40 | drop_last=drop_last, 41 | num_workers=num_workers 42 | ) 43 | 44 | return dataloader 45 | 46 | 47 | torch.manual_seed(123) 48 | 49 | file_path = "the-verdict.txt" 50 | with open(file_path, "r", encoding="utf-8") as file: 51 | text_data = file.read() 52 | 53 | 54 | tokenizer = tiktoken.get_encoding("gpt2") 55 | 56 | total_characters = len(text_data) 57 | total_tokens = len(tokenizer.encode(text_data)) 58 | 59 | print("Characters:", total_characters) 60 | print("Tokens:", total_tokens) 61 | 62 | # Data splitting into training and validation datasets 63 | train_ratio = 0.90 64 | split_idx = int(train_ratio * len(text_data)) 65 | 66 | train_data = text_data[:split_idx] 67 | val_data = text_data[split_idx:] 68 | 69 | 70 | # known as "GPT-2 small" 71 | GPT_CONFIG_124M = { 72 | "vocab_size": 50257, # Vocabulary size 73 | "context_length": 256, # Context length 74 | "emb_dim": 768, # Embedding dimension 75 | "n_heads": 12, # Number of attention heads 76 | "n_layers": 12, # Number of layers 77 | "drop_rate": 0.1, # Dropout rate 78 | "qkv_bias": False # Query-Key-Value bias 79 | } 80 | 81 | 82 | train_loader = create_dataloader_v1( 83 | train_data, 84 | batch_size=2, 85 | max_length=GPT_CONFIG_124M["context_length"], 86 | stride=GPT_CONFIG_124M["context_length"], 87 | drop_last=True, 88 | shuffle=True, 89 | num_workers=0 90 | ) 91 | 92 | val_loader = create_dataloader_v1( 93 | val_data, 94 | batch_size=2, 95 | max_length=GPT_CONFIG_124M["context_length"], 96 | stride=GPT_CONFIG_124M["context_length"], 97 | drop_last=False, 98 | shuffle=False, 99 | num_workers=0 100 | ) 101 | 102 | print("\nTrain loader:") 103 | for x, y in train_loader: 104 | print(x.shape, y.shape) 105 | 106 | print("\nValidation loader:") 107 | for x, y in val_loader: 108 | print(x.shape, y.shape) -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/loss-plot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/loss-plot.pdf -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/plot explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/plot explanation.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training loop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training loop.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.2_Training_an_LLM/images/training process.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature explanation.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature-plot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.1_Temperature_Scaling/images/temperature-plot.pdf -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.2_Top_k_sampling/images/top k steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.2_Top_k_sampling/images/top k steps.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.3_Decoding_strategies_to_control_randomness/5.3.2_Top_k_sampling/top_k.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | # Assume the LLM is given the start context "every effort moves you" and 5 | # generates the following next-token logits: 6 | next_token_logits = torch.tensor( 7 | [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79] 8 | ) 9 | 10 | top_k = 3 11 | top_logits, top_pos = torch.topk(next_token_logits, top_k) 12 | print("Top logits:", top_logits) 13 | print("Top positions:", top_pos) 14 | 15 | new_logits = torch.where( 16 | condition=next_token_logits < top_logits[-1], # identifies logits less than the minimum in the top 3 17 | input=torch.tensor(float("-inf")), # assigns -inf to these lower logits 18 | other=next_token_logits # retains the original logits for all other tokens 19 | ) 20 | 21 | print(new_logits) 22 | 23 | # An alternative, slightly more efficient implementation of the previous code 24 | new_logits_alt = torch.full_like( # create tensor containing -inf values 25 | next_token_logits, -torch.inf 26 | ) 27 | new_logits_alt[top_pos] = next_token_logits[top_pos] # copy top k values into the -inf tensor 28 | 29 | print(new_logits_alt) 30 | 31 | # ----- 32 | topk_probas = torch.softmax(new_logits, dim=0) 33 | print(topk_probas) 34 | 35 | -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt1.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt2.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt3.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.4_Loading_and_saving_model_weights_in_Pytorch/images/torch no_grad explained, pt4.png -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/gpt_setup/__init__.py -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/images/gpt architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/05_Pretraining_on_unlabeled_data/5.5_Loading_pretrained_weights_from_OpenAI/images/gpt architecture.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from download import data_file_path 3 | 4 | 5 | df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"]) 6 | 7 | # print(df) 8 | # print(df["Label"].value_counts()) 9 | 10 | def create_balanced_dataset(df): 11 | 12 | # Count the instances of "spam" 13 | num_spam = df[df["Label"] == "spam"].shape[0] 14 | 15 | # Randomly sample "ham" instances to match the number of "spam" instances 16 | ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123) 17 | 18 | # Combine ham "subset" with "spam" 19 | balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]]) 20 | 21 | return balanced_df 22 | 23 | 24 | # Split dataset into 3 parts. These ratios are common in machine learning to train, adjust, and evaluate models. 25 | # Training = 70% 26 | # Validation = 10% 27 | # Testing = 20% 28 | def random_split(df, train_frac, validation_frac): 29 | 30 | # Shuffle the entire DataFrame 31 | df = df.sample(frac=1, random_state=123).reset_index(drop=True) 32 | 33 | # Calculate the split indices 34 | train_end = int(len(df) * train_frac) 35 | validation_end = train_end + int(len(df) * validation_frac) 36 | 37 | # Split the DataFrame 38 | train_df = df[:train_end] 39 | validation_df = df[train_end:validation_end] 40 | test_df = df[validation_end:] 41 | 42 | return train_df, validation_df, test_df 43 | 44 | 45 | balanced_df = create_balanced_dataset(df) 46 | # print(balanced_df["Label"].value_counts()) 47 | # print(balanced_df.shape[0]) 48 | 49 | # Change the string class labels "ham" and "spam" into integer class labels 0 and 1 50 | balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1}) 51 | # print(balanced_df) 52 | 53 | train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1) 54 | # Test size is implied to be 0.2 as the remainder 55 | 56 | print(train_df.shape[0]) 57 | print(validation_df.shape[0]) 58 | print(test_df.shape[0]) 59 | 60 | # Save the dataset as CSV (comma-seperated values) files so we can reuse it later 61 | train_df.to_csv("train.csv", index=None) 62 | validation_df.to_csv("validation.csv", index=None) 63 | test_df.to_csv("test.csv", index=None) -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/download.py: -------------------------------------------------------------------------------- 1 | import urllib.request 2 | import zipfile 3 | import os 4 | from pathlib import Path 5 | 6 | 7 | url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip" 8 | zip_path = "sms_spam_collection.zip" 9 | extracted_path = "sms_spam_collection" 10 | data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv" 11 | 12 | 13 | def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path): 14 | 15 | if data_file_path.exists(): 16 | print(f"{data_file_path} already exists. Skipping download and extraction.") 17 | return 18 | 19 | # Downloading the file 20 | with urllib.request.urlopen(url) as response: 21 | with open(zip_path, "wb") as out_file: 22 | out_file.write(response.read()) 23 | 24 | # Unzipping the file 25 | with zipfile.ZipFile(zip_path, "r") as zip_ref: 26 | zip_ref.extractall(extracted_path) 27 | 28 | # Add .tsv file extension 29 | original_file_path = Path(extracted_path) / "SMSSpamCollection" 30 | os.rename(original_file_path, data_file_path) 31 | print(f"File downloaded and saved as {data_file_path}") 32 | 33 | download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/images/classification fine tuning stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/images/classification fine tuning stages.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection.zip -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection/readme: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.2_Preparing_the_dataset/sms_spam_collection/readme -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/input text prep process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/input text prep process.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/single training batch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.3_Creating_data_loaders/images/single training batch.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.3_Creating_data_loaders/padding_token.py: -------------------------------------------------------------------------------- 1 | import tiktoken 2 | 3 | 4 | tokenizer = tiktoken.get_encoding("gpt2") 5 | 6 | print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})) 7 | # 50256 -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/data_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/data_setup/spam_dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import pandas as pd 4 | import tiktoken 5 | 6 | 7 | class SpamDataset(Dataset): 8 | def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256): 9 | self.data = pd.read_csv(csv_file) 10 | 11 | # Pre-tokenize texts 12 | self.encoded_texts = [ 13 | tokenizer.encode(text) for text in self.data["Text"] 14 | ] 15 | 16 | if max_length is None: 17 | self.max_length = self._longest_encoded_length() 18 | else: 19 | self.max_length = max_length 20 | # Truncate sequences if they are longer than max_length 21 | self.encoded_texts = [ 22 | encoded_text[:self.max_length] 23 | for encoded_text in self.encoded_texts 24 | ] 25 | 26 | # Pad sequences to the longest sequence 27 | self.encoded_texts = [ 28 | encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) 29 | for encoded_text in self.encoded_texts 30 | ] 31 | 32 | def __getitem__(self, index): 33 | encoded = self.encoded_texts[index] 34 | label = self.data.iloc[index]["Label"] 35 | return ( 36 | torch.tensor(encoded, dtype=torch.long), 37 | torch.tensor(label, dtype=torch.long) 38 | ) 39 | 40 | def __len__(self): 41 | return len(self.data) 42 | 43 | def _longest_encoded_length(self): 44 | max_length = 0 45 | for encoded_text in self.encoded_texts: 46 | encoded_length = len(encoded_text) 47 | if encoded_length > max_length: 48 | max_length = encoded_length 49 | return max_length 50 | 51 | 52 | tokenizer = tiktoken.get_encoding("gpt2") 53 | 54 | train_dataset = SpamDataset( 55 | csv_file="data_setup/train.csv", 56 | max_length=None, 57 | tokenizer=tokenizer 58 | ) 59 | 60 | val_dataset = SpamDataset( 61 | csv_file="data_setup/validation.csv", 62 | max_length=train_dataset.max_length, 63 | tokenizer=tokenizer 64 | ) 65 | 66 | test_dataset = SpamDataset( 67 | csv_file="data_setup/test.csv", 68 | max_length=train_dataset.max_length, 69 | tokenizer=tokenizer 70 | ) 71 | 72 | # Setting up data loaders 73 | num_workers = 0 74 | batch_size = 8 75 | 76 | torch.manual_seed(123) 77 | 78 | train_loader = DataLoader( 79 | dataset=train_dataset, 80 | batch_size=batch_size, 81 | shuffle=True, 82 | num_workers=num_workers, 83 | drop_last=True 84 | ) 85 | 86 | val_loader = DataLoader( 87 | dataset=val_dataset, 88 | batch_size=batch_size, 89 | num_workers=num_workers, 90 | drop_last=False 91 | ) 92 | 93 | test_loader = DataLoader( 94 | dataset=test_dataset, 95 | batch_size=batch_size, 96 | num_workers=num_workers, 97 | drop_last=False 98 | ) 99 | -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/gpt_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.4_Initializing_a_model_with_pretrained_weights/images/stages.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/data_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/data_setup/spam_dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import pandas as pd 4 | import tiktoken 5 | 6 | 7 | class SpamDataset(Dataset): 8 | def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256): 9 | self.data = pd.read_csv(csv_file) 10 | 11 | # Pre-tokenize texts 12 | self.encoded_texts = [ 13 | tokenizer.encode(text) for text in self.data["Text"] 14 | ] 15 | 16 | if max_length is None: 17 | self.max_length = self._longest_encoded_length() 18 | else: 19 | self.max_length = max_length 20 | # Truncate sequences if they are longer than max_length 21 | self.encoded_texts = [ 22 | encoded_text[:self.max_length] 23 | for encoded_text in self.encoded_texts 24 | ] 25 | 26 | # Pad sequences to the longest sequence 27 | self.encoded_texts = [ 28 | encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) 29 | for encoded_text in self.encoded_texts 30 | ] 31 | 32 | def __getitem__(self, index): 33 | encoded = self.encoded_texts[index] 34 | label = self.data.iloc[index]["Label"] 35 | return ( 36 | torch.tensor(encoded, dtype=torch.long), 37 | torch.tensor(label, dtype=torch.long) 38 | ) 39 | 40 | def __len__(self): 41 | return len(self.data) 42 | 43 | def _longest_encoded_length(self): 44 | max_length = 0 45 | for encoded_text in self.encoded_texts: 46 | encoded_length = len(encoded_text) 47 | if encoded_length > max_length: 48 | max_length = encoded_length 49 | return max_length 50 | 51 | 52 | tokenizer = tiktoken.get_encoding("gpt2") 53 | 54 | train_dataset = SpamDataset( 55 | csv_file="data_setup/train.csv", 56 | max_length=None, 57 | tokenizer=tokenizer 58 | ) 59 | 60 | val_dataset = SpamDataset( 61 | csv_file="data_setup/validation.csv", 62 | max_length=train_dataset.max_length, 63 | tokenizer=tokenizer 64 | ) 65 | 66 | test_dataset = SpamDataset( 67 | csv_file="data_setup/test.csv", 68 | max_length=train_dataset.max_length, 69 | tokenizer=tokenizer 70 | ) 71 | 72 | # Setting up data loaders 73 | num_workers = 0 74 | batch_size = 8 75 | 76 | torch.manual_seed(123) 77 | 78 | train_loader = DataLoader( 79 | dataset=train_dataset, 80 | batch_size=batch_size, 81 | shuffle=True, 82 | num_workers=num_workers, 83 | drop_last=True 84 | ) 85 | 86 | val_loader = DataLoader( 87 | dataset=val_dataset, 88 | batch_size=batch_size, 89 | num_workers=num_workers, 90 | drop_last=False 91 | ) 92 | 93 | test_loader = DataLoader( 94 | dataset=test_dataset, 95 | batch_size=batch_size, 96 | num_workers=num_workers, 97 | drop_last=False 98 | ) 99 | -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/gpt_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/architecture adapation for binary classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/architecture adapation for binary classification.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/final layernorm and trf block set to trainable.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/final layernorm and trf block set to trainable.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/fine-tuning selected layers vs all layers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/fine-tuning selected layers vs all layers.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last row of output tensor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last row of output tensor.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last token contains attention score to all other tokens .png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/last token contains attention score to all other tokens .png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt1.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt2.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/layer training summary, pt3.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/modifying output layer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.5_Adding_a_classification_head/images/modifying output layer.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/data_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/data_setup/spam_dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import pandas as pd 4 | import tiktoken 5 | 6 | 7 | class SpamDataset(Dataset): 8 | def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256): 9 | self.data = pd.read_csv(csv_file) 10 | 11 | # Pre-tokenize texts 12 | self.encoded_texts = [ 13 | tokenizer.encode(text) for text in self.data["Text"] 14 | ] 15 | 16 | if max_length is None: 17 | self.max_length = self._longest_encoded_length() 18 | else: 19 | self.max_length = max_length 20 | # Truncate sequences if they are longer than max_length 21 | self.encoded_texts = [ 22 | encoded_text[:self.max_length] 23 | for encoded_text in self.encoded_texts 24 | ] 25 | 26 | # Pad sequences to the longest sequence 27 | self.encoded_texts = [ 28 | encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) 29 | for encoded_text in self.encoded_texts 30 | ] 31 | 32 | def __getitem__(self, index): 33 | encoded = self.encoded_texts[index] 34 | label = self.data.iloc[index]["Label"] 35 | return ( 36 | torch.tensor(encoded, dtype=torch.long), 37 | torch.tensor(label, dtype=torch.long) 38 | ) 39 | 40 | def __len__(self): 41 | return len(self.data) 42 | 43 | def _longest_encoded_length(self): 44 | max_length = 0 45 | for encoded_text in self.encoded_texts: 46 | encoded_length = len(encoded_text) 47 | if encoded_length > max_length: 48 | max_length = encoded_length 49 | return max_length 50 | 51 | 52 | tokenizer = tiktoken.get_encoding("gpt2") 53 | 54 | train_dataset = SpamDataset( 55 | csv_file="data_setup/train.csv", 56 | max_length=None, 57 | tokenizer=tokenizer 58 | ) 59 | 60 | val_dataset = SpamDataset( 61 | csv_file="data_setup/validation.csv", 62 | max_length=train_dataset.max_length, 63 | tokenizer=tokenizer 64 | ) 65 | 66 | test_dataset = SpamDataset( 67 | csv_file="data_setup/test.csv", 68 | max_length=train_dataset.max_length, 69 | tokenizer=tokenizer 70 | ) 71 | 72 | # Setting up data loaders 73 | num_workers = 0 74 | batch_size = 8 75 | 76 | torch.manual_seed(123) 77 | 78 | train_loader = DataLoader( 79 | dataset=train_dataset, 80 | batch_size=batch_size, 81 | shuffle=True, 82 | num_workers=num_workers, 83 | drop_last=True 84 | ) 85 | 86 | val_loader = DataLoader( 87 | dataset=val_dataset, 88 | batch_size=batch_size, 89 | num_workers=num_workers, 90 | drop_last=False 91 | ) 92 | 93 | test_loader = DataLoader( 94 | dataset=test_dataset, 95 | batch_size=batch_size, 96 | num_workers=num_workers, 97 | drop_last=False 98 | ) 99 | -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/gpt_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/model outputs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/model outputs.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.6_Calculating_the_classification_loss_and_accuracy/images/stages.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/data_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/data_setup/spam_dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import pandas as pd 4 | import tiktoken 5 | 6 | 7 | class SpamDataset(Dataset): 8 | def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256): 9 | self.data = pd.read_csv(csv_file) 10 | 11 | # Pre-tokenize texts 12 | self.encoded_texts = [ 13 | tokenizer.encode(text) for text in self.data["Text"] 14 | ] 15 | 16 | if max_length is None: 17 | self.max_length = self._longest_encoded_length() 18 | else: 19 | self.max_length = max_length 20 | # Truncate sequences if they are longer than max_length 21 | self.encoded_texts = [ 22 | encoded_text[:self.max_length] 23 | for encoded_text in self.encoded_texts 24 | ] 25 | 26 | # Pad sequences to the longest sequence 27 | self.encoded_texts = [ 28 | encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) 29 | for encoded_text in self.encoded_texts 30 | ] 31 | 32 | def __getitem__(self, index): 33 | encoded = self.encoded_texts[index] 34 | label = self.data.iloc[index]["Label"] 35 | return ( 36 | torch.tensor(encoded, dtype=torch.long), 37 | torch.tensor(label, dtype=torch.long) 38 | ) 39 | 40 | def __len__(self): 41 | return len(self.data) 42 | 43 | def _longest_encoded_length(self): 44 | max_length = 0 45 | for encoded_text in self.encoded_texts: 46 | encoded_length = len(encoded_text) 47 | if encoded_length > max_length: 48 | max_length = encoded_length 49 | return max_length 50 | 51 | 52 | tokenizer = tiktoken.get_encoding("gpt2") 53 | 54 | train_dataset = SpamDataset( 55 | csv_file="data_setup/train.csv", 56 | max_length=None, 57 | tokenizer=tokenizer 58 | ) 59 | 60 | val_dataset = SpamDataset( 61 | csv_file="data_setup/validation.csv", 62 | max_length=train_dataset.max_length, 63 | tokenizer=tokenizer 64 | ) 65 | 66 | test_dataset = SpamDataset( 67 | csv_file="data_setup/test.csv", 68 | max_length=train_dataset.max_length, 69 | tokenizer=tokenizer 70 | ) 71 | 72 | # Setting up data loaders 73 | num_workers = 0 74 | batch_size = 8 75 | 76 | torch.manual_seed(123) 77 | 78 | train_loader = DataLoader( 79 | dataset=train_dataset, 80 | batch_size=batch_size, 81 | shuffle=True, 82 | num_workers=num_workers, 83 | drop_last=True 84 | ) 85 | 86 | val_loader = DataLoader( 87 | dataset=val_dataset, 88 | batch_size=batch_size, 89 | num_workers=num_workers, 90 | drop_last=False 91 | ) 92 | 93 | test_loader = DataLoader( 94 | dataset=test_dataset, 95 | batch_size=batch_size, 96 | num_workers=num_workers, 97 | drop_last=False 98 | ) 99 | -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/gpt_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/accuracy-plot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/accuracy-plot.pdf -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/choosing the number of epochs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/choosing the number of epochs.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/loss-plot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/loss-plot.pdf -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt1.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt2.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt3.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt4.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt5.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt6.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/train, validation, test dataset explanation, pt7.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation accuracy plot explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation accuracy plot explanation.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation loss plot explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training & validation loss plot explanation.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training loop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.7_Fine_tuning_the_model_on_supervised_data/images/training loop.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/data_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/data_setup/spam_dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader 3 | import pandas as pd 4 | import tiktoken 5 | 6 | 7 | class SpamDataset(Dataset): 8 | def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256): 9 | self.data = pd.read_csv(csv_file) 10 | 11 | # Pre-tokenize texts 12 | self.encoded_texts = [ 13 | tokenizer.encode(text) for text in self.data["Text"] 14 | ] 15 | 16 | if max_length is None: 17 | self.max_length = self._longest_encoded_length() 18 | else: 19 | self.max_length = max_length 20 | # Truncate sequences if they are longer than max_length 21 | self.encoded_texts = [ 22 | encoded_text[:self.max_length] 23 | for encoded_text in self.encoded_texts 24 | ] 25 | 26 | # Pad sequences to the longest sequence 27 | self.encoded_texts = [ 28 | encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) 29 | for encoded_text in self.encoded_texts 30 | ] 31 | 32 | def __getitem__(self, index): 33 | encoded = self.encoded_texts[index] 34 | label = self.data.iloc[index]["Label"] 35 | return ( 36 | torch.tensor(encoded, dtype=torch.long), 37 | torch.tensor(label, dtype=torch.long) 38 | ) 39 | 40 | def __len__(self): 41 | return len(self.data) 42 | 43 | def _longest_encoded_length(self): 44 | max_length = 0 45 | for encoded_text in self.encoded_texts: 46 | encoded_length = len(encoded_text) 47 | if encoded_length > max_length: 48 | max_length = encoded_length 49 | return max_length 50 | 51 | 52 | tokenizer = tiktoken.get_encoding("gpt2") 53 | 54 | train_dataset = SpamDataset( 55 | csv_file="data_setup/train.csv", 56 | max_length=None, 57 | tokenizer=tokenizer 58 | ) 59 | 60 | val_dataset = SpamDataset( 61 | csv_file="data_setup/validation.csv", 62 | max_length=train_dataset.max_length, 63 | tokenizer=tokenizer 64 | ) 65 | 66 | test_dataset = SpamDataset( 67 | csv_file="data_setup/test.csv", 68 | max_length=train_dataset.max_length, 69 | tokenizer=tokenizer 70 | ) 71 | 72 | # Setting up data loaders 73 | num_workers = 0 74 | batch_size = 8 75 | 76 | torch.manual_seed(123) 77 | 78 | train_loader = DataLoader( 79 | dataset=train_dataset, 80 | batch_size=batch_size, 81 | shuffle=True, 82 | num_workers=num_workers, 83 | drop_last=True 84 | ) 85 | 86 | val_loader = DataLoader( 87 | dataset=val_dataset, 88 | batch_size=batch_size, 89 | num_workers=num_workers, 90 | drop_last=False 91 | ) 92 | 93 | test_loader = DataLoader( 94 | dataset=test_dataset, 95 | batch_size=batch_size, 96 | num_workers=num_workers, 97 | drop_last=False 98 | ) 99 | -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/gpt_setup/__init__.py -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt1.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt10.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt11.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt12.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt13.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt2.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt3.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt4.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt5.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt6.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt7.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt8.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/6.8_Using_the_LLM_as_a_spam_classifier/images/overview, pt9.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/images/classification fine tuning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/classification fine tuning.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/images/fine tuning approach.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/fine tuning approach.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/images/instruction fine tuning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/instruction fine tuning.png -------------------------------------------------------------------------------- /06_Fine_tuning_for_classification/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/06_Fine_tuning_for_classification/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/desired goal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/desired goal.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.1_Introduction_to_instruction_fine_tuning/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | 4 | 5 | def load_file(file_path): 6 | with open(file_path, "r", encoding="utf-8") as file: 7 | data = json.load(file) 8 | 9 | return data 10 | 11 | 12 | def format_input(entry): 13 | # Alpaca-style prompt formatting 14 | instruction_text = ( 15 | f"Below is an instruction that describes a task. " 16 | f"Write a response that appropriately completes the request." 17 | f"\n\n### Instruction:\n{entry['instruction']}" 18 | ) 19 | 20 | input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else "" 21 | 22 | return instruction_text + input_text 23 | 24 | 25 | data = load_file("instruction-data.json") 26 | 27 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data." 28 | 29 | # print("Number of entries:", len(data)) 30 | # print("Example entry:\n", data[50]) 31 | # print("Example entry:\n", data[999]) 32 | 33 | # Formatting input 34 | model_input = format_input(data[50]) 35 | 36 | desired_response = f"\n\n### Response:\n{data[50]['output']}" 37 | 38 | print(model_input + desired_response) 39 | 40 | # Divide the dataset into a training, validation, and test set 41 | train_portion = int(len(data) * 0.85) # 85% for training 42 | test_portion = int(len(data) * 0.1) # 10% for testing 43 | val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation 44 | 45 | print(train_portion) 46 | print(test_portion) 47 | print(val_portion) 48 | 49 | train_data = data[:train_portion] 50 | test_data = data[train_portion:train_portion + test_portion] 51 | val_data = data[train_portion + test_portion:] 52 | 53 | print("\nTraining set length:", len(train_data)) 54 | print("Validation set length:", len(val_data)) 55 | print("Test set length:", len(test_data)) 56 | 57 | # TODO: Exercise 7.1: Changing prompt styles 58 | # After fine-tuning the model with the Alpaca prompt style, try the Phi-3 prompt style 59 | # shown in figure 7.4 and observe whether it affects the response quality of the model. 60 | 61 | # def format_input(entry): 62 | # instruction_text = ( 63 | # f"<|user|>\n{entry['instruction']}" 64 | # ) 65 | 66 | # input_text = f"\n{entry['input']}" if entry["input"] else "" 67 | 68 | # return instruction_text + input_text -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/download.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import urllib.request 4 | 5 | 6 | def download_and_load_file(file_path, url): 7 | 8 | if not os.path.exists(file_path): 9 | with urllib.request.urlopen(url) as response: 10 | text_data = response.read().decode("utf-8") 11 | with open(file_path, "w", encoding="utf-8") as file: 12 | file.write(text_data) 13 | else: 14 | with open(file_path, "r", encoding="utf-8") as file: 15 | text_data = file.read() 16 | 17 | with open(file_path, "r", encoding="utf-8") as file: 18 | data = json.load(file) 19 | 20 | return data 21 | 22 | 23 | file_path = "instruction-data.json" 24 | url = ( 25 | "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch" 26 | "/main/ch07/01_main-chapter-code/instruction-data.json" 27 | ) 28 | 29 | data = download_and_load_file(file_path, url) 30 | print("Number of entries:", len(data)) 31 | 32 | print("Example entry:\n", data[50]) 33 | 34 | print("Example entry:\n", data[999]) -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/images/instruction fine tuning prompt styles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.2_Preparing_a_dataset_for_supervised_instruction_fine_tuning/images/instruction fine tuning prompt styles.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/-100 purpose in target IDs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/-100 purpose in target IDs.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt1.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt2.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt3.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt4.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt5.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/Instruction Fine-Tuning Training Process, pt6.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/batching process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/batching process.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/cross entropy loss for logits_1 & targets_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/cross entropy loss for logits_1 & targets_1.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/custom collate (assemble) function.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/custom collate (assemble) function.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/first two steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/first two steps.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index book explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index book explanation.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt1.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt2.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt3.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt4.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt5.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt6.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/ignore_index in cross-cross-entropy loss, pt7.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/input and target token alignment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/input and target token alignment.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masked instruction tokens.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masked instruction tokens.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masking the instruction tokens explained.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/masking the instruction tokens explained.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/padding token replacement in target batch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/padding token replacement in target batch.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/target IDs explained.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.3_Organizing_data_into_training_batches/images/target IDs explained.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/data_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/data_setup/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | def load_file(file_path): 5 | with open(file_path, "r", encoding="utf-8") as file: 6 | data = json.load(file) 7 | 8 | return data 9 | 10 | data = load_file("data_setup/instruction-data.json") 11 | 12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data." 13 | 14 | # Divide the dataset into a training, validation, and test set 15 | train_portion = int(len(data) * 0.85) # 85% for training 16 | test_portion = int(len(data) * 0.1) # 10% for testing 17 | val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation 18 | 19 | train_data = data[:train_portion] 20 | test_data = data[train_portion:train_portion + test_portion] 21 | val_data = data[train_portion + test_portion:] 22 | -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.4_Creating_data_loaders_for_an_instruction_dataset/testing.py: -------------------------------------------------------------------------------- 1 | from data_setup.instruction_dataset import train_loader, val_loader, test_loader, train_dataset 2 | import tiktoken 3 | 4 | 5 | tokenizer = tiktoken.get_encoding("gpt2") 6 | 7 | 8 | # Prerequisite for validating, set shuffle=False in train_loader in instruction_dataset.py 9 | 10 | # The very first sequence in the first batch is the sequence of longest length, thus no padding tokens are present. 11 | # Because of this the second sequence is tested because I want to see the padding token when I decode. 12 | IDX = 1 13 | 14 | print("\n----------------------------------------------------------") 15 | print("\n******* RAW TOKEN IDs + CORRESPONDING DECODED TEXT OF SECOND TRAIN LOADER SAMPLE *******\n") 16 | 17 | for inputs, targets in train_loader: 18 | # Get the second example from the batch (token IDs) 19 | second_example_tokens = inputs[IDX].tolist() 20 | 21 | print(second_example_tokens, "\n") 22 | 23 | print("Length of second example (train loader) tokens:", len(second_example_tokens), "\n") # 15 padding tokens (50256) get added 24 | 25 | # Decode token IDs back into text 26 | second_example_text = tokenizer.decode(second_example_tokens) 27 | 28 | print("Decoded Text for Second Example in Batch:") 29 | print(second_example_text) 30 | break # Stop after the first batch 31 | 32 | 33 | # { 34 | # "instruction": "Edit the following sentence for grammar.", 35 | # "input": "He go to the park every day.", 36 | # "output": "He goes to the park every day." 37 | # }, 38 | 39 | 40 | print("\n----------------------------------------------------------") 41 | print("\n******* RAW TOKEN IDs + CORRESPONDING DECODED TEXT OF SECOND TRAIN DATASET SAMPLE *******\n") 42 | 43 | raw_data = train_dataset[IDX] 44 | 45 | print("Length of second example (train dataset) tokens:", len(raw_data), "\n") 46 | 47 | print(raw_data, "\n") 48 | 49 | print(tokenizer.decode(raw_data)) 50 | 51 | print("\n----------------------------------------------------------\n") 52 | 53 | 54 | print("******* CHECKING FIRST BATCH FOR LONGEST SEQUENCE *******\n") 55 | max_length = 0 56 | max_length_index = None 57 | 58 | # Get the first batch 59 | for inputs, targets in train_loader: 60 | for i, seq in enumerate(inputs): 61 | seq_length = (seq != 50256).sum().item() # Count non-padding tokens 62 | if seq_length > max_length: 63 | max_length = seq_length 64 | max_length_index = i 65 | 66 | break # Stop after processing the first batch 67 | 68 | 69 | 70 | # Print results 71 | print(f"Index of longest sequence in the first batch: {max_length_index}") 72 | print(f"Length of the longest sequence (excluding padding): {max_length}") 73 | 74 | 75 | print("\n----------------------------------------------------------\n") 76 | -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/data_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/data_setup/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | def load_file(file_path): 5 | with open(file_path, "r", encoding="utf-8") as file: 6 | data = json.load(file) 7 | 8 | return data 9 | 10 | data = load_file("data_setup/instruction-data.json") 11 | 12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data." 13 | 14 | # Divide the dataset into a training, validation, and test set 15 | train_portion = int(len(data) * 0.85) # 85% for training 16 | test_portion = int(len(data) * 0.1) # 10% for testing 17 | val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation 18 | 19 | train_data = data[:train_portion] 20 | test_data = data[train_portion:train_portion + test_portion] 21 | val_data = data[train_portion + test_portion:] 22 | -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/gpt_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.5_Loading_a_pretrained_LLM/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/data_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/data_setup/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | def load_file(file_path): 5 | with open(file_path, "r", encoding="utf-8") as file: 6 | data = json.load(file) 7 | 8 | return data 9 | 10 | data = load_file("data_setup/instruction-data.json") 11 | 12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data." 13 | 14 | # Divide the dataset into a training, validation, and test set 15 | train_portion = int(len(data) * 0.85) # 85% for training 16 | test_portion = int(len(data) * 0.1) # 10% for testing 17 | val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation 18 | 19 | train_data = data[:train_portion] 20 | test_data = data[train_portion:train_portion + test_portion] 21 | val_data = data[train_portion + test_portion:] 22 | -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/gpt_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/dealing with hardware limitations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/dealing with hardware limitations.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/device runtimes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/device runtimes.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss plot.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss-plot.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/loss-plot.pdf -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.6_Fine_tuning_the_LLM_on_instruction_data/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/data_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/data_setup/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | def load_file(file_path): 5 | with open(file_path, "r", encoding="utf-8") as file: 6 | data = json.load(file) 7 | 8 | return data 9 | 10 | data = load_file("data_setup/instruction-data.json") 11 | 12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data." 13 | 14 | # Divide the dataset into a training, validation, and test set 15 | train_portion = int(len(data) * 0.85) # 85% for training 16 | test_portion = int(len(data) * 0.1) # 10% for testing 17 | val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation 18 | 19 | train_data = data[:train_portion] 20 | test_data = data[train_portion:train_portion + test_portion] 21 | val_data = data[train_portion + test_portion:] 22 | -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/gpt_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/gpt_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/gpt_setup/load_weights.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | 4 | 5 | def assign(left, right): 6 | if left.shape != right.shape: 7 | raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}") 8 | return torch.nn.Parameter(torch.tensor(right)) 9 | 10 | def load_weights_into_gpt(gpt, params): 11 | gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe']) 12 | gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte']) 13 | 14 | for b in range(len(params["blocks"])): # iterate over each transformer block in the model 15 | q_w, k_w, v_w = np.split( 16 | (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1) 17 | gpt.trf_blocks[b].att.W_query.weight = assign( 18 | gpt.trf_blocks[b].att.W_query.weight, q_w.T) 19 | gpt.trf_blocks[b].att.W_key.weight = assign( 20 | gpt.trf_blocks[b].att.W_key.weight, k_w.T) 21 | gpt.trf_blocks[b].att.W_value.weight = assign( 22 | gpt.trf_blocks[b].att.W_value.weight, v_w.T) 23 | 24 | q_b, k_b, v_b = np.split( 25 | (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1) 26 | gpt.trf_blocks[b].att.W_query.bias = assign( 27 | gpt.trf_blocks[b].att.W_query.bias, q_b) 28 | gpt.trf_blocks[b].att.W_key.bias = assign( 29 | gpt.trf_blocks[b].att.W_key.bias, k_b) 30 | gpt.trf_blocks[b].att.W_value.bias = assign( 31 | gpt.trf_blocks[b].att.W_value.bias, v_b) 32 | 33 | gpt.trf_blocks[b].att.out_proj.weight = assign( 34 | gpt.trf_blocks[b].att.out_proj.weight, 35 | params["blocks"][b]["attn"]["c_proj"]["w"].T) 36 | gpt.trf_blocks[b].att.out_proj.bias = assign( 37 | gpt.trf_blocks[b].att.out_proj.bias, 38 | params["blocks"][b]["attn"]["c_proj"]["b"]) 39 | 40 | gpt.trf_blocks[b].ff.layers[0].weight = assign( 41 | gpt.trf_blocks[b].ff.layers[0].weight, 42 | params["blocks"][b]["mlp"]["c_fc"]["w"].T) 43 | gpt.trf_blocks[b].ff.layers[0].bias = assign( 44 | gpt.trf_blocks[b].ff.layers[0].bias, 45 | params["blocks"][b]["mlp"]["c_fc"]["b"]) 46 | gpt.trf_blocks[b].ff.layers[2].weight = assign( 47 | gpt.trf_blocks[b].ff.layers[2].weight, 48 | params["blocks"][b]["mlp"]["c_proj"]["w"].T) 49 | gpt.trf_blocks[b].ff.layers[2].bias = assign( 50 | gpt.trf_blocks[b].ff.layers[2].bias, 51 | params["blocks"][b]["mlp"]["c_proj"]["b"]) 52 | 53 | gpt.trf_blocks[b].norm1.scale = assign( 54 | gpt.trf_blocks[b].norm1.scale, 55 | params["blocks"][b]["ln_1"]["g"]) 56 | gpt.trf_blocks[b].norm1.shift = assign( 57 | gpt.trf_blocks[b].norm1.shift, 58 | params["blocks"][b]["ln_1"]["b"]) 59 | gpt.trf_blocks[b].norm2.scale = assign( 60 | gpt.trf_blocks[b].norm2.scale, 61 | params["blocks"][b]["ln_2"]["g"]) 62 | gpt.trf_blocks[b].norm2.shift = assign( 63 | gpt.trf_blocks[b].norm2.shift, 64 | params["blocks"][b]["ln_2"]["b"]) 65 | 66 | gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"]) 67 | gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"]) 68 | gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) 69 | 70 | return gpt 71 | 72 | # ------ Notes ------ 73 | # load_weights_into_gpt() will set the model's positional and token embedding weights 74 | # to those specified in params 75 | 76 | # np.split function (line #15) is used to divide the attention and bias weights into three equal 77 | # parts for the query, key, and value components 78 | 79 | # Referring to line #68, the original GPT-2 model by OpenAI reused the token embedding weights in the 80 | # output layer to reduce the total number of parameters, which is a concept known as weight tying. -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt1.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response explained, pt2.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt1.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/model response, pt2.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.7_Extracting_and_saving_responses/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/check_ollama_status.py: -------------------------------------------------------------------------------- 1 | import psutil 2 | 3 | 4 | def check_if_running(process_name): 5 | running = False 6 | for proc in psutil.process_iter(["name"]): 7 | if process_name in proc.info["name"]: 8 | running = True 9 | break 10 | return running 11 | 12 | ollama_running = check_if_running("ollama") 13 | 14 | if not ollama_running: 15 | raise RuntimeError("Ollama not running. Launch ollama before proceeding.") 16 | print("Ollama running:", check_if_running("ollama")) -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/data_setup/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/data_setup/__init__.py -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/data_setup/data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | def load_file(file_path): 5 | with open(file_path, "r", encoding="utf-8") as file: 6 | data = json.load(file) 7 | 8 | return data 9 | 10 | data = load_file("data_setup/instruction-data.json") 11 | 12 | assert len(data) == 1100, "Instruction dataset is not of the correct length, please reload the data." 13 | 14 | # Divide the dataset into a training, validation, and test set 15 | train_portion = int(len(data) * 0.85) # 85% for training 16 | test_portion = int(len(data) * 0.1) # 10% for testing 17 | val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation 18 | 19 | train_data = data[:train_portion] 20 | test_data = data[train_portion:train_portion + test_portion] 21 | val_data = data[train_portion + test_portion:] 22 | -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/alternative ollama models.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/alternative ollama models.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt1.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt2.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/llama 3 score for gpt2 instruct, pt3.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/ollama.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/ollama.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/stages.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/using larger llms via web apis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/7.8_Evaluating_the_fine_tuned_LLM/images/using larger llms via web apis.png -------------------------------------------------------------------------------- /07_Fine_tuning_for_instructions/images/stages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JohnMachado11/Build-a-Large-Language-Model-from-Scratch/2c14e7960c7051e3d563e17fb092e9c11d36b46b/07_Fine_tuning_for_instructions/images/stages.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Build-a-Large-Language-Model-from-Scratch 2 | 3 | https://www.manning.com/books/build-a-large-language-model-from-scratch 4 | 5 | "In Build a Large Language Model (from Scratch), you’ll discover how LLMs work from the inside out. In this book, I’ll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. 6 | 7 | The method described in this book for training and developing your own small-but-functional model for educational purposes mirrors the approach used in creating large-scale foundational models such as those behind ChatGPT. The book uses Python and PyTorch for all its coding examples." 8 | 9 | By Sebastian Raschka 10 | 11 | Book Repo: 12 | https://github.com/rasbt/LLMs-from-scratch/ 13 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch==2.4.0 2 | tiktoken==0.8.0 3 | matplotlib==3.9.2 4 | tensorflow>=2.15.0 5 | tqdm>=4.66 6 | pandas==2.2.3 7 | psutil==6.1.1 --------------------------------------------------------------------------------