├── .gitignore ├── README.md ├── bash └── train-all-cfgs.sh ├── cfgs ├── train-0-notebook.yaml └── train-1.yaml ├── doc ├── imgs │ └── fitgraph.jpg └── tips_tricks │ ├── README.md │ └── vscode_setup.md ├── notebooks ├── eda.ipynb ├── references │ └── optimization-approaches-for-transformers.ipynb └── training.ipynb ├── requirements.txt ├── scripts └── train_model.py └── src ├── __init__.py ├── dataloading ├── __init__.py ├── load_data.py ├── load_datasets.py ├── preprocess.py └── stratify.py ├── models ├── __init__.py └── llm_multiclass.py ├── training ├── __init__.py ├── losses.py ├── metrics.py ├── optimizers.py └── single_fold.py ├── utils.py └── visualization.py /.gitignore: -------------------------------------------------------------------------------- 1 | prod.env 2 | .venv 3 | .vscode 4 | __pycache__/ 5 | .ipynb_checkpoints/ 6 | data/ 7 | hf_download/ 8 | *.pt 9 | *.pickle -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 5 | PyTorch Workflow for Large Language Models (LLM) 6 |

7 | 8 |

Utilize this repository for a basic framework to tailor Large Language Models (LLM) with PyTorch. 9 |

10 | 11 |

12 | Introduction  •  13 | Getting Started  •  14 | Generic Workflow  •  15 | Use Case  •  16 | Deep Learning Techniques  •  17 | Issues  •  18 | TODOs 19 |

20 | 21 |

22 | 23 | 24 | 25 | 26 | 27 | 28 |

29 | 30 | # Introduction 31 | This workflow helps you get accustomed to LLM project structure and PyTorch for custom model creation, showcasing a multi-class classification using a public dataset and an LLM model from Hugging Face Hub. 32 | 33 | ### Workflow Advantages 34 | Key advantages of this workflow not commonly found elsewhere include: 35 | - **PyTorch Models**: It employs a custom PyTorch class for LLM fine-tuning, allowing custom layers, activation functions, layer freezing, model heads, loss functions, etc. through a [PyTorch Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), unlike typical [HuggingFace Tasks](https://huggingface.co/tasks). 36 | - **Python Modules and Directory Structure**: The organized directory structure supports [Python modules](https://docs.python.org/3/tutorial/modules.html) and config files for versatility, inspired by [Joel Grus' presentation](https://www.youtube.com/watch?v=7jiPeIFXb6U) on Jupyter Notebooks. 37 | - **Configuration Files for Input Parameters**: For script execution via CLI or Cron scheduling, configuration files enable flexible pipeline variations and automated execution. 38 | - **Updated PyTorch and LLM Packages**: This workflow includes recent NLP advancements and open-source software, like the post-2022 release of ChatGPT. 39 | - **Integrated Feature Set**: The repository provides a comprehensive feature set for quick pipeline development and modification. 40 | 41 | **NOTE**: This workflow can be adapted for many PyTorch deep learning applications, not just LLMs. 42 | 43 | 44 | # Getting Started 45 | 46 | To understand this workflow, proceed with the use case in the following order: 47 | 48 | ### [1.) EDA - Jupyter Notebook](./notebooks/eda.ipynb) 49 | Review this [EDA - Jupyter Notebook](./notebooks/eda.ipynb) for a brief exploration of the CFPB data, featuring model features, target distributions, text tokens count, and data reduction. 50 | 51 | ### [2.) Model Training Walkthrough - Jupyter Notebook](https://nbviewer.org/github/mddunlap924/PyTorch-LLM/blob/main/notebooks/training.ipynb) 52 | Use this notebook to train a model via a [single configuration file](./cfgs/train-0-notebook.yaml), with supplementary pre-training tasks and further analysis techniques for model selection. 53 | 54 | ### [3.) Model Training Script - Python Script](./scripts/train_model.py) 55 | This script offers robust long-term training routines across various [configuration files](./cfgs/train-1.yaml) and can be paired with this [bash shell script](./bash/train-all-cfgs.sh) for full automation of model development and experiments, ideal for prolonged runs and allowing your computer to work autonomously. 56 | 57 | 58 | # Generic Workflow 59 | THe Pseudo Code provided below guides this repository and outlines a cross-validation training process using PyTorch. 60 | 61 | ``` 62 | INPUT: YAML config. file 63 | OUTPUT: Model checkpoints, training log 64 | 65 | 1. Load YAML config. 66 | 2. C.V. data Folds 67 | 3. Loop over each data fold: 68 | A.) Training module 69 | * Dataloader with custom preprocessing and collator 70 | * Train a custom PyTorch model 71 | * Standard PyTorch training loop with: save checkpoints, log training metrics, etc. 72 | ``` 73 | 74 | The standard PyTorch training loop, shown below, is used here. Additional modifications are implemented in the training loop to improve model performance and training/inference speed are also implemented. 75 | 76 | ```python 77 | # loop through batches 78 | for (inputs, labels) in data_loader: 79 | 80 | # extract inputs and labels 81 | inputs = inputs.to(device) 82 | labels = labels.to(device) 83 | 84 | # passes and weights update 85 | with torch.set_grad_enabled(True): 86 | 87 | # forward pass 88 | preds = model(inputs) 89 | loss = criterion(preds, labels) 90 | 91 | # backward pass 92 | loss.backward() 93 | 94 | # weights update 95 | optimizer.step() 96 | optimizer.zero_grad() 97 | ``` 98 | 99 | # Use Case 100 | The NLP dataset used here is obtained from [The Consumer Financial Protection Bureau (CFPB)](https://www.consumerfinance.gov/), available on [Kaggle](https://www.kaggle.com/datasets/selener/consumer-complaint-database), featuring consumer complaints about financial providers. 101 | 102 | ### Model Training Objective 103 | We're performing multi-class classification on this dataset, where the five product categories represent the *target* variable, and three *source* variables are used as input for the LLM model. 104 | - **NOTE**: The input variables used in this example include `unstructured text` and `categorical variables`, showcasing how to combine mixed data types for LLM model fine-tuning, while selection of these variables for prediction performance wasn't the primary focus. 105 | 106 | ### Metrics 107 | The classification performance was evaluated using MultiClass: [F1 Score](https://pytorch.org/torcheval/stable/generated/torcheval.metrics.MulticlassF1Score.html#torcheval.metrics.MulticlassF1Score), [Precision](https://pytorch.org/torcheval/stable/generated/torcheval.metrics.MulticlassPrecision.html#torcheval.metrics.MulticlassPrecision), and [Recall](https://pytorch.org/torcheval/stable/generated/torcheval.metrics.MulticlassRecall.html#torcheval.metrics.MulticlassRecall), but other metrics could be used as well. 108 | 109 | 110 | # Deep Learning Techniques 111 | Below are a list of deep learning techniques and tools utilized throughout this repository. 112 | - PyTorch: 113 | - [PyTorch Code structure](https://pytorch.org/tutorials/beginner/basics/intro.html) 114 | - [Datasets and Dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 115 | - [Custom Collator for Efficient RAM Dynamic Padding](https://huggingface.co/docs/transformers/main/main_classes/data_collator) 116 | - [Loss Functions](https://pytorch.org/docs/stable/nn.html#loss-functions) 117 | - [Learning Rate Schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) 118 | - [Learning Rate Finder](https://github.com/davidtvs/pytorch-lr-finder) 119 | - [Torch Metrics](https://torchmetrics.readthedocs.io/en/latest/) 120 | - [The Unofficial PyTorch Optimization Song](https://www.youtube.com/watch?v=Nutpusq_AFw) 121 | - [Gradient Checkpointing](https://medium.com/geekculture/training-larger-models-over-your-average-gpu-with-gradient-checkpointing-in-pytorch-571b4b5c2068) 122 | - Hugging Face 123 | - [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) 124 | - [Fast Tokenizers](https://huggingface.co/docs/transformers/v4.19.3/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained.use_fast) 125 | - [Padding Truncation](https://huggingface.co/docs/transformers/pad_truncation) 126 | - [HuggingFace Bert](https://huggingface.co/docs/transformers/model_doc/bert) 127 | - [HF Model Card for: bert-base-uncased](https://huggingface.co/bert-base-uncased) 128 | - [Dynamic Padding](https://www.youtube.com/watch?v=7q5NyFT8REg) 129 | - Basics 130 | - [Combining Mixed Data Types](https://mccormickml.com/2021/06/29/combining-categorical-numerical-features-with-bert/) 131 | - [Cross-Validation Training](https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right) 132 | - [Visualizing Learning Curves for Model Diagnosis](https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html#:~:text=Overfit%20learning%20curves,a%20greater%20number%20of%20parameters.) 133 | 134 | # Issues 135 | This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or make a Pull Request. :smiley: 136 | 137 | # TODOs 138 | - [ ] [Unit tests](https://docs.python.org/3/library/unittest.html) for Python modules 139 | - [ ] Parameter-Efficient Fine-Tuning [(PEFT)](https://github.com/huggingface/peft) methods (e.g. LoRA or QLoRA) 140 | - [ ] Quantize Transformer Models using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) 141 | - [ ] [Gradient Accumulation in PyTorch](https://kozodoi.me/blog/20210219/gradient-accumulation#:~:text=Gradient%20accumulation%20modifies%20the%20last,been%20processed%20by%20the%20model.) 142 | 143 | 144 | #### Liked the work? Please give a star! 145 | -------------------------------------------------------------------------------- /bash/train-all-cfgs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Shell script to execute customized model training. 4 | # YAML configurations files are used to control the training pipeline such as datasets, models, hyperparameters, etc. 5 | 6 | cfgfiles=`ls ./cfgs/tune/*.yaml` 7 | for cfgfile in $cfgfiles 8 | do 9 | echo "$(basename "$cfgfile")" 10 | FILENAME="$(basename "$cfgfile")" 11 | python ./src/train_model.py --dir ./cfgs/tune/ --name $FILENAME 12 | done -------------------------------------------------------------------------------- /cfgs/train-0-notebook.yaml: -------------------------------------------------------------------------------- 1 | # YAML file listing config. parameters 2 | seed: 42 3 | 4 | # Paths 5 | paths: 6 | data: 7 | base_dir: ../data 8 | data: cfpb_partial.csv 9 | debug_data: cfpb_debug.csv 10 | partial: cfpb.csv 11 | save_results: 12 | apply_model: False # Save model weights [boolean: True/False] 13 | apply_metric: True # Save performance metrics [boolean: True/False] 14 | base_dir: ../logs 15 | 16 | # DEBUG [True or False]; if False it will load the debug_data 17 | # Use for pipeline development 18 | debug: False 19 | 20 | # DATA 21 | data_info: 22 | source_fields: 23 | - Consumer complaint narrative 24 | - ZIP code 25 | - Sub-issue 26 | target: Product 27 | 28 | # Stratification Technique 29 | stratify: 30 | technique: stratified_kfold 31 | 32 | # Cross-Validation Folds 33 | cv: 34 | num_folds: 2 35 | val_folds: [1] #[list of integers] (start counting at 1) 36 | 37 | # Preprocessing 38 | preprocessing: 39 | apply_techniques: 40 | - LabelEncoder 41 | LabelEncoder: 42 | fields: 43 | - Product 44 | OneHotEncoder: 45 | fields: 46 | - Product 47 | 48 | # Model and Tokenizer 49 | model_tokenizer: 50 | base_dir: ../hf_download 51 | name: bert-base-uncased 52 | 53 | # Model 54 | model: 55 | freeze: 56 | apply: False 57 | # Number of layers to freeze starting from layer 1 58 | num_layers: 0 59 | mean_pooling: 60 | apply: True 61 | gradient_checkpointing: False 62 | 63 | # Tokenizer 64 | tokenizer: 65 | abbreviations: 66 | - Null 67 | add_special_tokens: True 68 | max_length: 512 69 | padding: True 70 | truncation: True 71 | return_tensors: pt 72 | 73 | # Optimizer 74 | optimizer: 75 | name: AdamW 76 | lr: 77 | max: 1.0E-4 78 | 79 | # Learning Rate Scheduler 80 | lr_scheduler: 81 | name: CosineAnnealingLR 82 | OneCycleLR: 83 | pct_start: 0.1 84 | CosineAnnealingLR: 85 | eta_min: 1.0E-5 86 | 87 | # Tuning 88 | epochs: 30 89 | batch_size: 16 90 | num_workers: 8 91 | eval_metric: 92 | name: loss -------------------------------------------------------------------------------- /cfgs/train-1.yaml: -------------------------------------------------------------------------------- 1 | # YAML file listing config. parameters 2 | 3 | # Paths 4 | paths: 5 | data: 6 | base_dir: ../data 7 | data: cfpb_partial.csv 8 | debug_data: cfpb_debug.csv 9 | partial: cfpb.csv 10 | save_results: 11 | apply_model: False # Save model weights [boolean: True/False] 12 | apply_metric: True # Save performance metrics [boolean: True/False] 13 | base_dir: ../logs 14 | 15 | # DEBUG [True or False]; if False it will load the debug_data 16 | # Use for pipeline development 17 | debug: False 18 | 19 | # DATA 20 | data_info: 21 | source_fields: 22 | - Consumer complaint narrative 23 | - ZIP code 24 | - Sub-issue 25 | target: Product 26 | 27 | # Stratification Technique 28 | stratify: 29 | technique: stratified_kfold 30 | 31 | # Cross-Validation Folds 32 | cv: 33 | num_folds: 5 34 | val_folds: [1, 2] #[list of integers] (start counting at 1) 35 | 36 | # Preprocessing 37 | preprocessing: 38 | apply_techniques: 39 | - LabelEncoder 40 | LabelEncoder: 41 | fields: 42 | - Product 43 | OneHotEncoder: 44 | fields: 45 | - Product 46 | 47 | # Model and Tokenizer 48 | model_tokenizer: 49 | base_dir: ../hf_download 50 | name: bert-base-uncased 51 | 52 | # Model 53 | model: 54 | freeze: 55 | apply: True 56 | # Number of layers to freeze starting from layer 1 57 | num_layers: 10 58 | # Custom LLM Pooling 59 | mean_pooling: 60 | apply: True 61 | # Gradient checkpointing 62 | gradient_checkpointing: False 63 | 64 | # Tokenizer parameters 65 | tokenizer: 66 | abbreviations: 67 | - Null 68 | add_special_tokens: True 69 | max_length: 512 70 | padding: True 71 | truncation: True 72 | return_tensors: pt 73 | 74 | # Optimizer 75 | optimizer: 76 | name: AdamW 77 | lr: 78 | max: 1.0E-4 79 | 80 | # Learning Rate Scheduler 81 | lr_scheduler: 82 | name: CosineAnnealingLR 83 | OneCycleLR: 84 | pct_start: 0.1 85 | CosineAnnealingLR: 86 | eta_min: 1.0E-5 87 | 88 | 89 | # Model Tuning 90 | epochs: 10 91 | batch_size: 16 92 | num_workers: 8 93 | eval_metric: 94 | name: loss 95 | -------------------------------------------------------------------------------- /doc/imgs/fitgraph.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mddunlap924/PyTorch-LLM/5c3fcfd608715063ce259a713ec05acbd95cfe31/doc/imgs/fitgraph.jpg -------------------------------------------------------------------------------- /doc/tips_tricks/README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This directory is used to collect various tips and tricks that are helpful for data science projects. The below are a list of the write-up contained in this directory and a brief description of each. 4 | 5 | - [VSCode Setup](./vscode_setup.md): this describes how the IDE VSCode was setup and used in this project. Environment variables were defined following instructions from official VS Code website. For example, HuggingFace was set to run in offline mode and cache objects at specified locations using VSCode settings. -------------------------------------------------------------------------------- /doc/tips_tricks/vscode_setup.md: -------------------------------------------------------------------------------- 1 | # VSCode Setup 2 | VSCode provides robust setup for projects. Listed below are some of links to relevant VSCode documentation. 3 | - [User and Workspace Settings](https://code.visualstudio.com/docs/getstarted/settings) 4 | - [Environment Variables](https://code.visualstudio.com/docs/python/environments#_environment-variables) 5 | - [Python settings reference](https://code.visualstudio.com/docs/python/settings-reference) 6 | 7 | # HuggingFace Cache Setup and Offline Mode 8 | HuggingFace (HF) can be setup to download pretrained models to specified locations as well as setup to run in offline mode. Please refer to the HF website to read about the [Cache Setup](https://huggingface.co/docs/transformers/installation#cache-setup) and [Offline mode](https://huggingface.co/docs/transformers/installation#offline-mode). 9 | 10 | To implement both the Cache setup and Offline mode the following was added in the project's folder: 11 | ### Step 1. 12 | 13 | Create a `prod.env` file at the same directory level as the `${workspaceFolder}`. Inside this file the following was written: 14 | 15 | ```env 16 | # prod.env - production configuration 17 | 18 | # HF cache setup 19 | TRANSFORMERS_CACHE=/PATH/WHERE/HF/WILL/CACHE/OBJECTS 20 | 21 | # HF offline mode 22 | TRANSFORMERS_OFFLINE=1 23 | 24 | # HF Parallelism 25 | TOKENIZERS_PARALLELISM=True 26 | 27 | # HF No Advisory Warnings 28 | TRANSFORMERS_NO_ADVISORY_WARNINGS=True 29 | ``` 30 | 31 | ### Step 2. 32 | 33 | Inside the `.vscode` folder located at the same directory level as the file from Step 1 open the `settings.json` file. Set the variable `python.envFile` as shown in the below snippet. Additional [Python settings reference](https://code.visualstudio.com/docs/python/settings-reference) can also be setup. 34 | 35 | ```json 36 | "python.envFile": "${workspaceFolder}/prod.env", 37 | ``` -------------------------------------------------------------------------------- /notebooks/eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Exploratory Data Analysis (EDA)\n", 8 | "\n", 9 | "This notebook performs a very rudimentary EDA on the original CFPB dataset. The objectives of this notebook are to:\n", 10 | "1. **Introduce the dependent and independent variables** that will be used in the modeling approach.\n", 11 | " - The independent variables will be mixed to showcase how this can be modeling using a custom PyTorch Module. The independent variables will consist of: A) unstructured text field and B) some categorical fields.\n", 12 | "2. **Remove null** values from the dataset for future modeling tasks.\n", 13 | " - This preprocessed dataset will be saved back to disk for use in modeling.\n", 14 | "3. **Save a reduced dataset** which will be used for the debugging feature in pipeline development.\n", 15 | " - This is a trick that can be used to tremendously speed up pipeline/code development.\n", 16 | "4. **Save about 10% of data** to disk for experimenting/working with because the full dataset takes too long to process for demonstrations." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "# Import Libraries\n", 26 | "import os\n", 27 | "import pandas as pd\n", 28 | "import matplotlib.pyplot as plt\n", 29 | "from transformers import AutoTokenizer\n", 30 | "from torch.utils.data import Dataset, DataLoader\n", 31 | "\n", 32 | "# Allow HF tokenizer parallelism\n", 33 | "os.environ['TOKENIZERS_PARALLELISM'] = 'True'" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "# User Inputs" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# Path to Data\n", 50 | "PATHS = {'data': '../data/rows.csv',\n", 51 | " 'hf_cache': os.environ['TRANSFORMERS_CACHE'],\n", 52 | " 'save_processed_data': '../data/cfpb.csv',\n", 53 | " 'save_debug_data': '../data/cfpb_debug.csv',\n", 54 | " 'save_partial_data': '../data/cfpb_partial.csv'}\n", 55 | "\n", 56 | "# Name of the model\n", 57 | "model_name = 'bert-base-uncased'" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "# Load Data and Basic EDA\n", 65 | "\n", 66 | "The basic EDA will be to view the number of unique values, remove nulls, and select a few different source fields that could be used to predicting the target variable." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "name": "stdout", 76 | "output_type": "stream", 77 | "text": [ 78 | "CFPB Data Shape: (1,282,355, 18)\n" 79 | ] 80 | }, 81 | { 82 | "data": { 83 | "text/html": [ 84 | "
\n", 85 | "\n", 98 | "\n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | "
Date receivedProductSub-productIssueSub-issueConsumer complaint narrativeCompany public responseCompanyStateZIP codeTagsConsumer consent provided?Submitted viaDate sent to companyCompany response to consumerTimely response?Consumer disputed?Complaint ID
005/10/2019Checking or savings accountChecking accountManaging an accountProblem using a debit or ATM cardNaNNaNNAVY FEDERAL CREDIT UNIONFL328XXOlder AmericanNaNWeb05/10/2019In progressYesNaN3238275
105/10/2019Checking or savings accountOther banking product or serviceManaging an accountDeposits and withdrawalsNaNNaNBOEING EMPLOYEES CREDIT UNIONWA98204NaNNaNReferral05/10/2019Closed with explanationYesNaN3238228
\n", 167 | "
" 168 | ], 169 | "text/plain": [ 170 | " Date received Product \\\n", 171 | "0 05/10/2019 Checking or savings account \n", 172 | "1 05/10/2019 Checking or savings account \n", 173 | "\n", 174 | " Sub-product Issue \\\n", 175 | "0 Checking account Managing an account \n", 176 | "1 Other banking product or service Managing an account \n", 177 | "\n", 178 | " Sub-issue Consumer complaint narrative \\\n", 179 | "0 Problem using a debit or ATM card NaN \n", 180 | "1 Deposits and withdrawals NaN \n", 181 | "\n", 182 | " Company public response Company State ZIP code \\\n", 183 | "0 NaN NAVY FEDERAL CREDIT UNION FL 328XX \n", 184 | "1 NaN BOEING EMPLOYEES CREDIT UNION WA 98204 \n", 185 | "\n", 186 | " Tags Consumer consent provided? Submitted via \\\n", 187 | "0 Older American NaN Web \n", 188 | "1 NaN NaN Referral \n", 189 | "\n", 190 | " Date sent to company Company response to consumer Timely response? \\\n", 191 | "0 05/10/2019 In progress Yes \n", 192 | "1 05/10/2019 Closed with explanation Yes \n", 193 | "\n", 194 | " Consumer disputed? Complaint ID \n", 195 | "0 NaN 3238275 \n", 196 | "1 NaN 3238228 " 197 | ] 198 | }, 199 | "metadata": {}, 200 | "output_type": "display_data" 201 | } 202 | ], 203 | "source": [ 204 | "# Load Data\n", 205 | "df = pd.read_csv(PATHS['data'], low_memory=False)\n", 206 | "\n", 207 | "# Display data shape and some rows\n", 208 | "print(f'CFPB Data Shape: ({df.shape[0]:,}, {df.shape[1]})')\n", 209 | "display(df.head(2))" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 4, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "data": { 219 | "text/html": [ 220 | "\n", 222 | "\n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | "
 unique
Date received2,717
Product18
Sub-product76
Issue167
Sub-issue218
Consumer complaint narrative366,945
Company public response10
Company5,275
State63
ZIP code22,591
Tags3
Consumer consent provided?4
Submitted via6
Date sent to company2,666
Company response to consumer8
Timely response?2
Consumer disputed?2
Complaint IDnan
\n" 304 | ], 305 | "text/plain": [ 306 | "" 307 | ] 308 | }, 309 | "metadata": {}, 310 | "output_type": "display_data" 311 | } 312 | ], 313 | "source": [ 314 | "# Uniques for each field\n", 315 | "tmp = df.describe(include='all').loc['unique', :]\n", 316 | "display(tmp.to_frame().style.format(\"{:,.0f}\"))\n", 317 | "del tmp" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 5, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "Number of Non-Nulls in Each Column BEFORE Removing Nulls\n", 330 | "\tConsumer complaint narrative: 383,564\n", 331 | "\tZIP code: 1,167,057\n", 332 | "\tSub-issue: 751,169\n", 333 | "CFPB Data Shape After Removing Nulls:(209,586, 18)\n", 334 | "Number of Non-Nulls in Each Column AFTER Removing Nulls\n", 335 | "\tConsumer complaint narrative: 209,586\n", 336 | "\tZIP code: 209,586\n", 337 | "\tSub-issue: 209,586\n" 338 | ] 339 | } 340 | ], 341 | "source": [ 342 | "# Number of nulls in source fields\n", 343 | "source_fields = ['Consumer complaint narrative',\n", 344 | " 'ZIP code',\n", 345 | " 'Sub-issue']\n", 346 | "target = 'Product'\n", 347 | "print(f'Number of Non-Nulls in Each Column BEFORE Removing Nulls')\n", 348 | "n_rows = len(df)\n", 349 | "for col in source_fields:\n", 350 | " print(f'\\t{col}: {n_rows - df[col].isnull().sum():,}')\n", 351 | "\n", 352 | "# Reduce the dataframe to only non-null consumer complaints\n", 353 | "data = (df.dropna(subset=source_fields)\n", 354 | " .reset_index(drop=True))\n", 355 | "print((f'CFPB Data Shape After Removing Nulls:'\n", 356 | " f'({data.shape[0]:,}, {data.shape[1]})'))\n", 357 | "\n", 358 | "# Number of nulls in source fields AFTER removing Nulls\n", 359 | "print(f'Number of Non-Nulls in Each Column AFTER Removing Nulls')\n", 360 | "n_rows = len(data)\n", 361 | "for col in source_fields:\n", 362 | " print(f'\\t{col}: {n_rows - data[col].isnull().sum():,}')" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 6, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "name": "stdout", 372 | "output_type": "stream", 373 | "text": [ 374 | "Product\n" 375 | ] 376 | }, 377 | { 378 | "data": { 379 | "text/html": [ 380 | "\n", 382 | "\n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | "
 count
Product 
Credit reporting, credit repair services, or other personal consumer reports71,813
Debt collection66,702
Credit reporting24,524
Credit card or prepaid card15,992
Student loan15,846
Checking or savings account10,130
Vehicle loan or lease4,387
Payday loan, title loan, or personal loan192
\n" 428 | ], 429 | "text/plain": [ 430 | "" 431 | ] 432 | }, 433 | "metadata": {}, 434 | "output_type": "display_data" 435 | } 436 | ], 437 | "source": [ 438 | "# Value counts for the target 'Product'\n", 439 | "for col in [target]:\n", 440 | " print(col)\n", 441 | " tmp = data[col].value_counts().to_frame().style.format(\"{:,.0f}\")\n", 442 | " display(tmp)\n", 443 | " del tmp" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 7, 449 | "metadata": {}, 450 | "outputs": [ 451 | { 452 | "name": "stdout", 453 | "output_type": "stream", 454 | "text": [ 455 | "CFPB Data Shape: (127,451, 18)\n" 456 | ] 457 | }, 458 | { 459 | "data": { 460 | "text/plain": [ 461 | "Product\n", 462 | "Debt collection 66702\n", 463 | "Credit reporting 24524\n", 464 | "Credit card or prepaid card 15992\n", 465 | "Student loan 15846\n", 466 | "Vehicle loan or lease 4387\n", 467 | "Name: count, dtype: int64" 468 | ] 469 | }, 470 | "metadata": {}, 471 | "output_type": "display_data" 472 | } 473 | ], 474 | "source": [ 475 | "# https://esource.dbs.ie/bitstream/handle/10788/4224/msc_shivaprasad_vm_2020.pdf?sequence=1&isAllowed=y\n", 476 | "# Reduce the \"Product\" categories for quicker processing\n", 477 | "keep_products = ['Debt collection',\n", 478 | " 'Credit reporting',\n", 479 | " 'Credit card or prepaid card',\n", 480 | " 'Student loan',\n", 481 | " 'Vehicle loan or lease',\n", 482 | " ]\n", 483 | "data = data[data['Product'].isin(keep_products)].reset_index(drop=True)\n", 484 | "print(f'CFPB Data Shape: ({data.shape[0]:,}, {data.shape[1]})')\n", 485 | "display(data['Product'].value_counts())" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "# Number of Tokens Distribution\n", 493 | "\n", 494 | "The field `Consumer complaint narrative` is the unstructured text source field that will be used to predict the target variable. The `bert-base-uncased` model used in this example only allows for a maximum of 512 tokens. Text past this point will be truncated. There are other methods (e.g., sliding windows) that can handle longer text but will not be implemented in this repository. Therefore, we can check the distribution of the number tokens for this field. " 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 8, 500 | "metadata": {}, 501 | "outputs": [ 502 | { 503 | "name": "stdout", 504 | "output_type": "stream", 505 | "text": [ 506 | "{\n", 507 | " \"architectures\": [\n", 508 | " \"BertForMaskedLM\"\n", 509 | " ],\n", 510 | " \"attention_probs_dropout_prob\": 0.1,\n", 511 | " \"gradient_checkpointing\": false,\n", 512 | " \"hidden_act\": \"gelu\",\n", 513 | " \"hidden_dropout_prob\": 0.1,\n", 514 | " \"hidden_size\": 768,\n", 515 | " \"initializer_range\": 0.02,\n", 516 | " \"intermediate_size\": 3072,\n", 517 | " \"layer_norm_eps\": 1e-12,\n", 518 | " \"max_position_embeddings\": 512,\n", 519 | " \"model_type\": \"bert\",\n", 520 | " \"num_attention_heads\": 12,\n", 521 | " \"num_hidden_layers\": 12,\n", 522 | " \"pad_token_id\": 0,\n", 523 | " \"position_embedding_type\": \"absolute\",\n", 524 | " \"transformers_version\": \"4.6.0.dev0\",\n", 525 | " \"type_vocab_size\": 2,\n", 526 | " \"use_cache\": true,\n", 527 | " \"vocab_size\": 30522\n", 528 | "}\n" 529 | ] 530 | }, 531 | { 532 | "data": { 533 | "text/plain": [ 534 | "0" 535 | ] 536 | }, 537 | "execution_count": 8, 538 | "metadata": {}, 539 | "output_type": "execute_result" 540 | } 541 | ], 542 | "source": [ 543 | "# Load the tokenizer\n", 544 | "tokenizer = AutoTokenizer.from_pretrained(f'{PATHS[\"hf_cache\"]}/'\n", 545 | " f'{model_name}')\n", 546 | "\n", 547 | "# View the Model configuration JSON\n", 548 | "os.system(f'cat {PATHS[\"hf_cache\"]}/{model_name}/config.json')" 549 | ] 550 | }, 551 | { 552 | "cell_type": "markdown", 553 | "metadata": {}, 554 | "source": [ 555 | "In the above cell the `max_position_embeddings=512` parameter for the BERT model means BERT can only take input sequences up to 512 tokens in length. There are solutions for handling longer text as discussed in this article by [Salt Data Labs](https://www.saltdatalabs.com/blog/bert-how-to-handle-long-documents#:~:text=However%2C%20BERT%20can%20only%20take,much%20longer%20than%20512%20words.).\n", 556 | "\n", 557 | "A few observations about the below cell:\n", 558 | "- To speedup calculations the Torch Dataset and DataLoader modules will be used which allows for multi-core processing using parameter `num_workers`. This provides over an ~8X speed up on my computer versus using a simple pandas.apply() function call (i.e., single core use).\n", 559 | "- Notice how the Torch Dataset and DataLoader modules can be modified to create custom solutions; like in the example below they are being used to count the number of tokens in the field `Consumer complaint narrative`." 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 9, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "name": "stdout", 569 | "output_type": "stream", 570 | "text": [ 571 | "Number of token distribution:\n" 572 | ] 573 | }, 574 | { 575 | "data": { 576 | "text/plain": [ 577 | "count 127451.000000\n", 578 | "mean 223.945312\n", 579 | "std 249.286584\n", 580 | "min 4.000000\n", 581 | "25% 81.000000\n", 582 | "50% 152.000000\n", 583 | "75% 281.000000\n", 584 | "max 8756.000000\n", 585 | "Name: num_tokens, dtype: float64" 586 | ] 587 | }, 588 | "metadata": {}, 589 | "output_type": "display_data" 590 | }, 591 | { 592 | "data": { 593 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlUAAAGdCAYAAAA7VYb2AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAwVklEQVR4nO3dfVRVdb7H8Q8P8qB5wIc4yIjKpPkQpimGqHVvV66U1M2HmaVGZcbkVFAqlmEPZNMDpldHzZKaKbWVpnlvOaZFMVhaSaioKRbYg42WHrRBOEqJyNn3jxb7egYrxZ/BwfdrrbNWZ/++Z+/v3r+WfNY+e+/jZ1mWJQAAAJwT/8ZuAAAAoDkgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGBDZ2AxcSj8ejAwcOqHXr1vLz82vsdgAAwBmwLEtHjx5VVFSU/P1/+nwUoepXdODAAUVHRzd2GwAAoAH279+vjh07/uQ4oepX1Lp1a0k/TorD4WjkbgAAwJlwu92Kjo62/47/FELVr6juKz+Hw0GoAgDAx/zSpTtcqA4AAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADAhs7AZgRpfMdUbW8/XMZCPrAQDgQsOZKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwoFFD1caNG3XDDTcoKipKfn5+Wr16tde4ZVnKyspShw4dFBoaqsTERH3++edeNeXl5UpJSZHD4VB4eLhSU1N17Ngxr5qdO3fqqquuUkhIiKKjozVr1qx6vaxatUo9evRQSEiIevfurbfeeuusewEAABeuRg1VVVVV6tOnj5599tnTjs+aNUsLFixQTk6OCgsL1apVKyUlJen48eN2TUpKinbv3q28vDytXbtWGzdu1MSJE+1xt9utYcOGqXPnzioqKtLs2bM1Y8YMvfDCC3bNpk2bNG7cOKWmpmr79u0aMWKERowYoeLi4rPqBQAAXLj8LMuyGrsJSfLz89Mbb7yhESNGSPrxzFBUVJSmTp2q++67T5JUWVkpp9OpJUuWaOzYsfrss8/Uq1cvbdmyRXFxcZKk3NxcDR8+XN98842ioqK0aNEiPfTQQ3K5XAoKCpIkZWZmavXq1SopKZEkjRkzRlVVVVq7dq3dz8CBA9W3b1/l5OScUS9nwu12KywsTJWVlXI4HEaOW50umeuMrOfrmclG1gMAQHNxpn+/m+w1VXv37pXL5VJiYqK9LCwsTPHx8SooKJAkFRQUKDw83A5UkpSYmCh/f38VFhbaNVdffbUdqCQpKSlJpaWlOnLkiF1z6nbqauq2cya9nE51dbXcbrfXCwAANE9NNlS5XC5JktPp9FrudDrtMZfLpYiICK/xwMBAtW3b1qvmdOs4dRs/VXPq+C/1cjrZ2dkKCwuzX9HR0b+w1wAAwFc12VDVHEyfPl2VlZX2a//+/Y3dEgAAOE+abKiKjIyUJJWVlXktLysrs8ciIyN16NAhr/GTJ0+qvLzcq+Z06zh1Gz9Vc+r4L/VyOsHBwXI4HF4vAADQPDXZUBUTE6PIyEjl5+fby9xutwoLC5WQkCBJSkhIUEVFhYqKiuya9evXy+PxKD4+3q7ZuHGjampq7Jq8vDx1795dbdq0sWtO3U5dTd12zqQXAABwYWvUUHXs2DHt2LFDO3bskPTjBeE7duzQvn375Ofnp8mTJ+uJJ57QmjVrtGvXLt16662Kioqy7xDs2bOnrr32Wt1xxx3avHmzPvroI6Wnp2vs2LGKioqSJN10000KCgpSamqqdu/erZUrV2r+/PnKyMiw+5g0aZJyc3M1Z84clZSUaMaMGdq6davS09Ml6Yx6AQAAF7bAxtz41q1bdc0119jv64LO+PHjtWTJEk2bNk1VVVWaOHGiKioqNGTIEOXm5iokJMT+zLJly5Senq6hQ4fK399fo0eP1oIFC+zxsLAwvfvuu0pLS1P//v3Vvn17ZWVleT3LatCgQVq+fLkefvhhPfjgg+rWrZtWr16t2NhYu+ZMegEAABeuJvOcqgsBz6kCAMD3+PxzqgAAAHwJoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAY0KRDVW1trR555BHFxMQoNDRUl1xyiR5//HFZlmXXWJalrKwsdejQQaGhoUpMTNTnn3/utZ7y8nKlpKTI4XAoPDxcqampOnbsmFfNzp07ddVVVykkJETR0dGaNWtWvX5WrVqlHj16KCQkRL1799Zbb711fnYcAAD4nCYdqp5++mktWrRICxcu1Geffaann35as2bN0jPPPGPXzJo1SwsWLFBOTo4KCwvVqlUrJSUl6fjx43ZNSkqKdu/erby8PK1du1YbN27UxIkT7XG3261hw4apc+fOKioq0uzZszVjxgy98MILds2mTZs0btw4paamavv27RoxYoRGjBih4uLiX+dgAACAJs3POvW0TxNz/fXXy+l06sUXX7SXjR49WqGhoXrllVdkWZaioqI0depU3XfffZKkyspKOZ1OLVmyRGPHjtVnn32mXr16acuWLYqLi5Mk5ebmavjw4frmm28UFRWlRYsW6aGHHpLL5VJQUJAkKTMzU6tXr1ZJSYkkacyYMaqqqtLatWvtXgYOHKi+ffsqJyfnjPbH7XYrLCxMlZWVcjgcRo5RnS6Z64ys5+uZyUbWAwBAc3Gmf7+b9JmqQYMGKT8/X3v27JEkffLJJ/rwww913XXXSZL27t0rl8ulxMRE+zNhYWGKj49XQUGBJKmgoEDh4eF2oJKkxMRE+fv7q7Cw0K65+uqr7UAlSUlJSSotLdWRI0fsmlO3U1dTtx0AAHBhC2zsBn5OZmam3G63evTooYCAANXW1urJJ59USkqKJMnlckmSnE6n1+ecTqc95nK5FBER4TUeGBiotm3betXExMTUW0fdWJs2beRyuX52O6dTXV2t6upq+73b7T7jfQcAAL6lSZ+peu2117Rs2TItX75c27Zt09KlS/Xf//3fWrp0aWO3dkays7MVFhZmv6Kjoxu7JQAAcJ406VB1//33KzMzU2PHjlXv3r11yy23aMqUKcrOzpYkRUZGSpLKysq8PldWVmaPRUZG6tChQ17jJ0+eVHl5uVfN6dZx6jZ+qqZu/HSmT5+uyspK+7V///6z2n8AAOA7mnSo+v777+Xv791iQECAPB6PJCkmJkaRkZHKz8+3x91utwoLC5WQkCBJSkhIUEVFhYqKiuya9evXy+PxKD4+3q7ZuHGjampq7Jq8vDx1795dbdq0sWtO3U5dTd12Tic4OFgOh8PrBQAAmqcmHapuuOEGPfnkk1q3bp2+/vprvfHGG5o7d65GjhwpSfLz89PkyZP1xBNPaM2aNdq1a5duvfVWRUVFacSIEZKknj176tprr9Udd9yhzZs366OPPlJ6errGjh2rqKgoSdJNN92koKAgpaamavfu3Vq5cqXmz5+vjIwMu5dJkyYpNzdXc+bMUUlJiWbMmKGtW7cqPT39Vz8uAACg6WnSF6o/88wzeuSRR3T33Xfr0KFDioqK0h//+EdlZWXZNdOmTVNVVZUmTpyoiooKDRkyRLm5uQoJCbFrli1bpvT0dA0dOlT+/v4aPXq0FixYYI+HhYXp3XffVVpamvr376/27dsrKyvL61lWgwYN0vLly/Xwww/rwQcfVLdu3bR69WrFxsb+OgcDAAA0aU36OVXNDc+pAgDA9zSL51QBAAD4CkIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGNChUffXVV6b7AAAA8GkNClVdu3bVNddco1deeUXHjx833RMAAIDPaVCo2rZtmy6//HJlZGQoMjJSf/zjH7V582bTvQEAAPiMBoWqvn37av78+Tpw4IBeeuklHTx4UEOGDFFsbKzmzp2rw4cPm+4TAACgSTunC9UDAwM1atQorVq1Sk8//bS++OIL3XfffYqOjtatt96qgwcPmuoTAACgSTunULV161bdfffd6tChg+bOnav77rtPX375pfLy8nTgwAHdeOONpvoEAABo0gIb8qG5c+dq8eLFKi0t1fDhw/Xyyy9r+PDh8vf/MaPFxMRoyZIl6tKli8leAQAAmqwGhapFixbp9ttv12233aYOHTqctiYiIkIvvvjiOTUHAADgKxoUqj7//PNfrAkKCtL48eMbsnoAAACf06BrqhYvXqxVq1bVW75q1SotXbr0nJsCAADwNQ0KVdnZ2Wrfvn295REREXrqqafOuSkAAABf06BQtW/fPsXExNRb3rlzZ+3bt++cmwIAAPA1DQpVERER2rlzZ73ln3zyidq1a3fOTZ3q22+/1c0336x27dopNDRUvXv31tatW+1xy7KUlZWlDh06KDQ0VImJifWu+SovL1dKSoocDofCw8OVmpqqY8eOedXs3LlTV111lUJCQhQdHa1Zs2bV62XVqlXq0aOHQkJC1Lt3b7311ltG9xUAAPiuBoWqcePG6d5779V7772n2tpa1dbWav369Zo0aZLGjh1rrLkjR45o8ODBatGihd5++219+umnmjNnjtq0aWPXzJo1SwsWLFBOTo4KCwvVqlUrJSUlef0mYUpKinbv3q28vDytXbtWGzdu1MSJE+1xt9utYcOGqXPnzioqKtLs2bM1Y8YMvfDCC3bNpk2bNG7cOKWmpmr79u0aMWKERowYoeLiYmP7CwAAfJefZVnW2X7oxIkTuuWWW7Rq1SoFBv54A6HH49Gtt96qnJwcBQUFGWkuMzNTH330kT744IPTjluWpaioKE2dOlX33XefJKmyslJOp1NLlizR2LFj9dlnn6lXr17asmWL4uLiJEm5ubkaPny4vvnmG0VFRWnRokV66KGH5HK57N4zMzO1evVqlZSUSJLGjBmjqqoqrV271t7+wIED1bdvX+Xk5JzR/rjdboWFhamyslIOh6PBx+V0umSuM7Ker2cmG1kPAADNxZn+/W7QmaqgoCCtXLlSJSUlWrZsmV5//XV9+eWXeumll4wFKklas2aN4uLi9Pvf/14RERG64oor9Je//MUe37t3r1wulxITE+1lYWFhio+PV0FBgSSpoKBA4eHhdqCSpMTERPn7+6uwsNCuufrqq716T0pKUmlpqY4cOWLXnLqdupq67QAAgAtbg55TVefSSy/VpZdeaqqXer766istWrRIGRkZevDBB7Vlyxbde++99jOwXC6XJMnpdHp9zul02mMul0sRERFe44GBgWrbtq1Xzb9eeF+3TpfLpTZt2sjlcv3sdk6nurpa1dXV9nu32302uw8AAHxIg0JVbW2tlixZovz8fB06dEgej8drfP369Uaa83g8iouLsx/TcMUVV6i4uFg5OTk+8WDR7OxsPfbYY43dBgAA+BU06Ou/SZMmadKkSaqtrVVsbKz69Onj9TKlQ4cO6tWrl9eynj172o9tiIyMlCSVlZV51ZSVldljkZGROnTokNf4yZMnVV5e7lVzunWcuo2fqqkbP53p06ersrLSfu3fv/+XdxoAAPikBp2pWrFihV577TUNHz7cdD9eBg8erNLSUq9le/bsUefOnSX9+MPNkZGRys/PV9++fSX9+BVbYWGh7rrrLklSQkKCKioqVFRUpP79+0v68Uyax+NRfHy8XfPQQw+ppqZGLVq0kCTl5eWpe/fu9p2GCQkJys/P1+TJk+1e8vLylJCQ8JP9BwcHKzg4+NwPBAAAaPIafKF6165dTfdSz5QpU/Txxx/rqaee0hdffKHly5frhRdeUFpamiTJz89PkydP1hNPPKE1a9Zo165duvXWWxUVFaURI0ZI+vHM1rXXXqs77rhDmzdv1kcffaT09HSNHTtWUVFRkqSbbrpJQUFBSk1N1e7du7Vy5UrNnz9fGRkZdi+TJk1Sbm6u5syZo5KSEs2YMUNbt25Venr6eT8OAACg6WtQqJo6darmz5+vBjyN4awMGDBAb7zxhl599VXFxsbq8ccf17x585SSkmLXTJs2Tffcc48mTpyoAQMG6NixY8rNzVVISIhds2zZMvXo0UNDhw7V8OHDNWTIEK9nUIWFhendd9/V3r171b9/f02dOlVZWVlez7IaNGiQHer69Omj//mf/9Hq1asVGxt7Xo8BAADwDQ16TtXIkSP13nvvqW3btrrsssvsr8zqvP7668YabE54ThUAAL7nTP9+N+iaqvDwcI0cObLBzQEAADQ3DQpVixcvNt0HAACAT2vQNVXSj48l+Pvf/67nn39eR48elSQdOHCg3g8VAwAAXAgadKbqH//4h6699lrt27dP1dXV+s///E+1bt1aTz/9tKqrq8/4t/AAAACaiwY//DMuLk5HjhxRaGiovXzkyJHKz8831hwAAICvaNCZqg8++ECbNm2q9+PJXbp00bfffmukMQAAAF/SoDNVHo9HtbW19ZZ/8803at269Tk3BQAA4GsaFKqGDRumefPm2e/9/Px07NgxPfroo+f9p2sAAACaogZ9/TdnzhwlJSWpV69eOn78uG666SZ9/vnnat++vV599VXTPQIAADR5DQpVHTt21CeffKIVK1Zo586dOnbsmFJTU5WSkuJ14ToAAMCFokGhSpICAwN18803m+wFAADAZzUoVL388ss/O37rrbc2qBkAAABf1aBQNWnSJK/3NTU1+v777xUUFKSWLVsSqgAAwAWnQXf/HTlyxOt17NgxlZaWasiQIVyoDgAALkgN/u2/f9WtWzfNnDmz3lksAACAC4GxUCX9ePH6gQMHTK4SAADAJzTomqo1a9Z4vbcsSwcPHtTChQs1ePBgI40BAAD4kgaFqhEjRni99/Pz08UXX6z/+I//0Jw5c0z0BQAA4FMaFKo8Ho/pPgAAAHya0WuqAAAALlQNOlOVkZFxxrVz585tyCYAAAB8SoNC1fbt27V9+3bV1NSoe/fukqQ9e/YoICBA/fr1s+v8/PzMdAkAANDENShU3XDDDWrdurWWLl2qNm3aSPrxgaATJkzQVVddpalTpxptEgAAoKlr0DVVc+bMUXZ2th2oJKlNmzZ64oknuPsPAABckBoUqtxutw4fPlxv+eHDh3X06NFzbgoAAMDXNChUjRw5UhMmTNDrr7+ub775Rt98843+93//V6mpqRo1apTpHgEAAJq8Bl1TlZOTo/vuu0833XSTampqflxRYKBSU1M1e/Zsow0CAAD4ggaFqpYtW+q5557T7Nmz9eWXX0qSLrnkErVq1cpocwAAAL7inB7+efDgQR08eFDdunVTq1atZFmWqb4AAAB8SoNC1T//+U8NHTpUl156qYYPH66DBw9KklJTU3mcAgAAuCA1KFRNmTJFLVq00L59+9SyZUt7+ZgxY5Sbm2usOQAAAF/RoGuq3n33Xb3zzjvq2LGj1/Ju3brpH//4h5HGAAAAfEmDzlRVVVV5naGqU15eruDg4HNuCgAAwNc0KFRdddVVevnll+33fn5+8ng8mjVrlq655hpjzQEAAPiKBn39N2vWLA0dOlRbt27ViRMnNG3aNO3evVvl5eX66KOPTPcIAADQ5DXoTFVsbKz27NmjIUOG6MYbb1RVVZVGjRql7du365JLLjHdIwAAQJN31meqampqdO211yonJ0cPPfTQ+egJAADA55z1maoWLVpo586d56MXAAAAn9Wgr/9uvvlmvfjii6Z7AQAA8FkNulD95MmTeumll/T3v/9d/fv3r/ebf3PnzjXSHAAAgK84q1D11VdfqUuXLiouLla/fv0kSXv27PGq8fPzM9cdAACAjzirUNWtWzcdPHhQ7733nqQff5ZmwYIFcjqd56U5AAAAX3FW11RZluX1/u2331ZVVZXRhgAAAHxRgy5Ur/OvIQsAAOBCdVahys/Pr941U1xDBQAAcJbXVFmWpdtuu83+0eTjx4/rzjvvrHf33+uvv26uQwAAAB9wVqFq/PjxXu9vvvlmo80AAAD4qrMKVYsXLz5ffQAAAPi0c7pQHQAAAD8iVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAG+FSomjlzpvz8/DR58mR72fHjx5WWlqZ27drpoosu0ujRo1VWVub1uX379ik5OVktW7ZURESE7r//fp08edKr5v3331e/fv0UHBysrl27asmSJfW2/+yzz6pLly4KCQlRfHy8Nm/efD52EwAA+CCfCVVbtmzR888/r8svv9xr+ZQpU/Tmm29q1apV2rBhgw4cOKBRo0bZ47W1tUpOTtaJEye0adMmLV26VEuWLFFWVpZds3fvXiUnJ+uaa67Rjh07NHnyZP3hD3/QO++8Y9esXLlSGRkZevTRR7Vt2zb16dNHSUlJOnTo0PnfeQAA0OT5WZZlNXYTv+TYsWPq16+fnnvuOT3xxBPq27ev5s2bp8rKSl188cVavny5fve730mSSkpK1LNnTxUUFGjgwIF6++23df311+vAgQNyOp2SpJycHD3wwAM6fPiwgoKC9MADD2jdunUqLi62tzl27FhVVFQoNzdXkhQfH68BAwZo4cKFkiSPx6Po6Gjdc889yszMPKP9cLvdCgsLU2VlpRwOh8lDpC6Z64ys5+uZyUbWAwBAc3Gmf7994kxVWlqakpOTlZiY6LW8qKhINTU1Xst79OihTp06qaCgQJJUUFCg3r1724FKkpKSkuR2u7V792675l/XnZSUZK/jxIkTKioq8qrx9/dXYmKiXXM61dXVcrvdXi8AANA8ndUPKjeGFStWaNu2bdqyZUu9MZfLpaCgIIWHh3stdzqdcrlcds2pgapuvG7s52rcbrd++OEHHTlyRLW1taetKSkp+cnes7Oz9dhjj53ZjgIAAJ/WpM9U7d+/X5MmTdKyZcsUEhLS2O2ctenTp6uystJ+7d+/v7FbAgAA50mTDlVFRUU6dOiQ+vXrp8DAQAUGBmrDhg1asGCBAgMD5XQ6deLECVVUVHh9rqysTJGRkZKkyMjIencD1r3/pRqHw6HQ0FC1b99eAQEBp62pW8fpBAcHy+FweL0AAEDz1KRD1dChQ7Vr1y7t2LHDfsXFxSklJcX+7xYtWig/P9/+TGlpqfbt26eEhARJUkJCgnbt2uV1l15eXp4cDod69epl15y6jrqaunUEBQWpf//+XjUej0f5+fl2DQAAuLA16WuqWrdurdjYWK9lrVq1Urt27ezlqampysjIUNu2beVwOHTPPfcoISFBAwcOlCQNGzZMvXr10i233KJZs2bJ5XLp4YcfVlpamoKDgyVJd955pxYuXKhp06bp9ttv1/r16/Xaa69p3br/v6MuIyND48ePV1xcnK688krNmzdPVVVVmjBhwq90NAAAQFPWpEPVmfjzn/8sf39/jR49WtXV1UpKStJzzz1njwcEBGjt2rW66667lJCQoFatWmn8+PH605/+ZNfExMRo3bp1mjJliubPn6+OHTvqr3/9q5KSkuyaMWPG6PDhw8rKypLL5VLfvn2Vm5tb7+J1AABwYfKJ51Q1FzynCgAA39OsnlMFAADQ1BGqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAOadKjKzs7WgAED1Lp1a0VERGjEiBEqLS31qjl+/LjS0tLUrl07XXTRRRo9erTKysq8avbt26fk5GS1bNlSERERuv/++3Xy5Emvmvfff1/9+vVTcHCwunbtqiVLltTr59lnn1WXLl0UEhKi+Ph4bd682fg+AwAA39SkQ9WGDRuUlpamjz/+WHl5eaqpqdGwYcNUVVVl10yZMkVvvvmmVq1apQ0bNujAgQMaNWqUPV5bW6vk5GSdOHFCmzZt0tKlS7VkyRJlZWXZNXv37lVycrKuueYa7dixQ5MnT9Yf/vAHvfPOO3bNypUrlZGRoUcffVTbtm1Tnz59lJSUpEOHDv06BwMAADRpfpZlWY3dxJk6fPiwIiIitGHDBl199dWqrKzUxRdfrOXLl+t3v/udJKmkpEQ9e/ZUQUGBBg4cqLffflvXX3+9Dhw4IKfTKUnKycnRAw88oMOHDysoKEgPPPCA1q1bp+LiYntbY8eOVUVFhXJzcyVJ8fHxGjBggBYuXChJ8ng8io6O1j333KPMzMwz6t/tdissLEyVlZVyOBwmD426ZK4zsp6vZyYbWQ8AAM3Fmf79btJnqv5VZWWlJKlt27aSpKKiItXU1CgxMdGu6dGjhzp16qSCggJJUkFBgXr37m0HKklKSkqS2+3W7t277ZpT11FXU7eOEydOqKioyKvG399fiYmJds3pVFdXy+12e70AAEDz5DOhyuPxaPLkyRo8eLBiY2MlSS6XS0FBQQoPD/eqdTqdcrlcds2pgapuvG7s52rcbrd++OEHfffdd6qtrT1tTd06Tic7O1thYWH2Kzo6+ux3HAAA+ASfCVVpaWkqLi7WihUrGruVMzZ9+nRVVlbar/379zd2SwAA4DwJbOwGzkR6errWrl2rjRs3qmPHjvbyyMhInThxQhUVFV5nq8rKyhQZGWnX/OtdenV3B55a8693DJaVlcnhcCg0NFQBAQEKCAg4bU3dOk4nODhYwcHBZ7/DAADA5zTpM1WWZSk9PV1vvPGG1q9fr5iYGK/x/v37q0WLFsrPz7eXlZaWat++fUpISJAkJSQkaNeuXV536eXl5cnhcKhXr152zanrqKupW0dQUJD69+/vVePxeJSfn2/XAACAC1uTPlOVlpam5cuX629/+5tat25tX78UFham0NBQhYWFKTU1VRkZGWrbtq0cDofuueceJSQkaODAgZKkYcOGqVevXrrllls0a9YsuVwuPfzww0pLS7PPIt15551auHChpk2bpttvv13r16/Xa6+9pnXr/v+OuoyMDI0fP15xcXG68sorNW/ePFVVVWnChAm//oEBAABNTpMOVYsWLZIk/fu//7vX8sWLF+u2226TJP35z3+Wv7+/Ro8ererqaiUlJem5556zawMCArR27VrdddddSkhIUKtWrTR+/Hj96U9/smtiYmK0bt06TZkyRfPnz1fHjh3117/+VUlJSXbNmDFjdPjwYWVlZcnlcqlv377Kzc2td/E6AAC4MPnUc6p8Hc+pAgDA9zTL51QBAAA0VYQqAAAAAwhVAAAABhCqAAAADGjSd//h12fqgneJi94BABcWzlQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADCAUAUAAGAAoQoAAMAAQhUAAIABhCoAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQBQAAYAChCgAAwABCFQAAgAGEKgAAAAMIVQAAAAYQqgAAAAwgVAEAABhAqAIAADAgsLEbQPPVJXOdkfV8PTPZyHoAADifOFMFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGECoAgAAMIBQdZaeffZZdenSRSEhIYqPj9fmzZsbuyUAANAEEKrOwsqVK5WRkaFHH31U27ZtU58+fZSUlKRDhw41dmsAAKCR+VmWZTV2E74iPj5eAwYM0MKFCyVJHo9H0dHRuueee5SZmfmLn3e73QoLC1NlZaUcDofR3kz9JExzxs/dAAAa4kz/fvPbf2foxIkTKioq0vTp0+1l/v7+SkxMVEFBwWk/U11drerqavt9ZWWlpB8nxzRP9ffG19ncdJqyysh6ih9LMrIeAIBvqPu7/UvnoQhVZ+i7775TbW2tnE6n13Kn06mSkpLTfiY7O1uPPfZYveXR0dHnpUf8OsLmNXYHAIDGcPToUYWFhf3kOKHqPJo+fboyMjLs9x6PR+Xl5WrXrp38/PyMbcftdis6Olr79+83/rUizg1z03QxN00Xc9N0XahzY1mWjh49qqioqJ+tI1Sdofbt2ysgIEBlZWVey8vKyhQZGXnazwQHBys4ONhrWXh4+PlqUQ6H44L6n9yXMDdNF3PTdDE3TdeFODc/d4aqDnf/naGgoCD1799f+fn59jKPx6P8/HwlJCQ0YmcAAKAp4EzVWcjIyND48eMVFxenK6+8UvPmzVNVVZUmTJjQ2K0BAIBGRqg6C2PGjNHhw4eVlZUll8ulvn37Kjc3t97F67+24OBgPfroo/W+akTjY26aLuam6WJumi7m5ufxnCoAAAADuKYKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqmoFnn31WXbp0UUhIiOLj47V58+bGbqlZyc7O1oABA9S6dWtFRERoxIgRKi0t9ao5fvy40tLS1K5dO1100UUaPXp0vQfF7tu3T8nJyWrZsqUiIiJ0//336+TJk14177//vvr166fg4GB17dpVS5YsOd+716zMnDlTfn5+mjx5sr2MuWk83377rW6++Wa1a9dOoaGh6t27t7Zu3WqPW5alrKwsdejQQaGhoUpMTNTnn3/utY7y8nKlpKTI4XAoPDxcqampOnbsmFfNzp07ddVVVykkJETR0dGaNWvWr7J/vqq2tlaPPPKIYmJiFBoaqksuuUSPP/641+/aMTcNZMGnrVixwgoKCrJeeukla/fu3dYdd9xhhYeHW2VlZY3dWrORlJRkLV682CouLrZ27NhhDR8+3OrUqZN17Ngxu+bOO++0oqOjrfz8fGvr1q3WwIEDrUGDBtnjJ0+etGJjY63ExERr+/bt1ltvvWW1b9/emj59ul3z1VdfWS1btrQyMjKsTz/91HrmmWesgIAAKzc391fdX1+1efNmq0uXLtbll19uTZo0yV7O3DSO8vJyq3PnztZtt91mFRYWWl999ZX1zjvvWF988YVdM3PmTCssLMxavXq19cknn1j/9V//ZcXExFg//PCDXXPttddaffr0sT7++GPrgw8+sLp27WqNGzfOHq+srLScTqeVkpJiFRcXW6+++qoVGhpqPf/887/q/vqSJ5980mrXrp21du1aa+/evdaqVausiy66yJo/f75dw9w0DKHKx1155ZVWWlqa/b62ttaKioqysrOzG7Gr5u3QoUOWJGvDhg2WZVlWRUWF1aJFC2vVqlV2zWeffWZJsgoKCizLsqy33nrL8vf3t1wul12zaNEiy+FwWNXV1ZZlWda0adOsyy67zGtbY8aMsZKSks73Lvm8o0ePWt26dbPy8vKsf/u3f7NDFXPTeB544AFryJAhPznu8XisyMhIa/bs2fayiooKKzg42Hr11Vcty7KsTz/91JJkbdmyxa55++23LT8/P+vbb7+1LMuynnvuOatNmzb2XNVtu3v37qZ3qdlITk62br/9dq9lo0aNslJSUizLYm7OBV//+bATJ06oqKhIiYmJ9jJ/f38lJiaqoKCgETtr3iorKyVJbdu2lSQVFRWppqbGax569OihTp062fNQUFCg3r17ez0oNikpSW63W7t377ZrTl1HXQ1z+cvS0tKUnJxc7/gxN41nzZo1iouL0+9//3tFREToiiuu0F/+8hd7fO/evXK5XF7HNSwsTPHx8V5zEx4erri4OLsmMTFR/v7+KiwstGuuvvpqBQUF2TVJSUkqLS3VkSNHzvdu+qRBgwYpPz9fe/bskSR98skn+vDDD3XddddJYm7OBU9U92Hfffedamtr6z3R3el0qqSkpJG6at48Ho8mT56swYMHKzY2VpLkcrkUFBRU78eynU6nXC6XXXO6eaob+7kat9utH374QaGhoedjl3zeihUrtG3bNm3ZsqXeGHPTeL766istWrRIGRkZevDBB7Vlyxbde++9CgoK0vjx4+1je7rjeupxj4iI8BoPDAxU27ZtvWpiYmLqraNurE2bNudl/3xZZmam3G63evTooYCAANXW1urJJ59USkqKJDE354BQBZyFtLQ0FRcX68MPP2zsViBp//79mjRpkvLy8hQSEtLY7eAUHo9HcXFxeuqppyRJV1xxhYqLi5WTk6Px48c3cncXttdee03Lli3T8uXLddlll2nHjh2aPHmyoqKimJtzxNd/Pqx9+/YKCAiodydTWVmZIiMjG6mr5is9PV1r167Ve++9p44dO9rLIyMjdeLECVVUVHjVnzoPkZGRp52nurGfq3E4HJwJ+QlFRUU6dOiQ+vXrp8DAQAUGBmrDhg1asGCBAgMD5XQ6mZtG0qFDB/Xq1ctrWc+ePbVv3z5J/39sf+7fr8jISB06dMhr/OTJkyovLz+r+YO3+++/X5mZmRo7dqx69+6tW265RVOmTFF2drYk5uZcEKp8WFBQkPr376/8/Hx7mcfjUX5+vhISEhqxs+bFsiylp6frjTfe0Pr16+udzu7fv79atGjhNQ+lpaXat2+fPQ8JCQnatWuX1z9CeXl5cjgc9h+ehIQEr3XU1TCXP23o0KHatWuXduzYYb/i4uKUkpJi/zdz0zgGDx5c79Eje/bsUefOnSVJMTExioyM9DqubrdbhYWFXnNTUVGhoqIiu2b9+vXyeDyKj4+3azZu3Kiamhq7Ji8vT927d2+WXy+Z8P3338vf3/vPf0BAgDwejyTm5pw09pXyODcrVqywgoODrSVLlliffvqpNXHiRCs8PNzrTiacm7vuussKCwuz3n//fevgwYP26/vvv7dr7rzzTqtTp07W+vXrra1bt1oJCQlWQkKCPV532/6wYcOsHTt2WLm5udbFF1982tv277//fuuzzz6znn32WW7bb4BT7/6zLOamsWzevNkKDAy0nnzySevzzz+3li1bZrVs2dJ65ZVX7JqZM2da4eHh1t/+9jdr586d1o033nja2/avuOIKq7Cw0Prwww+tbt26ed22X1FRYTmdTuuWW26xiouLrRUrVlgtW7Zs1rftn6vx48dbv/nNb+xHKrz++utW+/btrWnTptk1zE3DEKqagWeeecbq1KmTFRQUZF155ZXWxx9/3NgtNSuSTvtavHixXfPDDz9Yd999t9WmTRurZcuW1siRI62DBw96refrr7+2rrvuOis0NNRq3769NXXqVKumpsar5r333rP69u1rBQUFWb/97W+9toEz86+hirlpPG+++aYVGxtrBQcHWz169LBeeOEFr3GPx2M98sgjltPptIKDg62hQ4dapaWlXjX//Oc/rXHjxlkXXXSR5XA4rAkTJlhHjx71qvnkk0+sIUOGWMHBwdZvfvMba+bMmed933yZ2+22Jk2aZHXq1MkKCQmxfvvb31oPPfSQ16MPmJuG8bOsUx6hCgAAgAbhmioAAAADCFUAAAAGEKoAAAAMIFQBAAAYQKgCAAAwgFAFAABgAKEKAADAAEIVAACAAYQqAAAAAwhVAAAABhCqAAAADCBUAQAAGPB/y+tu61Au9y0AAAAASUVORK5CYII=", 594 | "text/plain": [ 595 | "
" 596 | ] 597 | }, 598 | "metadata": {}, 599 | "output_type": "display_data" 600 | } 601 | ], 602 | "source": [ 603 | "class CountTokens(Dataset):\n", 604 | " def __init__(self, texts, tok):\n", 605 | " self.texts = texts\n", 606 | " self.tok = tok\n", 607 | " \n", 608 | " def __len__(self):\n", 609 | " return len(self.texts)\n", 610 | " \n", 611 | " def __getitem__(self, item):\n", 612 | " # Number of tokens in string\n", 613 | " values = self.tok.encode_plus(text=self.texts[item],\n", 614 | " padding=False,\n", 615 | " truncation=False)\n", 616 | " return len(values['input_ids'])\n", 617 | "\n", 618 | "# Temporary Dataframe\n", 619 | "tmp = data[source_fields[0]].copy().to_frame()\n", 620 | "\n", 621 | "# Datasets and dataloaders\n", 622 | "tokens_dataset = CountTokens(texts=tmp[source_fields[0]].values,\n", 623 | " tok=tokenizer)\n", 624 | "tokens_dataloader = DataLoader(tokens_dataset,\n", 625 | " batch_size=1_024,\n", 626 | " shuffle=False,\n", 627 | " num_workers=8)\n", 628 | "\n", 629 | "# Count number of tokens in dataframe\n", 630 | "num_tokens = []\n", 631 | "for count, token_lengths in enumerate(tokens_dataloader):\n", 632 | " num_tokens.extend(token_lengths.numpy().tolist())\n", 633 | "tmp['num_tokens'] = num_tokens\n", 634 | "print('Number of token distribution:')\n", 635 | "display(tmp.num_tokens.describe())\n", 636 | "tmp.num_tokens.plot(kind='hist', bins=25)\n", 637 | "plt.show()\n", 638 | "del tmp" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "# Save Data to Disk\n", 646 | "\n", 647 | "The preprocessed data will be saved to disk for modeling." 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "### Dataset 1 of 3: All Data\n", 655 | "\n", 656 | "This is all the training data and is recommended when final model training is ready/needed. This will result in longer run time." 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 10, 662 | "metadata": {}, 663 | "outputs": [ 664 | { 665 | "name": "stdout", 666 | "output_type": "stream", 667 | "text": [ 668 | "CFPB Data Shape: (127,451, 4)\n", 669 | "Index(['Consumer complaint narrative', 'ZIP code', 'Sub-issue', 'Product'], dtype='object')\n" 670 | ] 671 | } 672 | ], 673 | "source": [ 674 | "# Reduce the data to only the necessary columns\n", 675 | "data = data[source_fields + [target]]\n", 676 | "\n", 677 | "# Save preprocessed data\n", 678 | "data.to_csv(PATHS['save_processed_data'], index=False)\n", 679 | "print(f'CFPB Data Shape: ({data.shape[0]:,}, {data.shape[1]})')\n", 680 | "print(data.columns)" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "### Dataset 2 of 3: Debug Data\n", 688 | "\n", 689 | "This is a very small sample of data and is ideal for using in code development. With the data being so small it will allows for quick load times and processing so minimal time is wasted waiting for preprocessing / data loading." 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 11, 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "name": "stdout", 699 | "output_type": "stream", 700 | "text": [ 701 | "Num. of Instances in Debug Data: 25\n" 702 | ] 703 | } 704 | ], 705 | "source": [ 706 | "# Downsample data to a much smaller size\n", 707 | "# This data will be quicker for all operations when debugging code\n", 708 | "prod_uniques = data.Product.unique()\n", 709 | "\n", 710 | "# Sample 5 instances for each value in Product\n", 711 | "# Product has 18 unique values; therefore 5 * 5 = 25 instances\n", 712 | "data_debug = None\n", 713 | "for prod_value in prod_uniques:\n", 714 | " tmp = data[data.Product == prod_value].iloc[0:5, :]\n", 715 | " if data_debug is None:\n", 716 | " data_debug = tmp\n", 717 | " else:\n", 718 | " data_debug = pd.concat([data_debug, tmp],\n", 719 | " ignore_index=True)\n", 720 | "# Shuffle the rows\n", 721 | "data_debug = data_debug.sample(frac=1)\n", 722 | "print(f'Num. of Instances in Debug Data: {len(data_debug):,}')\n", 723 | "\n", 724 | "data_debug.to_csv(PATHS['save_debug_data'], index=False)" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "### Dataset 3 of 3: Partial Data\n", 732 | "\n", 733 | "This dataset will be used for experimenting/demos because it takes a long time to process the full Dataset #1." 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": 12, 739 | "metadata": {}, 740 | "outputs": [ 741 | { 742 | "name": "stdout", 743 | "output_type": "stream", 744 | "text": [ 745 | "Num. of Instances in Partial Data: 5,000\n" 746 | ] 747 | } 748 | ], 749 | "source": [ 750 | "# Downsample data to a much smaller size\n", 751 | "# This data will be quicker for all operations when debugging code\n", 752 | "prod_uniques = data.Product.unique()\n", 753 | "N = 1_000\n", 754 | "# Sample N instances for each value in Product\n", 755 | "# Product has 5 unique values; therefore 1,000 * 5 = 5,000 instances\n", 756 | "data_partial = None\n", 757 | "for prod_value in prod_uniques:\n", 758 | " tmp = data[data.Product == prod_value].iloc[0:N, :]\n", 759 | " if data_partial is None:\n", 760 | " data_partial = tmp\n", 761 | " else:\n", 762 | " data_partial = pd.concat([data_partial, tmp],\n", 763 | " ignore_index=True)\n", 764 | "# Shuffle the rows\n", 765 | "data_partial = data_partial.sample(frac=1)\n", 766 | "print(f'Num. of Instances in Partial Data: {len(data_partial):,}')\n", 767 | "\n", 768 | "data_partial.to_csv(PATHS['save_partial_data'], index=False)" 769 | ] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": {}, 774 | "source": [ 775 | "# Summary\n", 776 | "\n", 777 | "- **DATA**:\n", 778 | " - Number of Independent variables = 3\n", 779 | " - `Consumer complaint narrative`: string of unstructured text and all instances are unique \n", 780 | " - `Zip Code`: a categorical variable with high cardinality (i.e., > 20K unique values)\n", 781 | " - `Sub-issues`: a categorical variable with 63 unique values\n", 782 | " - Dependent Variable = `Product`\n", 783 | " - `Product`: categorical with 5 unique values\n", 784 | " - Two datasets were created for modeling:\n", 785 | " - `Primary` dataset has ~127K instances\n", 786 | " - `Debug` dataset with 25 instances which is help when developing code\n", 787 | " - `Partial` dataset with 5K instances for faster processing\n", 788 | "- **NOTES**: \n", 789 | " - No emphasis was placed on selecting optimal independent variables for modeling because the focus of this repository is to highlight a workflow. Mixed data types (i.e., text and categorical variables) were selected to demonstrate their usage during modeling.\n", 790 | "\n", 791 | "### Next Steps:\n", 792 | "\n", 793 | "Refer to the [training.ipynb](https://nbviewer.org/github/mddunlap924/PyTorch-LLM/blob/main/notebooks/training.ipynb) Jupyter Notebook that demonstrates how to fine-tune a multi-classification LLM model, using a generic PyTorch workflow, on this data.\n" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [] 800 | } 801 | ], 802 | "metadata": { 803 | "kernelspec": { 804 | "display_name": ".venv", 805 | "language": "python", 806 | "name": "python3" 807 | }, 808 | "language_info": { 809 | "codemirror_mode": { 810 | "name": "ipython", 811 | "version": 3 812 | }, 813 | "file_extension": ".py", 814 | "mimetype": "text/x-python", 815 | "name": "python", 816 | "nbconvert_exporter": "python", 817 | "pygments_lexer": "ipython3", 818 | "version": "3.10.6" 819 | }, 820 | "orig_nbformat": 4 821 | }, 822 | "nbformat": 4, 823 | "nbformat_minor": 2 824 | } 825 | -------------------------------------------------------------------------------- /notebooks/references/optimization-approaches-for-transformers.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"

\n Translations:\n English, Chinese\n

\n\n\n\n

Introduction

\n

\n Despite the stunning success of Transformers in Natural Language Processing (NLP) tasks, it is still challenging to train them on even modern Graphics Processing Units (GPUs) or deploy them in production, due to the massive number of parameters. Training or inferencing such large models, we can probably run out of memory (OOM) or the process becomes very long.

\nNevertheless, there are a lot of offered approaches to avoid such problems, so the main contribution of this article is to describe and show how the provided methods can be applied in the Training and Inference scripts. Firstly, the article will go through basic approaches such as Gradient Accumulation, Freezing, Automatic Mixed Precision, 8-bit Optimizers, and Gradient Checkpointing, and then describe NLP-specific optimizing approaches such as Dynamic Padding, Uniform Dynamic Padding, and Fast Tokenizers.\n\n

","metadata":{}},{"cell_type":"markdown","source":"\n

Table on contents

\n\n","metadata":{}},{"cell_type":"code","source":"!pip uninstall -q -y transformers","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_kg_hide-input":true,"execution":{"iopub.status.busy":"2022-06-26T12:07:07.844216Z","iopub.execute_input":"2022-06-26T12:07:07.844807Z","iopub.status.idle":"2022-06-26T12:07:12.560936Z","shell.execute_reply.started":"2022-06-26T12:07:07.844682Z","shell.execute_reply":"2022-06-26T12:07:12.559967Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import sys\nsys.path.append(\"../input/torch-components-library/torch-components-main\")\nsys.path.append(\"../input/transformers/src\")\nimport transformers\nimport warnings\nimport os\n\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"true\"\n\nwarnings.simplefilter(\"ignore\")\ntransformers.logging.set_verbosity_error()","metadata":{"_kg_hide-input":true,"execution":{"iopub.status.busy":"2022-06-26T12:07:12.562525Z","iopub.execute_input":"2022-06-26T12:07:12.563104Z","iopub.status.idle":"2022-06-26T12:07:18.862859Z","shell.execute_reply.started":"2022-06-26T12:07:12.563056Z","shell.execute_reply":"2022-06-26T12:07:18.862071Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Gradient Accumulation

\n\n

\nThe idea behind Gradient Accumulation is very simple - simulating a larger batch size. Sometimes using a large batch size is necessary for better convergence or improving the performance, however, it often requires a lot of memory. One possible solution to such an issue is to use a smaller batch size, however, on the one hand, small batch size leads to increasing training or inference time, and on the other hand, the gradient descent algorithms are very sensitive to the choice of batch size and may lead to unstable convergence and performance reduction. Instead, we can run multiply steps (accumulation steps) and accumulate (compute average) gradients a certain number of accumulation steps, and then when we have enough computed gradients perform the optimization step.\n

","metadata":{}},{"cell_type":"markdown","source":"
\"Gradient
\n

Visualization of how Gradient Accumulation works

","metadata":{}},{"cell_type":"markdown","source":"

Vanilla training loop

\n","metadata":{}},{"cell_type":"code","source":"for step, batch in enumerate(loader, 1):\n \n # prepare inputs and targets for the model and loss function respectively.\n \n # forward pass\n outputs = model(inputs)\n \n # computing loss\n loss = loss_fn(outputs, targets)\n \n # backward pass\n loss.backward()\n \n # perform optimization step\n torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)\n optimizer.step()\n model.zero_grad()\n \n # perform validation loop\n if step % validation_steps == 0:\n validation_loop()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Training loop with Gradient Accumulation

","metadata":{}},{"cell_type":"code","source":"steps = len(loader)\n\n# perform validation loop each `validation_steps` training steps!\nvalidation_steps = int(validation_steps * gradient_accumulation_steps)\n\nfor step, batch in enumerate(loader, 1):\n \n # prepare inputs and targets for the model and loss function respectively.\n \n # forward pass\n outputs = model(inputs)\n \n # computing loss\n loss = loss_fn(outputs, targets)\n \n # accumulating gradients over steps\n if gradient_accumulation_steps > 1:\n loss = loss / gradient_accumulation_steps\n \n # backward pass\n loss.backward()\n \n # perform optimization step after certain number of accumulating steps and at the end of epoch\n if step % gradient_accumulation_steps == 0 or step == steps:\n torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)\n optimizer.step()\n model.zero_grad()\n \n # perform validation loop\n if step % validation_steps == 0:\n validation_loop()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Freezing

\n\n

\nFreezing is an effective way to speed up training and decrease memory utilization almost without losing final quality by toggling computing gradients in certain layers of the model.

\nA well-known fact in Deep Learning is that low layers learn input data patterns and at the same time top layers learn high-level features, which are specific to target tasks. When performing the optimization step with some kind of optimization algorithms (e.g. SGD, AdamW, or RMSprop), the low layers receive small gradients, and hence the parameters almost stay not changed, this is called Gradient Vanishing, so instead of computing \"useless\" gradients and perform optimization of such low-gradients parameters, which sometimes require a lot of time and computational power, we can just freeze them.

\nPyTorch provides a comfortable API for toggling computing gradients. Such behavior can be set by the property requires_grad of torch.Tensor.\n\n

","metadata":{}},{"cell_type":"markdown","source":"

Implementation

","metadata":{}},{"cell_type":"code","source":"def freeze(module):\n \"\"\"\n Freezes module's parameters.\n \"\"\"\n \n for parameter in module.parameters():\n parameter.requires_grad = False\n \ndef get_freezed_parameters(module):\n \"\"\"\n Returns names of freezed parameters of the given module.\n \"\"\"\n \n freezed_parameters = []\n for name, parameter in module.named_parameters():\n if not parameter.requires_grad:\n freezed_parameters.append(name)\n \n return freezed_parameters","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:07:39.151609Z","iopub.execute_input":"2022-06-26T12:07:39.152611Z","iopub.status.idle":"2022-06-26T12:07:39.158249Z","shell.execute_reply.started":"2022-06-26T12:07:39.152576Z","shell.execute_reply":"2022-06-26T12:07:39.157487Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import torch\nfrom transformers import AutoConfig, AutoModel\n\n\n# initializing model\nmodel_path = \"microsoft/deberta-v3-base\"\nconfig = AutoConfig.from_pretrained(model_path)\nmodel = AutoModel.from_pretrained(model_path, config=config)\n\n\n# freezing embeddings and first 2 layers of encoder\nfreeze(model.embeddings)\nfreeze(model.encoder.layer[:2])\n\nfreezed_parameters = get_freezed_parameters(model)\nprint(f\"Freezed parameters: {freezed_parameters}\")\n\n# selecting parameters, which requires gradients and initializing optimizer\nmodel_parameters = filter(lambda parameter: parameter.requires_grad, model.parameters())\noptimizer = torch.optim.AdamW(params=model_parameters, lr=2e-5, weight_decay=0.0)","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:07:40.963979Z","iopub.execute_input":"2022-06-26T12:07:40.964347Z","iopub.status.idle":"2022-06-26T12:07:59.202224Z","shell.execute_reply.started":"2022-06-26T12:07:40.964315Z","shell.execute_reply":"2022-06-26T12:07:59.201236Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Automatic Mixed Precision

\n\n

\nAutomatic Mixed Precision (AMP) is another very simple way of reducing memory consumption and training time without losing final quality, which was introduced in \"Mixed Precision Training\" paper by NVIDIA and Baidu researchers in 2017. The key idea behind the approach is to use lower precision for keeping the model's gradients and parameters in the memory, i.e instead of using full precision (e.g float32) the proposed approach uses half-precision (e.g float16) for keeping tensors in memory. However, when computing gradients in lower precision, some values can be so small that they are treated as zeros, this phenomenon is called \"overflow\". In order to prevent \"overflow\", the authors of the original paper proposed a gradient scaling method.

\nPyTorch provides a package with necessary functionality (from lowering precision to gradient scaling) for using Automatic Mixed Precision, called torch.cuda.amp. Automatic Mixed Precision was implemented as a context manager, so it can easily be inserted in training and inferencing scripts.\n\n\n

","metadata":{}},{"cell_type":"markdown","source":"
\"Automatic
\n

Visualization of how Automatic Mixed Precision works

","metadata":{}},{"cell_type":"markdown","source":"

Vanilla training loop

","metadata":{}},{"cell_type":"code","source":"for step, batch in enumerate(loader, 1):\n \n # prepare inputs and targets for the model and loss function respectively.\n \n # forward pass\n outputs = model(inputs)\n \n # computing loss\n loss = loss_fn(outputs, targets)\n \n # backward pass\n loss.backward()\n \n # perform optimization step\n torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)\n optimizer.step()\n model.zero_grad()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Training loop with Automatic Mixed Precision

","metadata":{}},{"cell_type":"code","source":"from torch.cuda.amp import autocast, GradScaler\n\n\nscaler = GradScaler()\n\nfor step, batch in enumerate(loader, 1):\n \n # prepare inputs and targets for the model and loss function respectively.\n\n # forward pass with `autocast` context manager\n with autocast(enabled=True):\n outputs = model(inputs)\n \n # computing loss\n loss = loss_fn(outputs, targets)\n \n # scale gradint and perform backward pass\n scaler.scale(loss).backward()\n \n # before gradient clipping the optimizer parameters must be unscaled.\n scaler.unscale_(optimizer)\n \n # perform optimization step\n torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)\n \n scaler.step(optimizer)\n scaler.update()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

8-bit Optimizers

\n\n

\n The idea of 8-bit Optimizers is similar to Automatic Mixed Precision, where the model's parameters and gradients are kept in lower precision, but 8-bit Optimizers additionally keep the optimizer's state in lower precision too. The authors (Meta Research) detail described the 8-bit Optimizers in the original paper \"8-bit Optimizers via Block-wise Quantization\", and showed that 8-bit Optimizers lead to significant decreasing memory utilization and slightly speeding up the training. Additionally, the authors studied the impact of different hyperparameter settings and show that 8-bit Optimizers are stable to different choices of learning rate, betas and weight decay parameters without losing performance or hurting the convergence. Therefore, the authors provided a comfortable high-level library for 8-bit Optimizers, called bitsandbytes. \n \n

","metadata":{}},{"cell_type":"markdown","source":"
\"8-bit
\n

Comparison table of different optimizers

","metadata":{}},{"cell_type":"markdown","source":"

Initializing optimizer via PyTorch API

","metadata":{}},{"cell_type":"code","source":"import torch\nfrom transformers import AutoConfig, AutoModel\n\n# initializing model\nmodel_path = \"microsoft/deberta-v3-base\"\nconfig = AutoConfig.from_pretrained(model_path)\nmodel = AutoModel.from_pretrained(model_path, config=config)\n\n\n# selecting parameters, which requires gradients\nmodel_parameters = filter(lambda parameter: parameter.requires_grad, model.parameters())\n\n# initializing optimizer\noptimizer = torch.optim.AdamW(params=model_parameters, lr=2e-5, weight_decay=0.0)\nprint(f\"32-bit Optimizer:\\n\\n{optimizer}\")","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:08:06.147695Z","iopub.execute_input":"2022-06-26T12:08:06.148055Z","iopub.status.idle":"2022-06-26T12:08:09.051683Z","shell.execute_reply.started":"2022-06-26T12:08:06.148024Z","shell.execute_reply":"2022-06-26T12:08:09.050692Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Initializing optimizer via bitsandbytes API

","metadata":{}},{"cell_type":"code","source":"!pip install -q bitsandbytes-cuda110","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:08:09.053424Z","iopub.execute_input":"2022-06-26T12:08:09.05378Z","iopub.status.idle":"2022-06-26T12:08:21.847071Z","shell.execute_reply.started":"2022-06-26T12:08:09.053744Z","shell.execute_reply":"2022-06-26T12:08:21.845866Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"def set_embedding_parameters_bits(embeddings_path, optim_bits=32):\n \"\"\"\n https://github.com/huggingface/transformers/issues/14819#issuecomment-1003427930\n \"\"\"\n \n embedding_types = (\"word\", \"position\", \"token_type\")\n for embedding_type in embedding_types:\n attr_name = f\"{embedding_type}_embeddings\"\n \n if hasattr(embeddings_path, attr_name): \n bnb.optim.GlobalOptimManager.get_instance().register_module_override(\n getattr(embeddings_path, attr_name), 'weight', {'optim_bits': optim_bits}\n )","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:08:21.850177Z","iopub.execute_input":"2022-06-26T12:08:21.850762Z","iopub.status.idle":"2022-06-26T12:08:21.857163Z","shell.execute_reply.started":"2022-06-26T12:08:21.850717Z","shell.execute_reply":"2022-06-26T12:08:21.856358Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import bitsandbytes as bnb\n\n\n# selecting parameters, which requires gradients\nmodel_parameters = filter(lambda parameter: parameter.requires_grad, model.parameters())\n\n# initializing optimizer \nbnb_optimizer = bnb.optim.AdamW(params=model_parameters, lr=2e-5, weight_decay=0.0, optim_bits=8)\n# bnb_optimizer = bnb.optim.AdamW8bit(params=model_parameters, lr=2e-5, weight_decay=0.0) # equivalent to the above line\n\n# setting embeddings parameters\nset_embedding_parameters_bits(embeddings_path=model.embeddings)\n\nprint(f\"8-bit Optimizer:\\n\\n{bnb_optimizer}\")","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:08:21.858534Z","iopub.execute_input":"2022-06-26T12:08:21.859056Z","iopub.status.idle":"2022-06-26T12:08:22.113776Z","shell.execute_reply.started":"2022-06-26T12:08:21.859021Z","shell.execute_reply":"2022-06-26T12:08:22.112963Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Gradient Checkpointing

\n\n

\nSometimes even using small batch size and other optimization techniques, e.g. Gradient Accumulation, Freezing, or Automatic Precision Training, we still can run out of memory, especially in cases when the models are large enough. One of the proposed powerful solutions for solving this issue is Gradient Checkpointing, which was firstly introduced in the \"Training Deep Nets With Sublinear Memory Cost\" paper in 2016. The authors demonstrated that Gradient Checkpointing can significantly reduce memory utilization from $ O(n) $ to $ O(\\sqrt{n}) $, where $ n $ is the number of layers in the model. This approach allows training large models on a single GPU or provides more memory for increasing the batch size for better and faster convergence.\n\n

","metadata":{}},{"cell_type":"markdown","source":"
\n

Number of blocks (layers) versus memory utilization in megabytes

","metadata":{}},{"cell_type":"markdown","source":"

\nThe idea behind Gradient Checkpoint is to compute gradients in small chunks while removing unnecessary gradients from the memory during forward and backpropagation passes, thereby reducing memory utilization, despite, such an approach requires more compute steps to reproduce the whole back propagation graph.\n\n

","metadata":{}},{"cell_type":"markdown","source":"
\n\n

Demonstration of how Gradient Checkpointing works in forward and backpropagation passes

","metadata":{}},{"cell_type":"markdown","source":"

\nPyTorch framework provides gradient checkpointing from the box via torch.utils.checkpoint.checkpoint and torch.utils.checkpoint.checkpoint_sequential functions.

\n \"Specifically, in the forward pass, function will run in torch.no_grad() manner, i.e., not storing the intermediate activations. Instead, the forward pass saves the inputs tuple and the function parameter. In the backwards pass, the saved inputs and function is retrieved, and the forward pass is computed on function again, now tracking the intermediate activations, and then the gradients are calculated using these activation values.\"

\nAdditionally, HuggingFace Transformers supports Gradient Checkpoint too. Gradient Checkpointing can be performed by gradient_checkpointing_enable method of PreTrainedModel instance.\n

\n","metadata":{}},{"cell_type":"markdown","source":"

Implementation

","metadata":{}},{"cell_type":"code","source":"from transformers import AutoConfig, AutoModel\n\n# https://github.com/huggingface/transformers/issues/9919\nfrom torch.utils.checkpoint import checkpoint\n\n\n# initializing model\nmodel_path = \"microsoft/deberta-v3-base\"\nconfig = AutoConfig.from_pretrained(model_path)\nmodel = AutoModel.from_pretrained(model_path, config=config)\n\n\n# gradient checkpointing\nmodel.gradient_checkpointing_enable()\nprint(f\"Gradient Checkpointing: {model.is_gradient_checkpointing}\")","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:08:22.115323Z","iopub.execute_input":"2022-06-26T12:08:22.115616Z","iopub.status.idle":"2022-06-26T12:08:24.938431Z","shell.execute_reply.started":"2022-06-26T12:08:22.11559Z","shell.execute_reply":"2022-06-26T12:08:24.937533Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Fast Tokenizers

\n\n

\nHuggingFace Transformers provides two types of Tokenizers: Base and Fast. The main difference between them is that Fast Tokenizers are written on Rust since Python is very slow in loops, which are necessary during tokenization. This is a non-trivial way of allowing us to get a additional speed-up during the tokenization process. The types of Tokenizers can be easily changed through HuggingFace Transformers API in from_pretrained method of transformers.AutoTokenizer instance by setting use_fast property to True.\n

","metadata":{}},{"cell_type":"markdown","source":"
\"Tokenization
\n\n

Visualization of how Tokenization works

","metadata":{}},{"cell_type":"markdown","source":"

Implementation

","metadata":{}},{"cell_type":"code","source":"from transformers import AutoTokenizer\n\n# initializing Base version of Tokenizer\nmodel_path = \"microsoft/deberta-v3-base\"\ntokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)\nprint(f\"Base version Tokenizer:\\n\\n{tokenizer}\", end=\"\\n\"*3)\n\n# initializing Fast version of Tokenizer\nfast_tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)\nprint(f\"Fast version Tokenizer:\\n\\n{fast_tokenizer}\")","metadata":{"execution":{"iopub.status.busy":"2022-06-26T12:08:28.582801Z","iopub.execute_input":"2022-06-26T12:08:28.58314Z","iopub.status.idle":"2022-06-26T12:08:39.351669Z","shell.execute_reply.started":"2022-06-26T12:08:28.583111Z","shell.execute_reply":"2022-06-26T12:08:39.350678Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

Dynamic Padding

\n\n

\nGenerally, the models are trained with a batch of inputs, and each input in the batch must have a fixed size, I.e the batch must be a representation of the matrix. The fixed size is often selected relatively on the length distribution in the dataset, the number of features, or other factors. In NLP tasks, the input size is referred to as the length of the text and called max length. Unfortunately, different texts have different lengths, so to handle such cases, the researchers proposed padding tokens and truncation. Truncation is applied, when the max length is smaller than the input text's length, so some tokens (often lasts) are removed. Padding tokens are special tokens, which is added to the end of input text when the input text's length is smaller than the max length, also worthing to note that padding tokens should not be included in calculating loss in some tasks (e.g Masked Language Modeling or Named Entity Recognition).\n

","metadata":{}},{"cell_type":"markdown","source":"\"fixed-padding-length-1\"\n

Visualization of Fixed Padding in the batch

","metadata":{}},{"cell_type":"markdown","source":"

\n However, the padding tokens have significant drawbacks. It is very ineffective and requires more additional memory in cases where the input text is very short relatively to chosen max length. To prevent extra computational operations, the developers proposed one very effective approach - pad the inputs of the batch to the maximum input length of the batch. Although the difference in the terminology is very insufficient, such an approach can speed up training by 35% or even 50%! Despite this, speed up and memory usage depends on batch size and distribution of lengths.\n\n

","metadata":{}},{"cell_type":"markdown","source":"
\"dynamic-padding\"
\n\n

Visualization of how Dynamic Padding works in the batch

","metadata":{}},{"cell_type":"markdown","source":"

Uniform Dynamic Padding

\n\n

\nThere is one additional possible approach, which is based on Dynamic Padding, called Uniform Dynamic Padding. The idea is to sort texts by their corresponding lengths beforehand and not shuffle samples during training or inference. This approach is very effective and needs fewer computations during training or inference than Dynamic Padding. However, it is not recommended to use Uniform Dynamic Padding during training since the training implies shuffling of inputs.\n\n

","metadata":{}},{"cell_type":"markdown","source":"\"uniform-length-batching\"\n

Visualization of how Uniform Dynamic Padding works in the batch

","metadata":{}},{"cell_type":"markdown","source":"

Conclusion

\n\n

\nThe optimization is a necessary step in developing models even on modern GPUs. For this reason, the article went through the most powerful and popular approaches for speeding up the training and reducing the memory consumption of large models such as Transformers. \n

","metadata":{}},{"cell_type":"markdown","source":"\n

References

\n\n

During writing article the following links were used:

\n","metadata":{}},{"cell_type":"markdown","source":"\n

Realeases

\n
    \n
  • 26.06.2022 - updated \"Training loop with Gradient Accumulation\", added validation loop and change some symbols.
  • \n
  • 18.08.2022 - added Chinese translation provided by @zachary666
  • \n
","metadata":{}}]} -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==1.4.0 2 | asttokens==2.2.1 3 | backcall==0.2.0 4 | bleach==6.0.0 5 | cachetools==5.3.1 6 | certifi==2022.12.7 7 | charset-normalizer==2.1.1 8 | cmake==3.25.0 9 | comm==0.1.3 10 | contourpy==1.1.0 11 | cycler==0.11.0 12 | debugpy==1.6.7 13 | decorator==5.1.1 14 | executing==1.2.0 15 | filelock==3.9.0 16 | fonttools==4.41.0 17 | fsspec==2023.6.0 18 | google-auth==2.22.0 19 | google-auth-oauthlib==1.0.0 20 | grpcio==1.56.0 21 | huggingface-hub==0.16.4 22 | idna==3.4 23 | ipykernel==6.24.0 24 | ipython==8.14.0 25 | ipywidgets==8.0.7 26 | jedi==0.18.2 27 | Jinja2==3.1.2 28 | joblib==1.3.1 29 | jupyter_client==8.3.0 30 | jupyter_core==5.3.1 31 | jupyterlab-widgets==3.0.8 32 | kaggle==1.5.15 33 | kiwisolver==1.4.4 34 | lightning-utilities==0.9.0 35 | lit==15.0.7 36 | Markdown==3.4.3 37 | MarkupSafe==2.1.2 38 | matplotlib==3.7.2 39 | matplotlib-inline==0.1.6 40 | mpmath==1.2.1 41 | mypy-extensions==1.0.0 42 | nest-asyncio==1.5.6 43 | networkx==3.0 44 | numpy==1.24.1 45 | oauthlib==3.2.2 46 | packaging==23.1 47 | pandas==2.0.3 48 | parso==0.8.3 49 | pexpect==4.8.0 50 | pickleshare==0.7.5 51 | Pillow==9.3.0 52 | platformdirs==3.9.1 53 | prompt-toolkit==3.0.39 54 | protobuf==4.23.4 55 | psutil==5.9.5 56 | ptyprocess==0.7.0 57 | pure-eval==0.2.2 58 | pyasn1==0.5.0 59 | pyasn1-modules==0.3.0 60 | Pygments==2.15.1 61 | pyparsing==3.0.9 62 | pyre-extensions==0.0.30 63 | python-dateutil==2.8.2 64 | python-slugify==8.0.1 65 | pytz==2023.3 66 | PyYAML==6.0 67 | pyzmq==25.1.0 68 | regex==2023.6.3 69 | requests==2.28.1 70 | requests-oauthlib==1.3.1 71 | rsa==4.9 72 | safetensors==0.3.1 73 | scikit-learn==1.3.0 74 | scipy==1.11.1 75 | six==1.16.0 76 | stack-data==0.6.2 77 | sympy==1.11.1 78 | tensorboard==2.13.0 79 | tensorboard-data-server==0.7.1 80 | text-unidecode==1.3 81 | threadpoolctl==3.2.0 82 | tokenizers==0.13.3 83 | torch==2.0.1+cu118 84 | torch-lr-finder==0.2.1 85 | torchaudio==2.0.2+cu118 86 | torcheval==0.0.6 87 | torchmetrics==1.0.1 88 | torchtnt==0.1.0 89 | torchvision==0.15.2+cu118 90 | tornado==6.3.2 91 | tqdm==4.65.0 92 | traitlets==5.9.0 93 | transformers==4.30.2 94 | triton==2.0.0 95 | typing-inspect==0.9.0 96 | typing_extensions==4.4.0 97 | tzdata==2023.3 98 | urllib3==1.26.13 99 | wcwidth==0.2.6 100 | webencodings==0.5.1 101 | Werkzeug==2.3.6 102 | widgetsnbextension==4.0.8 103 | -------------------------------------------------------------------------------- /scripts/train_model.py: -------------------------------------------------------------------------------- 1 | # Libraries 2 | from pathlib import Path 3 | import gc 4 | import argparse 5 | from transformers import AutoTokenizer 6 | import torch 7 | 8 | # Append Path to Custom Modules if Needed 9 | # sys.path.append('./') 10 | 11 | # Custom Modules 12 | from src.utils import (seed_everything, 13 | load_cfg, 14 | RunIDs, 15 | debugger_is_active, 16 | plot_perf_metric_to_disk) 17 | from src.dataloading.load_data import LoadData 18 | from src.dataloading.stratify import StratifyData 19 | from src.dataloading.preprocess import PreprocessData 20 | from src.dataloading.load_datasets import (CustomTextCollator, 21 | get_ds_dl, 22 | ) 23 | from src.training.single_fold import train_fold 24 | 25 | # Seed Everything 26 | SEED = 42 27 | seed_everything(seed=SEED) 28 | 29 | # Get Device type for processing 30 | DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 31 | 32 | 33 | def workflow(): 34 | """ 35 | The workflow for training a PyTorch model 36 | """ 37 | 38 | # Load Data from Disk 39 | load_data_file = LoadData(base_dir=CFG.paths.data.base_dir) 40 | if CFG.debug: 41 | data = load_data_file.load(filename=CFG.paths.data.debug_data) 42 | else: 43 | data = load_data_file.load(filename=CFG.paths.data.data) 44 | 45 | # Stratify the Data 46 | data = (StratifyData(technique=CFG.stratify.technique, 47 | n_folds=CFG.cv.num_folds, 48 | target=CFG.data_info.target) 49 | .stratify(df=data)) 50 | 51 | # Train a model for each validation fold 52 | for fold_num in CFG.cv.val_folds: 53 | # fold_num = CFG.cv.val_folds[0] 54 | 55 | print((f''' 56 | # ==================================================== 57 | # Starting Training for FOLD: {fold_num} 58 | # ==================================================== 59 | ''')) 60 | 61 | # Split Data into Training and Validation 62 | df_train = data.copy()[data.fold != fold_num].reset_index(drop=True) 63 | df_val = data.copy()[data.fold == fold_num].reset_index(drop=True) 64 | print(f'Train Number of Instances: {len(df_train):,}') 65 | print(f'Validation Number of Instances: {len(df_val):,}') 66 | 67 | # Preprocessing Encoders 68 | encoders = {} 69 | for technique in CFG.preprocessing.apply_techniques: 70 | fields = getattr(CFG.preprocessing, technique).fields 71 | for col in fields: 72 | enc = PreprocessData(y=df_train[col].values, 73 | technique=technique) 74 | encoders[col] = {'encoder': enc.encoder, 75 | 'technique': technique} 76 | 77 | # Path to the model and tokenizer model card saved on disk 78 | model_path = Path(CFG.model_tokenizer.base_dir) / CFG.model_tokenizer.name 79 | 80 | # Load the tokenizer 81 | tokenizer = AutoTokenizer.from_pretrained(model_path, do_lower=True) 82 | 83 | # Collator 84 | collator = CustomTextCollator(tokenizer=tokenizer, 85 | tokenizer_cfg=CFG.tokenizer) 86 | 87 | # Train Dataset and Dataloader 88 | (_, 89 | train_dataloader) = get_ds_dl(df=df_train, 90 | cfg=CFG, 91 | tokenizer=tokenizer, 92 | encoder=encoders[CFG.data_info.target]['encoder'], 93 | collator=collator) 94 | # Validation Dataset and Dataloader 95 | (_, 96 | val_dataloader) = get_ds_dl(df=df_val, 97 | cfg=CFG, 98 | tokenizer=tokenizer, 99 | encoder=encoders[CFG.data_info.target]['encoder'], 100 | collator=collator) 101 | 102 | print(f'# of Training Samples: {len(df_train):,}') 103 | print(f'# of Validation Samples: {len(df_val):,}') 104 | print(f'Batch Size: {CFG.batch_size}') 105 | print(f'{len(df_train):,} \ {CFG.batch_size:,} = {len(train_dataloader):,}') 106 | print(f'Train DataLoader # of Iters: {len(train_dataloader):,}') 107 | print(f'Val. DataLoader # of Iters: {len(val_dataloader):,}') 108 | 109 | # Path to save model results 110 | model_save_path = getattr(run_ids.folds_id, f'fold{fold_num}').path 111 | 112 | # Training for a single fold 113 | perf_metrics = train_fold(train_dl=train_dataloader, 114 | val_dl=val_dataloader, 115 | cfg=CFG, 116 | device=DEVICE, 117 | n_classes=df_train[CFG.data_info.target].nunique(), 118 | model_save_path=model_save_path) 119 | 120 | # Save plots of performance metrics to disk for visual assessment 121 | if CFG.paths.save_results.apply_metric: 122 | epochs = perf_metrics['epoch'] 123 | for metric_name in ['loss', 'f1', 'precision', 'recall']: 124 | save_path = model_save_path / f'{metric_name}.png' 125 | plot_perf_metric_to_disk(save_path=save_path, 126 | x=epochs, 127 | y_train=perf_metrics[f'train_{metric_name}'], 128 | y_val=perf_metrics[f'val_{metric_name}'], 129 | metric_name=metric_name) 130 | print(f'Completed Training Fold {fold_num}\n') 131 | 132 | # Clean up 133 | del (tokenizer, collator, train_dataloader, val_dataloader, 134 | model_save_path, perf_metrics, encoders, df_train, df_val) 135 | _ = gc.collect() 136 | return 137 | 138 | 139 | if __name__ == '__main__': 140 | 141 | # Determine if running in debug mode 142 | # If in debug manually point to CFG file 143 | is_debugger = debugger_is_active() 144 | 145 | # Construct the argument parser and parse the arguments 146 | if is_debugger: 147 | args = argparse.Namespace() 148 | args.dir = './cfgs' 149 | args.name = 'train-1-debug.yaml' 150 | else: 151 | arg_desc = '''This program points to input parameters for model training''' 152 | parser = argparse.ArgumentParser(formatter_class = argparse.RawDescriptionHelpFormatter, 153 | description= arg_desc) 154 | parser.add_argument("-cfg_dir", 155 | "--dir", 156 | required=True, 157 | help = "Base Dir. for the YAML config. file") 158 | parser.add_argument("-cfg_filename", 159 | "--name", 160 | required=True, 161 | help="File name of YAML config. file") 162 | args = parser.parse_args() 163 | print(args) 164 | 165 | # Load the configuration file 166 | CFG = load_cfg(base_dir=Path(args.dir), 167 | filename=args.name) 168 | 169 | # Create directories for saving results and use unique Group ID 170 | run_ids = RunIDs(test_folds=CFG.cv.val_folds, 171 | num_folds=CFG.cv.num_folds, 172 | save_dir=CFG.paths.save_results.base_dir, 173 | save_results=CFG.paths.save_results.apply_metric) 174 | run_ids.generate_run_ids() 175 | 176 | # Start the training workflow 177 | workflow() 178 | 179 | print('PYTHON SCRIPT COMPLETED - END') 180 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mddunlap924/PyTorch-LLM/5c3fcfd608715063ce259a713ec05acbd95cfe31/src/__init__.py -------------------------------------------------------------------------------- /src/dataloading/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mddunlap924/PyTorch-LLM/5c3fcfd608715063ce259a713ec05acbd95cfe31/src/dataloading/__init__.py -------------------------------------------------------------------------------- /src/dataloading/load_data.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import pandas as pd 3 | 4 | 5 | # Load Data 6 | class LoadData: 7 | """ 8 | Load CSV Data Files 9 | (Expand this class to other datasets suitable for your needs) 10 | """ 11 | 12 | def __init__(self, base_dir: str): 13 | """ 14 | :param base_dir: Directory data files are stored 15 | """ 16 | self.base_dir = Path(base_dir) 17 | 18 | 19 | def load(self, filename: str) -> pd.DataFrame: 20 | """ 21 | Pandas Read CSV filename 22 | :param filename: Name of File to Load 23 | :return: Data returned as a Pandas DataFrame 24 | """ 25 | return pd.read_csv(self.base_dir / filename, 26 | low_memory=False) 27 | -------------------------------------------------------------------------------- /src/dataloading/load_datasets.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import pandas as pd 3 | from torch.utils.data import Dataset 4 | import torch 5 | import numpy as np 6 | from torch.utils.data import DataLoader 7 | 8 | 9 | class CustomTextCollator: 10 | """ 11 | Data Collator used for a classification task. 12 | 13 | It uses a given tokenizer and label encoder to convert any text and labels to numbers that 14 | can go straight into a GPT2 model. 15 | 16 | This class is built with reusability in mind: it can be used as is as long 17 | as the `dataloader` outputs a batch in dictionary format that can be passed 18 | straight into the model - `model(**batch)`. 19 | 20 | Arguments: 21 | 22 | use_tokenizer (:obj:`transformers.tokenization_?`): 23 | Transformer type tokenizer used to process raw text into numbers. 24 | 25 | labels_ids (:obj:`dict`): 26 | Dictionary to encode any labels names into numbers. Keys map to 27 | labels names and Values map to number associated to those labels. 28 | 29 | max_sequence_len (:obj:`int`, `optional`) 30 | Value to indicate the maximum desired sequence to truncate or pad text 31 | sequences. If no value is passed it will used maximum sequence size 32 | supported by the tokenizer and model. 33 | 34 | """ 35 | 36 | def __init__(self, tokenizer, tokenizer_cfg): 37 | 38 | # Tokenizer to be used inside the class. 39 | self.tokenizer = tokenizer 40 | 41 | # Tokenizer configuration 42 | self.tok_cfg = tokenizer_cfg 43 | 44 | # Check max sequence length. 45 | self.max_sequence_len = tokenizer_cfg.max_length 46 | return 47 | 48 | 49 | def __call__(self, sequences): 50 | """ 51 | This function allows the class objects to be used as a function call. 52 | Since the PyTorch DataLoader needs a collator function, this 53 | class can be used as a function. 54 | 55 | Arguments: 56 | 57 | item (:obj:`list`): 58 | List of texts and labels. 59 | 60 | Returns: 61 | :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model. 62 | It holds the statement `model(**Returned Dictionary)`. 63 | """ 64 | 65 | # Get all texts from sequences list. 66 | texts = [sequence['text'] for sequence in sequences] 67 | # Get all labels from sequences list. 68 | labels = [sequence['label'] for sequence in sequences] 69 | 70 | # Call tokenizer on all texts to convert into tensors of numbers with 71 | # appropriate padding. 72 | # https://huggingface.co/docs/transformers/pad_truncation 73 | inputs = self.tokenizer(text=texts, 74 | return_tensors=self.tok_cfg.return_tensors, 75 | padding=self.tok_cfg.padding, 76 | truncation=self.tok_cfg.truncation, 77 | max_length=self.max_sequence_len, 78 | add_special_tokens=self.tok_cfg.add_special_tokens, 79 | ) 80 | # Update the inputs with the associated encoded labels as tensor. 81 | inputs.update({'labels': torch.tensor(labels, dtype=torch.long)}) 82 | return inputs 83 | 84 | 85 | class TrainDataset(Dataset): 86 | def __init__(self, 87 | df: pd.DataFrame, 88 | tok, 89 | tok_cfg, 90 | X_cols: list[str], 91 | label: str, 92 | encoder): 93 | self.df = df 94 | self.tokenizer = tok 95 | self.tokenizer_cfg = tok_cfg 96 | self.X_cols = X_cols 97 | self.label = label 98 | self.encoder = encoder 99 | 100 | 101 | def __len__(self): 102 | return len(self.df) 103 | 104 | 105 | def __getitem__(self, idx): 106 | # Extract all source fields into a list 107 | text = [] 108 | for col in self.X_cols: 109 | if col == 'ZIP code': 110 | feature = f'Zip code {self.df[col].iloc[idx]}' 111 | elif col == 'Sub-issue': 112 | feature = f'{self.df[col].iloc[idx]}' 113 | elif col == 'Consumer complaint narrative': 114 | feature = self.df[col].iloc[idx] 115 | text.append(feature) 116 | 117 | # Combine the fields using special SEP token 118 | text = '[SEP]'.join(text) 119 | # Extract all source fields into a list 120 | # text = self.df['Consumer complaint narrative'].iloc[idx] 121 | 122 | # Convert text labels into labels (e.g., if 18 classes then labels are 0-17) 123 | label_text = self.df[self.label].iloc[idx] 124 | label = self.encoder.transform([label_text])[0] 125 | return {'text': text, 'label': label} 126 | 127 | 128 | class TestDataset(Dataset): 129 | def __init__(self, df, tokenizer, tokenizer_cfg): 130 | self.tokenizer = tokenizer 131 | self.tokenizer_cfg = tokenizer_cfg 132 | self.texts = df['full_text'].values 133 | 134 | def __len__(self): 135 | return len(self.texts) 136 | 137 | def __getitem__(self, item): 138 | inputs = prepare_input(tokenizer=self.tokenizer, 139 | cfg=self.tokenizer_cfg, 140 | text=self.texts[item]) 141 | input_ids = torch.tensor(inputs['input_ids'], dtype=torch.float) 142 | return {'input_ids': input_ids} 143 | 144 | 145 | def get_ds_dl(df, 146 | cfg, 147 | tokenizer, 148 | encoder, 149 | collator): 150 | "Get the PyTorch Dataset (ds) and Dataloader (dl)" 151 | # Dataset 152 | ds = TrainDataset(df=df, 153 | tok=tokenizer, 154 | tok_cfg=cfg.tokenizer, 155 | X_cols=cfg.data_info.source_fields, 156 | label=cfg.data_info.target, 157 | encoder=encoder) 158 | 159 | # Dataloader 160 | dl = DataLoader(ds, 161 | batch_size=cfg.batch_size, 162 | collate_fn=collator, 163 | shuffle=True, 164 | num_workers=cfg.num_workers, 165 | pin_memory=True, 166 | ) 167 | return ds, dl 168 | -------------------------------------------------------------------------------- /src/dataloading/preprocess.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.preprocessing import (LabelEncoder, 3 | OneHotEncoder) 4 | 5 | 6 | class PreprocessData(): 7 | def __init__(self, y: np.array, 8 | technique: str): 9 | # Encoding technique 10 | if technique == 'LabelEncoder': 11 | enc = LabelEncoder() 12 | # Fit the encoder 13 | enc.fit(y) 14 | elif technique == 'OneHotEncoder': 15 | enc = OneHotEncoder(sparse_output=False) 16 | # Fit the encoder 17 | y = np.array([[i] for i in np.unique(y).tolist()]) 18 | enc.fit(y) 19 | else: 20 | raise ValueError((f'Encoder needs to be added ' 21 | f'to script: {technique}')) 22 | # Encoder 23 | self.encoder = enc 24 | -------------------------------------------------------------------------------- /src/dataloading/stratify.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from typing import List 3 | import socket 4 | from sklearn.model_selection import StratifiedKFold 5 | 6 | 7 | class StratifyData(): 8 | def __init__(self, 9 | technique: str, 10 | n_folds: int, 11 | target: str, 12 | *, 13 | shuffle: bool=False, 14 | seed: int=42): 15 | # Set parameters 16 | self.technique = technique 17 | self.n_folds = n_folds 18 | self.target = target 19 | self.shuffle = shuffle 20 | self.seed = seed 21 | 22 | 23 | def __stratified_kfold(self, df: pd.DataFrame) -> pd.DataFrame: 24 | if self.shuffle: 25 | skf = StratifiedKFold(n_splits=self.n_folds, 26 | shuffle=self.shuffle, 27 | random_state=self.seed) 28 | else: 29 | skf = StratifiedKFold(n_splits=self.n_folds) 30 | for n, (_, val_idx) in enumerate(skf.split(df, df[self.target])): 31 | df.loc[val_idx, 'fold'] = int(n + 1) 32 | df['fold'] = df['fold'].astype(int) 33 | return df 34 | 35 | 36 | def stratify(self, df: pd.DataFrame) -> pd.DataFrame: 37 | """ 38 | Stratify the dataframe 39 | """ 40 | # Select the stratification technique 41 | stratify_tech = getattr(self, f'_{self.__class__.__name__}__{self.technique}') 42 | return stratify_tech(df=df) 43 | -------------------------------------------------------------------------------- /src/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mddunlap924/PyTorch-LLM/5c3fcfd608715063ce259a713ec05acbd95cfe31/src/models/__init__.py -------------------------------------------------------------------------------- /src/models/llm_multiclass.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from transformers import AutoModel, AutoConfig, BertModel 4 | 5 | 6 | class MeanPooling(nn.Module): 7 | def __init__(self): 8 | super(MeanPooling, self).__init__() 9 | 10 | def forward(self, last_hidden_state, attention_mask): 11 | input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float() 12 | sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1) 13 | sum_mask = input_mask_expanded.sum(1) 14 | sum_mask = torch.clamp(sum_mask, min=1e-9) 15 | mean_embeddings = sum_embeddings / sum_mask 16 | return mean_embeddings 17 | 18 | 19 | class CustomModel(nn.Module): 20 | def __init__(self, 21 | llm_model_path, 22 | cfg, 23 | num_classes): 24 | super(CustomModel, self).__init__() 25 | # Path to model flat files 26 | self.llm_model_path = llm_model_path 27 | 28 | # Custom config information for the model specified in YAML file 29 | self.cfg = cfg 30 | 31 | # Number of classes / labels 32 | self.num_classes = num_classes 33 | 34 | # HF AutoConfig 35 | self.llm_model_config = AutoConfig.from_pretrained(llm_model_path) 36 | 37 | # HF AutoModel 38 | self.llm_model = AutoModel.from_pretrained(llm_model_path, 39 | config=self.llm_model_config) 40 | 41 | # Freeze Layers if Specified 42 | # https://discuss.huggingface.co/t/how-to-freeze-some-layers-of-bertmodel/917/4 43 | if cfg.freeze.apply: 44 | # Use modules to specify order 45 | modules = [self.llm_model.embeddings, 46 | self.llm_model.encoder.layer[:cfg.freeze.num_layers]] 47 | for module in modules: 48 | for param in module.parameters(): 49 | param.requires_grad = False 50 | 51 | # Gradient checkpointing [TODO check this] 52 | if self.cfg.gradient_checkpointing: 53 | self.llm_model.gradient_checkpointing_enable() 54 | 55 | # Mean Pooling 56 | self.pool = MeanPooling() 57 | 58 | # Dense layer for classification and weight initialization 59 | self.fc = nn.Linear(self.llm_model_config.hidden_size, num_classes) 60 | self._init_weights(self.fc) 61 | 62 | 63 | def _init_weights(self, module): 64 | "Initialize weights for classification weights for dense layer" 65 | if isinstance(module, nn.Linear): 66 | module.weight.data.normal_(mean=0.0, 67 | std=self.llm_model_config.initializer_range) 68 | if module.bias is not None: 69 | module.bias.data.zero_() 70 | elif isinstance(module, nn.Embedding): 71 | module.weight.data.normal_(mean=0.0, 72 | std=self.llm_model_config.initializer_range) 73 | if module.padding_idx is not None: 74 | module.weight.data[module.padding_idx].zero_() 75 | elif isinstance(module, nn.LayerNorm): 76 | module.bias.data.zero_() 77 | module.weight.data.fill_(1.0) 78 | 79 | 80 | def forward(self, inputs): 81 | # Outputs from model 82 | llm_outputs = self.llm_model(**inputs) 83 | # Apply custom pooling 84 | if self.cfg.mean_pooling.apply: 85 | feature = self.pool(last_hidden_state=llm_outputs.last_hidden_state, 86 | attention_mask=inputs['attention_mask']) 87 | else: 88 | #CLS default pooling 89 | feature = llm_outputs.pooler_output 90 | # Pooling 91 | feature = self.pool(last_hidden_state=llm_outputs[0], 92 | attention_mask=inputs['attention_mask']) 93 | # Dense layer 94 | logits = self.fc(feature) 95 | return logits 96 | -------------------------------------------------------------------------------- /src/training/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mddunlap924/PyTorch-LLM/5c3fcfd608715063ce259a713ec05acbd95cfe31/src/training/__init__.py -------------------------------------------------------------------------------- /src/training/losses.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class RMSELoss(nn.Module): 6 | def __init__(self, reduction='mean', eps=1e-9): 7 | super().__init__() 8 | self.mse = nn.MSELoss(reduction='none') 9 | self.reduction = reduction 10 | self.eps = eps 11 | 12 | def forward(self, y_pred, y_true): 13 | loss = torch.sqrt(self.mse(y_pred, y_true) + self.eps) 14 | if self.reduction == 'none': 15 | loss = loss 16 | elif self.reduction == 'sum': 17 | loss = loss.sum() 18 | elif self.reduction == 'mean': 19 | loss = loss.mean() 20 | return loss 21 | -------------------------------------------------------------------------------- /src/training/metrics.py: -------------------------------------------------------------------------------- 1 | import math 2 | from sklearn.metrics import mean_squared_error 3 | import numpy as np 4 | 5 | 6 | def MCRMSE(y_trues, y_preds): 7 | scores = [] 8 | idxes = y_trues.shape[1] 9 | for i in range(idxes): 10 | y_true = y_trues[:,i] 11 | y_pred = y_preds[:,i] 12 | score = mean_squared_error(y_true, y_pred, squared=False) # RMSE 13 | scores.append(score) 14 | mcrmse_score = np.mean(scores) 15 | return mcrmse_score, scores 16 | 17 | 18 | def get_score(y_trues, y_preds): 19 | mcrmse_score, scores = MCRMSE(y_trues, y_preds) 20 | return mcrmse_score, scores 21 | 22 | 23 | class AverageMeter(object): 24 | """Computes and stores the average and current value""" 25 | def __init__(self): 26 | self.reset() 27 | 28 | def reset(self): 29 | self.val = 0 30 | self.avg = 0 31 | self.sum = 0 32 | self.count = 0 33 | 34 | def update(self, val, n=1): 35 | self.val = val 36 | self.sum += val * n 37 | self.count += n 38 | self.avg = self.sum / self.count 39 | 40 | 41 | def asMinutes(s): 42 | m = math.floor(s / 60) 43 | s -= m * 60 44 | return '%dm %ds' % (m, s) 45 | 46 | 47 | def timeSince(since, percent): 48 | now = time.time() 49 | s = now - since 50 | es = s / (percent) 51 | rs = es - s 52 | return '%s (remain %s)' % (asMinutes(s), asMinutes(rs)) 53 | -------------------------------------------------------------------------------- /src/training/optimizers.py: -------------------------------------------------------------------------------- 1 | from torch.optim import AdamW 2 | 3 | 4 | 5 | def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0): 6 | """ 7 | Optimizer parameters by encoder and decoder 8 | """ 9 | param_optimizer = list(model.named_parameters()) 10 | no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] 11 | optimizer_parameters = [ 12 | {'params': [p for n, p in model.llm_model.named_parameters() if not any(nd in n for nd in no_decay)], 13 | 'lr': encoder_lr, 'weight_decay': weight_decay}, 14 | {'params': [p for n, p in model.llm_model.named_parameters() if any(nd in n for nd in no_decay)], 15 | 'lr': encoder_lr, 'weight_decay': 0.0}, 16 | {'params': [p for n, p in model.named_parameters() if "model" not in n], 17 | 'lr': decoder_lr, 'weight_decay': 0.0} 18 | ] 19 | return optimizer_parameters 20 | 21 | 22 | def get_optimizer(cfg, model): 23 | """ Select the optimizer """ 24 | if cfg.name == 'AdamW': 25 | opt = AdamW(filter(lambda p: p.requires_grad, model.parameters()), 26 | lr=cfg.lr.max, 27 | ) 28 | else: 29 | print('Optimizer needs to be included in code') 30 | return opt 31 | -------------------------------------------------------------------------------- /src/training/single_fold.py: -------------------------------------------------------------------------------- 1 | 2 | import time 3 | from pathlib import Path 4 | from tqdm import tqdm 5 | import pickle 6 | import numpy as np 7 | from sklearn.utils.class_weight import compute_class_weight 8 | import torch 9 | from torch import nn 10 | from torch.nn.functional import one_hot 11 | from torch.optim.lr_scheduler import (CosineAnnealingLR, 12 | OneCycleLR) 13 | from torcheval.metrics.functional import (multiclass_f1_score, 14 | multiclass_precision, 15 | multiclass_recall) 16 | from torchmetrics.classification import (MulticlassF1Score, 17 | MulticlassPrecision, 18 | MulticlassRecall) 19 | from src.training.metrics import AverageMeter 20 | from src.models.llm_multiclass import CustomModel 21 | from src.training.optimizers import get_optimizer 22 | 23 | 24 | def train_fold(train_dl, 25 | val_dl, 26 | cfg, 27 | device, 28 | n_classes, 29 | model_save_path): 30 | """ 31 | Train a model on a single fold of data 32 | """ 33 | # Model path 34 | model_path = Path(cfg.model_tokenizer.base_dir) / cfg.model_tokenizer.name 35 | 36 | # Load custom model 37 | model = CustomModel(llm_model_path=model_path, 38 | cfg=cfg.model, 39 | num_classes=n_classes) 40 | # Set model on device 41 | model.to(device) 42 | 43 | # Optimizer 44 | optimizer = get_optimizer(cfg=cfg.optimizer, 45 | model=model) 46 | 47 | # Total number of steps/iterations 48 | total_steps = cfg.epochs * len(train_dl) 49 | 50 | # Learning Rate Scheduler 51 | scheduler = CosineAnnealingLR(optimizer, 52 | T_max=total_steps, 53 | eta_min=(cfg 54 | .lr_scheduler 55 | .CosineAnnealingLR.eta_min)) 56 | 57 | # Batch size 58 | batch_size = cfg.batch_size 59 | 60 | # #Compute the class weights 61 | # train_labels = encoders['Product']['encoder'].transform(df_train['Product']) 62 | # class_wts = compute_class_weight('balanced', 63 | # classes=np.unique(train_labels), 64 | # y=train_labels) 65 | 66 | # # convert class weights to tensor 67 | # weights= torch.tensor(class_wts, dtype=torch.float) 68 | # weights = weights.to(DEVICE) 69 | # #the weights can be provided to CrossEntropyLoss to help with imbalance 70 | 71 | # Loss Function 72 | loss_fn = nn.CrossEntropyLoss() 73 | 74 | # Performance metrics 75 | f1 = MulticlassF1Score(num_classes=n_classes).to(device) 76 | precision = MulticlassPrecision(num_classes=n_classes).to(device) 77 | recall = MulticlassRecall(num_classes=n_classes).to(device) 78 | 79 | # ==================================================== 80 | # Model Training 81 | # ==================================================== 82 | start_training_time = time.time() 83 | step_count = 0 84 | 85 | # Set a poor best score when starting 86 | if cfg.eval_metric.name == 'loss': 87 | best_score = 1.0E6 88 | else: 89 | best_score = 0.0 90 | metrics = {'epoch': [], 91 | 'train_loss': [], 92 | 'train_f1': [], 93 | 'train_precision': [], 94 | 'train_recall': [], 95 | 'val_loss': [], 96 | 'val_f1': [], 97 | 'val_precision': [], 98 | 'val_recall': []} 99 | for epoch in range(cfg.epochs): 100 | epoch_start_time = time.time() 101 | print(f'\nStart Epoch {epoch + 1}') 102 | train_meters = {'loss': AverageMeter(), 103 | 'f1': AverageMeter(), 104 | 'precision': AverageMeter(), 105 | 'recall': AverageMeter()} 106 | model.train() 107 | 108 | # TRAINING 109 | # Iterate over each batch in an epoch 110 | # for idx, batch in enumerate(train_dataloader): [if you don't want a progress bar] 111 | with tqdm(train_dl, unit='batch') as tepoch: 112 | for idx, batch in enumerate(tepoch): 113 | tepoch.set_description(f"Epoch {epoch + 1}") 114 | X = {'input_ids': batch['input_ids'].to(device), 115 | 'attention_mask': batch['attention_mask'].to(device)} 116 | y = batch['labels'].to(device) 117 | 118 | # Model prediction 119 | optimizer.zero_grad() 120 | y_pred_logits = model(X) 121 | y_pred = nn.Softmax(dim=1)(y_pred_logits).argmax(1) 122 | 123 | # Calculate loss 124 | loss = loss_fn(input=y_pred_logits, target=y) 125 | 126 | # Backward pass, optimizer & scheduler steps 127 | loss.backward() 128 | optimizer.step() 129 | scheduler.step() 130 | 131 | # Performance metrics for the batch of data 132 | f1_score = f1(y_pred, y) 133 | precision_score = precision(y_pred, y) 134 | recall_score = recall(y_pred, y) 135 | 136 | # Store loss and performance metrics 137 | train_meters['loss'].update(loss.detach().cpu().numpy(), 138 | n=batch_size) 139 | train_meters['f1'].update(f1_score.detach().cpu().numpy(), 140 | n=batch_size) 141 | train_meters['precision'].update(precision_score.detach().cpu().numpy(), 142 | n=batch_size) 143 | train_meters['recall'].update(recall_score.detach().cpu().numpy(), 144 | n=batch_size) 145 | 146 | # Print at every N steps 147 | if step_count % 10 == 0: 148 | # Extract training metrics by step count 149 | train_loss = train_meters["loss"].avg 150 | train_f1 = train_meters["f1"].avg 151 | train_precision = train_meters["precision"].avg 152 | train_recall = train_meters["recall"].avg 153 | 154 | # Print Metrics to progress bar 155 | tepoch.set_postfix(loss=f'{train_loss:.4f}', 156 | f1=f'{train_f1:.3f}', 157 | precision=f'{train_precision:.3f}', 158 | recall=f'{train_recall:.3f}') 159 | step_count += 1 160 | 161 | # Print training time and performance metrics 162 | print(f'Epoch {epoch + 1} Training Time: ' 163 | f'{(time.time() - epoch_start_time) / 60:.2f} minutes') 164 | 165 | # Extract training metrics by step count 166 | train_loss = train_meters["loss"].avg 167 | train_f1 = train_meters["f1"].avg 168 | train_precision = train_meters["precision"].avg 169 | train_recall = train_meters["recall"].avg 170 | 171 | print((f'\tTraining: loss={train_loss:.4f}; ' 172 | f'f1={train_f1:.3f}; ' 173 | f'precision={train_precision:.3f}; ' 174 | f'recall={train_recall:.3f}')) 175 | 176 | # Reset metrics after each epoch 177 | f1.reset() 178 | precision.reset() 179 | recall.reset() 180 | 181 | # ==================================================== 182 | # Evaluate Val. Data After Epoch 183 | # ==================================================== 184 | val_meters = {'loss': AverageMeter(), 185 | 'f1': AverageMeter(), 186 | 'precision': AverageMeter(), 187 | 'recall': AverageMeter()} 188 | model.eval() 189 | with torch.no_grad(): 190 | with tqdm(val_dl, unit='batch') as tepoch: 191 | for idx, batch in enumerate(tepoch): 192 | tepoch.set_description(f"Val. at Epoch: {epoch + 1}") 193 | X = {'input_ids': batch['input_ids'].to(device), 194 | 'attention_mask': batch['attention_mask'].to(device)} 195 | y = batch['labels'].to(device) 196 | y_pred_logits = model(X) 197 | y_pred = nn.Softmax(dim=1)(y_pred_logits).argmax(1) 198 | 199 | # Calculate loss 200 | loss = loss_fn(input=y_pred_logits, target=y) 201 | 202 | # Performance metrics for the batch of data 203 | f1_score = f1(y_pred, y) 204 | precision_score = precision(y_pred, y) 205 | recall_score = recall(y_pred, y) 206 | 207 | # Store loss and performance metrics 208 | val_meters['loss'].update(loss.detach().cpu().numpy(), 209 | n=batch_size) 210 | val_meters['f1'].update(f1_score.detach().cpu().numpy(), 211 | n=batch_size) 212 | val_meters['precision'].update(precision_score.detach().cpu().numpy(), 213 | n=batch_size) 214 | val_meters['recall'].update(recall_score.detach().cpu().numpy(), 215 | n=batch_size) 216 | 217 | # Extract val. metrics by step count 218 | val_loss = val_meters["loss"].avg 219 | val_f1 = val_meters["f1"].avg 220 | val_precision = val_meters["precision"].avg 221 | val_recall = val_meters["recall"].avg 222 | 223 | # Print Val. Metric Performance 224 | print((f'\tVal.: loss={val_loss:.4f}; ' 225 | f'f1={val_f1:.3f}; ' 226 | f'precision={val_precision:.3f}; ' 227 | f'recall={val_recall:.3f}')) 228 | 229 | # Save best model to disk 230 | if cfg.eval_metric.name == 'loss': 231 | if val_meters[cfg.eval_metric.name].avg < best_score: 232 | best_score = val_meters[cfg.eval_metric.name].avg 233 | save_path = model_save_path / f'model_epoch{epoch + 1}.pt' 234 | if cfg.paths.save_results.apply_model: 235 | torch.save(model.state_dict(), save_path) 236 | print(f"Epoch {epoch + 1}; Saved best model at: {save_path}") 237 | else: 238 | if val_meters[cfg.eval_metric.name].avg > best_score: 239 | best_score = val_meters[cfg.eval_metric.name].avg 240 | save_path = model_save_path / f'model_epoch{epoch + 1}.pt' 241 | if cfg.paths.save_results.apply_model: 242 | print(f"Epoch {epoch + 1}; Saved best model at: {save_path}") 243 | 244 | # Reset metrics after each epoch 245 | f1.reset() 246 | precision.reset() 247 | recall.reset() 248 | 249 | # Store epoch training metrics 250 | metrics['epoch'] += [epoch] 251 | metrics['train_loss'] += [train_loss] 252 | metrics['train_f1'] += [train_f1] 253 | metrics['train_precision'] += [train_precision] 254 | metrics['train_recall'] += [train_recall] 255 | metrics['val_loss'] += [val_loss] 256 | metrics['val_f1'] += [val_f1] 257 | metrics['val_precision'] += [val_precision] 258 | metrics['val_recall'] += [val_recall] 259 | 260 | # Total training time 261 | total_training_time = (time.time() - start_training_time) / 60 262 | print(f'Total Training Time: {total_training_time:.1f} minutes') 263 | 264 | # Save performance metrics to disk 265 | if cfg.paths.save_results.apply_metric: 266 | with open(model_save_path / f'performance_metrics.pickle', 'wb') as handle: 267 | pickle.dump(metrics, handle, protocol=pickle.HIGHEST_PROTOCOL) 268 | 269 | return metrics 270 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Miscellaneous and helper code for various tasks will be used in this script. 3 | """ 4 | import os 5 | import sys 6 | from pathlib import Path 7 | from types import SimpleNamespace 8 | import uuid 9 | import random 10 | import yaml 11 | import numpy as np 12 | import torch 13 | import matplotlib.pyplot as plt 14 | 15 | 16 | class RecursiveNamespace(SimpleNamespace): 17 | """ 18 | Extending SimpleNamespace for Nested Dictionaries 19 | # https://dev.to/taqkarim/extending-simplenamespace-for-nested-dictionaries-58e8 20 | 21 | Args: 22 | SimpleNamespace (_type_): Base class is SimpleNamespace 23 | 24 | Returns: 25 | _type_: A simple class for nested dictionaries 26 | """ 27 | @staticmethod 28 | def map_entry(entry): 29 | if isinstance(entry, dict): 30 | return RecursiveNamespace(**entry) 31 | 32 | return entry 33 | 34 | def __init__(self, **kwargs): 35 | super().__init__(**kwargs) 36 | for key, val in kwargs.items(): 37 | if type(val) == dict: 38 | setattr(self, key, RecursiveNamespace(**val)) 39 | elif type(val) == list: 40 | setattr(self, key, list(map(self.map_entry, val))) 41 | 42 | 43 | def load_cfg(base_dir: Path, 44 | filename: str, 45 | *, 46 | as_namespace: bool=True) -> SimpleNamespace: 47 | """ 48 | Load YAML configuration files saved uding the "cfgs" directory 49 | Args: 50 | base_dir (Path): Directory to YAML config. file 51 | filename (str): Name of YAML configuration file to load 52 | Returns: 53 | SimpleNamespace: A simple class for calling configuration parameters 54 | """ 55 | cfg_path = Path(base_dir) / filename 56 | with open(cfg_path, 'r') as file: 57 | cfg_dict = yaml.safe_load(file) 58 | file.close() 59 | if as_namespace: 60 | cfg = RecursiveNamespace(**cfg_dict) 61 | else: 62 | cfg = cfg_dict 63 | return cfg 64 | 65 | 66 | def debugger_is_active() -> bool: 67 | """Return if the debugger is currently active""" 68 | return hasattr(sys, 'gettrace') and sys.gettrace() is not None 69 | 70 | 71 | def seed_everything(*, seed: int=42): 72 | """ 73 | Seed everything 74 | 75 | Args: 76 | seed (_type_): Seed 77 | """ 78 | os.environ['PYTHONHASHSEED'] = str(seed) 79 | random.seed(seed) 80 | np.random.seed(seed) 81 | torch.manual_seed(seed) 82 | if torch.cuda.is_available(): 83 | torch.cuda.manual_seed_all(seed) 84 | torch.backends.cudnn.deterministic = True 85 | 86 | class RunIDs(): 87 | def __init__(self, 88 | test_folds: list, 89 | num_folds: int, 90 | save_dir: str, 91 | save_results: bool): 92 | self.test_folds = test_folds 93 | self.num_folds = num_folds 94 | self.save_dir = Path(save_dir) 95 | self.save_results = save_results 96 | self.group_id = str 97 | self.folds_id = RecursiveNamespace 98 | 99 | # Are all folds being tested 100 | if test_folds == list(range(num_folds)): 101 | self.test_all_folds = True 102 | else: 103 | self.test_all_folds = False 104 | 105 | # Check if base directory for saving results exists 106 | if (not Path(save_dir).exists()) and save_results: 107 | raise Exception('Base Dir. for Saving Results Does NOT Exist') 108 | 109 | 110 | def generate_id(self): 111 | # Generate a random ID 112 | return str(uuid.uuid4()).split('-')[0] 113 | 114 | 115 | def generate_run_ids(self): 116 | # Get a group id (i.e. ID that will organize all folds) 117 | self.group_id = self.generate_id() 118 | 119 | # Info. for each data fold 120 | fold_info = {} 121 | for fold in self.test_folds: 122 | # Create directory for each fold 123 | fold_dir = self.save_dir / self.group_id / f'fold{fold}of{self.num_folds}' 124 | if self.save_results: 125 | fold_dir.mkdir(parents=True) 126 | save_path = fold_dir 127 | else: 128 | save_path = None 129 | 130 | # Store info 131 | fold_info[f'fold{fold}'] = {'run_id': self.generate_id(), 132 | 'save': self.save_results, 133 | 'path': save_path, 134 | 'fold_num': fold} 135 | self.folds_id = RecursiveNamespace(**fold_info) 136 | return 137 | 138 | 139 | # def plot_perf_metric_to_disk(save_path: str, 140 | # x: list, 141 | # y_train: list, 142 | # y_val: list, 143 | # metric_name: str) -> None: 144 | # """ 145 | # Save a Epoch vs. Performance Metric Plot to Disk 146 | 147 | # Args: 148 | # save_path (str): Full path for the saved image 149 | # x (list): Epochs 150 | # y_train (list): Performance metric values for training data 151 | # y_val (list): Performance metric values for val data 152 | # metric_name: Name of the performance metric 153 | # """ 154 | 155 | # # Create the plot 156 | # fig, ax = plt.subplots(nrows=1, ncols=1) 157 | # ax.plot(x, y_train, 'tab:blue', x, y_val, 'tab:orange') 158 | # ax.legend(['train', 'val.']) 159 | # ax.set_xlabel('Epochs') 160 | # ax.set_ylabel(f'{metric_name}') 161 | # fig.savefig(save_path) 162 | # plt.close(fig) 163 | # print((f'\tSaved Image at: {str(save_path)}')) 164 | # return 165 | -------------------------------------------------------------------------------- /src/visualization.py: -------------------------------------------------------------------------------- 1 | # import seaborn as sns 2 | import matplotlib.pyplot as plt 3 | import pandas as pd 4 | 5 | 6 | def score_bar_plot(df: pd.DataFrame) -> None: 7 | """ 8 | Bar plot of scores by grading metric 9 | :param df: Input a copy of the original training data as a dataframe 10 | """ 11 | # Count scores per grading criteria 12 | df.drop(columns=['text_id', 'full_text'], inplace=True) 13 | df = df.apply(pd.Series.value_counts) 14 | df = df.reset_index().rename(columns={'index': 'score'}) 15 | df = pd.melt(df, id_vars=['score'], var_name='grading_metric', value_name='count') 16 | plt.figure() 17 | sns.barplot(x='score', y='count', hue='grading_metric', data=df) 18 | del df 19 | return 20 | 21 | 22 | def avg_score_hist(df: pd.DataFrame, *, 23 | bin_width: float = 0.25) -> None: 24 | """ 25 | Histogram of a student's avg grading metric scores 26 | 27 | :param df: Input a copy of the original training data as a dataframe 28 | :param bin_width: Width of bins in the histogram 29 | """ 30 | # Histogram of average grading metrics 31 | df.drop(columns=['text_id', 'full_text'], inplace=True) 32 | df['avg_score'] = df.mean(axis=1) 33 | fig = plt.figure() 34 | ax = sns.histplot(x='avg_score', data=df, binwidth=bin_width) 35 | del df, ax, fig 36 | return 37 | 38 | 39 | def plot_perf_metric_to_disk(save_path: str, 40 | x: list, 41 | y_train: list, 42 | y_val: list, 43 | metric_name: str) -> None: 44 | """ 45 | Save a Epoch vs. Performance Metric Plot to Disk 46 | 47 | Args: 48 | save_path (str): Full path for the saved image 49 | x (list): Epochs 50 | y_train (list): Performance metric values for training data 51 | y_val (list): Performance metric values for val data 52 | metric_name: Name of the performance metric 53 | """ 54 | 55 | # Create the plot 56 | fig, ax = plt.subplots(nrows=1, ncols=1) 57 | ax.plot(x, y_train, 'tab:blue', x, y_val, 'tab:orange') 58 | ax.legend(['train', 'val.']) 59 | ax.set_xlabel('Epochs') 60 | ax.set_ylabel(f'{metric_name}') 61 | fig.savefig(save_path) 62 | plt.close(fig) 63 | print((f'\tSaved Image at: {str(save_path)}')) 64 | return 65 | --------------------------------------------------------------------------------