├── .gitignore ├── LICENSE.md ├── README.md ├── create_data.ipynb ├── data └── README.md ├── dimgpt ├── __init__.py ├── data │ ├── __init__.py │ ├── clean.py │ ├── datasets │ │ ├── __init__.py │ │ ├── dataset.py │ │ └── pretraining │ │ │ ├── __init__.py │ │ │ ├── books.py │ │ │ ├── common_crawl.py │ │ │ ├── institutions.py │ │ │ ├── news.py │ │ │ ├── others.py │ │ │ └── wikipedia.py │ ├── finetuning.py │ ├── pretokenizer.py │ ├── pretraining.py │ └── tokenizer.py ├── settings.py ├── testing │ ├── __init__.py │ └── sampling.py ├── training │ ├── __init__.py │ ├── datasets │ │ ├── __init__.py │ │ ├── dataset.py │ │ ├── finetuning.py │ │ └── pretraining.py │ ├── layers.py │ ├── model.py │ ├── optimizer.py │ ├── rope.py │ └── trainer.py └── utils.py ├── models └── README.md ├── requirements.txt ├── resources └── misc │ ├── accuracy.png │ ├── loss.png │ ├── test_1.png │ ├── test_10.png │ ├── test_11.png │ ├── test_2.png │ ├── test_3.png │ ├── test_4.png │ ├── test_5.png │ ├── test_6.png │ ├── test_7.png │ ├── test_8.png │ ├── test_9.png │ └── thumbnail.png ├── testing.ipynb └── training.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | /bacup* 3 | /venv 4 | /data/* 5 | !/data/README.md 6 | /models/* 7 | !/models/README.md 8 | /output 9 | notes.txt 10 | final_SPF.xml 11 | __pycache__ 12 | .DS_Store 13 | env.py 14 | /test*.ipynb 15 | /show*.ipynb 16 | /test.txt 17 | /*.whl 18 | /validate.ipynb 19 | /dimgpt/testing/tester.py -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Angel Uriot 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 💬 Language model 2 | 3 | ![Release](https://img.shields.io/badge/Release-v1.0-blueviolet) 4 | ![Language](https://img.shields.io/badge/Language-Python-f2cb1b) 5 | ![Libraries](https://img.shields.io/badge/Libraries-PyTorch-00cf2c) 6 | ![Size](https://img.shields.io/badge/Size-4.2Mo-f12222) 7 | ![Open Source](https://badges.frapsoft.com/os/v2/open-source.svg?v=103) 8 | 9 |
10 | 11 | This repository contains the code to train and test autoregressive language models like [**ChatGPT**](https://openai.com/chatgpt) from scratch. I also used it to train the french open-source [**DimensionGPT**](#-dimensiongpt) models. 12 | 13 |
14 | 15 |

16 | 17 |

18 | 19 |
20 | 21 | # 📋 Summary 22 | 23 | * **[📋 Summary](#-summary)** 24 | * **[🤖 DimensionGPT](#-dimensiongpt)** 25 | * [🏗️ Architecture](#%EF%B8%8F-architecture) 26 | * [💾 Data](#-data) 27 | * [🦾 Training](#-training) 28 | * [🪛 Fine-tuning](#-fine-tuning) 29 | * [🧪 Tests](#-tests) 30 | * [🎛️ Weights](#%EF%B8%8F-weights) 31 | * **[📦 Dependencies](#-dependencies)** 32 | * **[🦾 Training](#-training-1)** 33 | * **[⚗️ Testing](#%EF%B8%8F-testing)** 34 | * **[🙏 Credits](#-credits)** 35 | 36 |
37 | 38 | # 🤖 DimensionGPT 39 | 40 | Using this repository, I trained [**DimensionGPT-0.2B**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), a small 0.2B language model on 50B tokens with my personal RTX 3090 GPU during ≈570 hours. 41 | 42 |
43 | 44 | ## 🏗️ Architecture 45 | 46 | The model is based on the transformer architecture (only the decoder part) from the paper [**Attention is All You Need**](https://doi.org/10.48550/arXiv.1706.03762) by **Google Brain** (2017), with a few improvements: 47 | 48 | * I replaced the default normalization layer by the Root Mean Square Layer Normalization (RMSNorm) from the paper [**Root Mean Square Layer Normalization**](https://doi.org/10.48550/arXiv.1910.07467) by **Edinburgh University** (2019) 49 | 50 | * I moved the normalization layers before the transformer blocks (instead of after) like in the paper [**On Layer Normalization in the Transformer Architecture**](https://doi.org/10.48550/arXiv.2002.04745) by **Microsoft Research** (2020) 51 | 52 | * I replaced the ReLU activation by the SwiGLU activation from the paper [**GLU Variants Improve Transformer**](https://doi.org/10.48550/arXiv.2002.05202) by **Google** (2020) 53 | 54 | * I implemented Grouped-Query Attention (GQA) from the paper [**GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints**](https://doi.org/10.48550/arXiv.2305.13245) by **Google Research** (2023) 55 | 56 | * I replaced the absolute positional embedding by the Rotary Position Embedding (RoPE) from the paper [**RoFormer: Enhanced Transformer with Rotary Position Embedding**](https://doi.org/10.48550/arXiv.2104.09864) by **Zhuiyi Technology** (2023) 57 | 58 | * I implemented the Sliding Window Attention (SWA) from the paper [**Longformer: The Long-Document Transformer**](https://doi.org/10.48550/arXiv.2004.05150) by **Allen Institute** (2020) 59 | 60 |
61 | 62 | Here are the main parameters of the architecture: 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 |
ParameterValue
Embedding dimension1,024
Number of layers16
Heads dimension64
Feed forward hidden dimension2,730
Number of heads16
Number of grouped heads4
Window size256
Context length512
Vocab size32,000
110 | 111 |
112 | 113 | The resulting model has 208,929,792 trainable parameters and fits on a single RTX 3090 GPU with a batch size of 16 for training using mixed precision. For inference only, the model will probably fit on any modern GPU. 114 | 115 |
116 | 117 | ## 💾 Data 118 | 119 | The dataset used to train this model is exclusively in french and is a mix of multiple sources: 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 |
SourceDocumentsTokensMultiplierRatio
Common Crawl (FR)21,476,79635,821,271,1601.076.89 %
Wikipedia (FR)2,700,3731,626,389,8314.013.96 %
French news articles20,446,43511,308,851,1500.37.28 %
French books29,3222,796,450,3080.21.20 %
French institutions documents87,103147,034,9582.00.63 %
Others2,7617,287,3222.00.03 %
Total44,742,79051,707,284,729-100.00 %
183 | 184 |
185 | 186 | For the tokenization, I created my own tokenizer that starts by cleaning the text to keep only a predefined set of characters, then it uses the [**Byte Pair Encoding (BPE)**](https://en.wikipedia.org/wiki/Byte_pair_encoding) algorithm to create the vocabulary. I trained the tokenizer on a 300 million characters subset of the dataset to get my 32,000 tokens vocabulary. 187 | 188 |
189 | 190 | ## 🦾 Training 191 | 192 | For the training I used stochastic gradient descent with warmup and cosine decay learning rate schedules, here are the main hyperparameters: 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 |
HyperparameterValue
Batch size (tokens)524,288
OptimizerAdamW
Learning rate6.0 × 10-4
Warmup steps2,000
Decay steps100,000
β10.9
β20.95
ε10-5
Weight decay0.1
Gradient clipping1.0
244 | 245 |
246 | 247 | I trained the model on my personal RTX 3090 GPU for 1 epoch on the full dataset (13 times the [**Chinchilla optimal**](https://doi.org/10.48550/arXiv.2203.15556)) using mixed precision and gradient accumulation to increase the speed and reduce the memory usage : 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 |
Training summary
Tokens52,428,800,000
Steps100,000
FLOPs6.6 × 1019
Duration573 hours
Final loss2.19
Final accuracy54.8 %
282 | 283 |

284 | 285 |

286 | 287 |

288 | 289 |

290 | 291 |
292 | 293 | ## 🪛 Fine-tuning 294 | 295 | I fine-tuned the model on the [**french instructions dataset**](https://github.com/angeluriot/French_instruct) I made for this project to create [**DimensionGPT-0.2B-Chat**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), a 0.2B language model trained to follow instructions and answer questions in french. 296 | 297 |
298 | 299 | ## 🧪 Tests 300 | 301 | Here are some examples of the model outputs: 302 | 303 |

304 | 305 |

306 | 307 |

308 | 309 |

310 | 311 |

312 | 313 |

314 | 315 |

316 | 317 |

318 | 319 |

320 | 321 |

322 | 323 |

324 | 325 |

326 | 327 |

328 | 329 |

330 | 331 |

332 | 333 |

334 | 335 |

336 | 337 |

338 | 339 |

340 | 341 |

342 | 343 |

344 | 345 |

346 | 347 |
348 | 349 | ## 🎛️ Weights 350 | 351 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to: 352 | 353 | * Download the `.pt` file of the model you want to use and put it in the `models` folder 354 | * Download the `vocab.txt` file and put it in the `data` folder 355 | 356 |
357 | 358 | # 📦 Dependencies 359 | 360 | * [**Python**](https://www.python.org/) 361 | * [**PyTorch**](https://pytorch.org/) 362 | * [**Flash Attention**](https://github.com/Dao-AILab/Flash-attention) 363 | * [**Datasets 🤗**](https://github.com/huggingface/datasets) 364 | * [**Tokenizers 🤗**](https://github.com/huggingface/tokenizers) 365 | * [**Unidecode**](https://pypi.org/project/Unidecode/) 366 | * [**Regex**](https://github.com/mrabarnett/mrab-regex) 367 | * [**Tqdm**](https://tqdm.github.io/) 368 | * [**PSUtil**](https://github.com/giampaolo/psutil) 369 | 370 |
371 | 372 | Run the following command to install the dependencies: 373 | 374 | ```shell 375 | $ pip install -r requirements.txt 376 | ``` 377 | 378 | ⚠️ You may need to use a [**specific command**](https://pytorch.org/get-started/locally/) for PyTorch if you want to use CUDA 379 | 380 | ⚠️ You way need to manually install a [**Flash Attention release**](https://github.com/Dao-AILab/flash-attention/releases) for Windows 381 | 382 |
383 | 384 | # 🦾 Training 385 | 386 | * Run the `create_data.ipynb` file to create the tokenizer and the dataset *(it may take an entire day and consume a few hundred gigabytes of disk space)* 387 | 388 | * Run the `training.ipynb` file *(you can stop the training at any time and resume it later thanks to the checkpoints)* 389 | 390 | * If you don't have an overpriced 24GB GPU like me, the default settings (those used to train [**DimensionGPT**](#-dimensiongpt)) may not work for you. You can try to: 391 | * Reduce the **batch size** *(less stable and worse lowest point)* 392 | * Increase the **accumulation steps** *(fix previous problems but slower)* 393 | * Reduce some **architecture parameters** *(worse lowest point)* 394 | 395 |
396 | 397 | 398 | # ⚗️ Testing 399 | 400 | * Run the `testing.ipynb` file to use the models you downloaded or trained 401 | 402 |
403 | 404 | # 🙏 Credits 405 | 406 | * [**Angel Uriot**](https://github.com/angeluriot) : Creator of the project. 407 | -------------------------------------------------------------------------------- /create_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Create training data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Imports" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from dimgpt.data.tokenizer import *\n", 24 | "from dimgpt.data.pretraining import *\n", 25 | "from dimgpt.data.finetuning import *\n", 26 | "from dimgpt import utils\n", 27 | "from dimgpt.settings import *\n", 28 | "\n", 29 | "utils.reset_rand()" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Import pretraining dataset" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "pretraining = Pretraining()" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "pretraining.summary()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "print(pretraining.get_document())" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "### Create vocab" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "sizes, chars = pretraining.create_tokenizer_data()\n", 80 | "\n", 81 | "print()\n", 82 | "\n", 83 | "for i in range(len(pretraining.datasets)):\n", 84 | "\tprint(f'{pretraining.datasets[i].name}: {sizes[i]:,} characters')\n", 85 | "\n", 86 | "print('\\nNb unique characters:', len(chars), '\\n')\n", 87 | "\n", 88 | "for char in chars:\n", 89 | "\tprint(f'[{char}]', end = ' ')" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "tokenizer = Tokenizer()\n", 99 | "\n", 100 | "print(f'\\nVocab size: {len(tokenizer.vocab):,}\\n')\n", 101 | "\n", 102 | "for v in tokenizer.vocab:\n", 103 | "\tprint(f'[{v}]', end = ' ')" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### Encode datasets" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "pretraining.save(tokenizer)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "pretraining.summary()" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "finetuning = Finetuning()\n", 138 | "finetuning.save(tokenizer)" 139 | ] 140 | } 141 | ], 142 | "metadata": { 143 | "kernelspec": { 144 | "display_name": "venv", 145 | "language": "python", 146 | "name": "python3" 147 | }, 148 | "language_info": { 149 | "codemirror_mode": { 150 | "name": "ipython", 151 | "version": 3 152 | }, 153 | "file_extension": ".py", 154 | "mimetype": "text/x-python", 155 | "name": "python", 156 | "nbconvert_exporter": "python", 157 | "pygments_lexer": "ipython3", 158 | "version": "3.10.11" 159 | }, 160 | "orig_nbformat": 4 161 | }, 162 | "nbformat": 4, 163 | "nbformat_minor": 2 164 | } 165 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | # 🎛️ Trained weights 2 | 3 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to: 4 | 5 | * Download the `.pt` file of the model you want to use and put it in the `models` folder 6 | * Download the `vocab.txt` file and put it in this folder 7 | -------------------------------------------------------------------------------- /dimgpt/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/__init__.py -------------------------------------------------------------------------------- /dimgpt/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/data/__init__.py -------------------------------------------------------------------------------- /dimgpt/data/clean.py: -------------------------------------------------------------------------------- 1 | import regex 2 | from unidecode import unidecode 3 | 4 | from dimgpt.settings import * 5 | 6 | 7 | AUTHORIZED_UNICODE = set( 8 | 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \ 9 | '0123456789' \ 10 | ' !"#$%&\'`()*+,-./:;<=>?@[\\]^_{|}~' \ 11 | 'ÀàÂâÄäÇçÉéÈèÊêËëÎîÏïÔôÖöÙùÛûÜüÆæŒœ' \ 12 | '€£¥•·²³≠±×÷√π' \ 13 | '😀😃😄😁😆😅😂🤣🥲🥹😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁😣😖😫😩🥺😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓🫣🤗🫡🤔🫢🤭🤫🤥😶😐😑😬🫠🙄😯😦😧😮😲🥱😴🤤😪😵🫥🤐🥴🤢🤮🤧😷🤒🤕🤑🤠😈👿👹👺🤡💩👻💀👽👾🤖🎃😺😸😹😻😼😽🙀😿😾' \ 14 | '👋🤚🖐✋🖖👌🤌🤏🤞🫰🤟🤘🤙🫵🫱🫲🫳🫴👈👉👆🖕👇👍👎✊👊🤛🤜👏🫶🙌👐🤲🤝🙏💅🤳💪🦾🦵🦿🦶👣👂🦻👃🫀🫁🧠🦷🦴👀👁👅👄🫦💋🩸' \ 15 | '👶👧🧒👦👩🧑👨👱🧔👵🧓👴👲👳🧕👮👷💂👰🤵👸🫅🤴🥷🦸🦹🤶🎅🧙🧝🧛🧟🧞🧜🧚🧌👼🤰🤱🙇💁🙅🙆🙋🧏🤦🤷🙎🙍💇💆🧖💅🤳💃🕺👯🕴🚶🧎🏃🧍👭👬👫💑💏👪🗣👤👥🫂' \ 16 | '🧳🌂🧵🪡🪢🧶👓🕶🥽🥼🦺👔👕👖🧣🧤🧥🧦👗👘🥻🩴🩱🩲🩳👙👚👛👜👝🎒👞👟🥾🥿👠👡🩰👢👑👒🎩🎓🧢⛑🪖💄💍💼' \ 17 | '🐶🐱🐭🐹🐰🦊🐻🐼🐨🐯🦁🐮🐷🐽🐸🐵🙈🙉🙊🐒🐔🐧🐦🐤🐣🐥🦆🦅🦉🦇🐺🐗🐴🦄🐝🪱🐛🦋🐌🐞🐜🪰🪲🪳🦟🦗🕷🕸🦂🐢🐍🦎🦖🦕🐙🦑🦐🦞🦀🪸🐡🐠🐟🐬🐳🐋🦈🐊🐅🐆🦓🦍🦧🦣🐘🦛🦏🐪🐫🦒🦘🦬🐃🐂🐄🐎🐖🐏🐑🦙🐐🦌🐕🐩🦮🐈🪶🐓🦃🦤🦚🦜🦢🦩🕊🐇🦝🦨🦡🦫🦦🦥🐁🐀🐿🦔🐾🐉🐲🌵🎄🌲🌳🌴🪹🪺🪵🌱🌿🍀🎍🪴🎋🍃🍂🍁🍄🐚🪨🌾💐🌷🪷🌹🥀🌺🌸🌼🌻🌞🌝🌛🌜🌚🌕🌖🌗🌘🌑🌒🌓🌔🌙🌎🌍🌏🪐💫⭐🌟✨💥🔥🌪🌈🌤🌥🌦🌧⛈🌩🌨🌬💨💧💦🫧🌊🌫' \ 18 | '🍏🍎🍐🍊🍋🍌🍉🍇🍓🫐🍈🍒🍑🥭🍍🥥🥝🍅🍆🥑🥦🥬🥒🌶🫑🌽🥕🫒🧄🧅🥔🍠🫘🥐🥯🍞🥖🥨🧀🥚🍳🧈🥞🧇🥓🥩🍗🍖🦴🌭🍔🍟🍕🫓🥪🥙🧆🌮🌯🫔🥗🥘🫕🥫🍝🍜🍲🍛🍣🍱🥟🦪🍤🍙🍚🍘🍥🥠🥮🍢🍡🍧🍨🍦🥧🧁🍰🎂🍮🍭🍬🍫🍿🍩🍪🌰🥜🍯🥛🍼🫖☕🍵🧃🥤🧋🫙🍶🍺🍻🥂🍷🫗🥃🍸🍹🧉🍾🧊🥄🍴🍽🥣🥡🥢🧂' \ 19 | '⚽🏀🏈⚾🥎🎾🏐🏉🥏🎱🪀🏓🏸🏒🏑🥍🏏🪃🥅🪁🏹🎣🤿🥊🥋🎽🛹🛼🛷⛸🥌🎿⛷🏂🪂🤼🤸🤺🤾🏇🧘🏄🏊🤽🚣🧗🚵🚴🏆🥇🥈🥉🏅🎖🏵🎗🎫🎟🎪🤹🎭🩰🎨🎬🎤🎧🎼🎹🥁🪘🎷🎺🪗🎸🪕🎻🎲♟🎯🎳🎮🎰🧩' \ 20 | '🚗🚕🚙🚌🚎🏎🚓🚑🚒🚐🛻🚚🚛🚜🦯🦽🦼🛴🚲🛵🏍🛺🚨🚔🚍🚘🚖🛞🚡🚠🚟🚃🚋🚞🚝🚄🚅🚈🚂🚆🚇🚊🚉🛫🛬🛩💺🛰🚀🛸🚁🛶⛵🚤🛥🛳⛴🚢🛟🪝🚧🚦🚥🚏🗺🗿🗽🗼🏰🏯🏟🎡🎢🛝🎠⛱🏖🏝🏜🌋⛰🏔🗻🏕🛖🏠🏡🏘🏚🏗🏭🏢🏬🏣🏤🏥🏦🏨🏪🏫🏩💒🏛🕌🕍🛕🕋⛩🛤🛣🗾🎑🏞🌅🌄🌠🎇🎆🌇🌆🏙🌃🌌🌉🌁' \ 21 | '⌚📱📲💻🖥🖨🖱🖲🕹🗜💽💾💿📀📼📷📸📹🎥📽🎞📞📟📠📺📻🎙🎚🎛🧭⏱⏲⏰🕰⌛⏳📡🔋🪫🔌💡🔦🕯🪔🧯🛢💸💵💴💶💷🪙💰💳💎🪜🧰🪛🔧🔨⚒🛠⛏🪚🔩🪤🧱⛓🧲🔫💣🧨🪓🔪🗡🛡🚬🪦🏺🔮📿🧿🪬💈🔭🔬🕳🩹🩺🩻🩼💊💉🩸🧬🦠🧫🧪🌡🧹🪠🧺🧻🚽🚰🚿🛁🛀🧼🪥🪒🧽🪣🧴🛎🔑🗝🚪🪑🛋🛏🛌🧸🪆🖼🪞🪟🛍🛒🎁🎈🎏🎀🪄🪅🎊🎉🪩🎎🏮🎐🧧📩📨📧💌📥📤📦🏷🪧📪📫📬📭📮📯📜📃📄📑🧾📊📈📉🗒🗓📆📅🗑🪪📇🗃🗳🗄📋📁📂🗂🗞📰📓📔📒📕📗📘📙📚📖🔖🧷🔗📎🖇📐📏🧮📌📍🖊🖋🖌🖍📝🔍🔎🔏🔐🔒🔓' \ 22 | '🧡💛💚💙💜🖤🤍🤎💔💕💞💓💗💖💘💝💟🔯🕎🛐⛎🆔🉑📴📳🈶🈸🈺🆚💮🉐🈴🈵🈹🈲🆎🆑🆘❌🛑⛔📛🚫💯💢🚷🚯🚳🚱🔞📵🚭🔅🔆🚸🔱🔰✅💹❎🌐💠🌀💤🏧🚾🛗🈳🛂🛃🛄🛅🚹🚺🚼⚧🚻🚮🎦📶🈁🔣🔤🔡🔠🆖🆗🆙🆒🆕🆓🔟🔢⏸⏯⏹⏺⏭⏮⏩⏪⏫⏬🔼🔽🔀🔁🔂🔄🔃🎵🎶➕➖➗🟰♾💲💱➰➿🔚🔙🔛🔝🔜🔘🔴🟠🟡🟢🔵🟣🟤🔺🔻🔸🔹🔶🔷🔳🔲🟥🟧🟨🟩🟦🟪🟫🔈🔇🔉🔊🔔🔕📣📢💬💭🗯🃏🎴🕐🕑🕒🕓🕔🕕🕖🕗🕘🕙🕚🕛🕜🕝🕞🕟🕠🕡🕢🕣🕤🕥🕦🕧' \ 23 | '🏴🏁🚩🎌' 24 | ) 25 | 26 | AUTHORIZED_ASCII = set( 27 | 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \ 28 | '0123456789' \ 29 | ' !"#$%&\'`()*+,-./:;<=>?@[\\]^_{|}~' 30 | ) 31 | 32 | REPLACE_UNICODE = { 33 | '« ': '"', 34 | ' »': '"', 35 | '«': '"', 36 | '»': '"', 37 | '❗️': '!', 38 | '❕': '!', 39 | '❓': '?', 40 | '❔': '?', 41 | '‼️': '!!', 42 | '⁉️': '!?', 43 | '✖️': '❌', 44 | '✔️': '✅', 45 | '☺': '😊', 46 | '☺️': '😊', 47 | '☹': '🙁', 48 | '☹️': '🙁' 49 | } 50 | 51 | ENCODE_STRING_EMOJIS = { 52 | '☂️': '☂', 53 | '☀️': '☀', 54 | '❄️': '❄', 55 | '✈️': '✈', 56 | '☎️': '☎', 57 | '⚙️': '⚙', 58 | '⚔️': '⚔', 59 | '✉️': '✉', 60 | '✂️': '✂', 61 | '✒️': '✒', 62 | '❤️': '❤', 63 | '☢️': '☢', 64 | '☣️': '☣', 65 | '⚠️': '⚠', 66 | '♻️': '♻', 67 | '🏳️‍🌈': '①', 68 | '🏳️‍⚧️': '②', 69 | '🏴‍☠️': '③', 70 | '🇺🇸': '④', 71 | '🇨🇳': '⑤', 72 | '🇯🇵': '⑥', 73 | '🇩🇪': '⑦', 74 | '🇮🇳': '⑧', 75 | '🇬🇧': '⑨', 76 | '🇫🇷': '⑩', 77 | '🇮🇹': '⑪', 78 | '🇨🇦': '⑫', 79 | '🇧🇷': '⑬', 80 | '🇷🇺': '⑭', 81 | '🇰🇷': '⑮', 82 | '🇦🇺': '⑯', 83 | '🇲🇽': '⑰', 84 | '🇪🇸': '⑱', 85 | '🏳️': '🏳' 86 | } 87 | 88 | DECODE_STRING_EMOJIS = {value: key for key, value in reversed(ENCODE_STRING_EMOJIS.items())} 89 | 90 | ENCODE_CHARS = list('①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱') 91 | 92 | REPLACE_ASCII_STRING = { 93 | '--': '-' 94 | } 95 | 96 | STRIP_REPLACE = { 97 | ' \n': '\n', 98 | '\t\n': '\n', 99 | '\n\n\n': '\n\n' 100 | } 101 | 102 | CONTROL_REPLACE = { 103 | '\t': '⮜tab⮞', 104 | '\n': '⮜new-line⮞' 105 | } 106 | 107 | POSSIBLE_CHARS = AUTHORIZED_UNICODE | set(DECODE_STRING_EMOJIS.keys()) 108 | 109 | 110 | def clean_ascii(char: str) -> str: 111 | 112 | if char in AUTHORIZED_ASCII or char in CONTROL_REPLACE.keys(): 113 | return char 114 | 115 | return '' 116 | 117 | 118 | def clean_unicode(char: str) -> str: 119 | 120 | if char in AUTHORIZED_UNICODE or char in DECODE_STRING_EMOJIS or char in CONTROL_REPLACE.keys(): 121 | return char 122 | 123 | text = unidecode(char) 124 | 125 | for key, value in REPLACE_ASCII_STRING.items(): 126 | text = text.replace(key, value) 127 | 128 | return ''.join([clean_ascii(char) for char in text]) 129 | 130 | 131 | def clean_string(text: str, keep_control_tokens: bool = False) -> str: 132 | 133 | if len(text) == 0: 134 | return '' 135 | 136 | text = text.replace('\r', '') 137 | 138 | if keep_control_tokens: 139 | 140 | safe_control_tokens = [regex.escape(c) for c in CONTROL_TOKENS] 141 | reg = r'(' + r'|'.join(safe_control_tokens) + r''.join([f'[{i}]' for i in safe_control_tokens]) + r']+)' 142 | parts = regex.split(reg, text, flags = regex.UNICODE, concurrent = False) 143 | parts = list(filter(None, parts)) 144 | 145 | return ''.join([part if part in CONTROL_TOKENS else clean_string(part) for part in parts]) 146 | 147 | for key, value in REPLACE_UNICODE.items(): 148 | text = text.replace(key, value) 149 | 150 | for char in ENCODE_CHARS: 151 | text = text.replace(char, unidecode(char)) 152 | 153 | for key, value in ENCODE_STRING_EMOJIS.items(): 154 | text = text.replace(key, value) 155 | 156 | text = ''.join([clean_unicode(char) for char in text]) 157 | 158 | for key, value in STRIP_REPLACE.items(): 159 | while key in text: 160 | text = text.replace(key, value) 161 | 162 | text = text.strip() 163 | 164 | for key, value in CONTROL_REPLACE.items(): 165 | text = text.replace(key, value) 166 | 167 | return text 168 | 169 | 170 | def unclean_string(text: str, keep_control_tokens: bool = False) -> str: 171 | 172 | for key, value in DECODE_STRING_EMOJIS.items(): 173 | text = text.replace(key, value) 174 | 175 | if keep_control_tokens: 176 | return text 177 | 178 | text = text.replace('⮜unknown⮞', '�') 179 | text = text.replace('⮜padding⮞', '') 180 | text = text.replace('⮜start-of-text⮞', '\n\n---------- START OF TEXT ----------\n\n') 181 | text = text.replace('⮜tab⮞', '\t') 182 | text = text.replace('⮜new-line⮞', '\n') 183 | text = text.replace('⮜human⮞', '\n\n--- Human ---\n\n') 184 | text = text.replace('⮜system⮞', '\n\n--- System ---\n\n') 185 | text = text.replace('⮜user⮞', '\n\n--- User ---\n\n') 186 | text = text.replace('⮜assistant⮞', '\n\n--- Assistant ---\n\n') 187 | text = text.replace('⮜end-of-text⮞', '\n\n---------- END OF TEXT ----------\n\n') 188 | 189 | return text 190 | -------------------------------------------------------------------------------- /dimgpt/data/datasets/__init__.py: -------------------------------------------------------------------------------- 1 | from .dataset import Dataset -------------------------------------------------------------------------------- /dimgpt/data/datasets/dataset.py: -------------------------------------------------------------------------------- 1 | import os, random, pickle 2 | from tqdm import tqdm 3 | from abc import ABC 4 | import numpy as np 5 | import numpy.typing as npt 6 | from dimgpt.data.clean import * 7 | from dimgpt.data.tokenizer import Tokenizer 8 | 9 | 10 | class Dataset(ABC): 11 | 12 | def __init__(self) -> None: 13 | 14 | self.dataset = None 15 | self.training_part = '' 16 | self.name = '' 17 | self.size = {'train': 0, 'val': 0} 18 | self.multiplier = 1.0 19 | 20 | 21 | def get_document(self, i: int | None = None) -> str: 22 | 23 | if i is None: 24 | i = random.randint(0, len(self.dataset) - 1) 25 | 26 | return '⮜start-of-text⮞' + clean_string(self.dataset[i]['text']) + '⮜end-of-text⮞' 27 | 28 | 29 | def document_to_tokens(self, document: dict[str, str], tokenizer: Tokenizer) -> dict[str, npt.NDArray[np.uint16] | int]: 30 | 31 | tokens = [tokenizer.start_of_text_token, *tokenizer.encode(document['text']), tokenizer.end_of_text_token] 32 | 33 | return {'tokens': np.array(tokens, dtype = np.uint16), 'size': len(tokens)} 34 | 35 | 36 | def save(self, tokenizer: Tokenizer) -> None: 37 | 38 | if os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, f'train.bin')): 39 | return 40 | 41 | os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True) 42 | 43 | split_dataset = self.dataset.train_test_split(test_size = PRETRAINING_VAL_RATIO, shuffle = True) 44 | split_dataset['val'] = split_dataset.pop('test') 45 | 46 | tokenized = split_dataset.map( 47 | lambda doc: self.document_to_tokens(doc, tokenizer), 48 | desc = f'Tokenizing {self.name}', 49 | num_proc = NUM_THREADS 50 | ) 51 | 52 | for split, documents in tokenized.items(): 53 | 54 | total = 0 55 | ids = [] 56 | 57 | for doc in tqdm(documents, desc = f'Saving {self.name} {split} ids'): 58 | 59 | ids.append({ 60 | 'start': total, 61 | 'size': doc['size'] 62 | }) 63 | 64 | total += doc['size'] 65 | 66 | with open(os.path.join(DATA_DIR, self.training_part, self.name, f'{split}_ids.pkl'), 'wb') as file: 67 | pickle.dump(ids, file) 68 | 69 | batch_size = 1_024 70 | 71 | while batch_size >= len(documents): 72 | batch_size //= 2 73 | 74 | self.size[split] = int(np.sum(documents['size'], dtype = np.uint64)) 75 | path = os.path.join(DATA_DIR, self.training_part, self.name, f'{split}.bin') 76 | file = np.memmap(path, dtype = np.uint16, mode = 'w+', shape = (self.size[split],)) 77 | i = 0 78 | 79 | for batch_i in tqdm(range(batch_size), desc = f'Saving {self.name} {split}'): 80 | 81 | batch = documents.shard(num_shards = batch_size, index = batch_i, contiguous = True).with_format('numpy') 82 | file_batch = np.concatenate(batch['tokens']) 83 | file[i:i + len(file_batch)] = file_batch 84 | i += len(file_batch) 85 | 86 | file.flush() 87 | 88 | with open(os.path.join(DATA_DIR, self.training_part, self.name, f'metadata.pkl'), 'wb') as file: 89 | pickle.dump({ 90 | 'training_part': self.training_part, 91 | 'name': self.name, 92 | 'size': self.size, 93 | 'multiplier': self.multiplier 94 | }, file) 95 | -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/__init__.py: -------------------------------------------------------------------------------- 1 | from .common_crawl import CommonCrawlDataset 2 | from .wikipedia import WikipediaDataset 3 | from .books import BooksDataset 4 | from .news import NewsDataset 5 | from .institutions import InstitutionsDataset 6 | from .others import OthersDataset -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/books.py: -------------------------------------------------------------------------------- 1 | import os, json 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets 3 | from dimgpt.data.datasets import Dataset 4 | from dimgpt.settings import * 5 | 6 | class BooksDataset(Dataset): 7 | 8 | def __init__(self) -> None: 9 | 10 | super().__init__() 11 | 12 | self.training_part = 'pretraining' 13 | self.name = 'books' 14 | self.multiplier = 0.2 15 | 16 | print('Downloading Books dataset...') 17 | 18 | if not os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json')): 19 | 20 | dataset = load_dataset( 21 | path = 'PleIAs/French-PD-Books', 22 | split = 'train', 23 | download_config = DownloadConfig(max_retries = 10), 24 | streaming = True 25 | ) 26 | 27 | os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True) 28 | 29 | with open(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 'w', encoding = 'utf-8') as file: 30 | 31 | file.truncate(0) 32 | i = 0 33 | self.size['train'] = 0 34 | 35 | for record in dataset: 36 | 37 | text = str(record['complete_text']).strip() 38 | 39 | if len(text) < MIN_DOCUMENT_SIZE: 40 | continue 41 | 42 | file.write(json.dumps({'text': text}, ensure_ascii = False) + '\n') 43 | 44 | self.size['train'] += len(text) 45 | i += 1 46 | 47 | if i % 1_000 == 0: 48 | print(f'{i:,} documents | {self.size["train"]:,} characters ', end = '\r') 49 | 50 | if self.size['train'] >= 10_000_000_000: 51 | break 52 | 53 | self.dataset = load_dataset( 54 | path = 'json', 55 | split = 'train', 56 | data_files = os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 57 | num_proc = NUM_THREADS 58 | ) 59 | 60 | if self.size['train'] == 0: 61 | self.size['train'] = 10_000_000_000 62 | 63 | print(f'Books dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters') -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/common_crawl.py: -------------------------------------------------------------------------------- 1 | import os, json 2 | from datasets import load_dataset, DownloadConfig 3 | from dimgpt.data.datasets import Dataset 4 | from dimgpt.settings import * 5 | 6 | class CommonCrawlDataset(Dataset): 7 | 8 | def __init__(self) -> None: 9 | 10 | super().__init__() 11 | 12 | self.training_part = 'pretraining' 13 | self.name = 'common_crawl' 14 | self.multiplier = 1.0 15 | 16 | print('Downloading Common Crawl dataset...') 17 | 18 | if not os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json')): 19 | 20 | dataset = load_dataset( 21 | path = 'ontocord/CulturaY', 22 | name = 'fr', 23 | split = 'train', 24 | download_config = DownloadConfig(max_retries = 10), 25 | streaming = True 26 | ) 27 | 28 | os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True) 29 | 30 | with open(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 'w', encoding = 'utf-8') as file: 31 | 32 | file.truncate(0) 33 | i = 0 34 | self.size['train'] = 0 35 | 36 | for record in dataset: 37 | 38 | text = str(record['text']).strip() 39 | 40 | if len(text) < MIN_DOCUMENT_SIZE: 41 | continue 42 | 43 | file.write(json.dumps({'text': text}, ensure_ascii = False) + '\n') 44 | 45 | self.size['train'] += len(text) 46 | i += 1 47 | 48 | if i % 1_000 == 0: 49 | print(f'{i:,} documents | {self.size["train"]:,} characters ', end = '\r') 50 | 51 | if self.size['train'] >= 150_000_000_000: 52 | break 53 | 54 | self.dataset = load_dataset( 55 | path = 'json', 56 | split = 'train', 57 | data_files = os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 58 | num_proc = NUM_THREADS 59 | ) 60 | 61 | if self.size['train'] == 0: 62 | self.size['train'] = 150_000_000_000 63 | 64 | print(f'Common Crawl dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters') 65 | -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/institutions.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, DownloadConfig, concatenate_datasets 2 | from dimgpt.data.datasets import Dataset 3 | from dimgpt.settings import * 4 | 5 | class InstitutionsDataset(Dataset): 6 | 7 | def __init__(self) -> None: 8 | 9 | super().__init__() 10 | 11 | self.training_part = 'pretraining' 12 | self.name = 'institutions' 13 | self.multiplier = 2.0 14 | 15 | print('Downloading Institutions dataset...') 16 | 17 | europarl = load_dataset( 18 | path = 'bigscience-data/roots_fr_the_pile_europarl', 19 | split = 'train', 20 | download_config = DownloadConfig(max_retries = 10) 21 | ) 22 | 23 | qr_an = load_dataset( 24 | path = 'cassandra-themis/QR-AN', 25 | name = 'qran_generation', 26 | split = 'train+validation+test', 27 | download_config = DownloadConfig(max_retries = 10) 28 | ) 29 | 30 | qr_an = qr_an.map( 31 | lambda doc: {'text': (str(doc['question']).strip() + '\n\n' + str(doc['answer']).strip()).strip()}, 32 | remove_columns = ['question', 'answer'], 33 | desc = 'Cleaning QR-AN', 34 | num_proc = NUM_THREADS 35 | ) 36 | 37 | self.dataset = concatenate_datasets([europarl, qr_an]) 38 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE) 39 | self.size['train'] = 0 40 | 41 | for doc in self.dataset: 42 | self.size['train'] += len(str(doc['text']).strip()) 43 | 44 | print(f'Institutions dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters') -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/news.py: -------------------------------------------------------------------------------- 1 | import re, json 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets 3 | from dimgpt.data.datasets import Dataset 4 | from dimgpt.settings import * 5 | 6 | class NewsDataset(Dataset): 7 | 8 | def __init__(self) -> None: 9 | 10 | super().__init__() 11 | 12 | self.training_part = 'pretraining' 13 | self.name = 'news' 14 | self.multiplier = 0.3 15 | 16 | print('Downloading News dataset...') 17 | 18 | news_fr = load_dataset( 19 | path = 'eckendoerffer/news_fr', 20 | split = 'train+validation+test', 21 | download_config = DownloadConfig(max_retries = 10) 22 | ) 23 | 24 | news_fr = news_fr.map( 25 | lambda doc: {'text': self._clean_news_fr(doc['text'])}, 26 | desc = 'Cleaning news_fr', 27 | num_proc = NUM_THREADS 28 | ) 29 | 30 | wikinews = load_dataset( 31 | path = 'bigscience-data/roots_fr_wikinews', 32 | split = 'train', 33 | download_config = DownloadConfig(max_retries = 10) 34 | ) 35 | 36 | wikinews = wikinews.map( 37 | lambda doc: {'text': self._clean_wikinews(doc)}, 38 | remove_columns = ['meta'], 39 | desc = 'Cleaning wikinews', 40 | num_proc = NUM_THREADS 41 | ) 42 | 43 | cc_news = load_dataset( 44 | path = 'intfloat/multilingual_cc_news', 45 | name = 'fr', 46 | split = 'train', 47 | download_config = DownloadConfig(max_retries = 10) 48 | ) 49 | 50 | cc_news = cc_news.map( 51 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['maintext']).strip()).strip()}, 52 | remove_columns = ['title', 'maintext', 'url', 'date_publish'], 53 | desc = 'Cleaning cc_news', 54 | num_proc = NUM_THREADS 55 | ) 56 | 57 | xlsum = load_dataset( 58 | path = 'csebuetnlp/xlsum', 59 | name = 'french', 60 | split = 'train+validation+test', 61 | download_config = DownloadConfig(max_retries = 10) 62 | ) 63 | 64 | xlsum_summaries = xlsum.map( 65 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['summary']).strip()).strip()}, 66 | remove_columns = ['id', 'url', 'title', 'summary'], 67 | desc = 'Cleaning xlsum_summaries', 68 | num_proc = NUM_THREADS 69 | ) 70 | 71 | xlsum = xlsum.map( 72 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()}, 73 | remove_columns = ['id', 'url', 'title', 'summary'], 74 | desc = 'Cleaning xlsum', 75 | num_proc = NUM_THREADS 76 | ) 77 | 78 | mlsum = load_dataset( 79 | path = 'mlsum', 80 | name = 'fr', 81 | split = 'train+validation+test', 82 | download_config = DownloadConfig(max_retries = 10) 83 | ) 84 | 85 | mlsum_summaries = mlsum.map( 86 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['summary']).strip()).strip()}, 87 | remove_columns = ['summary', 'topic', 'url', 'title', 'date'], 88 | desc = 'Cleaning mlsum_summaries', 89 | num_proc = NUM_THREADS 90 | ) 91 | 92 | mlsum = mlsum.map( 93 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()}, 94 | remove_columns = ['summary', 'topic', 'url', 'title', 'date'], 95 | desc = 'Cleaning mlsum', 96 | num_proc = NUM_THREADS 97 | ) 98 | 99 | orange_sum = load_dataset( 100 | path = 'orange_sum', 101 | name = 'title', 102 | split = 'train+validation+test', 103 | download_config = DownloadConfig(max_retries = 10) 104 | ) 105 | 106 | orange_sum = orange_sum.map( 107 | lambda doc: {'text': (str(doc['summary']).strip() + '\n\n' + str(doc['text']).strip()).strip()}, 108 | remove_columns = ['summary'], 109 | desc = 'Cleaning orange_sum', 110 | num_proc = NUM_THREADS 111 | ) 112 | 113 | covid_news = load_dataset( 114 | path = 'gustavecortal/fr_covid_news', 115 | split = 'train', 116 | download_config = DownloadConfig(max_retries = 10) 117 | ) 118 | 119 | covid_news = covid_news.map( 120 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()}, 121 | remove_columns = ['title', 'description', 'domain', 'url', 'labels'], 122 | desc = 'Cleaning covid_news', 123 | num_proc = NUM_THREADS 124 | ) 125 | 126 | self.dataset = concatenate_datasets([news_fr, wikinews, cc_news, xlsum, xlsum_summaries, mlsum, mlsum_summaries, orange_sum, covid_news]) 127 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE) 128 | self.size['train'] = 0 129 | 130 | for doc in self.dataset: 131 | self.size['train'] += len(str(doc['text']).strip()) 132 | 133 | print(f'News dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters') 134 | 135 | 136 | def _clean_news_fr(self, text: str) -> str: 137 | 138 | text = text.replace(' ,', ',') 139 | text = text.replace(' .', '.') 140 | text = text.replace(' )', ')') 141 | text = text.replace('( ', '(') 142 | text = text.replace(' ]', ']') 143 | text = text.replace('[ ', '[') 144 | 145 | text = re.sub(r'(\d)\s*,\s*(\d)', r'\1,\2', text) 146 | 147 | array = list(text) 148 | start = True 149 | 150 | for i in range(len(array)): 151 | if array[i] == '"': 152 | array[i] = '«' if start else '»' 153 | start = not start 154 | 155 | return ''.join(array) 156 | 157 | 158 | def _clean_wikinews(self, document) -> str: 159 | 160 | meta = str(document['meta']).strip() 161 | start = meta.find(", 'title': ") + 12 162 | end = meta.find(", 'type':") - 1 163 | 164 | if start != 11 and end != -2: 165 | title = meta[start:end].strip() 166 | else: 167 | title = '' 168 | 169 | text = str(document['text']).strip() 170 | 171 | if len(text) < 32: 172 | return text 173 | 174 | index = text[:30].find('–') 175 | 176 | if index != -1: 177 | text = text[index + 1:] 178 | 179 | output = title + '\n\n' + text.strip() 180 | 181 | return output.strip() -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/others.py: -------------------------------------------------------------------------------- 1 | import re 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets 3 | from dimgpt.data.datasets import Dataset 4 | from dimgpt.settings import * 5 | 6 | class OthersDataset(Dataset): 7 | 8 | def __init__(self) -> None: 9 | 10 | super().__init__() 11 | 12 | self.training_part = 'pretraining' 13 | self.name = 'others' 14 | self.multiplier = 2.0 15 | 16 | print('Downloading Others dataset...') 17 | 18 | ted_talks = load_dataset( 19 | path = 'bigscience-data/roots_fr_ted_talks_iwslt', 20 | split = 'train', 21 | download_config = DownloadConfig(max_retries = 10) 22 | ) 23 | 24 | ted_talks = ted_talks.remove_columns('meta') 25 | 26 | bloom_lm = load_dataset( 27 | path = 'sil-ai/bloom-lm', 28 | name = 'fra', 29 | split = 'train+validation+test', 30 | download_config = DownloadConfig(max_retries = 10) 31 | ) 32 | 33 | bloom_lm = bloom_lm.remove_columns(['title', 'license', 'pageCount', 'bookInstanceId', 'bookLineage']) 34 | 35 | self.dataset = concatenate_datasets([ted_talks, bloom_lm]) 36 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE) 37 | self.size['train'] = 0 38 | 39 | for doc in self.dataset: 40 | self.size['train'] += len(str(doc['text']).strip()) 41 | 42 | print(f'Others dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters') 43 | -------------------------------------------------------------------------------- /dimgpt/data/datasets/pretraining/wikipedia.py: -------------------------------------------------------------------------------- 1 | import re 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets 3 | from dimgpt.data.datasets import Dataset 4 | from dimgpt.settings import * 5 | 6 | class WikipediaDataset(Dataset): 7 | 8 | def __init__(self) -> None: 9 | 10 | super().__init__() 11 | 12 | self.training_part = 'pretraining' 13 | self.name = 'wikipedia' 14 | self.multiplier = 4.0 15 | 16 | print('Downloading Wikipedia dataset...') 17 | 18 | wikipedia_fr = load_dataset( 19 | path = 'eckendoerffer/wikipedia_fr', 20 | split = 'train+validation+test', 21 | download_config = DownloadConfig(max_retries = 10) 22 | ) 23 | 24 | wikipedia_fr = wikipedia_fr.map( 25 | lambda doc: {'text': self._clean_wikipedia_fr(doc['text'])}, 26 | desc = 'Cleaning wikipedia_fr', 27 | num_proc = NUM_THREADS 28 | ) 29 | 30 | roots_fr_wikipedia = load_dataset( 31 | path = 'bigscience-data/roots_fr_wikipedia', 32 | split = 'train', 33 | download_config = DownloadConfig(max_retries = 10) 34 | ) 35 | 36 | roots_fr_wikipedia = roots_fr_wikipedia.remove_columns('meta') 37 | 38 | roots_fr_wikivoyage = load_dataset( 39 | path = 'bigscience-data/roots_fr_wikivoyage', 40 | split = 'train', 41 | download_config = DownloadConfig(max_retries = 10) 42 | ) 43 | 44 | roots_fr_wikivoyage = roots_fr_wikivoyage.remove_columns('meta') 45 | 46 | self.dataset = concatenate_datasets([wikipedia_fr, roots_fr_wikipedia, roots_fr_wikivoyage]) 47 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE) 48 | self.size['train'] = 0 49 | 50 | for doc in self.dataset: 51 | self.size['train'] += len(str(doc['text']).strip()) 52 | 53 | print(f'Wikipedia dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters') 54 | 55 | 56 | def _clean_wikipedia_fr(self, text: str) -> str: 57 | 58 | text = text.replace(' ,', ',') 59 | text = text.replace(' .', '.') 60 | text = text.replace(' )', ')') 61 | text = text.replace('( ', '(') 62 | text = text.replace(' ]', ']') 63 | text = text.replace('[ ', '[') 64 | 65 | text = re.sub(r'(\d)\s*,\s*(\d)', r'\1,\2', text) 66 | 67 | array = list(text) 68 | start = True 69 | 70 | for i in range(len(array)): 71 | if array[i] == '"': 72 | array[i] = '«' if start else '»' 73 | start = not start 74 | 75 | return ''.join(array) -------------------------------------------------------------------------------- /dimgpt/data/finetuning.py: -------------------------------------------------------------------------------- 1 | import os, pickle 2 | from datasets import load_dataset, DownloadConfig 3 | from tqdm import tqdm 4 | 5 | from dimgpt.settings import * 6 | from dimgpt.data.tokenizer import Tokenizer 7 | 8 | 9 | class Finetuning: 10 | 11 | def __init__(self): 12 | 13 | self.import_dataset() 14 | 15 | 16 | def import_dataset(self) -> None: 17 | 18 | if os.path.exists(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl')): 19 | return 20 | 21 | self.datasets = {} 22 | 23 | for name in ['human_conversations', 'chatbot_conversations', 'dimension_gpt_conversations', 'human_preprompts', 'chatbot_preprompts', 'dimension_gpt_preprompts']: 24 | 25 | self.datasets[name] = load_dataset( 26 | path = 'angeluriot/DimensionGPT_instruct', 27 | name = name, 28 | download_config = DownloadConfig(max_retries = 10), 29 | num_proc = NUM_THREADS 30 | ) 31 | 32 | 33 | def document_to_tokens(self, document: dict[str, str], tokenizer: Tokenizer, preprompts: bool) -> dict[str, list[int] | int]: 34 | 35 | if preprompts: 36 | 37 | tokens = [tokenizer.system_token, *tokenizer.encode(document['preprompt'])] 38 | 39 | return {'tokens': tokens, 'size': len(tokens)} 40 | 41 | tokens = [] 42 | 43 | for msg in document['conversation']: 44 | 45 | if msg['role'] == 'user': 46 | tokens.append(tokenizer.user_token) 47 | elif msg['role'] == 'assistant': 48 | tokens.append(tokenizer.assistant_token) 49 | else: 50 | tokens.append(tokenizer.human_token) 51 | 52 | tokens.extend(tokenizer.encode(msg['text'])) 53 | 54 | return {'tokens': tokens, 'size': len(tokens)} 55 | 56 | 57 | def save(self, tokenizer: Tokenizer) -> None: 58 | 59 | if os.path.exists(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl')): 60 | return 61 | 62 | if not os.path.exists(os.path.join(DATA_DIR, 'finetuning')): 63 | os.makedirs(os.path.join(DATA_DIR, 'finetuning')) 64 | 65 | for name, dataset in self.datasets.items(): 66 | 67 | if name == 'chatbot_conversations': 68 | dataset = dataset['train'].train_test_split(test_size = FINETUNING_VAL_RATIO, shuffle = True) 69 | dataset['val'] = dataset.pop('test') 70 | 71 | tokenized = dataset.map( 72 | lambda doc: self.document_to_tokens(doc, tokenizer, name.endswith('preprompts')), 73 | desc = f'Tokenizing {name}', 74 | num_proc = NUM_THREADS 75 | ) 76 | 77 | for split, documents in tokenized.items(): 78 | 79 | docs = [] 80 | 81 | for doc in tqdm(documents, desc = f'Saving finetuning dataset {name}_{split}'): 82 | docs.append(doc['tokens']) 83 | 84 | with open(os.path.join(DATA_DIR, 'finetuning', f'{name}_{split}.pkl'), 'wb') as file: 85 | pickle.dump(docs, file) 86 | 87 | -------------------------------------------------------------------------------- /dimgpt/data/pretokenizer.py: -------------------------------------------------------------------------------- 1 | import regex 2 | 3 | from tokenizers import * 4 | from dimgpt.settings import * 5 | from dimgpt.utils import * 6 | 7 | 8 | def split(text: str) -> list[str]: 9 | 10 | if text == '': 11 | return [] 12 | 13 | # Split in words 14 | 15 | safe_control_tokens = [regex.escape(c) for c in CONTROL_TOKENS] 16 | reg = r'(' + r'|'.join(safe_control_tokens) + r'|\d+|\s+|\p{L}+|[^\d\p{L}\s' + r''.join([f'[{i}]' for i in safe_control_tokens]) + r']+)' 17 | words = regex.split(reg, text, flags = regex.UNICODE, concurrent = False) 18 | words = list(filter(None, words)) 19 | 20 | # Add beginning spaces 21 | 22 | temp = [] 23 | i = 0 24 | 25 | while i < len(words) - 1: 26 | 27 | if words[i] == ' ' and words[i + 1] not in CONTROL_TOKENS: 28 | temp.append(' ' + words[i + 1]) 29 | i += 2 30 | continue 31 | 32 | if words[i].endswith(' ') and words[i + 1] not in CONTROL_TOKENS: 33 | temp.extend([words[i][:-1], ' ' + words[i + 1]]) 34 | i += 2 35 | continue 36 | 37 | temp.append(words[i]) 38 | i += 1 39 | 40 | if i == len(words) - 1: 41 | temp.append(words[-1]) 42 | 43 | words = temp 44 | words = list(filter(None, words)) 45 | 46 | return words 47 | 48 | 49 | class PreTokenizer: 50 | 51 | def split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]: 52 | 53 | print('Pretokenize...') 54 | 55 | words = split(str(normalized_string)) 56 | words = [NormalizedString(word) for word in words] 57 | 58 | print('Nb words:', '{:,.0f}'.format(len(words))) 59 | print('Merges...') 60 | 61 | return words 62 | 63 | 64 | def pre_tokenize(self, pretok: PreTokenizedString) -> None: 65 | 66 | pretok.split(self.split) 67 | -------------------------------------------------------------------------------- /dimgpt/data/pretraining.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | from tqdm import tqdm 4 | 5 | from dimgpt.settings import * 6 | from dimgpt.data.clean import * 7 | from dimgpt.data.tokenizer import Tokenizer 8 | from dimgpt.data.datasets.pretraining import * 9 | from dimgpt.data.datasets import Dataset 10 | 11 | 12 | class Pretraining: 13 | 14 | def __init__(self): 15 | 16 | self.datasets: list[Dataset] = [CommonCrawlDataset(), WikipediaDataset(), BooksDataset(), NewsDataset(), InstitutionsDataset(), OthersDataset()] 17 | 18 | 19 | def get_document(self) -> str: 20 | 21 | probabilities = np.array([dataset.size['train'] * dataset.multiplier for dataset in self.datasets]) 22 | probabilities /= np.sum(probabilities) 23 | 24 | dataset = np.random.choice(self.datasets, p = probabilities) 25 | 26 | return dataset.get_document() 27 | 28 | 29 | def create_tokenizer_data(self, epsilon: float = 1e-8) -> tuple[list[int], list[str]]: 30 | 31 | if os.path.exists(os.path.join(DATA_DIR, 'tokenizer_data.txt')): 32 | 33 | return [0] * len(self.datasets), [''] 34 | 35 | target_ratios = np.array([dataset.size['train'] * dataset.multiplier for dataset in self.datasets]) 36 | target_ratios = (target_ratios / np.sum(target_ratios)).tolist() 37 | 38 | with open(os.path.join(DATA_DIR, 'tokenizer_data.txt'), 'w', encoding = 'utf-8') as file: 39 | 40 | file.truncate(0) 41 | chars = {} 42 | current_sizes = [0] * len(self.datasets) 43 | pbar = tqdm(total = TOKENIZER_DATA_SIZE) 44 | 45 | while True: 46 | 47 | current_ratios = [size / (sum(current_sizes) + epsilon) for size in current_sizes] 48 | ratio_errors = [target_ratios[i] - current_ratios[i] for i in range(len(self.datasets))] 49 | dataset_index = np.argmax(ratio_errors) 50 | dataset = self.datasets[dataset_index] 51 | 52 | document = dataset.get_document() 53 | 54 | if len(document) == 0: 55 | continue 56 | 57 | file.write(document) 58 | current_sizes[dataset_index] += len(document) 59 | 60 | for char in document: 61 | chars[char] = chars.get(char, 0) + 1 62 | 63 | pbar.update(len(document)) 64 | 65 | if sum(current_sizes) >= TOKENIZER_DATA_SIZE: 66 | break 67 | 68 | document = ' ' + ' '.join(list(POSSIBLE_CHARS)) 69 | file.write(document) 70 | 71 | for char in document: 72 | chars[char] = chars.get(char, 0) + 1 73 | 74 | pbar.close() 75 | 76 | chars = sorted(chars.items(), key = lambda item: item[1], reverse = True) 77 | chars = [char for char, _ in chars] 78 | 79 | return current_sizes, chars 80 | 81 | 82 | def save(self, tokenizer: Tokenizer) -> None: 83 | 84 | for dataset in self.datasets: 85 | dataset.save(tokenizer) 86 | 87 | 88 | def summary(self) -> None: 89 | 90 | for dataset in self.datasets: 91 | print(f'{dataset.name}: {len(dataset.dataset):,} documents | {dataset.size["train"]:,} characters | {dataset.multiplier:.1f}x') -------------------------------------------------------------------------------- /dimgpt/data/tokenizer.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import numpy.typing as npt 4 | import tokenizers as tk 5 | from tokenizers.models import BPE 6 | from tokenizers.trainers import BpeTrainer 7 | from tokenizers.pre_tokenizers import PreTokenizer 8 | from tqdm import tqdm 9 | 10 | from dimgpt.data.clean import * 11 | from dimgpt.utils import * 12 | import dimgpt.data.pretokenizer as pretk 13 | from dimgpt.settings import * 14 | 15 | class Tokenizer: 16 | 17 | def __init__(self): 18 | 19 | self.vocab: list[str] = [] 20 | self.to_index: dict[str, int] = {} 21 | self.to_token: dict[int, str] = {} 22 | 23 | if os.path.exists(os.path.join(DATA_DIR, 'vocab.txt')): 24 | self.load_from_vocab(load_text_array(os.path.join(DATA_DIR, 'vocab.txt'))) 25 | else: 26 | self.create(os.path.join(DATA_DIR, 'tokenizer_data.txt')) 27 | save_text_array(self.vocab, os.path.join(DATA_DIR, 'vocab.txt')) 28 | 29 | 30 | def _set_control_tokens(self) -> None: 31 | 32 | self.unknown_token = self.to_index['⮜unknown⮞'] 33 | self.padding_token = self.to_index['⮜padding⮞'] 34 | self.start_of_text_token = self.to_index['⮜start-of-text⮞'] 35 | self.tab_token = self.to_index['⮜tab⮞'] 36 | self.new_line_token = self.to_index['⮜new-line⮞'] 37 | self.human_token = self.to_index['⮜human⮞'] 38 | self.system_token = self.to_index['⮜system⮞'] 39 | self.user_token = self.to_index['⮜user⮞'] 40 | self.assistant_token = self.to_index['⮜assistant⮞'] 41 | self.end_of_text_token = self.to_index['⮜end-of-text⮞'] 42 | 43 | 44 | def load_from_vocab(self, vocab: list[str]) -> None: 45 | 46 | self.vocab = vocab.copy() 47 | self.to_index = {v: i for i, v in enumerate(self.vocab)} 48 | self.to_token = {i: v for i, v in enumerate(self.vocab)} 49 | self._set_control_tokens() 50 | 51 | 52 | def create(self, data_path: str) -> None: 53 | 54 | self._create_vocab(data_path) 55 | dataset = open(data_path, 'r', encoding = 'utf-8').read() 56 | self._sort_vocab(dataset) 57 | self._set_control_tokens() 58 | 59 | 60 | def _create_vocab(self, data_path: str) -> None: 61 | 62 | print('Creating vocab...') 63 | 64 | tokenizer = tk.Tokenizer(BPE(unk_token = '⮜unknown⮞')) 65 | tokenizer.pre_tokenizer = PreTokenizer.custom(pretk.PreTokenizer()) 66 | 67 | trainer = BpeTrainer( 68 | vocab_size = int(VOCAB_SIZE * 1.1), 69 | show_progress = True, 70 | special_tokens = CONTROL_TOKENS 71 | ) 72 | 73 | tokenizer.train([data_path], trainer) 74 | 75 | self.vocab = list(tokenizer.get_vocab().keys()) 76 | vocab_size = len(self.vocab) 77 | 78 | def is_valid(word: str) -> bool: 79 | 80 | if len(word) > MAX_TOKEN_LENGTH: 81 | return False 82 | 83 | if word.endswith(' ') and len(word) > 4: 84 | return False 85 | 86 | if any(c not in POSSIBLE_CHARS for c in word): 87 | return False 88 | 89 | nb_digits = 0 90 | 91 | for char in word: 92 | if char.isdigit(): 93 | nb_digits += 1 94 | 95 | return nb_digits < 2 96 | 97 | self.vocab = list(filter(lambda v: is_valid(v), self.vocab)) 98 | 99 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({vocab_size - len(self.vocab):,} invalid tokens removed)') 100 | vocab_size = len(self.vocab) 101 | 102 | for i in range(10): 103 | if str(i) not in self.vocab: 104 | self.vocab.append(str(i)) 105 | if ' ' + str(i) not in self.vocab: 106 | self.vocab.append(' ' + str(i)) 107 | 108 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} number tokens added)') 109 | vocab_size = len(self.vocab) 110 | 111 | for token in FORCED_TOKENS: 112 | if token not in self.vocab: 113 | self.vocab.append(token) 114 | 115 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} forced tokens added)') 116 | vocab_size = len(self.vocab) 117 | 118 | self.vocab = CONTROL_TOKENS + self.vocab 119 | 120 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} control tokens added)') 121 | 122 | self.to_index = {v: i for i, v in enumerate(self.vocab)} 123 | self.to_token = {i: v for i, v in enumerate(self.vocab)} 124 | 125 | 126 | def _sort_vocab(self, dataset: str) -> None: 127 | 128 | print('Pretokenize...') 129 | data = pretk.split(dataset) 130 | 131 | print('Sorting vocab...') 132 | vocab = {v: 0 for v in self.vocab} 133 | nb_tokens = 0 134 | total_tokens_length = 0 135 | 136 | for i in tqdm(range(len(data))): 137 | 138 | if data[i] in self.to_index: 139 | vocab[data[i]] += 1 140 | nb_tokens += 1 141 | total_tokens_length += len(data[i]) 142 | continue 143 | 144 | j = 0 145 | 146 | while j < len(data[i]): 147 | 148 | found = False 149 | 150 | for k in reversed(range(min(MAX_TOKEN_LENGTH, len(data[i]) - j))): 151 | 152 | word = data[i][j:j + k + 1] 153 | 154 | if word in self.to_index: 155 | vocab[word] += 1 156 | nb_tokens += 1 157 | total_tokens_length += len(word) 158 | j += k 159 | found = True 160 | break 161 | 162 | if not found: 163 | vocab['⮜unknown⮞'] += 1 164 | nb_tokens += 1 165 | total_tokens_length += 5 166 | 167 | j += 1 168 | 169 | self.vocab = list(sorted(vocab.items(), key = lambda x: x[1], reverse = True)) 170 | vocab_size = len(self.vocab) 171 | self.vocab = list(filter(lambda x: x[0] not in CONTROL_TOKENS, self.vocab)) 172 | 173 | while len(self.vocab) > VOCAB_SIZE - len(CONTROL_TOKENS): 174 | 175 | for i in range(len(self.vocab) - 1, -1, -1): 176 | 177 | if len(self.vocab[i][0]) > 1 and self.vocab[i][0] not in FORCED_TOKENS and not (self.vocab[i][0][-1].isdigit() and len(self.vocab[i][0]) <= 2): 178 | self.vocab.pop(i) 179 | break 180 | 181 | self.vocab = [v[0] for v in self.vocab] 182 | self.vocab = CONTROL_TOKENS + self.vocab 183 | 184 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({vocab_size - len(self.vocab):,} unused tokens removed)') 185 | 186 | self.to_index = {v: i for i, v in enumerate(self.vocab)} 187 | self.to_token = {i: v for i, v in enumerate(self.vocab)} 188 | 189 | print(f'Number of tokens: {nb_tokens:,}') 190 | print(f'Average token length: {total_tokens_length / nb_tokens:.2f}') 191 | 192 | 193 | def encode(self, text: str, clean_text: bool = True, keep_control_tokens: bool = False, verbose: bool = False) -> list[int]: 194 | 195 | if verbose: 196 | print('Pretokenize...') 197 | 198 | if clean_text: 199 | text = clean_string(text, keep_control_tokens) 200 | 201 | data = pretk.split(text) 202 | 203 | if verbose: 204 | print('Encoding dataset...') 205 | 206 | output = [] 207 | 208 | for i in tqdm(range(len(data)), disable = not verbose): 209 | 210 | if data[i] in self.to_index: 211 | output.append(self.to_index[data[i]]) 212 | continue 213 | 214 | j = 0 215 | 216 | while j < len(data[i]): 217 | 218 | found = False 219 | 220 | for k in reversed(range(min(MAX_TOKEN_LENGTH, len(data[i]) - j))): 221 | 222 | word = data[i][j:j + k + 1] 223 | 224 | if word in self.to_index: 225 | output.append(self.to_index[word]) 226 | j += k 227 | found = True 228 | break 229 | 230 | if not found: 231 | output.append(self.to_index['⮜unknown⮞']) 232 | 233 | j += 1 234 | 235 | return output 236 | 237 | 238 | def decode(self, tokens: list[int] | npt.NDArray[np.uint16] | torch.Tensor | int, keep_control_tokens: bool = False, 239 | token_array: bool = False) -> str | list[str]: 240 | 241 | if type(tokens) == int: 242 | tokens = [tokens] 243 | if type(tokens) == torch.Tensor: 244 | tokens = tokens.detach().to('cpu').tolist() 245 | elif type(tokens) != list: 246 | tokens = list(tokens) 247 | 248 | text = [] 249 | 250 | for t in tokens: 251 | 252 | if t < 0 or t >= len(self.vocab): 253 | continue 254 | 255 | text.append(unclean_string(self.to_token[t], keep_control_tokens)) 256 | 257 | if token_array: 258 | return text 259 | 260 | return ''.join(text) 261 | -------------------------------------------------------------------------------- /dimgpt/settings.py: -------------------------------------------------------------------------------- 1 | import os, torch 2 | from contextlib import nullcontext 3 | 4 | # ============== Dataset ============== # 5 | 6 | DATA_DIR = 'data' 7 | OUTPUT_DIR = 'output' 8 | NUM_THREADS = 16 9 | 10 | TOKENIZER_DATA_SIZE = 300_000_000 11 | MIN_DOCUMENT_SIZE = 64 12 | PRETRAINING_VAL_RATIO = 0.001 13 | MAX_TOKEN_LENGTH = 16 14 | CONTROL_TOKENS = ['⮜unknown⮞', '⮜padding⮞', '⮜start-of-text⮞', '⮜tab⮞', '⮜new-line⮞', '⮜human⮞', '⮜system⮞', '⮜user⮞', '⮜assistant⮞', '⮜end-of-text⮞'] 15 | PADDING_TOKEN = 1 16 | FORCED_TOKENS = ['Dimension', ' Dimension', 'GPT', ' GPT', 'IA', ' IA', 'Generative', ' Generative', 'Pre', ' Pre', 'trained', ' trained', 'Transformer', ' Transformer'] 17 | 18 | FINETUNING_VAL_RATIO = 0.01 19 | 20 | SPLIT_RATIOS = [ 21 | 0.099, # human 22 | 0.9, # chatbot 23 | 0.001 # DimensionGPT 24 | ] 25 | 26 | HUMAN_PREPROMPT_RATIOS = [ 27 | 0.3, # human 28 | 0.0, # chatbot 29 | 0.0, # DimensionGPT 30 | 0.7 # None 31 | ] 32 | 33 | CHATBOT_PREPROMPT_RATIOS = [ 34 | 0.0, # human 35 | 0.5, # chatbot 36 | 0.4, # DimensionGPT 37 | 0.1 # None 38 | ] 39 | 40 | DIMENSION_GPT_PREPROMPT_RATIOS = [ 41 | 0.0, # human 42 | 0.0, # chatbot 43 | 1.0, # DimensionGPT 44 | 0.0 # None 45 | ] 46 | 47 | INSTRUCTION_LOSS_STRENGTH = 0.1 48 | PREPROMPT = "Une discussion entre un utilisateur et DimensionGPT, un modèle de langage conversationnel français créé par le développeur indépendant Dimension et basé sur l'architecture GPT." 49 | 50 | # =============== Model =============== # 51 | 52 | VOCAB_SIZE = 32_000 53 | MAX_CONTEXT = 512 54 | WINDOW_SIZE = 256 55 | EMBEDDING_DIM = 1024 56 | NUM_GROUPED_HEADS = 4 57 | NUM_HEADS = 16 58 | HEAD_DIM = EMBEDDING_DIM // NUM_HEADS 59 | FFN_DIM = int((2.0 / 3.0) * 4 * EMBEDDING_DIM) 60 | NUM_BLOCKS = 16 61 | DROPOUT = 0 62 | INIT_STDDEV = 0.02 63 | ROPE_THETA = 10000.0 64 | 65 | # ============= Training ============== # 66 | 67 | BATCH_SIZE = 16 68 | NUM_ACCUMULATIONS = 64 69 | 70 | MAX_LEARNING_RATE = 6e-4 71 | MIN_LEARNING_RATE = 6e-5 72 | WARMUP_STEPS = 2_000 73 | DECAY_STEPS = 100_000 74 | 75 | BETA_1 = 0.9 76 | BETA_2 = 0.95 77 | EPSILON = 1e-5 78 | WEIGHT_DECAY = 0.1 79 | CLIP_GRADIENT = 1.0 80 | 81 | METRICS_BETA = 0.9 82 | VAL_INTERVAL = 50 83 | 84 | # ===================================== # 85 | 86 | GPU_ENABLED = torch.cuda.is_available() 87 | FLOAT16_ENABLED = GPU_ENABLED and torch.cuda.is_bf16_supported() 88 | DEVICE_NAME = 'cuda:0' if GPU_ENABLED else 'cpu' 89 | DEVICE = torch.device(DEVICE_NAME) 90 | CONTEXT = torch.autocast(device_type='cuda', dtype=torch.bfloat16) if FLOAT16_ENABLED else nullcontext() 91 | -------------------------------------------------------------------------------- /dimgpt/testing/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/testing/__init__.py -------------------------------------------------------------------------------- /dimgpt/testing/sampling.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | import numpy.typing as npt 4 | 5 | from dimgpt.training.model import Model 6 | from dimgpt.data.tokenizer import Tokenizer 7 | from dimgpt.settings import * 8 | 9 | 10 | class Sampler(): 11 | 12 | def __init__(self, model: Model, tokenizer: Tokenizer): 13 | 14 | self.model = model 15 | self.tokenizer = tokenizer 16 | self.preprompt = [self.tokenizer.system_token, *self.tokenizer.encode(PREPROMPT)] 17 | 18 | 19 | def get_probabilities(self, input: list[int]) -> npt.NDArray[np.float32]: 20 | 21 | with CONTEXT: 22 | model_input = torch.tensor([input], dtype = torch.long, device = DEVICE) 23 | model_output = self.model(model_input, only_last = True) 24 | 25 | probabilities = model_output[0].float().detach().to('cpu').numpy() 26 | probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities)) 27 | 28 | return probabilities 29 | 30 | 31 | def sample(self, input: list[int], chatbot: bool, temperature: float = 1.0, top_p: float = 1.0, no_repeat_strength: float = 0.0) -> int: 32 | 33 | probabilities = np.log(self.get_probabilities(input)) 34 | proximity = MAX_CONTEXT 35 | 36 | for i in reversed(range(max(len(input) - MAX_CONTEXT, 0), len(input))): 37 | strength = no_repeat_strength * (proximity / MAX_CONTEXT) 38 | probabilities[input[i]] *= (1 + strength) 39 | proximity -= 1 40 | 41 | if temperature == 0.0: 42 | return np.argmax(probabilities) 43 | 44 | probabilities /= temperature 45 | probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities)) 46 | 47 | if chatbot: 48 | probabilities[self.tokenizer.end_of_text_token] += probabilities[self.tokenizer.user_token] 49 | 50 | probabilities[self.tokenizer.unknown_token] = 0.0 51 | probabilities[self.tokenizer.padding_token] = 0.0 52 | probabilities[self.tokenizer.start_of_text_token] = 0.0 53 | probabilities[self.tokenizer.human_token] = 0.0 54 | probabilities[self.tokenizer.system_token] = 0.0 55 | probabilities[self.tokenizer.user_token] = 0.0 56 | probabilities[self.tokenizer.assistant_token] = 0.0 57 | 58 | probabilities /= np.sum(probabilities) 59 | 60 | sorted_indices = np.argsort(-probabilities) 61 | cumsum_probabilites = np.cumsum(probabilities[sorted_indices]) 62 | cutoff_index = np.searchsorted(cumsum_probabilites, max(top_p, cumsum_probabilites[0] + 1e-6)) 63 | temp = np.zeros_like(probabilities) 64 | temp[sorted_indices[:cutoff_index]] = probabilities[sorted_indices[:cutoff_index]] 65 | probabilities = temp / np.sum(temp) 66 | 67 | return np.random.choice(range(len(probabilities)), p = probabilities) 68 | 69 | 70 | def generate(self, input: str, max_length: int, chat_bot: bool = False, temperature: float = 1.0, 71 | top_p: float = 1.0, no_repeat: float = 0.0, verbose: bool = False, max_print_line_length = 0) -> str: 72 | 73 | self.model.eval() 74 | 75 | with torch.no_grad(): 76 | 77 | input = self.tokenizer.encode(input) 78 | 79 | if chat_bot: 80 | input = [self.tokenizer.start_of_text_token, *self.preprompt, self.tokenizer.user_token, *input, self.tokenizer.assistant_token] 81 | else: 82 | input = [self.tokenizer.start_of_text_token, *input] 83 | 84 | output = [] 85 | to_print = [] 86 | last_line_length = 0 87 | 88 | if not chat_bot: 89 | output = input[1:].copy() 90 | to_print = input[1:].copy() 91 | text = self.tokenizer.decode(to_print) 92 | last_line_length = len(text) - 1 - text.rfind('\n') 93 | 94 | for _ in range(max_length): 95 | 96 | index = self.sample(input, temperature, top_p, no_repeat) 97 | 98 | if index == self.tokenizer.end_of_text_token: 99 | break 100 | 101 | input.append(index) 102 | output.append(index) 103 | to_print.append(index) 104 | 105 | if verbose: 106 | 107 | text = self.tokenizer.decode(to_print) 108 | 109 | if '\n' in text: 110 | last_line_length = len(text) - 1 - text.rfind('\n') 111 | else: 112 | last_line_length += len(text) 113 | 114 | if max_print_line_length > 0 and last_line_length >= max_print_line_length and text.startswith(' '): 115 | print() 116 | text = text[1:] 117 | last_line_length = 0 118 | 119 | print(text, end = '') 120 | to_print = [] 121 | 122 | return self.tokenizer.decode(output) 123 | -------------------------------------------------------------------------------- /dimgpt/training/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/training/__init__.py -------------------------------------------------------------------------------- /dimgpt/training/datasets/__init__.py: -------------------------------------------------------------------------------- 1 | from .dataset import Dataset 2 | from .pretraining import PretrainingDataset 3 | from .finetuning import FinetuningDataset -------------------------------------------------------------------------------- /dimgpt/training/datasets/dataset.py: -------------------------------------------------------------------------------- 1 | from abc import ABC 2 | import torch 3 | 4 | from dimgpt.data.tokenizer import Tokenizer 5 | from dimgpt.settings import * 6 | 7 | 8 | class Dataset(ABC): 9 | 10 | def __init__(self, tokenizer: Tokenizer): 11 | 12 | self.tokenizer = tokenizer 13 | 14 | 15 | def train_size(self) -> int: 16 | 17 | pass 18 | 19 | 20 | def val_size(self) -> int: 21 | 22 | pass 23 | 24 | 25 | def _random_document(self, val: bool) -> tuple[list[int], list[int]]: 26 | 27 | pass 28 | 29 | 30 | def _get_tokens(self, val: bool) -> tuple[list[int], list[int]]: 31 | 32 | pass 33 | 34 | 35 | def _next(self, val: bool) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: 36 | 37 | x = [] 38 | y = [] 39 | strengths = [] 40 | 41 | for _ in range(BATCH_SIZE): 42 | 43 | xy, strength = self._get_tokens(val) 44 | 45 | x.append(xy[0:MAX_CONTEXT]) 46 | y.append(xy[1:MAX_CONTEXT + 1]) 47 | strengths.append(strength[1:MAX_CONTEXT + 1]) 48 | 49 | x = torch.tensor(x, dtype = torch.long).pin_memory().to(DEVICE, non_blocking = True) 50 | y = torch.tensor(y, dtype = torch.long).pin_memory().to(DEVICE, non_blocking = True) 51 | strengths = torch.tensor(strengths, dtype = torch.float32).pin_memory().to(DEVICE, non_blocking = True) 52 | 53 | return x, y, strengths 54 | 55 | 56 | def next_train(self) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: 57 | 58 | return self._next(False) 59 | 60 | 61 | def next_val(self) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: 62 | 63 | return self._next(True) -------------------------------------------------------------------------------- /dimgpt/training/datasets/finetuning.py: -------------------------------------------------------------------------------- 1 | import os, random, pickle 2 | import torch 3 | import numpy as np 4 | 5 | from dimgpt.data.tokenizer import Tokenizer 6 | from dimgpt.settings import * 7 | from dimgpt.training.datasets import Dataset 8 | 9 | 10 | class FinetuningDataset(Dataset): 11 | 12 | def __init__(self, tokenizer: Tokenizer): 13 | 14 | self.tokenizer = tokenizer 15 | 16 | self.train_data = { 17 | 'human': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'human_conversations_train.pkl'), 'rb')), 18 | 'chatbot': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl'), 'rb')), 19 | 'dimension_gpt': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'dimension_gpt_conversations_train.pkl'), 'rb')) 20 | } 21 | 22 | self.val_data = pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_val.pkl'), 'rb')) 23 | 24 | self.train_preprompts = { 25 | 'human': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'human_preprompts_train.pkl'), 'rb')), 26 | 'chatbot': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_preprompts_train.pkl'), 'rb')), 27 | 'dimension_gpt': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'dimension_gpt_preprompts_train.pkl'), 'rb')) 28 | } 29 | 30 | self.final_preprompt = [self.tokenizer.system_token, *self.tokenizer.encode(PREPROMPT)] 31 | 32 | self.preprompt_ratios = { 33 | 'human': HUMAN_PREPROMPT_RATIOS, 34 | 'chatbot': CHATBOT_PREPROMPT_RATIOS, 35 | 'dimension_gpt': DIMENSION_GPT_PREPROMPT_RATIOS 36 | } 37 | 38 | h = [len(i) for i in self.train_data['human']] 39 | c = [len(i) for i in self.train_data['chatbot']] 40 | d = [len(i) for i in self.train_data['dimension_gpt']] 41 | 42 | self.train_data_p = { 43 | 'human': (np.array(h) / np.sum(h)).tolist(), 44 | 'chatbot': (np.array(c) / np.sum(c)).tolist(), 45 | 'dimension_gpt': (np.array(d) / np.sum(d)).tolist() 46 | } 47 | 48 | print(sum(self.train_data_p['human'])) 49 | print(sum(self.train_data_p['chatbot'])) 50 | print(sum(self.train_data_p['dimension_gpt'])) 51 | 52 | v = [len(i) for i in self.val_data] 53 | 54 | self.val_data_p = (np.array(v) / np.sum(v)).tolist() 55 | 56 | self.train_ids = { 57 | 'human': list(range(len(self.train_data['human']))), 58 | 'chatbot': list(range(len(self.train_data['chatbot']))), 59 | 'dimension_gpt': list(range(len(self.train_data['dimension_gpt']))) 60 | } 61 | 62 | self.val_ids = list(range(len(self.val_data))) 63 | 64 | 65 | def train_size(self) -> int: 66 | 67 | return sum([sum([len(i) for i in self.train_data[key]]) for key in self.train_data]) 68 | 69 | 70 | def val_size(self) -> int: 71 | 72 | return sum([len(i) for i in self.val_data]) 73 | 74 | 75 | def __get_strength(self, doc: list[int], val: bool) -> list[int]: 76 | 77 | assistant = False 78 | instruction_loss_strength = 0.0 if val else INSTRUCTION_LOSS_STRENGTH 79 | strength = [] 80 | 81 | for token in doc: 82 | 83 | strength.append(1.0 if assistant else instruction_loss_strength) 84 | 85 | if token == self.tokenizer.user_token or token == self.tokenizer.end_of_text_token: 86 | assistant = False 87 | 88 | if token == self.tokenizer.assistant_token or token == self.tokenizer.human_token: 89 | assistant = True 90 | 91 | return strength 92 | 93 | 94 | def __get_document(self, val: bool, first: bool) -> tuple[list[int], list[int]]: 95 | 96 | if val: 97 | data_ids = self.val_ids 98 | data = self.val_data 99 | data_p = self.val_data_p 100 | 101 | else: 102 | data_split = np.random.choice(['human', 'chatbot', 'dimension_gpt'], p = SPLIT_RATIOS) 103 | data_ids = self.train_ids[data_split] 104 | data = self.train_data[data_split] 105 | data_p = self.train_data_p[data_split] 106 | 107 | if first: 108 | id = np.random.choice(data_ids, p = data_p) 109 | conversation = data[id] 110 | else: 111 | conversation = data[random.randint(0, len(data) - 1)] 112 | 113 | if val: 114 | xy = [self.tokenizer.start_of_text_token, *self.final_preprompt, *conversation, self.tokenizer.end_of_text_token] 115 | strength = self.__get_strength(xy, val) 116 | return xy, strength 117 | 118 | preprompt_ratio = self.preprompt_ratios[data_split] 119 | preprompt_split = np.random.choice(['human', 'chatbot', 'dimension_gpt', 'none'], p = preprompt_ratio) 120 | 121 | if preprompt_split != 'none': 122 | preprompt = self.train_preprompts[preprompt_split][random.randint(0, len(self.train_preprompts[preprompt_split]) - 1)] 123 | conversation = [*preprompt, *conversation] 124 | 125 | xy = [self.tokenizer.start_of_text_token, *conversation, self.tokenizer.end_of_text_token] 126 | strength = self.__get_strength(xy, val) 127 | 128 | return xy, strength 129 | 130 | 131 | def _get_random_document(self, val: bool) -> tuple[list[int], list[int]]: 132 | 133 | xy, strength = self.__get_document(val, False) 134 | 135 | return xy, strength 136 | 137 | 138 | def _get_tokens(self, val: bool) -> tuple[torch.Tensor, torch.Tensor]: 139 | 140 | xy, strength = self.__get_document(val, True) 141 | 142 | i = random.randint(0, len(xy) - 1) 143 | xy = xy[i:] 144 | strength = strength[i:] 145 | 146 | while len(xy) < MAX_CONTEXT + 1: 147 | 148 | _xy, _strength = self._get_random_document(val) 149 | 150 | xy.extend(_xy) 151 | strength.extend(_strength) 152 | 153 | xy = xy[0:MAX_CONTEXT + 1] 154 | strength = strength[0:MAX_CONTEXT + 1] 155 | 156 | return xy, strength -------------------------------------------------------------------------------- /dimgpt/training/datasets/pretraining.py: -------------------------------------------------------------------------------- 1 | import os, random, pickle 2 | import torch 3 | import numpy as np 4 | 5 | from dimgpt.data.tokenizer import Tokenizer 6 | from dimgpt.settings import * 7 | from dimgpt.training.datasets import Dataset 8 | 9 | 10 | class PretrainingDataset(Dataset): 11 | 12 | def __init__(self, tokenizer: Tokenizer): 13 | 14 | super().__init__(tokenizer) 15 | 16 | datasets = os.listdir(os.path.join(DATA_DIR, 'pretraining')) 17 | self.datasets = [] 18 | 19 | for dataset in datasets: 20 | 21 | if not os.path.isdir(os.path.join(DATA_DIR, 'pretraining', dataset)): 22 | continue 23 | 24 | meta = pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'metadata.pkl'), 'rb')) 25 | 26 | self.datasets.append({ 27 | 'train': { 28 | 'data': np.memmap(os.path.join(DATA_DIR, 'pretraining', dataset, f'train.bin'), dtype = np.uint16, mode = 'r'), 29 | 'ids': pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'train_ids.pkl'), 'rb')), 30 | 'size': meta['size']['train'] 31 | }, 32 | 'val': { 33 | 'data': np.memmap(os.path.join(DATA_DIR, 'pretraining', dataset, f'val.bin'), dtype = np.uint16, mode = 'r'), 34 | 'ids': pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'val_ids.pkl'), 'rb')), 35 | 'size': meta['size']['val'] 36 | }, 37 | 'training_part': meta['training_part'], 38 | 'name': meta['name'], 39 | 'multiplier': meta['multiplier'] 40 | }) 41 | 42 | self.probas = [dataset['train']['size'] * dataset['multiplier'] for dataset in self.datasets] 43 | self.probas = (np.array(self.probas) / np.sum(self.probas)).tolist() 44 | 45 | 46 | def train_size(self) -> int: 47 | 48 | return sum([dataset['train']['size'] for dataset in self.datasets]) 49 | 50 | 51 | def val_size(self) -> int: 52 | 53 | return sum([dataset['val']['size'] for dataset in self.datasets]) 54 | 55 | 56 | def _get_random_document(self, val: bool) -> tuple[list[int], list[int]]: 57 | 58 | dataset = np.random.choice(self.datasets, p = self.probas) 59 | ids = dataset['val']['ids'] if val else dataset['train']['ids'] 60 | data = dataset['val']['data'] if val else dataset['train']['data'] 61 | 62 | i = random.randint(0, len(ids) - 1) 63 | xy = data[ids[i]['start']:ids[i]['start'] + ids[i]['size']] 64 | strength = [1.0] * ids[i]['size'] 65 | 66 | return xy, strength 67 | 68 | 69 | def _get_tokens(self, val: bool) -> tuple[torch.Tensor, torch.Tensor]: 70 | 71 | dataset = np.random.choice(self.datasets, p = self.probas) 72 | data = dataset['val']['data'] if val else dataset['train']['data'] 73 | 74 | start = random.randint(0, len(data) - 1 - (MAX_CONTEXT + 1)) 75 | xy = [] 76 | 77 | for i in range(MAX_CONTEXT + 1): 78 | 79 | token = data[start + i] 80 | xy.append(token) 81 | 82 | if token == self.tokenizer.end_of_text_token: 83 | break 84 | 85 | strength = [1.0] * len(xy) 86 | 87 | while len(xy) < MAX_CONTEXT + 1: 88 | 89 | _xy, _strength = self._get_random_document(val) 90 | 91 | xy.extend(_xy) 92 | strength.extend(_strength) 93 | 94 | xy = xy[0:MAX_CONTEXT + 1] 95 | strength = strength[0:MAX_CONTEXT + 1] 96 | 97 | return xy, strength -------------------------------------------------------------------------------- /dimgpt/training/layers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch import nn 4 | 5 | from dimgpt.settings import * 6 | 7 | 8 | # Base class for all layers 9 | class Module(nn.Module): 10 | 11 | # Give the number of parameters of the module 12 | def nb_parameters(self) -> int: 13 | 14 | return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters()]) 15 | 16 | 17 | # Give the number of trainable parameters of the module 18 | def nb_trainable_parameters(self) -> int: 19 | 20 | return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters() if p.requires_grad]) 21 | 22 | 23 | # Give the number of non-trainable parameters of the module 24 | def nb_non_trainable_parameters(self) -> int: 25 | 26 | return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters() if not p.requires_grad]) 27 | 28 | 29 | # Summarize the module 30 | def summary(self) -> None: 31 | 32 | print(f'Number of parameters: {self.nb_parameters():,}') 33 | print(f'Number of trainable parameters: {self.nb_trainable_parameters():,}') 34 | print(f'Number of non-trainable parameters: {self.nb_non_trainable_parameters():,}') 35 | 36 | 37 | # Remove NaNs from the module gradients 38 | def clean_nan(self) -> None: 39 | 40 | for p in self.parameters(): 41 | if p.grad is not None: 42 | torch.nan_to_num(p.grad, nan = 0, posinf = 1e5, neginf = -1e5, out = p.grad) 43 | 44 | 45 | # Clip the module gradients 46 | def clip_gradient(self, max_norm: float) -> None: 47 | 48 | nn.utils.clip_grad_norm_(self.parameters(), max_norm) 49 | 50 | 51 | class Linear(nn.Linear): 52 | 53 | def __init__(self, in_features: int, out_features: int, **kwargs): 54 | 55 | super().__init__(in_features, out_features, False, **kwargs) 56 | nn.init.normal_(self.weight, mean = 0.0, std = INIT_STDDEV) 57 | 58 | 59 | class LayerNorm(Module): 60 | 61 | def __init__(self, shape: int, epsilon: float = 1e-5, **kwargs): 62 | 63 | super().__init__(**kwargs) 64 | 65 | self.shape = (shape,) 66 | self.weight = nn.Parameter(torch.ones(shape)) 67 | self.epsilon = epsilon 68 | 69 | 70 | def _normalize(self, x: torch.Tensor): 71 | 72 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim = True) + self.epsilon) 73 | 74 | 75 | def forward(self, x: torch.Tensor): 76 | 77 | return self._normalize(x.float()).type_as(x) * self.weight 78 | 79 | 80 | class Embedding(nn.Embedding): 81 | 82 | def __init__(self, num_embeddings: int, embedding_dim: int, **kwargs): 83 | 84 | super().__init__(num_embeddings, embedding_dim, padding_idx = PADDING_TOKEN, **kwargs) 85 | nn.init.normal_(self.weight, mean = 0.0, std = INIT_STDDEV) 86 | 87 | -------------------------------------------------------------------------------- /dimgpt/training/model.py: -------------------------------------------------------------------------------- 1 | import torch, math 2 | from torch import nn 3 | from flash_attn import flash_attn_func 4 | 5 | from dimgpt.training.layers import * 6 | from dimgpt.settings import * 7 | from dimgpt.training.rope import * 8 | 9 | 10 | class AttentionBlock(Module): 11 | 12 | def __init__(self, **kwargs): 13 | 14 | super().__init__(**kwargs) 15 | 16 | self.query = Linear(EMBEDDING_DIM, NUM_HEADS * HEAD_DIM) 17 | self.key = Linear(EMBEDDING_DIM, NUM_GROUPED_HEADS * HEAD_DIM) 18 | self.value = Linear(EMBEDDING_DIM, NUM_GROUPED_HEADS * HEAD_DIM) 19 | 20 | self.projection = Linear(NUM_HEADS * HEAD_DIM, EMBEDDING_DIM) 21 | nn.init.normal_(self.projection.weight, mean = 0.0, std = INIT_STDDEV / math.sqrt(2 * NUM_BLOCKS)) 22 | 23 | self.residual_dropout = nn.Dropout(DROPOUT) 24 | 25 | 26 | def forward(self, x: torch.Tensor, rope_frequencies: torch.Tensor) -> torch.Tensor: 27 | 28 | batch_size, context_size, _ = x.shape 29 | 30 | q = self.query(x) 31 | k = self.key(x) 32 | v = self.value(x) 33 | 34 | q = q.view(batch_size, context_size, NUM_HEADS, HEAD_DIM) 35 | k = k.view(batch_size, context_size, NUM_GROUPED_HEADS, HEAD_DIM) 36 | v = v.view(batch_size, context_size, NUM_GROUPED_HEADS, HEAD_DIM) 37 | 38 | q, k = rotary_position_embedding(q, k, rope_frequencies) 39 | 40 | k = torch.repeat_interleave(k, repeats = NUM_HEADS // NUM_GROUPED_HEADS, dim = 2) 41 | v = torch.repeat_interleave(v, repeats = NUM_HEADS // NUM_GROUPED_HEADS, dim = 2) 42 | 43 | x = flash_attn_func(q, k, v, dropout_p = DROPOUT if self.training else 0, causal = True, window_size = (WINDOW_SIZE, 0)) 44 | 45 | x = x.view(batch_size, context_size, NUM_HEADS * HEAD_DIM) 46 | 47 | return self.residual_dropout(self.projection(x)) 48 | 49 | 50 | class FeedForward(Module): 51 | 52 | def __init__(self, **kwargs): 53 | 54 | super().__init__(**kwargs) 55 | 56 | self.linear_1 = Linear(EMBEDDING_DIM, FFN_DIM) 57 | self.linear_2 = Linear(EMBEDDING_DIM, FFN_DIM) 58 | self.linear_3 = Linear(FFN_DIM, EMBEDDING_DIM) 59 | self.activation = nn.SiLU() 60 | self.dropout = nn.Dropout(DROPOUT) 61 | 62 | 63 | def forward(self, x: torch.Tensor) -> torch.Tensor: 64 | 65 | x = self.activation(self.linear_1(x)) * self.linear_2(x) 66 | x = self.dropout(self.linear_3(x)) 67 | 68 | return x 69 | 70 | 71 | # Model block 72 | class TransformerBlock(Module): 73 | 74 | def __init__(self, **kwargs): 75 | 76 | super().__init__(**kwargs) 77 | 78 | self.norm_1 = LayerNorm(EMBEDDING_DIM) 79 | self.attention = AttentionBlock() 80 | self.norm_2 = LayerNorm(EMBEDDING_DIM) 81 | self.feed_forward = FeedForward() 82 | 83 | 84 | def forward(self, x: torch.Tensor, rope_frequencies: torch.Tensor) -> torch.Tensor: 85 | 86 | x = x + self.attention(self.norm_1(x), rope_frequencies) 87 | x = x + self.feed_forward(self.norm_2(x)) 88 | 89 | return x 90 | 91 | 92 | # Model 93 | class Model(Module): 94 | 95 | def __init__(self, **kwargs): 96 | 97 | super().__init__(**kwargs) 98 | 99 | self.token_embedding = Embedding(VOCAB_SIZE, EMBEDDING_DIM) 100 | self.rope_frequencies = create_rope_frequencies(HEAD_DIM, MAX_CONTEXT) 101 | self.init_dropout = nn.Dropout(DROPOUT) 102 | self.blocks = nn.ModuleList([TransformerBlock() for _ in range(NUM_BLOCKS)]) 103 | self.final_norm = LayerNorm(EMBEDDING_DIM) 104 | self.final_linear = Linear(EMBEDDING_DIM, VOCAB_SIZE) 105 | self.token_embedding.weight = self.final_linear.weight 106 | 107 | 108 | def forward(self, input: torch.Tensor, only_last: bool = False) -> torch.Tensor: 109 | 110 | if input.shape[1] > MAX_CONTEXT: 111 | input = input[:, -MAX_CONTEXT:] 112 | 113 | rope_frequencies = self.rope_frequencies[:input.shape[1]] 114 | rope_frequencies = rope_frequencies[None, :, None, :] 115 | 116 | x = self.token_embedding(input) 117 | x = self.init_dropout(x) 118 | 119 | for block in self.blocks: 120 | x = block(x, rope_frequencies) 121 | 122 | x = self.final_norm(x) 123 | 124 | if only_last: 125 | return self.final_linear(x[:, -1]) 126 | 127 | return self.final_linear(x) 128 | -------------------------------------------------------------------------------- /dimgpt/training/optimizer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | 4 | from dimgpt.settings import * 5 | 6 | 7 | class AdamW(torch.optim.AdamW): 8 | 9 | def __init__(self, params: list[nn.Parameter], learning_rate: float, **kwargs): 10 | 11 | decay_params = [p for p in params if p.requires_grad and p.dim() >= 2] 12 | other_params = [p for p in params if p.requires_grad and p.dim() < 2] 13 | 14 | groups = [ 15 | {'params': decay_params, 'weight_decay': WEIGHT_DECAY}, 16 | {'params': other_params, 'weight_decay': 0.0} 17 | ] 18 | 19 | super().__init__( 20 | groups, 21 | lr = learning_rate, 22 | betas = (BETA_1, BETA_2), 23 | eps = EPSILON, 24 | fused = GPU_ENABLED, 25 | **kwargs 26 | ) 27 | -------------------------------------------------------------------------------- /dimgpt/training/rope.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | from dimgpt.settings import * 4 | 5 | 6 | def create_rope_frequencies(dim: int, max_length: int, theta: float = ROPE_THETA) -> torch.Tensor: 7 | 8 | frequencies = 1.0 / (theta ** (torch.arange(0, dim, 2, device = DEVICE)[:(dim // 2)].float() / dim)) 9 | t = torch.arange(max_length, device = DEVICE) 10 | frequencies = torch.outer(t, frequencies).float() 11 | 12 | return torch.polar(torch.ones_like(frequencies, device = DEVICE), frequencies) 13 | 14 | 15 | def rotary_position_embedding(q: torch.Tensor, k: torch.Tensor, frequencies: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: 16 | 17 | q_complex = torch.view_as_complex(q.float().reshape(*q.shape[:-1], -1, 2)) 18 | k_complex = torch.view_as_complex(k.float().reshape(*k.shape[:-1], -1, 2)) 19 | 20 | q_out = torch.view_as_real(q_complex * frequencies).flatten(3) 21 | k_out = torch.view_as_real(k_complex * frequencies).flatten(3) 22 | 23 | return q_out.type_as(q), k_out.type_as(k) -------------------------------------------------------------------------------- /dimgpt/training/trainer.py: -------------------------------------------------------------------------------- 1 | import os, pickle, math, time 2 | import torch 3 | from torch import nn 4 | 5 | from dimgpt.settings import * 6 | from dimgpt.training.datasets import Dataset 7 | from dimgpt.training.model import Model 8 | from dimgpt.training.optimizer import AdamW 9 | 10 | 11 | class Trainer(): 12 | 13 | def __init__(self, model: Model, dataset: Dataset): 14 | 15 | self.model = model 16 | model.train() 17 | 18 | self.dataset = dataset 19 | 20 | self.time = None 21 | self.step = 0 22 | self.tokens = 0 23 | self.epochs = 0.0 24 | self.learning_rate = 0.0 25 | self.loss = 0.0 26 | self.accuracy = 0.0 27 | self.val_loss = 0.0 28 | self.val_accuracy = 0.0 29 | self.loss_ema = None 30 | self.accuracy_ema = None 31 | self.best_val_loss = float('inf') 32 | 33 | self.optimizer = AdamW(self.model.parameters(), self.learning_rate) 34 | 35 | self.metrics_history = { 36 | 'time': [], 37 | 'step': [], 38 | 'tokens': [], 39 | 'epochs': [], 40 | 'loss': [], 41 | 'accuracy': [], 42 | 'val_loss': [], 43 | 'val_accuracy': [] 44 | } 45 | 46 | 47 | # Save the models 48 | def save_model(self, path: str) -> None: 49 | 50 | if not os.path.exists(path): 51 | os.makedirs(path) 52 | 53 | torch.save(self.model.state_dict(), os.path.join(path, 'model.pt')) 54 | torch.save(self.optimizer.state_dict(), os.path.join(path, 'optimizer.pt')) 55 | 56 | 57 | # Load the models 58 | def load_model(self, path) -> None: 59 | 60 | if not os.path.exists(path): 61 | return 62 | 63 | self.model.load_state_dict(torch.load(os.path.join(path, 'model.pt'), map_location = DEVICE)) 64 | self.optimizer.load_state_dict(torch.load(os.path.join(path, 'optimizer.pt'), map_location = DEVICE)) 65 | 66 | 67 | # Find previous session 68 | def find_previous_session(self) -> None: 69 | 70 | if os.path.exists(os.path.join(OUTPUT_DIR, 'last')): 71 | self.load_model(os.path.join(OUTPUT_DIR, 'last')) 72 | 73 | if os.path.exists(os.path.join(OUTPUT_DIR, 'metrics.pkl')): 74 | self.load_metrics() 75 | 76 | 77 | # Print 78 | def print(self) -> None: 79 | 80 | print(f'Epochs: {self.epochs:.4f} | Steps: {self.step:,} | Tokens: {self.tokens:,} | LR: {self.learning_rate:.5f} || ' \ 81 | f'Loss: {self.loss_ema:.5f} | Accuracy: {self.accuracy_ema * 100.0:.4f} % | ' \ 82 | f'Val loss: {self.val_loss:.5f} | Val accuracy: {self.val_accuracy * 100.0:.4f} % ', end = '\r') 83 | 84 | 85 | # Save metrics 86 | def save_metrics(self) -> None: 87 | 88 | if self.time is None: 89 | self.metrics_history["time"].append(0.0) 90 | else: 91 | self.metrics_history["time"].append(self.metrics_history["time"][-1] + (time.time() - self.time)) 92 | 93 | self.time = time.time() 94 | 95 | self.metrics_history["step"].append(self.step) 96 | self.metrics_history["tokens"].append(self.tokens) 97 | self.metrics_history["epochs"].append(self.epochs) 98 | self.metrics_history["loss"].append(self.loss_ema) 99 | self.metrics_history["accuracy"].append(self.accuracy_ema) 100 | self.metrics_history["val_loss"].append(self.val_loss) 101 | self.metrics_history["val_accuracy"].append(self.val_accuracy) 102 | 103 | if not os.path.exists(OUTPUT_DIR): 104 | os.makedirs(OUTPUT_DIR) 105 | 106 | pickle.dump(self.metrics_history, open(os.path.join(OUTPUT_DIR, 'metrics.pkl'), 'wb')) 107 | 108 | 109 | # Load metrics 110 | def load_metrics(self) -> None: 111 | 112 | self.metrics_history = pickle.load(open(os.path.join(OUTPUT_DIR, 'metrics.pkl'), 'rb')) 113 | 114 | self.step = self.metrics_history["step"][-1] 115 | self.tokens = self.metrics_history["tokens"][-1] 116 | self.epochs = self.metrics_history["epochs"][-1] 117 | self.loss_ema = self.metrics_history["loss"][-1] 118 | self.accuracy_ema = self.metrics_history["accuracy"][-1] 119 | self.val_loss = self.metrics_history["val_loss"][-1] 120 | self.val_accuracy = self.metrics_history["val_accuracy"][-1] 121 | self.best_val_loss = min(self.metrics_history["val_loss"]) 122 | self.time = time.time() 123 | 124 | 125 | # Update learning rate 126 | def update_learning_rate(self) -> None: 127 | 128 | if self.step < WARMUP_STEPS: 129 | ratio = self.step / WARMUP_STEPS 130 | self.learning_rate = MAX_LEARNING_RATE * ratio 131 | elif self.step < WARMUP_STEPS + DECAY_STEPS: 132 | ratio = (self.step - WARMUP_STEPS) / DECAY_STEPS 133 | ratio = 0.5 * (1.0 + math.cos(math.pi * ratio)) 134 | self.learning_rate = ratio * (MAX_LEARNING_RATE - MIN_LEARNING_RATE) + MIN_LEARNING_RATE 135 | else: 136 | self.learning_rate = MIN_LEARNING_RATE 137 | 138 | for g in self.optimizer.param_groups: 139 | g['lr'] = self.learning_rate 140 | 141 | 142 | def apply_ema(self, value_1: float, value_2: float) -> float: 143 | 144 | if value_1 is None: 145 | return value_2 146 | 147 | return value_1 * METRICS_BETA + value_2 * (1.0 - METRICS_BETA) 148 | 149 | 150 | # Train the model 151 | def train(self) -> None: 152 | 153 | # Training loop 154 | while True: 155 | 156 | # Update step 157 | self.step += 1 158 | self.tokens += (MAX_CONTEXT + 1) * BATCH_SIZE * NUM_ACCUMULATIONS 159 | self.epochs += ((MAX_CONTEXT + 1) * BATCH_SIZE * NUM_ACCUMULATIONS) / self.dataset.train_size() 160 | 161 | # Update learning rate 162 | self.update_learning_rate() 163 | 164 | # ----- Training ----- # 165 | 166 | self.model.train() 167 | self.loss = 0.0 168 | self.accuracy = 0.0 169 | 170 | # First load data (asyncronous) 171 | x, y, strength = self.dataset.next_train() 172 | 173 | for i in range(NUM_ACCUMULATIONS): 174 | 175 | with CONTEXT: 176 | 177 | # Forward pass 178 | prediction = self.model(x) 179 | 180 | # Loss 181 | loss = nn.functional.cross_entropy( 182 | input = prediction.reshape(-1, prediction.shape[-1]), 183 | target = y.reshape(-1), 184 | ignore_index = PADDING_TOKEN, 185 | reduction = 'none' 186 | ) 187 | loss = ((loss * strength.reshape(-1)).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS 188 | self.loss += loss.item() 189 | 190 | # Accuracy 191 | accuracy = (prediction.argmax(dim = 2) == y).to(dtype = torch.float32) 192 | self.accuracy += (((accuracy * strength).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item() 193 | 194 | # Next load data (asyncronous) 195 | if i < NUM_ACCUMULATIONS - 1: 196 | x, y, strength = self.dataset.next_train() 197 | 198 | # Backward pass 199 | loss.backward() 200 | 201 | # Update weights 202 | self.model.clean_nan() 203 | self.model.clip_gradient(CLIP_GRADIENT) 204 | self.optimizer.step() 205 | self.optimizer.zero_grad(set_to_none = True) 206 | 207 | # Update ema values 208 | self.loss_ema = self.apply_ema(self.loss_ema, self.loss) 209 | self.accuracy_ema = self.apply_ema(self.accuracy_ema, self.accuracy) 210 | 211 | # ----- Validations ----- # 212 | 213 | if self.step % VAL_INTERVAL == 0: 214 | 215 | self.model.eval() 216 | 217 | with torch.no_grad(): 218 | 219 | self.val_loss = 0.0 220 | self.val_accuracy = 0.0 221 | 222 | for _ in range(NUM_ACCUMULATIONS): 223 | 224 | # Load data 225 | x, y, strength = self.dataset.next_val() 226 | 227 | with CONTEXT: 228 | 229 | # Forward pass 230 | prediction = self.model(x) 231 | 232 | # Loss 233 | loss = nn.functional.cross_entropy( 234 | input = prediction.reshape(-1, prediction.shape[-1]), 235 | target = y.reshape(-1), 236 | ignore_index = PADDING_TOKEN, 237 | reduction = 'none' 238 | ) 239 | self.val_loss += (((loss * strength.reshape(-1)).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item() 240 | 241 | # Accuracy 242 | accuracy = (prediction.argmax(dim = 2) == y).to(dtype = torch.float32) 243 | self.val_accuracy += (((accuracy * strength).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item() 244 | 245 | # Save 246 | self.save_metrics() 247 | self.save_model(os.path.join(OUTPUT_DIR, 'last')) 248 | 249 | # Save best 250 | if self.val_loss <= self.best_val_loss: 251 | self.best_val_loss = self.val_loss 252 | self.save_model(os.path.join(OUTPUT_DIR, 'best')) 253 | 254 | # -------------------- # 255 | 256 | # Print 257 | self.print() 258 | -------------------------------------------------------------------------------- /dimgpt/utils.py: -------------------------------------------------------------------------------- 1 | import random, platform, psutil, time 2 | import datetime as dt 3 | import numpy as np 4 | import torch 5 | from sys import exit 6 | 7 | from dimgpt.settings import * 8 | 9 | 10 | # Reset the random seed 11 | def reset_rand() -> None: 12 | 13 | now = dt.datetime.now() 14 | milliseconds_since_midnight = (now.hour * 3600 + now.minute * 60 + now.second) * 1000 + now.microsecond // 1000 15 | random.seed(milliseconds_since_midnight) 16 | np.random.seed(milliseconds_since_midnight) 17 | torch.manual_seed(milliseconds_since_midnight) 18 | 19 | 20 | # Check if there is a GPU available 21 | def check_gpu() -> None: 22 | 23 | if GPU_ENABLED: 24 | torch.cuda.empty_cache() 25 | nb_gpu = torch.cuda.device_count() 26 | memory = torch.cuda.mem_get_info()[0] / 1024 ** 3 27 | print(f'{nb_gpu} GPU {"are" if nb_gpu > 1 else "is"} available! Using GPU: "{torch.cuda.get_device_name()}" ({memory:.2f} GB available)') 28 | 29 | else: 30 | memory = psutil.virtual_memory().available / 1024 ** 3 31 | print(f'No GPU available... Using CPU: "{platform.processor()}" ({memory:.2f} GB available)') 32 | 33 | 34 | def save_text_array(array: list[str], path: str) -> None: 35 | 36 | with open(path, 'w', encoding = 'utf-8') as f: 37 | 38 | f.truncate(0) 39 | 40 | for i in range(len(array)): 41 | 42 | f.write(array[i]) 43 | 44 | if i != len(array) - 1: 45 | f.write('\n') 46 | 47 | 48 | def load_text_array(path: str) -> list[str]: 49 | 50 | with open(path, 'r', encoding = 'utf-8') as f: 51 | 52 | return f.read().split('\n') 53 | 54 | 55 | def split_keep(text: str, delimiter: str) -> list[str]: 56 | 57 | words = text.split(delimiter) 58 | 59 | temp = [] 60 | 61 | for i in range(len(words) - 1): 62 | temp.extend([words[i], delimiter]) 63 | 64 | temp.append(words[-1]) 65 | 66 | return temp 67 | 68 | 69 | class Timer: 70 | 71 | def __init__(self, wait_steps: int = 0, num_steps: int = 1, exit_on_end: bool = False): 72 | 73 | self.wait_steps = wait_steps 74 | self.num_steps = num_steps 75 | self.exit_on_end = exit_on_end 76 | self.times = [0.0] * num_steps 77 | self.wait_step = 0 78 | self.step = 0 79 | 80 | 81 | def __enter__(self): 82 | 83 | if self.wait_step < self.wait_steps: 84 | return 85 | 86 | self.times[self.step] = time.time() 87 | 88 | 89 | def __exit__(self, exc_type, exc_value, traceback): 90 | 91 | if self.wait_step < self.wait_steps: 92 | self.wait_step += 1 93 | return 94 | 95 | self.times[self.step] = time.time() - self.times[self.step] 96 | self.step += 1 97 | 98 | if self.step >= self.num_steps: 99 | 100 | print(f'\nDuration: {sum(self.times) / self.num_steps:.2f}s') 101 | 102 | if self.exit_on_end: 103 | exit(0) -------------------------------------------------------------------------------- /models/README.md: -------------------------------------------------------------------------------- 1 | # 🎛️ Trained weights 2 | 3 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to: 4 | 5 | * Download the `.pt` file of the model you want to use and put it in this folder 6 | * Download the `vocab.txt` file and put it in the `data` folder 7 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch 2 | flash-attn 3 | datasets 4 | tokenizers 5 | unidecode 6 | regex 7 | tqdm 8 | psutil -------------------------------------------------------------------------------- /resources/misc/accuracy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/accuracy.png -------------------------------------------------------------------------------- /resources/misc/loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/loss.png -------------------------------------------------------------------------------- /resources/misc/test_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_1.png -------------------------------------------------------------------------------- /resources/misc/test_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_10.png -------------------------------------------------------------------------------- /resources/misc/test_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_11.png -------------------------------------------------------------------------------- /resources/misc/test_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_2.png -------------------------------------------------------------------------------- /resources/misc/test_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_3.png -------------------------------------------------------------------------------- /resources/misc/test_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_4.png -------------------------------------------------------------------------------- /resources/misc/test_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_5.png -------------------------------------------------------------------------------- /resources/misc/test_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_6.png -------------------------------------------------------------------------------- /resources/misc/test_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_7.png -------------------------------------------------------------------------------- /resources/misc/test_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_8.png -------------------------------------------------------------------------------- /resources/misc/test_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_9.png -------------------------------------------------------------------------------- /resources/misc/thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/thumbnail.png -------------------------------------------------------------------------------- /testing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Testing" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Imports" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import os\n", 24 | "\n", 25 | "from dimgpt import utils\n", 26 | "from dimgpt.testing.sampling import *\n", 27 | "from dimgpt.data.tokenizer import *\n", 28 | "from dimgpt.settings import *\n", 29 | "\n", 30 | "utils.reset_rand()" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Check GPU" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "utils.check_gpu()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "### Tokenizer" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "tokenizer = Tokenizer()\n", 63 | "\n", 64 | "print(f'Vocab size: {len(tokenizer.vocab):,}\\n')\n", 65 | "\n", 66 | "for v in tokenizer.vocab:\n", 67 | "\tprint(f'[{v}]', end = ' ')" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### Model" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "MODEL_PATH = './models/DimensionGPT-0.2B-Chat.pt'\n", 84 | "\n", 85 | "model = Model().to(DEVICE)\n", 86 | "model.load_state_dict(torch.load(MODEL_PATH, map_location = DEVICE))\n", 87 | "model.summary()" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "### Testing" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "sampler = Sampler(model, tokenizer)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "_ = sampler.generate(\n", 113 | "\tinput = \"Bonjour\",\n", 114 | "\tmax_length = 512,\n", 115 | "\tchat_bot = True,\n", 116 | "\ttemperature = 0.5,\n", 117 | "\ttop_p = 0.9,\n", 118 | "\tno_repeat = 1.0,\n", 119 | "\tverbose = True,\n", 120 | "\tmax_print_line_length = 150\n", 121 | ")" 122 | ] 123 | } 124 | ], 125 | "metadata": { 126 | "kernelspec": { 127 | "display_name": "venv", 128 | "language": "python", 129 | "name": "python3" 130 | }, 131 | "language_info": { 132 | "codemirror_mode": { 133 | "name": "ipython", 134 | "version": 3 135 | }, 136 | "file_extension": ".py", 137 | "mimetype": "text/x-python", 138 | "name": "python", 139 | "nbconvert_exporter": "python", 140 | "pygments_lexer": "ipython3", 141 | "version": "3.11.8" 142 | }, 143 | "orig_nbformat": 4 144 | }, 145 | "nbformat": 4, 146 | "nbformat_minor": 2 147 | } 148 | -------------------------------------------------------------------------------- /training.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "# Training" 9 | ] 10 | }, 11 | { 12 | "attachments": {}, 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "### Imports" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "from dimgpt import utils\n", 26 | "from dimgpt.training.datasets import *\n", 27 | "from dimgpt.training.model import Model\n", 28 | "from dimgpt.training.trainer import Trainer\n", 29 | "from dimgpt.data.tokenizer import *\n", 30 | "from dimgpt.settings import *\n", 31 | "\n", 32 | "utils.reset_rand()" 33 | ] 34 | }, 35 | { 36 | "attachments": {}, 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### Check GPU" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "utils.check_gpu()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "### Tokenizer" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "tokenizer = Tokenizer()\n", 66 | "\n", 67 | "print(f'Vocab size: {len(tokenizer.vocab):,}\\n')\n", 68 | "\n", 69 | "for v in tokenizer.vocab:\n", 70 | "\tprint(f'[{v}]', end = ' ')" 71 | ] 72 | }, 73 | { 74 | "attachments": {}, 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Dataset" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "dataset = PretrainingDataset(tokenizer)\n", 88 | "#dataset = FinetuningDataset(tokenizer)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "x, y, strength = dataset.next_val()\n", 98 | "\n", 99 | "print(f'Batch shape: {tuple(x.shape)}\\n')\n", 100 | "\n", 101 | "print(tokenizer.decode(x[0]))\n", 102 | "\n", 103 | "del x, y, strength" 104 | ] 105 | }, 106 | { 107 | "attachments": {}, 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### Model" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "model = Model().to(DEVICE)\n", 121 | "model.summary()" 122 | ] 123 | }, 124 | { 125 | "attachments": {}, 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "### Training" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "trainer = Trainer(model, dataset)\n", 139 | "trainer.find_previous_session()\n", 140 | "\n", 141 | "trainer.train()" 142 | ] 143 | } 144 | ], 145 | "metadata": { 146 | "kernelspec": { 147 | "display_name": "venv", 148 | "language": "python", 149 | "name": "python3" 150 | }, 151 | "language_info": { 152 | "codemirror_mode": { 153 | "name": "ipython", 154 | "version": 3 155 | }, 156 | "file_extension": ".py", 157 | "mimetype": "text/x-python", 158 | "name": "python", 159 | "nbconvert_exporter": "python", 160 | "pygments_lexer": "ipython3", 161 | "version": "3.11.8" 162 | }, 163 | "orig_nbformat": 4 164 | }, 165 | "nbformat": 4, 166 | "nbformat_minor": 2 167 | } 168 | --------------------------------------------------------------------------------