├── .gitignore
├── LICENSE.md
├── README.md
├── create_data.ipynb
├── data
    └── README.md
├── dimgpt
    ├── __init__.py
    ├── data
    │   ├── __init__.py
    │   ├── clean.py
    │   ├── datasets
    │   │   ├── __init__.py
    │   │   ├── dataset.py
    │   │   └── pretraining
    │   │   │   ├── __init__.py
    │   │   │   ├── books.py
    │   │   │   ├── common_crawl.py
    │   │   │   ├── institutions.py
    │   │   │   ├── news.py
    │   │   │   ├── others.py
    │   │   │   └── wikipedia.py
    │   ├── finetuning.py
    │   ├── pretokenizer.py
    │   ├── pretraining.py
    │   └── tokenizer.py
    ├── settings.py
    ├── testing
    │   ├── __init__.py
    │   └── sampling.py
    ├── training
    │   ├── __init__.py
    │   ├── datasets
    │   │   ├── __init__.py
    │   │   ├── dataset.py
    │   │   ├── finetuning.py
    │   │   └── pretraining.py
    │   ├── layers.py
    │   ├── model.py
    │   ├── optimizer.py
    │   ├── rope.py
    │   └── trainer.py
    └── utils.py
├── models
    └── README.md
├── requirements.txt
├── resources
    └── misc
    │   ├── accuracy.png
    │   ├── loss.png
    │   ├── test_1.png
    │   ├── test_10.png
    │   ├── test_11.png
    │   ├── test_2.png
    │   ├── test_3.png
    │   ├── test_4.png
    │   ├── test_5.png
    │   ├── test_6.png
    │   ├── test_7.png
    │   ├── test_8.png
    │   ├── test_9.png
    │   └── thumbnail.png
├── testing.ipynb
└── training.ipynb


/.gitignore:
--------------------------------------------------------------------------------
 1 | .vscode
 2 | /bacup*
 3 | /venv
 4 | /data/*
 5 | !/data/README.md
 6 | /models/*
 7 | !/models/README.md
 8 | /output
 9 | notes.txt
10 | final_SPF.xml
11 | __pycache__
12 | .DS_Store
13 | env.py
14 | /test*.ipynb
15 | /show*.ipynb
16 | /test.txt
17 | /*.whl
18 | /validate.ipynb
19 | /dimgpt/testing/tester.py


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Angel Uriot
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 💬 Language model
  2 | 
  3 | ![Release](https://img.shields.io/badge/Release-v1.0-blueviolet)
  4 | ![Language](https://img.shields.io/badge/Language-Python-f2cb1b)
  5 | ![Libraries](https://img.shields.io/badge/Libraries-PyTorch-00cf2c)
  6 | ![Size](https://img.shields.io/badge/Size-4.2Mo-f12222)
  7 | ![Open Source](https://badges.frapsoft.com/os/v2/open-source.svg?v=103)
  8 | 
  9 | <br/>
 10 | 
 11 | This repository contains the code to train and test autoregressive language models like [**ChatGPT**](https://openai.com/chatgpt) from scratch. I also used it to train the french open-source [**DimensionGPT**](#-dimensiongpt) models.
 12 | 
 13 | <br/>
 14 | 
 15 | <p align="center">
 16 | 	<img src="resources/misc/thumbnail.png" width="750">
 17 | </p>
 18 | 
 19 | <br/>
 20 | 
 21 | # 📋 Summary
 22 | 
 23 | * **[📋 Summary](#-summary)**
 24 | * **[🤖 DimensionGPT](#-dimensiongpt)**
 25 | 	* [🏗️ Architecture](#%EF%B8%8F-architecture)
 26 | 	* [💾 Data](#-data)
 27 | 	* [🦾 Training](#-training)
 28 | 	* [🪛 Fine-tuning](#-fine-tuning)
 29 | 	* [🧪 Tests](#-tests)
 30 | 	* [🎛️ Weights](#%EF%B8%8F-weights)
 31 | * **[📦 Dependencies](#-dependencies)**
 32 | * **[🦾 Training](#-training-1)**
 33 | * **[⚗️ Testing](#%EF%B8%8F-testing)**
 34 | * **[🙏 Credits](#-credits)**
 35 | 
 36 | <br/>
 37 | 
 38 | # 🤖 DimensionGPT
 39 | 
 40 | Using this repository, I trained [**DimensionGPT-0.2B**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), a small 0.2B language model on 50B tokens with my personal RTX 3090 GPU during ≈570 hours.
 41 | 
 42 | <br/>
 43 | 
 44 | ## 🏗️ Architecture
 45 | 
 46 | The model is based on the transformer architecture (only the decoder part) from the paper [**Attention is All You Need**](https://doi.org/10.48550/arXiv.1706.03762) by **Google Brain** (2017), with a few improvements:
 47 | 
 48 | * I replaced the default normalization layer by the Root Mean Square Layer Normalization (RMSNorm) from the paper [**Root Mean Square Layer Normalization**](https://doi.org/10.48550/arXiv.1910.07467) by **Edinburgh University** (2019)
 49 | 
 50 | * I moved the normalization layers before the transformer blocks (instead of after) like in the paper [**On Layer Normalization in the Transformer Architecture**](https://doi.org/10.48550/arXiv.2002.04745) by **Microsoft Research** (2020)
 51 | 
 52 | * I replaced the ReLU activation by the SwiGLU activation from the paper [**GLU Variants Improve Transformer**](https://doi.org/10.48550/arXiv.2002.05202) by **Google** (2020)
 53 | 
 54 | * I implemented Grouped-Query Attention (GQA) from the paper [**GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints**](https://doi.org/10.48550/arXiv.2305.13245) by **Google Research** (2023)
 55 | 
 56 | * I replaced the absolute positional embedding by the Rotary Position Embedding (RoPE) from the paper [**RoFormer: Enhanced Transformer with Rotary Position Embedding**](https://doi.org/10.48550/arXiv.2104.09864) by **Zhuiyi Technology** (2023)
 57 | 
 58 | * I implemented the Sliding Window Attention (SWA) from the paper [**Longformer: The Long-Document Transformer**](https://doi.org/10.48550/arXiv.2004.05150) by **Allen Institute** (2020)
 59 | 
 60 | <br/>
 61 | 
 62 | Here are the main parameters of the architecture:
 63 | 
 64 | <table>
 65 | 	<thead>
 66 | 		<tr>
 67 | 			<th align="center">Parameter</th>
 68 | 			<th align="center">Value</th>
 69 | 		</tr>
 70 | 	</thead>
 71 | 	<tbody>
 72 | 		<tr>
 73 | 			<td align="left">Embedding dimension</td>
 74 | 			<td align="center">1,024</td>
 75 | 		</tr>
 76 | 		<tr>
 77 | 			<td align="left">Number of layers</td>
 78 | 			<td align="center">16</td>
 79 | 		</tr>
 80 | 		<tr>
 81 | 			<td align="left">Heads dimension</td>
 82 | 			<td align="center">64</td>
 83 | 		</tr>
 84 | 		<tr>
 85 | 			<td align="left">Feed forward hidden dimension</td>
 86 | 			<td align="center">2,730</td>
 87 | 		</tr>
 88 | 		<tr>
 89 | 			<td align="left">Number of heads</td>
 90 | 			<td align="center">16</td>
 91 | 		</tr>
 92 | 		<tr>
 93 | 			<td align="left">Number of grouped heads</td>
 94 | 			<td align="center">4</td>
 95 | 		</tr>
 96 | 		<tr>
 97 | 			<td align="left">Window size</td>
 98 | 			<td align="center">256</td>
 99 | 		</tr>
100 | 		<tr>
101 | 			<td align="left">Context length</td>
102 | 			<td align="center">512</td>
103 | 		</tr>
104 | 		<tr>
105 | 			<td align="left">Vocab size</td>
106 | 			<td align="center">32,000</td>
107 | 		</tr>
108 | 	</tbody>
109 | </table>
110 | 
111 | <br/>
112 | 
113 | The resulting model has 208,929,792 trainable parameters and fits on a single RTX 3090 GPU with a batch size of 16 for training using mixed precision. For inference only, the model will probably fit on any modern GPU.
114 | 
115 | <br/>
116 | 
117 | ## 💾 Data
118 | 
119 | The dataset used to train this model is exclusively in french and is a mix of multiple sources:
120 | 
121 | <table>
122 | 	<thead>
123 | 		<tr>
124 | 			<th align="center">Source</th>
125 | 			<th align="center">Documents</th>
126 | 			<th align="center">Tokens</th>
127 | 			<th align="center">Multiplier</th>
128 | 			<th align="center">Ratio</th>
129 | 		</tr>
130 | 	</thead>
131 | 	<tbody>
132 | 		<tr>
133 | 			<td align="left"><b><a href="https://commoncrawl.org/">Common Crawl</a></b> (FR)</td>
134 | 			<td align="center">21,476,796</td>
135 | 			<td align="center">35,821,271,160</td>
136 | 			<td align="center">1.0</td>
137 | 			<td align="center">76.89 %</td>
138 | 		</tr>
139 | 		<tr>
140 | 			<td align="left"><b><a href="https://wikipedia.org/">Wikipedia</a></b> (FR)</td>
141 | 			<td align="center">2,700,373</td>
142 | 			<td align="center">1,626,389,831</td>
143 | 			<td align="center">4.0</td>
144 | 			<td align="center">13.96 %</td>
145 | 		</tr>
146 | 		<tr>
147 | 			<td align="left">French news articles</td>
148 | 			<td align="center">20,446,435</td>
149 | 			<td align="center">11,308,851,150</td>
150 | 			<td align="center">0.3</td>
151 | 			<td align="center">7.28 %</td>
152 | 		</tr>
153 | 		<tr>
154 | 			<td align="left">French books</td>
155 | 			<td align="center">29,322</td>
156 | 			<td align="center">2,796,450,308</td>
157 | 			<td align="center">0.2</td>
158 | 			<td align="center">1.20 %</td>
159 | 		</tr>
160 | 		<tr>
161 | 			<td align="left">French institutions documents</td>
162 | 			<td align="center">87,103</td>
163 | 			<td align="center">147,034,958</td>
164 | 			<td align="center">2.0</td>
165 | 			<td align="center">0.63 %</td>
166 | 		</tr>
167 | 		<tr>
168 | 			<td align="left">Others</td>
169 | 			<td align="center">2,761</td>
170 | 			<td align="center">7,287,322</td>
171 | 			<td align="center">2.0</td>
172 | 			<td align="center">0.03 %</td>
173 | 		</tr>
174 | 		<tr>
175 | 			<th align="left">Total</th>
176 | 			<th align="center">44,742,790</th>
177 | 			<th align="center">51,707,284,729</th>
178 | 			<th align="center">-</th>
179 | 			<th align="center">100.00 %</th>
180 | 		</tr>
181 | 	</tbody>
182 | </table>
183 | 
184 | <br/>
185 | 
186 | For the tokenization, I created my own tokenizer that starts by cleaning the text to keep only a predefined set of characters, then it uses the [**Byte Pair Encoding (BPE)**](https://en.wikipedia.org/wiki/Byte_pair_encoding) algorithm to create the vocabulary. I trained the tokenizer on a 300 million characters subset of the dataset to get my 32,000 tokens vocabulary.
187 | 
188 | <br/>
189 | 
190 | ## 🦾 Training
191 | 
192 | For the training I used stochastic gradient descent with warmup and cosine decay learning rate schedules, here are the main hyperparameters:
193 | 
194 | <table>
195 | 	<thead>
196 | 		<tr>
197 | 			<th align="center">Hyperparameter</th>
198 | 			<th align="center">Value</th>
199 | 		</tr>
200 | 	</thead>
201 | 	<tbody>
202 | 		<tr>
203 | 			<td align="left">Batch size (tokens)</td>
204 | 			<td align="center">524,288</td>
205 | 		</tr>
206 | 		<tr>
207 | 			<td align="left">Optimizer</td>
208 | 			<td align="center"><a href="https://doi.org/10.48550/arXiv.1711.05101">AdamW</a></td>
209 | 		</tr>
210 | 		<tr>
211 | 			<td align="left">Learning rate</td>
212 | 			<td align="center">6.0 × 10<sup>-4</sup></td>
213 | 		</tr>
214 | 		<tr>
215 | 			<td align="left">Warmup steps</td>
216 | 			<td align="center">2,000</td>
217 | 		</tr>
218 | 		<tr>
219 | 			<td align="left">Decay steps</td>
220 | 			<td align="center">100,000</td>
221 | 		</tr>
222 | 		<tr>
223 | 			<td align="left">β<sub>1</sub></td>
224 | 			<td align="center">0.9</td>
225 | 		</tr>
226 | 		<tr>
227 | 			<td align="left">β<sub>2</sub></td>
228 | 			<td align="center">0.95</td>
229 | 		</tr>
230 | 		<tr>
231 | 			<td align="left">ε</td>
232 | 			<td align="center">10<sup>-5</sup></td>
233 | 		</tr>
234 | 		<tr>
235 | 			<td align="left">Weight decay</td>
236 | 			<td align="center">0.1</td>
237 | 		</tr>
238 | 		<tr>
239 | 			<td align="left">Gradient clipping</td>
240 | 			<td align="center">1.0</td>
241 | 		</tr>
242 | 	</tbody>
243 | </table>
244 | 
245 | <br/>
246 | 
247 | I trained the model on my personal RTX 3090 GPU for 1 epoch on the full dataset (13 times the [**Chinchilla optimal**](https://doi.org/10.48550/arXiv.2203.15556)) using mixed precision and gradient accumulation to increase the speed and reduce the memory usage :
248 | 
249 | <table>
250 | 	<thead>
251 | 		<tr>
252 | 			<th align="center" colspan="2">Training summary</th>
253 | 		</tr>
254 | 	</thead>
255 | 	<tbody>
256 | 		<tr>
257 | 			<td align="left">Tokens</td>
258 | 			<td align="center">52,428,800,000</td>
259 | 		</tr>
260 | 		<tr>
261 | 			<td align="left">Steps</td>
262 | 			<td align="center">100,000</td>
263 | 		</tr>
264 | 		<tr>
265 | 			<td align="left">FLOPs</td>
266 | 			<td align="center">6.6 × 10<sup>19</sup></td>
267 | 		</tr>
268 | 		<tr>
269 | 			<td align="left">Duration</td>
270 | 			<td align="center">573 hours</td>
271 | 		</tr>
272 | 		<tr>
273 | 			<td align="left">Final loss</td>
274 | 			<td align="center">2.19</td>
275 | 		</tr>
276 | 		<tr>
277 | 			<td align="left">Final accuracy</td>
278 | 			<td align="center">54.8 %</td>
279 | 		</tr>
280 | 	</tbody>
281 | </table>
282 | 
283 | <p align="center">
284 | 	<img src="resources/misc/loss.png" width="750">
285 | </p>
286 | 
287 | <p align="center">
288 | 	<img src="resources/misc/accuracy.png" width="750">
289 | </p>
290 | 
291 | <br/>
292 | 
293 | ## 🪛 Fine-tuning
294 | 
295 | I fine-tuned the model on the [**french instructions dataset**](https://github.com/angeluriot/French_instruct) I made for this project to create [**DimensionGPT-0.2B-Chat**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), a 0.2B language model trained to follow instructions and answer questions in french.
296 | 
297 | <br/>
298 | 
299 | ## 🧪 Tests
300 | 
301 | Here are some examples of the model outputs:
302 | 
303 | <p align="center">
304 | 	<img src="resources/misc/test_1.png" width="750">
305 | </p>
306 | 
307 | <p align="center">
308 | 	<img src="resources/misc/test_2.png" width="750">
309 | </p>
310 | 
311 | <p align="center">
312 | 	<img src="resources/misc/test_3.png" width="750">
313 | </p>
314 | 
315 | <p align="center">
316 | 	<img src="resources/misc/test_4.png" width="750">
317 | </p>
318 | 
319 | <p align="center">
320 | 	<img src="resources/misc/test_5.png" width="750">
321 | </p>
322 | 
323 | <p align="center">
324 | 	<img src="resources/misc/test_6.png" width="750">
325 | </p>
326 | 
327 | <p align="center">
328 | 	<img src="resources/misc/test_7.png" width="750">
329 | </p>
330 | 
331 | <p align="center">
332 | 	<img src="resources/misc/test_8.png" width="750">
333 | </p>
334 | 
335 | <p align="center">
336 | 	<img src="resources/misc/test_9.png" width="750">
337 | </p>
338 | 
339 | <p align="center">
340 | 	<img src="resources/misc/test_10.png" width="750">
341 | </p>
342 | 
343 | <p align="center">
344 | 	<img src="resources/misc/test_11.png" width="750">
345 | </p>
346 | 
347 | <br/>
348 | 
349 | ## 🎛️ Weights
350 | 
351 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to:
352 | 
353 | * Download the `.pt` file of the model you want to use and put it in the `models` folder
354 | * Download the `vocab.txt` file and put it in the `data` folder
355 | 
356 | <br/>
357 | 
358 | # 📦 Dependencies
359 | 
360 | * [**Python**](https://www.python.org/)
361 | * [**PyTorch**](https://pytorch.org/)
362 | * [**Flash Attention**](https://github.com/Dao-AILab/Flash-attention)
363 | * [**Datasets 🤗**](https://github.com/huggingface/datasets)
364 | * [**Tokenizers 🤗**](https://github.com/huggingface/tokenizers)
365 | * [**Unidecode**](https://pypi.org/project/Unidecode/)
366 | * [**Regex**](https://github.com/mrabarnett/mrab-regex)
367 | * [**Tqdm**](https://tqdm.github.io/)
368 | * [**PSUtil**](https://github.com/giampaolo/psutil)
369 | 
370 | <br/>
371 | 
372 | Run the following command to install the dependencies:
373 | 
374 | ```shell
375 | $ pip install -r requirements.txt
376 | ```
377 | 
378 | ⚠️ You may need to use a [**specific command**](https://pytorch.org/get-started/locally/) for PyTorch if you want to use CUDA
379 | 
380 | ⚠️ You way need to manually install a [**Flash Attention release**](https://github.com/Dao-AILab/flash-attention/releases) for Windows
381 | 
382 | <br/>
383 | 
384 | # 🦾 Training
385 | 
386 | * Run the `create_data.ipynb` file to create the tokenizer and the dataset *(it may take an entire day and consume a few hundred gigabytes of disk space)*
387 | 
388 | * Run the `training.ipynb` file *(you can stop the training at any time and resume it later thanks to the checkpoints)*
389 | 
390 | * If you don't have an overpriced 24GB GPU like me, the default settings (those used to train [**DimensionGPT**](#-dimensiongpt)) may not work for you. You can try to:
391 | 	* Reduce the **batch size** *(less stable and worse lowest point)*
392 | 	* Increase the **accumulation steps** *(fix previous problems but slower)*
393 | 	* Reduce some **architecture parameters** *(worse lowest point)*
394 | 
395 | <br/>
396 | 
397 | 
398 | # ⚗️ Testing
399 | 
400 | * Run the `testing.ipynb` file to use the models you downloaded or trained
401 | 
402 | <br/>
403 | 
404 | # 🙏 Credits
405 | 
406 | * [**Angel Uriot**](https://github.com/angeluriot) : Creator of the project.
407 | 


--------------------------------------------------------------------------------
/create_data.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Create training data"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Imports"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "from dimgpt.data.tokenizer import *\n",
 24 |     "from dimgpt.data.pretraining import *\n",
 25 |     "from dimgpt.data.finetuning import *\n",
 26 |     "from dimgpt import utils\n",
 27 |     "from dimgpt.settings import *\n",
 28 |     "\n",
 29 |     "utils.reset_rand()"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "### Import pretraining dataset"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "pretraining = Pretraining()"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "pretraining.summary()"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "print(pretraining.get_document())"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "### Create vocab"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": null,
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "sizes, chars = pretraining.create_tokenizer_data()\n",
 80 |     "\n",
 81 |     "print()\n",
 82 |     "\n",
 83 |     "for i in range(len(pretraining.datasets)):\n",
 84 |     "\tprint(f'{pretraining.datasets[i].name}: {sizes[i]:,} characters')\n",
 85 |     "\n",
 86 |     "print('\\nNb unique characters:', len(chars), '\\n')\n",
 87 |     "\n",
 88 |     "for char in chars:\n",
 89 |     "\tprint(f'[{char}]', end = ' ')"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {},
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "tokenizer = Tokenizer()\n",
 99 |     "\n",
100 |     "print(f'\\nVocab size: {len(tokenizer.vocab):,}\\n')\n",
101 |     "\n",
102 |     "for v in tokenizer.vocab:\n",
103 |     "\tprint(f'[{v}]', end = ' ')"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "### Encode datasets"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "pretraining.save(tokenizer)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": [
128 |     "pretraining.summary()"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "finetuning = Finetuning()\n",
138 |     "finetuning.save(tokenizer)"
139 |    ]
140 |   }
141 |  ],
142 |  "metadata": {
143 |   "kernelspec": {
144 |    "display_name": "venv",
145 |    "language": "python",
146 |    "name": "python3"
147 |   },
148 |   "language_info": {
149 |    "codemirror_mode": {
150 |     "name": "ipython",
151 |     "version": 3
152 |    },
153 |    "file_extension": ".py",
154 |    "mimetype": "text/x-python",
155 |    "name": "python",
156 |    "nbconvert_exporter": "python",
157 |    "pygments_lexer": "ipython3",
158 |    "version": "3.10.11"
159 |   },
160 |   "orig_nbformat": 4
161 |  },
162 |  "nbformat": 4,
163 |  "nbformat_minor": 2
164 | }
165 | 


--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
1 | # 🎛️ Trained weights
2 | 
3 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to:
4 | 
5 | * Download the `.pt` file of the model you want to use and put it in the `models` folder
6 | * Download the `vocab.txt` file and put it in this folder
7 | 


--------------------------------------------------------------------------------
/dimgpt/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/__init__.py


--------------------------------------------------------------------------------
/dimgpt/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/data/__init__.py


--------------------------------------------------------------------------------
/dimgpt/data/clean.py:
--------------------------------------------------------------------------------
  1 | import regex
  2 | from unidecode import unidecode
  3 | 
  4 | from dimgpt.settings import *
  5 | 
  6 | 
  7 | AUTHORIZED_UNICODE = set(
  8 | 	'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \
  9 | 	'0123456789' \
 10 | 	' !"#$%&\'`()*+,-./:;<=>?@[\\]^_{|}~' \
 11 | 	'ÀàÂâÄäÇçÉéÈèÊêËëÎîÏïÔôÖöÙùÛûÜüÆæŒœ' \
 12 | 	'€£¥•·²³≠±×÷√π' \
 13 | 	'😀😃😄😁😆😅😂🤣🥲🥹😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁😣😖😫😩🥺😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓🫣🤗🫡🤔🫢🤭🤫🤥😶😐😑😬🫠🙄😯😦😧😮😲🥱😴🤤😪😵🫥🤐🥴🤢🤮🤧😷🤒🤕🤑🤠😈👿👹👺🤡💩👻💀👽👾🤖🎃😺😸😹😻😼😽🙀😿😾' \
 14 | 	'👋🤚🖐✋🖖👌🤌🤏🤞🫰🤟🤘🤙🫵🫱🫲🫳🫴👈👉👆🖕👇👍👎✊👊🤛🤜👏🫶🙌👐🤲🤝🙏💅🤳💪🦾🦵🦿🦶👣👂🦻👃🫀🫁🧠🦷🦴👀👁👅👄🫦💋🩸' \
 15 | 	'👶👧🧒👦👩🧑👨👱🧔👵🧓👴👲👳🧕👮👷💂👰🤵👸🫅🤴🥷🦸🦹🤶🎅🧙🧝🧛🧟🧞🧜🧚🧌👼🤰🤱🙇💁🙅🙆🙋🧏🤦🤷🙎🙍💇💆🧖💅🤳💃🕺👯🕴🚶🧎🏃🧍👭👬👫💑💏👪🗣👤👥🫂' \
 16 | 	'🧳🌂🧵🪡🪢🧶👓🕶🥽🥼🦺👔👕👖🧣🧤🧥🧦👗👘🥻🩴🩱🩲🩳👙👚👛👜👝🎒👞👟🥾🥿👠👡🩰👢👑👒🎩🎓🧢⛑🪖💄💍💼' \
 17 | 	'🐶🐱🐭🐹🐰🦊🐻🐼🐨🐯🦁🐮🐷🐽🐸🐵🙈🙉🙊🐒🐔🐧🐦🐤🐣🐥🦆🦅🦉🦇🐺🐗🐴🦄🐝🪱🐛🦋🐌🐞🐜🪰🪲🪳🦟🦗🕷🕸🦂🐢🐍🦎🦖🦕🐙🦑🦐🦞🦀🪸🐡🐠🐟🐬🐳🐋🦈🐊🐅🐆🦓🦍🦧🦣🐘🦛🦏🐪🐫🦒🦘🦬🐃🐂🐄🐎🐖🐏🐑🦙🐐🦌🐕🐩🦮🐈🪶🐓🦃🦤🦚🦜🦢🦩🕊🐇🦝🦨🦡🦫🦦🦥🐁🐀🐿🦔🐾🐉🐲🌵🎄🌲🌳🌴🪹🪺🪵🌱🌿🍀🎍🪴🎋🍃🍂🍁🍄🐚🪨🌾💐🌷🪷🌹🥀🌺🌸🌼🌻🌞🌝🌛🌜🌚🌕🌖🌗🌘🌑🌒🌓🌔🌙🌎🌍🌏🪐💫⭐🌟✨💥🔥🌪🌈🌤🌥🌦🌧⛈🌩🌨🌬💨💧💦🫧🌊🌫' \
 18 | 	'🍏🍎🍐🍊🍋🍌🍉🍇🍓🫐🍈🍒🍑🥭🍍🥥🥝🍅🍆🥑🥦🥬🥒🌶🫑🌽🥕🫒🧄🧅🥔🍠🫘🥐🥯🍞🥖🥨🧀🥚🍳🧈🥞🧇🥓🥩🍗🍖🦴🌭🍔🍟🍕🫓🥪🥙🧆🌮🌯🫔🥗🥘🫕🥫🍝🍜🍲🍛🍣🍱🥟🦪🍤🍙🍚🍘🍥🥠🥮🍢🍡🍧🍨🍦🥧🧁🍰🎂🍮🍭🍬🍫🍿🍩🍪🌰🥜🍯🥛🍼🫖☕🍵🧃🥤🧋🫙🍶🍺🍻🥂🍷🫗🥃🍸🍹🧉🍾🧊🥄🍴🍽🥣🥡🥢🧂' \
 19 | 	'⚽🏀🏈⚾🥎🎾🏐🏉🥏🎱🪀🏓🏸🏒🏑🥍🏏🪃🥅🪁🏹🎣🤿🥊🥋🎽🛹🛼🛷⛸🥌🎿⛷🏂🪂🤼🤸🤺🤾🏇🧘🏄🏊🤽🚣🧗🚵🚴🏆🥇🥈🥉🏅🎖🏵🎗🎫🎟🎪🤹🎭🩰🎨🎬🎤🎧🎼🎹🥁🪘🎷🎺🪗🎸🪕🎻🎲♟🎯🎳🎮🎰🧩' \
 20 | 	'🚗🚕🚙🚌🚎🏎🚓🚑🚒🚐🛻🚚🚛🚜🦯🦽🦼🛴🚲🛵🏍🛺🚨🚔🚍🚘🚖🛞🚡🚠🚟🚃🚋🚞🚝🚄🚅🚈🚂🚆🚇🚊🚉🛫🛬🛩💺🛰🚀🛸🚁🛶⛵🚤🛥🛳⛴🚢🛟🪝🚧🚦🚥🚏🗺🗿🗽🗼🏰🏯🏟🎡🎢🛝🎠⛱🏖🏝🏜🌋⛰🏔🗻🏕🛖🏠🏡🏘🏚🏗🏭🏢🏬🏣🏤🏥🏦🏨🏪🏫🏩💒🏛🕌🕍🛕🕋⛩🛤🛣🗾🎑🏞🌅🌄🌠🎇🎆🌇🌆🏙🌃🌌🌉🌁' \
 21 | 	'⌚📱📲💻🖥🖨🖱🖲🕹🗜💽💾💿📀📼📷📸📹🎥📽🎞📞📟📠📺📻🎙🎚🎛🧭⏱⏲⏰🕰⌛⏳📡🔋🪫🔌💡🔦🕯🪔🧯🛢💸💵💴💶💷🪙💰💳💎🪜🧰🪛🔧🔨⚒🛠⛏🪚🔩🪤🧱⛓🧲🔫💣🧨🪓🔪🗡🛡🚬🪦🏺🔮📿🧿🪬💈🔭🔬🕳🩹🩺🩻🩼💊💉🩸🧬🦠🧫🧪🌡🧹🪠🧺🧻🚽🚰🚿🛁🛀🧼🪥🪒🧽🪣🧴🛎🔑🗝🚪🪑🛋🛏🛌🧸🪆🖼🪞🪟🛍🛒🎁🎈🎏🎀🪄🪅🎊🎉🪩🎎🏮🎐🧧📩📨📧💌📥📤📦🏷🪧📪📫📬📭📮📯📜📃📄📑🧾📊📈📉🗒🗓📆📅🗑🪪📇🗃🗳🗄📋📁📂🗂🗞📰📓📔📒📕📗📘📙📚📖🔖🧷🔗📎🖇📐📏🧮📌📍🖊🖋🖌🖍📝🔍🔎🔏🔐🔒🔓' \
 22 | 	'🧡💛💚💙💜🖤🤍🤎💔💕💞💓💗💖💘💝💟🔯🕎🛐⛎🆔🉑📴📳🈶🈸🈺🆚💮🉐🈴🈵🈹🈲🆎🆑🆘❌🛑⛔📛🚫💯💢🚷🚯🚳🚱🔞📵🚭🔅🔆🚸🔱🔰✅💹❎🌐💠🌀💤🏧🚾🛗🈳🛂🛃🛄🛅🚹🚺🚼⚧🚻🚮🎦📶🈁🔣🔤🔡🔠🆖🆗🆙🆒🆕🆓🔟🔢⏸⏯⏹⏺⏭⏮⏩⏪⏫⏬🔼🔽🔀🔁🔂🔄🔃🎵🎶➕➖➗🟰♾💲💱➰➿🔚🔙🔛🔝🔜🔘🔴🟠🟡🟢🔵🟣🟤🔺🔻🔸🔹🔶🔷🔳🔲🟥🟧🟨🟩🟦🟪🟫🔈🔇🔉🔊🔔🔕📣📢💬💭🗯🃏🎴🕐🕑🕒🕓🕔🕕🕖🕗🕘🕙🕚🕛🕜🕝🕞🕟🕠🕡🕢🕣🕤🕥🕦🕧' \
 23 | 	'🏴🏁🚩🎌'
 24 | )
 25 | 
 26 | AUTHORIZED_ASCII = set(
 27 | 	'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \
 28 | 	'0123456789' \
 29 | 	' !"#$%&\'`()*+,-./:;<=>?@[\\]^_{|}~'
 30 | )
 31 | 
 32 | REPLACE_UNICODE = {
 33 | 	'« ': '"',
 34 | 	' »': '"',
 35 | 	'«': '"',
 36 | 	'»': '"',
 37 | 	'❗️': '!',
 38 | 	'❕': '!',
 39 | 	'❓': '?',
 40 | 	'❔': '?',
 41 | 	'‼️': '!!',
 42 | 	'⁉️': '!?',
 43 | 	'✖️': '❌',
 44 | 	'✔️': '✅',
 45 | 	'☺': '😊',
 46 | 	'☺️': '😊',
 47 | 	'☹': '🙁',
 48 | 	'☹️': '🙁'
 49 | }
 50 | 
 51 | ENCODE_STRING_EMOJIS = {
 52 | 	'☂️': '☂',
 53 | 	'☀️': '☀',
 54 | 	'❄️': '❄',
 55 | 	'✈️': '✈',
 56 | 	'☎️': '☎',
 57 | 	'⚙️': '⚙',
 58 | 	'⚔️': '⚔',
 59 | 	'✉️': '✉',
 60 | 	'✂️': '✂',
 61 | 	'✒️': '✒',
 62 | 	'❤️': '❤',
 63 | 	'☢️': '☢',
 64 | 	'☣️': '☣',
 65 | 	'⚠️': '⚠',
 66 | 	'♻️': '♻',
 67 | 	'🏳️‍🌈': '①',
 68 | 	'🏳️‍⚧️': '②',
 69 | 	'🏴‍☠️': '③',
 70 | 	'🇺🇸': '④',
 71 | 	'🇨🇳': '⑤',
 72 | 	'🇯🇵': '⑥',
 73 | 	'🇩🇪': '⑦',
 74 | 	'🇮🇳': '⑧',
 75 | 	'🇬🇧': '⑨',
 76 | 	'🇫🇷': '⑩',
 77 | 	'🇮🇹': '⑪',
 78 | 	'🇨🇦': '⑫',
 79 | 	'🇧🇷': '⑬',
 80 | 	'🇷🇺': '⑭',
 81 | 	'🇰🇷': '⑮',
 82 | 	'🇦🇺': '⑯',
 83 | 	'🇲🇽': '⑰',
 84 | 	'🇪🇸': '⑱',
 85 | 	'🏳️': '🏳'
 86 | }
 87 | 
 88 | DECODE_STRING_EMOJIS = {value: key for key, value in reversed(ENCODE_STRING_EMOJIS.items())}
 89 | 
 90 | ENCODE_CHARS = list('①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱')
 91 | 
 92 | REPLACE_ASCII_STRING = {
 93 | 	'--': '-'
 94 | }
 95 | 
 96 | STRIP_REPLACE = {
 97 | 	' \n': '\n',
 98 | 	'\t\n': '\n',
 99 | 	'\n\n\n': '\n\n'
100 | }
101 | 
102 | CONTROL_REPLACE = {
103 | 	'\t': '⮜tab⮞',
104 | 	'\n': '⮜new-line⮞'
105 | }
106 | 
107 | POSSIBLE_CHARS = AUTHORIZED_UNICODE | set(DECODE_STRING_EMOJIS.keys())
108 | 
109 | 
110 | def clean_ascii(char: str) -> str:
111 | 
112 | 	if char in AUTHORIZED_ASCII or char in CONTROL_REPLACE.keys():
113 | 		return char
114 | 
115 | 	return ''
116 | 
117 | 
118 | def clean_unicode(char: str) -> str:
119 | 
120 | 	if char in AUTHORIZED_UNICODE or char in DECODE_STRING_EMOJIS or char in CONTROL_REPLACE.keys():
121 | 		return char
122 | 
123 | 	text = unidecode(char)
124 | 
125 | 	for key, value in REPLACE_ASCII_STRING.items():
126 | 		text = text.replace(key, value)
127 | 
128 | 	return ''.join([clean_ascii(char) for char in text])
129 | 
130 | 
131 | def clean_string(text: str, keep_control_tokens: bool = False) -> str:
132 | 
133 | 	if len(text) == 0:
134 | 		return ''
135 | 
136 | 	text = text.replace('\r', '')
137 | 
138 | 	if keep_control_tokens:
139 | 
140 | 		safe_control_tokens = [regex.escape(c) for c in CONTROL_TOKENS]
141 | 		reg = r'(' + r'|'.join(safe_control_tokens) + r''.join([f'[{i}]' for i in safe_control_tokens]) + r']+)'
142 | 		parts = regex.split(reg, text, flags = regex.UNICODE, concurrent = False)
143 | 		parts = list(filter(None, parts))
144 | 
145 | 		return ''.join([part if part in CONTROL_TOKENS else clean_string(part) for part in parts])
146 | 
147 | 	for key, value in REPLACE_UNICODE.items():
148 | 		text = text.replace(key, value)
149 | 
150 | 	for char in ENCODE_CHARS:
151 | 		text = text.replace(char, unidecode(char))
152 | 
153 | 	for key, value in ENCODE_STRING_EMOJIS.items():
154 | 		text = text.replace(key, value)
155 | 
156 | 	text = ''.join([clean_unicode(char) for char in text])
157 | 
158 | 	for key, value in STRIP_REPLACE.items():
159 | 		while key in text:
160 | 			text = text.replace(key, value)
161 | 
162 | 	text = text.strip()
163 | 
164 | 	for key, value in CONTROL_REPLACE.items():
165 | 		text = text.replace(key, value)
166 | 
167 | 	return text
168 | 
169 | 
170 | def unclean_string(text: str, keep_control_tokens: bool = False) -> str:
171 | 
172 | 	for key, value in DECODE_STRING_EMOJIS.items():
173 | 		text = text.replace(key, value)
174 | 
175 | 	if keep_control_tokens:
176 | 		return text
177 | 
178 | 	text = text.replace('⮜unknown⮞', '�')
179 | 	text = text.replace('⮜padding⮞', '')
180 | 	text = text.replace('⮜start-of-text⮞', '\n\n---------- START OF TEXT ----------\n\n')
181 | 	text = text.replace('⮜tab⮞', '\t')
182 | 	text = text.replace('⮜new-line⮞', '\n')
183 | 	text = text.replace('⮜human⮞', '\n\n--- Human ---\n\n')
184 | 	text = text.replace('⮜system⮞', '\n\n--- System ---\n\n')
185 | 	text = text.replace('⮜user⮞', '\n\n--- User ---\n\n')
186 | 	text = text.replace('⮜assistant⮞', '\n\n--- Assistant ---\n\n')
187 | 	text = text.replace('⮜end-of-text⮞', '\n\n---------- END OF TEXT ----------\n\n')
188 | 
189 | 	return text
190 | 


--------------------------------------------------------------------------------
/dimgpt/data/datasets/__init__.py:
--------------------------------------------------------------------------------
1 | from .dataset import Dataset


--------------------------------------------------------------------------------
/dimgpt/data/datasets/dataset.py:
--------------------------------------------------------------------------------
 1 | import os, random, pickle
 2 | from tqdm import tqdm
 3 | from abc import ABC
 4 | import numpy as np
 5 | import numpy.typing as npt
 6 | from dimgpt.data.clean import *
 7 | from dimgpt.data.tokenizer import Tokenizer
 8 | 
 9 | 
10 | class Dataset(ABC):
11 | 
12 | 	def __init__(self) -> None:
13 | 
14 | 		self.dataset = None
15 | 		self.training_part = ''
16 | 		self.name = ''
17 | 		self.size = {'train': 0, 'val': 0}
18 | 		self.multiplier = 1.0
19 | 
20 | 
21 | 	def get_document(self, i: int | None = None) -> str:
22 | 
23 | 		if i is None:
24 | 			i = random.randint(0, len(self.dataset) - 1)
25 | 
26 | 		return '⮜start-of-text⮞' + clean_string(self.dataset[i]['text']) + '⮜end-of-text⮞'
27 | 
28 | 
29 | 	def document_to_tokens(self, document: dict[str, str], tokenizer: Tokenizer) -> dict[str, npt.NDArray[np.uint16] | int]:
30 | 
31 | 		tokens = [tokenizer.start_of_text_token, *tokenizer.encode(document['text']), tokenizer.end_of_text_token]
32 | 
33 | 		return {'tokens': np.array(tokens, dtype = np.uint16), 'size': len(tokens)}
34 | 
35 | 
36 | 	def save(self, tokenizer: Tokenizer) -> None:
37 | 
38 | 		if os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, f'train.bin')):
39 | 			return
40 | 
41 | 		os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True)
42 | 
43 | 		split_dataset = self.dataset.train_test_split(test_size = PRETRAINING_VAL_RATIO, shuffle = True)
44 | 		split_dataset['val'] = split_dataset.pop('test')
45 | 
46 | 		tokenized = split_dataset.map(
47 | 			lambda doc: self.document_to_tokens(doc, tokenizer),
48 | 			desc = f'Tokenizing {self.name}',
49 | 			num_proc = NUM_THREADS
50 | 		)
51 | 
52 | 		for split, documents in tokenized.items():
53 | 
54 | 			total = 0
55 | 			ids = []
56 | 
57 | 			for doc in tqdm(documents, desc = f'Saving {self.name} {split} ids'):
58 | 
59 | 				ids.append({
60 | 					'start': total,
61 | 					'size': doc['size']
62 | 				})
63 | 
64 | 				total += doc['size']
65 | 
66 | 			with open(os.path.join(DATA_DIR, self.training_part, self.name, f'{split}_ids.pkl'), 'wb') as file:
67 | 				pickle.dump(ids, file)
68 | 
69 | 			batch_size = 1_024
70 | 
71 | 			while batch_size >= len(documents):
72 | 				batch_size //= 2
73 | 
74 | 			self.size[split] = int(np.sum(documents['size'], dtype = np.uint64))
75 | 			path = os.path.join(DATA_DIR, self.training_part, self.name, f'{split}.bin')
76 | 			file = np.memmap(path, dtype = np.uint16, mode = 'w+', shape = (self.size[split],))
77 | 			i = 0
78 | 
79 | 			for batch_i in tqdm(range(batch_size), desc = f'Saving {self.name} {split}'):
80 | 
81 | 				batch = documents.shard(num_shards = batch_size, index = batch_i, contiguous = True).with_format('numpy')
82 | 				file_batch = np.concatenate(batch['tokens'])
83 | 				file[i:i + len(file_batch)] = file_batch
84 | 				i += len(file_batch)
85 | 
86 | 			file.flush()
87 | 
88 | 		with open(os.path.join(DATA_DIR, self.training_part, self.name, f'metadata.pkl'), 'wb') as file:
89 | 			pickle.dump({
90 | 				'training_part': self.training_part,
91 | 				'name': self.name,
92 | 				'size': self.size,
93 | 				'multiplier': self.multiplier
94 | 			}, file)
95 | 


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/__init__.py:
--------------------------------------------------------------------------------
1 | from .common_crawl import CommonCrawlDataset
2 | from .wikipedia import WikipediaDataset
3 | from .books import BooksDataset
4 | from .news import NewsDataset
5 | from .institutions import InstitutionsDataset
6 | from .others import OthersDataset


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/books.py:
--------------------------------------------------------------------------------
 1 | import os, json
 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
 3 | from dimgpt.data.datasets import Dataset
 4 | from dimgpt.settings import *
 5 | 
 6 | class BooksDataset(Dataset):
 7 | 
 8 | 	def __init__(self) -> None:
 9 | 
10 | 		super().__init__()
11 | 
12 | 		self.training_part = 'pretraining'
13 | 		self.name = 'books'
14 | 		self.multiplier = 0.2
15 | 
16 | 		print('Downloading Books dataset...')
17 | 
18 | 		if not os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json')):
19 | 
20 | 			dataset = load_dataset(
21 | 				path = 'PleIAs/French-PD-Books',
22 | 				split = 'train',
23 | 				download_config = DownloadConfig(max_retries = 10),
24 | 				streaming = True
25 | 			)
26 | 
27 | 			os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True)
28 | 
29 | 			with open(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 'w', encoding = 'utf-8') as file:
30 | 
31 | 				file.truncate(0)
32 | 				i = 0
33 | 				self.size['train'] = 0
34 | 
35 | 				for record in dataset:
36 | 
37 | 					text = str(record['complete_text']).strip()
38 | 
39 | 					if len(text) < MIN_DOCUMENT_SIZE:
40 | 						continue
41 | 
42 | 					file.write(json.dumps({'text': text}, ensure_ascii = False) + '\n')
43 | 
44 | 					self.size['train'] += len(text)
45 | 					i += 1
46 | 
47 | 					if i % 1_000 == 0:
48 | 						print(f'{i:,} documents | {self.size["train"]:,} characters            ', end = '\r')
49 | 
50 | 					if self.size['train'] >= 10_000_000_000:
51 | 						break
52 | 
53 | 		self.dataset = load_dataset(
54 | 			path = 'json',
55 | 			split = 'train',
56 | 			data_files = os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'),
57 | 			num_proc = NUM_THREADS
58 | 		)
59 | 
60 | 		if self.size['train'] == 0:
61 | 			self.size['train'] = 10_000_000_000
62 | 
63 | 		print(f'Books dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/common_crawl.py:
--------------------------------------------------------------------------------
 1 | import os, json
 2 | from datasets import load_dataset, DownloadConfig
 3 | from dimgpt.data.datasets import Dataset
 4 | from dimgpt.settings import *
 5 | 
 6 | class CommonCrawlDataset(Dataset):
 7 | 
 8 | 	def __init__(self) -> None:
 9 | 
10 | 		super().__init__()
11 | 
12 | 		self.training_part = 'pretraining'
13 | 		self.name = 'common_crawl'
14 | 		self.multiplier = 1.0
15 | 
16 | 		print('Downloading Common Crawl dataset...')
17 | 
18 | 		if not os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json')):
19 | 
20 | 			dataset = load_dataset(
21 | 				path = 'ontocord/CulturaY',
22 | 				name = 'fr',
23 | 				split = 'train',
24 | 				download_config = DownloadConfig(max_retries = 10),
25 | 				streaming = True
26 | 			)
27 | 
28 | 			os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True)
29 | 
30 | 			with open(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 'w', encoding = 'utf-8') as file:
31 | 
32 | 				file.truncate(0)
33 | 				i = 0
34 | 				self.size['train'] = 0
35 | 
36 | 				for record in dataset:
37 | 
38 | 					text = str(record['text']).strip()
39 | 
40 | 					if len(text) < MIN_DOCUMENT_SIZE:
41 | 						continue
42 | 
43 | 					file.write(json.dumps({'text': text}, ensure_ascii = False) + '\n')
44 | 
45 | 					self.size['train'] += len(text)
46 | 					i += 1
47 | 
48 | 					if i % 1_000 == 0:
49 | 						print(f'{i:,} documents | {self.size["train"]:,} characters            ', end = '\r')
50 | 
51 | 					if self.size['train'] >= 150_000_000_000:
52 | 						break
53 | 
54 | 		self.dataset = load_dataset(
55 | 			path = 'json',
56 | 			split = 'train',
57 | 			data_files = os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'),
58 | 			num_proc = NUM_THREADS
59 | 		)
60 | 
61 | 		if self.size['train'] == 0:
62 | 			self.size['train'] = 150_000_000_000
63 | 
64 | 		print(f'Common Crawl dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
65 | 


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/institutions.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
 2 | from dimgpt.data.datasets import Dataset
 3 | from dimgpt.settings import *
 4 | 
 5 | class InstitutionsDataset(Dataset):
 6 | 
 7 | 	def __init__(self) -> None:
 8 | 
 9 | 		super().__init__()
10 | 
11 | 		self.training_part = 'pretraining'
12 | 		self.name = 'institutions'
13 | 		self.multiplier = 2.0
14 | 
15 | 		print('Downloading Institutions dataset...')
16 | 
17 | 		europarl = load_dataset(
18 | 			path = 'bigscience-data/roots_fr_the_pile_europarl',
19 | 			split = 'train',
20 | 			download_config = DownloadConfig(max_retries = 10)
21 | 		)
22 | 
23 | 		qr_an = load_dataset(
24 | 			path = 'cassandra-themis/QR-AN',
25 | 			name = 'qran_generation',
26 | 			split = 'train+validation+test',
27 | 			download_config = DownloadConfig(max_retries = 10)
28 | 		)
29 | 
30 | 		qr_an = qr_an.map(
31 | 			lambda doc: {'text': (str(doc['question']).strip() + '\n\n' + str(doc['answer']).strip()).strip()},
32 | 			remove_columns = ['question', 'answer'],
33 | 			desc = 'Cleaning QR-AN',
34 | 			num_proc = NUM_THREADS
35 | 		)
36 | 
37 | 		self.dataset = concatenate_datasets([europarl, qr_an])
38 | 		self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
39 | 		self.size['train'] = 0
40 | 
41 | 		for doc in self.dataset:
42 | 			self.size['train'] += len(str(doc['text']).strip())
43 | 
44 | 		print(f'Institutions dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/news.py:
--------------------------------------------------------------------------------
  1 | import re, json
  2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
  3 | from dimgpt.data.datasets import Dataset
  4 | from dimgpt.settings import *
  5 | 
  6 | class NewsDataset(Dataset):
  7 | 
  8 | 	def __init__(self) -> None:
  9 | 
 10 | 		super().__init__()
 11 | 
 12 | 		self.training_part = 'pretraining'
 13 | 		self.name = 'news'
 14 | 		self.multiplier = 0.3
 15 | 
 16 | 		print('Downloading News dataset...')
 17 | 
 18 | 		news_fr = load_dataset(
 19 | 			path = 'eckendoerffer/news_fr',
 20 | 			split = 'train+validation+test',
 21 | 			download_config = DownloadConfig(max_retries = 10)
 22 | 		)
 23 | 
 24 | 		news_fr = news_fr.map(
 25 | 			lambda doc: {'text': self._clean_news_fr(doc['text'])},
 26 | 			desc = 'Cleaning news_fr',
 27 | 			num_proc = NUM_THREADS
 28 | 		)
 29 | 
 30 | 		wikinews = load_dataset(
 31 | 			path = 'bigscience-data/roots_fr_wikinews',
 32 | 			split = 'train',
 33 | 			download_config = DownloadConfig(max_retries = 10)
 34 | 		)
 35 | 
 36 | 		wikinews = wikinews.map(
 37 | 			lambda doc: {'text': self._clean_wikinews(doc)},
 38 | 			remove_columns = ['meta'],
 39 | 			desc = 'Cleaning wikinews',
 40 | 			num_proc = NUM_THREADS
 41 | 		)
 42 | 
 43 | 		cc_news = load_dataset(
 44 | 			path = 'intfloat/multilingual_cc_news',
 45 | 			name = 'fr',
 46 | 			split = 'train',
 47 | 			download_config = DownloadConfig(max_retries = 10)
 48 | 		)
 49 | 
 50 | 		cc_news = cc_news.map(
 51 | 			lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['maintext']).strip()).strip()},
 52 | 			remove_columns = ['title', 'maintext', 'url', 'date_publish'],
 53 | 			desc = 'Cleaning cc_news',
 54 | 			num_proc = NUM_THREADS
 55 | 		)
 56 | 
 57 | 		xlsum = load_dataset(
 58 | 			path = 'csebuetnlp/xlsum',
 59 | 			name = 'french',
 60 | 			split = 'train+validation+test',
 61 | 			download_config = DownloadConfig(max_retries = 10)
 62 | 		)
 63 | 
 64 | 		xlsum_summaries = xlsum.map(
 65 | 			lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['summary']).strip()).strip()},
 66 | 			remove_columns = ['id', 'url', 'title', 'summary'],
 67 | 			desc = 'Cleaning xlsum_summaries',
 68 | 			num_proc = NUM_THREADS
 69 | 		)
 70 | 
 71 | 		xlsum = xlsum.map(
 72 | 			lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
 73 | 			remove_columns = ['id', 'url', 'title', 'summary'],
 74 | 			desc = 'Cleaning xlsum',
 75 | 			num_proc = NUM_THREADS
 76 | 		)
 77 | 
 78 | 		mlsum = load_dataset(
 79 | 			path = 'mlsum',
 80 | 			name = 'fr',
 81 | 			split = 'train+validation+test',
 82 | 			download_config = DownloadConfig(max_retries = 10)
 83 | 		)
 84 | 
 85 | 		mlsum_summaries = mlsum.map(
 86 | 			lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['summary']).strip()).strip()},
 87 | 			remove_columns = ['summary', 'topic', 'url', 'title', 'date'],
 88 | 			desc = 'Cleaning mlsum_summaries',
 89 | 			num_proc = NUM_THREADS
 90 | 		)
 91 | 
 92 | 		mlsum = mlsum.map(
 93 | 			lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
 94 | 			remove_columns = ['summary', 'topic', 'url', 'title', 'date'],
 95 | 			desc = 'Cleaning mlsum',
 96 | 			num_proc = NUM_THREADS
 97 | 		)
 98 | 
 99 | 		orange_sum = load_dataset(
100 | 			path = 'orange_sum',
101 | 			name = 'title',
102 | 			split = 'train+validation+test',
103 | 			download_config = DownloadConfig(max_retries = 10)
104 | 		)
105 | 
106 | 		orange_sum = orange_sum.map(
107 | 			lambda doc: {'text': (str(doc['summary']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
108 | 			remove_columns = ['summary'],
109 | 			desc = 'Cleaning orange_sum',
110 | 			num_proc = NUM_THREADS
111 | 		)
112 | 
113 | 		covid_news = load_dataset(
114 | 			path = 'gustavecortal/fr_covid_news',
115 | 			split = 'train',
116 | 			download_config = DownloadConfig(max_retries = 10)
117 | 		)
118 | 
119 | 		covid_news = covid_news.map(
120 | 			lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
121 | 			remove_columns = ['title', 'description', 'domain', 'url', 'labels'],
122 | 			desc = 'Cleaning covid_news',
123 | 			num_proc = NUM_THREADS
124 | 		)
125 | 
126 | 		self.dataset = concatenate_datasets([news_fr, wikinews, cc_news, xlsum, xlsum_summaries, mlsum, mlsum_summaries, orange_sum, covid_news])
127 | 		self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
128 | 		self.size['train'] = 0
129 | 
130 | 		for doc in self.dataset:
131 | 			self.size['train'] += len(str(doc['text']).strip())
132 | 
133 | 		print(f'News dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
134 | 
135 | 
136 | 	def _clean_news_fr(self, text: str) -> str:
137 | 
138 | 		text = text.replace(' ,', ',')
139 | 		text = text.replace(' .', '.')
140 | 		text = text.replace(' )', ')')
141 | 		text = text.replace('( ', '(')
142 | 		text = text.replace(' ]', ']')
143 | 		text = text.replace('[ ', '[')
144 | 
145 | 		text = re.sub(r'(\d)\s*,\s*(\d)', r'\1,\2', text)
146 | 
147 | 		array = list(text)
148 | 		start = True
149 | 
150 | 		for i in range(len(array)):
151 | 			if array[i] == '"':
152 | 				array[i] = '«' if start else '»'
153 | 				start = not start
154 | 
155 | 		return ''.join(array)
156 | 
157 | 
158 | 	def _clean_wikinews(self, document) -> str:
159 | 
160 | 		meta = str(document['meta']).strip()
161 | 		start = meta.find(", 'title': ") + 12
162 | 		end = meta.find(", 'type':") - 1
163 | 
164 | 		if start != 11 and end != -2:
165 | 			title = meta[start:end].strip()
166 | 		else:
167 | 			title = ''
168 | 
169 | 		text = str(document['text']).strip()
170 | 
171 | 		if len(text) < 32:
172 | 			return text
173 | 
174 | 		index = text[:30].find('–')
175 | 
176 | 		if index != -1:
177 | 			text = text[index + 1:]
178 | 
179 | 		output = title + '\n\n' + text.strip()
180 | 
181 | 		return output.strip()


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/others.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
 3 | from dimgpt.data.datasets import Dataset
 4 | from dimgpt.settings import *
 5 | 
 6 | class OthersDataset(Dataset):
 7 | 
 8 | 	def __init__(self) -> None:
 9 | 
10 | 		super().__init__()
11 | 
12 | 		self.training_part = 'pretraining'
13 | 		self.name = 'others'
14 | 		self.multiplier = 2.0
15 | 
16 | 		print('Downloading Others dataset...')
17 | 
18 | 		ted_talks = load_dataset(
19 | 			path = 'bigscience-data/roots_fr_ted_talks_iwslt',
20 | 			split = 'train',
21 | 			download_config = DownloadConfig(max_retries = 10)
22 | 		)
23 | 
24 | 		ted_talks = ted_talks.remove_columns('meta')
25 | 
26 | 		bloom_lm = load_dataset(
27 | 			path = 'sil-ai/bloom-lm',
28 | 			name = 'fra',
29 | 			split = 'train+validation+test',
30 | 			download_config = DownloadConfig(max_retries = 10)
31 | 		)
32 | 
33 | 		bloom_lm = bloom_lm.remove_columns(['title', 'license', 'pageCount', 'bookInstanceId', 'bookLineage'])
34 | 
35 | 		self.dataset = concatenate_datasets([ted_talks, bloom_lm])
36 | 		self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
37 | 		self.size['train'] = 0
38 | 
39 | 		for doc in self.dataset:
40 | 			self.size['train'] += len(str(doc['text']).strip())
41 | 
42 | 		print(f'Others dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
43 | 


--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/wikipedia.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
 3 | from dimgpt.data.datasets import Dataset
 4 | from dimgpt.settings import *
 5 | 
 6 | class WikipediaDataset(Dataset):
 7 | 
 8 | 	def __init__(self) -> None:
 9 | 
10 | 		super().__init__()
11 | 
12 | 		self.training_part = 'pretraining'
13 | 		self.name = 'wikipedia'
14 | 		self.multiplier = 4.0
15 | 
16 | 		print('Downloading Wikipedia dataset...')
17 | 
18 | 		wikipedia_fr = load_dataset(
19 | 			path = 'eckendoerffer/wikipedia_fr',
20 | 			split = 'train+validation+test',
21 | 			download_config = DownloadConfig(max_retries = 10)
22 | 		)
23 | 
24 | 		wikipedia_fr = wikipedia_fr.map(
25 | 			lambda doc: {'text': self._clean_wikipedia_fr(doc['text'])},
26 | 			desc = 'Cleaning wikipedia_fr',
27 | 			num_proc = NUM_THREADS
28 | 		)
29 | 
30 | 		roots_fr_wikipedia = load_dataset(
31 | 			path = 'bigscience-data/roots_fr_wikipedia',
32 | 			split = 'train',
33 | 			download_config = DownloadConfig(max_retries = 10)
34 | 		)
35 | 
36 | 		roots_fr_wikipedia = roots_fr_wikipedia.remove_columns('meta')
37 | 
38 | 		roots_fr_wikivoyage = load_dataset(
39 | 			path = 'bigscience-data/roots_fr_wikivoyage',
40 | 			split = 'train',
41 | 			download_config = DownloadConfig(max_retries = 10)
42 | 		)
43 | 
44 | 		roots_fr_wikivoyage = roots_fr_wikivoyage.remove_columns('meta')
45 | 
46 | 		self.dataset = concatenate_datasets([wikipedia_fr, roots_fr_wikipedia, roots_fr_wikivoyage])
47 | 		self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
48 | 		self.size['train'] = 0
49 | 
50 | 		for doc in self.dataset:
51 | 			self.size['train'] += len(str(doc['text']).strip())
52 | 
53 | 		print(f'Wikipedia dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
54 | 
55 | 
56 | 	def _clean_wikipedia_fr(self, text: str) -> str:
57 | 
58 | 		text = text.replace(' ,', ',')
59 | 		text = text.replace(' .', '.')
60 | 		text = text.replace(' )', ')')
61 | 		text = text.replace('( ', '(')
62 | 		text = text.replace(' ]', ']')
63 | 		text = text.replace('[ ', '[')
64 | 
65 | 		text = re.sub(r'(\d)\s*,\s*(\d)', r'\1,\2', text)
66 | 
67 | 		array = list(text)
68 | 		start = True
69 | 
70 | 		for i in range(len(array)):
71 | 			if array[i] == '"':
72 | 				array[i] = '«' if start else '»'
73 | 				start = not start
74 | 
75 | 		return ''.join(array)


--------------------------------------------------------------------------------
/dimgpt/data/finetuning.py:
--------------------------------------------------------------------------------
 1 | import os, pickle
 2 | from datasets import load_dataset, DownloadConfig
 3 | from tqdm import tqdm
 4 | 
 5 | from dimgpt.settings import *
 6 | from dimgpt.data.tokenizer import Tokenizer
 7 | 
 8 | 
 9 | class Finetuning:
10 | 
11 | 	def __init__(self):
12 | 
13 | 		self.import_dataset()
14 | 
15 | 
16 | 	def import_dataset(self) -> None:
17 | 
18 | 		if os.path.exists(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl')):
19 | 			return
20 | 
21 | 		self.datasets = {}
22 | 
23 | 		for name in ['human_conversations', 'chatbot_conversations', 'dimension_gpt_conversations', 'human_preprompts', 'chatbot_preprompts', 'dimension_gpt_preprompts']:
24 | 
25 | 			self.datasets[name] = load_dataset(
26 | 				path = 'angeluriot/DimensionGPT_instruct',
27 | 				name = name,
28 | 				download_config = DownloadConfig(max_retries = 10),
29 | 				num_proc = NUM_THREADS
30 | 			)
31 | 
32 | 
33 | 	def document_to_tokens(self, document: dict[str, str], tokenizer: Tokenizer, preprompts: bool) -> dict[str, list[int] | int]:
34 | 
35 | 		if preprompts:
36 | 
37 | 			tokens = [tokenizer.system_token, *tokenizer.encode(document['preprompt'])]
38 | 
39 | 			return {'tokens': tokens, 'size': len(tokens)}
40 | 
41 | 		tokens = []
42 | 
43 | 		for msg in document['conversation']:
44 | 
45 | 			if msg['role'] == 'user':
46 | 				tokens.append(tokenizer.user_token)
47 | 			elif msg['role'] == 'assistant':
48 | 				tokens.append(tokenizer.assistant_token)
49 | 			else:
50 | 				tokens.append(tokenizer.human_token)
51 | 
52 | 			tokens.extend(tokenizer.encode(msg['text']))
53 | 
54 | 		return {'tokens': tokens, 'size': len(tokens)}
55 | 
56 | 
57 | 	def save(self, tokenizer: Tokenizer) -> None:
58 | 
59 | 		if os.path.exists(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl')):
60 | 			return
61 | 
62 | 		if not os.path.exists(os.path.join(DATA_DIR, 'finetuning')):
63 | 			os.makedirs(os.path.join(DATA_DIR, 'finetuning'))
64 | 
65 | 		for name, dataset in self.datasets.items():
66 | 
67 | 			if name == 'chatbot_conversations':
68 | 				dataset = dataset['train'].train_test_split(test_size = FINETUNING_VAL_RATIO, shuffle = True)
69 | 				dataset['val'] = dataset.pop('test')
70 | 
71 | 			tokenized = dataset.map(
72 | 				lambda doc: self.document_to_tokens(doc, tokenizer, name.endswith('preprompts')),
73 | 				desc = f'Tokenizing {name}',
74 | 				num_proc = NUM_THREADS
75 | 			)
76 | 
77 | 			for split, documents in tokenized.items():
78 | 
79 | 				docs = []
80 | 
81 | 				for doc in tqdm(documents, desc = f'Saving finetuning dataset {name}_{split}'):
82 | 					docs.append(doc['tokens'])
83 | 
84 | 				with open(os.path.join(DATA_DIR, 'finetuning', f'{name}_{split}.pkl'), 'wb') as file:
85 | 					pickle.dump(docs, file)
86 | 
87 | 


--------------------------------------------------------------------------------
/dimgpt/data/pretokenizer.py:
--------------------------------------------------------------------------------
 1 | import regex
 2 | 
 3 | from tokenizers import *
 4 | from dimgpt.settings import *
 5 | from dimgpt.utils import *
 6 | 
 7 | 
 8 | def split(text: str) -> list[str]:
 9 | 
10 | 	if text == '':
11 | 		return []
12 | 
13 | 	# Split in words
14 | 
15 | 	safe_control_tokens = [regex.escape(c) for c in CONTROL_TOKENS]
16 | 	reg = r'(' + r'|'.join(safe_control_tokens) + r'|\d+|\s+|\p{L}+|[^\d\p{L}\s' + r''.join([f'[{i}]' for i in safe_control_tokens]) + r']+)'
17 | 	words = regex.split(reg, text, flags = regex.UNICODE, concurrent = False)
18 | 	words = list(filter(None, words))
19 | 
20 | 	# Add beginning spaces
21 | 
22 | 	temp = []
23 | 	i = 0
24 | 
25 | 	while i < len(words) - 1:
26 | 
27 | 		if words[i] == ' ' and words[i + 1] not in CONTROL_TOKENS:
28 | 			temp.append(' ' + words[i + 1])
29 | 			i += 2
30 | 			continue
31 | 
32 | 		if words[i].endswith(' ') and words[i + 1] not in CONTROL_TOKENS:
33 | 			temp.extend([words[i][:-1], ' ' + words[i + 1]])
34 | 			i += 2
35 | 			continue
36 | 
37 | 		temp.append(words[i])
38 | 		i += 1
39 | 
40 | 	if i == len(words) - 1:
41 | 		temp.append(words[-1])
42 | 
43 | 	words = temp
44 | 	words = list(filter(None, words))
45 | 
46 | 	return words
47 | 
48 | 
49 | class PreTokenizer:
50 | 
51 | 	def split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
52 | 
53 | 		print('Pretokenize...')
54 | 
55 | 		words = split(str(normalized_string))
56 | 		words = [NormalizedString(word) for word in words]
57 | 
58 | 		print('Nb words:', '{:,.0f}'.format(len(words)))
59 | 		print('Merges...')
60 | 
61 | 		return words
62 | 
63 | 
64 | 	def pre_tokenize(self, pretok: PreTokenizedString) -> None:
65 | 
66 | 		pretok.split(self.split)
67 | 


--------------------------------------------------------------------------------
/dimgpt/data/pretraining.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | from tqdm import tqdm
 4 | 
 5 | from dimgpt.settings import *
 6 | from dimgpt.data.clean import *
 7 | from dimgpt.data.tokenizer import Tokenizer
 8 | from dimgpt.data.datasets.pretraining import *
 9 | from dimgpt.data.datasets import Dataset
10 | 
11 | 
12 | class Pretraining:
13 | 
14 | 	def __init__(self):
15 | 
16 | 		self.datasets: list[Dataset] = [CommonCrawlDataset(), WikipediaDataset(), BooksDataset(), NewsDataset(), InstitutionsDataset(), OthersDataset()]
17 | 
18 | 
19 | 	def get_document(self) -> str:
20 | 
21 | 		probabilities = np.array([dataset.size['train'] * dataset.multiplier for dataset in self.datasets])
22 | 		probabilities /= np.sum(probabilities)
23 | 
24 | 		dataset = np.random.choice(self.datasets, p = probabilities)
25 | 
26 | 		return dataset.get_document()
27 | 
28 | 
29 | 	def create_tokenizer_data(self, epsilon: float = 1e-8) -> tuple[list[int], list[str]]:
30 | 
31 | 		if os.path.exists(os.path.join(DATA_DIR, 'tokenizer_data.txt')):
32 | 
33 | 			return [0] * len(self.datasets), ['']
34 | 
35 | 		target_ratios = np.array([dataset.size['train'] * dataset.multiplier for dataset in self.datasets])
36 | 		target_ratios = (target_ratios / np.sum(target_ratios)).tolist()
37 | 
38 | 		with open(os.path.join(DATA_DIR, 'tokenizer_data.txt'), 'w', encoding = 'utf-8') as file:
39 | 
40 | 			file.truncate(0)
41 | 			chars = {}
42 | 			current_sizes = [0] * len(self.datasets)
43 | 			pbar = tqdm(total = TOKENIZER_DATA_SIZE)
44 | 
45 | 			while True:
46 | 
47 | 				current_ratios = [size / (sum(current_sizes) + epsilon) for size in current_sizes]
48 | 				ratio_errors = [target_ratios[i] - current_ratios[i] for i in range(len(self.datasets))]
49 | 				dataset_index = np.argmax(ratio_errors)
50 | 				dataset = self.datasets[dataset_index]
51 | 
52 | 				document = dataset.get_document()
53 | 
54 | 				if len(document) == 0:
55 | 					continue
56 | 
57 | 				file.write(document)
58 | 				current_sizes[dataset_index] += len(document)
59 | 
60 | 				for char in document:
61 | 					chars[char] = chars.get(char, 0) + 1
62 | 
63 | 				pbar.update(len(document))
64 | 
65 | 				if sum(current_sizes) >= TOKENIZER_DATA_SIZE:
66 | 					break
67 | 
68 | 			document = ' ' + ' '.join(list(POSSIBLE_CHARS))
69 | 			file.write(document)
70 | 
71 | 			for char in document:
72 | 				chars[char] = chars.get(char, 0) + 1
73 | 
74 | 			pbar.close()
75 | 
76 | 		chars = sorted(chars.items(), key = lambda item: item[1], reverse = True)
77 | 		chars = [char for char, _ in chars]
78 | 
79 | 		return current_sizes, chars
80 | 
81 | 
82 | 	def save(self, tokenizer: Tokenizer) -> None:
83 | 
84 | 		for dataset in self.datasets:
85 | 			dataset.save(tokenizer)
86 | 
87 | 
88 | 	def summary(self) -> None:
89 | 
90 | 		for dataset in self.datasets:
91 | 			print(f'{dataset.name}: {len(dataset.dataset):,} documents | {dataset.size["train"]:,} characters | {dataset.multiplier:.1f}x')


--------------------------------------------------------------------------------
/dimgpt/data/tokenizer.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import numpy.typing as npt
  4 | import tokenizers as tk
  5 | from tokenizers.models import BPE
  6 | from tokenizers.trainers import BpeTrainer
  7 | from tokenizers.pre_tokenizers import PreTokenizer
  8 | from tqdm import tqdm
  9 | 
 10 | from dimgpt.data.clean import *
 11 | from dimgpt.utils import *
 12 | import dimgpt.data.pretokenizer as pretk
 13 | from dimgpt.settings import *
 14 | 
 15 | class Tokenizer:
 16 | 
 17 | 	def __init__(self):
 18 | 
 19 | 		self.vocab: list[str] = []
 20 | 		self.to_index: dict[str, int] = {}
 21 | 		self.to_token: dict[int, str] = {}
 22 | 
 23 | 		if os.path.exists(os.path.join(DATA_DIR, 'vocab.txt')):
 24 | 			self.load_from_vocab(load_text_array(os.path.join(DATA_DIR, 'vocab.txt')))
 25 | 		else:
 26 | 			self.create(os.path.join(DATA_DIR, 'tokenizer_data.txt'))
 27 | 			save_text_array(self.vocab, os.path.join(DATA_DIR, 'vocab.txt'))
 28 | 
 29 | 
 30 | 	def _set_control_tokens(self) -> None:
 31 | 
 32 | 		self.unknown_token = self.to_index['⮜unknown⮞']
 33 | 		self.padding_token = self.to_index['⮜padding⮞']
 34 | 		self.start_of_text_token = self.to_index['⮜start-of-text⮞']
 35 | 		self.tab_token = self.to_index['⮜tab⮞']
 36 | 		self.new_line_token = self.to_index['⮜new-line⮞']
 37 | 		self.human_token = self.to_index['⮜human⮞']
 38 | 		self.system_token = self.to_index['⮜system⮞']
 39 | 		self.user_token = self.to_index['⮜user⮞']
 40 | 		self.assistant_token = self.to_index['⮜assistant⮞']
 41 | 		self.end_of_text_token = self.to_index['⮜end-of-text⮞']
 42 | 
 43 | 
 44 | 	def load_from_vocab(self, vocab: list[str]) -> None:
 45 | 
 46 | 		self.vocab = vocab.copy()
 47 | 		self.to_index = {v: i for i, v in enumerate(self.vocab)}
 48 | 		self.to_token = {i: v for i, v in enumerate(self.vocab)}
 49 | 		self._set_control_tokens()
 50 | 
 51 | 
 52 | 	def create(self, data_path: str) -> None:
 53 | 
 54 | 		self._create_vocab(data_path)
 55 | 		dataset = open(data_path, 'r', encoding = 'utf-8').read()
 56 | 		self._sort_vocab(dataset)
 57 | 		self._set_control_tokens()
 58 | 
 59 | 
 60 | 	def _create_vocab(self, data_path: str) -> None:
 61 | 
 62 | 		print('Creating vocab...')
 63 | 
 64 | 		tokenizer = tk.Tokenizer(BPE(unk_token = '⮜unknown⮞'))
 65 | 		tokenizer.pre_tokenizer = PreTokenizer.custom(pretk.PreTokenizer())
 66 | 
 67 | 		trainer = BpeTrainer(
 68 | 			vocab_size = int(VOCAB_SIZE * 1.1),
 69 | 			show_progress = True,
 70 | 			special_tokens = CONTROL_TOKENS
 71 | 		)
 72 | 
 73 | 		tokenizer.train([data_path], trainer)
 74 | 
 75 | 		self.vocab = list(tokenizer.get_vocab().keys())
 76 | 		vocab_size = len(self.vocab)
 77 | 
 78 | 		def is_valid(word: str) -> bool:
 79 | 
 80 | 			if len(word) > MAX_TOKEN_LENGTH:
 81 | 				return False
 82 | 
 83 | 			if word.endswith(' ') and len(word) > 4:
 84 | 				return False
 85 | 
 86 | 			if any(c not in POSSIBLE_CHARS for c in word):
 87 | 				return False
 88 | 
 89 | 			nb_digits = 0
 90 | 
 91 | 			for char in word:
 92 | 				if char.isdigit():
 93 | 					nb_digits += 1
 94 | 
 95 | 			return nb_digits < 2
 96 | 
 97 | 		self.vocab = list(filter(lambda v: is_valid(v), self.vocab))
 98 | 
 99 | 		print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({vocab_size - len(self.vocab):,} invalid tokens removed)')
100 | 		vocab_size = len(self.vocab)
101 | 
102 | 		for i in range(10):
103 | 			if str(i) not in self.vocab:
104 | 				self.vocab.append(str(i))
105 | 			if ' ' + str(i) not in self.vocab:
106 | 				self.vocab.append(' ' + str(i))
107 | 
108 | 		print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} number tokens added)')
109 | 		vocab_size = len(self.vocab)
110 | 
111 | 		for token in FORCED_TOKENS:
112 | 			if token not in self.vocab:
113 | 				self.vocab.append(token)
114 | 
115 | 		print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} forced tokens added)')
116 | 		vocab_size = len(self.vocab)
117 | 
118 | 		self.vocab = CONTROL_TOKENS + self.vocab
119 | 
120 | 		print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} control tokens added)')
121 | 
122 | 		self.to_index = {v: i for i, v in enumerate(self.vocab)}
123 | 		self.to_token = {i: v for i, v in enumerate(self.vocab)}
124 | 
125 | 
126 | 	def _sort_vocab(self, dataset: str) -> None:
127 | 
128 | 		print('Pretokenize...')
129 | 		data = pretk.split(dataset)
130 | 
131 | 		print('Sorting vocab...')
132 | 		vocab = {v: 0 for v in self.vocab}
133 | 		nb_tokens = 0
134 | 		total_tokens_length = 0
135 | 
136 | 		for i in tqdm(range(len(data))):
137 | 
138 | 			if data[i] in self.to_index:
139 | 				vocab[data[i]] += 1
140 | 				nb_tokens += 1
141 | 				total_tokens_length += len(data[i])
142 | 				continue
143 | 
144 | 			j = 0
145 | 
146 | 			while j < len(data[i]):
147 | 
148 | 				found = False
149 | 
150 | 				for k in reversed(range(min(MAX_TOKEN_LENGTH, len(data[i]) - j))):
151 | 
152 | 					word = data[i][j:j + k + 1]
153 | 
154 | 					if word in self.to_index:
155 | 						vocab[word] += 1
156 | 						nb_tokens += 1
157 | 						total_tokens_length += len(word)
158 | 						j += k
159 | 						found = True
160 | 						break
161 | 
162 | 				if not found:
163 | 					vocab['⮜unknown⮞'] += 1
164 | 					nb_tokens += 1
165 | 					total_tokens_length += 5
166 | 
167 | 				j += 1
168 | 
169 | 		self.vocab = list(sorted(vocab.items(), key = lambda x: x[1], reverse = True))
170 | 		vocab_size = len(self.vocab)
171 | 		self.vocab = list(filter(lambda x: x[0] not in CONTROL_TOKENS, self.vocab))
172 | 
173 | 		while len(self.vocab) > VOCAB_SIZE - len(CONTROL_TOKENS):
174 | 
175 | 			for i in range(len(self.vocab) - 1, -1, -1):
176 | 
177 | 				if len(self.vocab[i][0]) > 1 and self.vocab[i][0] not in FORCED_TOKENS and not (self.vocab[i][0][-1].isdigit() and len(self.vocab[i][0]) <= 2):
178 | 					self.vocab.pop(i)
179 | 					break
180 | 
181 | 		self.vocab = [v[0] for v in self.vocab]
182 | 		self.vocab = CONTROL_TOKENS + self.vocab
183 | 
184 | 		print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({vocab_size - len(self.vocab):,} unused tokens removed)')
185 | 
186 | 		self.to_index = {v: i for i, v in enumerate(self.vocab)}
187 | 		self.to_token = {i: v for i, v in enumerate(self.vocab)}
188 | 
189 | 		print(f'Number of tokens: {nb_tokens:,}')
190 | 		print(f'Average token length: {total_tokens_length / nb_tokens:.2f}')
191 | 
192 | 
193 | 	def encode(self, text: str, clean_text: bool = True, keep_control_tokens: bool = False, verbose: bool = False) -> list[int]:
194 | 
195 | 		if verbose:
196 | 			print('Pretokenize...')
197 | 
198 | 		if clean_text:
199 | 			text = clean_string(text, keep_control_tokens)
200 | 
201 | 		data = pretk.split(text)
202 | 
203 | 		if verbose:
204 | 			print('Encoding dataset...')
205 | 
206 | 		output = []
207 | 
208 | 		for i in tqdm(range(len(data)), disable = not verbose):
209 | 
210 | 			if data[i] in self.to_index:
211 | 				output.append(self.to_index[data[i]])
212 | 				continue
213 | 
214 | 			j = 0
215 | 
216 | 			while j < len(data[i]):
217 | 
218 | 				found = False
219 | 
220 | 				for k in reversed(range(min(MAX_TOKEN_LENGTH, len(data[i]) - j))):
221 | 
222 | 					word = data[i][j:j + k + 1]
223 | 
224 | 					if word in self.to_index:
225 | 						output.append(self.to_index[word])
226 | 						j += k
227 | 						found = True
228 | 						break
229 | 
230 | 				if not found:
231 | 					output.append(self.to_index['⮜unknown⮞'])
232 | 
233 | 				j += 1
234 | 
235 | 		return output
236 | 
237 | 
238 | 	def decode(self, tokens: list[int] | npt.NDArray[np.uint16] | torch.Tensor | int, keep_control_tokens: bool = False,
239 | 		token_array: bool = False) -> str | list[str]:
240 | 
241 | 		if type(tokens) == int:
242 | 			tokens = [tokens]
243 | 		if type(tokens) == torch.Tensor:
244 | 			tokens = tokens.detach().to('cpu').tolist()
245 | 		elif type(tokens) != list:
246 | 			tokens = list(tokens)
247 | 
248 | 		text = []
249 | 
250 | 		for t in tokens:
251 | 
252 | 			if t < 0 or t >= len(self.vocab):
253 | 				continue
254 | 
255 | 			text.append(unclean_string(self.to_token[t], keep_control_tokens))
256 | 
257 | 		if token_array:
258 | 			return text
259 | 
260 | 		return ''.join(text)
261 | 


--------------------------------------------------------------------------------
/dimgpt/settings.py:
--------------------------------------------------------------------------------
 1 | import os, torch
 2 | from contextlib import nullcontext
 3 | 
 4 | # ============== Dataset ============== #
 5 | 
 6 | DATA_DIR = 'data'
 7 | OUTPUT_DIR = 'output'
 8 | NUM_THREADS = 16
 9 | 
10 | TOKENIZER_DATA_SIZE = 300_000_000
11 | MIN_DOCUMENT_SIZE = 64
12 | PRETRAINING_VAL_RATIO = 0.001
13 | MAX_TOKEN_LENGTH = 16
14 | CONTROL_TOKENS = ['⮜unknown⮞', '⮜padding⮞', '⮜start-of-text⮞', '⮜tab⮞', '⮜new-line⮞', '⮜human⮞', '⮜system⮞', '⮜user⮞', '⮜assistant⮞', '⮜end-of-text⮞']
15 | PADDING_TOKEN = 1
16 | FORCED_TOKENS = ['Dimension', ' Dimension', 'GPT', ' GPT', 'IA', ' IA', 'Generative', ' Generative', 'Pre', ' Pre', 'trained', ' trained', 'Transformer', ' Transformer']
17 | 
18 | FINETUNING_VAL_RATIO = 0.01
19 | 
20 | SPLIT_RATIOS = [
21 | 	0.099,	# human
22 | 	0.9,	# chatbot
23 | 	0.001	# DimensionGPT
24 | ]
25 | 
26 | HUMAN_PREPROMPT_RATIOS = [
27 | 	0.3,	# human
28 | 	0.0,	# chatbot
29 | 	0.0,	# DimensionGPT
30 | 	0.7		# None
31 | ]
32 | 
33 | CHATBOT_PREPROMPT_RATIOS = [
34 | 	0.0,	# human
35 | 	0.5,	# chatbot
36 | 	0.4,	# DimensionGPT
37 | 	0.1		# None
38 | ]
39 | 
40 | DIMENSION_GPT_PREPROMPT_RATIOS = [
41 | 	0.0,	# human
42 | 	0.0,	# chatbot
43 | 	1.0,	# DimensionGPT
44 | 	0.0		# None
45 | ]
46 | 
47 | INSTRUCTION_LOSS_STRENGTH = 0.1
48 | PREPROMPT = "Une discussion entre un utilisateur et DimensionGPT, un modèle de langage conversationnel français créé par le développeur indépendant Dimension et basé sur l'architecture GPT."
49 | 
50 | # =============== Model =============== #
51 | 
52 | VOCAB_SIZE = 32_000
53 | MAX_CONTEXT = 512
54 | WINDOW_SIZE = 256
55 | EMBEDDING_DIM = 1024
56 | NUM_GROUPED_HEADS = 4
57 | NUM_HEADS = 16
58 | HEAD_DIM = EMBEDDING_DIM // NUM_HEADS
59 | FFN_DIM = int((2.0 / 3.0) * 4 * EMBEDDING_DIM)
60 | NUM_BLOCKS = 16
61 | DROPOUT = 0
62 | INIT_STDDEV = 0.02
63 | ROPE_THETA = 10000.0
64 | 
65 | # ============= Training ============== #
66 | 
67 | BATCH_SIZE = 16
68 | NUM_ACCUMULATIONS = 64
69 | 
70 | MAX_LEARNING_RATE = 6e-4
71 | MIN_LEARNING_RATE = 6e-5
72 | WARMUP_STEPS = 2_000
73 | DECAY_STEPS = 100_000
74 | 
75 | BETA_1 = 0.9
76 | BETA_2 = 0.95
77 | EPSILON = 1e-5
78 | WEIGHT_DECAY = 0.1
79 | CLIP_GRADIENT = 1.0
80 | 
81 | METRICS_BETA = 0.9
82 | VAL_INTERVAL = 50
83 | 
84 | # ===================================== #
85 | 
86 | GPU_ENABLED = torch.cuda.is_available()
87 | FLOAT16_ENABLED = GPU_ENABLED and torch.cuda.is_bf16_supported()
88 | DEVICE_NAME = 'cuda:0' if GPU_ENABLED else 'cpu'
89 | DEVICE = torch.device(DEVICE_NAME)
90 | CONTEXT = torch.autocast(device_type='cuda', dtype=torch.bfloat16) if FLOAT16_ENABLED else nullcontext()
91 | 


--------------------------------------------------------------------------------
/dimgpt/testing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/testing/__init__.py


--------------------------------------------------------------------------------
/dimgpt/testing/sampling.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | import numpy.typing as npt
  4 | 
  5 | from dimgpt.training.model import Model
  6 | from dimgpt.data.tokenizer import Tokenizer
  7 | from dimgpt.settings import *
  8 | 
  9 | 
 10 | class Sampler():
 11 | 
 12 | 	def __init__(self, model: Model, tokenizer: Tokenizer):
 13 | 
 14 | 		self.model = model
 15 | 		self.tokenizer = tokenizer
 16 | 		self.preprompt = [self.tokenizer.system_token, *self.tokenizer.encode(PREPROMPT)]
 17 | 
 18 | 
 19 | 	def get_probabilities(self, input: list[int]) -> npt.NDArray[np.float32]:
 20 | 
 21 | 		with CONTEXT:
 22 | 			model_input = torch.tensor([input], dtype = torch.long, device = DEVICE)
 23 | 			model_output = self.model(model_input, only_last = True)
 24 | 
 25 | 		probabilities = model_output[0].float().detach().to('cpu').numpy()
 26 | 		probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities))
 27 | 
 28 | 		return probabilities
 29 | 
 30 | 
 31 | 	def sample(self, input: list[int], chatbot: bool, temperature: float = 1.0, top_p: float = 1.0, no_repeat_strength: float = 0.0) -> int:
 32 | 
 33 | 		probabilities = np.log(self.get_probabilities(input))
 34 | 		proximity = MAX_CONTEXT
 35 | 
 36 | 		for i in reversed(range(max(len(input) - MAX_CONTEXT, 0), len(input))):
 37 | 			strength = no_repeat_strength * (proximity / MAX_CONTEXT)
 38 | 			probabilities[input[i]] *= (1 + strength)
 39 | 			proximity -= 1
 40 | 
 41 | 		if temperature == 0.0:
 42 | 			return np.argmax(probabilities)
 43 | 
 44 | 		probabilities /= temperature
 45 | 		probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities))
 46 | 
 47 | 		if chatbot:
 48 | 			probabilities[self.tokenizer.end_of_text_token] += probabilities[self.tokenizer.user_token]
 49 | 
 50 | 		probabilities[self.tokenizer.unknown_token] = 0.0
 51 | 		probabilities[self.tokenizer.padding_token] = 0.0
 52 | 		probabilities[self.tokenizer.start_of_text_token] = 0.0
 53 | 		probabilities[self.tokenizer.human_token] = 0.0
 54 | 		probabilities[self.tokenizer.system_token] = 0.0
 55 | 		probabilities[self.tokenizer.user_token] = 0.0
 56 | 		probabilities[self.tokenizer.assistant_token] = 0.0
 57 | 
 58 | 		probabilities /= np.sum(probabilities)
 59 | 
 60 | 		sorted_indices = np.argsort(-probabilities)
 61 | 		cumsum_probabilites = np.cumsum(probabilities[sorted_indices])
 62 | 		cutoff_index = np.searchsorted(cumsum_probabilites, max(top_p, cumsum_probabilites[0] + 1e-6))
 63 | 		temp = np.zeros_like(probabilities)
 64 | 		temp[sorted_indices[:cutoff_index]] = probabilities[sorted_indices[:cutoff_index]]
 65 | 		probabilities = temp / np.sum(temp)
 66 | 
 67 | 		return np.random.choice(range(len(probabilities)), p = probabilities)
 68 | 
 69 | 
 70 | 	def generate(self, input: str, max_length: int, chat_bot: bool = False, temperature: float = 1.0,
 71 | 		top_p: float = 1.0, no_repeat: float = 0.0, verbose: bool = False, max_print_line_length = 0) -> str:
 72 | 
 73 | 		self.model.eval()
 74 | 
 75 | 		with torch.no_grad():
 76 | 
 77 | 			input = self.tokenizer.encode(input)
 78 | 
 79 | 			if chat_bot:
 80 | 				input = [self.tokenizer.start_of_text_token, *self.preprompt, self.tokenizer.user_token, *input, self.tokenizer.assistant_token]
 81 | 			else:
 82 | 				input = [self.tokenizer.start_of_text_token, *input]
 83 | 
 84 | 			output = []
 85 | 			to_print = []
 86 | 			last_line_length = 0
 87 | 
 88 | 			if not chat_bot:
 89 | 				output = input[1:].copy()
 90 | 				to_print = input[1:].copy()
 91 | 				text = self.tokenizer.decode(to_print)
 92 | 				last_line_length = len(text) - 1 - text.rfind('\n')
 93 | 
 94 | 			for _ in range(max_length):
 95 | 
 96 | 				index = self.sample(input, temperature, top_p, no_repeat)
 97 | 
 98 | 				if index == self.tokenizer.end_of_text_token:
 99 | 					break
100 | 
101 | 				input.append(index)
102 | 				output.append(index)
103 | 				to_print.append(index)
104 | 
105 | 				if verbose:
106 | 
107 | 					text = self.tokenizer.decode(to_print)
108 | 
109 | 					if '\n' in text:
110 | 						last_line_length = len(text) - 1 - text.rfind('\n')
111 | 					else:
112 | 						last_line_length += len(text)
113 | 
114 | 					if max_print_line_length > 0 and last_line_length >= max_print_line_length and text.startswith(' '):
115 | 						print()
116 | 						text = text[1:]
117 | 						last_line_length = 0
118 | 
119 | 					print(text, end = '')
120 | 					to_print = []
121 | 
122 | 		return self.tokenizer.decode(output)
123 | 


--------------------------------------------------------------------------------
/dimgpt/training/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/training/__init__.py


--------------------------------------------------------------------------------
/dimgpt/training/datasets/__init__.py:
--------------------------------------------------------------------------------
1 | from .dataset import Dataset
2 | from .pretraining import PretrainingDataset
3 | from .finetuning import FinetuningDataset


--------------------------------------------------------------------------------
/dimgpt/training/datasets/dataset.py:
--------------------------------------------------------------------------------
 1 | from abc import ABC
 2 | import torch
 3 | 
 4 | from dimgpt.data.tokenizer import Tokenizer
 5 | from dimgpt.settings import *
 6 | 
 7 | 
 8 | class Dataset(ABC):
 9 | 
10 | 	def __init__(self, tokenizer: Tokenizer):
11 | 
12 | 		self.tokenizer = tokenizer
13 | 
14 | 
15 | 	def train_size(self) -> int:
16 | 
17 | 		pass
18 | 
19 | 
20 | 	def val_size(self) -> int:
21 | 
22 | 		pass
23 | 
24 | 
25 | 	def _random_document(self, val: bool) -> tuple[list[int], list[int]]:
26 | 
27 | 		pass
28 | 
29 | 
30 | 	def _get_tokens(self, val: bool) -> tuple[list[int], list[int]]:
31 | 
32 | 		pass
33 | 
34 | 
35 | 	def _next(self, val: bool) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
36 | 
37 | 		x = []
38 | 		y = []
39 | 		strengths = []
40 | 
41 | 		for _ in range(BATCH_SIZE):
42 | 
43 | 			xy, strength = self._get_tokens(val)
44 | 
45 | 			x.append(xy[0:MAX_CONTEXT])
46 | 			y.append(xy[1:MAX_CONTEXT + 1])
47 | 			strengths.append(strength[1:MAX_CONTEXT + 1])
48 | 
49 | 		x = torch.tensor(x, dtype = torch.long).pin_memory().to(DEVICE, non_blocking = True)
50 | 		y = torch.tensor(y, dtype = torch.long).pin_memory().to(DEVICE, non_blocking = True)
51 | 		strengths = torch.tensor(strengths, dtype = torch.float32).pin_memory().to(DEVICE, non_blocking = True)
52 | 
53 | 		return x, y, strengths
54 | 
55 | 
56 | 	def next_train(self) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
57 | 
58 | 		return self._next(False)
59 | 
60 | 
61 | 	def next_val(self) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
62 | 
63 | 		return self._next(True)


--------------------------------------------------------------------------------
/dimgpt/training/datasets/finetuning.py:
--------------------------------------------------------------------------------
  1 | import os, random, pickle
  2 | import torch
  3 | import numpy as np
  4 | 
  5 | from dimgpt.data.tokenizer import Tokenizer
  6 | from dimgpt.settings import *
  7 | from dimgpt.training.datasets import Dataset
  8 | 
  9 | 
 10 | class FinetuningDataset(Dataset):
 11 | 
 12 | 	def __init__(self, tokenizer: Tokenizer):
 13 | 
 14 | 		self.tokenizer = tokenizer
 15 | 
 16 | 		self.train_data = {
 17 | 			'human': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'human_conversations_train.pkl'), 'rb')),
 18 | 			'chatbot': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl'), 'rb')),
 19 | 			'dimension_gpt': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'dimension_gpt_conversations_train.pkl'), 'rb'))
 20 | 		}
 21 | 
 22 | 		self.val_data =  pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_val.pkl'), 'rb'))
 23 | 
 24 | 		self.train_preprompts = {
 25 | 			'human': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'human_preprompts_train.pkl'), 'rb')),
 26 | 			'chatbot': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_preprompts_train.pkl'), 'rb')),
 27 | 			'dimension_gpt': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'dimension_gpt_preprompts_train.pkl'), 'rb'))
 28 | 		}
 29 | 
 30 | 		self.final_preprompt = [self.tokenizer.system_token, *self.tokenizer.encode(PREPROMPT)]
 31 | 
 32 | 		self.preprompt_ratios = {
 33 | 			'human': HUMAN_PREPROMPT_RATIOS,
 34 | 			'chatbot': CHATBOT_PREPROMPT_RATIOS,
 35 | 			'dimension_gpt': DIMENSION_GPT_PREPROMPT_RATIOS
 36 | 		}
 37 | 
 38 | 		h = [len(i) for i in self.train_data['human']]
 39 | 		c = [len(i) for i in self.train_data['chatbot']]
 40 | 		d = [len(i) for i in self.train_data['dimension_gpt']]
 41 | 
 42 | 		self.train_data_p = {
 43 | 			'human': (np.array(h) / np.sum(h)).tolist(),
 44 | 			'chatbot': (np.array(c) / np.sum(c)).tolist(),
 45 | 			'dimension_gpt': (np.array(d) / np.sum(d)).tolist()
 46 | 		}
 47 | 
 48 | 		print(sum(self.train_data_p['human']))
 49 | 		print(sum(self.train_data_p['chatbot']))
 50 | 		print(sum(self.train_data_p['dimension_gpt']))
 51 | 
 52 | 		v = [len(i) for i in self.val_data]
 53 | 
 54 | 		self.val_data_p = (np.array(v) / np.sum(v)).tolist()
 55 | 
 56 | 		self.train_ids = {
 57 | 			'human': list(range(len(self.train_data['human']))),
 58 | 			'chatbot': list(range(len(self.train_data['chatbot']))),
 59 | 			'dimension_gpt': list(range(len(self.train_data['dimension_gpt'])))
 60 | 		}
 61 | 
 62 | 		self.val_ids = list(range(len(self.val_data)))
 63 | 
 64 | 
 65 | 	def train_size(self) -> int:
 66 | 
 67 | 		return sum([sum([len(i) for i in self.train_data[key]]) for key in self.train_data])
 68 | 
 69 | 
 70 | 	def val_size(self) -> int:
 71 | 
 72 | 		return sum([len(i) for i in self.val_data])
 73 | 
 74 | 
 75 | 	def __get_strength(self, doc: list[int], val: bool) -> list[int]:
 76 | 
 77 | 		assistant = False
 78 | 		instruction_loss_strength = 0.0 if val else INSTRUCTION_LOSS_STRENGTH
 79 | 		strength = []
 80 | 
 81 | 		for token in doc:
 82 | 
 83 | 			strength.append(1.0 if assistant else instruction_loss_strength)
 84 | 
 85 | 			if token == self.tokenizer.user_token or token == self.tokenizer.end_of_text_token:
 86 | 				assistant = False
 87 | 
 88 | 			if token == self.tokenizer.assistant_token or token == self.tokenizer.human_token:
 89 | 				assistant = True
 90 | 
 91 | 		return strength
 92 | 
 93 | 
 94 | 	def __get_document(self, val: bool, first: bool) -> tuple[list[int], list[int]]:
 95 | 
 96 | 		if val:
 97 | 			data_ids = self.val_ids
 98 | 			data = self.val_data
 99 | 			data_p = self.val_data_p
100 | 
101 | 		else:
102 | 			data_split = np.random.choice(['human', 'chatbot', 'dimension_gpt'], p = SPLIT_RATIOS)
103 | 			data_ids = self.train_ids[data_split]
104 | 			data = self.train_data[data_split]
105 | 			data_p = self.train_data_p[data_split]
106 | 
107 | 		if first:
108 | 			id = np.random.choice(data_ids, p = data_p)
109 | 			conversation = data[id]
110 | 		else:
111 | 			conversation = data[random.randint(0, len(data) - 1)]
112 | 
113 | 		if val:
114 | 			xy = [self.tokenizer.start_of_text_token, *self.final_preprompt, *conversation, self.tokenizer.end_of_text_token]
115 | 			strength = self.__get_strength(xy, val)
116 | 			return xy, strength
117 | 
118 | 		preprompt_ratio = self.preprompt_ratios[data_split]
119 | 		preprompt_split = np.random.choice(['human', 'chatbot', 'dimension_gpt', 'none'], p = preprompt_ratio)
120 | 
121 | 		if preprompt_split != 'none':
122 | 			preprompt = self.train_preprompts[preprompt_split][random.randint(0, len(self.train_preprompts[preprompt_split]) - 1)]
123 | 			conversation = [*preprompt, *conversation]
124 | 
125 | 		xy = [self.tokenizer.start_of_text_token, *conversation, self.tokenizer.end_of_text_token]
126 | 		strength = self.__get_strength(xy, val)
127 | 
128 | 		return xy, strength
129 | 
130 | 
131 | 	def _get_random_document(self, val: bool) -> tuple[list[int], list[int]]:
132 | 
133 | 		xy, strength = self.__get_document(val, False)
134 | 
135 | 		return xy, strength
136 | 
137 | 
138 | 	def _get_tokens(self, val: bool) -> tuple[torch.Tensor, torch.Tensor]:
139 | 
140 | 		xy, strength = self.__get_document(val, True)
141 | 
142 | 		i = random.randint(0, len(xy) - 1)
143 | 		xy = xy[i:]
144 | 		strength = strength[i:]
145 | 
146 | 		while len(xy) < MAX_CONTEXT + 1:
147 | 
148 | 			_xy, _strength = self._get_random_document(val)
149 | 
150 | 			xy.extend(_xy)
151 | 			strength.extend(_strength)
152 | 
153 | 		xy = xy[0:MAX_CONTEXT + 1]
154 | 		strength = strength[0:MAX_CONTEXT + 1]
155 | 
156 | 		return xy, strength


--------------------------------------------------------------------------------
/dimgpt/training/datasets/pretraining.py:
--------------------------------------------------------------------------------
 1 | import os, random, pickle
 2 | import torch
 3 | import numpy as np
 4 | 
 5 | from dimgpt.data.tokenizer import Tokenizer
 6 | from dimgpt.settings import *
 7 | from dimgpt.training.datasets import Dataset
 8 | 
 9 | 
10 | class PretrainingDataset(Dataset):
11 | 
12 | 	def __init__(self, tokenizer: Tokenizer):
13 | 
14 | 		super().__init__(tokenizer)
15 | 
16 | 		datasets = os.listdir(os.path.join(DATA_DIR, 'pretraining'))
17 | 		self.datasets = []
18 | 
19 | 		for dataset in datasets:
20 | 
21 | 			if not os.path.isdir(os.path.join(DATA_DIR, 'pretraining', dataset)):
22 | 				continue
23 | 
24 | 			meta = pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'metadata.pkl'), 'rb'))
25 | 
26 | 			self.datasets.append({
27 | 				'train': {
28 | 					'data': np.memmap(os.path.join(DATA_DIR, 'pretraining', dataset, f'train.bin'), dtype = np.uint16, mode = 'r'),
29 | 					'ids': pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'train_ids.pkl'), 'rb')),
30 | 					'size': meta['size']['train']
31 | 				},
32 | 				'val': {
33 | 					'data': np.memmap(os.path.join(DATA_DIR, 'pretraining', dataset, f'val.bin'), dtype = np.uint16, mode = 'r'),
34 | 					'ids': pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'val_ids.pkl'), 'rb')),
35 | 					'size': meta['size']['val']
36 | 				},
37 | 				'training_part': meta['training_part'],
38 | 				'name': meta['name'],
39 | 				'multiplier': meta['multiplier']
40 | 			})
41 | 
42 | 		self.probas = [dataset['train']['size'] * dataset['multiplier'] for dataset in self.datasets]
43 | 		self.probas = (np.array(self.probas) / np.sum(self.probas)).tolist()
44 | 
45 | 
46 | 	def train_size(self) -> int:
47 | 
48 | 		return sum([dataset['train']['size'] for dataset in self.datasets])
49 | 
50 | 
51 | 	def val_size(self) -> int:
52 | 
53 | 		return sum([dataset['val']['size'] for dataset in self.datasets])
54 | 
55 | 
56 | 	def _get_random_document(self, val: bool) -> tuple[list[int], list[int]]:
57 | 
58 | 		dataset = np.random.choice(self.datasets, p = self.probas)
59 | 		ids = dataset['val']['ids'] if val else dataset['train']['ids']
60 | 		data = dataset['val']['data'] if val else dataset['train']['data']
61 | 
62 | 		i = random.randint(0, len(ids) - 1)
63 | 		xy = data[ids[i]['start']:ids[i]['start'] + ids[i]['size']]
64 | 		strength = [1.0] * ids[i]['size']
65 | 
66 | 		return xy, strength
67 | 
68 | 
69 | 	def _get_tokens(self, val: bool) -> tuple[torch.Tensor, torch.Tensor]:
70 | 
71 | 		dataset = np.random.choice(self.datasets, p = self.probas)
72 | 		data = dataset['val']['data'] if val else dataset['train']['data']
73 | 
74 | 		start = random.randint(0, len(data) - 1 - (MAX_CONTEXT + 1))
75 | 		xy = []
76 | 
77 | 		for i in range(MAX_CONTEXT + 1):
78 | 
79 | 			token = data[start + i]
80 | 			xy.append(token)
81 | 
82 | 			if token == self.tokenizer.end_of_text_token:
83 | 				break
84 | 
85 | 		strength = [1.0] * len(xy)
86 | 
87 | 		while len(xy) < MAX_CONTEXT + 1:
88 | 
89 | 			_xy, _strength = self._get_random_document(val)
90 | 
91 | 			xy.extend(_xy)
92 | 			strength.extend(_strength)
93 | 
94 | 		xy = xy[0:MAX_CONTEXT + 1]
95 | 		strength = strength[0:MAX_CONTEXT + 1]
96 | 
97 | 		return xy, strength


--------------------------------------------------------------------------------
/dimgpt/training/layers.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | from torch import nn
 4 | 
 5 | from dimgpt.settings import *
 6 | 
 7 | 
 8 | # Base class for all layers
 9 | class Module(nn.Module):
10 | 
11 | 	# Give the number of parameters of the module
12 | 	def nb_parameters(self) -> int:
13 | 
14 | 		return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters()])
15 | 
16 | 
17 | 	# Give the number of trainable parameters of the module
18 | 	def nb_trainable_parameters(self) -> int:
19 | 
20 | 		return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters() if p.requires_grad])
21 | 
22 | 
23 | 	# Give the number of non-trainable parameters of the module
24 | 	def nb_non_trainable_parameters(self) -> int:
25 | 
26 | 		return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters() if not p.requires_grad])
27 | 
28 | 
29 | 	# Summarize the module
30 | 	def summary(self) -> None:
31 | 
32 | 		print(f'Number of parameters: {self.nb_parameters():,}')
33 | 		print(f'Number of trainable parameters: {self.nb_trainable_parameters():,}')
34 | 		print(f'Number of non-trainable parameters: {self.nb_non_trainable_parameters():,}')
35 | 
36 | 
37 | 	# Remove NaNs from the module gradients
38 | 	def clean_nan(self) -> None:
39 | 
40 | 		for p in self.parameters():
41 | 			if p.grad is not None:
42 | 				torch.nan_to_num(p.grad, nan = 0, posinf = 1e5, neginf = -1e5, out = p.grad)
43 | 
44 | 
45 | 	# Clip the module gradients
46 | 	def clip_gradient(self, max_norm: float) -> None:
47 | 
48 | 		nn.utils.clip_grad_norm_(self.parameters(), max_norm)
49 | 
50 | 
51 | class Linear(nn.Linear):
52 | 
53 | 	def __init__(self, in_features: int, out_features: int, **kwargs):
54 | 
55 | 		super().__init__(in_features, out_features, False, **kwargs)
56 | 		nn.init.normal_(self.weight, mean = 0.0, std = INIT_STDDEV)
57 | 
58 | 
59 | class LayerNorm(Module):
60 | 
61 | 	def __init__(self, shape: int, epsilon: float = 1e-5, **kwargs):
62 | 
63 | 		super().__init__(**kwargs)
64 | 
65 | 		self.shape = (shape,)
66 | 		self.weight = nn.Parameter(torch.ones(shape))
67 | 		self.epsilon = epsilon
68 | 
69 | 
70 | 	def _normalize(self, x: torch.Tensor):
71 | 
72 | 		return x * torch.rsqrt(x.pow(2).mean(-1, keepdim = True) + self.epsilon)
73 | 
74 | 
75 | 	def forward(self, x: torch.Tensor):
76 | 
77 | 		return self._normalize(x.float()).type_as(x) * self.weight
78 | 
79 | 
80 | class Embedding(nn.Embedding):
81 | 
82 | 	def __init__(self, num_embeddings: int, embedding_dim: int, **kwargs):
83 | 
84 | 		super().__init__(num_embeddings, embedding_dim, padding_idx = PADDING_TOKEN, **kwargs)
85 | 		nn.init.normal_(self.weight, mean = 0.0, std = INIT_STDDEV)
86 | 
87 | 


--------------------------------------------------------------------------------
/dimgpt/training/model.py:
--------------------------------------------------------------------------------
  1 | import torch, math
  2 | from torch import nn
  3 | from flash_attn import flash_attn_func
  4 | 
  5 | from dimgpt.training.layers import *
  6 | from dimgpt.settings import *
  7 | from dimgpt.training.rope import *
  8 | 
  9 | 
 10 | class AttentionBlock(Module):
 11 | 
 12 | 	def __init__(self, **kwargs):
 13 | 
 14 | 		super().__init__(**kwargs)
 15 | 
 16 | 		self.query = Linear(EMBEDDING_DIM, NUM_HEADS * HEAD_DIM)
 17 | 		self.key = Linear(EMBEDDING_DIM, NUM_GROUPED_HEADS * HEAD_DIM)
 18 | 		self.value = Linear(EMBEDDING_DIM, NUM_GROUPED_HEADS * HEAD_DIM)
 19 | 
 20 | 		self.projection = Linear(NUM_HEADS * HEAD_DIM, EMBEDDING_DIM)
 21 | 		nn.init.normal_(self.projection.weight, mean = 0.0, std = INIT_STDDEV / math.sqrt(2 * NUM_BLOCKS))
 22 | 
 23 | 		self.residual_dropout = nn.Dropout(DROPOUT)
 24 | 
 25 | 
 26 | 	def forward(self, x: torch.Tensor, rope_frequencies: torch.Tensor) -> torch.Tensor:
 27 | 
 28 | 		batch_size, context_size, _ = x.shape
 29 | 
 30 | 		q = self.query(x)
 31 | 		k = self.key(x)
 32 | 		v = self.value(x)
 33 | 
 34 | 		q = q.view(batch_size, context_size, NUM_HEADS, HEAD_DIM)
 35 | 		k = k.view(batch_size, context_size, NUM_GROUPED_HEADS, HEAD_DIM)
 36 | 		v = v.view(batch_size, context_size, NUM_GROUPED_HEADS, HEAD_DIM)
 37 | 
 38 | 		q, k = rotary_position_embedding(q, k, rope_frequencies)
 39 | 
 40 | 		k = torch.repeat_interleave(k, repeats = NUM_HEADS // NUM_GROUPED_HEADS, dim = 2)
 41 | 		v = torch.repeat_interleave(v, repeats = NUM_HEADS // NUM_GROUPED_HEADS, dim = 2)
 42 | 
 43 | 		x = flash_attn_func(q, k, v, dropout_p = DROPOUT if self.training else 0, causal = True, window_size = (WINDOW_SIZE, 0))
 44 | 
 45 | 		x = x.view(batch_size, context_size, NUM_HEADS * HEAD_DIM)
 46 | 
 47 | 		return self.residual_dropout(self.projection(x))
 48 | 
 49 | 
 50 | class FeedForward(Module):
 51 | 
 52 | 	def __init__(self, **kwargs):
 53 | 
 54 | 		super().__init__(**kwargs)
 55 | 
 56 | 		self.linear_1 = Linear(EMBEDDING_DIM, FFN_DIM)
 57 | 		self.linear_2 = Linear(EMBEDDING_DIM, FFN_DIM)
 58 | 		self.linear_3 = Linear(FFN_DIM, EMBEDDING_DIM)
 59 | 		self.activation = nn.SiLU()
 60 | 		self.dropout = nn.Dropout(DROPOUT)
 61 | 
 62 | 
 63 | 	def forward(self, x: torch.Tensor) -> torch.Tensor:
 64 | 
 65 | 		x = self.activation(self.linear_1(x)) * self.linear_2(x)
 66 | 		x = self.dropout(self.linear_3(x))
 67 | 
 68 | 		return x
 69 | 
 70 | 
 71 | # Model block
 72 | class TransformerBlock(Module):
 73 | 
 74 | 	def __init__(self, **kwargs):
 75 | 
 76 | 		super().__init__(**kwargs)
 77 | 
 78 | 		self.norm_1 = LayerNorm(EMBEDDING_DIM)
 79 | 		self.attention = AttentionBlock()
 80 | 		self.norm_2 = LayerNorm(EMBEDDING_DIM)
 81 | 		self.feed_forward = FeedForward()
 82 | 
 83 | 
 84 | 	def forward(self, x: torch.Tensor, rope_frequencies: torch.Tensor) -> torch.Tensor:
 85 | 
 86 | 		x = x + self.attention(self.norm_1(x), rope_frequencies)
 87 | 		x = x + self.feed_forward(self.norm_2(x))
 88 | 
 89 | 		return x
 90 | 
 91 | 
 92 | # Model
 93 | class Model(Module):
 94 | 
 95 | 	def __init__(self, **kwargs):
 96 | 
 97 | 		super().__init__(**kwargs)
 98 | 
 99 | 		self.token_embedding = Embedding(VOCAB_SIZE, EMBEDDING_DIM)
100 | 		self.rope_frequencies = create_rope_frequencies(HEAD_DIM, MAX_CONTEXT)
101 | 		self.init_dropout = nn.Dropout(DROPOUT)
102 | 		self.blocks = nn.ModuleList([TransformerBlock() for _ in range(NUM_BLOCKS)])
103 | 		self.final_norm = LayerNorm(EMBEDDING_DIM)
104 | 		self.final_linear = Linear(EMBEDDING_DIM, VOCAB_SIZE)
105 | 		self.token_embedding.weight = self.final_linear.weight
106 | 
107 | 
108 | 	def forward(self, input: torch.Tensor, only_last: bool = False) -> torch.Tensor:
109 | 
110 | 		if input.shape[1] > MAX_CONTEXT:
111 | 			input = input[:, -MAX_CONTEXT:]
112 | 
113 | 		rope_frequencies = self.rope_frequencies[:input.shape[1]]
114 | 		rope_frequencies = rope_frequencies[None, :, None, :]
115 | 
116 | 		x = self.token_embedding(input)
117 | 		x = self.init_dropout(x)
118 | 
119 | 		for block in self.blocks:
120 | 			x = block(x, rope_frequencies)
121 | 
122 | 		x = self.final_norm(x)
123 | 
124 | 		if only_last:
125 | 			return self.final_linear(x[:, -1])
126 | 
127 | 		return self.final_linear(x)
128 | 


--------------------------------------------------------------------------------
/dimgpt/training/optimizer.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch import nn
 3 | 
 4 | from dimgpt.settings import *
 5 | 
 6 | 
 7 | class AdamW(torch.optim.AdamW):
 8 | 
 9 | 	def __init__(self, params: list[nn.Parameter], learning_rate: float, **kwargs):
10 | 
11 | 		decay_params = [p for p in params if p.requires_grad and p.dim() >= 2]
12 | 		other_params = [p for p in params if p.requires_grad and p.dim() < 2]
13 | 
14 | 		groups = [
15 | 			{'params': decay_params, 'weight_decay': WEIGHT_DECAY},
16 | 			{'params': other_params, 'weight_decay': 0.0}
17 | 		]
18 | 
19 | 		super().__init__(
20 | 			groups,
21 | 			lr = learning_rate,
22 | 			betas = (BETA_1, BETA_2),
23 | 			eps = EPSILON,
24 | 			fused = GPU_ENABLED,
25 | 			**kwargs
26 | 		)
27 | 


--------------------------------------------------------------------------------
/dimgpt/training/rope.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | from dimgpt.settings import *
 4 | 
 5 | 
 6 | def create_rope_frequencies(dim: int, max_length: int, theta: float = ROPE_THETA) -> torch.Tensor:
 7 | 
 8 | 	frequencies = 1.0 / (theta ** (torch.arange(0, dim, 2, device = DEVICE)[:(dim // 2)].float() / dim))
 9 | 	t = torch.arange(max_length, device = DEVICE)
10 | 	frequencies = torch.outer(t, frequencies).float()
11 | 
12 | 	return torch.polar(torch.ones_like(frequencies, device = DEVICE), frequencies)
13 | 
14 | 
15 | def rotary_position_embedding(q: torch.Tensor, k: torch.Tensor, frequencies: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
16 | 
17 | 	q_complex = torch.view_as_complex(q.float().reshape(*q.shape[:-1], -1, 2))
18 | 	k_complex = torch.view_as_complex(k.float().reshape(*k.shape[:-1], -1, 2))
19 | 
20 | 	q_out = torch.view_as_real(q_complex * frequencies).flatten(3)
21 | 	k_out = torch.view_as_real(k_complex * frequencies).flatten(3)
22 | 
23 | 	return q_out.type_as(q), k_out.type_as(k)


--------------------------------------------------------------------------------
/dimgpt/training/trainer.py:
--------------------------------------------------------------------------------
  1 | import os, pickle, math, time
  2 | import torch
  3 | from torch import nn
  4 | 
  5 | from dimgpt.settings import *
  6 | from dimgpt.training.datasets import Dataset
  7 | from dimgpt.training.model import Model
  8 | from dimgpt.training.optimizer import AdamW
  9 | 
 10 | 
 11 | class Trainer():
 12 | 
 13 | 	def __init__(self, model: Model, dataset: Dataset):
 14 | 
 15 | 		self.model = model
 16 | 		model.train()
 17 | 
 18 | 		self.dataset = dataset
 19 | 
 20 | 		self.time = None
 21 | 		self.step = 0
 22 | 		self.tokens = 0
 23 | 		self.epochs = 0.0
 24 | 		self.learning_rate = 0.0
 25 | 		self.loss = 0.0
 26 | 		self.accuracy = 0.0
 27 | 		self.val_loss = 0.0
 28 | 		self.val_accuracy = 0.0
 29 | 		self.loss_ema = None
 30 | 		self.accuracy_ema = None
 31 | 		self.best_val_loss = float('inf')
 32 | 
 33 | 		self.optimizer = AdamW(self.model.parameters(), self.learning_rate)
 34 | 
 35 | 		self.metrics_history = {
 36 | 			'time': [],
 37 | 			'step': [],
 38 | 			'tokens': [],
 39 | 			'epochs': [],
 40 | 			'loss': [],
 41 | 			'accuracy': [],
 42 | 			'val_loss': [],
 43 | 			'val_accuracy': []
 44 | 		}
 45 | 
 46 | 
 47 | 	# Save the models
 48 | 	def save_model(self, path: str) -> None:
 49 | 
 50 | 		if not os.path.exists(path):
 51 | 			os.makedirs(path)
 52 | 
 53 | 		torch.save(self.model.state_dict(), os.path.join(path, 'model.pt'))
 54 | 		torch.save(self.optimizer.state_dict(), os.path.join(path, 'optimizer.pt'))
 55 | 
 56 | 
 57 | 	# Load the models
 58 | 	def load_model(self, path) -> None:
 59 | 
 60 | 		if not os.path.exists(path):
 61 | 			return
 62 | 
 63 | 		self.model.load_state_dict(torch.load(os.path.join(path, 'model.pt'), map_location = DEVICE))
 64 | 		self.optimizer.load_state_dict(torch.load(os.path.join(path, 'optimizer.pt'), map_location = DEVICE))
 65 | 
 66 | 
 67 | 	# Find previous session
 68 | 	def find_previous_session(self) -> None:
 69 | 
 70 | 		if os.path.exists(os.path.join(OUTPUT_DIR, 'last')):
 71 | 			self.load_model(os.path.join(OUTPUT_DIR, 'last'))
 72 | 
 73 | 		if os.path.exists(os.path.join(OUTPUT_DIR, 'metrics.pkl')):
 74 | 			self.load_metrics()
 75 | 
 76 | 
 77 | 	# Print
 78 | 	def print(self) -> None:
 79 | 
 80 | 		print(f'Epochs: {self.epochs:.4f} | Steps: {self.step:,} | Tokens: {self.tokens:,} | LR: {self.learning_rate:.5f}   ||   ' \
 81 | 			f'Loss: {self.loss_ema:.5f} | Accuracy: {self.accuracy_ema * 100.0:.4f} % | ' \
 82 | 			f'Val loss: {self.val_loss:.5f} | Val accuracy: {self.val_accuracy * 100.0:.4f} %       ', end = '\r')
 83 | 
 84 | 
 85 | 	# Save metrics
 86 | 	def save_metrics(self) -> None:
 87 | 
 88 | 		if self.time is None:
 89 | 			self.metrics_history["time"].append(0.0)
 90 | 		else:
 91 | 			self.metrics_history["time"].append(self.metrics_history["time"][-1] + (time.time() - self.time))
 92 | 
 93 | 		self.time = time.time()
 94 | 
 95 | 		self.metrics_history["step"].append(self.step)
 96 | 		self.metrics_history["tokens"].append(self.tokens)
 97 | 		self.metrics_history["epochs"].append(self.epochs)
 98 | 		self.metrics_history["loss"].append(self.loss_ema)
 99 | 		self.metrics_history["accuracy"].append(self.accuracy_ema)
100 | 		self.metrics_history["val_loss"].append(self.val_loss)
101 | 		self.metrics_history["val_accuracy"].append(self.val_accuracy)
102 | 
103 | 		if not os.path.exists(OUTPUT_DIR):
104 | 			os.makedirs(OUTPUT_DIR)
105 | 
106 | 		pickle.dump(self.metrics_history, open(os.path.join(OUTPUT_DIR, 'metrics.pkl'), 'wb'))
107 | 
108 | 
109 | 	# Load metrics
110 | 	def load_metrics(self) -> None:
111 | 
112 | 		self.metrics_history = pickle.load(open(os.path.join(OUTPUT_DIR, 'metrics.pkl'), 'rb'))
113 | 
114 | 		self.step = self.metrics_history["step"][-1]
115 | 		self.tokens = self.metrics_history["tokens"][-1]
116 | 		self.epochs = self.metrics_history["epochs"][-1]
117 | 		self.loss_ema = self.metrics_history["loss"][-1]
118 | 		self.accuracy_ema = self.metrics_history["accuracy"][-1]
119 | 		self.val_loss = self.metrics_history["val_loss"][-1]
120 | 		self.val_accuracy = self.metrics_history["val_accuracy"][-1]
121 | 		self.best_val_loss = min(self.metrics_history["val_loss"])
122 | 		self.time = time.time()
123 | 
124 | 
125 | 	# Update learning rate
126 | 	def update_learning_rate(self) -> None:
127 | 
128 | 		if self.step < WARMUP_STEPS:
129 | 			ratio = self.step / WARMUP_STEPS
130 | 			self.learning_rate = MAX_LEARNING_RATE * ratio
131 | 		elif self.step < WARMUP_STEPS + DECAY_STEPS:
132 | 			ratio = (self.step - WARMUP_STEPS) / DECAY_STEPS
133 | 			ratio = 0.5 * (1.0 + math.cos(math.pi * ratio))
134 | 			self.learning_rate = ratio * (MAX_LEARNING_RATE - MIN_LEARNING_RATE) + MIN_LEARNING_RATE
135 | 		else:
136 | 			self.learning_rate = MIN_LEARNING_RATE
137 | 
138 | 		for g in self.optimizer.param_groups:
139 | 			g['lr'] = self.learning_rate
140 | 
141 | 
142 | 	def apply_ema(self, value_1: float, value_2: float) -> float:
143 | 
144 | 		if value_1 is None:
145 | 			return value_2
146 | 
147 | 		return value_1 * METRICS_BETA + value_2 * (1.0 - METRICS_BETA)
148 | 
149 | 
150 | 	# Train the model
151 | 	def train(self) -> None:
152 | 
153 | 		# Training loop
154 | 		while True:
155 | 
156 | 			# Update step
157 | 			self.step += 1
158 | 			self.tokens += (MAX_CONTEXT + 1) * BATCH_SIZE * NUM_ACCUMULATIONS
159 | 			self.epochs += ((MAX_CONTEXT + 1) * BATCH_SIZE * NUM_ACCUMULATIONS) / self.dataset.train_size()
160 | 
161 | 			# Update learning rate
162 | 			self.update_learning_rate()
163 | 
164 | 			# ----- Training ----- #
165 | 
166 | 			self.model.train()
167 | 			self.loss = 0.0
168 | 			self.accuracy = 0.0
169 | 
170 | 			# First load data (asyncronous)
171 | 			x, y, strength = self.dataset.next_train()
172 | 
173 | 			for i in range(NUM_ACCUMULATIONS):
174 | 
175 | 				with CONTEXT:
176 | 
177 | 					# Forward pass
178 | 					prediction = self.model(x)
179 | 
180 | 					# Loss
181 | 					loss = nn.functional.cross_entropy(
182 | 						input = prediction.reshape(-1, prediction.shape[-1]),
183 | 						target = y.reshape(-1),
184 | 						ignore_index = PADDING_TOKEN,
185 | 						reduction = 'none'
186 | 					)
187 | 					loss = ((loss * strength.reshape(-1)).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS
188 | 					self.loss += loss.item()
189 | 
190 | 					# Accuracy
191 | 					accuracy = (prediction.argmax(dim = 2) == y).to(dtype = torch.float32)
192 | 					self.accuracy += (((accuracy * strength).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item()
193 | 
194 | 				# Next load data (asyncronous)
195 | 				if i < NUM_ACCUMULATIONS - 1:
196 | 					x, y, strength = self.dataset.next_train()
197 | 
198 | 				# Backward pass
199 | 				loss.backward()
200 | 
201 | 			# Update weights
202 | 			self.model.clean_nan()
203 | 			self.model.clip_gradient(CLIP_GRADIENT)
204 | 			self.optimizer.step()
205 | 			self.optimizer.zero_grad(set_to_none = True)
206 | 
207 | 			# Update ema values
208 | 			self.loss_ema = self.apply_ema(self.loss_ema, self.loss)
209 | 			self.accuracy_ema = self.apply_ema(self.accuracy_ema, self.accuracy)
210 | 
211 | 			# ----- Validations ----- #
212 | 
213 | 			if self.step % VAL_INTERVAL == 0:
214 | 
215 | 				self.model.eval()
216 | 
217 | 				with torch.no_grad():
218 | 
219 | 					self.val_loss = 0.0
220 | 					self.val_accuracy = 0.0
221 | 
222 | 					for _ in range(NUM_ACCUMULATIONS):
223 | 
224 | 						# Load data
225 | 						x, y, strength = self.dataset.next_val()
226 | 
227 | 						with CONTEXT:
228 | 
229 | 							# Forward pass
230 | 							prediction = self.model(x)
231 | 
232 | 							# Loss
233 | 							loss = nn.functional.cross_entropy(
234 | 								input = prediction.reshape(-1, prediction.shape[-1]),
235 | 								target = y.reshape(-1),
236 | 								ignore_index = PADDING_TOKEN,
237 | 								reduction = 'none'
238 | 							)
239 | 							self.val_loss += (((loss * strength.reshape(-1)).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item()
240 | 
241 | 							# Accuracy
242 | 							accuracy = (prediction.argmax(dim = 2) == y).to(dtype = torch.float32)
243 | 							self.val_accuracy += (((accuracy * strength).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item()
244 | 
245 | 				# Save
246 | 				self.save_metrics()
247 | 				self.save_model(os.path.join(OUTPUT_DIR, 'last'))
248 | 
249 | 				# Save best
250 | 				if self.val_loss <= self.best_val_loss:
251 | 					self.best_val_loss = self.val_loss
252 | 					self.save_model(os.path.join(OUTPUT_DIR, 'best'))
253 | 
254 | 			# -------------------- #
255 | 
256 | 			# Print
257 | 			self.print()
258 | 


--------------------------------------------------------------------------------
/dimgpt/utils.py:
--------------------------------------------------------------------------------
  1 | import random, platform, psutil, time
  2 | import datetime as dt
  3 | import numpy as np
  4 | import torch
  5 | from sys import exit
  6 | 
  7 | from dimgpt.settings import *
  8 | 
  9 | 
 10 | # Reset the random seed
 11 | def reset_rand() -> None:
 12 | 
 13 | 	now = dt.datetime.now()
 14 | 	milliseconds_since_midnight = (now.hour * 3600 + now.minute * 60 + now.second) * 1000 + now.microsecond // 1000
 15 | 	random.seed(milliseconds_since_midnight)
 16 | 	np.random.seed(milliseconds_since_midnight)
 17 | 	torch.manual_seed(milliseconds_since_midnight)
 18 | 
 19 | 
 20 | # Check if there is a GPU available
 21 | def check_gpu() -> None:
 22 | 
 23 | 	if GPU_ENABLED:
 24 | 		torch.cuda.empty_cache()
 25 | 		nb_gpu = torch.cuda.device_count()
 26 | 		memory = torch.cuda.mem_get_info()[0] / 1024 ** 3
 27 | 		print(f'{nb_gpu} GPU {"are" if nb_gpu > 1 else "is"} available! Using GPU: "{torch.cuda.get_device_name()}" ({memory:.2f} GB available)')
 28 | 
 29 | 	else:
 30 | 		memory = psutil.virtual_memory().available / 1024 ** 3
 31 | 		print(f'No GPU available... Using CPU: "{platform.processor()}" ({memory:.2f} GB available)')
 32 | 
 33 | 
 34 | def save_text_array(array: list[str], path: str) -> None:
 35 | 
 36 | 	with open(path, 'w', encoding = 'utf-8') as f:
 37 | 
 38 | 		f.truncate(0)
 39 | 
 40 | 		for i in range(len(array)):
 41 | 
 42 | 			f.write(array[i])
 43 | 
 44 | 			if i != len(array) - 1:
 45 | 				f.write('\n')
 46 | 
 47 | 
 48 | def load_text_array(path: str) -> list[str]:
 49 | 
 50 | 	with open(path, 'r', encoding = 'utf-8') as f:
 51 | 
 52 | 		return f.read().split('\n')
 53 | 
 54 | 
 55 | def split_keep(text: str, delimiter: str) -> list[str]:
 56 | 
 57 | 	words = text.split(delimiter)
 58 | 
 59 | 	temp = []
 60 | 
 61 | 	for i in range(len(words) - 1):
 62 | 		temp.extend([words[i], delimiter])
 63 | 
 64 | 	temp.append(words[-1])
 65 | 
 66 | 	return temp
 67 | 
 68 | 
 69 | class Timer:
 70 | 
 71 | 	def __init__(self, wait_steps: int = 0, num_steps: int = 1, exit_on_end: bool = False):
 72 | 
 73 | 		self.wait_steps = wait_steps
 74 | 		self.num_steps = num_steps
 75 | 		self.exit_on_end = exit_on_end
 76 | 		self.times = [0.0] * num_steps
 77 | 		self.wait_step = 0
 78 | 		self.step = 0
 79 | 
 80 | 
 81 | 	def __enter__(self):
 82 | 
 83 | 		if self.wait_step < self.wait_steps:
 84 | 			return
 85 | 
 86 | 		self.times[self.step] = time.time()
 87 | 
 88 | 
 89 | 	def __exit__(self, exc_type, exc_value, traceback):
 90 | 
 91 | 		if self.wait_step < self.wait_steps:
 92 | 			self.wait_step += 1
 93 | 			return
 94 | 
 95 | 		self.times[self.step] = time.time() - self.times[self.step]
 96 | 		self.step += 1
 97 | 
 98 | 		if self.step >= self.num_steps:
 99 | 
100 | 			print(f'\nDuration: {sum(self.times) / self.num_steps:.2f}s')
101 | 
102 | 			if self.exit_on_end:
103 | 				exit(0)


--------------------------------------------------------------------------------
/models/README.md:
--------------------------------------------------------------------------------
1 | # 🎛️ Trained weights
2 | 
3 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to:
4 | 
5 | * Download the `.pt` file of the model you want to use and put it in this folder
6 | * Download the `vocab.txt` file and put it in the `data` folder
7 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | flash-attn
3 | datasets
4 | tokenizers
5 | unidecode
6 | regex
7 | tqdm
8 | psutil


--------------------------------------------------------------------------------
/resources/misc/accuracy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/accuracy.png


--------------------------------------------------------------------------------
/resources/misc/loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/loss.png


--------------------------------------------------------------------------------
/resources/misc/test_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_1.png


--------------------------------------------------------------------------------
/resources/misc/test_10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_10.png


--------------------------------------------------------------------------------
/resources/misc/test_11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_11.png


--------------------------------------------------------------------------------
/resources/misc/test_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_2.png


--------------------------------------------------------------------------------
/resources/misc/test_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_3.png


--------------------------------------------------------------------------------
/resources/misc/test_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_4.png


--------------------------------------------------------------------------------
/resources/misc/test_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_5.png


--------------------------------------------------------------------------------
/resources/misc/test_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_6.png


--------------------------------------------------------------------------------
/resources/misc/test_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_7.png


--------------------------------------------------------------------------------
/resources/misc/test_8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_8.png


--------------------------------------------------------------------------------
/resources/misc/test_9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_9.png


--------------------------------------------------------------------------------
/resources/misc/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/thumbnail.png


--------------------------------------------------------------------------------
/testing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Testing"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Imports"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "import os\n",
 24 |     "\n",
 25 |     "from dimgpt import utils\n",
 26 |     "from dimgpt.testing.sampling import *\n",
 27 |     "from dimgpt.data.tokenizer import *\n",
 28 |     "from dimgpt.settings import *\n",
 29 |     "\n",
 30 |     "utils.reset_rand()"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "### Check GPU"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": null,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "utils.check_gpu()"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "### Tokenizer"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "tokenizer = Tokenizer()\n",
 63 |     "\n",
 64 |     "print(f'Vocab size: {len(tokenizer.vocab):,}\\n')\n",
 65 |     "\n",
 66 |     "for v in tokenizer.vocab:\n",
 67 |     "\tprint(f'[{v}]', end = ' ')"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "### Model"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "MODEL_PATH = './models/DimensionGPT-0.2B-Chat.pt'\n",
 84 |     "\n",
 85 |     "model = Model().to(DEVICE)\n",
 86 |     "model.load_state_dict(torch.load(MODEL_PATH, map_location = DEVICE))\n",
 87 |     "model.summary()"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "### Testing"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "sampler = Sampler(model, tokenizer)"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "_ = sampler.generate(\n",
113 |     "\tinput = \"Bonjour\",\n",
114 |     "\tmax_length = 512,\n",
115 |     "\tchat_bot = True,\n",
116 |     "\ttemperature = 0.5,\n",
117 |     "\ttop_p = 0.9,\n",
118 |     "\tno_repeat = 1.0,\n",
119 |     "\tverbose = True,\n",
120 |     "\tmax_print_line_length = 150\n",
121 |     ")"
122 |    ]
123 |   }
124 |  ],
125 |  "metadata": {
126 |   "kernelspec": {
127 |    "display_name": "venv",
128 |    "language": "python",
129 |    "name": "python3"
130 |   },
131 |   "language_info": {
132 |    "codemirror_mode": {
133 |     "name": "ipython",
134 |     "version": 3
135 |    },
136 |    "file_extension": ".py",
137 |    "mimetype": "text/x-python",
138 |    "name": "python",
139 |    "nbconvert_exporter": "python",
140 |    "pygments_lexer": "ipython3",
141 |    "version": "3.11.8"
142 |   },
143 |   "orig_nbformat": 4
144 |  },
145 |  "nbformat": 4,
146 |  "nbformat_minor": 2
147 | }
148 | 


--------------------------------------------------------------------------------
/training.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "attachments": {},
  5 |    "cell_type": "markdown",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Training"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "attachments": {},
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "### Imports"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "from dimgpt import utils\n",
 26 |     "from dimgpt.training.datasets import *\n",
 27 |     "from dimgpt.training.model import Model\n",
 28 |     "from dimgpt.training.trainer import Trainer\n",
 29 |     "from dimgpt.data.tokenizer import *\n",
 30 |     "from dimgpt.settings import *\n",
 31 |     "\n",
 32 |     "utils.reset_rand()"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "attachments": {},
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "### Check GPU"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "metadata": {},
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "utils.check_gpu()"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "### Tokenizer"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "tokenizer = Tokenizer()\n",
 66 |     "\n",
 67 |     "print(f'Vocab size: {len(tokenizer.vocab):,}\\n')\n",
 68 |     "\n",
 69 |     "for v in tokenizer.vocab:\n",
 70 |     "\tprint(f'[{v}]', end = ' ')"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "attachments": {},
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "### Dataset"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "dataset = PretrainingDataset(tokenizer)\n",
 88 |     "#dataset = FinetuningDataset(tokenizer)"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "x, y, strength = dataset.next_val()\n",
 98 |     "\n",
 99 |     "print(f'Batch shape: {tuple(x.shape)}\\n')\n",
100 |     "\n",
101 |     "print(tokenizer.decode(x[0]))\n",
102 |     "\n",
103 |     "del x, y, strength"
104 |    ]
105 |   },
106 |   {
107 |    "attachments": {},
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "### Model"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "model = Model().to(DEVICE)\n",
121 |     "model.summary()"
122 |    ]
123 |   },
124 |   {
125 |    "attachments": {},
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "### Training"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "trainer = Trainer(model, dataset)\n",
139 |     "trainer.find_previous_session()\n",
140 |     "\n",
141 |     "trainer.train()"
142 |    ]
143 |   }
144 |  ],
145 |  "metadata": {
146 |   "kernelspec": {
147 |    "display_name": "venv",
148 |    "language": "python",
149 |    "name": "python3"
150 |   },
151 |   "language_info": {
152 |    "codemirror_mode": {
153 |     "name": "ipython",
154 |     "version": 3
155 |    },
156 |    "file_extension": ".py",
157 |    "mimetype": "text/x-python",
158 |    "name": "python",
159 |    "nbconvert_exporter": "python",
160 |    "pygments_lexer": "ipython3",
161 |    "version": "3.11.8"
162 |   },
163 |   "orig_nbformat": 4
164 |  },
165 |  "nbformat": 4,
166 |  "nbformat_minor": 2
167 | }
168 | 


--------------------------------------------------------------------------------