├── .gitignore
├── LICENSE.md
├── README.md
├── create_data.ipynb
├── data
└── README.md
├── dimgpt
├── __init__.py
├── data
│ ├── __init__.py
│ ├── clean.py
│ ├── datasets
│ │ ├── __init__.py
│ │ ├── dataset.py
│ │ └── pretraining
│ │ │ ├── __init__.py
│ │ │ ├── books.py
│ │ │ ├── common_crawl.py
│ │ │ ├── institutions.py
│ │ │ ├── news.py
│ │ │ ├── others.py
│ │ │ └── wikipedia.py
│ ├── finetuning.py
│ ├── pretokenizer.py
│ ├── pretraining.py
│ └── tokenizer.py
├── settings.py
├── testing
│ ├── __init__.py
│ └── sampling.py
├── training
│ ├── __init__.py
│ ├── datasets
│ │ ├── __init__.py
│ │ ├── dataset.py
│ │ ├── finetuning.py
│ │ └── pretraining.py
│ ├── layers.py
│ ├── model.py
│ ├── optimizer.py
│ ├── rope.py
│ └── trainer.py
└── utils.py
├── models
└── README.md
├── requirements.txt
├── resources
└── misc
│ ├── accuracy.png
│ ├── loss.png
│ ├── test_1.png
│ ├── test_10.png
│ ├── test_11.png
│ ├── test_2.png
│ ├── test_3.png
│ ├── test_4.png
│ ├── test_5.png
│ ├── test_6.png
│ ├── test_7.png
│ ├── test_8.png
│ ├── test_9.png
│ └── thumbnail.png
├── testing.ipynb
└── training.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode
2 | /bacup*
3 | /venv
4 | /data/*
5 | !/data/README.md
6 | /models/*
7 | !/models/README.md
8 | /output
9 | notes.txt
10 | final_SPF.xml
11 | __pycache__
12 | .DS_Store
13 | env.py
14 | /test*.ipynb
15 | /show*.ipynb
16 | /test.txt
17 | /*.whl
18 | /validate.ipynb
19 | /dimgpt/testing/tester.py
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 Angel Uriot
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 💬 Language model
2 |
3 | 
4 | 
5 | 
6 | 
7 | 
8 |
9 |
10 |
11 | This repository contains the code to train and test autoregressive language models like [**ChatGPT**](https://openai.com/chatgpt) from scratch. I also used it to train the french open-source [**DimensionGPT**](#-dimensiongpt) models.
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 | # 📋 Summary
22 |
23 | * **[📋 Summary](#-summary)**
24 | * **[🤖 DimensionGPT](#-dimensiongpt)**
25 | * [🏗️ Architecture](#%EF%B8%8F-architecture)
26 | * [💾 Data](#-data)
27 | * [🦾 Training](#-training)
28 | * [🪛 Fine-tuning](#-fine-tuning)
29 | * [🧪 Tests](#-tests)
30 | * [🎛️ Weights](#%EF%B8%8F-weights)
31 | * **[📦 Dependencies](#-dependencies)**
32 | * **[🦾 Training](#-training-1)**
33 | * **[⚗️ Testing](#%EF%B8%8F-testing)**
34 | * **[🙏 Credits](#-credits)**
35 |
36 |
37 |
38 | # 🤖 DimensionGPT
39 |
40 | Using this repository, I trained [**DimensionGPT-0.2B**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), a small 0.2B language model on 50B tokens with my personal RTX 3090 GPU during ≈570 hours.
41 |
42 |
43 |
44 | ## 🏗️ Architecture
45 |
46 | The model is based on the transformer architecture (only the decoder part) from the paper [**Attention is All You Need**](https://doi.org/10.48550/arXiv.1706.03762) by **Google Brain** (2017), with a few improvements:
47 |
48 | * I replaced the default normalization layer by the Root Mean Square Layer Normalization (RMSNorm) from the paper [**Root Mean Square Layer Normalization**](https://doi.org/10.48550/arXiv.1910.07467) by **Edinburgh University** (2019)
49 |
50 | * I moved the normalization layers before the transformer blocks (instead of after) like in the paper [**On Layer Normalization in the Transformer Architecture**](https://doi.org/10.48550/arXiv.2002.04745) by **Microsoft Research** (2020)
51 |
52 | * I replaced the ReLU activation by the SwiGLU activation from the paper [**GLU Variants Improve Transformer**](https://doi.org/10.48550/arXiv.2002.05202) by **Google** (2020)
53 |
54 | * I implemented Grouped-Query Attention (GQA) from the paper [**GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints**](https://doi.org/10.48550/arXiv.2305.13245) by **Google Research** (2023)
55 |
56 | * I replaced the absolute positional embedding by the Rotary Position Embedding (RoPE) from the paper [**RoFormer: Enhanced Transformer with Rotary Position Embedding**](https://doi.org/10.48550/arXiv.2104.09864) by **Zhuiyi Technology** (2023)
57 |
58 | * I implemented the Sliding Window Attention (SWA) from the paper [**Longformer: The Long-Document Transformer**](https://doi.org/10.48550/arXiv.2004.05150) by **Allen Institute** (2020)
59 |
60 |
61 |
62 | Here are the main parameters of the architecture:
63 |
64 |
65 |
66 |
67 | Parameter |
68 | Value |
69 |
70 |
71 |
72 |
73 | Embedding dimension |
74 | 1,024 |
75 |
76 |
77 | Number of layers |
78 | 16 |
79 |
80 |
81 | Heads dimension |
82 | 64 |
83 |
84 |
85 | Feed forward hidden dimension |
86 | 2,730 |
87 |
88 |
89 | Number of heads |
90 | 16 |
91 |
92 |
93 | Number of grouped heads |
94 | 4 |
95 |
96 |
97 | Window size |
98 | 256 |
99 |
100 |
101 | Context length |
102 | 512 |
103 |
104 |
105 | Vocab size |
106 | 32,000 |
107 |
108 |
109 |
110 |
111 |
112 |
113 | The resulting model has 208,929,792 trainable parameters and fits on a single RTX 3090 GPU with a batch size of 16 for training using mixed precision. For inference only, the model will probably fit on any modern GPU.
114 |
115 |
116 |
117 | ## 💾 Data
118 |
119 | The dataset used to train this model is exclusively in french and is a mix of multiple sources:
120 |
121 |
122 |
123 |
124 | Source |
125 | Documents |
126 | Tokens |
127 | Multiplier |
128 | Ratio |
129 |
130 |
131 |
132 |
133 | Common Crawl (FR) |
134 | 21,476,796 |
135 | 35,821,271,160 |
136 | 1.0 |
137 | 76.89 % |
138 |
139 |
140 | Wikipedia (FR) |
141 | 2,700,373 |
142 | 1,626,389,831 |
143 | 4.0 |
144 | 13.96 % |
145 |
146 |
147 | French news articles |
148 | 20,446,435 |
149 | 11,308,851,150 |
150 | 0.3 |
151 | 7.28 % |
152 |
153 |
154 | French books |
155 | 29,322 |
156 | 2,796,450,308 |
157 | 0.2 |
158 | 1.20 % |
159 |
160 |
161 | French institutions documents |
162 | 87,103 |
163 | 147,034,958 |
164 | 2.0 |
165 | 0.63 % |
166 |
167 |
168 | Others |
169 | 2,761 |
170 | 7,287,322 |
171 | 2.0 |
172 | 0.03 % |
173 |
174 |
175 | Total |
176 | 44,742,790 |
177 | 51,707,284,729 |
178 | - |
179 | 100.00 % |
180 |
181 |
182 |
183 |
184 |
185 |
186 | For the tokenization, I created my own tokenizer that starts by cleaning the text to keep only a predefined set of characters, then it uses the [**Byte Pair Encoding (BPE)**](https://en.wikipedia.org/wiki/Byte_pair_encoding) algorithm to create the vocabulary. I trained the tokenizer on a 300 million characters subset of the dataset to get my 32,000 tokens vocabulary.
187 |
188 |
189 |
190 | ## 🦾 Training
191 |
192 | For the training I used stochastic gradient descent with warmup and cosine decay learning rate schedules, here are the main hyperparameters:
193 |
194 |
195 |
196 |
197 | Hyperparameter |
198 | Value |
199 |
200 |
201 |
202 |
203 | Batch size (tokens) |
204 | 524,288 |
205 |
206 |
207 | Optimizer |
208 | AdamW |
209 |
210 |
211 | Learning rate |
212 | 6.0 × 10-4 |
213 |
214 |
215 | Warmup steps |
216 | 2,000 |
217 |
218 |
219 | Decay steps |
220 | 100,000 |
221 |
222 |
223 | β1 |
224 | 0.9 |
225 |
226 |
227 | β2 |
228 | 0.95 |
229 |
230 |
231 | ε |
232 | 10-5 |
233 |
234 |
235 | Weight decay |
236 | 0.1 |
237 |
238 |
239 | Gradient clipping |
240 | 1.0 |
241 |
242 |
243 |
244 |
245 |
246 |
247 | I trained the model on my personal RTX 3090 GPU for 1 epoch on the full dataset (13 times the [**Chinchilla optimal**](https://doi.org/10.48550/arXiv.2203.15556)) using mixed precision and gradient accumulation to increase the speed and reduce the memory usage :
248 |
249 |
250 |
251 |
252 | Training summary |
253 |
254 |
255 |
256 |
257 | Tokens |
258 | 52,428,800,000 |
259 |
260 |
261 | Steps |
262 | 100,000 |
263 |
264 |
265 | FLOPs |
266 | 6.6 × 1019 |
267 |
268 |
269 | Duration |
270 | 573 hours |
271 |
272 |
273 | Final loss |
274 | 2.19 |
275 |
276 |
277 | Final accuracy |
278 | 54.8 % |
279 |
280 |
281 |
282 |
283 |
284 |
285 |
286 |
287 |
288 |
289 |
290 |
291 |
292 |
293 | ## 🪛 Fine-tuning
294 |
295 | I fine-tuned the model on the [**french instructions dataset**](https://github.com/angeluriot/French_instruct) I made for this project to create [**DimensionGPT-0.2B-Chat**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), a 0.2B language model trained to follow instructions and answer questions in french.
296 |
297 |
298 |
299 | ## 🧪 Tests
300 |
301 | Here are some examples of the model outputs:
302 |
303 |
304 |
305 |
306 |
307 |
308 |
309 |
310 |
311 |
312 |
313 |
314 |
315 |
316 |
317 |
318 |
319 |
320 |
321 |
322 |
323 |
324 |
325 |
326 |
327 |
328 |
329 |
330 |
331 |
332 |
333 |
334 |
335 |
336 |
337 |
338 |
339 |
340 |
341 |
342 |
343 |
344 |
345 |
346 |
347 |
348 |
349 | ## 🎛️ Weights
350 |
351 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to:
352 |
353 | * Download the `.pt` file of the model you want to use and put it in the `models` folder
354 | * Download the `vocab.txt` file and put it in the `data` folder
355 |
356 |
357 |
358 | # 📦 Dependencies
359 |
360 | * [**Python**](https://www.python.org/)
361 | * [**PyTorch**](https://pytorch.org/)
362 | * [**Flash Attention**](https://github.com/Dao-AILab/Flash-attention)
363 | * [**Datasets 🤗**](https://github.com/huggingface/datasets)
364 | * [**Tokenizers 🤗**](https://github.com/huggingface/tokenizers)
365 | * [**Unidecode**](https://pypi.org/project/Unidecode/)
366 | * [**Regex**](https://github.com/mrabarnett/mrab-regex)
367 | * [**Tqdm**](https://tqdm.github.io/)
368 | * [**PSUtil**](https://github.com/giampaolo/psutil)
369 |
370 |
371 |
372 | Run the following command to install the dependencies:
373 |
374 | ```shell
375 | $ pip install -r requirements.txt
376 | ```
377 |
378 | ⚠️ You may need to use a [**specific command**](https://pytorch.org/get-started/locally/) for PyTorch if you want to use CUDA
379 |
380 | ⚠️ You way need to manually install a [**Flash Attention release**](https://github.com/Dao-AILab/flash-attention/releases) for Windows
381 |
382 |
383 |
384 | # 🦾 Training
385 |
386 | * Run the `create_data.ipynb` file to create the tokenizer and the dataset *(it may take an entire day and consume a few hundred gigabytes of disk space)*
387 |
388 | * Run the `training.ipynb` file *(you can stop the training at any time and resume it later thanks to the checkpoints)*
389 |
390 | * If you don't have an overpriced 24GB GPU like me, the default settings (those used to train [**DimensionGPT**](#-dimensiongpt)) may not work for you. You can try to:
391 | * Reduce the **batch size** *(less stable and worse lowest point)*
392 | * Increase the **accumulation steps** *(fix previous problems but slower)*
393 | * Reduce some **architecture parameters** *(worse lowest point)*
394 |
395 |
396 |
397 |
398 | # ⚗️ Testing
399 |
400 | * Run the `testing.ipynb` file to use the models you downloaded or trained
401 |
402 |
403 |
404 | # 🙏 Credits
405 |
406 | * [**Angel Uriot**](https://github.com/angeluriot) : Creator of the project.
407 |
--------------------------------------------------------------------------------
/create_data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Create training data"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Imports"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "from dimgpt.data.tokenizer import *\n",
24 | "from dimgpt.data.pretraining import *\n",
25 | "from dimgpt.data.finetuning import *\n",
26 | "from dimgpt import utils\n",
27 | "from dimgpt.settings import *\n",
28 | "\n",
29 | "utils.reset_rand()"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "### Import pretraining dataset"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "pretraining = Pretraining()"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "pretraining.summary()"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "print(pretraining.get_document())"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "### Create vocab"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "sizes, chars = pretraining.create_tokenizer_data()\n",
80 | "\n",
81 | "print()\n",
82 | "\n",
83 | "for i in range(len(pretraining.datasets)):\n",
84 | "\tprint(f'{pretraining.datasets[i].name}: {sizes[i]:,} characters')\n",
85 | "\n",
86 | "print('\\nNb unique characters:', len(chars), '\\n')\n",
87 | "\n",
88 | "for char in chars:\n",
89 | "\tprint(f'[{char}]', end = ' ')"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "tokenizer = Tokenizer()\n",
99 | "\n",
100 | "print(f'\\nVocab size: {len(tokenizer.vocab):,}\\n')\n",
101 | "\n",
102 | "for v in tokenizer.vocab:\n",
103 | "\tprint(f'[{v}]', end = ' ')"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "### Encode datasets"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "pretraining.save(tokenizer)"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "pretraining.summary()"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": [
137 | "finetuning = Finetuning()\n",
138 | "finetuning.save(tokenizer)"
139 | ]
140 | }
141 | ],
142 | "metadata": {
143 | "kernelspec": {
144 | "display_name": "venv",
145 | "language": "python",
146 | "name": "python3"
147 | },
148 | "language_info": {
149 | "codemirror_mode": {
150 | "name": "ipython",
151 | "version": 3
152 | },
153 | "file_extension": ".py",
154 | "mimetype": "text/x-python",
155 | "name": "python",
156 | "nbconvert_exporter": "python",
157 | "pygments_lexer": "ipython3",
158 | "version": "3.10.11"
159 | },
160 | "orig_nbformat": 4
161 | },
162 | "nbformat": 4,
163 | "nbformat_minor": 2
164 | }
165 |
--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
1 | # 🎛️ Trained weights
2 |
3 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to:
4 |
5 | * Download the `.pt` file of the model you want to use and put it in the `models` folder
6 | * Download the `vocab.txt` file and put it in this folder
7 |
--------------------------------------------------------------------------------
/dimgpt/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/__init__.py
--------------------------------------------------------------------------------
/dimgpt/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/data/__init__.py
--------------------------------------------------------------------------------
/dimgpt/data/clean.py:
--------------------------------------------------------------------------------
1 | import regex
2 | from unidecode import unidecode
3 |
4 | from dimgpt.settings import *
5 |
6 |
7 | AUTHORIZED_UNICODE = set(
8 | 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \
9 | '0123456789' \
10 | ' !"#$%&\'`()*+,-./:;<=>?@[\\]^_{|}~' \
11 | 'ÀàÂâÄäÇçÉéÈèÊêËëÎîÏïÔôÖöÙùÛûÜüÆæŒœ' \
12 | '€£¥•·²³≠±×÷√π' \
13 | '😀😃😄😁😆😅😂🤣🥲🥹😊😇🙂🙃😉😌😍🥰😘😗😙😚😋😛😝😜🤪🤨🧐🤓😎🥸🤩🥳😏😒😞😔😟😕🙁😣😖😫😩🥺😢😭😤😠😡🤬🤯😳🥵🥶😱😨😰😥😓🫣🤗🫡🤔🫢🤭🤫🤥😶😐😑😬🫠🙄😯😦😧😮😲🥱😴🤤😪😵🫥🤐🥴🤢🤮🤧😷🤒🤕🤑🤠😈👿👹👺🤡💩👻💀👽👾🤖🎃😺😸😹😻😼😽🙀😿😾' \
14 | '👋🤚🖐✋🖖👌🤌🤏🤞🫰🤟🤘🤙🫵🫱🫲🫳🫴👈👉👆🖕👇👍👎✊👊🤛🤜👏🫶🙌👐🤲🤝🙏💅🤳💪🦾🦵🦿🦶👣👂🦻👃🫀🫁🧠🦷🦴👀👁👅👄🫦💋🩸' \
15 | '👶👧🧒👦👩🧑👨👱🧔👵🧓👴👲👳🧕👮👷💂👰🤵👸🫅🤴🥷🦸🦹🤶🎅🧙🧝🧛🧟🧞🧜🧚🧌👼🤰🤱🙇💁🙅🙆🙋🧏🤦🤷🙎🙍💇💆🧖💅🤳💃🕺👯🕴🚶🧎🏃🧍👭👬👫💑💏👪🗣👤👥🫂' \
16 | '🧳🌂🧵🪡🪢🧶👓🕶🥽🥼🦺👔👕👖🧣🧤🧥🧦👗👘🥻🩴🩱🩲🩳👙👚👛👜👝🎒👞👟🥾🥿👠👡🩰👢👑👒🎩🎓🧢⛑🪖💄💍💼' \
17 | '🐶🐱🐭🐹🐰🦊🐻🐼🐨🐯🦁🐮🐷🐽🐸🐵🙈🙉🙊🐒🐔🐧🐦🐤🐣🐥🦆🦅🦉🦇🐺🐗🐴🦄🐝🪱🐛🦋🐌🐞🐜🪰🪲🪳🦟🦗🕷🕸🦂🐢🐍🦎🦖🦕🐙🦑🦐🦞🦀🪸🐡🐠🐟🐬🐳🐋🦈🐊🐅🐆🦓🦍🦧🦣🐘🦛🦏🐪🐫🦒🦘🦬🐃🐂🐄🐎🐖🐏🐑🦙🐐🦌🐕🐩🦮🐈🪶🐓🦃🦤🦚🦜🦢🦩🕊🐇🦝🦨🦡🦫🦦🦥🐁🐀🐿🦔🐾🐉🐲🌵🎄🌲🌳🌴🪹🪺🪵🌱🌿🍀🎍🪴🎋🍃🍂🍁🍄🐚🪨🌾💐🌷🪷🌹🥀🌺🌸🌼🌻🌞🌝🌛🌜🌚🌕🌖🌗🌘🌑🌒🌓🌔🌙🌎🌍🌏🪐💫⭐🌟✨💥🔥🌪🌈🌤🌥🌦🌧⛈🌩🌨🌬💨💧💦🫧🌊🌫' \
18 | '🍏🍎🍐🍊🍋🍌🍉🍇🍓🫐🍈🍒🍑🥭🍍🥥🥝🍅🍆🥑🥦🥬🥒🌶🫑🌽🥕🫒🧄🧅🥔🍠🫘🥐🥯🍞🥖🥨🧀🥚🍳🧈🥞🧇🥓🥩🍗🍖🦴🌭🍔🍟🍕🫓🥪🥙🧆🌮🌯🫔🥗🥘🫕🥫🍝🍜🍲🍛🍣🍱🥟🦪🍤🍙🍚🍘🍥🥠🥮🍢🍡🍧🍨🍦🥧🧁🍰🎂🍮🍭🍬🍫🍿🍩🍪🌰🥜🍯🥛🍼🫖☕🍵🧃🥤🧋🫙🍶🍺🍻🥂🍷🫗🥃🍸🍹🧉🍾🧊🥄🍴🍽🥣🥡🥢🧂' \
19 | '⚽🏀🏈⚾🥎🎾🏐🏉🥏🎱🪀🏓🏸🏒🏑🥍🏏🪃🥅🪁🏹🎣🤿🥊🥋🎽🛹🛼🛷⛸🥌🎿⛷🏂🪂🤼🤸🤺🤾🏇🧘🏄🏊🤽🚣🧗🚵🚴🏆🥇🥈🥉🏅🎖🏵🎗🎫🎟🎪🤹🎭🩰🎨🎬🎤🎧🎼🎹🥁🪘🎷🎺🪗🎸🪕🎻🎲♟🎯🎳🎮🎰🧩' \
20 | '🚗🚕🚙🚌🚎🏎🚓🚑🚒🚐🛻🚚🚛🚜🦯🦽🦼🛴🚲🛵🏍🛺🚨🚔🚍🚘🚖🛞🚡🚠🚟🚃🚋🚞🚝🚄🚅🚈🚂🚆🚇🚊🚉🛫🛬🛩💺🛰🚀🛸🚁🛶⛵🚤🛥🛳⛴🚢🛟🪝🚧🚦🚥🚏🗺🗿🗽🗼🏰🏯🏟🎡🎢🛝🎠⛱🏖🏝🏜🌋⛰🏔🗻🏕🛖🏠🏡🏘🏚🏗🏭🏢🏬🏣🏤🏥🏦🏨🏪🏫🏩💒🏛🕌🕍🛕🕋⛩🛤🛣🗾🎑🏞🌅🌄🌠🎇🎆🌇🌆🏙🌃🌌🌉🌁' \
21 | '⌚📱📲💻🖥🖨🖱🖲🕹🗜💽💾💿📀📼📷📸📹🎥📽🎞📞📟📠📺📻🎙🎚🎛🧭⏱⏲⏰🕰⌛⏳📡🔋🪫🔌💡🔦🕯🪔🧯🛢💸💵💴💶💷🪙💰💳💎🪜🧰🪛🔧🔨⚒🛠⛏🪚🔩🪤🧱⛓🧲🔫💣🧨🪓🔪🗡🛡🚬🪦🏺🔮📿🧿🪬💈🔭🔬🕳🩹🩺🩻🩼💊💉🩸🧬🦠🧫🧪🌡🧹🪠🧺🧻🚽🚰🚿🛁🛀🧼🪥🪒🧽🪣🧴🛎🔑🗝🚪🪑🛋🛏🛌🧸🪆🖼🪞🪟🛍🛒🎁🎈🎏🎀🪄🪅🎊🎉🪩🎎🏮🎐🧧📩📨📧💌📥📤📦🏷🪧📪📫📬📭📮📯📜📃📄📑🧾📊📈📉🗒🗓📆📅🗑🪪📇🗃🗳🗄📋📁📂🗂🗞📰📓📔📒📕📗📘📙📚📖🔖🧷🔗📎🖇📐📏🧮📌📍🖊🖋🖌🖍📝🔍🔎🔏🔐🔒🔓' \
22 | '🧡💛💚💙💜🖤🤍🤎💔💕💞💓💗💖💘💝💟🔯🕎🛐⛎🆔🉑📴📳🈶🈸🈺🆚💮🉐🈴🈵🈹🈲🆎🆑🆘❌🛑⛔📛🚫💯💢🚷🚯🚳🚱🔞📵🚭🔅🔆🚸🔱🔰✅💹❎🌐💠🌀💤🏧🚾🛗🈳🛂🛃🛄🛅🚹🚺🚼⚧🚻🚮🎦📶🈁🔣🔤🔡🔠🆖🆗🆙🆒🆕🆓🔟🔢⏸⏯⏹⏺⏭⏮⏩⏪⏫⏬🔼🔽🔀🔁🔂🔄🔃🎵🎶➕➖➗🟰♾💲💱➰➿🔚🔙🔛🔝🔜🔘🔴🟠🟡🟢🔵🟣🟤🔺🔻🔸🔹🔶🔷🔳🔲🟥🟧🟨🟩🟦🟪🟫🔈🔇🔉🔊🔔🔕📣📢💬💭🗯🃏🎴🕐🕑🕒🕓🕔🕕🕖🕗🕘🕙🕚🕛🕜🕝🕞🕟🕠🕡🕢🕣🕤🕥🕦🕧' \
23 | '🏴🏁🚩🎌'
24 | )
25 |
26 | AUTHORIZED_ASCII = set(
27 | 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' \
28 | '0123456789' \
29 | ' !"#$%&\'`()*+,-./:;<=>?@[\\]^_{|}~'
30 | )
31 |
32 | REPLACE_UNICODE = {
33 | '« ': '"',
34 | ' »': '"',
35 | '«': '"',
36 | '»': '"',
37 | '❗️': '!',
38 | '❕': '!',
39 | '❓': '?',
40 | '❔': '?',
41 | '‼️': '!!',
42 | '⁉️': '!?',
43 | '✖️': '❌',
44 | '✔️': '✅',
45 | '☺': '😊',
46 | '☺️': '😊',
47 | '☹': '🙁',
48 | '☹️': '🙁'
49 | }
50 |
51 | ENCODE_STRING_EMOJIS = {
52 | '☂️': '☂',
53 | '☀️': '☀',
54 | '❄️': '❄',
55 | '✈️': '✈',
56 | '☎️': '☎',
57 | '⚙️': '⚙',
58 | '⚔️': '⚔',
59 | '✉️': '✉',
60 | '✂️': '✂',
61 | '✒️': '✒',
62 | '❤️': '❤',
63 | '☢️': '☢',
64 | '☣️': '☣',
65 | '⚠️': '⚠',
66 | '♻️': '♻',
67 | '🏳️🌈': '①',
68 | '🏳️⚧️': '②',
69 | '🏴☠️': '③',
70 | '🇺🇸': '④',
71 | '🇨🇳': '⑤',
72 | '🇯🇵': '⑥',
73 | '🇩🇪': '⑦',
74 | '🇮🇳': '⑧',
75 | '🇬🇧': '⑨',
76 | '🇫🇷': '⑩',
77 | '🇮🇹': '⑪',
78 | '🇨🇦': '⑫',
79 | '🇧🇷': '⑬',
80 | '🇷🇺': '⑭',
81 | '🇰🇷': '⑮',
82 | '🇦🇺': '⑯',
83 | '🇲🇽': '⑰',
84 | '🇪🇸': '⑱',
85 | '🏳️': '🏳'
86 | }
87 |
88 | DECODE_STRING_EMOJIS = {value: key for key, value in reversed(ENCODE_STRING_EMOJIS.items())}
89 |
90 | ENCODE_CHARS = list('①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱')
91 |
92 | REPLACE_ASCII_STRING = {
93 | '--': '-'
94 | }
95 |
96 | STRIP_REPLACE = {
97 | ' \n': '\n',
98 | '\t\n': '\n',
99 | '\n\n\n': '\n\n'
100 | }
101 |
102 | CONTROL_REPLACE = {
103 | '\t': '⮜tab⮞',
104 | '\n': '⮜new-line⮞'
105 | }
106 |
107 | POSSIBLE_CHARS = AUTHORIZED_UNICODE | set(DECODE_STRING_EMOJIS.keys())
108 |
109 |
110 | def clean_ascii(char: str) -> str:
111 |
112 | if char in AUTHORIZED_ASCII or char in CONTROL_REPLACE.keys():
113 | return char
114 |
115 | return ''
116 |
117 |
118 | def clean_unicode(char: str) -> str:
119 |
120 | if char in AUTHORIZED_UNICODE or char in DECODE_STRING_EMOJIS or char in CONTROL_REPLACE.keys():
121 | return char
122 |
123 | text = unidecode(char)
124 |
125 | for key, value in REPLACE_ASCII_STRING.items():
126 | text = text.replace(key, value)
127 |
128 | return ''.join([clean_ascii(char) for char in text])
129 |
130 |
131 | def clean_string(text: str, keep_control_tokens: bool = False) -> str:
132 |
133 | if len(text) == 0:
134 | return ''
135 |
136 | text = text.replace('\r', '')
137 |
138 | if keep_control_tokens:
139 |
140 | safe_control_tokens = [regex.escape(c) for c in CONTROL_TOKENS]
141 | reg = r'(' + r'|'.join(safe_control_tokens) + r''.join([f'[{i}]' for i in safe_control_tokens]) + r']+)'
142 | parts = regex.split(reg, text, flags = regex.UNICODE, concurrent = False)
143 | parts = list(filter(None, parts))
144 |
145 | return ''.join([part if part in CONTROL_TOKENS else clean_string(part) for part in parts])
146 |
147 | for key, value in REPLACE_UNICODE.items():
148 | text = text.replace(key, value)
149 |
150 | for char in ENCODE_CHARS:
151 | text = text.replace(char, unidecode(char))
152 |
153 | for key, value in ENCODE_STRING_EMOJIS.items():
154 | text = text.replace(key, value)
155 |
156 | text = ''.join([clean_unicode(char) for char in text])
157 |
158 | for key, value in STRIP_REPLACE.items():
159 | while key in text:
160 | text = text.replace(key, value)
161 |
162 | text = text.strip()
163 |
164 | for key, value in CONTROL_REPLACE.items():
165 | text = text.replace(key, value)
166 |
167 | return text
168 |
169 |
170 | def unclean_string(text: str, keep_control_tokens: bool = False) -> str:
171 |
172 | for key, value in DECODE_STRING_EMOJIS.items():
173 | text = text.replace(key, value)
174 |
175 | if keep_control_tokens:
176 | return text
177 |
178 | text = text.replace('⮜unknown⮞', '�')
179 | text = text.replace('⮜padding⮞', '')
180 | text = text.replace('⮜start-of-text⮞', '\n\n---------- START OF TEXT ----------\n\n')
181 | text = text.replace('⮜tab⮞', '\t')
182 | text = text.replace('⮜new-line⮞', '\n')
183 | text = text.replace('⮜human⮞', '\n\n--- Human ---\n\n')
184 | text = text.replace('⮜system⮞', '\n\n--- System ---\n\n')
185 | text = text.replace('⮜user⮞', '\n\n--- User ---\n\n')
186 | text = text.replace('⮜assistant⮞', '\n\n--- Assistant ---\n\n')
187 | text = text.replace('⮜end-of-text⮞', '\n\n---------- END OF TEXT ----------\n\n')
188 |
189 | return text
190 |
--------------------------------------------------------------------------------
/dimgpt/data/datasets/__init__.py:
--------------------------------------------------------------------------------
1 | from .dataset import Dataset
--------------------------------------------------------------------------------
/dimgpt/data/datasets/dataset.py:
--------------------------------------------------------------------------------
1 | import os, random, pickle
2 | from tqdm import tqdm
3 | from abc import ABC
4 | import numpy as np
5 | import numpy.typing as npt
6 | from dimgpt.data.clean import *
7 | from dimgpt.data.tokenizer import Tokenizer
8 |
9 |
10 | class Dataset(ABC):
11 |
12 | def __init__(self) -> None:
13 |
14 | self.dataset = None
15 | self.training_part = ''
16 | self.name = ''
17 | self.size = {'train': 0, 'val': 0}
18 | self.multiplier = 1.0
19 |
20 |
21 | def get_document(self, i: int | None = None) -> str:
22 |
23 | if i is None:
24 | i = random.randint(0, len(self.dataset) - 1)
25 |
26 | return '⮜start-of-text⮞' + clean_string(self.dataset[i]['text']) + '⮜end-of-text⮞'
27 |
28 |
29 | def document_to_tokens(self, document: dict[str, str], tokenizer: Tokenizer) -> dict[str, npt.NDArray[np.uint16] | int]:
30 |
31 | tokens = [tokenizer.start_of_text_token, *tokenizer.encode(document['text']), tokenizer.end_of_text_token]
32 |
33 | return {'tokens': np.array(tokens, dtype = np.uint16), 'size': len(tokens)}
34 |
35 |
36 | def save(self, tokenizer: Tokenizer) -> None:
37 |
38 | if os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, f'train.bin')):
39 | return
40 |
41 | os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True)
42 |
43 | split_dataset = self.dataset.train_test_split(test_size = PRETRAINING_VAL_RATIO, shuffle = True)
44 | split_dataset['val'] = split_dataset.pop('test')
45 |
46 | tokenized = split_dataset.map(
47 | lambda doc: self.document_to_tokens(doc, tokenizer),
48 | desc = f'Tokenizing {self.name}',
49 | num_proc = NUM_THREADS
50 | )
51 |
52 | for split, documents in tokenized.items():
53 |
54 | total = 0
55 | ids = []
56 |
57 | for doc in tqdm(documents, desc = f'Saving {self.name} {split} ids'):
58 |
59 | ids.append({
60 | 'start': total,
61 | 'size': doc['size']
62 | })
63 |
64 | total += doc['size']
65 |
66 | with open(os.path.join(DATA_DIR, self.training_part, self.name, f'{split}_ids.pkl'), 'wb') as file:
67 | pickle.dump(ids, file)
68 |
69 | batch_size = 1_024
70 |
71 | while batch_size >= len(documents):
72 | batch_size //= 2
73 |
74 | self.size[split] = int(np.sum(documents['size'], dtype = np.uint64))
75 | path = os.path.join(DATA_DIR, self.training_part, self.name, f'{split}.bin')
76 | file = np.memmap(path, dtype = np.uint16, mode = 'w+', shape = (self.size[split],))
77 | i = 0
78 |
79 | for batch_i in tqdm(range(batch_size), desc = f'Saving {self.name} {split}'):
80 |
81 | batch = documents.shard(num_shards = batch_size, index = batch_i, contiguous = True).with_format('numpy')
82 | file_batch = np.concatenate(batch['tokens'])
83 | file[i:i + len(file_batch)] = file_batch
84 | i += len(file_batch)
85 |
86 | file.flush()
87 |
88 | with open(os.path.join(DATA_DIR, self.training_part, self.name, f'metadata.pkl'), 'wb') as file:
89 | pickle.dump({
90 | 'training_part': self.training_part,
91 | 'name': self.name,
92 | 'size': self.size,
93 | 'multiplier': self.multiplier
94 | }, file)
95 |
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/__init__.py:
--------------------------------------------------------------------------------
1 | from .common_crawl import CommonCrawlDataset
2 | from .wikipedia import WikipediaDataset
3 | from .books import BooksDataset
4 | from .news import NewsDataset
5 | from .institutions import InstitutionsDataset
6 | from .others import OthersDataset
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/books.py:
--------------------------------------------------------------------------------
1 | import os, json
2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
3 | from dimgpt.data.datasets import Dataset
4 | from dimgpt.settings import *
5 |
6 | class BooksDataset(Dataset):
7 |
8 | def __init__(self) -> None:
9 |
10 | super().__init__()
11 |
12 | self.training_part = 'pretraining'
13 | self.name = 'books'
14 | self.multiplier = 0.2
15 |
16 | print('Downloading Books dataset...')
17 |
18 | if not os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json')):
19 |
20 | dataset = load_dataset(
21 | path = 'PleIAs/French-PD-Books',
22 | split = 'train',
23 | download_config = DownloadConfig(max_retries = 10),
24 | streaming = True
25 | )
26 |
27 | os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True)
28 |
29 | with open(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 'w', encoding = 'utf-8') as file:
30 |
31 | file.truncate(0)
32 | i = 0
33 | self.size['train'] = 0
34 |
35 | for record in dataset:
36 |
37 | text = str(record['complete_text']).strip()
38 |
39 | if len(text) < MIN_DOCUMENT_SIZE:
40 | continue
41 |
42 | file.write(json.dumps({'text': text}, ensure_ascii = False) + '\n')
43 |
44 | self.size['train'] += len(text)
45 | i += 1
46 |
47 | if i % 1_000 == 0:
48 | print(f'{i:,} documents | {self.size["train"]:,} characters ', end = '\r')
49 |
50 | if self.size['train'] >= 10_000_000_000:
51 | break
52 |
53 | self.dataset = load_dataset(
54 | path = 'json',
55 | split = 'train',
56 | data_files = os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'),
57 | num_proc = NUM_THREADS
58 | )
59 |
60 | if self.size['train'] == 0:
61 | self.size['train'] = 10_000_000_000
62 |
63 | print(f'Books dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/common_crawl.py:
--------------------------------------------------------------------------------
1 | import os, json
2 | from datasets import load_dataset, DownloadConfig
3 | from dimgpt.data.datasets import Dataset
4 | from dimgpt.settings import *
5 |
6 | class CommonCrawlDataset(Dataset):
7 |
8 | def __init__(self) -> None:
9 |
10 | super().__init__()
11 |
12 | self.training_part = 'pretraining'
13 | self.name = 'common_crawl'
14 | self.multiplier = 1.0
15 |
16 | print('Downloading Common Crawl dataset...')
17 |
18 | if not os.path.exists(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json')):
19 |
20 | dataset = load_dataset(
21 | path = 'ontocord/CulturaY',
22 | name = 'fr',
23 | split = 'train',
24 | download_config = DownloadConfig(max_retries = 10),
25 | streaming = True
26 | )
27 |
28 | os.makedirs(os.path.join(DATA_DIR, self.training_part, self.name), exist_ok = True)
29 |
30 | with open(os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'), 'w', encoding = 'utf-8') as file:
31 |
32 | file.truncate(0)
33 | i = 0
34 | self.size['train'] = 0
35 |
36 | for record in dataset:
37 |
38 | text = str(record['text']).strip()
39 |
40 | if len(text) < MIN_DOCUMENT_SIZE:
41 | continue
42 |
43 | file.write(json.dumps({'text': text}, ensure_ascii = False) + '\n')
44 |
45 | self.size['train'] += len(text)
46 | i += 1
47 |
48 | if i % 1_000 == 0:
49 | print(f'{i:,} documents | {self.size["train"]:,} characters ', end = '\r')
50 |
51 | if self.size['train'] >= 150_000_000_000:
52 | break
53 |
54 | self.dataset = load_dataset(
55 | path = 'json',
56 | split = 'train',
57 | data_files = os.path.join(DATA_DIR, self.training_part, self.name, 'raw.json'),
58 | num_proc = NUM_THREADS
59 | )
60 |
61 | if self.size['train'] == 0:
62 | self.size['train'] = 150_000_000_000
63 |
64 | print(f'Common Crawl dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
65 |
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/institutions.py:
--------------------------------------------------------------------------------
1 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
2 | from dimgpt.data.datasets import Dataset
3 | from dimgpt.settings import *
4 |
5 | class InstitutionsDataset(Dataset):
6 |
7 | def __init__(self) -> None:
8 |
9 | super().__init__()
10 |
11 | self.training_part = 'pretraining'
12 | self.name = 'institutions'
13 | self.multiplier = 2.0
14 |
15 | print('Downloading Institutions dataset...')
16 |
17 | europarl = load_dataset(
18 | path = 'bigscience-data/roots_fr_the_pile_europarl',
19 | split = 'train',
20 | download_config = DownloadConfig(max_retries = 10)
21 | )
22 |
23 | qr_an = load_dataset(
24 | path = 'cassandra-themis/QR-AN',
25 | name = 'qran_generation',
26 | split = 'train+validation+test',
27 | download_config = DownloadConfig(max_retries = 10)
28 | )
29 |
30 | qr_an = qr_an.map(
31 | lambda doc: {'text': (str(doc['question']).strip() + '\n\n' + str(doc['answer']).strip()).strip()},
32 | remove_columns = ['question', 'answer'],
33 | desc = 'Cleaning QR-AN',
34 | num_proc = NUM_THREADS
35 | )
36 |
37 | self.dataset = concatenate_datasets([europarl, qr_an])
38 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
39 | self.size['train'] = 0
40 |
41 | for doc in self.dataset:
42 | self.size['train'] += len(str(doc['text']).strip())
43 |
44 | print(f'Institutions dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/news.py:
--------------------------------------------------------------------------------
1 | import re, json
2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
3 | from dimgpt.data.datasets import Dataset
4 | from dimgpt.settings import *
5 |
6 | class NewsDataset(Dataset):
7 |
8 | def __init__(self) -> None:
9 |
10 | super().__init__()
11 |
12 | self.training_part = 'pretraining'
13 | self.name = 'news'
14 | self.multiplier = 0.3
15 |
16 | print('Downloading News dataset...')
17 |
18 | news_fr = load_dataset(
19 | path = 'eckendoerffer/news_fr',
20 | split = 'train+validation+test',
21 | download_config = DownloadConfig(max_retries = 10)
22 | )
23 |
24 | news_fr = news_fr.map(
25 | lambda doc: {'text': self._clean_news_fr(doc['text'])},
26 | desc = 'Cleaning news_fr',
27 | num_proc = NUM_THREADS
28 | )
29 |
30 | wikinews = load_dataset(
31 | path = 'bigscience-data/roots_fr_wikinews',
32 | split = 'train',
33 | download_config = DownloadConfig(max_retries = 10)
34 | )
35 |
36 | wikinews = wikinews.map(
37 | lambda doc: {'text': self._clean_wikinews(doc)},
38 | remove_columns = ['meta'],
39 | desc = 'Cleaning wikinews',
40 | num_proc = NUM_THREADS
41 | )
42 |
43 | cc_news = load_dataset(
44 | path = 'intfloat/multilingual_cc_news',
45 | name = 'fr',
46 | split = 'train',
47 | download_config = DownloadConfig(max_retries = 10)
48 | )
49 |
50 | cc_news = cc_news.map(
51 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['maintext']).strip()).strip()},
52 | remove_columns = ['title', 'maintext', 'url', 'date_publish'],
53 | desc = 'Cleaning cc_news',
54 | num_proc = NUM_THREADS
55 | )
56 |
57 | xlsum = load_dataset(
58 | path = 'csebuetnlp/xlsum',
59 | name = 'french',
60 | split = 'train+validation+test',
61 | download_config = DownloadConfig(max_retries = 10)
62 | )
63 |
64 | xlsum_summaries = xlsum.map(
65 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['summary']).strip()).strip()},
66 | remove_columns = ['id', 'url', 'title', 'summary'],
67 | desc = 'Cleaning xlsum_summaries',
68 | num_proc = NUM_THREADS
69 | )
70 |
71 | xlsum = xlsum.map(
72 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
73 | remove_columns = ['id', 'url', 'title', 'summary'],
74 | desc = 'Cleaning xlsum',
75 | num_proc = NUM_THREADS
76 | )
77 |
78 | mlsum = load_dataset(
79 | path = 'mlsum',
80 | name = 'fr',
81 | split = 'train+validation+test',
82 | download_config = DownloadConfig(max_retries = 10)
83 | )
84 |
85 | mlsum_summaries = mlsum.map(
86 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['summary']).strip()).strip()},
87 | remove_columns = ['summary', 'topic', 'url', 'title', 'date'],
88 | desc = 'Cleaning mlsum_summaries',
89 | num_proc = NUM_THREADS
90 | )
91 |
92 | mlsum = mlsum.map(
93 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
94 | remove_columns = ['summary', 'topic', 'url', 'title', 'date'],
95 | desc = 'Cleaning mlsum',
96 | num_proc = NUM_THREADS
97 | )
98 |
99 | orange_sum = load_dataset(
100 | path = 'orange_sum',
101 | name = 'title',
102 | split = 'train+validation+test',
103 | download_config = DownloadConfig(max_retries = 10)
104 | )
105 |
106 | orange_sum = orange_sum.map(
107 | lambda doc: {'text': (str(doc['summary']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
108 | remove_columns = ['summary'],
109 | desc = 'Cleaning orange_sum',
110 | num_proc = NUM_THREADS
111 | )
112 |
113 | covid_news = load_dataset(
114 | path = 'gustavecortal/fr_covid_news',
115 | split = 'train',
116 | download_config = DownloadConfig(max_retries = 10)
117 | )
118 |
119 | covid_news = covid_news.map(
120 | lambda doc: {'text': (str(doc['title']).strip() + '\n\n' + str(doc['text']).strip()).strip()},
121 | remove_columns = ['title', 'description', 'domain', 'url', 'labels'],
122 | desc = 'Cleaning covid_news',
123 | num_proc = NUM_THREADS
124 | )
125 |
126 | self.dataset = concatenate_datasets([news_fr, wikinews, cc_news, xlsum, xlsum_summaries, mlsum, mlsum_summaries, orange_sum, covid_news])
127 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
128 | self.size['train'] = 0
129 |
130 | for doc in self.dataset:
131 | self.size['train'] += len(str(doc['text']).strip())
132 |
133 | print(f'News dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
134 |
135 |
136 | def _clean_news_fr(self, text: str) -> str:
137 |
138 | text = text.replace(' ,', ',')
139 | text = text.replace(' .', '.')
140 | text = text.replace(' )', ')')
141 | text = text.replace('( ', '(')
142 | text = text.replace(' ]', ']')
143 | text = text.replace('[ ', '[')
144 |
145 | text = re.sub(r'(\d)\s*,\s*(\d)', r'\1,\2', text)
146 |
147 | array = list(text)
148 | start = True
149 |
150 | for i in range(len(array)):
151 | if array[i] == '"':
152 | array[i] = '«' if start else '»'
153 | start = not start
154 |
155 | return ''.join(array)
156 |
157 |
158 | def _clean_wikinews(self, document) -> str:
159 |
160 | meta = str(document['meta']).strip()
161 | start = meta.find(", 'title': ") + 12
162 | end = meta.find(", 'type':") - 1
163 |
164 | if start != 11 and end != -2:
165 | title = meta[start:end].strip()
166 | else:
167 | title = ''
168 |
169 | text = str(document['text']).strip()
170 |
171 | if len(text) < 32:
172 | return text
173 |
174 | index = text[:30].find('–')
175 |
176 | if index != -1:
177 | text = text[index + 1:]
178 |
179 | output = title + '\n\n' + text.strip()
180 |
181 | return output.strip()
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/others.py:
--------------------------------------------------------------------------------
1 | import re
2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
3 | from dimgpt.data.datasets import Dataset
4 | from dimgpt.settings import *
5 |
6 | class OthersDataset(Dataset):
7 |
8 | def __init__(self) -> None:
9 |
10 | super().__init__()
11 |
12 | self.training_part = 'pretraining'
13 | self.name = 'others'
14 | self.multiplier = 2.0
15 |
16 | print('Downloading Others dataset...')
17 |
18 | ted_talks = load_dataset(
19 | path = 'bigscience-data/roots_fr_ted_talks_iwslt',
20 | split = 'train',
21 | download_config = DownloadConfig(max_retries = 10)
22 | )
23 |
24 | ted_talks = ted_talks.remove_columns('meta')
25 |
26 | bloom_lm = load_dataset(
27 | path = 'sil-ai/bloom-lm',
28 | name = 'fra',
29 | split = 'train+validation+test',
30 | download_config = DownloadConfig(max_retries = 10)
31 | )
32 |
33 | bloom_lm = bloom_lm.remove_columns(['title', 'license', 'pageCount', 'bookInstanceId', 'bookLineage'])
34 |
35 | self.dataset = concatenate_datasets([ted_talks, bloom_lm])
36 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
37 | self.size['train'] = 0
38 |
39 | for doc in self.dataset:
40 | self.size['train'] += len(str(doc['text']).strip())
41 |
42 | print(f'Others dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
43 |
--------------------------------------------------------------------------------
/dimgpt/data/datasets/pretraining/wikipedia.py:
--------------------------------------------------------------------------------
1 | import re
2 | from datasets import load_dataset, DownloadConfig, concatenate_datasets
3 | from dimgpt.data.datasets import Dataset
4 | from dimgpt.settings import *
5 |
6 | class WikipediaDataset(Dataset):
7 |
8 | def __init__(self) -> None:
9 |
10 | super().__init__()
11 |
12 | self.training_part = 'pretraining'
13 | self.name = 'wikipedia'
14 | self.multiplier = 4.0
15 |
16 | print('Downloading Wikipedia dataset...')
17 |
18 | wikipedia_fr = load_dataset(
19 | path = 'eckendoerffer/wikipedia_fr',
20 | split = 'train+validation+test',
21 | download_config = DownloadConfig(max_retries = 10)
22 | )
23 |
24 | wikipedia_fr = wikipedia_fr.map(
25 | lambda doc: {'text': self._clean_wikipedia_fr(doc['text'])},
26 | desc = 'Cleaning wikipedia_fr',
27 | num_proc = NUM_THREADS
28 | )
29 |
30 | roots_fr_wikipedia = load_dataset(
31 | path = 'bigscience-data/roots_fr_wikipedia',
32 | split = 'train',
33 | download_config = DownloadConfig(max_retries = 10)
34 | )
35 |
36 | roots_fr_wikipedia = roots_fr_wikipedia.remove_columns('meta')
37 |
38 | roots_fr_wikivoyage = load_dataset(
39 | path = 'bigscience-data/roots_fr_wikivoyage',
40 | split = 'train',
41 | download_config = DownloadConfig(max_retries = 10)
42 | )
43 |
44 | roots_fr_wikivoyage = roots_fr_wikivoyage.remove_columns('meta')
45 |
46 | self.dataset = concatenate_datasets([wikipedia_fr, roots_fr_wikipedia, roots_fr_wikivoyage])
47 | self.dataset = self.dataset.filter(lambda doc: len(str(doc['text']).strip()) >= MIN_DOCUMENT_SIZE)
48 | self.size['train'] = 0
49 |
50 | for doc in self.dataset:
51 | self.size['train'] += len(str(doc['text']).strip())
52 |
53 | print(f'Wikipedia dataset downloaded: {len(self.dataset):,} documents | {self.size["train"]:,} characters')
54 |
55 |
56 | def _clean_wikipedia_fr(self, text: str) -> str:
57 |
58 | text = text.replace(' ,', ',')
59 | text = text.replace(' .', '.')
60 | text = text.replace(' )', ')')
61 | text = text.replace('( ', '(')
62 | text = text.replace(' ]', ']')
63 | text = text.replace('[ ', '[')
64 |
65 | text = re.sub(r'(\d)\s*,\s*(\d)', r'\1,\2', text)
66 |
67 | array = list(text)
68 | start = True
69 |
70 | for i in range(len(array)):
71 | if array[i] == '"':
72 | array[i] = '«' if start else '»'
73 | start = not start
74 |
75 | return ''.join(array)
--------------------------------------------------------------------------------
/dimgpt/data/finetuning.py:
--------------------------------------------------------------------------------
1 | import os, pickle
2 | from datasets import load_dataset, DownloadConfig
3 | from tqdm import tqdm
4 |
5 | from dimgpt.settings import *
6 | from dimgpt.data.tokenizer import Tokenizer
7 |
8 |
9 | class Finetuning:
10 |
11 | def __init__(self):
12 |
13 | self.import_dataset()
14 |
15 |
16 | def import_dataset(self) -> None:
17 |
18 | if os.path.exists(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl')):
19 | return
20 |
21 | self.datasets = {}
22 |
23 | for name in ['human_conversations', 'chatbot_conversations', 'dimension_gpt_conversations', 'human_preprompts', 'chatbot_preprompts', 'dimension_gpt_preprompts']:
24 |
25 | self.datasets[name] = load_dataset(
26 | path = 'angeluriot/DimensionGPT_instruct',
27 | name = name,
28 | download_config = DownloadConfig(max_retries = 10),
29 | num_proc = NUM_THREADS
30 | )
31 |
32 |
33 | def document_to_tokens(self, document: dict[str, str], tokenizer: Tokenizer, preprompts: bool) -> dict[str, list[int] | int]:
34 |
35 | if preprompts:
36 |
37 | tokens = [tokenizer.system_token, *tokenizer.encode(document['preprompt'])]
38 |
39 | return {'tokens': tokens, 'size': len(tokens)}
40 |
41 | tokens = []
42 |
43 | for msg in document['conversation']:
44 |
45 | if msg['role'] == 'user':
46 | tokens.append(tokenizer.user_token)
47 | elif msg['role'] == 'assistant':
48 | tokens.append(tokenizer.assistant_token)
49 | else:
50 | tokens.append(tokenizer.human_token)
51 |
52 | tokens.extend(tokenizer.encode(msg['text']))
53 |
54 | return {'tokens': tokens, 'size': len(tokens)}
55 |
56 |
57 | def save(self, tokenizer: Tokenizer) -> None:
58 |
59 | if os.path.exists(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl')):
60 | return
61 |
62 | if not os.path.exists(os.path.join(DATA_DIR, 'finetuning')):
63 | os.makedirs(os.path.join(DATA_DIR, 'finetuning'))
64 |
65 | for name, dataset in self.datasets.items():
66 |
67 | if name == 'chatbot_conversations':
68 | dataset = dataset['train'].train_test_split(test_size = FINETUNING_VAL_RATIO, shuffle = True)
69 | dataset['val'] = dataset.pop('test')
70 |
71 | tokenized = dataset.map(
72 | lambda doc: self.document_to_tokens(doc, tokenizer, name.endswith('preprompts')),
73 | desc = f'Tokenizing {name}',
74 | num_proc = NUM_THREADS
75 | )
76 |
77 | for split, documents in tokenized.items():
78 |
79 | docs = []
80 |
81 | for doc in tqdm(documents, desc = f'Saving finetuning dataset {name}_{split}'):
82 | docs.append(doc['tokens'])
83 |
84 | with open(os.path.join(DATA_DIR, 'finetuning', f'{name}_{split}.pkl'), 'wb') as file:
85 | pickle.dump(docs, file)
86 |
87 |
--------------------------------------------------------------------------------
/dimgpt/data/pretokenizer.py:
--------------------------------------------------------------------------------
1 | import regex
2 |
3 | from tokenizers import *
4 | from dimgpt.settings import *
5 | from dimgpt.utils import *
6 |
7 |
8 | def split(text: str) -> list[str]:
9 |
10 | if text == '':
11 | return []
12 |
13 | # Split in words
14 |
15 | safe_control_tokens = [regex.escape(c) for c in CONTROL_TOKENS]
16 | reg = r'(' + r'|'.join(safe_control_tokens) + r'|\d+|\s+|\p{L}+|[^\d\p{L}\s' + r''.join([f'[{i}]' for i in safe_control_tokens]) + r']+)'
17 | words = regex.split(reg, text, flags = regex.UNICODE, concurrent = False)
18 | words = list(filter(None, words))
19 |
20 | # Add beginning spaces
21 |
22 | temp = []
23 | i = 0
24 |
25 | while i < len(words) - 1:
26 |
27 | if words[i] == ' ' and words[i + 1] not in CONTROL_TOKENS:
28 | temp.append(' ' + words[i + 1])
29 | i += 2
30 | continue
31 |
32 | if words[i].endswith(' ') and words[i + 1] not in CONTROL_TOKENS:
33 | temp.extend([words[i][:-1], ' ' + words[i + 1]])
34 | i += 2
35 | continue
36 |
37 | temp.append(words[i])
38 | i += 1
39 |
40 | if i == len(words) - 1:
41 | temp.append(words[-1])
42 |
43 | words = temp
44 | words = list(filter(None, words))
45 |
46 | return words
47 |
48 |
49 | class PreTokenizer:
50 |
51 | def split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
52 |
53 | print('Pretokenize...')
54 |
55 | words = split(str(normalized_string))
56 | words = [NormalizedString(word) for word in words]
57 |
58 | print('Nb words:', '{:,.0f}'.format(len(words)))
59 | print('Merges...')
60 |
61 | return words
62 |
63 |
64 | def pre_tokenize(self, pretok: PreTokenizedString) -> None:
65 |
66 | pretok.split(self.split)
67 |
--------------------------------------------------------------------------------
/dimgpt/data/pretraining.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | from tqdm import tqdm
4 |
5 | from dimgpt.settings import *
6 | from dimgpt.data.clean import *
7 | from dimgpt.data.tokenizer import Tokenizer
8 | from dimgpt.data.datasets.pretraining import *
9 | from dimgpt.data.datasets import Dataset
10 |
11 |
12 | class Pretraining:
13 |
14 | def __init__(self):
15 |
16 | self.datasets: list[Dataset] = [CommonCrawlDataset(), WikipediaDataset(), BooksDataset(), NewsDataset(), InstitutionsDataset(), OthersDataset()]
17 |
18 |
19 | def get_document(self) -> str:
20 |
21 | probabilities = np.array([dataset.size['train'] * dataset.multiplier for dataset in self.datasets])
22 | probabilities /= np.sum(probabilities)
23 |
24 | dataset = np.random.choice(self.datasets, p = probabilities)
25 |
26 | return dataset.get_document()
27 |
28 |
29 | def create_tokenizer_data(self, epsilon: float = 1e-8) -> tuple[list[int], list[str]]:
30 |
31 | if os.path.exists(os.path.join(DATA_DIR, 'tokenizer_data.txt')):
32 |
33 | return [0] * len(self.datasets), ['']
34 |
35 | target_ratios = np.array([dataset.size['train'] * dataset.multiplier for dataset in self.datasets])
36 | target_ratios = (target_ratios / np.sum(target_ratios)).tolist()
37 |
38 | with open(os.path.join(DATA_DIR, 'tokenizer_data.txt'), 'w', encoding = 'utf-8') as file:
39 |
40 | file.truncate(0)
41 | chars = {}
42 | current_sizes = [0] * len(self.datasets)
43 | pbar = tqdm(total = TOKENIZER_DATA_SIZE)
44 |
45 | while True:
46 |
47 | current_ratios = [size / (sum(current_sizes) + epsilon) for size in current_sizes]
48 | ratio_errors = [target_ratios[i] - current_ratios[i] for i in range(len(self.datasets))]
49 | dataset_index = np.argmax(ratio_errors)
50 | dataset = self.datasets[dataset_index]
51 |
52 | document = dataset.get_document()
53 |
54 | if len(document) == 0:
55 | continue
56 |
57 | file.write(document)
58 | current_sizes[dataset_index] += len(document)
59 |
60 | for char in document:
61 | chars[char] = chars.get(char, 0) + 1
62 |
63 | pbar.update(len(document))
64 |
65 | if sum(current_sizes) >= TOKENIZER_DATA_SIZE:
66 | break
67 |
68 | document = ' ' + ' '.join(list(POSSIBLE_CHARS))
69 | file.write(document)
70 |
71 | for char in document:
72 | chars[char] = chars.get(char, 0) + 1
73 |
74 | pbar.close()
75 |
76 | chars = sorted(chars.items(), key = lambda item: item[1], reverse = True)
77 | chars = [char for char, _ in chars]
78 |
79 | return current_sizes, chars
80 |
81 |
82 | def save(self, tokenizer: Tokenizer) -> None:
83 |
84 | for dataset in self.datasets:
85 | dataset.save(tokenizer)
86 |
87 |
88 | def summary(self) -> None:
89 |
90 | for dataset in self.datasets:
91 | print(f'{dataset.name}: {len(dataset.dataset):,} documents | {dataset.size["train"]:,} characters | {dataset.multiplier:.1f}x')
--------------------------------------------------------------------------------
/dimgpt/data/tokenizer.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | import numpy.typing as npt
4 | import tokenizers as tk
5 | from tokenizers.models import BPE
6 | from tokenizers.trainers import BpeTrainer
7 | from tokenizers.pre_tokenizers import PreTokenizer
8 | from tqdm import tqdm
9 |
10 | from dimgpt.data.clean import *
11 | from dimgpt.utils import *
12 | import dimgpt.data.pretokenizer as pretk
13 | from dimgpt.settings import *
14 |
15 | class Tokenizer:
16 |
17 | def __init__(self):
18 |
19 | self.vocab: list[str] = []
20 | self.to_index: dict[str, int] = {}
21 | self.to_token: dict[int, str] = {}
22 |
23 | if os.path.exists(os.path.join(DATA_DIR, 'vocab.txt')):
24 | self.load_from_vocab(load_text_array(os.path.join(DATA_DIR, 'vocab.txt')))
25 | else:
26 | self.create(os.path.join(DATA_DIR, 'tokenizer_data.txt'))
27 | save_text_array(self.vocab, os.path.join(DATA_DIR, 'vocab.txt'))
28 |
29 |
30 | def _set_control_tokens(self) -> None:
31 |
32 | self.unknown_token = self.to_index['⮜unknown⮞']
33 | self.padding_token = self.to_index['⮜padding⮞']
34 | self.start_of_text_token = self.to_index['⮜start-of-text⮞']
35 | self.tab_token = self.to_index['⮜tab⮞']
36 | self.new_line_token = self.to_index['⮜new-line⮞']
37 | self.human_token = self.to_index['⮜human⮞']
38 | self.system_token = self.to_index['⮜system⮞']
39 | self.user_token = self.to_index['⮜user⮞']
40 | self.assistant_token = self.to_index['⮜assistant⮞']
41 | self.end_of_text_token = self.to_index['⮜end-of-text⮞']
42 |
43 |
44 | def load_from_vocab(self, vocab: list[str]) -> None:
45 |
46 | self.vocab = vocab.copy()
47 | self.to_index = {v: i for i, v in enumerate(self.vocab)}
48 | self.to_token = {i: v for i, v in enumerate(self.vocab)}
49 | self._set_control_tokens()
50 |
51 |
52 | def create(self, data_path: str) -> None:
53 |
54 | self._create_vocab(data_path)
55 | dataset = open(data_path, 'r', encoding = 'utf-8').read()
56 | self._sort_vocab(dataset)
57 | self._set_control_tokens()
58 |
59 |
60 | def _create_vocab(self, data_path: str) -> None:
61 |
62 | print('Creating vocab...')
63 |
64 | tokenizer = tk.Tokenizer(BPE(unk_token = '⮜unknown⮞'))
65 | tokenizer.pre_tokenizer = PreTokenizer.custom(pretk.PreTokenizer())
66 |
67 | trainer = BpeTrainer(
68 | vocab_size = int(VOCAB_SIZE * 1.1),
69 | show_progress = True,
70 | special_tokens = CONTROL_TOKENS
71 | )
72 |
73 | tokenizer.train([data_path], trainer)
74 |
75 | self.vocab = list(tokenizer.get_vocab().keys())
76 | vocab_size = len(self.vocab)
77 |
78 | def is_valid(word: str) -> bool:
79 |
80 | if len(word) > MAX_TOKEN_LENGTH:
81 | return False
82 |
83 | if word.endswith(' ') and len(word) > 4:
84 | return False
85 |
86 | if any(c not in POSSIBLE_CHARS for c in word):
87 | return False
88 |
89 | nb_digits = 0
90 |
91 | for char in word:
92 | if char.isdigit():
93 | nb_digits += 1
94 |
95 | return nb_digits < 2
96 |
97 | self.vocab = list(filter(lambda v: is_valid(v), self.vocab))
98 |
99 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({vocab_size - len(self.vocab):,} invalid tokens removed)')
100 | vocab_size = len(self.vocab)
101 |
102 | for i in range(10):
103 | if str(i) not in self.vocab:
104 | self.vocab.append(str(i))
105 | if ' ' + str(i) not in self.vocab:
106 | self.vocab.append(' ' + str(i))
107 |
108 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} number tokens added)')
109 | vocab_size = len(self.vocab)
110 |
111 | for token in FORCED_TOKENS:
112 | if token not in self.vocab:
113 | self.vocab.append(token)
114 |
115 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} forced tokens added)')
116 | vocab_size = len(self.vocab)
117 |
118 | self.vocab = CONTROL_TOKENS + self.vocab
119 |
120 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({len(self.vocab) - vocab_size:,} control tokens added)')
121 |
122 | self.to_index = {v: i for i, v in enumerate(self.vocab)}
123 | self.to_token = {i: v for i, v in enumerate(self.vocab)}
124 |
125 |
126 | def _sort_vocab(self, dataset: str) -> None:
127 |
128 | print('Pretokenize...')
129 | data = pretk.split(dataset)
130 |
131 | print('Sorting vocab...')
132 | vocab = {v: 0 for v in self.vocab}
133 | nb_tokens = 0
134 | total_tokens_length = 0
135 |
136 | for i in tqdm(range(len(data))):
137 |
138 | if data[i] in self.to_index:
139 | vocab[data[i]] += 1
140 | nb_tokens += 1
141 | total_tokens_length += len(data[i])
142 | continue
143 |
144 | j = 0
145 |
146 | while j < len(data[i]):
147 |
148 | found = False
149 |
150 | for k in reversed(range(min(MAX_TOKEN_LENGTH, len(data[i]) - j))):
151 |
152 | word = data[i][j:j + k + 1]
153 |
154 | if word in self.to_index:
155 | vocab[word] += 1
156 | nb_tokens += 1
157 | total_tokens_length += len(word)
158 | j += k
159 | found = True
160 | break
161 |
162 | if not found:
163 | vocab['⮜unknown⮞'] += 1
164 | nb_tokens += 1
165 | total_tokens_length += 5
166 |
167 | j += 1
168 |
169 | self.vocab = list(sorted(vocab.items(), key = lambda x: x[1], reverse = True))
170 | vocab_size = len(self.vocab)
171 | self.vocab = list(filter(lambda x: x[0] not in CONTROL_TOKENS, self.vocab))
172 |
173 | while len(self.vocab) > VOCAB_SIZE - len(CONTROL_TOKENS):
174 |
175 | for i in range(len(self.vocab) - 1, -1, -1):
176 |
177 | if len(self.vocab[i][0]) > 1 and self.vocab[i][0] not in FORCED_TOKENS and not (self.vocab[i][0][-1].isdigit() and len(self.vocab[i][0]) <= 2):
178 | self.vocab.pop(i)
179 | break
180 |
181 | self.vocab = [v[0] for v in self.vocab]
182 | self.vocab = CONTROL_TOKENS + self.vocab
183 |
184 | print(f'Vocab size: {vocab_size:,} -> {len(self.vocab):,} ({vocab_size - len(self.vocab):,} unused tokens removed)')
185 |
186 | self.to_index = {v: i for i, v in enumerate(self.vocab)}
187 | self.to_token = {i: v for i, v in enumerate(self.vocab)}
188 |
189 | print(f'Number of tokens: {nb_tokens:,}')
190 | print(f'Average token length: {total_tokens_length / nb_tokens:.2f}')
191 |
192 |
193 | def encode(self, text: str, clean_text: bool = True, keep_control_tokens: bool = False, verbose: bool = False) -> list[int]:
194 |
195 | if verbose:
196 | print('Pretokenize...')
197 |
198 | if clean_text:
199 | text = clean_string(text, keep_control_tokens)
200 |
201 | data = pretk.split(text)
202 |
203 | if verbose:
204 | print('Encoding dataset...')
205 |
206 | output = []
207 |
208 | for i in tqdm(range(len(data)), disable = not verbose):
209 |
210 | if data[i] in self.to_index:
211 | output.append(self.to_index[data[i]])
212 | continue
213 |
214 | j = 0
215 |
216 | while j < len(data[i]):
217 |
218 | found = False
219 |
220 | for k in reversed(range(min(MAX_TOKEN_LENGTH, len(data[i]) - j))):
221 |
222 | word = data[i][j:j + k + 1]
223 |
224 | if word in self.to_index:
225 | output.append(self.to_index[word])
226 | j += k
227 | found = True
228 | break
229 |
230 | if not found:
231 | output.append(self.to_index['⮜unknown⮞'])
232 |
233 | j += 1
234 |
235 | return output
236 |
237 |
238 | def decode(self, tokens: list[int] | npt.NDArray[np.uint16] | torch.Tensor | int, keep_control_tokens: bool = False,
239 | token_array: bool = False) -> str | list[str]:
240 |
241 | if type(tokens) == int:
242 | tokens = [tokens]
243 | if type(tokens) == torch.Tensor:
244 | tokens = tokens.detach().to('cpu').tolist()
245 | elif type(tokens) != list:
246 | tokens = list(tokens)
247 |
248 | text = []
249 |
250 | for t in tokens:
251 |
252 | if t < 0 or t >= len(self.vocab):
253 | continue
254 |
255 | text.append(unclean_string(self.to_token[t], keep_control_tokens))
256 |
257 | if token_array:
258 | return text
259 |
260 | return ''.join(text)
261 |
--------------------------------------------------------------------------------
/dimgpt/settings.py:
--------------------------------------------------------------------------------
1 | import os, torch
2 | from contextlib import nullcontext
3 |
4 | # ============== Dataset ============== #
5 |
6 | DATA_DIR = 'data'
7 | OUTPUT_DIR = 'output'
8 | NUM_THREADS = 16
9 |
10 | TOKENIZER_DATA_SIZE = 300_000_000
11 | MIN_DOCUMENT_SIZE = 64
12 | PRETRAINING_VAL_RATIO = 0.001
13 | MAX_TOKEN_LENGTH = 16
14 | CONTROL_TOKENS = ['⮜unknown⮞', '⮜padding⮞', '⮜start-of-text⮞', '⮜tab⮞', '⮜new-line⮞', '⮜human⮞', '⮜system⮞', '⮜user⮞', '⮜assistant⮞', '⮜end-of-text⮞']
15 | PADDING_TOKEN = 1
16 | FORCED_TOKENS = ['Dimension', ' Dimension', 'GPT', ' GPT', 'IA', ' IA', 'Generative', ' Generative', 'Pre', ' Pre', 'trained', ' trained', 'Transformer', ' Transformer']
17 |
18 | FINETUNING_VAL_RATIO = 0.01
19 |
20 | SPLIT_RATIOS = [
21 | 0.099, # human
22 | 0.9, # chatbot
23 | 0.001 # DimensionGPT
24 | ]
25 |
26 | HUMAN_PREPROMPT_RATIOS = [
27 | 0.3, # human
28 | 0.0, # chatbot
29 | 0.0, # DimensionGPT
30 | 0.7 # None
31 | ]
32 |
33 | CHATBOT_PREPROMPT_RATIOS = [
34 | 0.0, # human
35 | 0.5, # chatbot
36 | 0.4, # DimensionGPT
37 | 0.1 # None
38 | ]
39 |
40 | DIMENSION_GPT_PREPROMPT_RATIOS = [
41 | 0.0, # human
42 | 0.0, # chatbot
43 | 1.0, # DimensionGPT
44 | 0.0 # None
45 | ]
46 |
47 | INSTRUCTION_LOSS_STRENGTH = 0.1
48 | PREPROMPT = "Une discussion entre un utilisateur et DimensionGPT, un modèle de langage conversationnel français créé par le développeur indépendant Dimension et basé sur l'architecture GPT."
49 |
50 | # =============== Model =============== #
51 |
52 | VOCAB_SIZE = 32_000
53 | MAX_CONTEXT = 512
54 | WINDOW_SIZE = 256
55 | EMBEDDING_DIM = 1024
56 | NUM_GROUPED_HEADS = 4
57 | NUM_HEADS = 16
58 | HEAD_DIM = EMBEDDING_DIM // NUM_HEADS
59 | FFN_DIM = int((2.0 / 3.0) * 4 * EMBEDDING_DIM)
60 | NUM_BLOCKS = 16
61 | DROPOUT = 0
62 | INIT_STDDEV = 0.02
63 | ROPE_THETA = 10000.0
64 |
65 | # ============= Training ============== #
66 |
67 | BATCH_SIZE = 16
68 | NUM_ACCUMULATIONS = 64
69 |
70 | MAX_LEARNING_RATE = 6e-4
71 | MIN_LEARNING_RATE = 6e-5
72 | WARMUP_STEPS = 2_000
73 | DECAY_STEPS = 100_000
74 |
75 | BETA_1 = 0.9
76 | BETA_2 = 0.95
77 | EPSILON = 1e-5
78 | WEIGHT_DECAY = 0.1
79 | CLIP_GRADIENT = 1.0
80 |
81 | METRICS_BETA = 0.9
82 | VAL_INTERVAL = 50
83 |
84 | # ===================================== #
85 |
86 | GPU_ENABLED = torch.cuda.is_available()
87 | FLOAT16_ENABLED = GPU_ENABLED and torch.cuda.is_bf16_supported()
88 | DEVICE_NAME = 'cuda:0' if GPU_ENABLED else 'cpu'
89 | DEVICE = torch.device(DEVICE_NAME)
90 | CONTEXT = torch.autocast(device_type='cuda', dtype=torch.bfloat16) if FLOAT16_ENABLED else nullcontext()
91 |
--------------------------------------------------------------------------------
/dimgpt/testing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/testing/__init__.py
--------------------------------------------------------------------------------
/dimgpt/testing/sampling.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import numpy as np
3 | import numpy.typing as npt
4 |
5 | from dimgpt.training.model import Model
6 | from dimgpt.data.tokenizer import Tokenizer
7 | from dimgpt.settings import *
8 |
9 |
10 | class Sampler():
11 |
12 | def __init__(self, model: Model, tokenizer: Tokenizer):
13 |
14 | self.model = model
15 | self.tokenizer = tokenizer
16 | self.preprompt = [self.tokenizer.system_token, *self.tokenizer.encode(PREPROMPT)]
17 |
18 |
19 | def get_probabilities(self, input: list[int]) -> npt.NDArray[np.float32]:
20 |
21 | with CONTEXT:
22 | model_input = torch.tensor([input], dtype = torch.long, device = DEVICE)
23 | model_output = self.model(model_input, only_last = True)
24 |
25 | probabilities = model_output[0].float().detach().to('cpu').numpy()
26 | probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities))
27 |
28 | return probabilities
29 |
30 |
31 | def sample(self, input: list[int], chatbot: bool, temperature: float = 1.0, top_p: float = 1.0, no_repeat_strength: float = 0.0) -> int:
32 |
33 | probabilities = np.log(self.get_probabilities(input))
34 | proximity = MAX_CONTEXT
35 |
36 | for i in reversed(range(max(len(input) - MAX_CONTEXT, 0), len(input))):
37 | strength = no_repeat_strength * (proximity / MAX_CONTEXT)
38 | probabilities[input[i]] *= (1 + strength)
39 | proximity -= 1
40 |
41 | if temperature == 0.0:
42 | return np.argmax(probabilities)
43 |
44 | probabilities /= temperature
45 | probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities))
46 |
47 | if chatbot:
48 | probabilities[self.tokenizer.end_of_text_token] += probabilities[self.tokenizer.user_token]
49 |
50 | probabilities[self.tokenizer.unknown_token] = 0.0
51 | probabilities[self.tokenizer.padding_token] = 0.0
52 | probabilities[self.tokenizer.start_of_text_token] = 0.0
53 | probabilities[self.tokenizer.human_token] = 0.0
54 | probabilities[self.tokenizer.system_token] = 0.0
55 | probabilities[self.tokenizer.user_token] = 0.0
56 | probabilities[self.tokenizer.assistant_token] = 0.0
57 |
58 | probabilities /= np.sum(probabilities)
59 |
60 | sorted_indices = np.argsort(-probabilities)
61 | cumsum_probabilites = np.cumsum(probabilities[sorted_indices])
62 | cutoff_index = np.searchsorted(cumsum_probabilites, max(top_p, cumsum_probabilites[0] + 1e-6))
63 | temp = np.zeros_like(probabilities)
64 | temp[sorted_indices[:cutoff_index]] = probabilities[sorted_indices[:cutoff_index]]
65 | probabilities = temp / np.sum(temp)
66 |
67 | return np.random.choice(range(len(probabilities)), p = probabilities)
68 |
69 |
70 | def generate(self, input: str, max_length: int, chat_bot: bool = False, temperature: float = 1.0,
71 | top_p: float = 1.0, no_repeat: float = 0.0, verbose: bool = False, max_print_line_length = 0) -> str:
72 |
73 | self.model.eval()
74 |
75 | with torch.no_grad():
76 |
77 | input = self.tokenizer.encode(input)
78 |
79 | if chat_bot:
80 | input = [self.tokenizer.start_of_text_token, *self.preprompt, self.tokenizer.user_token, *input, self.tokenizer.assistant_token]
81 | else:
82 | input = [self.tokenizer.start_of_text_token, *input]
83 |
84 | output = []
85 | to_print = []
86 | last_line_length = 0
87 |
88 | if not chat_bot:
89 | output = input[1:].copy()
90 | to_print = input[1:].copy()
91 | text = self.tokenizer.decode(to_print)
92 | last_line_length = len(text) - 1 - text.rfind('\n')
93 |
94 | for _ in range(max_length):
95 |
96 | index = self.sample(input, temperature, top_p, no_repeat)
97 |
98 | if index == self.tokenizer.end_of_text_token:
99 | break
100 |
101 | input.append(index)
102 | output.append(index)
103 | to_print.append(index)
104 |
105 | if verbose:
106 |
107 | text = self.tokenizer.decode(to_print)
108 |
109 | if '\n' in text:
110 | last_line_length = len(text) - 1 - text.rfind('\n')
111 | else:
112 | last_line_length += len(text)
113 |
114 | if max_print_line_length > 0 and last_line_length >= max_print_line_length and text.startswith(' '):
115 | print()
116 | text = text[1:]
117 | last_line_length = 0
118 |
119 | print(text, end = '')
120 | to_print = []
121 |
122 | return self.tokenizer.decode(output)
123 |
--------------------------------------------------------------------------------
/dimgpt/training/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/dimgpt/training/__init__.py
--------------------------------------------------------------------------------
/dimgpt/training/datasets/__init__.py:
--------------------------------------------------------------------------------
1 | from .dataset import Dataset
2 | from .pretraining import PretrainingDataset
3 | from .finetuning import FinetuningDataset
--------------------------------------------------------------------------------
/dimgpt/training/datasets/dataset.py:
--------------------------------------------------------------------------------
1 | from abc import ABC
2 | import torch
3 |
4 | from dimgpt.data.tokenizer import Tokenizer
5 | from dimgpt.settings import *
6 |
7 |
8 | class Dataset(ABC):
9 |
10 | def __init__(self, tokenizer: Tokenizer):
11 |
12 | self.tokenizer = tokenizer
13 |
14 |
15 | def train_size(self) -> int:
16 |
17 | pass
18 |
19 |
20 | def val_size(self) -> int:
21 |
22 | pass
23 |
24 |
25 | def _random_document(self, val: bool) -> tuple[list[int], list[int]]:
26 |
27 | pass
28 |
29 |
30 | def _get_tokens(self, val: bool) -> tuple[list[int], list[int]]:
31 |
32 | pass
33 |
34 |
35 | def _next(self, val: bool) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
36 |
37 | x = []
38 | y = []
39 | strengths = []
40 |
41 | for _ in range(BATCH_SIZE):
42 |
43 | xy, strength = self._get_tokens(val)
44 |
45 | x.append(xy[0:MAX_CONTEXT])
46 | y.append(xy[1:MAX_CONTEXT + 1])
47 | strengths.append(strength[1:MAX_CONTEXT + 1])
48 |
49 | x = torch.tensor(x, dtype = torch.long).pin_memory().to(DEVICE, non_blocking = True)
50 | y = torch.tensor(y, dtype = torch.long).pin_memory().to(DEVICE, non_blocking = True)
51 | strengths = torch.tensor(strengths, dtype = torch.float32).pin_memory().to(DEVICE, non_blocking = True)
52 |
53 | return x, y, strengths
54 |
55 |
56 | def next_train(self) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
57 |
58 | return self._next(False)
59 |
60 |
61 | def next_val(self) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
62 |
63 | return self._next(True)
--------------------------------------------------------------------------------
/dimgpt/training/datasets/finetuning.py:
--------------------------------------------------------------------------------
1 | import os, random, pickle
2 | import torch
3 | import numpy as np
4 |
5 | from dimgpt.data.tokenizer import Tokenizer
6 | from dimgpt.settings import *
7 | from dimgpt.training.datasets import Dataset
8 |
9 |
10 | class FinetuningDataset(Dataset):
11 |
12 | def __init__(self, tokenizer: Tokenizer):
13 |
14 | self.tokenizer = tokenizer
15 |
16 | self.train_data = {
17 | 'human': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'human_conversations_train.pkl'), 'rb')),
18 | 'chatbot': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_train.pkl'), 'rb')),
19 | 'dimension_gpt': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'dimension_gpt_conversations_train.pkl'), 'rb'))
20 | }
21 |
22 | self.val_data = pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_conversations_val.pkl'), 'rb'))
23 |
24 | self.train_preprompts = {
25 | 'human': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'human_preprompts_train.pkl'), 'rb')),
26 | 'chatbot': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'chatbot_preprompts_train.pkl'), 'rb')),
27 | 'dimension_gpt': pickle.load(open(os.path.join(DATA_DIR, 'finetuning', 'dimension_gpt_preprompts_train.pkl'), 'rb'))
28 | }
29 |
30 | self.final_preprompt = [self.tokenizer.system_token, *self.tokenizer.encode(PREPROMPT)]
31 |
32 | self.preprompt_ratios = {
33 | 'human': HUMAN_PREPROMPT_RATIOS,
34 | 'chatbot': CHATBOT_PREPROMPT_RATIOS,
35 | 'dimension_gpt': DIMENSION_GPT_PREPROMPT_RATIOS
36 | }
37 |
38 | h = [len(i) for i in self.train_data['human']]
39 | c = [len(i) for i in self.train_data['chatbot']]
40 | d = [len(i) for i in self.train_data['dimension_gpt']]
41 |
42 | self.train_data_p = {
43 | 'human': (np.array(h) / np.sum(h)).tolist(),
44 | 'chatbot': (np.array(c) / np.sum(c)).tolist(),
45 | 'dimension_gpt': (np.array(d) / np.sum(d)).tolist()
46 | }
47 |
48 | print(sum(self.train_data_p['human']))
49 | print(sum(self.train_data_p['chatbot']))
50 | print(sum(self.train_data_p['dimension_gpt']))
51 |
52 | v = [len(i) for i in self.val_data]
53 |
54 | self.val_data_p = (np.array(v) / np.sum(v)).tolist()
55 |
56 | self.train_ids = {
57 | 'human': list(range(len(self.train_data['human']))),
58 | 'chatbot': list(range(len(self.train_data['chatbot']))),
59 | 'dimension_gpt': list(range(len(self.train_data['dimension_gpt'])))
60 | }
61 |
62 | self.val_ids = list(range(len(self.val_data)))
63 |
64 |
65 | def train_size(self) -> int:
66 |
67 | return sum([sum([len(i) for i in self.train_data[key]]) for key in self.train_data])
68 |
69 |
70 | def val_size(self) -> int:
71 |
72 | return sum([len(i) for i in self.val_data])
73 |
74 |
75 | def __get_strength(self, doc: list[int], val: bool) -> list[int]:
76 |
77 | assistant = False
78 | instruction_loss_strength = 0.0 if val else INSTRUCTION_LOSS_STRENGTH
79 | strength = []
80 |
81 | for token in doc:
82 |
83 | strength.append(1.0 if assistant else instruction_loss_strength)
84 |
85 | if token == self.tokenizer.user_token or token == self.tokenizer.end_of_text_token:
86 | assistant = False
87 |
88 | if token == self.tokenizer.assistant_token or token == self.tokenizer.human_token:
89 | assistant = True
90 |
91 | return strength
92 |
93 |
94 | def __get_document(self, val: bool, first: bool) -> tuple[list[int], list[int]]:
95 |
96 | if val:
97 | data_ids = self.val_ids
98 | data = self.val_data
99 | data_p = self.val_data_p
100 |
101 | else:
102 | data_split = np.random.choice(['human', 'chatbot', 'dimension_gpt'], p = SPLIT_RATIOS)
103 | data_ids = self.train_ids[data_split]
104 | data = self.train_data[data_split]
105 | data_p = self.train_data_p[data_split]
106 |
107 | if first:
108 | id = np.random.choice(data_ids, p = data_p)
109 | conversation = data[id]
110 | else:
111 | conversation = data[random.randint(0, len(data) - 1)]
112 |
113 | if val:
114 | xy = [self.tokenizer.start_of_text_token, *self.final_preprompt, *conversation, self.tokenizer.end_of_text_token]
115 | strength = self.__get_strength(xy, val)
116 | return xy, strength
117 |
118 | preprompt_ratio = self.preprompt_ratios[data_split]
119 | preprompt_split = np.random.choice(['human', 'chatbot', 'dimension_gpt', 'none'], p = preprompt_ratio)
120 |
121 | if preprompt_split != 'none':
122 | preprompt = self.train_preprompts[preprompt_split][random.randint(0, len(self.train_preprompts[preprompt_split]) - 1)]
123 | conversation = [*preprompt, *conversation]
124 |
125 | xy = [self.tokenizer.start_of_text_token, *conversation, self.tokenizer.end_of_text_token]
126 | strength = self.__get_strength(xy, val)
127 |
128 | return xy, strength
129 |
130 |
131 | def _get_random_document(self, val: bool) -> tuple[list[int], list[int]]:
132 |
133 | xy, strength = self.__get_document(val, False)
134 |
135 | return xy, strength
136 |
137 |
138 | def _get_tokens(self, val: bool) -> tuple[torch.Tensor, torch.Tensor]:
139 |
140 | xy, strength = self.__get_document(val, True)
141 |
142 | i = random.randint(0, len(xy) - 1)
143 | xy = xy[i:]
144 | strength = strength[i:]
145 |
146 | while len(xy) < MAX_CONTEXT + 1:
147 |
148 | _xy, _strength = self._get_random_document(val)
149 |
150 | xy.extend(_xy)
151 | strength.extend(_strength)
152 |
153 | xy = xy[0:MAX_CONTEXT + 1]
154 | strength = strength[0:MAX_CONTEXT + 1]
155 |
156 | return xy, strength
--------------------------------------------------------------------------------
/dimgpt/training/datasets/pretraining.py:
--------------------------------------------------------------------------------
1 | import os, random, pickle
2 | import torch
3 | import numpy as np
4 |
5 | from dimgpt.data.tokenizer import Tokenizer
6 | from dimgpt.settings import *
7 | from dimgpt.training.datasets import Dataset
8 |
9 |
10 | class PretrainingDataset(Dataset):
11 |
12 | def __init__(self, tokenizer: Tokenizer):
13 |
14 | super().__init__(tokenizer)
15 |
16 | datasets = os.listdir(os.path.join(DATA_DIR, 'pretraining'))
17 | self.datasets = []
18 |
19 | for dataset in datasets:
20 |
21 | if not os.path.isdir(os.path.join(DATA_DIR, 'pretraining', dataset)):
22 | continue
23 |
24 | meta = pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'metadata.pkl'), 'rb'))
25 |
26 | self.datasets.append({
27 | 'train': {
28 | 'data': np.memmap(os.path.join(DATA_DIR, 'pretraining', dataset, f'train.bin'), dtype = np.uint16, mode = 'r'),
29 | 'ids': pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'train_ids.pkl'), 'rb')),
30 | 'size': meta['size']['train']
31 | },
32 | 'val': {
33 | 'data': np.memmap(os.path.join(DATA_DIR, 'pretraining', dataset, f'val.bin'), dtype = np.uint16, mode = 'r'),
34 | 'ids': pickle.load(open(os.path.join(DATA_DIR, 'pretraining', dataset, f'val_ids.pkl'), 'rb')),
35 | 'size': meta['size']['val']
36 | },
37 | 'training_part': meta['training_part'],
38 | 'name': meta['name'],
39 | 'multiplier': meta['multiplier']
40 | })
41 |
42 | self.probas = [dataset['train']['size'] * dataset['multiplier'] for dataset in self.datasets]
43 | self.probas = (np.array(self.probas) / np.sum(self.probas)).tolist()
44 |
45 |
46 | def train_size(self) -> int:
47 |
48 | return sum([dataset['train']['size'] for dataset in self.datasets])
49 |
50 |
51 | def val_size(self) -> int:
52 |
53 | return sum([dataset['val']['size'] for dataset in self.datasets])
54 |
55 |
56 | def _get_random_document(self, val: bool) -> tuple[list[int], list[int]]:
57 |
58 | dataset = np.random.choice(self.datasets, p = self.probas)
59 | ids = dataset['val']['ids'] if val else dataset['train']['ids']
60 | data = dataset['val']['data'] if val else dataset['train']['data']
61 |
62 | i = random.randint(0, len(ids) - 1)
63 | xy = data[ids[i]['start']:ids[i]['start'] + ids[i]['size']]
64 | strength = [1.0] * ids[i]['size']
65 |
66 | return xy, strength
67 |
68 |
69 | def _get_tokens(self, val: bool) -> tuple[torch.Tensor, torch.Tensor]:
70 |
71 | dataset = np.random.choice(self.datasets, p = self.probas)
72 | data = dataset['val']['data'] if val else dataset['train']['data']
73 |
74 | start = random.randint(0, len(data) - 1 - (MAX_CONTEXT + 1))
75 | xy = []
76 |
77 | for i in range(MAX_CONTEXT + 1):
78 |
79 | token = data[start + i]
80 | xy.append(token)
81 |
82 | if token == self.tokenizer.end_of_text_token:
83 | break
84 |
85 | strength = [1.0] * len(xy)
86 |
87 | while len(xy) < MAX_CONTEXT + 1:
88 |
89 | _xy, _strength = self._get_random_document(val)
90 |
91 | xy.extend(_xy)
92 | strength.extend(_strength)
93 |
94 | xy = xy[0:MAX_CONTEXT + 1]
95 | strength = strength[0:MAX_CONTEXT + 1]
96 |
97 | return xy, strength
--------------------------------------------------------------------------------
/dimgpt/training/layers.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | from torch import nn
4 |
5 | from dimgpt.settings import *
6 |
7 |
8 | # Base class for all layers
9 | class Module(nn.Module):
10 |
11 | # Give the number of parameters of the module
12 | def nb_parameters(self) -> int:
13 |
14 | return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters()])
15 |
16 |
17 | # Give the number of trainable parameters of the module
18 | def nb_trainable_parameters(self) -> int:
19 |
20 | return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters() if p.requires_grad])
21 |
22 |
23 | # Give the number of non-trainable parameters of the module
24 | def nb_non_trainable_parameters(self) -> int:
25 |
26 | return sum([np.prod(p.size(), dtype = np.int32) for p in self.parameters() if not p.requires_grad])
27 |
28 |
29 | # Summarize the module
30 | def summary(self) -> None:
31 |
32 | print(f'Number of parameters: {self.nb_parameters():,}')
33 | print(f'Number of trainable parameters: {self.nb_trainable_parameters():,}')
34 | print(f'Number of non-trainable parameters: {self.nb_non_trainable_parameters():,}')
35 |
36 |
37 | # Remove NaNs from the module gradients
38 | def clean_nan(self) -> None:
39 |
40 | for p in self.parameters():
41 | if p.grad is not None:
42 | torch.nan_to_num(p.grad, nan = 0, posinf = 1e5, neginf = -1e5, out = p.grad)
43 |
44 |
45 | # Clip the module gradients
46 | def clip_gradient(self, max_norm: float) -> None:
47 |
48 | nn.utils.clip_grad_norm_(self.parameters(), max_norm)
49 |
50 |
51 | class Linear(nn.Linear):
52 |
53 | def __init__(self, in_features: int, out_features: int, **kwargs):
54 |
55 | super().__init__(in_features, out_features, False, **kwargs)
56 | nn.init.normal_(self.weight, mean = 0.0, std = INIT_STDDEV)
57 |
58 |
59 | class LayerNorm(Module):
60 |
61 | def __init__(self, shape: int, epsilon: float = 1e-5, **kwargs):
62 |
63 | super().__init__(**kwargs)
64 |
65 | self.shape = (shape,)
66 | self.weight = nn.Parameter(torch.ones(shape))
67 | self.epsilon = epsilon
68 |
69 |
70 | def _normalize(self, x: torch.Tensor):
71 |
72 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim = True) + self.epsilon)
73 |
74 |
75 | def forward(self, x: torch.Tensor):
76 |
77 | return self._normalize(x.float()).type_as(x) * self.weight
78 |
79 |
80 | class Embedding(nn.Embedding):
81 |
82 | def __init__(self, num_embeddings: int, embedding_dim: int, **kwargs):
83 |
84 | super().__init__(num_embeddings, embedding_dim, padding_idx = PADDING_TOKEN, **kwargs)
85 | nn.init.normal_(self.weight, mean = 0.0, std = INIT_STDDEV)
86 |
87 |
--------------------------------------------------------------------------------
/dimgpt/training/model.py:
--------------------------------------------------------------------------------
1 | import torch, math
2 | from torch import nn
3 | from flash_attn import flash_attn_func
4 |
5 | from dimgpt.training.layers import *
6 | from dimgpt.settings import *
7 | from dimgpt.training.rope import *
8 |
9 |
10 | class AttentionBlock(Module):
11 |
12 | def __init__(self, **kwargs):
13 |
14 | super().__init__(**kwargs)
15 |
16 | self.query = Linear(EMBEDDING_DIM, NUM_HEADS * HEAD_DIM)
17 | self.key = Linear(EMBEDDING_DIM, NUM_GROUPED_HEADS * HEAD_DIM)
18 | self.value = Linear(EMBEDDING_DIM, NUM_GROUPED_HEADS * HEAD_DIM)
19 |
20 | self.projection = Linear(NUM_HEADS * HEAD_DIM, EMBEDDING_DIM)
21 | nn.init.normal_(self.projection.weight, mean = 0.0, std = INIT_STDDEV / math.sqrt(2 * NUM_BLOCKS))
22 |
23 | self.residual_dropout = nn.Dropout(DROPOUT)
24 |
25 |
26 | def forward(self, x: torch.Tensor, rope_frequencies: torch.Tensor) -> torch.Tensor:
27 |
28 | batch_size, context_size, _ = x.shape
29 |
30 | q = self.query(x)
31 | k = self.key(x)
32 | v = self.value(x)
33 |
34 | q = q.view(batch_size, context_size, NUM_HEADS, HEAD_DIM)
35 | k = k.view(batch_size, context_size, NUM_GROUPED_HEADS, HEAD_DIM)
36 | v = v.view(batch_size, context_size, NUM_GROUPED_HEADS, HEAD_DIM)
37 |
38 | q, k = rotary_position_embedding(q, k, rope_frequencies)
39 |
40 | k = torch.repeat_interleave(k, repeats = NUM_HEADS // NUM_GROUPED_HEADS, dim = 2)
41 | v = torch.repeat_interleave(v, repeats = NUM_HEADS // NUM_GROUPED_HEADS, dim = 2)
42 |
43 | x = flash_attn_func(q, k, v, dropout_p = DROPOUT if self.training else 0, causal = True, window_size = (WINDOW_SIZE, 0))
44 |
45 | x = x.view(batch_size, context_size, NUM_HEADS * HEAD_DIM)
46 |
47 | return self.residual_dropout(self.projection(x))
48 |
49 |
50 | class FeedForward(Module):
51 |
52 | def __init__(self, **kwargs):
53 |
54 | super().__init__(**kwargs)
55 |
56 | self.linear_1 = Linear(EMBEDDING_DIM, FFN_DIM)
57 | self.linear_2 = Linear(EMBEDDING_DIM, FFN_DIM)
58 | self.linear_3 = Linear(FFN_DIM, EMBEDDING_DIM)
59 | self.activation = nn.SiLU()
60 | self.dropout = nn.Dropout(DROPOUT)
61 |
62 |
63 | def forward(self, x: torch.Tensor) -> torch.Tensor:
64 |
65 | x = self.activation(self.linear_1(x)) * self.linear_2(x)
66 | x = self.dropout(self.linear_3(x))
67 |
68 | return x
69 |
70 |
71 | # Model block
72 | class TransformerBlock(Module):
73 |
74 | def __init__(self, **kwargs):
75 |
76 | super().__init__(**kwargs)
77 |
78 | self.norm_1 = LayerNorm(EMBEDDING_DIM)
79 | self.attention = AttentionBlock()
80 | self.norm_2 = LayerNorm(EMBEDDING_DIM)
81 | self.feed_forward = FeedForward()
82 |
83 |
84 | def forward(self, x: torch.Tensor, rope_frequencies: torch.Tensor) -> torch.Tensor:
85 |
86 | x = x + self.attention(self.norm_1(x), rope_frequencies)
87 | x = x + self.feed_forward(self.norm_2(x))
88 |
89 | return x
90 |
91 |
92 | # Model
93 | class Model(Module):
94 |
95 | def __init__(self, **kwargs):
96 |
97 | super().__init__(**kwargs)
98 |
99 | self.token_embedding = Embedding(VOCAB_SIZE, EMBEDDING_DIM)
100 | self.rope_frequencies = create_rope_frequencies(HEAD_DIM, MAX_CONTEXT)
101 | self.init_dropout = nn.Dropout(DROPOUT)
102 | self.blocks = nn.ModuleList([TransformerBlock() for _ in range(NUM_BLOCKS)])
103 | self.final_norm = LayerNorm(EMBEDDING_DIM)
104 | self.final_linear = Linear(EMBEDDING_DIM, VOCAB_SIZE)
105 | self.token_embedding.weight = self.final_linear.weight
106 |
107 |
108 | def forward(self, input: torch.Tensor, only_last: bool = False) -> torch.Tensor:
109 |
110 | if input.shape[1] > MAX_CONTEXT:
111 | input = input[:, -MAX_CONTEXT:]
112 |
113 | rope_frequencies = self.rope_frequencies[:input.shape[1]]
114 | rope_frequencies = rope_frequencies[None, :, None, :]
115 |
116 | x = self.token_embedding(input)
117 | x = self.init_dropout(x)
118 |
119 | for block in self.blocks:
120 | x = block(x, rope_frequencies)
121 |
122 | x = self.final_norm(x)
123 |
124 | if only_last:
125 | return self.final_linear(x[:, -1])
126 |
127 | return self.final_linear(x)
128 |
--------------------------------------------------------------------------------
/dimgpt/training/optimizer.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import nn
3 |
4 | from dimgpt.settings import *
5 |
6 |
7 | class AdamW(torch.optim.AdamW):
8 |
9 | def __init__(self, params: list[nn.Parameter], learning_rate: float, **kwargs):
10 |
11 | decay_params = [p for p in params if p.requires_grad and p.dim() >= 2]
12 | other_params = [p for p in params if p.requires_grad and p.dim() < 2]
13 |
14 | groups = [
15 | {'params': decay_params, 'weight_decay': WEIGHT_DECAY},
16 | {'params': other_params, 'weight_decay': 0.0}
17 | ]
18 |
19 | super().__init__(
20 | groups,
21 | lr = learning_rate,
22 | betas = (BETA_1, BETA_2),
23 | eps = EPSILON,
24 | fused = GPU_ENABLED,
25 | **kwargs
26 | )
27 |
--------------------------------------------------------------------------------
/dimgpt/training/rope.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 | from dimgpt.settings import *
4 |
5 |
6 | def create_rope_frequencies(dim: int, max_length: int, theta: float = ROPE_THETA) -> torch.Tensor:
7 |
8 | frequencies = 1.0 / (theta ** (torch.arange(0, dim, 2, device = DEVICE)[:(dim // 2)].float() / dim))
9 | t = torch.arange(max_length, device = DEVICE)
10 | frequencies = torch.outer(t, frequencies).float()
11 |
12 | return torch.polar(torch.ones_like(frequencies, device = DEVICE), frequencies)
13 |
14 |
15 | def rotary_position_embedding(q: torch.Tensor, k: torch.Tensor, frequencies: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
16 |
17 | q_complex = torch.view_as_complex(q.float().reshape(*q.shape[:-1], -1, 2))
18 | k_complex = torch.view_as_complex(k.float().reshape(*k.shape[:-1], -1, 2))
19 |
20 | q_out = torch.view_as_real(q_complex * frequencies).flatten(3)
21 | k_out = torch.view_as_real(k_complex * frequencies).flatten(3)
22 |
23 | return q_out.type_as(q), k_out.type_as(k)
--------------------------------------------------------------------------------
/dimgpt/training/trainer.py:
--------------------------------------------------------------------------------
1 | import os, pickle, math, time
2 | import torch
3 | from torch import nn
4 |
5 | from dimgpt.settings import *
6 | from dimgpt.training.datasets import Dataset
7 | from dimgpt.training.model import Model
8 | from dimgpt.training.optimizer import AdamW
9 |
10 |
11 | class Trainer():
12 |
13 | def __init__(self, model: Model, dataset: Dataset):
14 |
15 | self.model = model
16 | model.train()
17 |
18 | self.dataset = dataset
19 |
20 | self.time = None
21 | self.step = 0
22 | self.tokens = 0
23 | self.epochs = 0.0
24 | self.learning_rate = 0.0
25 | self.loss = 0.0
26 | self.accuracy = 0.0
27 | self.val_loss = 0.0
28 | self.val_accuracy = 0.0
29 | self.loss_ema = None
30 | self.accuracy_ema = None
31 | self.best_val_loss = float('inf')
32 |
33 | self.optimizer = AdamW(self.model.parameters(), self.learning_rate)
34 |
35 | self.metrics_history = {
36 | 'time': [],
37 | 'step': [],
38 | 'tokens': [],
39 | 'epochs': [],
40 | 'loss': [],
41 | 'accuracy': [],
42 | 'val_loss': [],
43 | 'val_accuracy': []
44 | }
45 |
46 |
47 | # Save the models
48 | def save_model(self, path: str) -> None:
49 |
50 | if not os.path.exists(path):
51 | os.makedirs(path)
52 |
53 | torch.save(self.model.state_dict(), os.path.join(path, 'model.pt'))
54 | torch.save(self.optimizer.state_dict(), os.path.join(path, 'optimizer.pt'))
55 |
56 |
57 | # Load the models
58 | def load_model(self, path) -> None:
59 |
60 | if not os.path.exists(path):
61 | return
62 |
63 | self.model.load_state_dict(torch.load(os.path.join(path, 'model.pt'), map_location = DEVICE))
64 | self.optimizer.load_state_dict(torch.load(os.path.join(path, 'optimizer.pt'), map_location = DEVICE))
65 |
66 |
67 | # Find previous session
68 | def find_previous_session(self) -> None:
69 |
70 | if os.path.exists(os.path.join(OUTPUT_DIR, 'last')):
71 | self.load_model(os.path.join(OUTPUT_DIR, 'last'))
72 |
73 | if os.path.exists(os.path.join(OUTPUT_DIR, 'metrics.pkl')):
74 | self.load_metrics()
75 |
76 |
77 | # Print
78 | def print(self) -> None:
79 |
80 | print(f'Epochs: {self.epochs:.4f} | Steps: {self.step:,} | Tokens: {self.tokens:,} | LR: {self.learning_rate:.5f} || ' \
81 | f'Loss: {self.loss_ema:.5f} | Accuracy: {self.accuracy_ema * 100.0:.4f} % | ' \
82 | f'Val loss: {self.val_loss:.5f} | Val accuracy: {self.val_accuracy * 100.0:.4f} % ', end = '\r')
83 |
84 |
85 | # Save metrics
86 | def save_metrics(self) -> None:
87 |
88 | if self.time is None:
89 | self.metrics_history["time"].append(0.0)
90 | else:
91 | self.metrics_history["time"].append(self.metrics_history["time"][-1] + (time.time() - self.time))
92 |
93 | self.time = time.time()
94 |
95 | self.metrics_history["step"].append(self.step)
96 | self.metrics_history["tokens"].append(self.tokens)
97 | self.metrics_history["epochs"].append(self.epochs)
98 | self.metrics_history["loss"].append(self.loss_ema)
99 | self.metrics_history["accuracy"].append(self.accuracy_ema)
100 | self.metrics_history["val_loss"].append(self.val_loss)
101 | self.metrics_history["val_accuracy"].append(self.val_accuracy)
102 |
103 | if not os.path.exists(OUTPUT_DIR):
104 | os.makedirs(OUTPUT_DIR)
105 |
106 | pickle.dump(self.metrics_history, open(os.path.join(OUTPUT_DIR, 'metrics.pkl'), 'wb'))
107 |
108 |
109 | # Load metrics
110 | def load_metrics(self) -> None:
111 |
112 | self.metrics_history = pickle.load(open(os.path.join(OUTPUT_DIR, 'metrics.pkl'), 'rb'))
113 |
114 | self.step = self.metrics_history["step"][-1]
115 | self.tokens = self.metrics_history["tokens"][-1]
116 | self.epochs = self.metrics_history["epochs"][-1]
117 | self.loss_ema = self.metrics_history["loss"][-1]
118 | self.accuracy_ema = self.metrics_history["accuracy"][-1]
119 | self.val_loss = self.metrics_history["val_loss"][-1]
120 | self.val_accuracy = self.metrics_history["val_accuracy"][-1]
121 | self.best_val_loss = min(self.metrics_history["val_loss"])
122 | self.time = time.time()
123 |
124 |
125 | # Update learning rate
126 | def update_learning_rate(self) -> None:
127 |
128 | if self.step < WARMUP_STEPS:
129 | ratio = self.step / WARMUP_STEPS
130 | self.learning_rate = MAX_LEARNING_RATE * ratio
131 | elif self.step < WARMUP_STEPS + DECAY_STEPS:
132 | ratio = (self.step - WARMUP_STEPS) / DECAY_STEPS
133 | ratio = 0.5 * (1.0 + math.cos(math.pi * ratio))
134 | self.learning_rate = ratio * (MAX_LEARNING_RATE - MIN_LEARNING_RATE) + MIN_LEARNING_RATE
135 | else:
136 | self.learning_rate = MIN_LEARNING_RATE
137 |
138 | for g in self.optimizer.param_groups:
139 | g['lr'] = self.learning_rate
140 |
141 |
142 | def apply_ema(self, value_1: float, value_2: float) -> float:
143 |
144 | if value_1 is None:
145 | return value_2
146 |
147 | return value_1 * METRICS_BETA + value_2 * (1.0 - METRICS_BETA)
148 |
149 |
150 | # Train the model
151 | def train(self) -> None:
152 |
153 | # Training loop
154 | while True:
155 |
156 | # Update step
157 | self.step += 1
158 | self.tokens += (MAX_CONTEXT + 1) * BATCH_SIZE * NUM_ACCUMULATIONS
159 | self.epochs += ((MAX_CONTEXT + 1) * BATCH_SIZE * NUM_ACCUMULATIONS) / self.dataset.train_size()
160 |
161 | # Update learning rate
162 | self.update_learning_rate()
163 |
164 | # ----- Training ----- #
165 |
166 | self.model.train()
167 | self.loss = 0.0
168 | self.accuracy = 0.0
169 |
170 | # First load data (asyncronous)
171 | x, y, strength = self.dataset.next_train()
172 |
173 | for i in range(NUM_ACCUMULATIONS):
174 |
175 | with CONTEXT:
176 |
177 | # Forward pass
178 | prediction = self.model(x)
179 |
180 | # Loss
181 | loss = nn.functional.cross_entropy(
182 | input = prediction.reshape(-1, prediction.shape[-1]),
183 | target = y.reshape(-1),
184 | ignore_index = PADDING_TOKEN,
185 | reduction = 'none'
186 | )
187 | loss = ((loss * strength.reshape(-1)).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS
188 | self.loss += loss.item()
189 |
190 | # Accuracy
191 | accuracy = (prediction.argmax(dim = 2) == y).to(dtype = torch.float32)
192 | self.accuracy += (((accuracy * strength).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item()
193 |
194 | # Next load data (asyncronous)
195 | if i < NUM_ACCUMULATIONS - 1:
196 | x, y, strength = self.dataset.next_train()
197 |
198 | # Backward pass
199 | loss.backward()
200 |
201 | # Update weights
202 | self.model.clean_nan()
203 | self.model.clip_gradient(CLIP_GRADIENT)
204 | self.optimizer.step()
205 | self.optimizer.zero_grad(set_to_none = True)
206 |
207 | # Update ema values
208 | self.loss_ema = self.apply_ema(self.loss_ema, self.loss)
209 | self.accuracy_ema = self.apply_ema(self.accuracy_ema, self.accuracy)
210 |
211 | # ----- Validations ----- #
212 |
213 | if self.step % VAL_INTERVAL == 0:
214 |
215 | self.model.eval()
216 |
217 | with torch.no_grad():
218 |
219 | self.val_loss = 0.0
220 | self.val_accuracy = 0.0
221 |
222 | for _ in range(NUM_ACCUMULATIONS):
223 |
224 | # Load data
225 | x, y, strength = self.dataset.next_val()
226 |
227 | with CONTEXT:
228 |
229 | # Forward pass
230 | prediction = self.model(x)
231 |
232 | # Loss
233 | loss = nn.functional.cross_entropy(
234 | input = prediction.reshape(-1, prediction.shape[-1]),
235 | target = y.reshape(-1),
236 | ignore_index = PADDING_TOKEN,
237 | reduction = 'none'
238 | )
239 | self.val_loss += (((loss * strength.reshape(-1)).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item()
240 |
241 | # Accuracy
242 | accuracy = (prediction.argmax(dim = 2) == y).to(dtype = torch.float32)
243 | self.val_accuracy += (((accuracy * strength).sum() / (strength.sum() + 1e-8)) / NUM_ACCUMULATIONS).item()
244 |
245 | # Save
246 | self.save_metrics()
247 | self.save_model(os.path.join(OUTPUT_DIR, 'last'))
248 |
249 | # Save best
250 | if self.val_loss <= self.best_val_loss:
251 | self.best_val_loss = self.val_loss
252 | self.save_model(os.path.join(OUTPUT_DIR, 'best'))
253 |
254 | # -------------------- #
255 |
256 | # Print
257 | self.print()
258 |
--------------------------------------------------------------------------------
/dimgpt/utils.py:
--------------------------------------------------------------------------------
1 | import random, platform, psutil, time
2 | import datetime as dt
3 | import numpy as np
4 | import torch
5 | from sys import exit
6 |
7 | from dimgpt.settings import *
8 |
9 |
10 | # Reset the random seed
11 | def reset_rand() -> None:
12 |
13 | now = dt.datetime.now()
14 | milliseconds_since_midnight = (now.hour * 3600 + now.minute * 60 + now.second) * 1000 + now.microsecond // 1000
15 | random.seed(milliseconds_since_midnight)
16 | np.random.seed(milliseconds_since_midnight)
17 | torch.manual_seed(milliseconds_since_midnight)
18 |
19 |
20 | # Check if there is a GPU available
21 | def check_gpu() -> None:
22 |
23 | if GPU_ENABLED:
24 | torch.cuda.empty_cache()
25 | nb_gpu = torch.cuda.device_count()
26 | memory = torch.cuda.mem_get_info()[0] / 1024 ** 3
27 | print(f'{nb_gpu} GPU {"are" if nb_gpu > 1 else "is"} available! Using GPU: "{torch.cuda.get_device_name()}" ({memory:.2f} GB available)')
28 |
29 | else:
30 | memory = psutil.virtual_memory().available / 1024 ** 3
31 | print(f'No GPU available... Using CPU: "{platform.processor()}" ({memory:.2f} GB available)')
32 |
33 |
34 | def save_text_array(array: list[str], path: str) -> None:
35 |
36 | with open(path, 'w', encoding = 'utf-8') as f:
37 |
38 | f.truncate(0)
39 |
40 | for i in range(len(array)):
41 |
42 | f.write(array[i])
43 |
44 | if i != len(array) - 1:
45 | f.write('\n')
46 |
47 |
48 | def load_text_array(path: str) -> list[str]:
49 |
50 | with open(path, 'r', encoding = 'utf-8') as f:
51 |
52 | return f.read().split('\n')
53 |
54 |
55 | def split_keep(text: str, delimiter: str) -> list[str]:
56 |
57 | words = text.split(delimiter)
58 |
59 | temp = []
60 |
61 | for i in range(len(words) - 1):
62 | temp.extend([words[i], delimiter])
63 |
64 | temp.append(words[-1])
65 |
66 | return temp
67 |
68 |
69 | class Timer:
70 |
71 | def __init__(self, wait_steps: int = 0, num_steps: int = 1, exit_on_end: bool = False):
72 |
73 | self.wait_steps = wait_steps
74 | self.num_steps = num_steps
75 | self.exit_on_end = exit_on_end
76 | self.times = [0.0] * num_steps
77 | self.wait_step = 0
78 | self.step = 0
79 |
80 |
81 | def __enter__(self):
82 |
83 | if self.wait_step < self.wait_steps:
84 | return
85 |
86 | self.times[self.step] = time.time()
87 |
88 |
89 | def __exit__(self, exc_type, exc_value, traceback):
90 |
91 | if self.wait_step < self.wait_steps:
92 | self.wait_step += 1
93 | return
94 |
95 | self.times[self.step] = time.time() - self.times[self.step]
96 | self.step += 1
97 |
98 | if self.step >= self.num_steps:
99 |
100 | print(f'\nDuration: {sum(self.times) / self.num_steps:.2f}s')
101 |
102 | if self.exit_on_end:
103 | exit(0)
--------------------------------------------------------------------------------
/models/README.md:
--------------------------------------------------------------------------------
1 | # 🎛️ Trained weights
2 |
3 | The trained weights of the different models are available on [**Google Drive**](https://drive.google.com/drive/folders/1XxKdsR33rt6VTFAF8qwyE3uxulK7gK6m), you just need to:
4 |
5 | * Download the `.pt` file of the model you want to use and put it in this folder
6 | * Download the `vocab.txt` file and put it in the `data` folder
7 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | flash-attn
3 | datasets
4 | tokenizers
5 | unidecode
6 | regex
7 | tqdm
8 | psutil
--------------------------------------------------------------------------------
/resources/misc/accuracy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/accuracy.png
--------------------------------------------------------------------------------
/resources/misc/loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/loss.png
--------------------------------------------------------------------------------
/resources/misc/test_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_1.png
--------------------------------------------------------------------------------
/resources/misc/test_10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_10.png
--------------------------------------------------------------------------------
/resources/misc/test_11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_11.png
--------------------------------------------------------------------------------
/resources/misc/test_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_2.png
--------------------------------------------------------------------------------
/resources/misc/test_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_3.png
--------------------------------------------------------------------------------
/resources/misc/test_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_4.png
--------------------------------------------------------------------------------
/resources/misc/test_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_5.png
--------------------------------------------------------------------------------
/resources/misc/test_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_6.png
--------------------------------------------------------------------------------
/resources/misc/test_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_7.png
--------------------------------------------------------------------------------
/resources/misc/test_8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_8.png
--------------------------------------------------------------------------------
/resources/misc/test_9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/test_9.png
--------------------------------------------------------------------------------
/resources/misc/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/angeluriot/Language_model/f1ea817b1c1711cbda1a4d1f28a535e234b6ece7/resources/misc/thumbnail.png
--------------------------------------------------------------------------------
/testing.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Testing"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Imports"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "import os\n",
24 | "\n",
25 | "from dimgpt import utils\n",
26 | "from dimgpt.testing.sampling import *\n",
27 | "from dimgpt.data.tokenizer import *\n",
28 | "from dimgpt.settings import *\n",
29 | "\n",
30 | "utils.reset_rand()"
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "### Check GPU"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": null,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "utils.check_gpu()"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "### Tokenizer"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "tokenizer = Tokenizer()\n",
63 | "\n",
64 | "print(f'Vocab size: {len(tokenizer.vocab):,}\\n')\n",
65 | "\n",
66 | "for v in tokenizer.vocab:\n",
67 | "\tprint(f'[{v}]', end = ' ')"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "### Model"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "MODEL_PATH = './models/DimensionGPT-0.2B-Chat.pt'\n",
84 | "\n",
85 | "model = Model().to(DEVICE)\n",
86 | "model.load_state_dict(torch.load(MODEL_PATH, map_location = DEVICE))\n",
87 | "model.summary()"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "### Testing"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": null,
100 | "metadata": {},
101 | "outputs": [],
102 | "source": [
103 | "sampler = Sampler(model, tokenizer)"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "_ = sampler.generate(\n",
113 | "\tinput = \"Bonjour\",\n",
114 | "\tmax_length = 512,\n",
115 | "\tchat_bot = True,\n",
116 | "\ttemperature = 0.5,\n",
117 | "\ttop_p = 0.9,\n",
118 | "\tno_repeat = 1.0,\n",
119 | "\tverbose = True,\n",
120 | "\tmax_print_line_length = 150\n",
121 | ")"
122 | ]
123 | }
124 | ],
125 | "metadata": {
126 | "kernelspec": {
127 | "display_name": "venv",
128 | "language": "python",
129 | "name": "python3"
130 | },
131 | "language_info": {
132 | "codemirror_mode": {
133 | "name": "ipython",
134 | "version": 3
135 | },
136 | "file_extension": ".py",
137 | "mimetype": "text/x-python",
138 | "name": "python",
139 | "nbconvert_exporter": "python",
140 | "pygments_lexer": "ipython3",
141 | "version": "3.11.8"
142 | },
143 | "orig_nbformat": 4
144 | },
145 | "nbformat": 4,
146 | "nbformat_minor": 2
147 | }
148 |
--------------------------------------------------------------------------------
/training.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "attachments": {},
5 | "cell_type": "markdown",
6 | "metadata": {},
7 | "source": [
8 | "# Training"
9 | ]
10 | },
11 | {
12 | "attachments": {},
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "### Imports"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "from dimgpt import utils\n",
26 | "from dimgpt.training.datasets import *\n",
27 | "from dimgpt.training.model import Model\n",
28 | "from dimgpt.training.trainer import Trainer\n",
29 | "from dimgpt.data.tokenizer import *\n",
30 | "from dimgpt.settings import *\n",
31 | "\n",
32 | "utils.reset_rand()"
33 | ]
34 | },
35 | {
36 | "attachments": {},
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "### Check GPU"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": null,
46 | "metadata": {},
47 | "outputs": [],
48 | "source": [
49 | "utils.check_gpu()"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "### Tokenizer"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "tokenizer = Tokenizer()\n",
66 | "\n",
67 | "print(f'Vocab size: {len(tokenizer.vocab):,}\\n')\n",
68 | "\n",
69 | "for v in tokenizer.vocab:\n",
70 | "\tprint(f'[{v}]', end = ' ')"
71 | ]
72 | },
73 | {
74 | "attachments": {},
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "### Dataset"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "dataset = PretrainingDataset(tokenizer)\n",
88 | "#dataset = FinetuningDataset(tokenizer)"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "x, y, strength = dataset.next_val()\n",
98 | "\n",
99 | "print(f'Batch shape: {tuple(x.shape)}\\n')\n",
100 | "\n",
101 | "print(tokenizer.decode(x[0]))\n",
102 | "\n",
103 | "del x, y, strength"
104 | ]
105 | },
106 | {
107 | "attachments": {},
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "### Model"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": null,
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "model = Model().to(DEVICE)\n",
121 | "model.summary()"
122 | ]
123 | },
124 | {
125 | "attachments": {},
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "### Training"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {},
136 | "outputs": [],
137 | "source": [
138 | "trainer = Trainer(model, dataset)\n",
139 | "trainer.find_previous_session()\n",
140 | "\n",
141 | "trainer.train()"
142 | ]
143 | }
144 | ],
145 | "metadata": {
146 | "kernelspec": {
147 | "display_name": "venv",
148 | "language": "python",
149 | "name": "python3"
150 | },
151 | "language_info": {
152 | "codemirror_mode": {
153 | "name": "ipython",
154 | "version": 3
155 | },
156 | "file_extension": ".py",
157 | "mimetype": "text/x-python",
158 | "name": "python",
159 | "nbconvert_exporter": "python",
160 | "pygments_lexer": "ipython3",
161 | "version": "3.11.8"
162 | },
163 | "orig_nbformat": 4
164 | },
165 | "nbformat": 4,
166 | "nbformat_minor": 2
167 | }
168 |
--------------------------------------------------------------------------------