├── .gitignore
├── MANIFEST.in
├── README.md
├── bpe_surgery_script.py
├── config.ini
├── demo.ipynb
├── fine_tune_title_generation.py
├── nmatheg
├── __init__.py
├── config.ini
├── configs.py
├── dataset.py
├── datasets.ini
├── models.py
├── ner_utils.py
├── nmatheg.py
├── preprocess_ner.py
├── preprocess_qa.py
├── qa_utils.py
├── tests.py
└── utils.py
├── nmatheg_logo.PNG
├── playground.ipynb
├── predict.py
├── requirements.txt
├── script.py
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode/
2 | .ipynb_checkpoints/
3 | # Byte-compiled / optimized / DLL files
4 | __pycache__/
5 | *.py[cod]
6 | *$py.class
7 |
8 | # C extensions
9 | *.so
10 |
11 | .Python
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 | cover/
54 |
55 | # Translations
56 | *.mo
57 | *.pot
58 |
59 | # Django stuff:
60 | *.log
61 | local_settings.py
62 | db.sqlite3
63 | db.sqlite3-journal
64 |
65 | # Flask stuff:
66 | instance/
67 | .webassets-cache
68 |
69 | # Scrapy stuff:
70 | .scrapy
71 |
72 | # Sphinx documentation
73 | docs/_build/
74 |
75 | # PyBuilder
76 | .pybuilder/
77 | target/
78 |
79 | # Jupyter Notebook
80 | .ipynb_checkpoints
81 |
82 | # IPython
83 | profile_default/
84 | ipython_config.py
85 |
86 | # pyenv
87 | # For a library or package, you might want to ignore these files since the code is
88 | # intended to run in multiple environments; otherwise, check them in:
89 | # .python-version
90 |
91 | # pipenv
92 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
93 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
94 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
95 | # install all needed dependencies.
96 | #Pipfile.lock
97 |
98 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
99 | __pypackages__/
100 |
101 | # Celery stuff
102 | celerybeat-schedule
103 | celerybeat.pid
104 |
105 | # SageMath parsed files
106 | *.sage.py
107 |
108 | # Environments
109 | .env
110 | .venv
111 | env/
112 | venv/
113 | ENV/
114 | env.bak/
115 | venv.bak/
116 |
117 | # Spyder project settings
118 | .spyderproject
119 | .spyproject
120 |
121 | # Rope project settings
122 | .ropeproject
123 |
124 | # mkdocs documentation
125 | /site
126 |
127 | # mypy
128 | .mypy_cache/
129 | .dmypy.json
130 | dmypy.json
131 |
132 | # Pyre type checker
133 | .pyre/
134 |
135 | # pytype static type analyzer
136 | .pytype/
137 |
138 | # Cython debug symbols
139 | cython_debug/
140 |
141 | # data files
142 | *data*.txt
143 | data/
144 | ckpts/
145 | tmp
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include nmatheg/datasets.ini
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | # nmatheg
8 |
9 | nmatheg `نماذج` an easy straregy for training Arabic NLP models on huggingface datasets. Just specifiy the name of the dataset, preprocessing, tokenization and the training procedure in the config file to train an nlp model for that task.
10 |
11 | ## install
12 |
13 | ```pip install nmatheg```
14 |
15 | ## Configuration
16 |
17 | Setup a config file for the training strategy.
18 |
19 | ``` ini
20 | [dataset]
21 | dataset_name = ajgt_twitter_ar
22 |
23 | [preprocessing]
24 | segment = False
25 | remove_special_chars = False
26 | remove_english = False
27 | normalize = False
28 | remove_diacritics = False
29 | excluded_chars = []
30 | remove_tatweel = False
31 | remove_html_elements = False
32 | remove_links = False
33 | remove_twitter_meta = False
34 | remove_long_words = False
35 | remove_repeated_chars = False
36 |
37 | [tokenization]
38 | tokenizer_name = WordTokenizer
39 | vocab_size = 1000
40 | max_tokens = 128
41 |
42 | [model]
43 | model_name = rnn
44 |
45 | [log]
46 | print_every = 10
47 |
48 | [train]
49 | save_dir = .
50 | epochs = 10
51 | batch_size = 256
52 | ```
53 |
54 | ### Main Sections
55 |
56 | - `dataset` describe the dataset and the task type. Currently we only support classification
57 | - `preprocessing` a set of cleaning functions mainly uses our library [tnkeeh](https://github.com/ARBML/tnkeeh).
58 | - `tokenization` descrbies the tokenizer used for encoding the dataset. It uses our library [tkseem](https://github.com/ARBML/tkseem).
59 | - `train` the training parameters like number of epochs and batch size.
60 |
61 | ## Usage
62 |
63 | ### Config Files
64 | ```python
65 | import nmatheg as nm
66 | strategy = nm.TrainStrategy('config.ini')
67 | strategy.start()
68 | ```
69 | ### Benchmarking on multiple datasets and models
70 | ```python
71 | import nmatheg as nm
72 | strategy = nm.TrainStrategy(
73 | datasets = 'arsentd_lev,caner,arcd',
74 | models = 'qarib/bert-base-qarib,aubmindlab/bert-base-arabertv01',
75 | mode = 'finetune',
76 | runs = 5,
77 | lr = 1e-4,
78 | epochs = 1,
79 | batch_size = 8,
80 | max_tokens = 128,
81 | max_train_samples = 1024
82 | )
83 | strategy.start()
84 | ```
85 |
86 | ## Datasets
87 | We are supporting huggingface datasets for Arabic. You can find the supported datasets [here](https://github.com/ARBML/nmatheg/blob/main/nmatheg/datasets.ini).
88 |
89 | | Dataset | Description |
90 | | --- | --- |
91 | | [ajgt_twitter_ar](https://huggingface.co/datasets/ajgt_twitter_ar) | Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect. |
92 | | [metrec](https://huggingface.co/datasets/metrec) | The dataset contains the verses and their corresponding meter classes. Meter classes are represented as numbers from 0 to 13. The dataset can be highly useful for further research in order to improve the field of Arabic poems’ meter classification. The train dataset contains 47,124 records and the test dataset contains 8,316 records. |
93 | |[labr](https://huggingface.co/datasets/labr) |This dataset contains over 63,000 book reviews in Arabic. It is the largest sentiment analysis dataset for Arabic to-date. The book reviews were harvested from the website Goodreads during the month or March 2013. Each book review comes with the goodreads review id, the user id, the book id, the rating (1 to 5) and the text of the review. |
94 | |[ar_res_reviews](https://huggingface.co/datasets/ar_res_reviews)|Dataset of 8364 restaurant reviews from qaym.com in Arabic for sentiment analysis|
95 | |[arsentd_lev](https://huggingface.co/datasets/arsentd_lev)|The Arabic Sentiment Twitter Dataset for Levantine dialect (ArSenTD-LEV) contains 4,000 tweets written in Arabic and equally retrieved from Jordan, Lebanon, Palestine and Syria.|
96 | |[oclar](https://huggingface.co/datasets/oclar)|The researchers of OCLAR Marwan et al. (2019), they gathered Arabic costumer reviews Zomato [website](https://www.zomato.com/lebanon) on wide scope of domain, including restaurants, hotels, hospitals, local shops, etc. The corpus finally contains 3916 reviews in 5-rating scale. For this research purpose, the positive class considers rating stars from 5 to 3 of 3465 reviews, and the negative class is represented from values of 1 and 2 of about 451 texts.|
97 | |[emotone_ar](https://huggingface.co/datasets/emotone_ar)|Dataset of 10,065 tweets in Arabic for Emotion detection in Arabic text|
98 | |[hard](https://huggingface.co/datasets/hard)|This dataset contains 93,700 hotel reviews in Arabic language.The hotel reviews were collected from Booking.com website during June/July 2016.The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.The following table summarize some tatistics on the HARD Dataset.|
99 | |[caner](https://huggingface.co/datasets/caner)|The Classical Arabic Named Entity Recognition corpus is a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities.|
100 | |[arcd](https://huggingface.co/datasets/arcd)|Arabic Reading Comprehension Dataset (ARCD) composed of 1,395 questions posed by crowdworkers on Wikipedia articles.|
101 | |[mlqa](https://huggingface.co/datasets/mlqa)|MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese.|
102 | |[xnli](https://huggingface.co/datasets/xnli)|XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages (some low-ish resource).|
103 | |[tatoeba_mt](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt)|The Tatoeba Translation Challenge is a multilingual dataset of machine translation benchmarks derived from user-contributed translations collected by Tatoeba.org and provided as parallel corpus from OPUS.|
104 | ## Tasks
105 |
106 | Currently we support text classification, named entity recognition, question answering, machine translation and natural language inference.
107 |
108 | ## Demo
109 | Check this [colab notebook](https://colab.research.google.com/github/ARBML/nmatheg/blob/main/demo.ipynb) for a quick demo.
110 |
--------------------------------------------------------------------------------
/bpe_surgery_script.py:
--------------------------------------------------------------------------------
1 | import nmatheg as nm
2 | strategy = nm.TrainStrategy(
3 | datasets = 'ajgt_twitter_ar,caner,xnli',
4 | models = 'birnn',
5 | tokenizers = 'BPE,MaT-BPE,Seg-BPE',
6 | vocab_sizes = '250,500,1000,5000,10000',
7 | runs = 10,
8 | lr = 1e-3,
9 | epochs = 20,
10 | batch_size = 128,
11 | max_tokens = 128,
12 | mode = 'pretrain'
13 | )
14 | output = strategy.start()
--------------------------------------------------------------------------------
/config.ini:
--------------------------------------------------------------------------------
1 |
2 | [dataset]
3 | dataset_name = ajgt_twitter_ar
4 | task = classification
5 |
6 | [preprocessing]
7 | segment = False
8 | remove_special_chars = False
9 | remove_english = False
10 | normalize = False
11 | remove_diacritics = False
12 | excluded_chars = []
13 | remove_tatweel = False
14 | remove_html_elements = False
15 | remove_links = False
16 | remove_twitter_meta = False
17 | remove_long_words = False
18 | remove_repeated_chars = False
19 |
20 | [tokenization]
21 | tokenizer_name = WordTokenizer
22 | vocab_size = 1000
23 | max_tokens = 128
24 |
25 | [model]
26 | model_name = rnn,aubmindlab/bert-base-arabertv01
27 |
28 | [log]
29 | print_every = 10
30 |
31 | [train]
32 | save_dir = .
33 | epochs = 10
34 | batch_size = 256
35 |
--------------------------------------------------------------------------------
/demo.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "id": "Yr3ZFtPMr22x"
8 | },
9 | "outputs": [],
10 | "source": [
11 | "%%capture\n",
12 | "!pip install git+https://github.com/ARBML/nmatheg"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": null,
18 | "metadata": {
19 | "id": "g9liZeykvsfe"
20 | },
21 | "outputs": [],
22 | "source": [
23 | "import nmatheg as nm\n",
24 | "strategy = nm.TrainStrategy(datasets = 'arsentd_lev,ajgt_twitter_ar,ar_res_reviews,arcd,caner',\n",
25 | " models = 'qarib/bert-base-qarib,\\\n",
26 | " aubmindlab/bert-base-arabertv01,\\\n",
27 | " CAMeL-Lab/bert-base-arabic-camelbert-da,\\\n",
28 | " UBC-NLP/MARBERT,\\\n",
29 | " bashar-talafha/multi-dialect-bert-base-arabic',\n",
30 | " epochs = 5,\n",
31 | " lr = 1e-3,\n",
32 | " batch_size = 8,)\n",
33 | "strategy.start()"
34 | ]
35 | }
36 | ],
37 | "metadata": {
38 | "accelerator": "GPU",
39 | "colab": {
40 | "name": "demo.ipynb",
41 | "provenance": []
42 | },
43 | "kernelspec": {
44 | "display_name": "Python 3.9.5 64-bit",
45 | "language": "python",
46 | "name": "python3"
47 | },
48 | "language_info": {
49 | "codemirror_mode": {
50 | "name": "ipython",
51 | "version": 3
52 | },
53 | "file_extension": ".py",
54 | "mimetype": "text/x-python",
55 | "name": "python",
56 | "nbconvert_exporter": "python",
57 | "pygments_lexer": "ipython3",
58 | "version": "3.9.5"
59 | },
60 | "orig_nbformat": 2,
61 | "vscode": {
62 | "interpreter": {
63 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
64 | }
65 | },
66 | "widgets": {
67 | "application/vnd.jupyter.widget-state+json": {
68 | "02e45e7277bf417d896d92cfad3e7db3": {
69 | "model_module": "@jupyter-widgets/base",
70 | "model_name": "LayoutModel",
71 | "state": {
72 | "_model_module": "@jupyter-widgets/base",
73 | "_model_module_version": "1.2.0",
74 | "_model_name": "LayoutModel",
75 | "_view_count": null,
76 | "_view_module": "@jupyter-widgets/base",
77 | "_view_module_version": "1.2.0",
78 | "_view_name": "LayoutView",
79 | "align_content": null,
80 | "align_items": null,
81 | "align_self": null,
82 | "border": null,
83 | "bottom": null,
84 | "display": null,
85 | "flex": null,
86 | "flex_flow": null,
87 | "grid_area": null,
88 | "grid_auto_columns": null,
89 | "grid_auto_flow": null,
90 | "grid_auto_rows": null,
91 | "grid_column": null,
92 | "grid_gap": null,
93 | "grid_row": null,
94 | "grid_template_areas": null,
95 | "grid_template_columns": null,
96 | "grid_template_rows": null,
97 | "height": null,
98 | "justify_content": null,
99 | "justify_items": null,
100 | "left": null,
101 | "margin": null,
102 | "max_height": null,
103 | "max_width": null,
104 | "min_height": null,
105 | "min_width": null,
106 | "object_fit": null,
107 | "object_position": null,
108 | "order": null,
109 | "overflow": null,
110 | "overflow_x": null,
111 | "overflow_y": null,
112 | "padding": null,
113 | "right": null,
114 | "top": null,
115 | "visibility": null,
116 | "width": null
117 | }
118 | },
119 | "02f25412874b417892d6ff746d65d0e7": {
120 | "model_module": "@jupyter-widgets/base",
121 | "model_name": "LayoutModel",
122 | "state": {
123 | "_model_module": "@jupyter-widgets/base",
124 | "_model_module_version": "1.2.0",
125 | "_model_name": "LayoutModel",
126 | "_view_count": null,
127 | "_view_module": "@jupyter-widgets/base",
128 | "_view_module_version": "1.2.0",
129 | "_view_name": "LayoutView",
130 | "align_content": null,
131 | "align_items": null,
132 | "align_self": null,
133 | "border": null,
134 | "bottom": null,
135 | "display": null,
136 | "flex": null,
137 | "flex_flow": null,
138 | "grid_area": null,
139 | "grid_auto_columns": null,
140 | "grid_auto_flow": null,
141 | "grid_auto_rows": null,
142 | "grid_column": null,
143 | "grid_gap": null,
144 | "grid_row": null,
145 | "grid_template_areas": null,
146 | "grid_template_columns": null,
147 | "grid_template_rows": null,
148 | "height": null,
149 | "justify_content": null,
150 | "justify_items": null,
151 | "left": null,
152 | "margin": null,
153 | "max_height": null,
154 | "max_width": null,
155 | "min_height": null,
156 | "min_width": null,
157 | "object_fit": null,
158 | "object_position": null,
159 | "order": null,
160 | "overflow": null,
161 | "overflow_x": null,
162 | "overflow_y": null,
163 | "padding": null,
164 | "right": null,
165 | "top": null,
166 | "visibility": null,
167 | "width": null
168 | }
169 | },
170 | "07b9064cf6804ebbbb26419522c49700": {
171 | "model_module": "@jupyter-widgets/base",
172 | "model_name": "LayoutModel",
173 | "state": {
174 | "_model_module": "@jupyter-widgets/base",
175 | "_model_module_version": "1.2.0",
176 | "_model_name": "LayoutModel",
177 | "_view_count": null,
178 | "_view_module": "@jupyter-widgets/base",
179 | "_view_module_version": "1.2.0",
180 | "_view_name": "LayoutView",
181 | "align_content": null,
182 | "align_items": null,
183 | "align_self": null,
184 | "border": null,
185 | "bottom": null,
186 | "display": null,
187 | "flex": null,
188 | "flex_flow": null,
189 | "grid_area": null,
190 | "grid_auto_columns": null,
191 | "grid_auto_flow": null,
192 | "grid_auto_rows": null,
193 | "grid_column": null,
194 | "grid_gap": null,
195 | "grid_row": null,
196 | "grid_template_areas": null,
197 | "grid_template_columns": null,
198 | "grid_template_rows": null,
199 | "height": null,
200 | "justify_content": null,
201 | "justify_items": null,
202 | "left": null,
203 | "margin": null,
204 | "max_height": null,
205 | "max_width": null,
206 | "min_height": null,
207 | "min_width": null,
208 | "object_fit": null,
209 | "object_position": null,
210 | "order": null,
211 | "overflow": null,
212 | "overflow_x": null,
213 | "overflow_y": null,
214 | "padding": null,
215 | "right": null,
216 | "top": null,
217 | "visibility": null,
218 | "width": null
219 | }
220 | },
221 | "0c44bce3963d4b73bad9a2f43034da6e": {
222 | "model_module": "@jupyter-widgets/controls",
223 | "model_name": "HTMLModel",
224 | "state": {
225 | "_dom_classes": [],
226 | "_model_module": "@jupyter-widgets/controls",
227 | "_model_module_version": "1.5.0",
228 | "_model_name": "HTMLModel",
229 | "_view_count": null,
230 | "_view_module": "@jupyter-widgets/controls",
231 | "_view_module_version": "1.5.0",
232 | "_view_name": "HTMLView",
233 | "description": "",
234 | "description_tooltip": null,
235 | "layout": "IPY_MODEL_985c2e847ae2468aa32f734fad8cac15",
236 | "placeholder": "",
237 | "style": "IPY_MODEL_b9f50346f8164d2c99722a21646b4b05",
238 | "value": " 2/2 [00:01<00:00, 1.18ba/s]"
239 | }
240 | },
241 | "169dd26ec2c34af2b23624f81afcebc3": {
242 | "model_module": "@jupyter-widgets/controls",
243 | "model_name": "HTMLModel",
244 | "state": {
245 | "_dom_classes": [],
246 | "_model_module": "@jupyter-widgets/controls",
247 | "_model_module_version": "1.5.0",
248 | "_model_name": "HTMLModel",
249 | "_view_count": null,
250 | "_view_module": "@jupyter-widgets/controls",
251 | "_view_module_version": "1.5.0",
252 | "_view_name": "HTMLView",
253 | "description": "",
254 | "description_tooltip": null,
255 | "layout": "IPY_MODEL_f2c5b053dc2e4a7590710c091ca5aa1e",
256 | "placeholder": "",
257 | "style": "IPY_MODEL_9eefe0eb1b02400598908bf39be13998",
258 | "value": " 576/576 [00:03<00:00, 190B/s]"
259 | }
260 | },
261 | "180e7a8558b74bea8ebcec9e4a06138a": {
262 | "model_module": "@jupyter-widgets/base",
263 | "model_name": "LayoutModel",
264 | "state": {
265 | "_model_module": "@jupyter-widgets/base",
266 | "_model_module_version": "1.2.0",
267 | "_model_name": "LayoutModel",
268 | "_view_count": null,
269 | "_view_module": "@jupyter-widgets/base",
270 | "_view_module_version": "1.2.0",
271 | "_view_name": "LayoutView",
272 | "align_content": null,
273 | "align_items": null,
274 | "align_self": null,
275 | "border": null,
276 | "bottom": null,
277 | "display": null,
278 | "flex": null,
279 | "flex_flow": null,
280 | "grid_area": null,
281 | "grid_auto_columns": null,
282 | "grid_auto_flow": null,
283 | "grid_auto_rows": null,
284 | "grid_column": null,
285 | "grid_gap": null,
286 | "grid_row": null,
287 | "grid_template_areas": null,
288 | "grid_template_columns": null,
289 | "grid_template_rows": null,
290 | "height": null,
291 | "justify_content": null,
292 | "justify_items": null,
293 | "left": null,
294 | "margin": null,
295 | "max_height": null,
296 | "max_width": null,
297 | "min_height": null,
298 | "min_width": null,
299 | "object_fit": null,
300 | "object_position": null,
301 | "order": null,
302 | "overflow": null,
303 | "overflow_x": null,
304 | "overflow_y": null,
305 | "padding": null,
306 | "right": null,
307 | "top": null,
308 | "visibility": null,
309 | "width": null
310 | }
311 | },
312 | "2367e98e314848fcbf34969075b276b3": {
313 | "model_module": "@jupyter-widgets/controls",
314 | "model_name": "FloatProgressModel",
315 | "state": {
316 | "_dom_classes": [],
317 | "_model_module": "@jupyter-widgets/controls",
318 | "_model_module_version": "1.5.0",
319 | "_model_name": "FloatProgressModel",
320 | "_view_count": null,
321 | "_view_module": "@jupyter-widgets/controls",
322 | "_view_module_version": "1.5.0",
323 | "_view_name": "ProgressView",
324 | "bar_style": "success",
325 | "description": "100%",
326 | "description_tooltip": null,
327 | "layout": "IPY_MODEL_77046ee7556242aeb281a62a7305eeb7",
328 | "max": 2,
329 | "min": 0,
330 | "orientation": "horizontal",
331 | "style": "IPY_MODEL_4684f0aba9f347d1938ce3ed125c56be",
332 | "value": 2
333 | }
334 | },
335 | "2733a8c195944629a0cb9a0007588358": {
336 | "model_module": "@jupyter-widgets/controls",
337 | "model_name": "ProgressStyleModel",
338 | "state": {
339 | "_model_module": "@jupyter-widgets/controls",
340 | "_model_module_version": "1.5.0",
341 | "_model_name": "ProgressStyleModel",
342 | "_view_count": null,
343 | "_view_module": "@jupyter-widgets/base",
344 | "_view_module_version": "1.2.0",
345 | "_view_name": "StyleView",
346 | "bar_color": null,
347 | "description_width": "initial"
348 | }
349 | },
350 | "276a05360cb0423a80b0afe2c211e69c": {
351 | "model_module": "@jupyter-widgets/controls",
352 | "model_name": "ProgressStyleModel",
353 | "state": {
354 | "_model_module": "@jupyter-widgets/controls",
355 | "_model_module_version": "1.5.0",
356 | "_model_name": "ProgressStyleModel",
357 | "_view_count": null,
358 | "_view_module": "@jupyter-widgets/base",
359 | "_view_module_version": "1.2.0",
360 | "_view_name": "StyleView",
361 | "bar_color": null,
362 | "description_width": "initial"
363 | }
364 | },
365 | "29fba2547a7e478ca97792eca1835286": {
366 | "model_module": "@jupyter-widgets/controls",
367 | "model_name": "FloatProgressModel",
368 | "state": {
369 | "_dom_classes": [],
370 | "_model_module": "@jupyter-widgets/controls",
371 | "_model_module_version": "1.5.0",
372 | "_model_name": "FloatProgressModel",
373 | "_view_count": null,
374 | "_view_module": "@jupyter-widgets/controls",
375 | "_view_module_version": "1.5.0",
376 | "_view_name": "ProgressView",
377 | "bar_style": "success",
378 | "description": "Downloading: 100%",
379 | "description_tooltip": null,
380 | "layout": "IPY_MODEL_dbd206abbcac414d8730ff7cf4ef55d8",
381 | "max": 2697421,
382 | "min": 0,
383 | "orientation": "horizontal",
384 | "style": "IPY_MODEL_ec259b6cb7654b819f6e33be6b99a58e",
385 | "value": 2697421
386 | }
387 | },
388 | "2b239899205b4e35a1fe9a24f578e2c2": {
389 | "model_module": "@jupyter-widgets/controls",
390 | "model_name": "HTMLModel",
391 | "state": {
392 | "_dom_classes": [],
393 | "_model_module": "@jupyter-widgets/controls",
394 | "_model_module_version": "1.5.0",
395 | "_model_name": "HTMLModel",
396 | "_view_count": null,
397 | "_view_module": "@jupyter-widgets/controls",
398 | "_view_module_version": "1.5.0",
399 | "_view_name": "HTMLView",
400 | "description": "",
401 | "description_tooltip": null,
402 | "layout": "IPY_MODEL_3a1c2001ae784d8c9beb5ce39d1fa185",
403 | "placeholder": "",
404 | "style": "IPY_MODEL_96376ad744dd4e1f97a4df143d164843",
405 | "value": " 2/2 [00:00<00:00, 2.03ba/s]"
406 | }
407 | },
408 | "307c92aaf92b48cdb7bcf6c2a5a4fc94": {
409 | "model_module": "@jupyter-widgets/controls",
410 | "model_name": "FloatProgressModel",
411 | "state": {
412 | "_dom_classes": [],
413 | "_model_module": "@jupyter-widgets/controls",
414 | "_model_module_version": "1.5.0",
415 | "_model_name": "FloatProgressModel",
416 | "_view_count": null,
417 | "_view_module": "@jupyter-widgets/controls",
418 | "_view_module_version": "1.5.0",
419 | "_view_name": "ProgressView",
420 | "bar_style": "success",
421 | "description": "100%",
422 | "description_tooltip": null,
423 | "layout": "IPY_MODEL_fabc123e8ab14205a81fee5e8e6484dc",
424 | "max": 2,
425 | "min": 0,
426 | "orientation": "horizontal",
427 | "style": "IPY_MODEL_4ec1f854d65c4abeaa257fd6f165dc56",
428 | "value": 2
429 | }
430 | },
431 | "39863d81dd304318beee0df7d409e9e9": {
432 | "model_module": "@jupyter-widgets/controls",
433 | "model_name": "DescriptionStyleModel",
434 | "state": {
435 | "_model_module": "@jupyter-widgets/controls",
436 | "_model_module_version": "1.5.0",
437 | "_model_name": "DescriptionStyleModel",
438 | "_view_count": null,
439 | "_view_module": "@jupyter-widgets/base",
440 | "_view_module_version": "1.2.0",
441 | "_view_name": "StyleView",
442 | "description_width": ""
443 | }
444 | },
445 | "39d781ce862343cb833d481cdf2237fb": {
446 | "model_module": "@jupyter-widgets/controls",
447 | "model_name": "HTMLModel",
448 | "state": {
449 | "_dom_classes": [],
450 | "_model_module": "@jupyter-widgets/controls",
451 | "_model_module_version": "1.5.0",
452 | "_model_name": "HTMLModel",
453 | "_view_count": null,
454 | "_view_module": "@jupyter-widgets/controls",
455 | "_view_module_version": "1.5.0",
456 | "_view_name": "HTMLView",
457 | "description": "",
458 | "description_tooltip": null,
459 | "layout": "IPY_MODEL_fb595227f20f4a18aaef097807dede5b",
460 | "placeholder": "",
461 | "style": "IPY_MODEL_5a46917da090478f850d3aa0f06c5226",
462 | "value": " 780k/780k [00:02<00:00, 321kB/s]"
463 | }
464 | },
465 | "3a1c2001ae784d8c9beb5ce39d1fa185": {
466 | "model_module": "@jupyter-widgets/base",
467 | "model_name": "LayoutModel",
468 | "state": {
469 | "_model_module": "@jupyter-widgets/base",
470 | "_model_module_version": "1.2.0",
471 | "_model_name": "LayoutModel",
472 | "_view_count": null,
473 | "_view_module": "@jupyter-widgets/base",
474 | "_view_module_version": "1.2.0",
475 | "_view_name": "LayoutView",
476 | "align_content": null,
477 | "align_items": null,
478 | "align_self": null,
479 | "border": null,
480 | "bottom": null,
481 | "display": null,
482 | "flex": null,
483 | "flex_flow": null,
484 | "grid_area": null,
485 | "grid_auto_columns": null,
486 | "grid_auto_flow": null,
487 | "grid_auto_rows": null,
488 | "grid_column": null,
489 | "grid_gap": null,
490 | "grid_row": null,
491 | "grid_template_areas": null,
492 | "grid_template_columns": null,
493 | "grid_template_rows": null,
494 | "height": null,
495 | "justify_content": null,
496 | "justify_items": null,
497 | "left": null,
498 | "margin": null,
499 | "max_height": null,
500 | "max_width": null,
501 | "min_height": null,
502 | "min_width": null,
503 | "object_fit": null,
504 | "object_position": null,
505 | "order": null,
506 | "overflow": null,
507 | "overflow_x": null,
508 | "overflow_y": null,
509 | "padding": null,
510 | "right": null,
511 | "top": null,
512 | "visibility": null,
513 | "width": null
514 | }
515 | },
516 | "451ce5e7a0644ecdbac61b6995192e02": {
517 | "model_module": "@jupyter-widgets/controls",
518 | "model_name": "HTMLModel",
519 | "state": {
520 | "_dom_classes": [],
521 | "_model_module": "@jupyter-widgets/controls",
522 | "_model_module_version": "1.5.0",
523 | "_model_name": "HTMLModel",
524 | "_view_count": null,
525 | "_view_module": "@jupyter-widgets/controls",
526 | "_view_module_version": "1.5.0",
527 | "_view_name": "HTMLView",
528 | "description": "",
529 | "description_tooltip": null,
530 | "layout": "IPY_MODEL_5ed2d167e0e641e290a46d457c8930d5",
531 | "placeholder": "",
532 | "style": "IPY_MODEL_39863d81dd304318beee0df7d409e9e9",
533 | "value": " 2.70M/2.70M [00:01<00:00, 1.89MB/s]"
534 | }
535 | },
536 | "4684f0aba9f347d1938ce3ed125c56be": {
537 | "model_module": "@jupyter-widgets/controls",
538 | "model_name": "ProgressStyleModel",
539 | "state": {
540 | "_model_module": "@jupyter-widgets/controls",
541 | "_model_module_version": "1.5.0",
542 | "_model_name": "ProgressStyleModel",
543 | "_view_count": null,
544 | "_view_module": "@jupyter-widgets/base",
545 | "_view_module_version": "1.2.0",
546 | "_view_name": "StyleView",
547 | "bar_color": null,
548 | "description_width": "initial"
549 | }
550 | },
551 | "47b5070966da480cab27ed1332f25aa4": {
552 | "model_module": "@jupyter-widgets/controls",
553 | "model_name": "DescriptionStyleModel",
554 | "state": {
555 | "_model_module": "@jupyter-widgets/controls",
556 | "_model_module_version": "1.5.0",
557 | "_model_name": "DescriptionStyleModel",
558 | "_view_count": null,
559 | "_view_module": "@jupyter-widgets/base",
560 | "_view_module_version": "1.2.0",
561 | "_view_name": "StyleView",
562 | "description_width": ""
563 | }
564 | },
565 | "47c3a693d7e9401cbb20bed9ca7fe7ef": {
566 | "model_module": "@jupyter-widgets/controls",
567 | "model_name": "FloatProgressModel",
568 | "state": {
569 | "_dom_classes": [],
570 | "_model_module": "@jupyter-widgets/controls",
571 | "_model_module_version": "1.5.0",
572 | "_model_name": "FloatProgressModel",
573 | "_view_count": null,
574 | "_view_module": "@jupyter-widgets/controls",
575 | "_view_module_version": "1.5.0",
576 | "_view_name": "ProgressView",
577 | "bar_style": "success",
578 | "description": "Downloading: 100%",
579 | "description_tooltip": null,
580 | "layout": "IPY_MODEL_4dc76c6f669f4514b9144e6c7b020dbd",
581 | "max": 780034,
582 | "min": 0,
583 | "orientation": "horizontal",
584 | "style": "IPY_MODEL_94e063404d83486190d12c8846d0882f",
585 | "value": 780034
586 | }
587 | },
588 | "4d2b0fbfe0d5421b93e529f34adeefb3": {
589 | "model_module": "@jupyter-widgets/base",
590 | "model_name": "LayoutModel",
591 | "state": {
592 | "_model_module": "@jupyter-widgets/base",
593 | "_model_module_version": "1.2.0",
594 | "_model_name": "LayoutModel",
595 | "_view_count": null,
596 | "_view_module": "@jupyter-widgets/base",
597 | "_view_module_version": "1.2.0",
598 | "_view_name": "LayoutView",
599 | "align_content": null,
600 | "align_items": null,
601 | "align_self": null,
602 | "border": null,
603 | "bottom": null,
604 | "display": null,
605 | "flex": null,
606 | "flex_flow": null,
607 | "grid_area": null,
608 | "grid_auto_columns": null,
609 | "grid_auto_flow": null,
610 | "grid_auto_rows": null,
611 | "grid_column": null,
612 | "grid_gap": null,
613 | "grid_row": null,
614 | "grid_template_areas": null,
615 | "grid_template_columns": null,
616 | "grid_template_rows": null,
617 | "height": null,
618 | "justify_content": null,
619 | "justify_items": null,
620 | "left": null,
621 | "margin": null,
622 | "max_height": null,
623 | "max_width": null,
624 | "min_height": null,
625 | "min_width": null,
626 | "object_fit": null,
627 | "object_position": null,
628 | "order": null,
629 | "overflow": null,
630 | "overflow_x": null,
631 | "overflow_y": null,
632 | "padding": null,
633 | "right": null,
634 | "top": null,
635 | "visibility": null,
636 | "width": null
637 | }
638 | },
639 | "4dc76c6f669f4514b9144e6c7b020dbd": {
640 | "model_module": "@jupyter-widgets/base",
641 | "model_name": "LayoutModel",
642 | "state": {
643 | "_model_module": "@jupyter-widgets/base",
644 | "_model_module_version": "1.2.0",
645 | "_model_name": "LayoutModel",
646 | "_view_count": null,
647 | "_view_module": "@jupyter-widgets/base",
648 | "_view_module_version": "1.2.0",
649 | "_view_name": "LayoutView",
650 | "align_content": null,
651 | "align_items": null,
652 | "align_self": null,
653 | "border": null,
654 | "bottom": null,
655 | "display": null,
656 | "flex": null,
657 | "flex_flow": null,
658 | "grid_area": null,
659 | "grid_auto_columns": null,
660 | "grid_auto_flow": null,
661 | "grid_auto_rows": null,
662 | "grid_column": null,
663 | "grid_gap": null,
664 | "grid_row": null,
665 | "grid_template_areas": null,
666 | "grid_template_columns": null,
667 | "grid_template_rows": null,
668 | "height": null,
669 | "justify_content": null,
670 | "justify_items": null,
671 | "left": null,
672 | "margin": null,
673 | "max_height": null,
674 | "max_width": null,
675 | "min_height": null,
676 | "min_width": null,
677 | "object_fit": null,
678 | "object_position": null,
679 | "order": null,
680 | "overflow": null,
681 | "overflow_x": null,
682 | "overflow_y": null,
683 | "padding": null,
684 | "right": null,
685 | "top": null,
686 | "visibility": null,
687 | "width": null
688 | }
689 | },
690 | "4ec1f854d65c4abeaa257fd6f165dc56": {
691 | "model_module": "@jupyter-widgets/controls",
692 | "model_name": "ProgressStyleModel",
693 | "state": {
694 | "_model_module": "@jupyter-widgets/controls",
695 | "_model_module_version": "1.5.0",
696 | "_model_name": "ProgressStyleModel",
697 | "_view_count": null,
698 | "_view_module": "@jupyter-widgets/base",
699 | "_view_module_version": "1.2.0",
700 | "_view_name": "StyleView",
701 | "bar_color": null,
702 | "description_width": "initial"
703 | }
704 | },
705 | "523f17ee95144d40b65d4549c98e6cd5": {
706 | "model_module": "@jupyter-widgets/controls",
707 | "model_name": "FloatProgressModel",
708 | "state": {
709 | "_dom_classes": [],
710 | "_model_module": "@jupyter-widgets/controls",
711 | "_model_module_version": "1.5.0",
712 | "_model_name": "FloatProgressModel",
713 | "_view_count": null,
714 | "_view_module": "@jupyter-widgets/controls",
715 | "_view_module_version": "1.5.0",
716 | "_view_name": "ProgressView",
717 | "bar_style": "success",
718 | "description": "Downloading: 100%",
719 | "description_tooltip": null,
720 | "layout": "IPY_MODEL_a918de7d6fc141069ec93489dbb2cdb3",
721 | "max": 543450723,
722 | "min": 0,
723 | "orientation": "horizontal",
724 | "style": "IPY_MODEL_2733a8c195944629a0cb9a0007588358",
725 | "value": 543450723
726 | }
727 | },
728 | "5a37e3838ef146d2aef32cddfaf8daf4": {
729 | "model_module": "@jupyter-widgets/controls",
730 | "model_name": "ProgressStyleModel",
731 | "state": {
732 | "_model_module": "@jupyter-widgets/controls",
733 | "_model_module_version": "1.5.0",
734 | "_model_name": "ProgressStyleModel",
735 | "_view_count": null,
736 | "_view_module": "@jupyter-widgets/base",
737 | "_view_module_version": "1.2.0",
738 | "_view_name": "StyleView",
739 | "bar_color": null,
740 | "description_width": "initial"
741 | }
742 | },
743 | "5a46917da090478f850d3aa0f06c5226": {
744 | "model_module": "@jupyter-widgets/controls",
745 | "model_name": "DescriptionStyleModel",
746 | "state": {
747 | "_model_module": "@jupyter-widgets/controls",
748 | "_model_module_version": "1.5.0",
749 | "_model_name": "DescriptionStyleModel",
750 | "_view_count": null,
751 | "_view_module": "@jupyter-widgets/base",
752 | "_view_module_version": "1.2.0",
753 | "_view_name": "StyleView",
754 | "description_width": ""
755 | }
756 | },
757 | "5c15eae703e24258a4fa9272c55e93f3": {
758 | "model_module": "@jupyter-widgets/controls",
759 | "model_name": "DescriptionStyleModel",
760 | "state": {
761 | "_model_module": "@jupyter-widgets/controls",
762 | "_model_module_version": "1.5.0",
763 | "_model_name": "DescriptionStyleModel",
764 | "_view_count": null,
765 | "_view_module": "@jupyter-widgets/base",
766 | "_view_module_version": "1.2.0",
767 | "_view_name": "StyleView",
768 | "description_width": ""
769 | }
770 | },
771 | "5ed2d167e0e641e290a46d457c8930d5": {
772 | "model_module": "@jupyter-widgets/base",
773 | "model_name": "LayoutModel",
774 | "state": {
775 | "_model_module": "@jupyter-widgets/base",
776 | "_model_module_version": "1.2.0",
777 | "_model_name": "LayoutModel",
778 | "_view_count": null,
779 | "_view_module": "@jupyter-widgets/base",
780 | "_view_module_version": "1.2.0",
781 | "_view_name": "LayoutView",
782 | "align_content": null,
783 | "align_items": null,
784 | "align_self": null,
785 | "border": null,
786 | "bottom": null,
787 | "display": null,
788 | "flex": null,
789 | "flex_flow": null,
790 | "grid_area": null,
791 | "grid_auto_columns": null,
792 | "grid_auto_flow": null,
793 | "grid_auto_rows": null,
794 | "grid_column": null,
795 | "grid_gap": null,
796 | "grid_row": null,
797 | "grid_template_areas": null,
798 | "grid_template_columns": null,
799 | "grid_template_rows": null,
800 | "height": null,
801 | "justify_content": null,
802 | "justify_items": null,
803 | "left": null,
804 | "margin": null,
805 | "max_height": null,
806 | "max_width": null,
807 | "min_height": null,
808 | "min_width": null,
809 | "object_fit": null,
810 | "object_position": null,
811 | "order": null,
812 | "overflow": null,
813 | "overflow_x": null,
814 | "overflow_y": null,
815 | "padding": null,
816 | "right": null,
817 | "top": null,
818 | "visibility": null,
819 | "width": null
820 | }
821 | },
822 | "5f26156efe6d49379ee39e75bae41840": {
823 | "model_module": "@jupyter-widgets/base",
824 | "model_name": "LayoutModel",
825 | "state": {
826 | "_model_module": "@jupyter-widgets/base",
827 | "_model_module_version": "1.2.0",
828 | "_model_name": "LayoutModel",
829 | "_view_count": null,
830 | "_view_module": "@jupyter-widgets/base",
831 | "_view_module_version": "1.2.0",
832 | "_view_name": "LayoutView",
833 | "align_content": null,
834 | "align_items": null,
835 | "align_self": null,
836 | "border": null,
837 | "bottom": null,
838 | "display": null,
839 | "flex": null,
840 | "flex_flow": null,
841 | "grid_area": null,
842 | "grid_auto_columns": null,
843 | "grid_auto_flow": null,
844 | "grid_auto_rows": null,
845 | "grid_column": null,
846 | "grid_gap": null,
847 | "grid_row": null,
848 | "grid_template_areas": null,
849 | "grid_template_columns": null,
850 | "grid_template_rows": null,
851 | "height": null,
852 | "justify_content": null,
853 | "justify_items": null,
854 | "left": null,
855 | "margin": null,
856 | "max_height": null,
857 | "max_width": null,
858 | "min_height": null,
859 | "min_width": null,
860 | "object_fit": null,
861 | "object_position": null,
862 | "order": null,
863 | "overflow": null,
864 | "overflow_x": null,
865 | "overflow_y": null,
866 | "padding": null,
867 | "right": null,
868 | "top": null,
869 | "visibility": null,
870 | "width": null
871 | }
872 | },
873 | "5f7bea12320e41e2a5b16da0987d14fa": {
874 | "model_module": "@jupyter-widgets/controls",
875 | "model_name": "DescriptionStyleModel",
876 | "state": {
877 | "_model_module": "@jupyter-widgets/controls",
878 | "_model_module_version": "1.5.0",
879 | "_model_name": "DescriptionStyleModel",
880 | "_view_count": null,
881 | "_view_module": "@jupyter-widgets/base",
882 | "_view_module_version": "1.2.0",
883 | "_view_name": "StyleView",
884 | "description_width": ""
885 | }
886 | },
887 | "62f3278b280b4c86aa34638c097d05fd": {
888 | "model_module": "@jupyter-widgets/controls",
889 | "model_name": "HBoxModel",
890 | "state": {
891 | "_dom_classes": [],
892 | "_model_module": "@jupyter-widgets/controls",
893 | "_model_module_version": "1.5.0",
894 | "_model_name": "HBoxModel",
895 | "_view_count": null,
896 | "_view_module": "@jupyter-widgets/controls",
897 | "_view_module_version": "1.5.0",
898 | "_view_name": "HBoxView",
899 | "box_style": "",
900 | "children": [
901 | "IPY_MODEL_47c3a693d7e9401cbb20bed9ca7fe7ef",
902 | "IPY_MODEL_39d781ce862343cb833d481cdf2237fb"
903 | ],
904 | "layout": "IPY_MODEL_180e7a8558b74bea8ebcec9e4a06138a"
905 | }
906 | },
907 | "63615e3977904ddc8204f70ddfc54dd6": {
908 | "model_module": "@jupyter-widgets/controls",
909 | "model_name": "HTMLModel",
910 | "state": {
911 | "_dom_classes": [],
912 | "_model_module": "@jupyter-widgets/controls",
913 | "_model_module_version": "1.5.0",
914 | "_model_name": "HTMLModel",
915 | "_view_count": null,
916 | "_view_module": "@jupyter-widgets/controls",
917 | "_view_module_version": "1.5.0",
918 | "_view_name": "HTMLView",
919 | "description": "",
920 | "description_tooltip": null,
921 | "layout": "IPY_MODEL_02e45e7277bf417d896d92cfad3e7db3",
922 | "placeholder": "",
923 | "style": "IPY_MODEL_5c15eae703e24258a4fa9272c55e93f3",
924 | "value": " 543M/543M [00:10<00:00, 52.7MB/s]"
925 | }
926 | },
927 | "644806c6298940c892b9e1f34a5127b9": {
928 | "model_module": "@jupyter-widgets/controls",
929 | "model_name": "HBoxModel",
930 | "state": {
931 | "_dom_classes": [],
932 | "_model_module": "@jupyter-widgets/controls",
933 | "_model_module_version": "1.5.0",
934 | "_model_name": "HBoxModel",
935 | "_view_count": null,
936 | "_view_module": "@jupyter-widgets/controls",
937 | "_view_module_version": "1.5.0",
938 | "_view_name": "HBoxView",
939 | "box_style": "",
940 | "children": [
941 | "IPY_MODEL_307c92aaf92b48cdb7bcf6c2a5a4fc94",
942 | "IPY_MODEL_0c44bce3963d4b73bad9a2f43034da6e"
943 | ],
944 | "layout": "IPY_MODEL_c5a25438e62945ca84bf8f386729811b"
945 | }
946 | },
947 | "69705ae4e97b484393c78d35b3d1ced4": {
948 | "model_module": "@jupyter-widgets/controls",
949 | "model_name": "HBoxModel",
950 | "state": {
951 | "_dom_classes": [],
952 | "_model_module": "@jupyter-widgets/controls",
953 | "_model_module_version": "1.5.0",
954 | "_model_name": "HBoxModel",
955 | "_view_count": null,
956 | "_view_module": "@jupyter-widgets/controls",
957 | "_view_module_version": "1.5.0",
958 | "_view_name": "HBoxView",
959 | "box_style": "",
960 | "children": [
961 | "IPY_MODEL_29fba2547a7e478ca97792eca1835286",
962 | "IPY_MODEL_451ce5e7a0644ecdbac61b6995192e02"
963 | ],
964 | "layout": "IPY_MODEL_d79e114bea844160bc941d81d4581b46"
965 | }
966 | },
967 | "77046ee7556242aeb281a62a7305eeb7": {
968 | "model_module": "@jupyter-widgets/base",
969 | "model_name": "LayoutModel",
970 | "state": {
971 | "_model_module": "@jupyter-widgets/base",
972 | "_model_module_version": "1.2.0",
973 | "_model_name": "LayoutModel",
974 | "_view_count": null,
975 | "_view_module": "@jupyter-widgets/base",
976 | "_view_module_version": "1.2.0",
977 | "_view_name": "LayoutView",
978 | "align_content": null,
979 | "align_items": null,
980 | "align_self": null,
981 | "border": null,
982 | "bottom": null,
983 | "display": null,
984 | "flex": null,
985 | "flex_flow": null,
986 | "grid_area": null,
987 | "grid_auto_columns": null,
988 | "grid_auto_flow": null,
989 | "grid_auto_rows": null,
990 | "grid_column": null,
991 | "grid_gap": null,
992 | "grid_row": null,
993 | "grid_template_areas": null,
994 | "grid_template_columns": null,
995 | "grid_template_rows": null,
996 | "height": null,
997 | "justify_content": null,
998 | "justify_items": null,
999 | "left": null,
1000 | "margin": null,
1001 | "max_height": null,
1002 | "max_width": null,
1003 | "min_height": null,
1004 | "min_width": null,
1005 | "object_fit": null,
1006 | "object_position": null,
1007 | "order": null,
1008 | "overflow": null,
1009 | "overflow_x": null,
1010 | "overflow_y": null,
1011 | "padding": null,
1012 | "right": null,
1013 | "top": null,
1014 | "visibility": null,
1015 | "width": null
1016 | }
1017 | },
1018 | "8b9043ffe8fb4313869fd9ca78021a93": {
1019 | "model_module": "@jupyter-widgets/controls",
1020 | "model_name": "FloatProgressModel",
1021 | "state": {
1022 | "_dom_classes": [],
1023 | "_model_module": "@jupyter-widgets/controls",
1024 | "_model_module_version": "1.5.0",
1025 | "_model_name": "FloatProgressModel",
1026 | "_view_count": null,
1027 | "_view_module": "@jupyter-widgets/controls",
1028 | "_view_module_version": "1.5.0",
1029 | "_view_name": "ProgressView",
1030 | "bar_style": "success",
1031 | "description": "Downloading: 100%",
1032 | "description_tooltip": null,
1033 | "layout": "IPY_MODEL_02f25412874b417892d6ff746d65d0e7",
1034 | "max": 112,
1035 | "min": 0,
1036 | "orientation": "horizontal",
1037 | "style": "IPY_MODEL_e24e4163978a4f068ec80beab9ad0362",
1038 | "value": 112
1039 | }
1040 | },
1041 | "8d5f2f8c43884e49bbec86cf2bc22d26": {
1042 | "model_module": "@jupyter-widgets/controls",
1043 | "model_name": "HBoxModel",
1044 | "state": {
1045 | "_dom_classes": [],
1046 | "_model_module": "@jupyter-widgets/controls",
1047 | "_model_module_version": "1.5.0",
1048 | "_model_name": "HBoxModel",
1049 | "_view_count": null,
1050 | "_view_module": "@jupyter-widgets/controls",
1051 | "_view_module_version": "1.5.0",
1052 | "_view_name": "HBoxView",
1053 | "box_style": "",
1054 | "children": [
1055 | "IPY_MODEL_523f17ee95144d40b65d4549c98e6cd5",
1056 | "IPY_MODEL_63615e3977904ddc8204f70ddfc54dd6"
1057 | ],
1058 | "layout": "IPY_MODEL_be044d27d62342cb91d633eeddf52a58"
1059 | }
1060 | },
1061 | "90dc4e646e3b4bf39f7d2de7e6e28d15": {
1062 | "model_module": "@jupyter-widgets/base",
1063 | "model_name": "LayoutModel",
1064 | "state": {
1065 | "_model_module": "@jupyter-widgets/base",
1066 | "_model_module_version": "1.2.0",
1067 | "_model_name": "LayoutModel",
1068 | "_view_count": null,
1069 | "_view_module": "@jupyter-widgets/base",
1070 | "_view_module_version": "1.2.0",
1071 | "_view_name": "LayoutView",
1072 | "align_content": null,
1073 | "align_items": null,
1074 | "align_self": null,
1075 | "border": null,
1076 | "bottom": null,
1077 | "display": null,
1078 | "flex": null,
1079 | "flex_flow": null,
1080 | "grid_area": null,
1081 | "grid_auto_columns": null,
1082 | "grid_auto_flow": null,
1083 | "grid_auto_rows": null,
1084 | "grid_column": null,
1085 | "grid_gap": null,
1086 | "grid_row": null,
1087 | "grid_template_areas": null,
1088 | "grid_template_columns": null,
1089 | "grid_template_rows": null,
1090 | "height": null,
1091 | "justify_content": null,
1092 | "justify_items": null,
1093 | "left": null,
1094 | "margin": null,
1095 | "max_height": null,
1096 | "max_width": null,
1097 | "min_height": null,
1098 | "min_width": null,
1099 | "object_fit": null,
1100 | "object_position": null,
1101 | "order": null,
1102 | "overflow": null,
1103 | "overflow_x": null,
1104 | "overflow_y": null,
1105 | "padding": null,
1106 | "right": null,
1107 | "top": null,
1108 | "visibility": null,
1109 | "width": null
1110 | }
1111 | },
1112 | "93d29cab94e84cae9b7a6f8ecccacba6": {
1113 | "model_module": "@jupyter-widgets/controls",
1114 | "model_name": "HTMLModel",
1115 | "state": {
1116 | "_dom_classes": [],
1117 | "_model_module": "@jupyter-widgets/controls",
1118 | "_model_module_version": "1.5.0",
1119 | "_model_name": "HTMLModel",
1120 | "_view_count": null,
1121 | "_view_module": "@jupyter-widgets/controls",
1122 | "_view_module_version": "1.5.0",
1123 | "_view_name": "HTMLView",
1124 | "description": "",
1125 | "description_tooltip": null,
1126 | "layout": "IPY_MODEL_90dc4e646e3b4bf39f7d2de7e6e28d15",
1127 | "placeholder": "",
1128 | "style": "IPY_MODEL_47b5070966da480cab27ed1332f25aa4",
1129 | "value": " 112/112 [00:01<00:00, 86.3B/s]"
1130 | }
1131 | },
1132 | "94e063404d83486190d12c8846d0882f": {
1133 | "model_module": "@jupyter-widgets/controls",
1134 | "model_name": "ProgressStyleModel",
1135 | "state": {
1136 | "_model_module": "@jupyter-widgets/controls",
1137 | "_model_module_version": "1.5.0",
1138 | "_model_name": "ProgressStyleModel",
1139 | "_view_count": null,
1140 | "_view_module": "@jupyter-widgets/base",
1141 | "_view_module_version": "1.2.0",
1142 | "_view_name": "StyleView",
1143 | "bar_color": null,
1144 | "description_width": "initial"
1145 | }
1146 | },
1147 | "96376ad744dd4e1f97a4df143d164843": {
1148 | "model_module": "@jupyter-widgets/controls",
1149 | "model_name": "DescriptionStyleModel",
1150 | "state": {
1151 | "_model_module": "@jupyter-widgets/controls",
1152 | "_model_module_version": "1.5.0",
1153 | "_model_name": "DescriptionStyleModel",
1154 | "_view_count": null,
1155 | "_view_module": "@jupyter-widgets/base",
1156 | "_view_module_version": "1.2.0",
1157 | "_view_name": "StyleView",
1158 | "description_width": ""
1159 | }
1160 | },
1161 | "985c2e847ae2468aa32f734fad8cac15": {
1162 | "model_module": "@jupyter-widgets/base",
1163 | "model_name": "LayoutModel",
1164 | "state": {
1165 | "_model_module": "@jupyter-widgets/base",
1166 | "_model_module_version": "1.2.0",
1167 | "_model_name": "LayoutModel",
1168 | "_view_count": null,
1169 | "_view_module": "@jupyter-widgets/base",
1170 | "_view_module_version": "1.2.0",
1171 | "_view_name": "LayoutView",
1172 | "align_content": null,
1173 | "align_items": null,
1174 | "align_self": null,
1175 | "border": null,
1176 | "bottom": null,
1177 | "display": null,
1178 | "flex": null,
1179 | "flex_flow": null,
1180 | "grid_area": null,
1181 | "grid_auto_columns": null,
1182 | "grid_auto_flow": null,
1183 | "grid_auto_rows": null,
1184 | "grid_column": null,
1185 | "grid_gap": null,
1186 | "grid_row": null,
1187 | "grid_template_areas": null,
1188 | "grid_template_columns": null,
1189 | "grid_template_rows": null,
1190 | "height": null,
1191 | "justify_content": null,
1192 | "justify_items": null,
1193 | "left": null,
1194 | "margin": null,
1195 | "max_height": null,
1196 | "max_width": null,
1197 | "min_height": null,
1198 | "min_width": null,
1199 | "object_fit": null,
1200 | "object_position": null,
1201 | "order": null,
1202 | "overflow": null,
1203 | "overflow_x": null,
1204 | "overflow_y": null,
1205 | "padding": null,
1206 | "right": null,
1207 | "top": null,
1208 | "visibility": null,
1209 | "width": null
1210 | }
1211 | },
1212 | "9eefe0eb1b02400598908bf39be13998": {
1213 | "model_module": "@jupyter-widgets/controls",
1214 | "model_name": "DescriptionStyleModel",
1215 | "state": {
1216 | "_model_module": "@jupyter-widgets/controls",
1217 | "_model_module_version": "1.5.0",
1218 | "_model_name": "DescriptionStyleModel",
1219 | "_view_count": null,
1220 | "_view_module": "@jupyter-widgets/base",
1221 | "_view_module_version": "1.2.0",
1222 | "_view_name": "StyleView",
1223 | "description_width": ""
1224 | }
1225 | },
1226 | "a1b586360cc046e2bcb50f8e32c54134": {
1227 | "model_module": "@jupyter-widgets/controls",
1228 | "model_name": "HBoxModel",
1229 | "state": {
1230 | "_dom_classes": [],
1231 | "_model_module": "@jupyter-widgets/controls",
1232 | "_model_module_version": "1.5.0",
1233 | "_model_name": "HBoxModel",
1234 | "_view_count": null,
1235 | "_view_module": "@jupyter-widgets/controls",
1236 | "_view_module_version": "1.5.0",
1237 | "_view_name": "HBoxView",
1238 | "box_style": "",
1239 | "children": [
1240 | "IPY_MODEL_b021c96a3a144159a1de8b9b526a9cf0",
1241 | "IPY_MODEL_dbaf10b01e9b42299f67b4095b210499"
1242 | ],
1243 | "layout": "IPY_MODEL_4d2b0fbfe0d5421b93e529f34adeefb3"
1244 | }
1245 | },
1246 | "a3be3ff323e842b1a0927a62eddf462d": {
1247 | "model_module": "@jupyter-widgets/base",
1248 | "model_name": "LayoutModel",
1249 | "state": {
1250 | "_model_module": "@jupyter-widgets/base",
1251 | "_model_module_version": "1.2.0",
1252 | "_model_name": "LayoutModel",
1253 | "_view_count": null,
1254 | "_view_module": "@jupyter-widgets/base",
1255 | "_view_module_version": "1.2.0",
1256 | "_view_name": "LayoutView",
1257 | "align_content": null,
1258 | "align_items": null,
1259 | "align_self": null,
1260 | "border": null,
1261 | "bottom": null,
1262 | "display": null,
1263 | "flex": null,
1264 | "flex_flow": null,
1265 | "grid_area": null,
1266 | "grid_auto_columns": null,
1267 | "grid_auto_flow": null,
1268 | "grid_auto_rows": null,
1269 | "grid_column": null,
1270 | "grid_gap": null,
1271 | "grid_row": null,
1272 | "grid_template_areas": null,
1273 | "grid_template_columns": null,
1274 | "grid_template_rows": null,
1275 | "height": null,
1276 | "justify_content": null,
1277 | "justify_items": null,
1278 | "left": null,
1279 | "margin": null,
1280 | "max_height": null,
1281 | "max_width": null,
1282 | "min_height": null,
1283 | "min_width": null,
1284 | "object_fit": null,
1285 | "object_position": null,
1286 | "order": null,
1287 | "overflow": null,
1288 | "overflow_x": null,
1289 | "overflow_y": null,
1290 | "padding": null,
1291 | "right": null,
1292 | "top": null,
1293 | "visibility": null,
1294 | "width": null
1295 | }
1296 | },
1297 | "a918de7d6fc141069ec93489dbb2cdb3": {
1298 | "model_module": "@jupyter-widgets/base",
1299 | "model_name": "LayoutModel",
1300 | "state": {
1301 | "_model_module": "@jupyter-widgets/base",
1302 | "_model_module_version": "1.2.0",
1303 | "_model_name": "LayoutModel",
1304 | "_view_count": null,
1305 | "_view_module": "@jupyter-widgets/base",
1306 | "_view_module_version": "1.2.0",
1307 | "_view_name": "LayoutView",
1308 | "align_content": null,
1309 | "align_items": null,
1310 | "align_self": null,
1311 | "border": null,
1312 | "bottom": null,
1313 | "display": null,
1314 | "flex": null,
1315 | "flex_flow": null,
1316 | "grid_area": null,
1317 | "grid_auto_columns": null,
1318 | "grid_auto_flow": null,
1319 | "grid_auto_rows": null,
1320 | "grid_column": null,
1321 | "grid_gap": null,
1322 | "grid_row": null,
1323 | "grid_template_areas": null,
1324 | "grid_template_columns": null,
1325 | "grid_template_rows": null,
1326 | "height": null,
1327 | "justify_content": null,
1328 | "justify_items": null,
1329 | "left": null,
1330 | "margin": null,
1331 | "max_height": null,
1332 | "max_width": null,
1333 | "min_height": null,
1334 | "min_width": null,
1335 | "object_fit": null,
1336 | "object_position": null,
1337 | "order": null,
1338 | "overflow": null,
1339 | "overflow_x": null,
1340 | "overflow_y": null,
1341 | "padding": null,
1342 | "right": null,
1343 | "top": null,
1344 | "visibility": null,
1345 | "width": null
1346 | }
1347 | },
1348 | "b021c96a3a144159a1de8b9b526a9cf0": {
1349 | "model_module": "@jupyter-widgets/controls",
1350 | "model_name": "FloatProgressModel",
1351 | "state": {
1352 | "_dom_classes": [],
1353 | "_model_module": "@jupyter-widgets/controls",
1354 | "_model_module_version": "1.5.0",
1355 | "_model_name": "FloatProgressModel",
1356 | "_view_count": null,
1357 | "_view_module": "@jupyter-widgets/controls",
1358 | "_view_module_version": "1.5.0",
1359 | "_view_name": "ProgressView",
1360 | "bar_style": "success",
1361 | "description": "Downloading: 100%",
1362 | "description_tooltip": null,
1363 | "layout": "IPY_MODEL_c2d4e9c398b647f1965ca6c818446c28",
1364 | "max": 379,
1365 | "min": 0,
1366 | "orientation": "horizontal",
1367 | "style": "IPY_MODEL_5a37e3838ef146d2aef32cddfaf8daf4",
1368 | "value": 379
1369 | }
1370 | },
1371 | "b63e42d403bd4c2bb68f8c4709ae1dc4": {
1372 | "model_module": "@jupyter-widgets/controls",
1373 | "model_name": "HBoxModel",
1374 | "state": {
1375 | "_dom_classes": [],
1376 | "_model_module": "@jupyter-widgets/controls",
1377 | "_model_module_version": "1.5.0",
1378 | "_model_name": "HBoxModel",
1379 | "_view_count": null,
1380 | "_view_module": "@jupyter-widgets/controls",
1381 | "_view_module_version": "1.5.0",
1382 | "_view_name": "HBoxView",
1383 | "box_style": "",
1384 | "children": [
1385 | "IPY_MODEL_2367e98e314848fcbf34969075b276b3",
1386 | "IPY_MODEL_2b239899205b4e35a1fe9a24f578e2c2"
1387 | ],
1388 | "layout": "IPY_MODEL_a3be3ff323e842b1a0927a62eddf462d"
1389 | }
1390 | },
1391 | "b8fda7b5b9cc46f0865fe124fdaed067": {
1392 | "model_module": "@jupyter-widgets/controls",
1393 | "model_name": "FloatProgressModel",
1394 | "state": {
1395 | "_dom_classes": [],
1396 | "_model_module": "@jupyter-widgets/controls",
1397 | "_model_module_version": "1.5.0",
1398 | "_model_name": "FloatProgressModel",
1399 | "_view_count": null,
1400 | "_view_module": "@jupyter-widgets/controls",
1401 | "_view_module_version": "1.5.0",
1402 | "_view_name": "ProgressView",
1403 | "bar_style": "success",
1404 | "description": "Downloading: 100%",
1405 | "description_tooltip": null,
1406 | "layout": "IPY_MODEL_07b9064cf6804ebbbb26419522c49700",
1407 | "max": 576,
1408 | "min": 0,
1409 | "orientation": "horizontal",
1410 | "style": "IPY_MODEL_276a05360cb0423a80b0afe2c211e69c",
1411 | "value": 576
1412 | }
1413 | },
1414 | "b9f50346f8164d2c99722a21646b4b05": {
1415 | "model_module": "@jupyter-widgets/controls",
1416 | "model_name": "DescriptionStyleModel",
1417 | "state": {
1418 | "_model_module": "@jupyter-widgets/controls",
1419 | "_model_module_version": "1.5.0",
1420 | "_model_name": "DescriptionStyleModel",
1421 | "_view_count": null,
1422 | "_view_module": "@jupyter-widgets/base",
1423 | "_view_module_version": "1.2.0",
1424 | "_view_name": "StyleView",
1425 | "description_width": ""
1426 | }
1427 | },
1428 | "be044d27d62342cb91d633eeddf52a58": {
1429 | "model_module": "@jupyter-widgets/base",
1430 | "model_name": "LayoutModel",
1431 | "state": {
1432 | "_model_module": "@jupyter-widgets/base",
1433 | "_model_module_version": "1.2.0",
1434 | "_model_name": "LayoutModel",
1435 | "_view_count": null,
1436 | "_view_module": "@jupyter-widgets/base",
1437 | "_view_module_version": "1.2.0",
1438 | "_view_name": "LayoutView",
1439 | "align_content": null,
1440 | "align_items": null,
1441 | "align_self": null,
1442 | "border": null,
1443 | "bottom": null,
1444 | "display": null,
1445 | "flex": null,
1446 | "flex_flow": null,
1447 | "grid_area": null,
1448 | "grid_auto_columns": null,
1449 | "grid_auto_flow": null,
1450 | "grid_auto_rows": null,
1451 | "grid_column": null,
1452 | "grid_gap": null,
1453 | "grid_row": null,
1454 | "grid_template_areas": null,
1455 | "grid_template_columns": null,
1456 | "grid_template_rows": null,
1457 | "height": null,
1458 | "justify_content": null,
1459 | "justify_items": null,
1460 | "left": null,
1461 | "margin": null,
1462 | "max_height": null,
1463 | "max_width": null,
1464 | "min_height": null,
1465 | "min_width": null,
1466 | "object_fit": null,
1467 | "object_position": null,
1468 | "order": null,
1469 | "overflow": null,
1470 | "overflow_x": null,
1471 | "overflow_y": null,
1472 | "padding": null,
1473 | "right": null,
1474 | "top": null,
1475 | "visibility": null,
1476 | "width": null
1477 | }
1478 | },
1479 | "c2d4e9c398b647f1965ca6c818446c28": {
1480 | "model_module": "@jupyter-widgets/base",
1481 | "model_name": "LayoutModel",
1482 | "state": {
1483 | "_model_module": "@jupyter-widgets/base",
1484 | "_model_module_version": "1.2.0",
1485 | "_model_name": "LayoutModel",
1486 | "_view_count": null,
1487 | "_view_module": "@jupyter-widgets/base",
1488 | "_view_module_version": "1.2.0",
1489 | "_view_name": "LayoutView",
1490 | "align_content": null,
1491 | "align_items": null,
1492 | "align_self": null,
1493 | "border": null,
1494 | "bottom": null,
1495 | "display": null,
1496 | "flex": null,
1497 | "flex_flow": null,
1498 | "grid_area": null,
1499 | "grid_auto_columns": null,
1500 | "grid_auto_flow": null,
1501 | "grid_auto_rows": null,
1502 | "grid_column": null,
1503 | "grid_gap": null,
1504 | "grid_row": null,
1505 | "grid_template_areas": null,
1506 | "grid_template_columns": null,
1507 | "grid_template_rows": null,
1508 | "height": null,
1509 | "justify_content": null,
1510 | "justify_items": null,
1511 | "left": null,
1512 | "margin": null,
1513 | "max_height": null,
1514 | "max_width": null,
1515 | "min_height": null,
1516 | "min_width": null,
1517 | "object_fit": null,
1518 | "object_position": null,
1519 | "order": null,
1520 | "overflow": null,
1521 | "overflow_x": null,
1522 | "overflow_y": null,
1523 | "padding": null,
1524 | "right": null,
1525 | "top": null,
1526 | "visibility": null,
1527 | "width": null
1528 | }
1529 | },
1530 | "c5a25438e62945ca84bf8f386729811b": {
1531 | "model_module": "@jupyter-widgets/base",
1532 | "model_name": "LayoutModel",
1533 | "state": {
1534 | "_model_module": "@jupyter-widgets/base",
1535 | "_model_module_version": "1.2.0",
1536 | "_model_name": "LayoutModel",
1537 | "_view_count": null,
1538 | "_view_module": "@jupyter-widgets/base",
1539 | "_view_module_version": "1.2.0",
1540 | "_view_name": "LayoutView",
1541 | "align_content": null,
1542 | "align_items": null,
1543 | "align_self": null,
1544 | "border": null,
1545 | "bottom": null,
1546 | "display": null,
1547 | "flex": null,
1548 | "flex_flow": null,
1549 | "grid_area": null,
1550 | "grid_auto_columns": null,
1551 | "grid_auto_flow": null,
1552 | "grid_auto_rows": null,
1553 | "grid_column": null,
1554 | "grid_gap": null,
1555 | "grid_row": null,
1556 | "grid_template_areas": null,
1557 | "grid_template_columns": null,
1558 | "grid_template_rows": null,
1559 | "height": null,
1560 | "justify_content": null,
1561 | "justify_items": null,
1562 | "left": null,
1563 | "margin": null,
1564 | "max_height": null,
1565 | "max_width": null,
1566 | "min_height": null,
1567 | "min_width": null,
1568 | "object_fit": null,
1569 | "object_position": null,
1570 | "order": null,
1571 | "overflow": null,
1572 | "overflow_x": null,
1573 | "overflow_y": null,
1574 | "padding": null,
1575 | "right": null,
1576 | "top": null,
1577 | "visibility": null,
1578 | "width": null
1579 | }
1580 | },
1581 | "ca650af9b9954d569e2523731121cb88": {
1582 | "model_module": "@jupyter-widgets/controls",
1583 | "model_name": "HBoxModel",
1584 | "state": {
1585 | "_dom_classes": [],
1586 | "_model_module": "@jupyter-widgets/controls",
1587 | "_model_module_version": "1.5.0",
1588 | "_model_name": "HBoxModel",
1589 | "_view_count": null,
1590 | "_view_module": "@jupyter-widgets/controls",
1591 | "_view_module_version": "1.5.0",
1592 | "_view_name": "HBoxView",
1593 | "box_style": "",
1594 | "children": [
1595 | "IPY_MODEL_b8fda7b5b9cc46f0865fe124fdaed067",
1596 | "IPY_MODEL_169dd26ec2c34af2b23624f81afcebc3"
1597 | ],
1598 | "layout": "IPY_MODEL_cce331dde52f402083d93a66f5b30b8f"
1599 | }
1600 | },
1601 | "cce331dde52f402083d93a66f5b30b8f": {
1602 | "model_module": "@jupyter-widgets/base",
1603 | "model_name": "LayoutModel",
1604 | "state": {
1605 | "_model_module": "@jupyter-widgets/base",
1606 | "_model_module_version": "1.2.0",
1607 | "_model_name": "LayoutModel",
1608 | "_view_count": null,
1609 | "_view_module": "@jupyter-widgets/base",
1610 | "_view_module_version": "1.2.0",
1611 | "_view_name": "LayoutView",
1612 | "align_content": null,
1613 | "align_items": null,
1614 | "align_self": null,
1615 | "border": null,
1616 | "bottom": null,
1617 | "display": null,
1618 | "flex": null,
1619 | "flex_flow": null,
1620 | "grid_area": null,
1621 | "grid_auto_columns": null,
1622 | "grid_auto_flow": null,
1623 | "grid_auto_rows": null,
1624 | "grid_column": null,
1625 | "grid_gap": null,
1626 | "grid_row": null,
1627 | "grid_template_areas": null,
1628 | "grid_template_columns": null,
1629 | "grid_template_rows": null,
1630 | "height": null,
1631 | "justify_content": null,
1632 | "justify_items": null,
1633 | "left": null,
1634 | "margin": null,
1635 | "max_height": null,
1636 | "max_width": null,
1637 | "min_height": null,
1638 | "min_width": null,
1639 | "object_fit": null,
1640 | "object_position": null,
1641 | "order": null,
1642 | "overflow": null,
1643 | "overflow_x": null,
1644 | "overflow_y": null,
1645 | "padding": null,
1646 | "right": null,
1647 | "top": null,
1648 | "visibility": null,
1649 | "width": null
1650 | }
1651 | },
1652 | "cf79d2f4a1d84c2bae5b07fb2839f0ab": {
1653 | "model_module": "@jupyter-widgets/controls",
1654 | "model_name": "HBoxModel",
1655 | "state": {
1656 | "_dom_classes": [],
1657 | "_model_module": "@jupyter-widgets/controls",
1658 | "_model_module_version": "1.5.0",
1659 | "_model_name": "HBoxModel",
1660 | "_view_count": null,
1661 | "_view_module": "@jupyter-widgets/controls",
1662 | "_view_module_version": "1.5.0",
1663 | "_view_name": "HBoxView",
1664 | "box_style": "",
1665 | "children": [
1666 | "IPY_MODEL_8b9043ffe8fb4313869fd9ca78021a93",
1667 | "IPY_MODEL_93d29cab94e84cae9b7a6f8ecccacba6"
1668 | ],
1669 | "layout": "IPY_MODEL_5f26156efe6d49379ee39e75bae41840"
1670 | }
1671 | },
1672 | "d79e114bea844160bc941d81d4581b46": {
1673 | "model_module": "@jupyter-widgets/base",
1674 | "model_name": "LayoutModel",
1675 | "state": {
1676 | "_model_module": "@jupyter-widgets/base",
1677 | "_model_module_version": "1.2.0",
1678 | "_model_name": "LayoutModel",
1679 | "_view_count": null,
1680 | "_view_module": "@jupyter-widgets/base",
1681 | "_view_module_version": "1.2.0",
1682 | "_view_name": "LayoutView",
1683 | "align_content": null,
1684 | "align_items": null,
1685 | "align_self": null,
1686 | "border": null,
1687 | "bottom": null,
1688 | "display": null,
1689 | "flex": null,
1690 | "flex_flow": null,
1691 | "grid_area": null,
1692 | "grid_auto_columns": null,
1693 | "grid_auto_flow": null,
1694 | "grid_auto_rows": null,
1695 | "grid_column": null,
1696 | "grid_gap": null,
1697 | "grid_row": null,
1698 | "grid_template_areas": null,
1699 | "grid_template_columns": null,
1700 | "grid_template_rows": null,
1701 | "height": null,
1702 | "justify_content": null,
1703 | "justify_items": null,
1704 | "left": null,
1705 | "margin": null,
1706 | "max_height": null,
1707 | "max_width": null,
1708 | "min_height": null,
1709 | "min_width": null,
1710 | "object_fit": null,
1711 | "object_position": null,
1712 | "order": null,
1713 | "overflow": null,
1714 | "overflow_x": null,
1715 | "overflow_y": null,
1716 | "padding": null,
1717 | "right": null,
1718 | "top": null,
1719 | "visibility": null,
1720 | "width": null
1721 | }
1722 | },
1723 | "dbaf10b01e9b42299f67b4095b210499": {
1724 | "model_module": "@jupyter-widgets/controls",
1725 | "model_name": "HTMLModel",
1726 | "state": {
1727 | "_dom_classes": [],
1728 | "_model_module": "@jupyter-widgets/controls",
1729 | "_model_module_version": "1.5.0",
1730 | "_model_name": "HTMLModel",
1731 | "_view_count": null,
1732 | "_view_module": "@jupyter-widgets/controls",
1733 | "_view_module_version": "1.5.0",
1734 | "_view_name": "HTMLView",
1735 | "description": "",
1736 | "description_tooltip": null,
1737 | "layout": "IPY_MODEL_f3d35e8d289f4be9870e4c55e695bda8",
1738 | "placeholder": "",
1739 | "style": "IPY_MODEL_5f7bea12320e41e2a5b16da0987d14fa",
1740 | "value": " 379/379 [00:00<00:00, 1.96kB/s]"
1741 | }
1742 | },
1743 | "dbd206abbcac414d8730ff7cf4ef55d8": {
1744 | "model_module": "@jupyter-widgets/base",
1745 | "model_name": "LayoutModel",
1746 | "state": {
1747 | "_model_module": "@jupyter-widgets/base",
1748 | "_model_module_version": "1.2.0",
1749 | "_model_name": "LayoutModel",
1750 | "_view_count": null,
1751 | "_view_module": "@jupyter-widgets/base",
1752 | "_view_module_version": "1.2.0",
1753 | "_view_name": "LayoutView",
1754 | "align_content": null,
1755 | "align_items": null,
1756 | "align_self": null,
1757 | "border": null,
1758 | "bottom": null,
1759 | "display": null,
1760 | "flex": null,
1761 | "flex_flow": null,
1762 | "grid_area": null,
1763 | "grid_auto_columns": null,
1764 | "grid_auto_flow": null,
1765 | "grid_auto_rows": null,
1766 | "grid_column": null,
1767 | "grid_gap": null,
1768 | "grid_row": null,
1769 | "grid_template_areas": null,
1770 | "grid_template_columns": null,
1771 | "grid_template_rows": null,
1772 | "height": null,
1773 | "justify_content": null,
1774 | "justify_items": null,
1775 | "left": null,
1776 | "margin": null,
1777 | "max_height": null,
1778 | "max_width": null,
1779 | "min_height": null,
1780 | "min_width": null,
1781 | "object_fit": null,
1782 | "object_position": null,
1783 | "order": null,
1784 | "overflow": null,
1785 | "overflow_x": null,
1786 | "overflow_y": null,
1787 | "padding": null,
1788 | "right": null,
1789 | "top": null,
1790 | "visibility": null,
1791 | "width": null
1792 | }
1793 | },
1794 | "e24e4163978a4f068ec80beab9ad0362": {
1795 | "model_module": "@jupyter-widgets/controls",
1796 | "model_name": "ProgressStyleModel",
1797 | "state": {
1798 | "_model_module": "@jupyter-widgets/controls",
1799 | "_model_module_version": "1.5.0",
1800 | "_model_name": "ProgressStyleModel",
1801 | "_view_count": null,
1802 | "_view_module": "@jupyter-widgets/base",
1803 | "_view_module_version": "1.2.0",
1804 | "_view_name": "StyleView",
1805 | "bar_color": null,
1806 | "description_width": "initial"
1807 | }
1808 | },
1809 | "ec259b6cb7654b819f6e33be6b99a58e": {
1810 | "model_module": "@jupyter-widgets/controls",
1811 | "model_name": "ProgressStyleModel",
1812 | "state": {
1813 | "_model_module": "@jupyter-widgets/controls",
1814 | "_model_module_version": "1.5.0",
1815 | "_model_name": "ProgressStyleModel",
1816 | "_view_count": null,
1817 | "_view_module": "@jupyter-widgets/base",
1818 | "_view_module_version": "1.2.0",
1819 | "_view_name": "StyleView",
1820 | "bar_color": null,
1821 | "description_width": "initial"
1822 | }
1823 | },
1824 | "f2c5b053dc2e4a7590710c091ca5aa1e": {
1825 | "model_module": "@jupyter-widgets/base",
1826 | "model_name": "LayoutModel",
1827 | "state": {
1828 | "_model_module": "@jupyter-widgets/base",
1829 | "_model_module_version": "1.2.0",
1830 | "_model_name": "LayoutModel",
1831 | "_view_count": null,
1832 | "_view_module": "@jupyter-widgets/base",
1833 | "_view_module_version": "1.2.0",
1834 | "_view_name": "LayoutView",
1835 | "align_content": null,
1836 | "align_items": null,
1837 | "align_self": null,
1838 | "border": null,
1839 | "bottom": null,
1840 | "display": null,
1841 | "flex": null,
1842 | "flex_flow": null,
1843 | "grid_area": null,
1844 | "grid_auto_columns": null,
1845 | "grid_auto_flow": null,
1846 | "grid_auto_rows": null,
1847 | "grid_column": null,
1848 | "grid_gap": null,
1849 | "grid_row": null,
1850 | "grid_template_areas": null,
1851 | "grid_template_columns": null,
1852 | "grid_template_rows": null,
1853 | "height": null,
1854 | "justify_content": null,
1855 | "justify_items": null,
1856 | "left": null,
1857 | "margin": null,
1858 | "max_height": null,
1859 | "max_width": null,
1860 | "min_height": null,
1861 | "min_width": null,
1862 | "object_fit": null,
1863 | "object_position": null,
1864 | "order": null,
1865 | "overflow": null,
1866 | "overflow_x": null,
1867 | "overflow_y": null,
1868 | "padding": null,
1869 | "right": null,
1870 | "top": null,
1871 | "visibility": null,
1872 | "width": null
1873 | }
1874 | },
1875 | "f3d35e8d289f4be9870e4c55e695bda8": {
1876 | "model_module": "@jupyter-widgets/base",
1877 | "model_name": "LayoutModel",
1878 | "state": {
1879 | "_model_module": "@jupyter-widgets/base",
1880 | "_model_module_version": "1.2.0",
1881 | "_model_name": "LayoutModel",
1882 | "_view_count": null,
1883 | "_view_module": "@jupyter-widgets/base",
1884 | "_view_module_version": "1.2.0",
1885 | "_view_name": "LayoutView",
1886 | "align_content": null,
1887 | "align_items": null,
1888 | "align_self": null,
1889 | "border": null,
1890 | "bottom": null,
1891 | "display": null,
1892 | "flex": null,
1893 | "flex_flow": null,
1894 | "grid_area": null,
1895 | "grid_auto_columns": null,
1896 | "grid_auto_flow": null,
1897 | "grid_auto_rows": null,
1898 | "grid_column": null,
1899 | "grid_gap": null,
1900 | "grid_row": null,
1901 | "grid_template_areas": null,
1902 | "grid_template_columns": null,
1903 | "grid_template_rows": null,
1904 | "height": null,
1905 | "justify_content": null,
1906 | "justify_items": null,
1907 | "left": null,
1908 | "margin": null,
1909 | "max_height": null,
1910 | "max_width": null,
1911 | "min_height": null,
1912 | "min_width": null,
1913 | "object_fit": null,
1914 | "object_position": null,
1915 | "order": null,
1916 | "overflow": null,
1917 | "overflow_x": null,
1918 | "overflow_y": null,
1919 | "padding": null,
1920 | "right": null,
1921 | "top": null,
1922 | "visibility": null,
1923 | "width": null
1924 | }
1925 | },
1926 | "fabc123e8ab14205a81fee5e8e6484dc": {
1927 | "model_module": "@jupyter-widgets/base",
1928 | "model_name": "LayoutModel",
1929 | "state": {
1930 | "_model_module": "@jupyter-widgets/base",
1931 | "_model_module_version": "1.2.0",
1932 | "_model_name": "LayoutModel",
1933 | "_view_count": null,
1934 | "_view_module": "@jupyter-widgets/base",
1935 | "_view_module_version": "1.2.0",
1936 | "_view_name": "LayoutView",
1937 | "align_content": null,
1938 | "align_items": null,
1939 | "align_self": null,
1940 | "border": null,
1941 | "bottom": null,
1942 | "display": null,
1943 | "flex": null,
1944 | "flex_flow": null,
1945 | "grid_area": null,
1946 | "grid_auto_columns": null,
1947 | "grid_auto_flow": null,
1948 | "grid_auto_rows": null,
1949 | "grid_column": null,
1950 | "grid_gap": null,
1951 | "grid_row": null,
1952 | "grid_template_areas": null,
1953 | "grid_template_columns": null,
1954 | "grid_template_rows": null,
1955 | "height": null,
1956 | "justify_content": null,
1957 | "justify_items": null,
1958 | "left": null,
1959 | "margin": null,
1960 | "max_height": null,
1961 | "max_width": null,
1962 | "min_height": null,
1963 | "min_width": null,
1964 | "object_fit": null,
1965 | "object_position": null,
1966 | "order": null,
1967 | "overflow": null,
1968 | "overflow_x": null,
1969 | "overflow_y": null,
1970 | "padding": null,
1971 | "right": null,
1972 | "top": null,
1973 | "visibility": null,
1974 | "width": null
1975 | }
1976 | },
1977 | "fb595227f20f4a18aaef097807dede5b": {
1978 | "model_module": "@jupyter-widgets/base",
1979 | "model_name": "LayoutModel",
1980 | "state": {
1981 | "_model_module": "@jupyter-widgets/base",
1982 | "_model_module_version": "1.2.0",
1983 | "_model_name": "LayoutModel",
1984 | "_view_count": null,
1985 | "_view_module": "@jupyter-widgets/base",
1986 | "_view_module_version": "1.2.0",
1987 | "_view_name": "LayoutView",
1988 | "align_content": null,
1989 | "align_items": null,
1990 | "align_self": null,
1991 | "border": null,
1992 | "bottom": null,
1993 | "display": null,
1994 | "flex": null,
1995 | "flex_flow": null,
1996 | "grid_area": null,
1997 | "grid_auto_columns": null,
1998 | "grid_auto_flow": null,
1999 | "grid_auto_rows": null,
2000 | "grid_column": null,
2001 | "grid_gap": null,
2002 | "grid_row": null,
2003 | "grid_template_areas": null,
2004 | "grid_template_columns": null,
2005 | "grid_template_rows": null,
2006 | "height": null,
2007 | "justify_content": null,
2008 | "justify_items": null,
2009 | "left": null,
2010 | "margin": null,
2011 | "max_height": null,
2012 | "max_width": null,
2013 | "min_height": null,
2014 | "min_width": null,
2015 | "object_fit": null,
2016 | "object_position": null,
2017 | "order": null,
2018 | "overflow": null,
2019 | "overflow_x": null,
2020 | "overflow_y": null,
2021 | "padding": null,
2022 | "right": null,
2023 | "top": null,
2024 | "visibility": null,
2025 | "width": null
2026 | }
2027 | }
2028 | }
2029 | }
2030 | },
2031 | "nbformat": 4,
2032 | "nbformat_minor": 0
2033 | }
2034 |
--------------------------------------------------------------------------------
/fine_tune_title_generation.py:
--------------------------------------------------------------------------------
1 | import nmatheg as nm
2 | strategy = nm.TrainStrategy(
3 | datasets = 'ARGEN_title_generation',
4 | models = 'UBC-NLP/AraT5-base',
5 | runs = 10,
6 | lr = 5e-5,
7 | epochs = 20,
8 | batch_size = 4,
9 | max_tokens = 128,
10 | )
11 | output = strategy.start()
--------------------------------------------------------------------------------
/nmatheg/__init__.py:
--------------------------------------------------------------------------------
1 | from nmatheg.nmatheg import TrainStrategy
2 | from nmatheg.nmatheg import predict_from_run
--------------------------------------------------------------------------------
/nmatheg/config.ini:
--------------------------------------------------------------------------------
1 | [dataset]
2 | dataset_name = ajgt_twitter_ar
3 |
4 | [preprocessing]
5 | segment = False
6 | remove_special_chars = False
7 | remove_english = False
8 | normalize = False
9 | remove_diacritics = False
10 | excluded_chars = []
11 | remove_tatweel = False
12 | remove_html_elements = False
13 | remove_links = False
14 | remove_twitter_meta = False
15 | remove_long_words = False
16 | remove_repeated_chars = False
17 |
18 | [tokenization]
19 | tokenizer_name = WordTokenizer
20 | vocab_size = 10000
21 | max_tokens = 128
22 |
23 | [model]
24 | model_name = bert-base-arabertv01
25 |
26 | [train]
27 | epochs = 10
28 | batch_size = 256
29 | save_dir = .
30 |
--------------------------------------------------------------------------------
/nmatheg/configs.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | def create_default_config(batch_size = 64, epochs = 5, lr = 5e-5, runs = 10, max_tokens = 64,
3 | max_train_samples = -1, preprocessing = {}, ckpt = 'ckpts'):
4 | config = configparser.ConfigParser()
5 |
6 | config['preprocessing'] = {
7 | 'segment' : False,
8 | 'remove_special_chars' : False,
9 | 'remove_english' : False,
10 | 'normalize' : False,
11 | 'remove_diacritics' : False,
12 | 'excluded_chars' : [],
13 | 'remove_tatweel' : False,
14 | 'remove_html_elements' : False,
15 | 'remove_links' : False,
16 | 'remove_twitter_meta' : False,
17 | 'remove_long_words' : False,
18 | 'remove_repeated_chars' : False,
19 | }
20 |
21 | for arg in preprocessing:
22 | config['preprocessing'][arg] = preprocessing[arg]
23 |
24 | config['tokenization'] = {
25 | 'max_tokens' : max_tokens,
26 | 'tok_save_path': 'ckpts',
27 | 'max_train_samples': max_train_samples
28 | }
29 |
30 | config['log'] = {'print_every':10}
31 |
32 | config['train'] = {
33 | 'save_dir' : ckpt,
34 | 'epochs' : epochs,
35 | 'batch_size' : batch_size,
36 | 'lr': lr,
37 | 'runs': runs
38 | }
39 | return config
--------------------------------------------------------------------------------
/nmatheg/dataset.py:
--------------------------------------------------------------------------------
1 |
2 | from genericpath import isdir
3 | from regex import E
4 | import tnkeeh as tn
5 | from datasets import load_dataset, load_from_disk
6 | try:
7 | import bpe_surgery
8 | except:
9 | pass
10 |
11 | import os
12 | from .utils import get_preprocessing_args, get_tokenizer
13 | from transformers import AutoTokenizer
14 | import torch
15 | from .preprocess_ner import aggregate_tokens, tokenize_and_align_labels
16 | from .preprocess_qa import prepare_features
17 | import copy
18 |
19 | def split_dataset(dataset, data_config, seed = 42, max_train_samples = -1):
20 | split_names = ['train', 'valid', 'test']
21 |
22 | for i, split_name in enumerate(['train', 'valid', 'test']):
23 | if split_name in data_config:
24 | split_names[i] = data_config[split_name]
25 | dataset[split_name] = dataset[split_names[i]]
26 |
27 | if max_train_samples < len(dataset['train']) and max_train_samples != -1:
28 | print(f"truncating train samples from {len(dataset['train'])} to {max_train_samples}")
29 | dataset['train'] = dataset['train'].select(range(max_train_samples))
30 |
31 | #create validation split
32 | if 'valid' not in dataset:
33 | train_valid_dataset = dataset['train'].train_test_split(test_size=0.1, seed = seed)
34 | dataset['valid'] = train_valid_dataset.pop('test')
35 | dataset['train'] = train_valid_dataset['train']
36 |
37 | #create training split
38 | if 'test' not in dataset:
39 | train_valid_dataset = dataset['train'].train_test_split(test_size=0.1, seed = seed)
40 | dataset['test'] = train_valid_dataset.pop('test')
41 | dataset['train'] = train_valid_dataset['train']
42 |
43 | columns = list(dataset.keys())
44 | for key in columns:
45 | if key not in ['train', 'valid', 'test']:
46 | del dataset[key]
47 | return dataset
48 |
49 |
50 | def clean_dataset(dataset, config, data_config, task = 'cls'):
51 | if task == 'mt':
52 | sourceString, targetString = data_config['text'].split(',')
53 | args = get_preprocessing_args(config)
54 | cleaner = tn.Tnkeeh(**args)
55 | dataset = cleaner.clean_hf_dataset(dataset, targetString)
56 | return dataset
57 | elif task == 'qa':
58 | question, context = data_config['text'].split(',')
59 | args = get_preprocessing_args(config)
60 | cleaner = tn.Tnkeeh(**args)
61 | dataset = cleaner.clean_hf_dataset(dataset, question)
62 | return dataset
63 | elif task == 'nli':
64 | premise, hypothesis = data_config['text'].split(',')
65 | args = get_preprocessing_args(config)
66 | cleaner = tn.Tnkeeh(**args)
67 | dataset = cleaner.clean_hf_dataset(dataset, premise)
68 | dataset = cleaner.clean_hf_dataset(dataset, hypothesis)
69 | return dataset
70 | else:
71 | args = get_preprocessing_args(config)
72 | cleaner = tn.Tnkeeh(**args)
73 | dataset = cleaner.clean_hf_dataset(dataset, data_config['text'])
74 | return dataset
75 |
76 | def write_data_for_train(dataset, text, path, task = 'cls'):
77 | data = []
78 | if task == 'cls':
79 | for sample in dataset:
80 | data.append(sample[text])
81 | elif task == 'nli':
82 | for sample in dataset:
83 | hypothesis, premise = text.split(",")
84 | data.append(sample[hypothesis]+" "+sample[premise])
85 | elif task == 'ner':
86 | for sample in dataset:
87 | data.append(' '.join(sample[text]))
88 | elif task == 'qa':
89 | for sample in dataset:
90 | context, question = text.split(",")
91 | data.append(sample[context]+" "+sample[question])
92 | elif task == 'mt':
93 | for sample in dataset:
94 | context, question = text.split(",")
95 | data.append(sample[context]+" "+sample[question])
96 |
97 | open(f'{path}/data.txt', 'w').write(('\n').join(data))
98 |
99 | def get_prev_tokenizer(save_dir, tokenizer_name, vocab_size, dataset_name, model_name):
100 | prev_vocab_sizes = [int(v) for v in os.listdir(f"{save_dir}/{tokenizer_name}") if int(v) != vocab_size and dataset_name in os.listdir(f"{save_dir}/{tokenizer_name}/{v}")]
101 |
102 | if len(prev_vocab_sizes) == 0:
103 | return ""
104 | else:
105 | return f"{save_dir}/{tokenizer_name}/{max(prev_vocab_sizes)}/{dataset_name}/{model_name}/tokenizer"
106 |
107 | def create_dataset(config, data_config, vocab_size = 300,
108 | model_name = "birnn", tokenizer_name = "bpe", clean = True, mode = "finetune",
109 | tok_save_path = None, data_save_path = None):
110 |
111 | hf_dataset_name = data_config['name']
112 | dataset_name = hf_dataset_name.split("/")[-1] #in case we have / in the name
113 | max_tokens = int(config['tokenization']['max_tokens'])
114 | max_train_samples = int(config['tokenization']['max_train_samples'])
115 | save_dir = config['train']['save_dir']
116 | prev_tok_save_path = ""
117 | if mode == "pretrain":
118 | prev_tok_save_path = get_prev_tokenizer(save_dir, tokenizer_name, vocab_size, dataset_name, model_name)
119 |
120 | batch_size = int(config['train']['batch_size'])
121 | task_name = data_config['task']
122 |
123 | if 'subset' in data_config:
124 | dataset = load_dataset(hf_dataset_name, data_config['subset'])
125 | else:
126 | dataset = load_dataset(hf_dataset_name)
127 |
128 | if task_name != "qa" and clean:
129 | dataset = clean_dataset(dataset, config, data_config, task = task_name)
130 |
131 | dataset = split_dataset(dataset, data_config, max_train_samples=max_train_samples)
132 | examples = copy.deepcopy(dataset)
133 | print(dataset)
134 | if 'birnn' in model_name:
135 | model_type = 'rnn'
136 | else:
137 | model_type = 'transformer'
138 | if task_name == 'cls':
139 | # tokenize data
140 | if 'birnn' not in model_name:
141 | tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False, model_max_length = 512)
142 | if not os.path.isfile(f"{data_save_path}/dataset_dict.json"):
143 | dataset = dataset.map(lambda examples:tokenizer(examples[data_config['text']], truncation=True, padding='max_length'), batched=True)
144 | dataset = dataset.map(lambda examples:{'labels': examples[data_config['label']]}, batched=True)
145 | dataset.save_to_disk(data_save_path)
146 | else:
147 | dataset = load_from_disk(data_save_path)
148 | columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels']
149 | else:
150 | tokenizer = get_tokenizer(tokenizer_name, vocab_size= vocab_size)
151 |
152 | if os.path.isfile(f"{tok_save_path}/tok.model"):
153 | print('loading pretrained tokenizer')
154 | tokenizer.load(tok_save_path)
155 | dataset = load_from_disk(data_save_path)
156 | else:
157 | write_data_for_train(dataset['train'], data_config['text'], data_save_path)
158 | if prev_tok_save_path != "":
159 | tokenizer.load(prev_tok_save_path)
160 | else:
161 | print('training tokenizer from scratch')
162 | tokenizer.train(file_path = f'{data_save_path}/data.txt')
163 | tokenizer.save_model(f"{tok_save_path}/m.model")
164 | dataset = dataset.map(lambda examples:{'input_ids': tokenizer.encode_sentences(examples[data_config['text']], out_length= max_tokens)}, batched=True)
165 | dataset = dataset.map(lambda examples:{'labels': examples[data_config['label']]}, batched=True)
166 | dataset.save_to_disk(data_save_path)
167 | columns=['input_ids', 'labels']
168 |
169 | elif task_name == 'nli':
170 | # tokenize data
171 | premise, hypothesis = data_config['text'].split(",")
172 | if 'birnn' not in model_name:
173 | tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False, model_max_length = 512)
174 | def concat(examples):
175 | texts = (examples[premise], examples[hypothesis])
176 | result = tokenizer(*texts, truncation=True, padding='max_length')
177 | return result
178 |
179 | if not os.path.isfile(f"{data_save_path}/dataset_dict.json"):
180 | dataset = dataset.map(concat, batched=True)
181 | dataset.save_to_disk(data_save_path)
182 | else:
183 | load_dataset(data_save_path)
184 | columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels']
185 | else:
186 | tokenizer = get_tokenizer(tokenizer_name, vocab_size= vocab_size)
187 | if os.path.isfile(f"{tok_save_path}/tok.model"):
188 | print('loading pretrained tokenizer')
189 | tokenizer.load(tok_save_path)
190 | dataset = load_from_disk(data_save_path)
191 | else:
192 |
193 | write_data_for_train(dataset['train'], data_config['text'], data_save_path, task = 'nli')
194 | if prev_tok_save_path != "":
195 | tokenizer.load(prev_tok_save_path)
196 | else:
197 | print('training tokenizer from scratch')
198 | tokenizer.train(file_path = f"{data_save_path}/data.txt")
199 | tokenizer.save(tok_save_path)
200 |
201 | def concat(example):
202 | example["text"] = example[premise] + ' ' + example[hypothesis]
203 | return example
204 |
205 | dataset = dataset.map(lambda examples:{'input_ids': tokenizer.encode_sentences(sentences1 = examples[premise], sentences2 = examples[hypothesis], out_length= max_tokens)}, batched=True)
206 | dataset = dataset.map(lambda examples:{'labels': examples[data_config['label']]}, batched=True)
207 | dataset.save_to_disk(data_save_path)
208 | columns=['input_ids', 'labels']
209 |
210 | elif task_name in ['ner', 'pos']:
211 | dataset = aggregate_tokens(dataset, config, data_config)
212 | if 'birnn' not in model_name:
213 | tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
214 | if not os.path.isfile(f"{data_save_path}/dataset_dict.json"):
215 | print('aligining the tokens ...')
216 | for split in dataset:
217 | dataset[split] = dataset[split].map(lambda x: tokenize_and_align_labels(x, tokenizer, data_config, model_type = model_type)
218 | , batched=True, remove_columns=dataset[split].column_names)
219 | dataset.save_to_disk(data_save_path)
220 | else:
221 | dataset = load_from_disk(data_save_path)
222 | columns=['input_ids', 'attention_mask', 'labels']
223 | else:
224 | tokenizer = get_tokenizer(tokenizer_name, vocab_size= vocab_size)
225 |
226 | if os.path.isfile(f"{tok_save_path}/tok.model"):
227 | print('loading pretrained tokenizer')
228 | tokenizer.load(tok_save_path)
229 | dataset = load_from_disk(data_save_path)
230 | else:
231 | write_data_for_train(dataset['train'], data_config['text'], data_save_path, task = task_name)
232 | if prev_tok_save_path != "":
233 | tokenizer.load(prev_tok_save_path)
234 | else:
235 | print('training tokenizer from scratch')
236 | tokenizer.train(file_path = f'{data_save_path}/data.txt')
237 | tokenizer.save(tok_save_path)
238 | print('aligining the tokens ...')
239 | for split in dataset:
240 | dataset[split] = dataset[split].map(lambda x: tokenize_and_align_labels(x, tokenizer, data_config, model_type = model_type)
241 | , batched=True, remove_columns=dataset[split].column_names)
242 | dataset.save_to_disk(data_save_path)
243 |
244 | columns=['input_ids', 'labels']
245 |
246 | elif task_name == 'qa':
247 | if 'birnn' not in model_name:
248 | tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
249 | if not os.path.isfile(f"{data_save_path}/dataset_dict.json"):
250 | for split in dataset:
251 | dataset[split] = dataset[split].map(lambda x: prepare_features(x, tokenizer, data_config, model_type = model_type)
252 | , batched=True, remove_columns=dataset[split].column_names)
253 | dataset.save_to_disk(data_save_path)
254 | else:
255 | dataset = load_from_disk(data_save_path)
256 | columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions']
257 | else:
258 | tokenizer = get_tokenizer(tokenizer_name, vocab_size= vocab_size)
259 |
260 | if os.path.isfile(f"{tok_save_path}/tok.model"):
261 | print('loading pretrained tokenizer')
262 | tokenizer.load(tok_save_path)
263 | dataset = load_from_disk(data_save_path)
264 | else:
265 | write_data_for_train(dataset['train'], data_config['text'], data_save_path, task = task_name)
266 | if prev_tok_save_path != "":
267 | tokenizer.load(prev_tok_save_path)
268 | else:
269 | print('training tokenizer from scratch')
270 | tokenizer.train(file_path = f'{data_save_path}/data.txt')
271 | tokenizer.save(tok_save_path)
272 | for split in dataset:
273 | dataset[split] = dataset[split].map(lambda x: prepare_features(x, tokenizer, data_config, model_type = model_type, max_len = max_tokens)
274 | , batched=True, remove_columns=dataset[split].column_names)
275 | dataset.save_to_disk(data_save_path)
276 | columns=['input_ids', 'start_positions', 'end_positions']
277 |
278 | elif task_name == 'mt':
279 | prefix = "translate English to Arabic: "
280 | src_lang, trg_lang = data_config['text'].split(",")
281 |
282 | if 'birnn' not in model_name:
283 |
284 | tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
285 | def preprocess(dataset):
286 | inputs = [prefix + ex for ex in dataset[src_lang]]
287 | targets = [ex for ex in dataset[trg_lang]]
288 | dataset = tokenizer(inputs, max_length=128, truncation=True, padding = 'max_length')
289 |
290 | # Setup the tokenizer for targets
291 | with tokenizer.as_target_tokenizer():
292 | labels = tokenizer(targets, max_length=128, truncation=True, padding = 'max_length')
293 |
294 | dataset["labels"] = labels["input_ids"]
295 | return dataset
296 | if not os.path.isfile(f"{data_save_path}/dataset_dict.json"):
297 | dataset = dataset.map(preprocess, batched=True)
298 | dataset.save_to_disk(data_save_path)
299 | else:
300 | dataset = load_from_disk(data_save_path)
301 | columns = ['input_ids', 'attention_mask', 'labels']
302 | else:
303 | src_tokenizer = get_tokenizer('BPE', vocab_size= 1000)
304 | trg_tokenizer = get_tokenizer(tokenizer_name, vocab_size= vocab_size)
305 | src_tok_save_path = f"{save_dir}/{tokenizer_name}/1000/{dataset_name}/{model_name}/tokenizer"
306 |
307 | if os.path.isfile(f"{tok_save_path}/trg_tok.model"):
308 | print('loading pretrained tokenizers')
309 | src_tokenizer.load(f"{src_tok_save_path}/", name = "src_tok")
310 | trg_tokenizer.load(f"{tok_save_path}/", name = "trg_tok")
311 | dataset = load_from_disk(data_save_path)
312 | else:
313 | open(f'{data_save_path}/src_data.txt', 'w').write('\n'.join(dataset['train'][src_lang]))
314 | open(f'{data_save_path}/trg_data.txt', 'w').write('\n'.join(dataset['train'][trg_lang]))
315 |
316 | if not os.path.isfile(f"{src_tok_save_path}/src_tok.model"):
317 | src_tokenizer.train(file_path = f'{data_save_path}/src_data.txt')
318 | src_tokenizer.save(f"{tok_save_path}/", name = 'src_tok')
319 |
320 | if prev_tok_save_path != "":
321 | tokenizer.load(prev_tok_save_path)
322 | else:
323 | print('training tokenizer from scratch')
324 |
325 | trg_tokenizer.train(file_path = f'{data_save_path}/trg_data.txt')
326 | trg_tokenizer.save(f"{tok_save_path}/", name = 'trg_tok')
327 |
328 | def preprocess(dataset):
329 | inputs = [ex for ex in dataset[src_lang]]
330 | targets = [ex for ex in dataset[trg_lang]]
331 |
332 | input_ids = src_tokenizer.encode_sentences(inputs, out_length = max_tokens, add_boundry = True)
333 | labels = trg_tokenizer.encode_sentences(targets, out_length = max_tokens, add_boundry = True)
334 | dataset = dataset.add_column("input_ids", input_ids)
335 | dataset = dataset.add_column("labels", labels)
336 | return dataset
337 |
338 | for split in dataset:
339 | dataset[split] = preprocess(dataset[split])
340 |
341 | dataset.save_to_disk(data_save_path)
342 |
343 | columns = ['input_ids', 'labels']
344 | tokenizer = trg_tokenizer
345 |
346 | elif task_name == 'sum':
347 | prefix = ""
348 | text, summary = data_config['text'].split(",")
349 |
350 | if 'birnn' not in model_name:
351 |
352 | tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
353 | def preprocess(dataset):
354 | inputs = [prefix + ex for ex in dataset[text]]
355 | targets = [ex for ex in dataset[summary]]
356 | dataset = tokenizer(inputs, max_length=128, truncation=True, padding = 'max_length')
357 |
358 | # Setup the tokenizer for targets
359 | with tokenizer.as_target_tokenizer():
360 | labels = tokenizer(targets, max_length=128, truncation=True, padding = 'max_length')
361 | labels["input_ids"] = [
362 | [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]]
363 | dataset["labels"] = labels["input_ids"]
364 | return dataset
365 | if not os.path.isfile(f"{data_save_path}/dataset_dict.json"):
366 | dataset = dataset.map(preprocess, batched=True)
367 | dataset.save_to_disk(data_save_path)
368 | else:
369 | dataset = load_from_disk(data_save_path)
370 | columns = ['input_ids', 'attention_mask', 'labels']
371 | else:
372 | src_tokenizer = get_tokenizer('BPE', vocab_size= 1000)
373 | trg_tokenizer = get_tokenizer(tokenizer_name, vocab_size= vocab_size)
374 | src_tok_save_path = f"{save_dir}/{tokenizer_name}/1000/{dataset_name}/{model_name}/tokenizer"
375 |
376 | if os.path.isfile(f"{tok_save_path}/trg_tok.model"):
377 | print('loading pretrained tokenizers')
378 | src_tokenizer.load(f"{src_tok_save_path}/", name = "src_tok")
379 | trg_tokenizer.load(f"{tok_save_path}/", name = "trg_tok")
380 | dataset = load_from_disk(data_save_path)
381 | else:
382 | open(f'{data_save_path}/src_data.txt', 'w').write('\n'.join(dataset['train'][text]))
383 | open(f'{data_save_path}/trg_data.txt', 'w').write('\n'.join(dataset['train'][summary]))
384 |
385 | if not os.path.isfile(f"{src_tok_save_path}/src_tok.model"):
386 | src_tokenizer.train(file_path = f'{data_save_path}/src_data.txt')
387 | src_tokenizer.save(f"{tok_save_path}/", name = 'src_tok')
388 |
389 | if prev_tok_save_path != "":
390 | tokenizer.load(prev_tok_save_path)
391 | else:
392 | print('training tokenizer from scratch')
393 |
394 | trg_tokenizer.train(file_path = f'{data_save_path}/trg_data.txt')
395 | trg_tokenizer.save(f"{tok_save_path}/", name = 'trg_tok')
396 |
397 | def preprocess(dataset):
398 | inputs = [ex for ex in dataset[text]]
399 | targets = [ex for ex in dataset[summary]]
400 |
401 | input_ids = src_tokenizer.encode_sentences(inputs, out_length = max_tokens, add_boundry = True)
402 | labels = trg_tokenizer.encode_sentences(targets, out_length = max_tokens, add_boundry = True)
403 | dataset = dataset.add_column("input_ids", input_ids)
404 | dataset = dataset.add_column("labels", labels)
405 | return dataset
406 |
407 | for split in dataset:
408 | dataset[split] = preprocess(dataset[split])
409 |
410 | dataset.save_to_disk(data_save_path)
411 |
412 | columns = ['input_ids', 'labels']
413 | tokenizer = trg_tokenizer
414 | #create loaders
415 | if task_name != 'qa':
416 | for split in dataset:
417 | dataset[split].set_format(type='torch', columns=columns)
418 | dataset[split] = torch.utils.data.DataLoader(dataset[split], batch_size=batch_size, shuffle = True)
419 |
420 | return tokenizer, [dataset['train'], dataset['valid'], dataset['test']], [examples['train'], examples['valid'], examples['test']]
421 |
--------------------------------------------------------------------------------
/nmatheg/datasets.ini:
--------------------------------------------------------------------------------
1 | [ajgt_twitter_ar]
2 | name = ajgt_twitter_ar
3 | text = text
4 | label = label
5 | num_labels = 2
6 | train = train
7 | task = cls
8 | labels = Negative,Positive
9 |
10 | [off-eval-ar]
11 | name = Zaid/off-eval-ar
12 | text = tweet
13 | label = label
14 | num_labels = 2
15 | train = train
16 | test = test
17 | task = cls
18 | labels = NOT,OFF
19 |
20 | [off-eval-en]
21 | name = Zaid/off-eval-en
22 | text = tweet
23 | label = label
24 | num_labels = 2
25 | train = train
26 | test = test
27 | task = cls
28 | labels = NOT,OFF
29 |
30 | [metrec]
31 | name = metrec
32 | text = text
33 | label = label
34 | num_labels = 14
35 | train = train
36 | test = test
37 | task = cls
38 | labels = saree,kamel,mutakareb,mutadarak,munsareh,madeed,mujtath,ramal,baseet,khafeef,taweel,wafer,hazaj,rajaz
39 |
40 | [labr]
41 | name = labr
42 | text = text
43 | label = label
44 | num_labels = 5
45 | train = train
46 | test = test
47 | task = cls
48 | labels = 1,2,3,4,5
49 |
50 | [ar_res_reviews]
51 | name = ar_res_reviews
52 | text = text
53 | label = polarity
54 | num_labels = 2
55 | split = train
56 | task = cls
57 | labels = negative,positive
58 |
59 | [arsentd_lev]
60 | name = arsentd_lev
61 | text = Tweet
62 | label = Sentiment
63 | num_labels = 5
64 | train = train
65 | task = cls
66 | labels = negative,neutral,positive,very_negative,very_positive
67 |
68 | [oclar]
69 | name = oclar
70 | text = review
71 | label = rating
72 | num_labels = 5
73 | train = train
74 | task = cls
75 | labels = 1,2,3,4,5
76 |
77 | [emotone_ar]
78 | name = emotone_ar
79 | text = tweet
80 | label = label
81 | num_labels = 8
82 | train = train
83 | task = cls
84 | labels = none,anger,joy,sadness,love,sympathy,surprise,fear
85 |
86 | [hard]
87 | name = hard
88 | text = text
89 | label = label
90 | num_labels = 5
91 | train = train
92 | task = cls
93 | labels = 1,2,3,4,5
94 |
95 | [ar_sarcasm]
96 | name = ar_sarcasm
97 | text = tweet
98 | label = sarcasm
99 | num_labels = 2
100 | train = train
101 | test = test
102 | task = cls
103 | labels = non-sarcastic,sarcastic
104 |
105 | [caner]
106 | name = caner
107 | subset = dummy
108 | text = token
109 | label = ner_tag
110 | num_labels = 21
111 | train = train
112 | task = ner
113 | labels = Allah,Book,Clan,Crime,Date,Day,Hell,Loc,Meas,Mon,Month,NatOb,Number,O,Org,Para,Pers,Prophet,Rlig,Sect,Time
114 |
115 | [arcd]
116 | name = arcd
117 | text = question,context
118 | label = answer
119 | num_labels = 2
120 | train = train
121 | test = validation
122 | task = qa
123 | labels = start_logits,end_logits
124 |
125 | [mlqa]
126 | name = mlqa
127 | subset = mlqa-translate-train.ar
128 | text = question,context
129 | label = answer
130 | num_labels = 2
131 | train = train
132 | test = validation
133 | task = qa
134 | labels = start_logits,end_logits
135 |
136 | [tatoeba_mt]
137 | name = Helsinki-NLP/tatoeba_mt
138 | subset = ara-eng
139 | text = targetString,sourceString
140 | num_labels = 0
141 | train = validation
142 | test = test
143 | task = mt
144 | labels = english,arabic
145 |
146 |
147 | [xnli]
148 | name = xnli
149 | subset = ar
150 | text = premise,hypothesis
151 | label = label
152 | num_labels = 3
153 | train = train
154 | valid = validation
155 | test = test
156 | task = nli
157 | labels = entailment,neutral,contradiction
158 |
159 | [xlsum]
160 | name = csebuetnlp/xlsum
161 | subset = arabic
162 | text = text,summary
163 | num_labels = 0
164 | train = train
165 | valid = validation
166 | test = test
167 | task = sum
168 | labels = text,summary
169 |
170 | [ARGEN_title_generation]
171 | name = arbml/ARGEN_title_generation
172 | text = document,title
173 | num_labels = 0
174 | train = train
175 | test = validation
176 | task = sum
177 | labels = document,title
178 |
179 | [wiki_lingua_ar]
180 | name = arbml/wiki_lingua_ar
181 | text = article,summary
182 | num_labels = 0
183 | train = train
184 | valid = validation
185 | test = test
186 | task = sum
187 | labels = article,summary
188 |
189 | [arabic_pos_dialect]
190 | name = arbml/arabic_pos_dialect
191 | subset = all
192 | text = words
193 | label = pos_tags
194 | num_labels = 22
195 | train = train
196 | task = pos
197 | labels = ADJ,ADV,CASE,CONJ,DET,EMOT,EOS,FOREIGN,FUT_PART,HASH,MENTION,NEG_PART,NOUN,NSUFF,NUM,PART,PREP,PROG_PART,PRON,PUNC,URL,V
198 |
--------------------------------------------------------------------------------
/nmatheg/models.py:
--------------------------------------------------------------------------------
1 | from transformers import (
2 | AutoModelForSequenceClassification,
3 | AutoConfig,
4 | AutoModelForTokenClassification,
5 | AutoModelForQuestionAnswering,
6 | AutoModelForSeq2SeqLM,
7 | get_linear_schedule_with_warmup)
8 | from evaluate import load
9 | import random
10 | import torch.nn.functional as F
11 | import os
12 | import time
13 | import numpy as np
14 | from tqdm.auto import tqdm
15 | import torch
16 | from torch.optim import AdamW
17 | import torch.nn as nn
18 | from accelerate import Accelerator
19 | from datasets import load_metric
20 | import copy
21 | from .ner_utils import get_labels
22 | from .qa_utils import evaluate_metric
23 | from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
24 | import nltk
25 | nltk.download('punkt')
26 |
27 | class BiRNN(nn.Module):
28 | def __init__(self, vocab_size, num_labels, hidden_dim = 128):
29 |
30 | super().__init__()
31 |
32 | self.embedding = nn.Embedding(vocab_size, hidden_dim)
33 | self.bigru1 = nn.GRU(hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
34 | self.bigru2 = nn.GRU(2*hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
35 | self.bigru3 = nn.GRU(2*hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
36 | self.fc = nn.Linear(2*hidden_dim, num_labels)
37 | self.hidden_dim = hidden_dim
38 | self.num_labels = num_labels
39 |
40 | def forward(self,
41 | input_ids,
42 | labels = None):
43 | embedded = self.embedding(input_ids)
44 | out,h = self.bigru1(embedded)
45 | out,h = self.bigru2(out)
46 | out,h = self.bigru3(out)
47 | logits = self.fc(out[:,0,:])
48 | if labels is not None:
49 | loss = self.compute_loss(logits, labels)
50 | return {'loss':loss,
51 | 'logits':logits}
52 | return {'logits': logits}
53 |
54 | def compute_loss(self, logits, labels):
55 | loss_fct = nn.CrossEntropyLoss()
56 | loss = loss_fct(logits, labels)
57 | return loss
58 |
59 | class BaseTextClassficationModel:
60 | def __init__(self, config):
61 | self.model = nn.Module()
62 | self.num_labels = config['num_labels']
63 | self.model_name = config['model_name']
64 | self.vocab_size = config['vocab_size']
65 |
66 | self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
67 |
68 | def train(self, datasets, examples, **kwargs):
69 | save_dir = kwargs['save_dir']
70 | epochs = kwargs['epochs']
71 | lr = kwargs['lr']
72 |
73 | train_dataset, valid_dataset, test_dataset = datasets
74 |
75 | self.optimizer = AdamW(self.model.parameters(), lr = lr)
76 | filepath = os.path.join(save_dir, 'pytorch_model.bin')
77 | best_accuracy = 0
78 | pbar = tqdm(total=epochs * len(train_dataset), leave=True)
79 | for epoch in range(epochs):
80 | accuracy = 0
81 | loss = 0
82 | self.model.train().to(self.device)
83 | for _, batch in enumerate(train_dataset):
84 | batch = {k: v.to(self.device) for k, v in batch.items()}
85 | self.optimizer.zero_grad()
86 | outputs = self.model(**batch)
87 | loss = outputs['loss']
88 | loss.backward()
89 | self.optimizer.step()
90 | labels = batch['labels'].cpu()
91 | preds = outputs['logits'].argmax(-1).cpu()
92 | accuracy += accuracy_score(labels, preds) /len(train_dataset)
93 | loss += loss / len(train_dataset)
94 | batch = None
95 | pbar.update(1)
96 | print(f"Epoch {epoch} Train Loss {loss:.4f} Train Accuracy {accuracy:.4f}")
97 |
98 | self.model.eval().to(self.device)
99 | results = self.evaluate_dataset(valid_dataset)
100 | print(f"Epoch {epoch} Valid Loss {results['loss']:.4f} Valid Accuracy {results['accuracy']:.4f}")
101 |
102 | val_accuracy = results['accuracy']
103 | if val_accuracy >= best_accuracy:
104 | best_accuracy = val_accuracy
105 | torch.save(self.model.state_dict(), filepath)
106 |
107 | #Later to restore:
108 |
109 | self.model.load_state_dict(torch.load(filepath))
110 | self.model.eval()
111 | test_metrics = self.evaluate_dataset(test_dataset)
112 | print(f"Test Loss {test_metrics['loss']:.4f} Test Accuracy {test_metrics['accuracy']:.4f}")
113 | return test_metrics
114 |
115 | def evaluate_dataset(self, dataset, desc = "Eval"):
116 | accuracy = 0
117 | total_loss = 0
118 | pbar = tqdm(total=len(dataset), position=0, leave=False, desc=desc)
119 | refs = []
120 | preds = []
121 | with torch.no_grad():
122 | for _, batch in enumerate(dataset):
123 | batch = {k: v.to(self.device) for k, v in batch.items()}
124 | outputs = self.model(**batch)
125 | loss = outputs['loss']
126 | refs += batch['labels'].cpu()
127 | preds += outputs['logits'].argmax(-1).cpu()
128 | total_loss += loss / len(dataset)
129 | batch = None
130 | pbar.update(1)
131 | return {
132 | "loss":float(total_loss.cpu().detach().numpy()),
133 | "precision": precision_score(refs, preds, average = "macro"),
134 | "recall": recall_score(refs, preds, average = "macro"),
135 | "f1": f1_score(refs, preds, average = "macro"),
136 | "accuracy": accuracy_score(refs, preds),
137 | }
138 |
139 | class SimpleClassificationModel(BaseTextClassficationModel):
140 | def __init__(self, config):
141 | BaseTextClassficationModel.__init__(self, config)
142 | self.model = BiRNN(self.vocab_size, self.num_labels)
143 | self.model.to(self.device)
144 | # self.optimizer = AdamW(self.model.parameters(), lr = 5e-5)
145 |
146 | def wipe_memory(self):
147 | self.model = None
148 | self.optimizer = None
149 | torch.cuda.empty_cache()
150 |
151 | class BERTTextClassificationModel(BaseTextClassficationModel):
152 | def __init__(self, config):
153 | BaseTextClassficationModel.__init__(self, config)
154 | config = AutoConfig.from_pretrained(self.model_name,num_labels=self.num_labels)
155 | self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name, config = config)
156 |
157 | def wipe_memory(self):
158 | self.model = None
159 | self.optimizer = None
160 | torch.cuda.empty_cache()
161 |
162 | class BiRNNForTokenClassification(nn.Module):
163 | def __init__(self, vocab_size, num_labels, hidden_dim = 128):
164 |
165 | super().__init__()
166 |
167 | self.embedding = nn.Embedding(vocab_size, hidden_dim)
168 | self.bigru1 = nn.GRU(hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
169 | self.bigru2 = nn.GRU(2*hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
170 | self.bigru3 = nn.GRU(2*hidden_dim, hidden_dim//2, bidirectional=True, batch_first = True)
171 | self.fc = nn.Linear(hidden_dim, num_labels)
172 | self.hidden_dim = hidden_dim
173 | self.num_labels = num_labels
174 |
175 | def forward(self,
176 | input_ids,
177 | labels = None):
178 |
179 | embedded = self.embedding(input_ids)
180 | out,h = self.bigru1(embedded)
181 | out,h = self.bigru2(out)
182 | out,h = self.bigru3(out)
183 | logits = self.fc(out)
184 | if labels is not None:
185 | loss = self.compute_loss(logits, labels)
186 | return {'loss':loss,
187 | 'logits':logits}
188 | else:
189 | return {'logits':logits}
190 |
191 | def compute_loss(self, logits, labels):
192 | loss_fct = nn.CrossEntropyLoss()
193 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
194 | return loss
195 |
196 | class BaseTokenClassficationModel:
197 | def __init__(self, config):
198 | self.model = nn.Module()
199 | self.num_labels = config['num_labels']
200 | self.model_name = config['model_name']
201 | self.vocab_size = config['vocab_size']
202 | self.labels = config['labels']
203 | self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
204 | self.metric = load_metric("seqeval")
205 | self.accelerator = Accelerator()
206 |
207 | def train(self, datasets, examples, **kwargs):
208 | save_dir = kwargs['save_dir']
209 | epochs = kwargs['epochs']
210 | lr = kwargs['lr']
211 | self.optimizer = AdamW(self.model.parameters(), lr = lr)
212 |
213 | train_dataset, valid_dataset, test_dataset = datasets
214 | filepath = os.path.join(save_dir, 'pytorch_model.bin')
215 | best_accuracy = 0
216 | pbar = tqdm(total=epochs * len(train_dataset), leave=True)
217 | for epoch in range(epochs):
218 | accuracy = 0
219 | loss = 0
220 | self.model.train().to(self.device)
221 | predictions , true_labels = [], []
222 | for _, batch in enumerate(train_dataset):
223 | batch = {k: v.to(self.device) for k, v in batch.items()}
224 | outputs = self.model(**batch)
225 | loss = outputs['loss']
226 | loss.backward()
227 | self.optimizer.step()
228 | self.optimizer.zero_grad()
229 | loss += loss / len(train_dataset)
230 | batch = None
231 | pbar.update(1)
232 |
233 | train_metrics = self.evaluate_dataset(train_dataset)
234 | print(f"Epoch {epoch} Train Loss {train_metrics['loss']:.4f} Train F1 {train_metrics['f1']:.4f}")
235 |
236 | self.model.eval().to(self.device)
237 | valid_metrics = self.evaluate_dataset(valid_dataset)
238 | print(f"Epoch {epoch} Valid Loss {valid_metrics['loss']:.4f} Valid F1 {valid_metrics['f1']:.4f}")
239 |
240 | val_accuracy = valid_metrics['f1']
241 | if val_accuracy >= best_accuracy:
242 | best_accuracy = val_accuracy
243 | torch.save(self.model.state_dict(), filepath)
244 |
245 | self.model.load_state_dict(torch.load(filepath))
246 | self.model.eval()
247 | test_metrics = self.evaluate_dataset(test_dataset)
248 | print(f"Test Loss {test_metrics['loss']:.4f} Test F1 {test_metrics['f1']:.4f}")
249 | return {
250 | "precision": test_metrics["precision"],
251 | "recall": test_metrics["recall"],
252 | "f1": test_metrics["f1"],
253 | "accuracy": test_metrics["accuracy"],
254 | }
255 |
256 | def evaluate_dataset(self, dataset, desc = "Eval"):
257 | preds = []
258 | refs = []
259 |
260 | total_loss = 0
261 | pbar = tqdm(total=len(dataset), position=0, leave=False, desc=desc)
262 | for _, batch in enumerate(dataset):
263 | batch = {k: v.to(self.device) for k, v in batch.items()}
264 | outputs = self.model(**batch)
265 | loss = outputs['loss']
266 | labels = batch['labels']
267 | predictions = outputs['logits'].argmax(dim=-1)
268 |
269 | predictions_gathered = self.accelerator.gather(predictions)
270 | labels_gathered = self.accelerator.gather(labels)
271 | pred, ref = get_labels(predictions_gathered, labels_gathered, self.labels)
272 | ref = [item for sublist in ref for item in sublist]
273 | pred = [item for sublist in pred for item in sublist]
274 | preds.append(pred)
275 | refs.append(ref)
276 |
277 | total_loss += loss / len(dataset)
278 | batch = None
279 | pbar.update(1)
280 |
281 | refs = [item for sublist in refs for item in sublist]
282 | preds = [item for sublist in preds for item in sublist]
283 |
284 | return {
285 | "loss":float(total_loss.cpu().detach().numpy()),
286 | "precision": precision_score(refs, preds, average = "micro"),
287 | "recall": recall_score(refs, preds, average = "micro"),
288 | "f1": f1_score(refs, preds, average = "micro"),
289 | "accuracy": accuracy_score(refs, preds),
290 | }
291 |
292 | class SimpleTokenClassificationModel(BaseTokenClassficationModel):
293 | def __init__(self, config):
294 | BaseTokenClassficationModel.__init__(self, config)
295 | self.model = BiRNNForTokenClassification(self.vocab_size, self.num_labels)
296 | self.model.to(self.device)
297 | # self.optimizer = AdamW(self.model.parameters(), lr = 5e-5)
298 |
299 | def wipe_memory(self):
300 | self.model = None
301 | self.optimizer = None
302 | torch.cuda.empty_cache()
303 |
304 | class BERTTokenClassificationModel(BaseTokenClassficationModel):
305 | def __init__(self, config):
306 | BaseTokenClassficationModel.__init__(self, config)
307 | config = AutoConfig.from_pretrained(self.model_name,num_labels=self.num_labels)
308 | self.model = AutoModelForTokenClassification.from_pretrained(self.model_name, config = config)
309 |
310 | def wipe_memory(self):
311 | self.model = None
312 | self.optimizer = None
313 | torch.cuda.empty_cache()
314 |
315 | class BaseQuestionAnsweringModel:
316 | def __init__(self, config):
317 | self.model = nn.Module()
318 | self.model_name = config['model_name']
319 | self.vocab_size = config['vocab_size']
320 | self.num_labels = config['num_labels']
321 | self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
322 | self.accelerator = Accelerator()
323 | if 'bert' in self.model_name:
324 | self.columns = ['input_ids', 'attention_mask', 'start_positions', 'end_positions']
325 | else:
326 | self.columns = ['input_ids', 'start_positions', 'end_positions']
327 |
328 |
329 | def train(self, datasets, examples, **kwargs):
330 | save_dir = kwargs['save_dir']
331 | epochs = kwargs['epochs']
332 | lr = kwargs['lr']
333 | batch_size = kwargs['batch_size']
334 | self.optimizer = AdamW(self.model.parameters(), lr = lr)
335 |
336 | train_dataset, valid_dataset, test_dataset = datasets
337 | train_examples, valid_examples, test_examples = examples
338 | train_loader = copy.deepcopy(train_dataset)
339 | train_loader.set_format(type='torch', columns=self.columns)
340 | train_loader = torch.utils.data.DataLoader(train_loader, batch_size=batch_size, shuffle = True)
341 | filepath = os.path.join(save_dir, 'pytorch_model.bin')
342 | best_accuracy = 0
343 | pbar = tqdm(total=epochs * len(train_dataset), leave=True)
344 |
345 | for epoch in range(epochs):
346 | accuracy = 0
347 | loss = 0
348 | self.model.train().to(self.device)
349 | all_start_logits = []
350 | all_end_logits = []
351 | for _, batch in enumerate(train_loader):
352 | batch = {k: v.to(self.device) for k, v in batch.items()}
353 | val = batch['input_ids']
354 | val[val==-100] = 0
355 | outputs = self.model(**batch)
356 | loss = outputs['loss']
357 | start_logits = outputs['start_logits']
358 | end_logits = outputs['end_logits']
359 |
360 | all_start_logits.append(self.accelerator.gather(start_logits).detach().cpu().numpy())
361 | all_end_logits.append(self.accelerator.gather(end_logits).detach().cpu().numpy())
362 |
363 | loss.backward()
364 | self.optimizer.step()
365 | self.optimizer.zero_grad()
366 |
367 | loss += loss / len(train_dataset)
368 | batch = None
369 | pbar.update(1)
370 |
371 | train_metrics = self.evaluate_dataset(train_dataset, train_examples, batch_size=batch_size)
372 | print(f"Epoch {epoch} Train Loss {loss:.4f} Train F1 {train_metrics['f1']:.4f}")
373 |
374 | self.model.eval().to(self.device)
375 | valid_metrics = self.evaluate_dataset(valid_dataset, valid_examples, batch_size=batch_size)
376 | print(f"Epoch {epoch} Valid Loss {valid_metrics['loss']:.4f} Valid F1 {valid_metrics['f1']:.4f}")
377 |
378 | val_accuracy = valid_metrics['f1']
379 | if val_accuracy >= best_accuracy:
380 | best_accuracy = val_accuracy
381 | torch.save(self.model.state_dict(), filepath)
382 |
383 | self.model.load_state_dict(torch.load(filepath))
384 | self.model.eval()
385 | test_metrics = self.evaluate_dataset(test_dataset, test_examples, batch_size=batch_size)
386 | print(f"Epoch {epoch} Test Loss {test_metrics['loss']:.4f} Test F1 {test_metrics['f1']:.4f}")
387 | return {'f1':test_metrics['f1'], 'Exact Match':test_metrics['exact_match']}
388 |
389 | def evaluate_dataset(self, dataset, examples, batch_size = 8, desc = "Eval"):
390 | total_loss = 0
391 | all_start_logits = []
392 | all_end_logits = []
393 | data_loader = copy.deepcopy(dataset)
394 | data_loader.set_format(type='torch', columns=self.columns)
395 | data_loader = torch.utils.data.DataLoader(data_loader, batch_size=batch_size)
396 | pbar = tqdm(total=len(dataset), position=0, leave=False, desc=desc)
397 | for _, batch in enumerate(data_loader):
398 | batch = {k: v.to(self.device) for k, v in batch.items()}
399 | val = batch['input_ids']
400 | val[val==-100] = 0
401 | outputs = self.model(**batch)
402 | loss = outputs['loss']
403 | start_logits = outputs['start_logits']
404 | end_logits = outputs['end_logits']
405 |
406 | all_start_logits.append(self.accelerator.gather(start_logits).detach().cpu().numpy())
407 | all_end_logits.append(self.accelerator.gather(end_logits).detach().cpu().numpy())
408 |
409 | total_loss += loss / len(dataset)
410 | batch = None
411 | pbar.update(1)
412 | metric = evaluate_metric(dataset, examples, all_start_logits, all_end_logits)
413 | return {'loss':total_loss, 'f1':metric['f1']/100, 'exact_match':metric['exact_match']/100}
414 |
415 | class BERTQuestionAnsweringModel(BaseQuestionAnsweringModel):
416 | def __init__(self, config):
417 | BaseQuestionAnsweringModel.__init__(self, config)
418 | config = AutoConfig.from_pretrained(self.model_name)
419 | self.model = AutoModelForQuestionAnswering.from_pretrained(self.model_name, config = config)
420 | self.model.to(self.device)
421 |
422 | def wipe_memory(self):
423 | self.model = None
424 | self.optimizer = None
425 | torch.cuda.empty_cache()
426 |
427 | class BiRNNForQuestionAnswering(nn.Module):
428 | def __init__(self, vocab_size, num_labels = 2, hidden_dim = 128):
429 |
430 | super().__init__()
431 |
432 | self.embedding = nn.Embedding(vocab_size, hidden_dim)
433 | self.bigru1 = nn.GRU(hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
434 | self.bigru2 = nn.GRU(2*hidden_dim, hidden_dim, bidirectional=True, batch_first = True)
435 | self.bigru3 = nn.GRU(2*hidden_dim, hidden_dim//2, bidirectional=True, batch_first = True)
436 | self.qa_outputs = nn.Linear(hidden_dim, num_labels)
437 | self.hidden_dim = hidden_dim
438 | self.num_labels = num_labels
439 |
440 | def forward(self,
441 | input_ids,
442 | start_positions = None,
443 | end_positions = None):
444 |
445 | embedded = self.embedding(input_ids)
446 | out,h = self.bigru1(embedded)
447 | out,h = self.bigru2(out)
448 | out,h = self.bigru3(out)
449 | logits = self.qa_outputs(out) # (bs, max_query_len, 2)
450 | start_logits, end_logits = logits.split(1, dim=-1)
451 | start_logits = start_logits.squeeze(-1).contiguous() # (bs, max_query_len)
452 | end_logits = end_logits.squeeze(-1).contiguous() # (bs, max_query_len)
453 | if start_positions is not None:
454 | loss = self.compute_loss(start_logits, end_logits, start_positions, end_positions)
455 | return {'loss':loss,
456 | 'logits':logits,
457 | 'start_logits':start_logits,
458 | 'end_logits':end_logits}
459 | else:
460 | return {'logits':logits,
461 | 'start_logits':start_logits,
462 | 'end_logits':end_logits}
463 |
464 | def compute_loss(self, start_logits, end_logits, start_positions, end_positions):
465 | loss_fct = nn.CrossEntropyLoss(ignore_index=0)
466 | start_loss = loss_fct(start_logits, start_positions)
467 | end_loss = loss_fct(end_logits, end_positions)
468 | total_loss = (start_loss + end_loss) / 2
469 | return total_loss
470 |
471 | class SimpleQuestionAnsweringModel(BaseQuestionAnsweringModel):
472 | def __init__(self, config):
473 | BaseQuestionAnsweringModel.__init__(self, config)
474 | self.model = BiRNNForQuestionAnswering(self.vocab_size, self.num_labels)
475 | self.model.to(self.device)
476 | # self.optimizer = AdamW(self.model.parameters(), lr = 5e-5)
477 |
478 | def wipe_memory(self):
479 | self.model = None
480 | self.optimizer = None
481 | torch.cuda.empty_cache()
482 |
483 |
484 | class BaseSeq2SeqModel:
485 | def __init__(self, config, tokenizer = None, task = ""):
486 | self.model = nn.Module()
487 | self.model_name = config['model_name']
488 | self.vocab_size = config['vocab_size']
489 | self.num_labels = config['num_labels']
490 | self.tokenizer = tokenizer
491 | self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
492 | self.task = task
493 |
494 | def train(self, datasets, examples, **kwargs):
495 | save_dir = kwargs['save_dir']
496 | epochs = kwargs['epochs']
497 | lr = kwargs['lr']
498 | batch_size = kwargs['batch_size']
499 |
500 | self.optimizer = AdamW(self.model.parameters(), lr = lr)
501 | self.mt_metric = load("sacrebleu")
502 | self.sum_metric = load("rouge")
503 | train_dataset, valid_dataset, test_dataset = datasets
504 |
505 | filepath = os.path.join(save_dir, 'pytorch_model.bin')
506 | best_accuracy = 0
507 | metric_name = "bleu" if self.task == "mt" else "rougeLsum"
508 | pbar = tqdm(total=epochs * len(train_dataset), leave=True)
509 |
510 | for epoch in range(epochs):
511 | loss = 0
512 | self.model.train().to(self.device)
513 | for _, batch in enumerate(train_dataset):
514 | batch = {k: v.to(self.device) for k, v in batch.items()}
515 | outputs = self.model(**batch)
516 | loss = outputs['loss']
517 | loss.backward()
518 | self.optimizer.step()
519 | self.optimizer.zero_grad()
520 | batch = None
521 | pbar.update(1)
522 | self.model.eval().to(self.device)
523 | train_loss, train_metrics = self.evaluate_dataset(train_dataset)
524 | print(f"Epoch {epoch} Train Loss {train_loss:.4f} Train {metric_name} {train_metrics[metric_name]:.4f}")
525 |
526 | valid_loss, valid_metrics = self.evaluate_dataset(valid_dataset)
527 | print(f"Epoch {epoch} Valid Loss {valid_loss:.4f} Valid {metric_name} {valid_metrics[metric_name]:.4f}")
528 |
529 | val_accuracy = valid_metrics[metric_name]
530 | if val_accuracy >= best_accuracy:
531 | best_accuracy = val_accuracy
532 | torch.save(self.model.state_dict(), filepath)
533 |
534 | self.model.load_state_dict(torch.load(filepath))
535 | self.model.eval()
536 | test_loss, test_metrics = self.evaluate_dataset(test_dataset)
537 | print(f"Epoch {epoch} Test Loss {test_loss:.4f} Test {metric_name} {test_metrics[metric_name]:.4f}")
538 | return test_metrics
539 |
540 | def evaluate_dataset(self, dataset, desc="Eval"):
541 | total_loss = 0
542 | bleu_score = 0
543 | pbar = tqdm(total=len(dataset), position=0, leave=False, desc=desc)
544 | for _, batch in enumerate(dataset):
545 | batch = {k: v.to(self.device) for k, v in batch.items()}
546 | if 't5' in self.model_name.lower():
547 | with torch.no_grad():
548 | outputs = self.model(**batch)
549 | generated_tokens = self.model.generate(batch['input_ids'])
550 | else:
551 | with torch.no_grad():
552 | outputs = self.model(**batch, mode ="generate")
553 | generated_tokens = outputs['outputs']
554 |
555 | labels = batch['labels']
556 | loss = outputs['loss']
557 | total_loss += loss.cpu().numpy() / len(dataset)
558 |
559 | if self.task == "mt":
560 | metric = self.compute_metrics(generated_tokens.cpu(), labels.cpu())
561 | bleu_score += metric['bleu'] / len(dataset)
562 |
563 | elif self.task == "sum":
564 | labels = labels.cpu().numpy()
565 | labels = np.where(labels != -100, labels, self.tokenizer.pad_token_id)
566 |
567 | decoded_preds = self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
568 | decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
569 |
570 | decoded_preds, decoded_labels = self.postprocess_text_sum(decoded_preds, decoded_labels)
571 | self.sum_metric.add_batch(
572 | predictions=decoded_preds,
573 | references=decoded_labels,)
574 |
575 | pbar.update(1)
576 | if self.task == "sum":
577 | result = self.sum_metric.compute(use_stemmer=True)
578 | result = {k: round(v * 100, 4) for k, v in result.items()}
579 | return loss, result
580 |
581 | elif self.task == "mt":
582 | return loss, {'bleu':bleu_score}
583 |
584 | def compute_metrics(self, preds, labels):
585 | if isinstance(preds, tuple):
586 | preds = preds[0]
587 |
588 | if 't5' in self.model_name.lower():
589 | decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
590 |
591 | # Replace -100 in the labels as we can't decode them.
592 | labels = np.where(labels != -100, labels, self.tokenizer.pad_token_id)
593 | decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
594 | result = self.mt_metric.compute(predictions=decoded_preds, references=decoded_labels)
595 | result = {"bleu": result["score"]}
596 | result = {k: round(v, 4) for k, v in result.items()}
597 | return result
598 | else:
599 |
600 | preds = self.get_lists(preds)
601 | labels = self.get_lists(labels)
602 |
603 | decoded_preds = self.tokenizer.decode_sentences(preds)
604 | decoded_preds = [stmt.replace(" .", ".") for stmt in decoded_preds]
605 |
606 | decoded_labels = self.tokenizer.decode_sentences(labels)
607 | decoded_labels = [[stmt.replace(" .", ".")] for stmt in decoded_labels]
608 |
609 | result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
610 | result = {"bleu": result["score"]}
611 | result = {k: round(v, 4) for k, v in result.items()}
612 | return result
613 |
614 | def postprocess_text_sum(self, preds, labels):
615 | preds = [pred.strip() for pred in preds]
616 | labels = [label.strip() for label in labels]
617 |
618 | # rougeLSum expects newline after each sentence
619 | preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
620 | labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]
621 |
622 | return preds, labels
623 |
624 | def get_lists(self, inputs):
625 | inputs = inputs.cpu().detach().numpy().astype(int).tolist()
626 | output = []
627 | for input in inputs:
628 | current_item =[]
629 | for item in input:
630 | if item == self.tokenizer.eos_idx:
631 | break
632 | else:
633 | current_item.append(item)
634 | output.append(current_item)
635 | return output
636 |
637 |
638 |
639 | class T5Seq2SeqModel(BaseSeq2SeqModel):
640 | def __init__(self, config, tokenizer = None, task = ""):
641 | BaseSeq2SeqModel.__init__(self, config, tokenizer = tokenizer, task = task)
642 | config = AutoConfig.from_pretrained(self.model_name)
643 | self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name, config = config)
644 |
645 | def wipe_memory(self):
646 | self.model = None
647 | self.optimizer = None
648 | torch.cuda.empty_cache()
649 |
650 | #https://colab.research.google.com/github/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb#scrollTo=dCK3LIN25n_S
651 | class Encoder(nn.Module):
652 | def __init__(self, input_dim, emb_dim, hid_dim, n_layers, bidirectional = True):
653 | super().__init__()
654 |
655 | self.hid_dim = hid_dim
656 | self.n_layers = n_layers
657 |
658 | self.embedding = nn.Embedding(input_dim, emb_dim)
659 |
660 | self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, bidirectional = bidirectional)
661 |
662 | def forward(self, src):
663 |
664 | #src = [src len, batch size]
665 |
666 | embedded = self.embedding(src)
667 |
668 | #embedded = [src len, batch size, emb dim]
669 |
670 | outputs, hidden = self.rnn(embedded)
671 | #outputs = [src len, batch size, hid dim * n directions]
672 | #hidden = [n layers * n directions, batch size, hid dim]
673 |
674 | #outputs are always from the top hidden layer
675 |
676 | return hidden
677 | class Decoder(nn.Module):
678 | def __init__(self, output_dim, emb_dim, hid_dim, n_layers, bidirectional = True):
679 | super().__init__()
680 |
681 | self.output_dim = output_dim
682 | self.hid_dim = hid_dim
683 | self.n_layers = n_layers
684 |
685 | self.embedding = nn.Embedding(output_dim, emb_dim)
686 |
687 | self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, bidirectional = bidirectional)
688 |
689 | self.fc_out = nn.Linear(hid_dim, output_dim)
690 |
691 |
692 | def forward(self, input, hidden):
693 |
694 | #input = [batch size]
695 | #hidden = [n layers * n directions, batch size, hid dim]
696 |
697 | input = input.unsqueeze(0)
698 |
699 | #input = [1, batch size]
700 |
701 | embedded = self.embedding(input)
702 |
703 | #embedded = [1, batch size, emb dim]
704 |
705 | output, hidden = self.rnn(embedded, hidden)
706 | #seq len and n directions will always be 1 in the decoder, therefore:
707 | #output = [1, batch size, hid dim*2]
708 | #hidden = [n layers, batch size, hid dim]
709 | output = (output[:, :, :self.hid_dim] + output[:, :, self.hid_dim:])
710 | prediction = self.fc_out(output.squeeze(0))
711 |
712 | #prediction = [batch size, output dim]
713 |
714 | return prediction, hidden
715 |
716 | class Seq2SeqMachineTranslation(nn.Module):
717 | def __init__(self, vocab_size = 500, tokenizer = None):
718 | super().__init__()
719 | ENC_EMB_DIM = 128
720 | DEC_EMB_DIM = 128
721 | HID_DIM = 1024
722 | N_LAYERS = 2
723 | INPUT_DIM = vocab_size
724 | OUTPUT_DIM = vocab_size
725 | self.vocab_size = vocab_size
726 | self.encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS)
727 | self.decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS)
728 | self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
729 | self.tokenizer = tokenizer
730 | assert self.encoder.hid_dim == self.decoder.hid_dim, \
731 | "Hidden dimensions of encoder and decoder must be equal!"
732 | assert self.encoder.n_layers == self.decoder.n_layers, \
733 | "Encoder and decoder must have equal number of layers!"
734 |
735 | def forward(self, input_ids, labels = None, teacher_forcing_ratio = 0.5, mode = "train"):
736 | src = torch.transpose(input_ids, 0, 1)
737 |
738 | if labels is not None:
739 | trg = torch.transpose(labels, 0, 1)
740 |
741 | #src = [src len, batch size]
742 | #trg = [trg len, batch size]
743 | #teacher_forcing_ratio is probability to use teacher forcing
744 | #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
745 |
746 | batch_size = src.shape[1]
747 | trg_len = src.shape[0]
748 |
749 | trg_vocab_size = self.decoder.output_dim
750 |
751 | #tensor to store decoder outputs
752 | outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
753 | #last hidden state of the encoder is used as the initial hidden state of the decoder
754 | hidden = self.encoder(src)
755 |
756 | #first input to the decoder is the tokens
757 | input = torch.tensor([self.tokenizer.sos_idx]*batch_size).to(self.device)
758 |
759 | for t in range(1, trg_len):
760 |
761 | #insert input token embedding, previous hidden and previous cell states
762 | #receive output tensor (predictions) and new hidden and cell states
763 | output, hidden = self.decoder(input, hidden)
764 |
765 | #decide if we are going to use teacher forcing or not
766 | teacher_force = random.random() < teacher_forcing_ratio
767 |
768 | #get the highest predicted token from our predictions
769 | top1 = output.argmax(1)
770 |
771 | #if teacher forcing, use actual next token as next input
772 | #if not, use predicted token
773 |
774 | if mode == "train" and teacher_force:
775 | input = trg[t]
776 | else:
777 | input = top1
778 |
779 | outputs[t] = output
780 |
781 | if labels is not None:
782 | loss = self.compute_loss(outputs, trg)
783 | return {'loss':loss,
784 | 'outputs':torch.transpose(outputs.argmax(-1), 0, 1)
785 | }
786 | else:
787 | return {'outputs': torch.transpose(outputs.argmax(-1), 0, 1)}
788 |
789 | def compute_loss(self, output, trg):
790 | loss_fct = nn.CrossEntropyLoss(ignore_index = self.tokenizer.pad_idx)
791 | output_dim = output.shape[-1]
792 | output = output[1:].view(-1, output_dim)
793 | trg = trg[1:].reshape(-1)
794 | #trg = [(trg len - 1) * batch size]
795 | #output = [(trg len - 1) * batch size, output dim]
796 |
797 | loss = loss_fct(output, trg)
798 | return loss
799 |
800 |
801 | class SimpleMachineTranslationModel(BaseSeq2SeqModel):
802 | def __init__(self, config, tokenizer = None):
803 | BaseSeq2SeqModel.__init__(self, config, tokenizer = tokenizer)
804 | self.model = Seq2SeqMachineTranslation(vocab_size = self.vocab_size, tokenizer = tokenizer)
805 | self.model.to(self.device)
806 | # self.optimizer = AdamW(self.model.parameters(), lr = 5e-5)
807 |
808 | def wipe_memory(self):
809 | self.model = None
810 | self.optimizer = None
811 | torch.cuda.empty_cache()
812 |
--------------------------------------------------------------------------------
/nmatheg/ner_utils.py:
--------------------------------------------------------------------------------
1 | # https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner_no_trainer.py
2 | def get_labels(predictions, references, labels):
3 | labels = labels.split(',')
4 | # Transform predictions and references tensos to numpy arrays
5 | y_pred = predictions.detach().cpu().clone().numpy()
6 | y_true = references.detach().cpu().clone().numpy()
7 |
8 | # Remove ignored index (special tokens)
9 | true_predictions = [
10 | [labels[p] for (p, l) in zip(pred, gold_label) if l != -100]
11 | for pred, gold_label in zip(y_pred, y_true)
12 | ]
13 | true_labels = [
14 | [labels[l] for (p, l) in zip(pred, gold_label) if l != -100]
15 | for pred, gold_label in zip(y_pred, y_true)
16 | ]
17 | return true_predictions, true_labels
--------------------------------------------------------------------------------
/nmatheg/nmatheg.py:
--------------------------------------------------------------------------------
1 |
2 | import os
3 | from .dataset import create_dataset
4 | from .models import SimpleClassificationModel, BERTTextClassificationModel\
5 | ,BERTTokenClassificationModel,BERTQuestionAnsweringModel\
6 | ,SimpleTokenClassificationModel,SimpleQuestionAnsweringModel\
7 | ,SimpleMachineTranslationModel,T5Seq2SeqModel
8 | from .configs import create_default_config
9 | import configparser
10 | import json
11 | from .utils import save_json, get_tokenizer
12 | from transformers import AutoModelForSequenceClassification, AutoConfig, AutoTokenizer,AutoModelForTokenClassification,AutoModelForQuestionAnswering,AutoModelForSeq2SeqLM
13 | from transformers import pipeline
14 | import pathlib
15 |
16 | import torch
17 | try:
18 | import bpe_surgery
19 | except:
20 | pass
21 | import numpy as np
22 |
23 |
24 | class TrainStrategy:
25 | def __init__(self, datasets, models, tokenizers= None, vocab_sizes=None, config_path= None,
26 | batch_size = 64, epochs = 5, lr = 5e-5, runs = 10, max_tokens = 128, max_train_samples = -1,
27 | preprocessing = {}, mode = 'finetune', ckpt= 'ckpts'):
28 |
29 | self.mode = mode
30 | modes = ['finetune', 'pretrain']
31 | assert mode in modes , f"mode must be one of the following {modes}"
32 | if self.mode == 'pretrain':
33 | assert tokenizers is not None , "tokenizers must be set"
34 | assert vocab_sizes is not None, "vocab sizes must be set"
35 |
36 | if config_path == None:
37 | self.config = create_default_config(batch_size=batch_size, epochs = epochs, lr = lr, runs = runs,
38 | max_tokens=max_tokens, max_train_samples = max_train_samples,
39 | preprocessing = preprocessing, ckpt = ckpt)
40 | self.config['dataset'] = {'dataset_name' : datasets}
41 | self.config['model'] = {'model_name' : models}
42 | if self.mode == 'pretrain':
43 | self.config['tokenization']['vocab_size'] = vocab_sizes
44 | self.config['tokenization']['tokenizer_name'] = tokenizers
45 | else:
46 | self.config = configparser.ConfigParser()
47 | self.config.read(config_path)
48 |
49 | self.datasets_config = configparser.ConfigParser()
50 | rel_path = os.path.dirname(__file__)
51 | data_ini_path = os.path.join(rel_path, "datasets.ini")
52 | self.datasets_config.read(data_ini_path)
53 | self.preprocessing = preprocessing
54 |
55 | def start(self):
56 | model_names = [m.strip() for m in self.config['model']['model_name'].split(',')]
57 | dataset_names = [d.strip() for d in self.config['dataset']['dataset_name'].split(',')]
58 | if self.mode == 'pretrain':
59 | tokenizers = [t.strip() for t in self.config['tokenization']['tokenizer_name'].split(',')]
60 | vocab_sizes = [v.strip() for v in self.config['tokenization']['vocab_size'].split(',')]
61 | else:
62 | tokenizers = [m.strip() for m in self.config['model']['model_name'].split(',')]
63 | vocab_sizes = [str(AutoTokenizer.from_pretrained(v.strip()).vocab_size) for v in self.config['model']['model_name'].split(',')]
64 | runs = int(self.config['train']['runs'])
65 | max_tokens = int(self.config['tokenization']['max_tokens'])
66 |
67 | results = {}
68 |
69 | results_path = f"{self.config['train']['save_dir']}/results.json"
70 | if os.path.isfile(results_path):
71 | f = open(results_path)
72 | results = json.load(f)
73 |
74 | if self.mode == "finetune":
75 | for m, model_name in enumerate(model_names):
76 | if not model_name in results:
77 | results[model_name] = {}
78 | for d, dataset_name in enumerate(dataset_names):
79 | if not dataset_name in results[model_name]:
80 | results[model_name][dataset_name] = {}
81 | for run in range(runs):
82 | if os.path.isfile(results_path):
83 | if len(results[model_name][dataset_name].keys()) > 0:
84 | metric_name = list(results[model_name][dataset_name].keys())[0]
85 | curr_run = len(results[model_name][dataset_name][metric_name])
86 | if run < curr_run:
87 | print(f"Run {run} already finished ")
88 | continue
89 |
90 | new_model_name = model_name.split("/")[-1]
91 | data_dir = f"{self.config['train']['save_dir']}/{new_model_name}/{dataset_name}/data"
92 | tokenizer_dir = f"{self.config['train']['save_dir']}/{new_model_name}/{dataset_name}/tokenizer"
93 | train_dir = f"{self.config['train']['save_dir']}/{new_model_name}/{dataset_name}/run_{run}"
94 | for path in [data_dir, tokenizer_dir, train_dir]:
95 | pathlib.Path(path).mkdir(parents=True, exist_ok=True)
96 |
97 | self.data_config = self.datasets_config[dataset_name]
98 | print(dict(self.data_config))
99 | task_name = self.data_config['task']
100 | vocab_size = vocab_sizes[m]
101 | tokenizer_name = tokenizers[m]
102 | tokenizer, self.datasets, self.examples = create_dataset(self.config, self.data_config,
103 | vocab_size = int(vocab_size),
104 | model_name = model_name,
105 | tokenizer_name = tokenizer_name,
106 | clean = True if len(self.preprocessing) else False,
107 | tok_save_path = tokenizer_dir,
108 | data_save_path = data_dir)
109 | self.model_config = {'model_name':model_name,
110 | 'vocab_size':int(vocab_size),
111 | 'num_labels':int(self.data_config['num_labels']),
112 | 'labels':self.data_config['labels']}
113 |
114 | print(self.model_config)
115 | if task_name in ['cls', 'nli']:
116 | self.model = BERTTextClassificationModel(self.model_config)
117 | elif task_name in ['ner', 'pos']:
118 | self.model = BERTTokenClassificationModel(self.model_config)
119 |
120 | elif task_name == 'qa':
121 | self.model = BERTQuestionAnsweringModel(self.model_config)
122 | elif task_name in ['mt', 'sum']:
123 | self.model = T5Seq2SeqModel(self.model_config, tokenizer = tokenizer, task = task_name)
124 |
125 |
126 | self.train_config = {'epochs':int(self.config['train']['epochs']),
127 | 'save_dir':train_dir,
128 | 'batch_size':int(self.config['train']['batch_size']),
129 | 'lr':float(self.config['train']['lr']),
130 | 'runs':run}
131 | self.tokenizer_config = {'name': tokenizer_name, 'vocab_size': vocab_size, 'max_tokens': max_tokens,
132 | 'save_path': tokenizer_dir}
133 | print(self.tokenizer_config)
134 | print(self.train_config)
135 | os.makedirs(self.train_config['save_dir'], exist_ok = True)
136 |
137 | if task_name == 'mt':
138 | metrics = self.model.train(self.datasets, self.examples, **self.train_config)
139 | else:
140 | metrics = self.model.train(self.datasets, self.examples, **self.train_config)
141 |
142 | save_json(self.train_config, f"{train_dir}/train_config.json")
143 | save_json(self.data_config, f"{data_dir}/data_config.json")
144 | save_json(self.model_config, f"{train_dir}/model_config.json")
145 | save_json(self.tokenizer_config, f"{tokenizer_dir}/tokenizer_config.json")
146 |
147 | for metric_name in metrics:
148 | if metric_name not in results[model_name][dataset_name]:
149 | results[model_name][dataset_name][metric_name] = []
150 | results[model_name][dataset_name][metric_name].append(metrics[metric_name])
151 | self.model.wipe_memory()
152 | with open(f"{self.config['train']['save_dir']}/results.json", 'w') as handle:
153 | json.dump(results, handle)
154 |
155 | elif self.mode == "pretrain":
156 | for t, tokenizer_name in enumerate(tokenizers):
157 | if not tokenizer_name in results:
158 | results[tokenizer_name] = {}
159 | for v, vocab_size in enumerate(vocab_sizes):
160 | if self.mode == 'finetune' and v != t:
161 | continue
162 | if not vocab_size in results[tokenizer_name]:
163 | results[tokenizer_name][vocab_size] = {}
164 | for d, dataset_name in enumerate(dataset_names):
165 | if not dataset_name in results[tokenizer_name][vocab_size]:
166 | results[tokenizer_name][vocab_size][dataset_name] = {}
167 | for m, model_name in enumerate(model_names):
168 | if self.mode == 'finetune' and t != m:
169 | continue
170 | if not model_name in results[tokenizer_name][vocab_size][dataset_name]:
171 | results[tokenizer_name][vocab_size][dataset_name][model_name] = {}
172 | for run in range(runs):
173 | if os.path.isfile(results_path):
174 | if len(results[tokenizer_name][vocab_size][dataset_name][model_name].keys()) > 0:
175 | metric_name = list(results[tokenizer_name][vocab_size][dataset_name][model_name].keys())[0]
176 | curr_run = len(results[tokenizer_name][vocab_size][dataset_name][model_name][metric_name])
177 | if run < curr_run:
178 | print(f"Run {run} already finished ")
179 | continue
180 |
181 | data_dir = f"{self.config['train']['save_dir']}/{tokenizer_name}/{vocab_size}/{dataset_name}/{model_name}/data"
182 | tokenizer_dir = f"{self.config['train']['save_dir']}/{tokenizer_name}/{vocab_size}/{dataset_name}/{model_name}/tokenizer"
183 | train_dir = f"{self.config['train']['save_dir']}/{tokenizer_name}/{vocab_size}/{dataset_name}/{model_name}/run_{run}"
184 | for path in [data_dir, tokenizer_dir, train_dir]:
185 | pathlib.Path(path).mkdir(parents=True, exist_ok=True)
186 |
187 |
188 | self.data_config = self.datasets_config[dataset_name]
189 | print(dict(self.data_config))
190 | task_name = self.data_config['task']
191 | tokenizer, self.datasets, self.examples = create_dataset(self.config, self.data_config,
192 | vocab_size = int(vocab_size),
193 | model_name = model_name,
194 | tokenizer_name = tokenizer_name,
195 | data_save_path = data_dir,
196 | tok_save_path = tokenizer_dir,
197 | clean = True if len(self.preprocessing) else False)
198 | self.model_config = {'model_name':model_name,
199 | 'vocab_size':int(vocab_size),
200 | 'num_labels':int(self.data_config['num_labels']),
201 | 'labels':self.data_config['labels']}
202 |
203 | print(self.model_config)
204 | if task_name in ['cls', 'nli']:
205 | self.model = SimpleClassificationModel(self.model_config)
206 | elif task_name == 'ner':
207 | self.model = SimpleTokenClassificationModel(self.model_config)
208 |
209 | elif task_name == 'qa':
210 | self.model = SimpleQuestionAnsweringModel(self.model_config)
211 | elif task_name == 'mt':
212 | self.model = SimpleMachineTranslationModel(self.model_config, tokenizer = tokenizer)
213 |
214 | self.train_config = {'epochs':int(self.config['train']['epochs']),
215 | 'save_dir':train_dir,
216 | 'batch_size':int(self.config['train']['batch_size']),
217 | 'lr':float(self.config['train']['lr']),
218 | 'runs':run}
219 | self.tokenizer_config = {'name': tokenizer_name, 'vocab_size': vocab_size, 'max_tokens': max_tokens,
220 | 'save_path': tokenizer_dir}
221 | print(self.tokenizer_config)
222 | print(self.train_config)
223 | os.makedirs(self.train_config['save_dir'], exist_ok = True)
224 |
225 | if task_name == 'mt':
226 | metrics = self.model.train(self.datasets, self.examples, **self.train_config)
227 | else:
228 | metrics = self.model.train(self.datasets, self.examples, **self.train_config)
229 |
230 | save_json(self.train_config, f"{train_dir}/train_config.json")
231 | save_json(self.data_config, f"{data_dir}/data_config.json")
232 | save_json(self.model_config, f"{train_dir}/model_config.json")
233 | save_json(self.tokenizer_config, f"{tokenizer_dir}/tokenizer_config.json")
234 | for metric_name in metrics:
235 | if metric_name not in results[tokenizer_name][vocab_size][dataset_name][model_name]:
236 | results[tokenizer_name][vocab_size][dataset_name][model_name][metric_name] = []
237 | results[tokenizer_name][vocab_size][dataset_name][model_name][metric_name].append(metrics[metric_name])
238 | self.model.wipe_memory()
239 | with open(f"{self.config['train']['save_dir']}/results.json", 'w') as handle:
240 | json.dump(results, handle)
241 | return results
242 |
243 | def predict_from_run(save_dir, run = 0, sentence = "", question = "", context = "", hypothesis = "", premise = ""):
244 | data_config = json.load(open(f"{save_dir}/data/data_config.json"))
245 | tokenizer_config = json.load(open(f"{save_dir}/tokenizer/tokenizer_config.json"))
246 | train_dir = f"{save_dir}/run_{run}"
247 | model_config = json.load(open(f"{train_dir}/model_config.json"))
248 | model_name = model_config["model_name"]
249 | task_name = data_config['task']
250 | tokenizer_name = tokenizer_config["name"]
251 | tokenizer_save_path = tokenizer_config["save_path"]
252 | max_tokens = tokenizer_config["max_tokens"]
253 | vocab_size = tokenizer_config["vocab_size"]
254 | num_labels = model_config["num_labels"]
255 |
256 | if model_name == "birnn":
257 | if task_name == "mt":
258 | src_tokenizer = get_tokenizer(tokenizer_name, vocab_size = vocab_size)
259 | trg_tokenizer = get_tokenizer(tokenizer_name, vocab_size = vocab_size)
260 |
261 | src_tokenizer.load(tokenizer_save_path, name = "src_tok")
262 | trg_tokenizer.load(tokenizer_save_path, name = "trg_tok")
263 |
264 | model = SimpleMachineTranslationModel(model_config, tokenizer = trg_tokenizer)
265 | model.model.load_state_dict(torch.load(f"{train_dir}/pytorch_model.bin"))
266 |
267 | encoding = src_tokenizer.encode_sentences([sentence], add_boundry=True, out_length=max_tokens)
268 | out = model.model(torch.tensor(encoding).to('cuda'), mode = "generate")
269 | return trg_tokenizer.decode_sentences(out['outputs'])
270 |
271 | elif task_name == "cls":
272 | tokenizer = get_tokenizer(tokenizer_name, vocab_size = vocab_size)
273 | tokenizer.load_model(f"{tokenizer_save_path}/m.model")
274 |
275 | model = SimpleClassificationModel(model_config)
276 | model.model.load_state_dict(torch.load(f"{train_dir}/pytorch_model.bin"))
277 |
278 | encoding = tokenizer.encode_sentences([sentence], out_length=max_tokens)
279 | out = model.model(torch.tensor(encoding).to('cuda'))
280 | labels = data_config['labels'].split(",")
281 | return labels[out['logits'].argmax(-1)]
282 |
283 | elif task_name == "nli":
284 | tokenizer = get_tokenizer(tokenizer_name, vocab_size = vocab_size)
285 | tokenizer.load(tokenizer_save_path)
286 |
287 | model = SimpleClassificationModel(model_config)
288 | model.model.load_state_dict(torch.load(f"{train_dir}/pytorch_model.bin"))
289 |
290 | encoding = tokenizer.encode_sentences([premise + " "+ hypothesis], add_boundry=True, out_length=max_tokens)
291 | out = model.model(torch.tensor(encoding).to('cuda'))
292 | labels = data_config['labels'].split(",")
293 | return labels[out['logits'].argmax(-1)]
294 |
295 | elif task_name == "ner":
296 | tokenizer = get_tokenizer(tokenizer_name, vocab_size = vocab_size)
297 | tokenizer.load(tokenizer_save_path)
298 |
299 | model = SimpleTokenClassificationModel(model_config)
300 | model.model.load_state_dict(torch.load(f"{train_dir}/pytorch_model.bin"))
301 | output = []
302 | labels = data_config['labels'].split(",")
303 | out_sentence = ""
304 | sentence_encoding = []
305 | word_lens = []
306 | words = sentence.split(' ')
307 | for word_id , word in enumerate(words):
308 | enc_words = tokenizer._encode_word(word)
309 | sentence_encoding += enc_words
310 | word_lens .append(len(enc_words))
311 |
312 | while len(sentence_encoding) < max_tokens:
313 | sentence_encoding.append(0)
314 | out = model.model(torch.tensor(sentence_encoding).to('cuda'))['logits'].argmax(-1).cpu().numpy()
315 | i = 0
316 | j = 0
317 | while i < sum(word_lens):
318 | preds = out[i:i+word_lens[j]]
319 | counts = np.bincount(preds)
320 | mj_label = np.argmax(counts)
321 | out_sentence += " "+labels[mj_label]
322 | i += word_lens[j]
323 | j += 1
324 | output.append(out_sentence.strip())
325 | return output
326 |
327 | elif task_name == "qa":
328 | tokenizer = get_tokenizer(tokenizer_name, vocab_size = vocab_size)
329 | tokenizer.load(tokenizer_save_path)
330 |
331 | model = SimpleQuestionAnsweringModel(model_config)
332 | model.model.load_state_dict(torch.load(f"{train_dir}/pytorch_model.bin"))
333 | question_encoding = tokenizer.encode_sentences([question])[0]
334 | context_encoding = tokenizer.encode_sentences([context])[0]
335 | pad_re = max_tokens - (len(question_encoding) + len(context_encoding) + 1)
336 | encoding = question_encoding +[0]+context_encoding + [0] * pad_re
337 | out = model.model(torch.tensor([encoding]).to('cuda'))
338 | start_preds = out['start_logits'].argmax(-1).cpu().numpy()
339 | end_preds = out['end_logits'].argmax(-1).cpu().numpy()
340 | return tokenizer.decode_sentences([encoding[start_preds[0]:end_preds[0]]])
341 | else:
342 |
343 |
344 | if task_name == "cls":
345 | config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
346 | model = AutoModelForSequenceClassification.from_pretrained(train_dir, config = config)
347 | tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False, model_max_length = 512)
348 | encoded_review = tokenizer.encode_plus(
349 | sentence,
350 | max_length=512,
351 | add_special_tokens=True,
352 | return_token_type_ids=False,
353 | pad_to_max_length=True,
354 | return_attention_mask=True,
355 | return_tensors='pt',
356 | )
357 |
358 | input_ids = encoded_review['input_ids']
359 | attention_mask = encoded_review['attention_mask']
360 | output = model(input_ids, attention_mask)
361 | labels = data_config['labels'].split(",")
362 | return labels[output['logits'].argmax(-1)]
363 |
364 | elif task_name == "nli":
365 | config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
366 | model = AutoModelForSequenceClassification.from_pretrained(train_dir, config = config)
367 | tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False, model_max_length = 512)
368 | encoded_review = tokenizer.encode_plus(
369 | premise,
370 | hypothesis,
371 | max_length=512,
372 | add_special_tokens=True,
373 | return_token_type_ids=False,
374 | pad_to_max_length=True,
375 | return_attention_mask=True,
376 | return_tensors='pt',
377 | )
378 |
379 | input_ids = encoded_review['input_ids']
380 | attention_mask = encoded_review['attention_mask']
381 | output = model(input_ids, attention_mask)
382 | labels = data_config['labels'].split(",")
383 | return labels[output['logits'].argmax(-1)]
384 |
385 | elif task_name in ['ner', 'pos']:
386 | labels = data_config['labels'].split(",")
387 | config = AutoConfig.from_pretrained(model_name, num_labels = num_labels, id2label = labels)
388 | tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length = 512)
389 | model = AutoModelForTokenClassification.from_pretrained(train_dir, config = config)
390 | nlp = pipeline(task_name, model=model, tokenizer=tokenizer)
391 | return nlp(sentence)
392 |
393 | elif task_name == "qa":
394 | config = AutoConfig.from_pretrained(model_name)
395 | tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length = 512)
396 | model = AutoModelForQuestionAnswering.from_pretrained(train_dir, config = config)
397 | nlp = pipeline("question-answering", model=model, tokenizer=tokenizer)
398 | return nlp(question=question, context=context)
399 |
400 | elif task_name in ["mt", "sum"]:
401 | config = AutoConfig.from_pretrained(model_name)
402 | tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length = 512)
403 | model = AutoModelForSeq2SeqLM.from_pretrained(train_dir, config = config)
404 | nlp = pipeline('text2text-generation', model=model, tokenizer=tokenizer)
405 | return nlp(sentence)
406 |
--------------------------------------------------------------------------------
/nmatheg/preprocess_ner.py:
--------------------------------------------------------------------------------
1 | # Creating a class to pull the words from the columns and create them into sentences
2 | from datasets import Dataset, DatasetDict
3 |
4 | def aggregate_tokens(dataset, config, data_config, max_len = 128):
5 | new_dataset = {}
6 | token_col = data_config['text']
7 | tag_col = data_config['label']
8 |
9 | for split in dataset:
10 | sent_labels = []
11 | sent_label = []
12 | sentence = []
13 | sentences = []
14 |
15 | for i, item in enumerate(dataset[split]):
16 | token, label = item[token_col], item[tag_col]
17 | sent_label.append(label)
18 | sentence.append(token)
19 | if len(sentence) == max_len:
20 | sentences.append(sentence)
21 | sent_labels.append(sent_label)
22 | sentence = []
23 | sent_label = []
24 | new_dataset[split] = Dataset.from_dict({token_col:sentences, tag_col:sent_labels})
25 | return DatasetDict(new_dataset)
26 |
27 | # https://github.com/huggingface/transformers/blob/44f5b260fe7a69cbd82be91b58c62a2879d530fa/examples/pytorch/token-classification/run_ner_no_trainer.py#L353
28 | def tokenize_and_align_labels(dataset, tokenizer, data_config, model_type = 'transformer', max_len = 128):
29 |
30 | token_col = data_config['text']
31 | tag_col = data_config['label']
32 |
33 | if 'transformer' in model_type:
34 | tokenized_inputs = tokenizer(
35 | dataset[token_col],
36 | max_length=max_len,
37 | padding='max_length',
38 | truncation=True,
39 | # We use this argument because the texts in our dataset are lists of words (with a label for each word).
40 | is_split_into_words=True,
41 | )
42 | labels = []
43 | for i, label in enumerate(dataset[tag_col]):
44 | word_ids = tokenized_inputs.word_ids(batch_index=i)
45 | previous_word_idx = None
46 | label_ids = []
47 | for word_idx in word_ids:
48 | if word_idx is None:
49 | label_ids.append(-100)
50 | elif word_idx != previous_word_idx:
51 | label_ids.append(label[word_idx])
52 | else:
53 | label_ids.append(label[word_idx] if True else -100)
54 | previous_word_idx = word_idx
55 |
56 | labels.append(label_ids)
57 | tokenized_inputs["labels"] = labels
58 | return tokenized_inputs
59 | else:
60 | labels = []
61 | input_ids = []
62 | for i, label in enumerate(dataset[tag_col]):
63 | word_ids = []
64 | tokens = []
65 | for j, word in enumerate(dataset[token_col][i]):
66 | token_ids = tokenizer._encode_word(word)
67 | for token_id in token_ids:
68 | tokens.append(token_id)
69 | word_ids.append(j)
70 | if len(tokens) > max_len:
71 | break
72 |
73 | while len(tokens) < max_len:
74 | tokens.append(0)
75 | word_ids.append(None)
76 | else:
77 | tokens = tokens[:max_len]
78 | word_ids = word_ids[:max_len]
79 |
80 | input_ids.append(tokens)
81 | previous_word_idx = None
82 | label_ids = []
83 | for word_idx in word_ids:
84 | if word_idx is None:
85 | label_ids.append(-100)
86 | elif word_idx != previous_word_idx:
87 | label_ids.append(label[word_idx])
88 | else:
89 | label_ids.append(label[word_idx] if True else -100)
90 | previous_word_idx = word_idx
91 | labels.append(label_ids)
92 | dataset["labels"] = labels
93 | dataset["input_ids"] = input_ids
94 | return dataset
--------------------------------------------------------------------------------
/nmatheg/preprocess_qa.py:
--------------------------------------------------------------------------------
1 | # https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_qa_no_trainer.py
2 | import re
3 | import copy
4 |
5 | def overflow_to_sample_mapping(tokens, offsets, idx, max_len = 384, doc_stride = 128):
6 | fixed_tokens = []
7 | fixed_offsets = []
8 | sep_index = tokens.index(-100)
9 | question = tokens[:sep_index]
10 | context = tokens[sep_index+1:]
11 | q_offsets = offsets[:sep_index]
12 | c_offsets = offsets[sep_index+1:]
13 | q_len = len(question)
14 | c_len = len(context)
15 | st_idx = 0
16 | samplings = []
17 | sequences = []
18 |
19 | while True:
20 | ed_idx = st_idx+max_len-q_len-1
21 | pad_re = max_len - len(question+ [0] + context[st_idx:ed_idx])
22 |
23 | if len(context[st_idx:ed_idx]) == 0:
24 | break
25 | curr_tokens = question+[0] + context[st_idx:ed_idx] + [0] * pad_re
26 | curr_offset = q_offsets+[(0,0)] + c_offsets[st_idx:ed_idx] + [(0,0)] * pad_re
27 | curr_seq = [0]*q_len+[None]+[1]*len(context[st_idx:ed_idx])+[None] * pad_re
28 |
29 | assert len(curr_tokens) == len(curr_offset) == len(curr_seq) == max_len, f"curr_tokens: {len(curr_tokens)}, curr_seq: {len(curr_seq)}"
30 | fixed_tokens.append(curr_tokens[:max_len])
31 | fixed_offsets.append(curr_offset[:max_len])
32 | samplings.append(idx)
33 | sequences.append(curr_seq)
34 |
35 | st_idx += doc_stride
36 | if pad_re > 0:
37 | break
38 | for i in range(len(fixed_tokens)):
39 | assert len(fixed_tokens[i]) == len(fixed_offsets[i])
40 | return fixed_tokens, fixed_offsets, samplings, sequences
41 |
42 | def prepare_features(examples, tokenizer, data_config, model_type = 'transformer', max_len = 384):
43 | # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
44 | # in one example possible giving several features when a context is long, each of those features having a
45 | # context that overlaps a bit the context of the previous feature.
46 | if 'transformer' in model_type:
47 | tokenized_examples = tokenizer(
48 | examples["question"],
49 | examples["context"],
50 | truncation="only_second",
51 | max_length=max_len,
52 | stride=128,
53 | return_overflowing_tokens=True,
54 | return_offsets_mapping=True,
55 | padding="max_length",
56 | )
57 | # Since one example might give us several features if it has a long context, we need a map from a feature to
58 | # its corresponding example. This key gives us just that.
59 | sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
60 | # The offset mappings will give us a map from token to character position in the original context. This will
61 | # help us compute the start_positions and end_positions.
62 | offset_mapping = tokenized_examples["offset_mapping"]
63 | # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
64 | # corresponding example_id and we will store the offset mappings.
65 | # Let's label those examples!
66 | tokenized_examples["start_positions"] = []
67 | tokenized_examples["end_positions"] = []
68 |
69 | for i, offsets in enumerate(offset_mapping):
70 | # We will label impossible answers with the index of the CLS token.
71 | input_ids = tokenized_examples["input_ids"][i]
72 | cls_index = input_ids.index(tokenizer.cls_token_id)
73 |
74 | # Grab the sequence corresponding to that example (to know what is the context and what is the question).
75 | sequence_ids = tokenized_examples.sequence_ids(i)
76 |
77 | # One example can give several spans, this is the index of the example containing this span of text.
78 | sample_index = sample_mapping[i]
79 | answers = examples["answers"][sample_index]
80 | # If no answers are given, set the cls_index as answer.
81 | if len(answers["answer_start"]) == 0:
82 | tokenized_examples["start_positions"].append(cls_index)
83 | tokenized_examples["end_positions"].append(cls_index)
84 | else:
85 | # Start/end character index of the answer in the text.
86 | start_char = answers["answer_start"][0]
87 | end_char = start_char + len(answers["text"][0])
88 |
89 | # Start token index of the current span in the text.
90 | token_start_index = 0
91 | while sequence_ids[token_start_index] != 1:
92 | token_start_index += 1
93 |
94 | # End token index of the current span in the text.
95 | token_end_index = len(input_ids) - 1
96 | while sequence_ids[token_end_index] != 1:
97 | token_end_index -= 1
98 |
99 | # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
100 | if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
101 | tokenized_examples["start_positions"].append(cls_index)
102 | tokenized_examples["end_positions"].append(cls_index)
103 | else:
104 | # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
105 | # Note: we could go after the last offset if the answer is the last word (edge case).
106 | while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
107 | token_start_index += 1
108 | tokenized_examples["start_positions"].append(token_start_index - 1)
109 | while offsets[token_end_index][1] >= end_char:
110 | token_end_index -= 1
111 | tokenized_examples["end_positions"].append(token_end_index + 1)
112 |
113 | tokenized_examples["example_id"] = []
114 |
115 | for i in range(len(tokenized_examples["input_ids"])):
116 | # Grab the sequence corresponding to that example (to know what is the context and what is the question).
117 | sequence_ids = tokenized_examples.sequence_ids(i)
118 | context_index = 1
119 |
120 | # One example can give several spans, this is the index of the example containing this span of text.
121 | sample_index = sample_mapping[i]
122 | tokenized_examples["example_id"].append(examples["id"][sample_index])
123 |
124 | # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
125 | # position is part of the context or not.
126 | tokenized_examples["offset_mapping"][i] = [
127 | (o if sequence_ids[k] == context_index else None)
128 | for k, o in enumerate(tokenized_examples["offset_mapping"][i])
129 | ]
130 |
131 | return tokenized_examples
132 |
133 | else:
134 | question_col, context_col = data_config['text'].split(",")
135 | tokenized_examples = copy.deepcopy(examples)
136 | input_ids = []
137 | offset_mapping = []
138 | sequence_ids = []
139 | sample_mapping = []
140 |
141 | for i, (question, context) in enumerate(zip(examples[question_col], examples[context_col])):
142 | offsets = []
143 | tokens = []
144 | sequences = []
145 |
146 | question_context = question + " "+context
147 | st = 0
148 | for word in question_context.split(" "):
149 | if len(word) == 0:
150 | st += 1
151 | continue
152 |
153 | word = word.strip()
154 |
155 | if word == "":
156 | offsets.append((0, 0))
157 | tokens.append(-100)
158 | st = 0
159 | else:
160 | token_ids = tokenizer._encode_word(word)
161 | token_ids = [token_id for token_id in token_ids]
162 | token_strs = tokenizer._tokenize_word(word, remove_sow=True)
163 | if token_ids[0] == tokenizer.sow_idx:
164 | token_strs = [tokenizer.sow] + token_strs
165 | for j, token_id in enumerate(token_ids):
166 | token_str = token_strs[j]
167 | tokens.append(token_id)
168 | if token_str == tokenizer.sow:
169 | offsets.append((st, st))
170 | else:
171 | offsets.append((st, st+len(token_str)))
172 | st += len(token_str)
173 | st += 1 # for space
174 |
175 |
176 | tokens, offsets, samplings, sequences = overflow_to_sample_mapping(tokens, offsets, i, max_len = max_len)
177 |
178 |
179 | sample_mapping += samplings
180 | input_ids += tokens
181 | offset_mapping += offsets
182 | sequence_ids += sequences
183 |
184 | tokenized_examples = {'input_ids':input_ids, 'sequence_ids':sequence_ids, 'offset_mapping': offset_mapping, 'overflow_to_sample_mapping': sample_mapping}
185 | tokenized_examples["start_positions"] = []
186 | tokenized_examples["end_positions"] = []
187 |
188 | for i, offsets in enumerate(offset_mapping):
189 | # We will label impossible answers with the index of the CLS token.
190 | input_ids = tokenized_examples["input_ids"][i]
191 | # cls_index = input_ids.index(tokenizer.cls_token_id)
192 | cls_index = 0
193 |
194 | # Grab the sequence corresponding to that example (to know what is the context and what is the question).
195 | sequence_ids = tokenized_examples['sequence_ids'][i]
196 |
197 | # One example can give several spans, this is the index of the example containing this span of text.
198 | sample_index = tokenized_examples["overflow_to_sample_mapping"][i]
199 |
200 | answers = examples["answers"][sample_index]
201 | # If no answers are given, set the cls_index as answer.
202 | if len(answers["answer_start"]) == 0:
203 | tokenized_examples["start_positions"].append(cls_index)
204 | tokenized_examples["end_positions"].append(cls_index)
205 | else:
206 | # Start/end character index of the answer in the text.
207 | start_char = answers["answer_start"][0]
208 | end_char = start_char + len(answers["text"][0])
209 |
210 | # Start token index of the current span in the text.
211 | token_start_index = 0
212 | while sequence_ids[token_start_index] != 1:
213 | token_start_index += 1
214 | # End token index of the current span in the text.
215 | token_end_index = len(input_ids) - 1
216 | while sequence_ids[token_end_index] != 1:
217 | token_end_index -= 1
218 |
219 | # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
220 | if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
221 | tokenized_examples["start_positions"].append(cls_index)
222 | tokenized_examples["end_positions"].append(cls_index)
223 |
224 | else:
225 | # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
226 | # Note: we could go after the last offset if the answer is the last word (edge case).
227 | while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
228 | token_start_index += 1
229 | tokenized_examples["start_positions"].append(token_start_index - 1)
230 | while offsets[token_end_index][1] >= end_char:
231 | token_end_index -= 1
232 | tokenized_examples["end_positions"].append(token_end_index + 1)
233 |
234 | tokenized_examples["example_id"] = []
235 |
236 | for i in range(len(tokenized_examples["input_ids"])):
237 | # Grab the sequence corresponding to that example (to know what is the context and what is the question).
238 | sequence_ids = tokenized_examples['sequence_ids'][i]
239 | context_index = 1
240 |
241 | # One example can give several spans, this is the index of the example containing this span of text.
242 | sample_index = sample_mapping[i]
243 | tokenized_examples["example_id"].append(examples["id"][sample_index])
244 |
245 | # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
246 | # position is part of the context or not.
247 | tokenized_examples["offset_mapping"][i] = [
248 | (o if sequence_ids[k] == context_index else None)
249 | for k, o in enumerate(tokenized_examples["offset_mapping"][i])
250 | ]
251 | return tokenized_examples
--------------------------------------------------------------------------------
/nmatheg/qa_utils.py:
--------------------------------------------------------------------------------
1 | import collections
2 | import json
3 | import logging
4 | import os
5 | from typing import Optional, Tuple
6 | from datasets import load_metric
7 | import numpy as np
8 | from tqdm.auto import tqdm
9 |
10 | def evaluate_metric(dataset, examples, all_start_logits, all_end_logits):
11 | metric = load_metric("squad")
12 | max_len = max([x.shape[1] for x in all_start_logits]) # Get the max_length of the tensor
13 |
14 | # concatenate the numpy array
15 | start_logits_concat = create_and_fill_np_array(all_start_logits, dataset, max_len)
16 | end_logits_concat = create_and_fill_np_array(all_end_logits, dataset, max_len)
17 |
18 | # delete the list of numpy arrays
19 | del all_start_logits
20 | del all_end_logits
21 |
22 | outputs_numpy = (start_logits_concat, end_logits_concat)
23 | prediction = post_processing_function(examples, dataset, outputs_numpy)
24 | return metric.compute(predictions=prediction['predictions'], references=prediction['label_ids'])
25 |
26 |
27 | def post_processing_function(examples, features, predictions, stage="eval"):
28 | # Post-processing: we match the start logits and end logits to answers in the original context.
29 | predictions = postprocess_qa_predictions(
30 | examples=examples,
31 | features=features,
32 | predictions=predictions,
33 | prefix=stage,
34 | )
35 | # Format the result to the format the metric expects.
36 | formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]
37 |
38 | references = [{"id": ex["id"], "answers": ex['answers']} for ex in examples]
39 | return {'predictions':formatted_predictions, 'label_ids':references}
40 |
41 | def create_and_fill_np_array(start_or_end_logits, dataset, max_len):
42 | """
43 | Create and fill numpy array of size len_of_validation_data * max_length_of_output_tensor
44 | Args:
45 | start_or_end_logits(:obj:`tensor`):
46 | This is the output predictions of the model. We can only enter either start or end logits.
47 | eval_dataset: Evaluation dataset
48 | max_len(:obj:`int`):
49 | The maximum length of the output tensor. ( See the model.eval() part for more details )
50 | """
51 |
52 | step = 0
53 | # create a numpy array and fill it with -100.
54 | logits_concat = np.full((len(dataset), max_len), -100, dtype=np.float64)
55 | # Now since we have create an array now we will populate it with the outputs gathered using accelerator.gather
56 | for i, output_logit in enumerate(start_or_end_logits): # populate columns
57 | # We have to fill it such that we have to take the whole tensor and replace it on the newly created array
58 | # And after every iteration we have to change the step
59 |
60 | batch_size = output_logit.shape[0]
61 | cols = output_logit.shape[1]
62 |
63 | if step + batch_size < len(dataset):
64 | logits_concat[step : step + batch_size, :cols] = output_logit
65 | else:
66 | logits_concat[step:, :cols] = output_logit[: len(dataset) - step]
67 |
68 | step += batch_size
69 |
70 | return logits_concat
71 |
72 | def postprocess_qa_predictions(
73 | examples,
74 | features,
75 | predictions: Tuple[np.ndarray, np.ndarray],
76 | version_2_with_negative: bool = False,
77 | n_best_size: int = 20,
78 | max_answer_length: int = 30,
79 | null_score_diff_threshold: float = 0.0,
80 | output_dir: Optional[str] = None,
81 | prefix: Optional[str] = None,
82 | log_level: Optional[int] = logging.WARNING,
83 | ):
84 | """
85 | Post-processes the predictions of a question-answering model to convert them to answers that are substrings of the
86 | original contexts. This is the base postprocessing functions for models that only return start and end logits.
87 | Args:
88 | examples: The non-preprocessed dataset (see the main script for more information).
89 | features: The processed dataset (see the main script for more information).
90 | predictions (:obj:`Tuple[np.ndarray, np.ndarray]`):
91 | The predictions of the model: two arrays containing the start logits and the end logits respectively. Its
92 | first dimension must match the number of elements of :obj:`features`.
93 | version_2_with_negative (:obj:`bool`, `optional`, defaults to :obj:`False`):
94 | Whether or not the underlying dataset contains examples with no answers.
95 | n_best_size (:obj:`int`, `optional`, defaults to 20):
96 | The total number of n-best predictions to generate when looking for an answer.
97 | max_answer_length (:obj:`int`, `optional`, defaults to 30):
98 | The maximum length of an answer that can be generated. This is needed because the start and end predictions
99 | are not conditioned on one another.
100 | null_score_diff_threshold (:obj:`float`, `optional`, defaults to 0):
101 | The threshold used to select the null answer: if the best answer has a score that is less than the score of
102 | the null answer minus this threshold, the null answer is selected for this example (note that the score of
103 | the null answer for an example giving several features is the minimum of the scores for the null answer on
104 | each feature: all features must be aligned on the fact they `want` to predict a null answer).
105 | Only useful when :obj:`version_2_with_negative` is :obj:`True`.
106 | output_dir (:obj:`str`, `optional`):
107 | If provided, the dictionaries of predictions, n_best predictions (with their scores and logits) and, if
108 | :obj:`version_2_with_negative=True`, the dictionary of the scores differences between best and null
109 | answers, are saved in `output_dir`.
110 | prefix (:obj:`str`, `optional`):
111 | If provided, the dictionaries mentioned above are saved with `prefix` added to their names.
112 | log_level (:obj:`int`, `optional`, defaults to ``logging.WARNING``):
113 | ``logging`` log level (e.g., ``logging.WARNING``)
114 | """
115 | assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)."
116 | all_start_logits, all_end_logits = predictions
117 |
118 | assert len(predictions[0]) == len(features), f"Got {len(predictions[0])} predictions and {len(features)} features."
119 |
120 | # Build a map example to its corresponding features.
121 | example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
122 | features_per_example = collections.defaultdict(list)
123 | for i, feature in enumerate(features):
124 | features_per_example[example_id_to_index[feature["example_id"]]].append(i)
125 |
126 | # The dictionaries we have to fill.
127 | all_predictions = collections.OrderedDict()
128 | all_nbest_json = collections.OrderedDict()
129 | if version_2_with_negative:
130 | scores_diff_json = collections.OrderedDict()
131 |
132 | # Let's loop over all the examples!
133 | for example_index, example in enumerate(examples):
134 | # Those are the indices of the features associated to the current example.
135 | feature_indices = features_per_example[example_index]
136 |
137 | min_null_prediction = None
138 | prelim_predictions = []
139 |
140 | # Looping through all the features associated to the current example.
141 | for feature_index in feature_indices:
142 | # We grab the predictions of the model for this feature.
143 | start_logits = all_start_logits[feature_index]
144 | end_logits = all_end_logits[feature_index]
145 | # This is what will allow us to map some the positions in our logits to span of texts in the original
146 | # context.
147 | offset_mapping = features[feature_index]["offset_mapping"]
148 | # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context
149 | # available in the current feature.
150 | token_is_max_context = features[feature_index].get("token_is_max_context", None)
151 |
152 | # Update minimum null prediction.
153 | feature_null_score = start_logits[0] + end_logits[0]
154 | if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
155 | min_null_prediction = {
156 | "offsets": (0, 0),
157 | "score": feature_null_score,
158 | "start_logit": start_logits[0],
159 | "end_logit": end_logits[0],
160 | }
161 |
162 | # Go through all possibilities for the `n_best_size` greater start and end logits.
163 | start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
164 | end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
165 | for start_index in start_indexes:
166 | for end_index in end_indexes:
167 | # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
168 | # to part of the input_ids that are not in the context.
169 | if (
170 | start_index >= len(offset_mapping)
171 | or end_index >= len(offset_mapping)
172 | or offset_mapping[start_index] is None
173 | or offset_mapping[end_index] is None
174 | ):
175 | continue
176 | # Don't consider answers with a length that is either < 0 or > max_answer_length.
177 | if end_index < start_index or end_index - start_index + 1 > max_answer_length:
178 | continue
179 | # Don't consider answer that don't have the maximum context available (if such information is
180 | # provided).
181 | if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
182 | continue
183 | prelim_predictions.append(
184 | {
185 | "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
186 | "score": start_logits[start_index] + end_logits[end_index],
187 | "start_logit": start_logits[start_index],
188 | "end_logit": end_logits[end_index],
189 | }
190 | )
191 | if version_2_with_negative:
192 | # Add the minimum null prediction
193 | prelim_predictions.append(min_null_prediction)
194 | null_score = min_null_prediction["score"]
195 |
196 | # Only keep the best `n_best_size` predictions.
197 | predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]
198 |
199 | # Add back the minimum null prediction if it was removed because of its low score.
200 | if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions):
201 | predictions.append(min_null_prediction)
202 |
203 | # Use the offsets to gather the answer text in the original context.
204 | context = example["context"]
205 | for pred in predictions:
206 | offsets = pred.pop("offsets")
207 | pred["text"] = context[offsets[0] : offsets[1]]
208 |
209 | # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
210 | # failure.
211 | if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
212 | predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0})
213 |
214 | # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using
215 | # the LogSumExp trick).
216 | scores = np.array([pred.pop("score") for pred in predictions])
217 | exp_scores = np.exp(scores - np.max(scores))
218 | probs = exp_scores / exp_scores.sum()
219 |
220 | # Include the probabilities in our predictions.
221 | for prob, pred in zip(probs, predictions):
222 | pred["probability"] = prob
223 |
224 | # Pick the best prediction. If the null answer is not possible, this is easy.
225 | if not version_2_with_negative:
226 | all_predictions[example["id"]] = predictions[0]["text"]
227 | else:
228 | # Otherwise we first need to find the best non-empty prediction.
229 | i = 0
230 | while predictions[i]["text"] == "":
231 | i += 1
232 | best_non_null_pred = predictions[i]
233 |
234 | # Then we compare to the null prediction using the threshold.
235 | score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"]
236 | scores_diff_json[example["id"]] = float(score_diff) # To be JSON-serializable.
237 | if score_diff > null_score_diff_threshold:
238 | all_predictions[example["id"]] = ""
239 | else:
240 | all_predictions[example["id"]] = best_non_null_pred["text"]
241 |
242 | # Make `predictions` JSON-serializable by casting np.float back to float.
243 | all_nbest_json[example["id"]] = [
244 | {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
245 | for pred in predictions
246 | ]
247 | return all_predictions
--------------------------------------------------------------------------------
/nmatheg/tests.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | config = configparser.ConfigParser()
3 | config.read('config.ini')
4 | print(config['preprocessing']['segment'])
--------------------------------------------------------------------------------
/nmatheg/utils.py:
--------------------------------------------------------------------------------
1 | import tkseem as tk
2 | try:
3 | import bpe_surgery
4 | except:
5 | pass
6 | import json
7 |
8 | def get_tokenizer(tok_name, vocab_size = 300, lang = 'ar'):
9 | if tok_name == "WordTokenizer":
10 | return tk.WordTokenizer(vocab_size=vocab_size)
11 | elif tok_name == "SentencePieceTokenizer":
12 | return tk.SentencePieceTokenizer(vocab_size=vocab_size)
13 | elif tok_name == "CharacterTokenizer":
14 | return tk.CharacterTokenizer(vocab_size=vocab_size)
15 | elif tok_name == "RandomTokenizer":
16 | return tk.RandomTokenizer(vocab_size=vocab_size)
17 | elif tok_name == "DisjointLetterTokenizer":
18 | return tk.DisjointLetterTokenizer(vocab_size=vocab_size)
19 | elif tok_name == "MorphologicalTokenizer":
20 | return tk.MorphologicalTokenizer(vocab_size=vocab_size)
21 | elif tok_name == "BPE":
22 | return bpe_surgery.bpe(vocab_size=vocab_size)
23 | elif tok_name == "MaT-BPE":
24 | return bpe_surgery.bpe(vocab_size=vocab_size, morph=True)
25 | elif tok_name == "Seg-BPE":
26 | return bpe_surgery.bpe(vocab_size=vocab_size, seg = True)
27 | else:
28 | raise('Unrecognized tokenizer name!')
29 |
30 | def get_preprocessing_args(config):
31 | args = {}
32 | map_bool = {'True':True, 'False':False, '[]': []}
33 | for key in config['preprocessing']:
34 | val = config['preprocessing'][key]
35 | if val in map_bool.keys():
36 | args[key] = map_bool[val]
37 | else:
38 | args[key] = val
39 | return args
40 |
41 | def save_json(ob, save_path):
42 | with open(save_path, 'w') as handle:
43 | json.dump(dict(ob), handle)
--------------------------------------------------------------------------------
/nmatheg_logo.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ARBML/nmatheg/209285b0b30780e2bf2b4a6a272cc9b2ac8ba95b/nmatheg_logo.PNG
--------------------------------------------------------------------------------
/predict.py:
--------------------------------------------------------------------------------
1 | from nmatheg import predict_from_run
2 | import argparse
3 | import os
4 | from datasets import load_dataset
5 | import json
6 | # Create the parser
7 | my_parser = argparse.ArgumentParser()
8 | my_parser.add_argument('--p', '-path', type = str, action='store')
9 | my_parser.add_argument('--n', '-num', type = int, action='store')
10 |
11 | args = my_parser.parse_args()
12 | data_config = json.load(open(f"{args.p}/data/data_config.json"))
13 | data = load_dataset(data_config["name"])
14 | src, trg = data_config['text'].split(',')
15 | out = predict_from_run(args.p, run = 0, sentence = data['train'][args.n][src])
16 | print(out[0])
17 | print({'gold_text': data['train'][args.n][trg]})
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tnkeeh
2 | tkseem
3 | tfds-nightly
4 | datasets
5 | transformers[sentencepiece]
6 | accelerate
7 | seqeval
8 | sacrebleu
9 | rouge_score_ar @ git+https://github.com/ARBML/rouge_score_ar
10 | evaluate
11 | pandas
12 | fsspec==2021.10
13 | s3fs==2021.10
14 |
--------------------------------------------------------------------------------
/script.py:
--------------------------------------------------------------------------------
1 | import nmatheg as nm
2 | strategy = nm.TrainStrategy(
3 | datasets = 'caner',
4 | models = 'birnn',
5 | tokenizers = 'bpe',
6 | vocab_sizes = '1000',
7 | runs = 1,
8 | lr = 1e-4,
9 | epochs = 50,
10 | batch_size = 128,
11 | max_tokens = 128,
12 | )
13 | output = strategy.start()
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import os
2 | from setuptools import setup
3 |
4 | with open('requirements.txt') as f:
5 | required = f.read().splitlines()
6 |
7 | with open('README.md') as readme_file:
8 | readme = readme_file.read()
9 |
10 | setup(name='nmatheg',
11 | version='0.0.4',
12 | url='',
13 | discription="Arabic Training Strategy For NLP Models",
14 | long_description=readme,
15 | long_description_content_type='text/markdown',
16 | author='Zaid Alyafeai, Maged Saeed',
17 | author_email='arabicmachinelearning@gmail.com',
18 | license='MIT',
19 | packages=['nmatheg'],
20 | install_requires=required,
21 | python_requires=">=3.6",
22 | include_package_data=True,
23 | zip_safe=False,
24 | )
25 |
--------------------------------------------------------------------------------