├── 1-NMT-Data-Processing.ipynb
├── 2-NMT-Training.ipynb
├── LICENSE
└── README.md
/1-NMT-Data-Processing.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "include_colab_link": true
8 | },
9 | "kernelspec": {
10 | "name": "python3",
11 | "display_name": "Python 3"
12 | },
13 | "language_info": {
14 | "name": "python"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "id": "view-in-github",
22 | "colab_type": "text"
23 | },
24 | "source": [
25 | "
"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "source": [
31 | "# Data Gathering and Processing \n",
32 | "\n",
33 | "To build a Machine Translation system, you need bilingual data, i.e. source sentences and their translations. You can use public bilingual corpora/datasets or you can use your translation memories (TMs). However, NMT requires a lot of data to train a good model, that is why most companies start with training a strong baseline model using public bilingual datasets, and then fine-tune this baseline model on their TMs. Sometimes also you can use pre-trained models directly for fine-tuning.\n",
34 | "\n",
35 | "The majority of public bilingual datasets are collected on OPUS: https://opus.nlpl.eu/\n",
36 | "\n",
37 | "Most of the datasets can be used for both commercial and non-commercial uses; however, some of them have more restricted licences. So you have to double-check the licence of a dataset before using it.\n",
38 | "\n",
39 | "On OPUS, go to “Search & download resources” and choose two languages from the drop-down lists. You will see how it will list the available language datasets for this language pair. Try to use non-variant language codes like “en” for English and “fr” for French to get all the variants under this language. To know more details about a specific dataset, click its name.\n",
40 | "\n",
41 | "In Machine Translation, we use the “Moses” format. Go ahead and try to download the “tico-19 v2020-10-28” by clicking “moses”. This will download a *.zip file; when you extract it, the two files that you care about are those whose names ending by the language codes. For example, for English to French, you will have “tico-19.en-fr.en” and “tico-19.en-fr.fr“. You can open these files with any text editor. Each file has a sentence/segment per line, and it is matching translation in the same line in the other file. This is what the \"Moses\" file format means.\n",
42 | "\n",
43 | "Note that not all datasets are of the same quality. Some datasets have lower quality, especially big corpora crawled from the web. Check the provided “sample” before using the dataset. Nevertheless, even high-quality datasets, like those from the UN and EU, require filtering.\n",
44 | "\n"
45 | ],
46 | "metadata": {
47 | "id": "XMk-V8nT9IpU"
48 | }
49 | },
50 | {
51 | "cell_type": "code",
52 | "metadata": {
53 | "id": "bLRpsedwqLg4",
54 | "colab": {
55 | "base_uri": "https://localhost:8080/"
56 | },
57 | "outputId": "f0e6d1e6-8ee1-4149-a309-47507d1a6740"
58 | },
59 | "source": [
60 | "# Create a directory and clone the Github MT-Preparation repository\n",
61 | "!mkdir -p nmt\n",
62 | "%cd nmt\n",
63 | "!git clone https://github.com/ymoslem/MT-Preparation.git"
64 | ],
65 | "execution_count": 1,
66 | "outputs": [
67 | {
68 | "output_type": "stream",
69 | "name": "stdout",
70 | "text": [
71 | "/content/nmt\n",
72 | "Cloning into 'MT-Preparation'...\n",
73 | "remote: Enumerating objects: 207, done.\u001b[K\n",
74 | "remote: Counting objects: 100% (207/207), done.\u001b[K\n",
75 | "remote: Compressing objects: 100% (104/104), done.\u001b[K\n",
76 | "remote: Total 207 (delta 104), reused 184 (delta 94), pack-reused 0\u001b[K\n",
77 | "Receiving objects: 100% (207/207), 50.84 KiB | 10.17 MiB/s, done.\n",
78 | "Resolving deltas: 100% (104/104), done.\n"
79 | ]
80 | }
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "metadata": {
86 | "id": "H8d13pqsp3Ii",
87 | "colab": {
88 | "base_uri": "https://localhost:8080/"
89 | },
90 | "outputId": "98dcc2d5-475f-4843-af3f-ce1ee71265af"
91 | },
92 | "source": [
93 | "# Install the requirements\n",
94 | "!pip3 install -r MT-Preparation/requirements.txt"
95 | ],
96 | "execution_count": 2,
97 | "outputs": [
98 | {
99 | "output_type": "stream",
100 | "name": "stdout",
101 | "text": [
102 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
103 | "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from -r MT-Preparation/requirements.txt (line 1)) (1.21.6)\n",
104 | "Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from -r MT-Preparation/requirements.txt (line 2)) (1.3.5)\n",
105 | "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.7/dist-packages (from -r MT-Preparation/requirements.txt (line 3)) (0.1.97)\n",
106 | "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->-r MT-Preparation/requirements.txt (line 2)) (2.8.2)\n",
107 | "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->-r MT-Preparation/requirements.txt (line 2)) (2022.6)\n",
108 | "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->-r MT-Preparation/requirements.txt (line 2)) (1.15.0)\n"
109 | ]
110 | }
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {
116 | "id": "G903Vcm7u08U"
117 | },
118 | "source": [
119 | "# Datasets\n",
120 | "\n",
121 | "Example datasets:\n",
122 | "\n",
123 | "* EN-AR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/ar-en.txt.zip\n",
124 | "* EN-ES: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-es.txt.zip\n",
125 | "* EN-FR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip\n",
126 | "* EN-RU: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-ru.txt.zip\n",
127 | "* EN-ZH: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-zh.txt.zip"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "metadata": {
133 | "id": "6WmiX_xTqqdr",
134 | "colab": {
135 | "base_uri": "https://localhost:8080/"
136 | },
137 | "outputId": "f8e50277-5465-43ed-ec0f-53a8f9d65eff"
138 | },
139 | "source": [
140 | "# Download and unzip a dataset\n",
141 | "!wget https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip\n",
142 | "!unzip en-fr.txt.zip"
143 | ],
144 | "execution_count": 3,
145 | "outputs": [
146 | {
147 | "output_type": "stream",
148 | "name": "stdout",
149 | "text": [
150 | "--2022-11-19 18:04:10-- https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip\n",
151 | "Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19\n",
152 | "Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.\n",
153 | "HTTP request sent, awaiting response... 200 OK\n",
154 | "Length: 10014972 (9.6M) [application/zip]\n",
155 | "Saving to: ‘en-fr.txt.zip’\n",
156 | "\n",
157 | "en-fr.txt.zip 100%[===================>] 9.55M 5.67MB/s in 1.7s \n",
158 | "\n",
159 | "2022-11-19 18:04:13 (5.67 MB/s) - ‘en-fr.txt.zip’ saved [10014972/10014972]\n",
160 | "\n",
161 | "Archive: en-fr.txt.zip\n",
162 | " inflating: UN.en-fr.en \n",
163 | " inflating: UN.en-fr.fr \n",
164 | " inflating: README \n"
165 | ]
166 | }
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "source": [
172 | "# Data Filtering\n",
173 | "\n",
174 | "Filtering out low-quality segments can help improve the translation quality of the output MT model. This might include misalignments, empty segments, duplicates, among other issues. "
175 | ],
176 | "metadata": {
177 | "id": "5G6GTlXa86Qb"
178 | }
179 | },
180 | {
181 | "cell_type": "code",
182 | "metadata": {
183 | "id": "b-9jDIWarB-9",
184 | "colab": {
185 | "base_uri": "https://localhost:8080/"
186 | },
187 | "outputId": "ba753eac-6fb2-418e-94e7-e588a86b9dd6"
188 | },
189 | "source": [
190 | "# Filter the dataset\n",
191 | "# Arguments: source file, target file, source language, target language\n",
192 | "!python3 MT-Preparation/filtering/filter.py UN.en-fr.fr UN.en-fr.en fr en"
193 | ],
194 | "execution_count": 4,
195 | "outputs": [
196 | {
197 | "output_type": "stream",
198 | "name": "stdout",
199 | "text": [
200 | "Dataframe shape (rows, columns): (74067, 2)\n",
201 | "--- Rows with Empty Cells Deleted\t--> Rows: 74067\n",
202 | "--- Duplicates Deleted\t\t\t--> Rows: 60662\n",
203 | "--- Source-Copied Rows Deleted\t\t--> Rows: 60476\n",
204 | "--- Too Long Source/Target Deleted\t--> Rows: 59719\n",
205 | "--- HTML Removed\t\t\t--> Rows: 59719\n",
206 | "--- Rows will remain in true-cased\t--> Rows: 59719\n",
207 | "--- Rows with Empty Cells Deleted\t--> Rows: 59719\n",
208 | "--- Rows Shuffled\t\t\t--> Rows: 59719\n",
209 | "--- Source Saved: UN.en-fr.fr-filtered.fr\n",
210 | "--- Target Saved: UN.en-fr.en-filtered.en\n"
211 | ]
212 | }
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "source": [
218 | "# Tokenization / Sub-wording\n",
219 | "\n",
220 | "To build a vocabulary for any NLP model, you have to tokenize (i.e. split) sentences into smaller units. Word-based tokenization used to be the way to go; in this case, each word would be a token. However, an MT model can only learn a specific number of vocabulary tokens due to limited hardware resources. To solve this issue, sub-words are used instead of whole words. At translation time, when the model sees a new word/token that looks like a word/token it has in the vocabulary, it still can try to continue the translation instead of marking this word as “unknown” or “unk”.\n",
221 | "\n",
222 | "There are a few approaches to sub-wording such as BPE and the unigram model. One of the famous toolkits that incorporates the most common approaches is [SentencePiece](https://github.com/google/sentencepiece). Note that you have to train a sub-wording model and then use it. After translation, you will have to “desubword” or “decode” your text back using the same SentencePiece model.\n",
223 | "\n"
224 | ],
225 | "metadata": {
226 | "id": "IbRpxXjC78c0"
227 | }
228 | },
229 | {
230 | "cell_type": "code",
231 | "metadata": {
232 | "id": "n9c1pqhuru3j",
233 | "colab": {
234 | "base_uri": "https://localhost:8080/"
235 | },
236 | "outputId": "d980a82b-1968-40b4-e875-78c59bc446b9"
237 | },
238 | "source": [
239 | "!ls MT-Preparation/subwording/"
240 | ],
241 | "execution_count": 5,
242 | "outputs": [
243 | {
244 | "output_type": "stream",
245 | "name": "stdout",
246 | "text": [
247 | "1-train_bpe.py\t1-train_unigram.py 2-subword.py 3-desubword.py\n"
248 | ]
249 | }
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "metadata": {
255 | "id": "weSS6QDPsOUJ",
256 | "colab": {
257 | "base_uri": "https://localhost:8080/"
258 | },
259 | "outputId": "23061a1e-b114-48ee-d2e3-e7e469840d55"
260 | },
261 | "source": [
262 | "# Train a SentencePiece model for subword tokenization\n",
263 | "!python3 MT-Preparation/subwording/1-train_unigram.py UN.en-fr.fr-filtered.fr UN.en-fr.en-filtered.en"
264 | ],
265 | "execution_count": 6,
266 | "outputs": [
267 | {
268 | "output_type": "stream",
269 | "name": "stdout",
270 | "text": [
271 | "sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=UN.en-fr.fr-filtered.fr --model_prefix=source --vocab_size=50000 --hard_vocab_limit=false --split_digits=true\n",
272 | "sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : \n",
273 | "trainer_spec {\n",
274 | " input: UN.en-fr.fr-filtered.fr\n",
275 | " input_format: \n",
276 | " model_prefix: source\n",
277 | " model_type: UNIGRAM\n",
278 | " vocab_size: 50000\n",
279 | " self_test_sample_size: 0\n",
280 | " character_coverage: 0.9995\n",
281 | " input_sentence_size: 0\n",
282 | " shuffle_input_sentence: 1\n",
283 | " seed_sentencepiece_size: 1000000\n",
284 | " shrinking_factor: 0.75\n",
285 | " max_sentence_length: 4192\n",
286 | " num_threads: 16\n",
287 | " num_sub_iterations: 2\n",
288 | " max_sentencepiece_length: 16\n",
289 | " split_by_unicode_script: 1\n",
290 | " split_by_number: 1\n",
291 | " split_by_whitespace: 1\n",
292 | " split_digits: 1\n",
293 | " treat_whitespace_as_suffix: 0\n",
294 | " allow_whitespace_only_pieces: 0\n",
295 | " required_chars: \n",
296 | " byte_fallback: 0\n",
297 | " vocabulary_output_piece_score: 1\n",
298 | " train_extremely_large_corpus: 0\n",
299 | " hard_vocab_limit: 0\n",
300 | " use_all_vocab: 0\n",
301 | " unk_id: 0\n",
302 | " bos_id: 1\n",
303 | " eos_id: 2\n",
304 | " pad_id: -1\n",
305 | " unk_piece: \n",
306 | " bos_piece: \n",
307 | " eos_piece: \n",
308 | " pad_piece: \n",
309 | " unk_surface: ⁇ \n",
310 | " enable_differential_privacy: 0\n",
311 | " differential_privacy_noise_level: 0\n",
312 | " differential_privacy_clipping_threshold: 0\n",
313 | "}\n",
314 | "normalizer_spec {\n",
315 | " name: nmt_nfkc\n",
316 | " add_dummy_prefix: 1\n",
317 | " remove_extra_whitespaces: 1\n",
318 | " escape_whitespaces: 1\n",
319 | " normalization_rule_tsv: \n",
320 | "}\n",
321 | "denormalizer_spec {}\n",
322 | "trainer_interface.cc(350) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.\n",
323 | "trainer_interface.cc(181) LOG(INFO) Loading corpus: UN.en-fr.fr-filtered.fr\n",
324 | "trainer_interface.cc(406) LOG(INFO) Loaded all 59719 sentences\n",
325 | "trainer_interface.cc(422) LOG(INFO) Adding meta_piece: \n",
326 | "trainer_interface.cc(422) LOG(INFO) Adding meta_piece: \n",
327 | "trainer_interface.cc(422) LOG(INFO) Adding meta_piece: \n",
328 | "trainer_interface.cc(427) LOG(INFO) Normalizing sentences...\n",
329 | "trainer_interface.cc(536) LOG(INFO) all chars count=19614832\n",
330 | "trainer_interface.cc(547) LOG(INFO) Done: 99.9546% characters are covered.\n",
331 | "trainer_interface.cc(557) LOG(INFO) Alphabet size=82\n",
332 | "trainer_interface.cc(558) LOG(INFO) Final character coverage=0.999546\n",
333 | "trainer_interface.cc(590) LOG(INFO) Done! preprocessed 59719 sentences.\n",
334 | "unigram_model_trainer.cc(146) LOG(INFO) Making suffix array...\n",
335 | "unigram_model_trainer.cc(150) LOG(INFO) Extracting frequent sub strings...\n",
336 | "unigram_model_trainer.cc(201) LOG(INFO) Initialized 61805 seed sentencepieces\n",
337 | "trainer_interface.cc(596) LOG(INFO) Tokenizing input sentences with whitespace: 59719\n",
338 | "trainer_interface.cc(607) LOG(INFO) Done! 48938\n",
339 | "unigram_model_trainer.cc(491) LOG(INFO) Using 48938 sentences for EM training\n",
340 | "unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=23065 obj=12.0957 num_tokens=205556 num_tokens/piece=8.91203\n",
341 | "unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size=16778 obj=8.81693 num_tokens=205778 num_tokens/piece=12.2648\n",
342 | "trainer_interface.cc(685) LOG(INFO) Saving model: source.model\n",
343 | "trainer_interface.cc(697) LOG(INFO) Saving vocabs: source.vocab\n",
344 | "Done, training a SentencepPiece model for the Source finished successfully!\n",
345 | "sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=UN.en-fr.en-filtered.en --model_prefix=target --vocab_size=50000 --hard_vocab_limit=false --split_digits=true\n",
346 | "sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : \n",
347 | "trainer_spec {\n",
348 | " input: UN.en-fr.en-filtered.en\n",
349 | " input_format: \n",
350 | " model_prefix: target\n",
351 | " model_type: UNIGRAM\n",
352 | " vocab_size: 50000\n",
353 | " self_test_sample_size: 0\n",
354 | " character_coverage: 0.9995\n",
355 | " input_sentence_size: 0\n",
356 | " shuffle_input_sentence: 1\n",
357 | " seed_sentencepiece_size: 1000000\n",
358 | " shrinking_factor: 0.75\n",
359 | " max_sentence_length: 4192\n",
360 | " num_threads: 16\n",
361 | " num_sub_iterations: 2\n",
362 | " max_sentencepiece_length: 16\n",
363 | " split_by_unicode_script: 1\n",
364 | " split_by_number: 1\n",
365 | " split_by_whitespace: 1\n",
366 | " split_digits: 1\n",
367 | " treat_whitespace_as_suffix: 0\n",
368 | " allow_whitespace_only_pieces: 0\n",
369 | " required_chars: \n",
370 | " byte_fallback: 0\n",
371 | " vocabulary_output_piece_score: 1\n",
372 | " train_extremely_large_corpus: 0\n",
373 | " hard_vocab_limit: 0\n",
374 | " use_all_vocab: 0\n",
375 | " unk_id: 0\n",
376 | " bos_id: 1\n",
377 | " eos_id: 2\n",
378 | " pad_id: -1\n",
379 | " unk_piece: \n",
380 | " bos_piece: \n",
381 | " eos_piece: \n",
382 | " pad_piece: \n",
383 | " unk_surface: ⁇ \n",
384 | " enable_differential_privacy: 0\n",
385 | " differential_privacy_noise_level: 0\n",
386 | " differential_privacy_clipping_threshold: 0\n",
387 | "}\n",
388 | "normalizer_spec {\n",
389 | " name: nmt_nfkc\n",
390 | " add_dummy_prefix: 1\n",
391 | " remove_extra_whitespaces: 1\n",
392 | " escape_whitespaces: 1\n",
393 | " normalization_rule_tsv: \n",
394 | "}\n",
395 | "denormalizer_spec {}\n",
396 | "trainer_interface.cc(350) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.\n",
397 | "trainer_interface.cc(181) LOG(INFO) Loading corpus: UN.en-fr.en-filtered.en\n",
398 | "trainer_interface.cc(406) LOG(INFO) Loaded all 59719 sentences\n",
399 | "trainer_interface.cc(422) LOG(INFO) Adding meta_piece: \n",
400 | "trainer_interface.cc(422) LOG(INFO) Adding meta_piece: \n",
401 | "trainer_interface.cc(422) LOG(INFO) Adding meta_piece: \n",
402 | "trainer_interface.cc(427) LOG(INFO) Normalizing sentences...\n",
403 | "trainer_interface.cc(536) LOG(INFO) all chars count=17772658\n",
404 | "trainer_interface.cc(547) LOG(INFO) Done: 99.9623% characters are covered.\n",
405 | "trainer_interface.cc(557) LOG(INFO) Alphabet size=70\n",
406 | "trainer_interface.cc(558) LOG(INFO) Final character coverage=0.999623\n",
407 | "trainer_interface.cc(590) LOG(INFO) Done! preprocessed 59719 sentences.\n",
408 | "unigram_model_trainer.cc(146) LOG(INFO) Making suffix array...\n",
409 | "unigram_model_trainer.cc(150) LOG(INFO) Extracting frequent sub strings...\n",
410 | "unigram_model_trainer.cc(201) LOG(INFO) Initialized 46525 seed sentencepieces\n",
411 | "trainer_interface.cc(596) LOG(INFO) Tokenizing input sentences with whitespace: 59719\n",
412 | "trainer_interface.cc(607) LOG(INFO) Done! 44916\n",
413 | "unigram_model_trainer.cc(491) LOG(INFO) Using 44916 sentences for EM training\n",
414 | "unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=19499 obj=11.8498 num_tokens=207239 num_tokens/piece=10.6282\n",
415 | "unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size=13744 obj=8.60033 num_tokens=207645 num_tokens/piece=15.108\n",
416 | "trainer_interface.cc(685) LOG(INFO) Saving model: target.model\n",
417 | "trainer_interface.cc(697) LOG(INFO) Saving vocabs: target.vocab\n",
418 | "Done, training a SentencepPiece model for the Target finished successfully!\n"
419 | ]
420 | }
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "metadata": {
426 | "id": "T89THXeRslKu",
427 | "colab": {
428 | "base_uri": "https://localhost:8080/"
429 | },
430 | "outputId": "71fd0af6-3d1c-4726-9584-aa89d715d04a"
431 | },
432 | "source": [
433 | "!ls"
434 | ],
435 | "execution_count": 7,
436 | "outputs": [
437 | {
438 | "output_type": "stream",
439 | "name": "stdout",
440 | "text": [
441 | "en-fr.txt.zip\tsource.model target.vocab\t UN.en-fr.fr\n",
442 | "MT-Preparation\tsource.vocab UN.en-fr.en\t UN.en-fr.fr-filtered.fr\n",
443 | "README\t\ttarget.model UN.en-fr.en-filtered.en\n"
444 | ]
445 | }
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "metadata": {
451 | "id": "hBWQoCfBsqlT",
452 | "colab": {
453 | "base_uri": "https://localhost:8080/"
454 | },
455 | "outputId": "3dbc3400-1ab6-4d69-e93b-73bfeb1d21db"
456 | },
457 | "source": [
458 | "# Subword the dataset\n",
459 | "!python3 MT-Preparation/subwording/2-subword.py source.model target.model UN.en-fr.fr-filtered.fr UN.en-fr.en-filtered.en"
460 | ],
461 | "execution_count": 8,
462 | "outputs": [
463 | {
464 | "output_type": "stream",
465 | "name": "stdout",
466 | "text": [
467 | "Source Model: source.model\n",
468 | "Target Model: target.model\n",
469 | "Source Dataset: UN.en-fr.fr-filtered.fr\n",
470 | "Target Dataset: UN.en-fr.en-filtered.en\n",
471 | "Done subwording the source file! Output: UN.en-fr.fr-filtered.fr.subword\n",
472 | "Done subwording the target file! Output: UN.en-fr.en-filtered.en.subword\n"
473 | ]
474 | }
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "metadata": {
480 | "id": "CnfMRckbvNfZ",
481 | "colab": {
482 | "base_uri": "https://localhost:8080/"
483 | },
484 | "outputId": "b5b967ee-ab5a-4016-b310-031a042647da"
485 | },
486 | "source": [
487 | "# First 3 lines before subwording\n",
488 | "!head -n 3 UN.en-fr.fr-filtered.fr && echo \"-----\" && head -n 3 UN.en-fr.en-filtered.en"
489 | ],
490 | "execution_count": 9,
491 | "outputs": [
492 | {
493 | "output_type": "stream",
494 | "name": "stdout",
495 | "text": [
496 | "Ayant examiné le rapport du Comité des conférences pour 2006Documents officiels de l'Assemblée générale, soixante et unième session, Supplément no 32 (A/61/32). et les rapports pertinents du Secrétaire généralA/61/129 et Add.1 et A/61/300.,\n",
497 | "Consciente en outre du rôle indispensable que jouent les professionnels de la santé dans la promotion et la protection du droit de toute personne de jouir du meilleur état de santé physique et mentale possible,\n",
498 | "2. Réaffirme que la fourchette de 10 à 20 pour cent établie pour la marge entre la rémunération nette des administrateurs et fonctionnaires de rang supérieur en poste à New York et celle des fonctionnaires de la fonction publique de référence occupant des emplois comparables reste applicable, étant entendu que la marge devrait être maintenue aux alentours du niveau souhaitable, le point médian (15 pour cent), pendant un certain temps ;\n",
499 | "-----\n",
500 | "Having considered the report of the Committee on Conferences for 2006Official Records of the General Assembly, Sixty-first Session, Supplement No. 32 (A/61/32). and the relevant reports of the Secretary-General,A/61/129 and Add.1 and A/61/300.\n",
501 | "Recognizing further the indispensable role that health professionals play in the promotion and protection of the right of everyone to the enjoyment of the highest attainable standard of physical and mental health,\n",
502 | "2. Reaffirms that the range of 110 to 120 for the margin between the net remuneration of officials in the Professional and higher categories of the United Nations in New York and the officials in comparable positions in the comparator civil service should continue to apply, on the understanding that the margin would be maintained at a level around the desirable midpoint of 115 over a period of time;\n"
503 | ]
504 | }
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "metadata": {
510 | "id": "hs_xxKK_vf1Z",
511 | "colab": {
512 | "base_uri": "https://localhost:8080/"
513 | },
514 | "outputId": "80f0650e-cbf9-4855-bdca-2c6424b5423d"
515 | },
516 | "source": [
517 | "# First 3 lines after subwording\n",
518 | "!head -n 3 UN.en-fr.fr-filtered.fr.subword && echo \"---\" && head -n 3 UN.en-fr.en-filtered.en.subword"
519 | ],
520 | "execution_count": 10,
521 | "outputs": [
522 | {
523 | "output_type": "stream",
524 | "name": "stdout",
525 | "text": [
526 | "▁A yant ▁examiné ▁le ▁rapport ▁du ▁Comité ▁des ▁conférences ▁pour ▁ 2 0 0 6 Document s ▁officiels ▁de ▁l ' Assemblée ▁générale , ▁soixante ▁et ▁unième ▁session , ▁S upplément ▁no ▁ 3 2 ▁( A / 6 1 / 3 2 ). ▁et ▁les ▁rapports ▁pertinents ▁du ▁Secrétaire ▁général A / 6 1 / 1 2 9 ▁et ▁Add . 1 ▁et ▁A / 6 1 / 3 0 0 .,\n",
527 | "▁Conscient e ▁en ▁outre ▁du ▁rôle ▁indispensable ▁que ▁jouent ▁les ▁professionnels ▁de ▁la ▁santé ▁d ans ▁la ▁promotion ▁et ▁la ▁protection ▁du ▁droit ▁de ▁toute ▁personne ▁de ▁joui r ▁du ▁meilleur ▁état ▁de ▁santé ▁physique ▁et ▁mentale ▁possible ,\n",
528 | "▁ 2 . ▁Réaffirme ▁que ▁la ▁fourchette ▁de ▁ 1 0 ▁à ▁ 2 0 ▁pour ▁cent ▁établie ▁pour ▁la ▁marge ▁entre ▁la ▁rémunération ▁nette ▁des ▁administrateurs ▁et ▁fonctionnaires ▁de ▁rang ▁supérieur ▁en ▁poste ▁à ▁N ew ▁ Y ork ▁et ▁celle ▁des ▁fonctionnaires ▁de ▁la ▁fonction ▁publique ▁de ▁référence ▁occupant ▁des ▁emplois ▁comparables ▁reste ▁applicable , ▁étant ▁entendu ▁que ▁la ▁marge ▁devrait ▁être ▁maintenue ▁aux ▁alentours ▁du ▁niveau ▁souhaitable , ▁le ▁point ▁médi an ▁( 1 5 ▁pour ▁cent ), ▁ pendant ▁un ▁certain ▁temps ▁;\n",
529 | "---\n",
530 | "▁Hav ing ▁considered ▁the ▁report ▁of ▁the ▁Committee ▁on ▁Conferences ▁for ▁ 2 0 0 6 Official ▁Record s ▁of ▁the ▁General ▁Assembly , ▁S ixty - first ▁Session , ▁Supplement ▁No . ▁ 3 2 ▁( A / 6 1 / 3 2 ). ▁and ▁the ▁relevant ▁reports ▁of ▁the ▁Secretary - General , A / 6 1 / 1 2 9 ▁and ▁Add . 1 ▁and ▁A / 6 1 / 3 0 0 .\n",
531 | "▁Recognizing ▁further ▁the ▁indispensable ▁role ▁that ▁health ▁professionals ▁play ▁in ▁the ▁promotion ▁and ▁protection ▁of ▁the ▁right ▁of ▁everyone ▁to ▁the ▁enjoyment ▁of ▁the ▁high est ▁at tainable ▁standard ▁of ▁physical ▁and ▁mental ▁health ,\n",
532 | "▁ 2 . ▁Reaffirms ▁that ▁the ▁range ▁of ▁ 1 1 0 ▁to ▁ 1 2 0 ▁for ▁the ▁margin ▁between ▁the ▁net ▁remuneration ▁of ▁officials ▁in ▁the ▁Professional ▁and ▁higher ▁categories ▁of ▁the ▁Unit ed ▁Nations ▁in ▁New ▁York ▁and ▁the ▁officials ▁in ▁comparable ▁positions ▁in ▁the ▁comparator ▁civil ▁service ▁should ▁continue ▁to ▁apply , ▁on ▁the ▁understanding ▁that ▁the ▁margin ▁would ▁be ▁maintained ▁at ▁a ▁level ▁a round ▁the ▁desirable ▁midpoint ▁of ▁ 1 1 5 ▁over ▁a ▁period ▁of ▁time ;\n"
533 | ]
534 | }
535 | ]
536 | },
537 | {
538 | "cell_type": "markdown",
539 | "source": [
540 | "# Data Splitting\n",
541 | "\n",
542 | "We usually split our dataset into 3 portions:\n",
543 | "\n",
544 | "1. training dataset - used for training the model;\n",
545 | "2. development dataset - used to run regular validations during the training to help improve the model parameters; and\n",
546 | "3. testing dataset - a holdout dataset used after the model finishes training to finally evaluate the model on unseen data."
547 | ],
548 | "metadata": {
549 | "id": "YgTZ-m718neI"
550 | }
551 | },
552 | {
553 | "cell_type": "code",
554 | "metadata": {
555 | "id": "KfQRMGRixBAL",
556 | "colab": {
557 | "base_uri": "https://localhost:8080/"
558 | },
559 | "outputId": "31bde92c-9062-47bd-8d5b-8593c11586c7"
560 | },
561 | "source": [
562 | "# Split the dataset into training set, development set, and test set\n",
563 | "# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)\n",
564 | "!python3 MT-Preparation/train_dev_split/train_dev_test_split.py 2000 2000 UN.en-fr.fr-filtered.fr.subword UN.en-fr.en-filtered.en.subword"
565 | ],
566 | "execution_count": 11,
567 | "outputs": [
568 | {
569 | "output_type": "stream",
570 | "name": "stdout",
571 | "text": [
572 | "Dataframe shape: (59719, 2)\n",
573 | "--- Empty Cells Deleted --> Rows: 59719\n",
574 | "--- Wrote Files\n",
575 | "Done!\n",
576 | "Output files\n",
577 | "UN.en-fr.fr-filtered.fr.subword.train\n",
578 | "UN.en-fr.en-filtered.en.subword.train\n",
579 | "UN.en-fr.fr-filtered.fr.subword.dev\n",
580 | "UN.en-fr.en-filtered.en.subword.dev\n",
581 | "UN.en-fr.fr-filtered.fr.subword.test\n",
582 | "UN.en-fr.en-filtered.en.subword.test\n"
583 | ]
584 | }
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "metadata": {
590 | "id": "6y3HQr4nxYib",
591 | "colab": {
592 | "base_uri": "https://localhost:8080/"
593 | },
594 | "outputId": "c5fd9355-830a-4904-bf5a-189427fd00bc"
595 | },
596 | "source": [
597 | "# Line count for the subworded train, dev, test datatest\n",
598 | "!wc -l *.subword.*"
599 | ],
600 | "execution_count": 12,
601 | "outputs": [
602 | {
603 | "output_type": "stream",
604 | "name": "stdout",
605 | "text": [
606 | " 2000 UN.en-fr.en-filtered.en.subword.dev\n",
607 | " 2000 UN.en-fr.en-filtered.en.subword.test\n",
608 | " 55719 UN.en-fr.en-filtered.en.subword.train\n",
609 | " 2000 UN.en-fr.fr-filtered.fr.subword.dev\n",
610 | " 2000 UN.en-fr.fr-filtered.fr.subword.test\n",
611 | " 55719 UN.en-fr.fr-filtered.fr.subword.train\n",
612 | " 119438 total\n"
613 | ]
614 | }
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "metadata": {
620 | "id": "0duUCLP93GKE",
621 | "colab": {
622 | "base_uri": "https://localhost:8080/"
623 | },
624 | "outputId": "42f7d78e-9b55-4bc0-a989-f30dc786af68"
625 | },
626 | "source": [
627 | "# Check the first and last line from each dataset\n",
628 | "\n",
629 | "# -------------------------------------------\n",
630 | "# Change this cell to print your name\n",
631 | "!echo -e \"My name is: FirstName SecondName \\n\"\n",
632 | "# -------------------------------------------\n",
633 | "\n",
634 | "!echo \"---First line---\"\n",
635 | "!head -n 1 *.{train,dev,test}\n",
636 | "\n",
637 | "!echo -e \"\\n---Last line---\"\n",
638 | "!tail -n 1 *.{train,dev,test}"
639 | ],
640 | "execution_count": 13,
641 | "outputs": [
642 | {
643 | "output_type": "stream",
644 | "name": "stdout",
645 | "text": [
646 | "My name is: FirstName SecondName \n",
647 | "\n",
648 | "---First line---\n",
649 | "==> UN.en-fr.en-filtered.en.subword.train <==\n",
650 | "▁Hav ing ▁considered ▁the ▁report ▁of ▁the ▁Committee ▁on ▁Conferences ▁for ▁ 2 0 0 6 Official ▁Record s ▁of ▁the ▁General ▁Assembly , ▁S ixty - first ▁Session , ▁Supplement ▁No . ▁ 3 2 ▁( A / 6 1 / 3 2 ). ▁and ▁the ▁relevant ▁reports ▁of ▁the ▁Secretary - General , A / 6 1 / 1 2 9 ▁and ▁Add . 1 ▁and ▁A / 6 1 / 3 0 0 .\n",
651 | "\n",
652 | "==> UN.en-fr.fr-filtered.fr.subword.train <==\n",
653 | "▁A yant ▁examiné ▁le ▁rapport ▁du ▁Comité ▁des ▁conférences ▁pour ▁ 2 0 0 6 Document s ▁officiels ▁de ▁l ' Assemblée ▁générale , ▁soixante ▁et ▁unième ▁session , ▁S upplément ▁no ▁ 3 2 ▁( A / 6 1 / 3 2 ). ▁et ▁les ▁rapports ▁pertinents ▁du ▁Secrétaire ▁général A / 6 1 / 1 2 9 ▁et ▁Add . 1 ▁et ▁A / 6 1 / 3 0 0 .,\n",
654 | "\n",
655 | "==> UN.en-fr.en-filtered.en.subword.dev <==\n",
656 | "▁ 1 6 . ▁Further ▁reaffirms ▁that ▁preferences ▁granted ▁to ▁developing ▁countries , ▁pursuan t ▁to ▁the ▁\" enabl ing ▁clause \", Decision ▁of ▁the ▁Contract ing ▁Parties ▁of ▁ 2 8 ▁November ▁ 1 9 7 9 ▁( L / 4 9 0 3 ). ▁A vailable ▁from ▁ ht tp :// doc s online . w to . org . ▁should ▁be ▁generalized , ▁non - reciprocal ▁and ▁non - discriminatory ;\n",
657 | "\n",
658 | "==> UN.en-fr.fr-filtered.fr.subword.dev <==\n",
659 | "▁ 1 6 . ▁Réaffirme ▁en ▁outre ▁que ▁les ▁préférences ▁accordées ▁aux ▁pays ▁en ▁développement , ▁conformé ment ▁à ▁la ▁« ▁clause ▁d ' habilitation ▁» Décision ▁L / 4 9 0 3 ▁des ▁Parties ▁contractantes ▁en ▁date ▁du ▁ 2 8 ▁novembre ▁ 1 9 7 9 . ▁D isponible ▁à ▁l ' adresse ▁suivante ▁: ▁http :// docs on line . wt o . org . ▁devraient ▁être ▁généralisées , ▁et ▁n ' être ▁ni ▁réciproque s ▁ni ▁ discriminatoires ▁;\n",
660 | "\n",
661 | "==> UN.en-fr.en-filtered.en.subword.test <==\n",
662 | "▁ 1 1 . ▁Reaffirms ▁the ▁commitments ▁made ▁at ▁the ▁Fourth ▁M inisterial ▁Conference ▁of ▁the ▁World ▁Trade ▁Organization ▁held ▁at ▁Doha ▁and ▁at ▁the ▁Thir d ▁Unit ed ▁Nations ▁Conference ▁on ▁the ▁Lea st ▁Developed ▁Countries , ▁held ▁at ▁Brussels ▁from ▁ 1 4 ▁to ▁ 2 0 ▁May ▁ 2 0 0 1 , See ▁A / CONF . 1 9 1 / 1 1 ▁and ▁ 1 2 . ▁and ▁in ▁this ▁regard ▁calls ▁upon ▁developed ▁countries ▁that ▁have ▁not ▁al ready ▁done ▁so ▁to ▁work ▁towards ▁the ▁objective ▁of ▁duty - free , ▁quota - free ▁market ▁access ▁for ▁all ▁least ▁developed ▁countries ' ▁exports , ▁and ▁notes ▁that ▁consideration ▁of ▁proposals ▁for ▁developing ▁countries ▁to ▁contribute ▁to ▁improved ▁market ▁access ▁for ▁least ▁developed ▁countries ▁would ▁also ▁be ▁helpful ;\n",
663 | "\n",
664 | "==> UN.en-fr.fr-filtered.fr.subword.test <==\n",
665 | "▁ 1 1 . ▁Réaffirme ▁les ▁engagements ▁pris ▁lors ▁de ▁la ▁quatrième ▁Conférence ▁ministérielle ▁de ▁l ' Organisation ▁mondiale ▁du ▁commerce , ▁réunie ▁à ▁Doha , ▁et ▁de ▁la ▁troisième ▁Conférence ▁des ▁Nations ▁Unies ▁sur ▁les ▁pays ▁les ▁moins ▁avancés , ▁tenue ▁à ▁Bruxelles ▁du ▁ 1 4 ▁au ▁ 2 0 ▁mai ▁ 2 0 0 1 Voir ▁A / CONF . 1 9 1 / 1 1 ▁et ▁ 1 2 . ▁et , ▁à ▁ce ▁propos , ▁demande ▁aux ▁pays ▁développés ▁qui ▁ne ▁l ' ont ▁pas ▁encore ▁fait ▁d ' œuvrer ▁pour ▁que ▁tous ▁les ▁produits ▁en ▁provenance ▁des ▁pays ▁les ▁moins ▁avancés ▁aient ▁accès ▁aux ▁marchés ▁en ▁franchise ▁de ▁droits ▁et ▁ hors ▁quota , ▁et ▁note ▁qu ' il ▁serait ▁également ▁utile ▁d ' examiner ▁les ▁propositions ▁tendant ▁à ▁ce ▁que ▁les ▁pays ▁en ▁développement ▁contribuent ▁à ▁faciliter ▁l ' accès ▁des ▁pays ▁les ▁moins ▁avancés ▁aux ▁marchés ▁;\n",
666 | "\n",
667 | "---Last line---\n",
668 | "==> UN.en-fr.en-filtered.en.subword.train <==\n",
669 | "▁ 2 0 . ▁Recalls ▁that ▁the ▁Committee ▁hold s ▁that ▁the ▁prohibition ▁of ▁the ▁dissemination ▁of ▁ideas ▁based ▁on ▁racial ▁superiority ▁or ▁racial ▁hatred ▁is ▁compatib le ▁with ▁the ▁right ▁to ▁freedom ▁of ▁opinion ▁and ▁expression ▁as ▁outlin ed ▁in ▁article ▁ 1 9 ▁of ▁the ▁Universal ▁Declaration ▁of ▁Human ▁Rights ▁and ▁in ▁article ▁ 5 ▁of ▁the ▁Convention ;\n",
670 | "\n",
671 | "==> UN.en-fr.fr-filtered.fr.subword.train <==\n",
672 | "▁ 2 0 . ▁Rappelle ▁que ▁le ▁Comité ▁considère ▁que ▁l ' interdiction ▁de ▁propag er ▁des ▁idées ▁inspirée s ▁par ▁des ▁notion s ▁de ▁supéri orité ▁raciale ▁ou ▁par ▁la ▁haine ▁raciale ▁est ▁compatible ▁avec ▁le ▁droit ▁à ▁la ▁liberté ▁d ' opinion ▁et ▁d ' expression ▁énoncé ▁à ▁l ' article ▁ 1 9 ▁de ▁la ▁Déclaration ▁universelle ▁des ▁droits ▁de ▁l ' homme ▁et ▁à ▁l ' article ▁ 5 ▁de ▁la ▁Convention ▁;\n",
673 | "\n",
674 | "==> UN.en-fr.en-filtered.en.subword.dev <==\n",
675 | "▁ 8 . ▁A lso ▁encourages ▁the ▁involvement ▁of ▁the ▁mass ▁media ▁in ▁education ▁for ▁a ▁culture ▁of ▁peace ▁and ▁non - violence , ▁with ▁particular ▁regard ▁to ▁children ▁and ▁young ▁people , ▁including ▁through ▁the ▁planned ▁expansion ▁of ▁the ▁Culture ▁of ▁Peace ▁New s ▁Network ▁as ▁a ▁global ▁network ▁of ▁Internet ▁sites ▁in ▁ many ▁languages ;\n",
676 | "\n",
677 | "==> UN.en-fr.fr-filtered.fr.subword.dev <==\n",
678 | "▁ 8 . ▁Engage ▁également ▁les ▁médias ▁à ▁participer ▁à ▁l ' éducation ▁en ▁faveur ▁d ' une ▁culture ▁de ▁non - violence ▁et ▁de ▁paix , ▁en ▁particulier ▁en ▁ce ▁qui ▁concerne ▁les ▁enfants ▁et ▁les ▁jeunes , ▁notamment ▁au ▁moyen ▁de ▁l ' élargissement ▁prévu ▁du ▁réseau ▁d ' information ▁relatif ▁à ▁la ▁culture ▁de ▁paix ▁qui ▁de viendrait ▁un ▁réseau ▁mondial ▁de ▁site s ▁Internet ▁multilingue s ▁;\n",
679 | "\n",
680 | "==> UN.en-fr.en-filtered.en.subword.test <==\n",
681 | "▁ 5 . ▁Welcomes ▁the ▁efforts ▁that ▁are ▁being ▁made ▁to ▁improve ▁the ▁utilization ▁of ▁the ▁conference ▁facilities ▁at ▁the ▁Unit ed ▁Nations ▁Office ▁at ▁Nairobi , ▁as ▁set ▁out ▁in ▁the ▁report ▁of ▁the ▁Secretary - General ; A / 5 8 / 5 3 0 .\n",
682 | "\n",
683 | "==> UN.en-fr.fr-filtered.fr.subword.test <==\n",
684 | "▁ 5 ▁Se ▁félicite ▁des ▁efforts ▁qui ▁sont ▁faits ▁pour ▁améliorer ▁le ▁ taux ▁d ' utilisation ▁des ▁installations ▁de ▁conférence ▁à ▁l ' Office ▁des ▁Nations ▁Unies ▁à ▁Nairobi , ▁ainsi ▁qu ' il ▁ressort ▁du ▁rapport ▁du ▁Secrétaire ▁général A / 5 8 / 5 3 0 . ▁;\n"
685 | ]
686 | }
687 | ]
688 | },
689 | {
690 | "cell_type": "markdown",
691 | "metadata": {
692 | "id": "T9GpLuAvIwtT"
693 | },
694 | "source": [
695 | "# Mount your drive to save your data\n",
696 | "\n",
697 | "Click the folder icon to the left, and then click the Google Drive icon.\n",
698 | "\n",
699 | ""
700 | ]
701 | },
702 | {
703 | "cell_type": "code",
704 | "metadata": {
705 | "id": "xZjU1Lx8IxmS"
706 | },
707 | "source": [
708 | "# Copy your data to your Google Drive\n",
709 | "!cp -R /content/nmt/ /content/drive/MyDrive/"
710 | ],
711 | "execution_count": 16,
712 | "outputs": []
713 | }
714 | ]
715 | }
--------------------------------------------------------------------------------
/2-NMT-Training.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "accelerator": "GPU",
6 | "colab": {
7 | "provenance": [],
8 | "machine_shape": "hm",
9 | "mount_file_id": "1QFWTTPHQYjMEMPCqzrS4ayT5Z2nD1WI_",
10 | "authorship_tag": "ABX9TyObE91Y3T+1kwaL2GlL0Zkm",
11 | "include_colab_link": true
12 | },
13 | "kernelspec": {
14 | "display_name": "Python 3",
15 | "name": "python3"
16 | },
17 | "language_info": {
18 | "name": "python"
19 | },
20 | "gpuClass": "standard"
21 | },
22 | "cells": [
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {
26 | "id": "view-in-github",
27 | "colab_type": "text"
28 | },
29 | "source": [
30 | "
"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "metadata": {
36 | "id": "vSUyCs23M_H2"
37 | },
38 | "source": [
39 | "# Install OpenNMT-py 3.x\n",
40 | "!pip3 install OpenNMT-py"
41 | ],
42 | "execution_count": null,
43 | "outputs": []
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "source": [
48 | "# Prepare Your Datasets\n",
49 | "Please make sure you have completed the [first exercise](https://colab.research.google.com/drive/1rsFPnAQu9-_A6e2Aw9JYK3C8mXx9djsF?usp=sharing)."
50 | ],
51 | "metadata": {
52 | "id": "vhgIdJn-cLqu"
53 | }
54 | },
55 | {
56 | "cell_type": "code",
57 | "metadata": {
58 | "id": "dWVOWYedzZ_G"
59 | },
60 | "source": [
61 | "# Open the folder where you saved your prepapred datasets from the first exercise\n",
62 | "# You might need to mount your Google Drive first\n",
63 | "%cd /content/drive/MyDrive/nmt/\n",
64 | "!ls"
65 | ],
66 | "execution_count": null,
67 | "outputs": []
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {
72 | "id": "MPlmhd426B7l"
73 | },
74 | "source": [
75 | "# Create the Training Configuration File\n",
76 | "\n",
77 | "The following config file matches most of the recommended values for the Transformer model [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762). As the current dataset is small, we reduced the following values: \n",
78 | "* `train_steps` - for datasets with a few millions of sentences, consider using a value between 100000 and 200000, or more! Enabling the option `early_stopping` can help stop the training when there is no considerable improvement.\n",
79 | "* `valid_steps` - 10000 can be good if the value `train_steps` is big enough. \n",
80 | "* `warmup_steps` - obviously, its value must be less than `train_steps`. Try 4000 and 8000 values.\n",
81 | "\n",
82 | "Refer to [OpenNMT-py training parameters](https://opennmt.net/OpenNMT-py/options/train.html) for more details. If you are interested in further explanation of the Transformer model, you can check this article, [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)."
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "metadata": {
88 | "id": "qbW7Xek6UDlY"
89 | },
90 | "source": [
91 | "# Create the YAML configuration file\n",
92 | "# On a regular machine, you can create it manually or with nano\n",
93 | "# Note here we are using some smaller values because the dataset is small\n",
94 | "# For larger datasets, consider increasing: train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint\n",
95 | "\n",
96 | "config = '''# config.yaml\n",
97 | "\n",
98 | "\n",
99 | "## Where the samples will be written\n",
100 | "save_data: run\n",
101 | "\n",
102 | "# Training files\n",
103 | "data:\n",
104 | " corpus_1:\n",
105 | " path_src: UN.en-fr.fr-filtered.fr.subword.train\n",
106 | " path_tgt: UN.en-fr.en-filtered.en.subword.train\n",
107 | " transforms: [filtertoolong]\n",
108 | " valid:\n",
109 | " path_src: UN.en-fr.fr-filtered.fr.subword.dev\n",
110 | " path_tgt: UN.en-fr.en-filtered.en.subword.dev\n",
111 | " transforms: [filtertoolong]\n",
112 | "\n",
113 | "# Vocabulary files, generated by onmt_build_vocab\n",
114 | "src_vocab: run/source.vocab\n",
115 | "tgt_vocab: run/target.vocab\n",
116 | "\n",
117 | "# Vocabulary size - should be the same as in sentence piece\n",
118 | "src_vocab_size: 50000\n",
119 | "tgt_vocab_size: 50000\n",
120 | "\n",
121 | "# Filter out source/target longer than n if [filtertoolong] enabled\n",
122 | "src_seq_length: 150\n",
123 | "src_seq_length: 150\n",
124 | "\n",
125 | "# Tokenization options\n",
126 | "src_subword_model: source.model\n",
127 | "tgt_subword_model: target.model\n",
128 | "\n",
129 | "# Where to save the log file and the output models/checkpoints\n",
130 | "log_file: train.log\n",
131 | "save_model: models/model.fren\n",
132 | "\n",
133 | "# Stop training if it does not imporve after n validations\n",
134 | "early_stopping: 4\n",
135 | "\n",
136 | "# Default: 5000 - Save a model checkpoint for each n\n",
137 | "save_checkpoint_steps: 1000\n",
138 | "\n",
139 | "# To save space, limit checkpoints to last n\n",
140 | "# keep_checkpoint: 3\n",
141 | "\n",
142 | "seed: 3435\n",
143 | "\n",
144 | "# Default: 100000 - Train the model to max n steps \n",
145 | "# Increase to 200000 or more for large datasets\n",
146 | "# For fine-tuning, add up the required steps to the original steps\n",
147 | "train_steps: 3000\n",
148 | "\n",
149 | "# Default: 10000 - Run validation after n steps\n",
150 | "valid_steps: 1000\n",
151 | "\n",
152 | "# Default: 4000 - for large datasets, try up to 8000\n",
153 | "warmup_steps: 1000\n",
154 | "report_every: 100\n",
155 | "\n",
156 | "# Number of GPUs, and IDs of GPUs\n",
157 | "world_size: 1\n",
158 | "gpu_ranks: [0]\n",
159 | "\n",
160 | "# Batching\n",
161 | "bucket_size: 262144\n",
162 | "num_workers: 0 # Default: 2, set to 0 when RAM out of memory\n",
163 | "batch_type: \"tokens\"\n",
164 | "batch_size: 4096 # Tokens per batch, change when CUDA out of memory\n",
165 | "valid_batch_size: 2048\n",
166 | "max_generator_batches: 2\n",
167 | "accum_count: [4]\n",
168 | "accum_steps: [0]\n",
169 | "\n",
170 | "# Optimization\n",
171 | "model_dtype: \"fp16\"\n",
172 | "optim: \"adam\"\n",
173 | "learning_rate: 2\n",
174 | "# warmup_steps: 8000\n",
175 | "decay_method: \"noam\"\n",
176 | "adam_beta2: 0.998\n",
177 | "max_grad_norm: 0\n",
178 | "label_smoothing: 0.1\n",
179 | "param_init: 0\n",
180 | "param_init_glorot: true\n",
181 | "normalization: \"tokens\"\n",
182 | "\n",
183 | "# Model\n",
184 | "encoder_type: transformer\n",
185 | "decoder_type: transformer\n",
186 | "position_encoding: true\n",
187 | "enc_layers: 6\n",
188 | "dec_layers: 6\n",
189 | "heads: 8\n",
190 | "hidden_size: 512\n",
191 | "word_vec_size: 512\n",
192 | "transformer_ff: 2048\n",
193 | "dropout_steps: [0]\n",
194 | "dropout: [0.1]\n",
195 | "attention_dropout: [0.1]\n",
196 | "'''\n",
197 | "\n",
198 | "with open(\"config.yaml\", \"w+\") as config_yaml:\n",
199 | " config_yaml.write(config)"
200 | ],
201 | "execution_count": 31,
202 | "outputs": []
203 | },
204 | {
205 | "cell_type": "code",
206 | "metadata": {
207 | "id": "vsL4zycvLMUx"
208 | },
209 | "source": [
210 | "# [Optional] Check the content of the configuration file\n",
211 | "!cat config.yaml"
212 | ],
213 | "execution_count": null,
214 | "outputs": []
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {
219 | "id": "F0bcqYkEXhRY"
220 | },
221 | "source": [
222 | "# Build Vocabulary\n",
223 | "\n",
224 | "For large datasets, it is not feasable to use all words/tokens found in the corpus. Instead, a specific set of vocabulary is extracted from the training dataset, usually betweeen 32k and 100k words. This is the main purpose of the vocabulary building step."
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "metadata": {
230 | "id": "AuwltKp_VhnQ",
231 | "colab": {
232 | "base_uri": "https://localhost:8080/"
233 | },
234 | "outputId": "4d9d5e5e-df7b-474b-b281-369424c47603"
235 | },
236 | "source": [
237 | "# Find the number of CPUs/cores on the machine\n",
238 | "!nproc --all"
239 | ],
240 | "execution_count": 7,
241 | "outputs": [
242 | {
243 | "output_type": "stream",
244 | "name": "stdout",
245 | "text": [
246 | "2\n"
247 | ]
248 | }
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "metadata": {
254 | "id": "P2GV1PgyUsJr",
255 | "colab": {
256 | "base_uri": "https://localhost:8080/"
257 | },
258 | "outputId": "e1749505-d1f6-41cd-9420-1876db27e405"
259 | },
260 | "source": [
261 | "# Build Vocabulary\n",
262 | "\n",
263 | "# -config: path to your config.yaml file\n",
264 | "# -n_sample: use -1 to build vocabulary on all the segment in the training dataset\n",
265 | "# -num_threads: change it to match the number of CPUs to run it faster\n",
266 | "\n",
267 | "!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2"
268 | ],
269 | "execution_count": 25,
270 | "outputs": [
271 | {
272 | "output_type": "stream",
273 | "name": "stdout",
274 | "text": [
275 | "Corpus corpus_1's weight should be given. We default it to 1 for you.\n",
276 | "[2022-11-19 20:00:16,415 INFO] Counter vocab from -1 samples.\n",
277 | "[2022-11-19 20:00:16,415 INFO] n_sample=-1: Build vocab on full datasets.\n",
278 | "[2022-11-19 20:00:18,967 INFO] * Transform statistics for corpus_1(50.00%):\n",
279 | "\t\t\t* FilterTooLongStats(filtered=2138)\n",
280 | "\n",
281 | "[2022-11-19 20:00:18,976 INFO] * Transform statistics for corpus_1(50.00%):\n",
282 | "\t\t\t* FilterTooLongStats(filtered=2032)\n",
283 | "\n",
284 | "[2022-11-19 20:00:19,035 INFO] Counters src:14705\n",
285 | "[2022-11-19 20:00:19,035 INFO] Counters tgt:11884\n"
286 | ]
287 | }
288 | ]
289 | },
290 | {
291 | "cell_type": "markdown",
292 | "metadata": {
293 | "id": "ncWyNtxiO_Ov"
294 | },
295 | "source": [
296 | "From the **Runtime menu** > **Change runtime type**, make sure that the \"**Hardware accelerator**\" is \"**GPU**\".\n"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "metadata": {
302 | "id": "TMMPeS-pSV8I",
303 | "colab": {
304 | "base_uri": "https://localhost:8080/"
305 | },
306 | "outputId": "ea51133a-beaa-4642-e8ba-7bd9159ada68"
307 | },
308 | "source": [
309 | "# Check if the GPU is active\n",
310 | "!nvidia-smi -L"
311 | ],
312 | "execution_count": 12,
313 | "outputs": [
314 | {
315 | "output_type": "stream",
316 | "name": "stdout",
317 | "text": [
318 | "GPU 0: Tesla T4 (UUID: GPU-1759f39f-df0c-a03f-3066-463f5fec7c38)\n"
319 | ]
320 | }
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "metadata": {
326 | "id": "_3rVQhd4NXNG",
327 | "colab": {
328 | "base_uri": "https://localhost:8080/"
329 | },
330 | "outputId": "181eb6a3-cc09-45b6-de4e-b1e88e45f97b"
331 | },
332 | "source": [
333 | "# Check if the GPU is visable to PyTorch\n",
334 | "\n",
335 | "import torch\n",
336 | "\n",
337 | "print(torch.cuda.is_available())\n",
338 | "print(torch.cuda.get_device_name(0))\n",
339 | "\n",
340 | "gpu_memory = torch.cuda.mem_get_info(0)\n",
341 | "print(\"Free GPU memory:\", gpu_memory[0]/1024**2, \"out of:\", gpu_memory[1]/1024**2)"
342 | ],
343 | "execution_count": 24,
344 | "outputs": [
345 | {
346 | "output_type": "stream",
347 | "name": "stdout",
348 | "text": [
349 | "True\n",
350 | "Tesla T4\n",
351 | "Free GPU memory: 15007.75 out of: 15109.75\n"
352 | ]
353 | }
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {
359 | "id": "8aCxETSnXcL-"
360 | },
361 | "source": [
362 | "# Training\n",
363 | "\n",
364 | "Now, start training your NMT model! 🎉 🎉 🎉"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "source": [
370 | "!rm -rf drive/MyDrive/nmt/models/"
371 | ],
372 | "metadata": {
373 | "id": "HZd1o1kIb6Nv"
374 | },
375 | "execution_count": 32,
376 | "outputs": []
377 | },
378 | {
379 | "cell_type": "code",
380 | "metadata": {
381 | "id": "prJCKA2CP-dl"
382 | },
383 | "source": [
384 | "# Train the NMT model\n",
385 | "!onmt_train -config config.yaml"
386 | ],
387 | "execution_count": null,
388 | "outputs": []
389 | },
390 | {
391 | "cell_type": "code",
392 | "source": [
393 | "# For error debugging try:\n",
394 | "# !dmesg -T"
395 | ],
396 | "metadata": {
397 | "id": "XUYAvE8ffK2k"
398 | },
399 | "execution_count": null,
400 | "outputs": []
401 | },
402 | {
403 | "cell_type": "markdown",
404 | "metadata": {
405 | "id": "eShpS01j-Jcp"
406 | },
407 | "source": [
408 | "# Translation\n",
409 | "\n",
410 | "Translation Options:\n",
411 | "* `-model` - specify the last model checkpoint name; try testing the quality of multiple checkpoints\n",
412 | "* `-src` - the subworded test dataset, source file\n",
413 | "* `-output` - give any file name to the new translation output file\n",
414 | "* `-gpu` - GPU ID, usually 0 if you have one GPU. Otherwise, it will translate on CPU, which would be slower.\n",
415 | "* `-min_length` - [optional] to avoid empty translations\n",
416 | "* `-verbose` - [optional] if you want to print translations\n",
417 | "\n",
418 | "Refer to [OpenNMT-py translation options](https://opennmt.net/OpenNMT-py/options/translate.html) for more details."
419 | ]
420 | },
421 | {
422 | "cell_type": "code",
423 | "metadata": {
424 | "id": "MbQEGTj4TybH",
425 | "colab": {
426 | "base_uri": "https://localhost:8080/"
427 | },
428 | "outputId": "b2181f89-47a1-46d3-888c-9facf296bf61"
429 | },
430 | "source": [
431 | "# Translate the \"subworded\" source file of the test dataset\n",
432 | "# Change the model name, if needed.\n",
433 | "!onmt_translate -model models/model.fren_step_3000.pt -src UN.en-fr.fr-filtered.fr.subword.test -output UN.en.translated -gpu 0 -min_length 1"
434 | ],
435 | "execution_count": 35,
436 | "outputs": [
437 | {
438 | "output_type": "stream",
439 | "name": "stdout",
440 | "text": [
441 | "[2022-11-19 21:00:48,673 INFO] PRED SCORE: -0.2185, PRED PPL: 1.24 NB SENTENCES: 2000\n"
442 | ]
443 | }
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "metadata": {
449 | "id": "XHYihrgfIrIO",
450 | "colab": {
451 | "base_uri": "https://localhost:8080/"
452 | },
453 | "outputId": "980d6dad-3467-4157-a398-8fe05509cb54"
454 | },
455 | "source": [
456 | "# Check the first 5 lines of the translation file\n",
457 | "!head -n 5 UN.en.translated"
458 | ],
459 | "execution_count": 36,
460 | "outputs": [
461 | {
462 | "output_type": "stream",
463 | "name": "stdout",
464 | "text": [
465 | "▁ 1 1 . ▁Reaffirms ▁the ▁commitments ▁made ▁at ▁the ▁Fourth ▁M inisterial ▁Conference ▁of ▁the ▁World ▁Trade ▁Organization , ▁held ▁in ▁Doha , ▁and ▁the ▁Thir d ▁Unit ed ▁Nations ▁Conference ▁on ▁the ▁Lea st ▁Developed ▁Countries , ▁held ▁in ▁Brussels ▁from ▁ 1 4 ▁to ▁ 2 0 ▁May ▁ 2 0 0 1 , See ▁A / CONF . 1 9 1 / 1 1 1 ▁and ▁ 1 2 . ▁and ▁in ▁this ▁regard ▁calls ▁upon ▁developed ▁countries ▁that ▁have ▁not ▁ye t ▁done ▁so ▁to ▁work ▁together ▁to ▁work ▁towards ▁the ▁objective ▁of ▁duty - free ▁and ▁quota - free ▁market ▁access ▁for ▁all ▁least ▁developed ▁countries , ▁and ▁also ▁notes ▁that ▁the ▁proposals ▁for ▁developing ▁countries ▁would ▁be ▁helpful ;\n",
466 | "▁ 1 0 . ▁Recognizes , ▁where ▁appropriate , ▁international ▁cooperation ▁in ▁the ▁field ▁of ▁critical ▁information ▁infrastructures , ▁including ▁the ▁establishment ▁of ▁emergency ▁information ▁and ▁coordination ▁mechanisms , ▁as ▁well ▁as ▁the ▁sharing ▁of ▁experience , ▁information ▁on ▁and ▁ te lecommunications , ▁response ▁and ▁solutions ▁to ▁such ▁incidents , ▁and ▁the ▁identification ▁of ▁incidents ▁in ▁accord ance ▁with ▁domestic ▁law .\n",
467 | "▁Recalling ▁its ▁relevant ▁resolutions , ▁including ▁resolution ▁ 5 8 / 2 9 2 ▁of ▁ 6 ▁May ▁ 2 0 0 4 , ▁as ▁well ▁as ▁th ose ▁adopted ▁at ▁its ▁tenth ▁emergency ▁special ▁session ,\n",
468 | "▁ 1 6 . ▁Further ▁requests ▁the ▁Secretary - General ▁to ▁report ▁to ▁the ▁General ▁Assembly ▁at ▁its ▁ resumed ▁fifty - ninth ▁session ▁on ▁the ▁experience ▁of ▁human ▁resources ▁management ;\n",
469 | "▁ 6 . ▁Encourages ▁all ▁States ▁of ▁the ▁region ▁to ▁foster ▁conditions ▁conduc ive ▁to ▁strengthening ▁mutual ▁confidence - building ▁measures ▁and ▁genuine ▁openness , ▁transparency ▁in ▁all ▁military ▁matters , ▁in ▁particular ▁through ▁the ▁Unit ed ▁Nations ▁system ▁for ▁the ▁standardiz ed ▁reporting ▁of ▁military ▁expenditures ▁and ▁the ▁Unit ed ▁Nations ▁Register ▁of ▁Convention al ▁Arms ▁and ▁L ight ▁Weapons ; See ▁resolution ▁ 4 6 / 3 6 .\n"
470 | ]
471 | }
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "metadata": {
477 | "id": "zRsJm6UET2C_",
478 | "colab": {
479 | "base_uri": "https://localhost:8080/"
480 | },
481 | "outputId": "f9a410d7-e753-4c43-e5cb-82c62ed00a53"
482 | },
483 | "source": [
484 | "# If needed install/update sentencepiece\n",
485 | "!pip3 install --upgrade -q sentencepiece\n",
486 | "\n",
487 | "# Desubword the translation file\n",
488 | "!python3 MT-Preparation/subwording/3-desubword.py target.model UN.en.translated"
489 | ],
490 | "execution_count": 38,
491 | "outputs": [
492 | {
493 | "output_type": "stream",
494 | "name": "stdout",
495 | "text": [
496 | "\u001b[?25l\r\u001b[K |▎ | 10 kB 27.3 MB/s eta 0:00:01\r\u001b[K |▌ | 20 kB 13.1 MB/s eta 0:00:01\r\u001b[K |▊ | 30 kB 16.9 MB/s eta 0:00:01\r\u001b[K |█ | 40 kB 14.2 MB/s eta 0:00:01\r\u001b[K |█▎ | 51 kB 12.4 MB/s eta 0:00:01\r\u001b[K |█▌ | 61 kB 14.3 MB/s eta 0:00:01\r\u001b[K |█▉ | 71 kB 15.6 MB/s eta 0:00:01\r\u001b[K |██ | 81 kB 17.1 MB/s eta 0:00:01\r\u001b[K |██▎ | 92 kB 17.3 MB/s eta 0:00:01\r\u001b[K |██▋ | 102 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██▉ | 112 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███ | 122 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███▍ | 133 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███▋ | 143 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███▉ | 153 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████ | 163 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████▍ | 174 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████▋ | 184 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████▉ | 194 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████▏ | 204 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████▍ | 215 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████▋ | 225 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████ | 235 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████▏ | 245 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████▍ | 256 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████▊ | 266 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████ | 276 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████▏ | 286 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████▍ | 296 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████▊ | 307 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████ | 317 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████▏ | 327 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████▌ | 337 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████▊ | 348 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████ | 358 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████▎ | 368 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████▌ | 378 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████▊ | 389 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████ | 399 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████▎ | 409 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████▌ | 419 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████▉ | 430 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████ | 440 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████▎ | 450 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████▌ | 460 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████▉ | 471 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████ | 481 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████▎ | 491 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████▋ | 501 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████▉ | 512 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████ | 522 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████▍ | 532 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████▋ | 542 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████▉ | 552 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████▏ | 563 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████▍ | 573 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████▋ | 583 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████▉ | 593 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████▏ | 604 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████▍ | 614 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████▋ | 624 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████ | 634 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████▏ | 645 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████▍ | 655 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████▊ | 665 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████ | 675 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████▏ | 686 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████▌ | 696 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████▊ | 706 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████ | 716 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████▎ | 727 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████▌ | 737 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████▊ | 747 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████ | 757 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████▎ | 768 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████▌ | 778 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████▊ | 788 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████ | 798 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████▎ | 808 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████▌ | 819 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████▉ | 829 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████ | 839 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████▎ | 849 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████▋ | 860 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████▉ | 870 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████ | 880 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████▎ | 890 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████▋ | 901 kB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████▉ | 911 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████ | 921 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████▍ | 931 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████▋ | 942 kB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████▉ | 952 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████▏ | 962 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████▍ | 972 kB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████▋ | 983 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████ | 993 kB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████▏ | 1.0 MB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████▍ | 1.0 MB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████▊ | 1.0 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████ | 1.0 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████▏ | 1.0 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████▍ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████▊ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████▏ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████▌ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████▊ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████████ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████████▎ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████████▌ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████████▊ | 1.1 MB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████████ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████████▎ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████████▌ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |█████████████████████████████▊ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████████ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████████▎ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████████▌ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |██████████████████████████████▉ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████████ | 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████████▎| 1.2 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████████▋| 1.3 MB 15.7 MB/s eta 0:00:01\r\u001b[K |███████████████████████████████▉| 1.3 MB 15.7 MB/s eta 0:00:01\r\u001b[K |████████████████████████████████| 1.3 MB 15.7 MB/s \n",
497 | "\u001b[?25hDone desubwording! Output: UN.en.translated.desubword\n"
498 | ]
499 | }
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "metadata": {
505 | "id": "ai4RhhGaKBp1",
506 | "colab": {
507 | "base_uri": "https://localhost:8080/"
508 | },
509 | "outputId": "a833acd0-99d0-48ed-ad63-dac8f91ff4da"
510 | },
511 | "source": [
512 | "# Check the first 5 lines of the desubworded translation file\n",
513 | "!head -n 5 UN.en.translated.desubword"
514 | ],
515 | "execution_count": 39,
516 | "outputs": [
517 | {
518 | "output_type": "stream",
519 | "name": "stdout",
520 | "text": [
521 | "11. Reaffirms the commitments made at the Fourth Ministerial Conference of the World Trade Organization, held in Doha, and the Third United Nations Conference on the Least Developed Countries, held in Brussels from 14 to 20 May 2001,See A/CONF.191/111 and 12. and in this regard calls upon developed countries that have not yet done so to work together to work towards the objective of duty-free and quota-free market access for all least developed countries, and also notes that the proposals for developing countries would be helpful;\n",
522 | "10. Recognizes, where appropriate, international cooperation in the field of critical information infrastructures, including the establishment of emergency information and coordination mechanisms, as well as the sharing of experience, information on and telecommunications, response and solutions to such incidents, and the identification of incidents in accordance with domestic law.\n",
523 | "Recalling its relevant resolutions, including resolution 58/292 of 6 May 2004, as well as those adopted at its tenth emergency special session,\n",
524 | "16. Further requests the Secretary-General to report to the General Assembly at its resumed fifty-ninth session on the experience of human resources management;\n",
525 | "6. Encourages all States of the region to foster conditions conducive to strengthening mutual confidence-building measures and genuine openness, transparency in all military matters, in particular through the United Nations system for the standardized reporting of military expenditures and the United Nations Register of Conventional Arms and Light Weapons;See resolution 46/36.\n"
526 | ]
527 | }
528 | ]
529 | },
530 | {
531 | "cell_type": "code",
532 | "metadata": {
533 | "id": "kOUWB4r3OFOV",
534 | "colab": {
535 | "base_uri": "https://localhost:8080/"
536 | },
537 | "outputId": "c16d31db-0643-4e05-c72a-f3982b4c6ebb"
538 | },
539 | "source": [
540 | "# Desubword the target file (reference) of the test dataset\n",
541 | "# Note: You might as well have split files *before* subwording during dataset preperation, \n",
542 | "# but sometimes datasets have tokeniztion issues, so this way you are sure the file is really untokenized.\n",
543 | "!python3 MT-Preparation/subwording/3-desubword.py target.model UN.en-fr.en-filtered.en.subword.test"
544 | ],
545 | "execution_count": 40,
546 | "outputs": [
547 | {
548 | "output_type": "stream",
549 | "name": "stdout",
550 | "text": [
551 | "Done desubwording! Output: UN.en-fr.en-filtered.en.subword.test.desubword\n"
552 | ]
553 | }
554 | ]
555 | },
556 | {
557 | "cell_type": "code",
558 | "metadata": {
559 | "id": "0jULN0MwOFeH",
560 | "colab": {
561 | "base_uri": "https://localhost:8080/"
562 | },
563 | "outputId": "c240f715-d7f8-451f-9bf6-1cf344af6810"
564 | },
565 | "source": [
566 | "# Check the first 5 lines of the desubworded reference\n",
567 | "!head -n 5 UN.en-fr.en-filtered.en.subword.test.desubword"
568 | ],
569 | "execution_count": 42,
570 | "outputs": [
571 | {
572 | "output_type": "stream",
573 | "name": "stdout",
574 | "text": [
575 | "11. Reaffirms the commitments made at the Fourth Ministerial Conference of the World Trade Organization held at Doha and at the Third United Nations Conference on the Least Developed Countries, held at Brussels from 14 to 20 May 2001,See A/CONF.191/11 and 12. and in this regard calls upon developed countries that have not already done so to work towards the objective of duty-free, quota-free market access for all least developed countries' exports, and notes that consideration of proposals for developing countries to contribute to improved market access for least developed countries would also be helpful;\n",
576 | "10. Engage in international cooperation, when appropriate, to secure critical information infrastructures, including by developing and coordinating emergency warning systems, sharing and analysing information regarding vulnerabilities, threats and incidents and coordinating investigations of attacks on such infrastructures in accordance with domestic laws.\n",
577 | "Recalling its relevant resolutions, including resolution 58/292 of 6 May 2004, as well as those resolutions adopted at its tenth emergency special session,\n",
578 | "16. Further requests the Secretary-General to report to the General Assembly at its resumed fifty-ninth session on the implications of the experiment for human resources management policies;\n",
579 | "6. Encourages all States of the region to favour the necessary conditions for strengthening the confidence-building measures among them by promoting genuine openness and transparency on all military matters, by participating, inter alia, in the United Nations system for the standardized reporting of military expenditures and by providing accurate data and information to the United Nations Register of Conventional Arms;See resolution 46/36 L.\n"
580 | ]
581 | }
582 | ]
583 | },
584 | {
585 | "cell_type": "markdown",
586 | "metadata": {
587 | "id": "bHMumxqvLDDc"
588 | },
589 | "source": [
590 | "# MT Evaluation\n",
591 | "\n",
592 | "There are several MT Evaluation metrics such as BLEU, TER, METEOR, COMET, BERTScore, among others.\n",
593 | "\n",
594 | "Here we are using BLEU. Files must be detokenized/desubworded beforehand."
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "metadata": {
600 | "id": "w-9XGYnaJ-Nj"
601 | },
602 | "source": [
603 | "# Download the BLEU script\n",
604 | "!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py"
605 | ],
606 | "execution_count": null,
607 | "outputs": []
608 | },
609 | {
610 | "cell_type": "code",
611 | "metadata": {
612 | "id": "rYDG0x0KLk_O"
613 | },
614 | "source": [
615 | "# Install sacrebleu\n",
616 | "!pip3 install sacrebleu"
617 | ],
618 | "execution_count": null,
619 | "outputs": []
620 | },
621 | {
622 | "cell_type": "code",
623 | "metadata": {
624 | "id": "W3V3tZphTzK9",
625 | "colab": {
626 | "base_uri": "https://localhost:8080/"
627 | },
628 | "outputId": "c2a85bb9-9a25-420e-98fd-800a554aae79"
629 | },
630 | "source": [
631 | "# Evaluate the translation (without subwording)\n",
632 | "!python3 compute-bleu.py UN.en-fr.en-filtered.en.subword.test.desubword UN.en.translated.desubword"
633 | ],
634 | "execution_count": 45,
635 | "outputs": [
636 | {
637 | "output_type": "stream",
638 | "name": "stdout",
639 | "text": [
640 | "Reference 1st sentence: 11. Reaffirms the commitments made at the Fourth Ministerial Conference of the World Trade Organization held at Doha and at the Third United Nations Conference on the Least Developed Countries, held at Brussels from 14 to 20 May 2001,See A/CONF.191/11 and 12. and in this regard calls upon developed countries that have not already done so to work towards the objective of duty-free, quota-free market access for all least developed countries' exports, and notes that consideration of proposals for developing countries to contribute to improved market access for least developed countries would also be helpful;\n",
641 | "MTed 1st sentence: 11. Reaffirms the commitments made at the Fourth Ministerial Conference of the World Trade Organization, held in Doha, and the Third United Nations Conference on the Least Developed Countries, held in Brussels from 14 to 20 May 2001,See A/CONF.191/111 and 12. and in this regard calls upon developed countries that have not yet done so to work together to work towards the objective of duty-free and quota-free market access for all least developed countries, and also notes that the proposals for developing countries would be helpful;\n",
642 | "BLEU: 63.13825360368016\n"
643 | ]
644 | }
645 | ]
646 | },
647 | {
648 | "cell_type": "markdown",
649 | "metadata": {
650 | "id": "IBi1PhRv4bX9"
651 | },
652 | "source": [
653 | "# More Features and Directions to Explore\n",
654 | "\n",
655 | "Experiment with the following ideas:\n",
656 | "* Icrease `train_steps` and see to what extent new checkpoints provide better translation, in terms of both BLEU and your human evaluation.\n",
657 | "\n",
658 | "* Check other MT Evaluation mentrics other than BLEU such as [TER](https://github.com/mjpost/sacrebleu#ter), [WER](https://blog.machinetranslation.io/compute-wer-score/), [METEOR](https://blog.machinetranslation.io/compute-bleu-score/#meteor), [COMET](https://github.com/Unbabel/COMET), and [BERTScore](https://github.com/Tiiiger/bert_score). What are the conceptual differences between them? Is there special cases for using a specific metric?\n",
659 | "\n",
660 | "* Continue training from the last model checkpoint using the `-train_from` option, only if the training stopped and you want to continue it. In this case, `train_steps` in the config file should be larger than the steps of the last checkpoint you train from.\n",
661 | "```\n",
662 | "!onmt_train -config config.yaml -train_from models/model.fren_step_3000.pt\n",
663 | "```\n",
664 | "\n",
665 | "* **Ensemble Decoding:** During translation, instead of adding one model/checkpoint to the `-model` argument, add multiple checkpoints. For example, try the two last checkpoints. Does it improve quality of translation? Does it affect translation seepd?\n",
666 | "\n",
667 | "* **Averaging Models:** Try to average multiple models into one model using the [average_models.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/bin/average_models.py) script, and see how this affects translation quality.\n",
668 | "```\n",
669 | "python3 average_models.py -models model_step_xxx.pt model_step_yyy.pt -output model_avg.pt\n",
670 | "```\n",
671 | "* **Release the model:** Try this command and see how it reduce the model size.\n",
672 | "```\n",
673 | "onmt_release_model --model \"model.pt\" --output \"model_released.pt\n",
674 | "```\n",
675 | "* **Use CTranslate2:** For efficient translation, consider using [CTranslate2](https://github.com/OpenNMT/CTranslate2), a fast inference engine. Check out an [example](https://gist.github.com/ymoslem/60e1d1dc44fe006f67e130b6ad703c4b).\n",
676 | "\n",
677 | "* **Work on low-resource languages:** Find out more details about [how to train NMT models for low-resource languages](https://blog.machinetranslation.io/low-resource-nmt/).\n",
678 | "\n",
679 | "* **Train a multilingual model:** Find out helpful notes about [training multilingual models](https://blog.machinetranslation.io/multilingual-nmt).\n",
680 | "\n",
681 | "* **Publish a demo:** Show off your work through a [simple demo with CTranslate2 and Streamlit](https://blog.machinetranslation.io/nmt-web-interface/).\n"
682 | ]
683 | }
684 | ]
685 | }
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Yasmin Moslem
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # OpenNMT-py Tutorial
2 | Neural Machine Translation (NMT) tutorial with [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py). Data preprocessing, model training, evaluation, and deployment.
3 |
4 | ## Fundamentals
5 | * Data Processing ([notebook](1-NMT-Data-Processing.ipynb) | [code](https://github.com/ymoslem/MT-Preparation))
6 | * NMT Model Training with OpenNMT-py ([notebook](2-NMT-Training.ipynb))
7 | * Translation/Inference with CTranslate2 ([code](https://gist.github.com/ymoslem/60e1d1dc44fe006f67e130b6ad703c4b))
8 | * MT Evaluation with BLEU and other metrics ([tutorial](https://blog.machinetranslation.io/compute-bleu-score/) | [code](https://github.com/ymoslem/MT-Evaluation) | [notebook](https://github.com/ymoslem/Adaptive-MT-LLM/blob/main/evaluation/Evaluation.ipynb))
9 | * Simple Web UI ([tutorial](https://blog.machinetranslation.io/nmt-web-interface/) | [code](https://github.com/ymoslem/OpenNMT-Web-Interface))
10 |
11 | ## Advanced Topics
12 | * Running TensorBoard with OpenNMT ([tutorial](https://blog.machinetranslation.io/TensorBoard/))
13 | * Low-Resource Neural Machine Translation ([tutorial](https://blog.machinetranslation.io/low-resource-nmt/))
14 | * Domain Adaptation with Mixed Fine-tuning ([tutorial](https://blog.machinetranslation.io/domain-adaptation-mixed-fine-tuning/))
15 | * Overview of Domain Adaptation Techniques ([tutorial](https://amtaweb.org/wp-content/uploads/2020/11/NMTDomainAdaptationTechniques.pdf))
16 | * Multilingual Machine Translation ([tutorial](https://blog.machinetranslation.io/multilingual-nmt/))
17 | * Using Pre-trained NMT models with CTranslate2 ([M2M-100](https://gist.github.com/ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d) | [NLLB-200](https://github.com/ymoslem/Adaptive-MT-LLM/blob/main/MT/NLLB.ipynb))
18 | * Domain-Specific Text Generation for Machine Translation ([paper](https://aclanthology.org/2022.amta-research.2/) | [article](https://blog.machinetranslation.io/synthetic-data-machine-translation/) | [code](https://github.com/ymoslem/MT-LM))
19 | * Adaptive Machine Translation with Large Language Models ([paper](https://aclanthology.org/2023.eamt-1.22/) | [code](https://github.com/ymoslem/Adaptive-MT-LLM))
20 | * Fine-tuning Large Language Models for Adaptive Machine Translation ([paper](https://arxiv.org/abs/2312.12740) | [code](https://github.com/ymoslem/Adaptive-MT-LLM-Fine-tuning))
21 |
--------------------------------------------------------------------------------