├── .gitignore
├── LICENSE
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Yong Shan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center"><img src="https://i.loli.net/2018/11/03/5bdd66a85e6d3.jpg"></p>
  2 | 
  3 | <p align="center">
  4 | <a href="https://raw.githubusercontent.com/SkyAndCloud/awesome-transformer/master/LICENSE"><img src="https://img.shields.io/cocoapods/l/Kingfisher.svg?style=flat"></a>
  5 | </p>
  6 | 
  7 | A collection of transformer's guides, implementations and so on(For those who want to do some research using transformer as a baseline or simply reproduce paper's performance).
  8 | 
  9 | Please feel free to pull requests or report issues.
 10 | 
 11 | * [Why this project](#why-this-project)
 12 | * [Papers](#papers)
 13 |     * [NMT Basic](#nmt-basic)
 14 |     * [Transformer original paper](#transformer-original-paper)
 15 | * [Implementations &amp; How to reproduce paper's result](#implementations--how-to-reproduce-papers-result)
 16 |     * [Minimal, paper-equavalent but not certainly performance-reproducable implementations(both <em>PyTorch</em> implementations)](#minimal-paper-equavalent-but-not-certainly-performance-reproducable-implementationsboth-pytorch-implementations)
 17 |     * [Complex, performance-reproducable implementations](#complex-performance-reproducable-implementations)
 18 |         * <a href="#t2t">Paper's original implementation: tensor2tensor(using <em>TensorFlow</em>)]</a>
 19 |             * [Code](#code)
 20 |             * [Code annotation](#code-annotation)
 21 |             * [Steps to reproduce WMT14 English-German result:](#steps-to-reproduce-wmt14-english-german-result)
 22 |             * [Resources](#resources)
 23 |         * [Harvard NLP Group's implementation: OpenNMT-py(using <em>PyTorch</em>)](#harvard-nlp-groups-implementation-opennmt-pyusing-pytorch)
 24 |              * [Code](#code-1)
 25 |              * [Steps to reproduce WMT14 English-German result:](#steps-to-reproduce-wmt14-english-german-result-1)
 26 |              * [Resources](#resources-1)
 27 |         * [FAIR's implementation: fairseq-py(using <em>PyTorch</em>)](#fairs-implementation-fairseq-pyusing-pytorch)
 28 |              * [Code](#code-2)
 29 |              * [Steps to reproduce WMT14 English-German result:](#steps-to-reproduce-wmt14-english-german-result-2)
 30 |              * [Resources](#resources-2)
 31 |     * [Complex, not certainly performance-reproducable implementations](#complex-not-certainly-performance-reproducable-implementations)
 32 | * [Training tips](#training-tips)
 33 | * [Further](#further)
 34 | * [Contributors](#contributors)
 35 |       
 36 | ## Why this project
 37 | 
 38 | Transformer is a powerful model applied in sequence to sequence learning. However, when we were using transformer as our baseline in NMT research we found no good & reliable guide to reproduce approximate result as reported in original paper(even official <a href="#t2t">tensor2tensor</a> implementation), which means our research would be unauthentic. We collected some implementations, obtained corresponding performance-reproducable approaches and other materials, which eventually formed this project.
 39 | 
 40 | ## Papers
 41 | 
 42 | ### NMT Basic
 43 | - seq2seq model: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
 44 | - seq2seq & attention: [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
 45 | - refined attention: [Effective Approaches to Attention-based Neural Machine Translation](http://arxiv.org/abs/1508.04025)
 46 | - seq2seq using CGRU: [DL4MT](https://github.com/nyu-dl/dl4mt-tutorial)
 47 | - GNMT: [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144)
 48 | - bytenet: [Neural Machine Translation in Linear Time](https://arxiv.org/abs/1610.10099)
 49 | - convolutional NMT: [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)
 50 | - bpe: [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909)
 51 | - word piece: [Japanese and Korean Voice Search](https://ieeexplore.ieee.org/document/6289079/)
 52 | - self attention paper: [A Structured Self-attentive Sentence Embedding](https://arxiv.org/abs/1703.03130)
 53 | 
 54 | ### Transformer original paper
 55 | 
 56 | <p align="center">
 57 | <a><img width=50% src="https://i.loli.net/2018/11/03/5bdd7052d8b63.png"></a>
 58 | </p>
 59 | 
 60 | - [Attention is All You Need](https://arxiv.org/abs/1706.03762)
 61 | 
 62 | ## Implementations & How to reproduce paper's result
 63 | 
 64 | Indeed there are lots of transformer implementations on the Internet, in order to simplify learning curve, here we only include **the most valuable** projects.
 65 | 
 66 | >**[Note]**: In transformer original paper, there are *WMT14 English-German*, *WMT14 English-French* two results
 67 |    ![transformer result](https://i.loli.net/2018/11/03/5bdd701614ba1.png)
 68 | Here we regard a implementation as performance-reproducable **if there exists approaches to reproduce WMT14 English-German BLEU score**. Therefore, we'll also support corresponding approach to reproduce *WMT14 English-German* result.
 69 | 
 70 | ### Minimal, paper-equavalent but not certainly performance-reproducable implementations(both *PyTorch* implementations)
 71 | 
 72 | 1. attention-is-all-you-need-pytorch
 73 | 
 74 |     - [code](https://github.com/jadore801120/attention-is-all-you-need-pytorch)
 75 | 
 76 | 2. Harvard NLP Group's annotation   
 77 | 
 78 |     - [code](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
 79 | 
 80 | ### Complex, performance-reproducable implementations
 81 | 
 82 | Because transformer's original implementation should run on **8 GPU** to replicate corresponding result, where each GPU loads one batch and after forward propagation 8 batch's loss is summed to execute backward operation, so we can **accumulate every 8 batch's loss** to execute backward operation if we **only have 1 GPU** to imitate this process. **You'd better assemble `gpu_count`, `tokens_on_each_gpu` and `gradient_accumulation_count` to satisfy `gpu_count * tokens_on_each_gpu * gradient_accumulation_count = 4096 * 8`**. See each implementation's guide for details.
 83 |     
 84 | Although original paper used `multi-bleu.perl` to evaluate bleu score, we recommend using [sacrebleu](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu), which should be equivalent to `mteval-v13a.pl` but more convenient, to calculate bleu score and report the signature as `BLEU+case.mixed+lang.de-en+test.wmt17 = 32.97 66.1/40.2/26.6/18.1 (BP = 0.980 ratio = 0.980 hyp_len = 63134 ref_len = 64399)` for easy reproduction.
 85 | 
 86 | ```
 87 | # calculate lowercase bleu on all tokenized text
 88 | cat model_prediction | sacrebleu -tok none -lc ground_truth
 89 | # calculate lowercase bleu on all tokenized text if you have 3 ground truth
 90 | cat model_prediction | sacrebleu -tok none -lc ground_truth_1 ground_truth_2 ground_truth_3 
 91 | # calculate lowercase bleu on all untokenized romance-language text using v13a tokenization
 92 | cat model_prediction | sacrebleu -tok 13a -lc ground_truth
 93 | # calculate lowercase bleu on all untokenized romance-language text using v14 tokenization
 94 | cat model_prediction | sacrebleu -tok intl -lc ground_truth
 95 | ```
 96 | 
 97 | The transformer paper's original model settings can be found in [tensor2tensor transformer.py](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py). For example, You can find `base model configs` in`transformer_base` function.
 98 | 
 99 | As you can see, [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt) also has a replicable instruction but we prefer <a href="#t2t">tensor2tensor</a> as a baseline to reproduce paper's result if we have to use TensorFlow since it is official.
100 | 
101 | #### <a id="t2t"/>Paper's original implementation: tensor2tensor(using *TensorFlow*)
102 | 
103 | ##### Code
104 | 
105 | - [tensor2tensor](https://github.com/tensorflow/tensor2tensor)
106 | 
107 | ##### Code annotation
108 | 
109 | - [“变形金刚”为何强大：从模型到代码全面解析Google Tensor2Tensor系统](https://cloud.tencent.com/developer/article/1153079)(only Chinese version, corresponding to tensor2tensor v1.6.3)
110 | 
111 | ##### Steps to reproduce WMT14 English-German result: 
112 | 
113 | **(updated on v1.10.0)**
114 | 
115 | ```shell
116 | # 1. Install tensor2tensor toolkit
117 | pip install tensor2tensor
118 | 
119 | # 2. Basic config
120 | # For BPE model use this problem
121 | PROBLEM=translate_ende_wmt_bpe32k
122 | MODEL=transformer
123 | HPARAMS=transformer_base
124 | # or use transformer_large to reproduce large model
125 | # HPARAMS=transformer_large
126 | DATA_DIR=$HOME/t2t_data
127 | TMP_DIR=/tmp/t2t_datagen
128 | TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
129 | 
130 | mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
131 | 
132 | # 3. Download and preprocess corpus
133 | # Note that tensor2tensor has an inner tokenizer
134 | t2t-datagen \
135 |   --data_dir=$DATA_DIR \
136 |   --tmp_dir=$TMP_DIR \
137 |   --problem=$PROBLEM
138 | 
139 | # 4. Train on 8 GPUs. You'll get nearly expected performance after ~250k steps and certainly expected performance after ~500k steps.
140 | t2t-trainer \
141 |   --data_dir=$DATA_DIR \
142 |   --problem=$PROBLEM \
143 |   --model=$MODEL \
144 |   --hparams_set=$HPARAMS \
145 |   --output_dir=$TRAIN_DIR \ 
146 |   --train_steps=600000
147 | 
148 | # 5. Translate
149 | DECODE_FILE=$TMP_DIR/newstest2014.tok.bpe.32000.en
150 | BEAM_SIZE=4
151 | ALPHA=0.6
152 | 
153 | t2t-decoder \
154 |   --data_dir=$DATA_DIR \
155 |   --problem=$PROBLEM \
156 |   --model=$MODEL \
157 |   --hparams_set=$HPARAMS \
158 |   --output_dir=$TRAIN_DIR \
159 |   --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
160 |   --decode_from_file=$DECODE_FILE \
161 |   --decode_to_file=$TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode
162 | 
163 | # 6. Debpe
164 | cat $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode | sed 's/@@ //g' > $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe
165 | # Do compound splitting on the translation
166 | perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe > $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe.atat
167 | # Do same compound splitting on the ground truth and then score bleu
168 | # ...
169 | ```
170 | 
171 | <a id="compound_split"/>**Note that step 6 remains a postprocessing**. For some historical reasons, Google split compound words before getting the final BLEU results which will bring moderate increase. see [get_ende_bleu.sh](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh) for more details.
172 | 
173 | If you have only 1 GPU, you can use `transformer_base_multistep8` hparams to imitate 8 GPU.
174 | 
175 | ![transformer_base_multistep8](https://i.loli.net/2018/11/03/5bdd6a22ae29a.png)
176 |  
177 |  You can also modify `transformer_base_multistep8` function to accumulate gradient times you want. Here is an example using 4 GPU to run transformer big model. Note that `hparams.optimizer_multistep_accumulate_steps = 2` since we only need to accumulate gradient twice for 4 GPU.
178 | 
179 |  ```python
180 | @registry.register_hparams
181 | def transformer_base_multistep8():
182 |   """HParams for simulating 8 GPUs with MultistepAdam optimizer."""
183 |   hparams = transformer_big()
184 |   hparams.optimizer = "MultistepAdam"
185 |   hparams.optimizer_multistep_accumulate_steps = 2
186 |   return hparams
187 |  ```
188 | 
189 | ##### Resources
190 | - [t2t issue 539](https://github.com/tensorflow/tensor2tensor/issues/539)
191 | - [t2t issue 444](https://github.com/tensorflow/tensor2tensor/issues/444)
192 | - [t2t issue 317](https://github.com/tensorflow/tensor2tensor/issues/317)
193 | - [Tensor2Tensor for Neural Machine Translation](https://arxiv.org/abs/1803.07416)
194 | 
195 | #### Harvard NLP Group's implementation: OpenNMT-py(using *PyTorch*)
196 | 
197 | ##### Code
198 | 
199 | - [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py)
200 |         
201 | ##### Steps to reproduce WMT14 English-German result:
202 | 
203 | **(updated on v0.5.0)**
204 | 
205 | For command arguments meaning, see [OpenNMT-py doc](http://opennmt.net/OpenNMT-py/main.html) or [OpenNMT-py opts.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/opts.py)
206 | 
207 | 1. Download [corpus preprocessed by OpenNMT](https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz), [sentencepiece model preprocessed by OpenNMT](https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp_model.tar.gz). Note that the preprocess procedure includes tokenization, bpe/word-piece operation(here using [sentencepiece](https://github.com/google/sentencepiece) powered by Google which implements word-piece algorithm), see [OpenNMT-tf script](https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/prepare_data.sh) for more details.
208 |         
209 | 2. Preprocess. Because English and German are similar languages here we use `-share_vocab` to share vocabulary between source language and target language, which means you don't need to set this flag for distant language pairs such as Chinese-English. Meanwhile, we use a max sequence length of `100` to cover almostly all sentences on the basis of sentence length distribution of corpus.
210 |    For example:
211 |     
212 |     ```shell
213 |     python preprocess.py \
214 |         -train_src ../wmt-en-de/train.en.shuf \
215 |         -train_tgt ../wmt-en-de/train.de.shuf \
216 |         -valid_src ../wmt-en-de/valid.en \
217 |         -valid_tgt ../wmt-en-de/valid.de \
218 |         -save_data ../wmt-en-de/processed \
219 |         -src_seq_length 100 \
220 |         -tgt_seq_length 100 \
221 |         -max_shard_size 200000000 \
222 |         -share_vocab
223 |     ```
224 |         
225 | 3. Train. For example, if you only have 4 GPU:
226 |     ```shell
227 |     python  train.py -data /tmp/de2/data -save_model /tmp/extra \
228 |         -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
229 |         -encoder_type transformer -decoder_type transformer -position_encoding \
230 |         -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
231 |         -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
232 |         -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
233 |         -max_grad_norm 0 -param_init 0  -param_init_glorot \
234 |         -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
235 |         -world_size 4 -gpu_ranks 0 1 2 3 
236 |     ```
237 |     
238 |     <a id="accum_count"/>Note that here `-accum_count` means every `N` batches accumulating loss to backward, so it's 2 for 4 GPUs and so on.
239 |         
240 | 4. Translate. For example:  
241 |    You can set `-batch_size`(default `30`) larger to boost the translation.
242 |    
243 |     ```shell
244 |     python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu \
245 |          -share_vocab vocab_file -max_length 200 -model model_file -src newstest2014.en.32kspe -output model.pred -verbose
246 |     ```
247 |         
248 |     Note that testset in corpus preprocessed by OpenNMT is newstest2017 while it is newstest2014 in original paper, which may be a mistake. To obtain newstest2014 testset as in paper, here we can use sentencepiece to encode `newstest2014.en` manually. You can find `<model_file>`in step 1's downloaded archive.
249 |     
250 |     ```shell
251 |     spm_encode --model=<model_file> --output_format=piece < newstest2014.en > newstest2014.en.32kspe
252 |     ```
253 |         
254 | 5. Detokenization. Since training data is processed by [sentencepiece](https://github.com/google/sentencepiece), step 4's translation should be sentencepiece-encoded style, so we need a decoding procedure to obtain a detokenized plain prediction. 
255 |     For example: 
256 |     
257 |     ```shell
258 |     spm_decode --model=<model_file> --input_format=piece < input > output
259 |     ```
260 | 
261 | 6. <a href="#compound_split">Postprocess</a>
262 | 
263 | There is also a [bpe-version](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) WMT'16 ENDE corpus preprocessed by Google. See [subword-nmt](https://github.com/rsennrich/subword-nmt) for bpe encoding and decoding.
264 | 
265 | ##### Resources
266 | 
267 | - [OpenNMT-py FAQ](http://opennmt.net/OpenNMT-py/FAQ.html)
268 | - ~~[OpenNMT-py issue](https://github.com/OpenNMT/OpenNMT-py/issues/637)(deprecated)~~
269 | - [OpenNMT: Open-Source Toolkit for Neural Machine Translation](https://arxiv.org/abs/1701.02810)
270 | 
271 | #### FAIR's implementation: fairseq-py(using *PyTorch*)
272 |         
273 | ##### Code
274 | 
275 | - [fairseq-py](https://github.com/pytorch/fairseq/)
276 |         
277 | ##### Steps to reproduce WMT14 English-German result:
278 | 
279 | **(updated on commit `7e60d45`)**
280 | 
281 | <a id="update_freq"/>For arguments meaning, see [doc](https://fairseq.readthedocs.io/en/latest/command_line_tools.html). Note that we can use `--update-freq` when training to accumulate every `N` batches loss to backward, so it's `8` for 1 GPU, `2` for 4 GPUs and so on.
282 | 
283 | 1. Download [the preprocessed WMT'16 EN-DE data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) and extract it.
284 | 
285 |     ```
286 |     TEXT=wmt16_en_de_bpe32k
287 |     mkdir $TEXT
288 |     tar -xzvf wmt16_en_de.tar.gz -C $TEXT
289 |     ```
290 | 
291 | 2. Preprocess the dataset with a joined dictionary
292 | 
293 |     ```
294 |     python preprocess.py --source-lang en --target-lang de \
295 |             --trainpref $TEXT/train.tok.clean.bpe.32000 \
296 |             --validpref $TEXT/newstest2013.tok.bpe.32000 \
297 |             --testpref $TEXT/newstest2014.tok.bpe.32000 \
298 |             --destdir data-bin/wmt16_en_de_bpe32k \
299 |             --joined-dictionary
300 |     ```
301 | 
302 | 3. Train. For a base model.
303 | 
304 |     ```
305 |     # train about 180k steps
306 |     python train.py data-bin/wmt16_en_de_bpe32k \
307 |         --arch transformer_wmt_en_de --share-all-embeddings \
308 |         --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
309 |         --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
310 |         --lr 0.0007 --min-lr 1e-09 \
311 |         --weight-decay 0.0 --criterion label_smoothed_cross_entropy \ 
312 |         --label-smoothing 0.1 --max-tokens 4096 --update-freq 2 \
313 |         --no-progress-bar --log-format json --log-interval 10 --save-interval-updates 1000 \
314 |         --keep-interval-updates 5
315 |     # average last 5 checkpoints
316 |     modelfile=checkpoints
317 |     python scripts/average_checkpoints.py --inputs $modelfile --num-update-checkpoints 5 \
318 |         --output $modelfile/average-model.pt
319 |     ```
320 | 
321 |     For a big model.
322 |     ```
323 |     # train about 270k steps
324 |     python train.py data-bin/wmt16_en_de_bpe32k \
325 |         --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
326 |         --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
327 |         --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
328 |         --lr 0.0005 --min-lr 1e-09 \
329 |         --weight-decay 0.0 --criterion label_smoothed_cross_entropy \		
330 |         --label-smoothing 0.1 --max-tokens 4096 --update-freq 2\
331 |         --no-progress-bar --log-format json --log-interval 10 --save-interval-updates 1000 \
332 |         --keep-interval-updates 20
333 |     # average last 20 checkpoints
334 |     modelfile=checkpoints
335 |     python scripts/average_checkpoints.py --inputs $modelfile --num-update-checkpoints 20 \ 
336 |         --output $modelfile/average-model.pt
337 |     ```
338 | 
339 | 4. Inference
340 |     ```
341 |     model=average-model.pt
342 |     subset=test
343 |     python generate.py data-bin/wmt16_en_de_bpe32k --path $modelfile/$model \
344 |         --gen-subset $subset --beam 4 --batch-size 128 --remove-bpe --lenpen 0.6 > pred.de
345 |     # because fairseq's output is unordered, we need to recover its order
346 |     grep ^H pred.de | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.de
347 |     ```
348 | 
349 | 5. <a href="#compound_split">Postprocess</a>
350 | 
351 | ##### Resources
352 | 
353 | - [fairseq-py example](https://github.com/pytorch/fairseq/tree/master/examples/translation)
354 | - [fairseq-py issue](https://github.com/pytorch/fairseq/issues/202). The corpus problem described in the issue has been fixed now, so we can directly follow the instruction above.
355 | 
356 | ### Complex, not certainly performance-reproducable implementations
357 | 
358 | - [Marian](https://github.com/marian-nmt/marian-examples/tree/master/transformer)(purely c++ implementation without any deep learning framework)
359 | 
360 | ## Training tips
361 | 
362 | - [Training Tips for the Transformer Model](https://arxiv.org/abs/1804.00247)
363 | 
364 | ## Further
365 | 
366 | - RNMT+: [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849)
367 | - [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187)
368 | - Turing-complete Transformer: [Universal Transformer](https://arxiv.org/abs/1807.03819)
369 | - [Self-Attention with Relative Position Representations](https://arxiv.org/abs/1803.02155)
370 | - [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
371 | - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
372 | 
373 | ## Contributors
374 | 
375 | This project is developed and maintained by Natural Language Processing Group, ICT/CAS.
376 | 
377 | - [Yong Shan](https://github.com/SkyAndCloud)
378 | - [Jinchao Zhang](https://github.com/zhangjcqq)
379 | - Shuhao Gu
380 | 


--------------------------------------------------------------------------------