├── .gitignore ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Yong Shan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 |

4 | 5 |

6 | 7 | A collection of transformer's guides, implementations and so on(For those who want to do some research using transformer as a baseline or simply reproduce paper's performance). 8 | 9 | Please feel free to pull requests or report issues. 10 | 11 | * [Why this project](#why-this-project) 12 | * [Papers](#papers) 13 | * [NMT Basic](#nmt-basic) 14 | * [Transformer original paper](#transformer-original-paper) 15 | * [Implementations & How to reproduce paper's result](#implementations--how-to-reproduce-papers-result) 16 | * [Minimal, paper-equavalent but not certainly performance-reproducable implementations(both PyTorch implementations)](#minimal-paper-equavalent-but-not-certainly-performance-reproducable-implementationsboth-pytorch-implementations) 17 | * [Complex, performance-reproducable implementations](#complex-performance-reproducable-implementations) 18 | * Paper's original implementation: tensor2tensor(using TensorFlow)] 19 | * [Code](#code) 20 | * [Code annotation](#code-annotation) 21 | * [Steps to reproduce WMT14 English-German result:](#steps-to-reproduce-wmt14-english-german-result) 22 | * [Resources](#resources) 23 | * [Harvard NLP Group's implementation: OpenNMT-py(using PyTorch)](#harvard-nlp-groups-implementation-opennmt-pyusing-pytorch) 24 | * [Code](#code-1) 25 | * [Steps to reproduce WMT14 English-German result:](#steps-to-reproduce-wmt14-english-german-result-1) 26 | * [Resources](#resources-1) 27 | * [FAIR's implementation: fairseq-py(using PyTorch)](#fairs-implementation-fairseq-pyusing-pytorch) 28 | * [Code](#code-2) 29 | * [Steps to reproduce WMT14 English-German result:](#steps-to-reproduce-wmt14-english-german-result-2) 30 | * [Resources](#resources-2) 31 | * [Complex, not certainly performance-reproducable implementations](#complex-not-certainly-performance-reproducable-implementations) 32 | * [Training tips](#training-tips) 33 | * [Further](#further) 34 | * [Contributors](#contributors) 35 | 36 | ## Why this project 37 | 38 | Transformer is a powerful model applied in sequence to sequence learning. However, when we were using transformer as our baseline in NMT research we found no good & reliable guide to reproduce approximate result as reported in original paper(even official tensor2tensor implementation), which means our research would be unauthentic. We collected some implementations, obtained corresponding performance-reproducable approaches and other materials, which eventually formed this project. 39 | 40 | ## Papers 41 | 42 | ### NMT Basic 43 | - seq2seq model: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) 44 | - seq2seq & attention: [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) 45 | - refined attention: [Effective Approaches to Attention-based Neural Machine Translation](http://arxiv.org/abs/1508.04025) 46 | - seq2seq using CGRU: [DL4MT](https://github.com/nyu-dl/dl4mt-tutorial) 47 | - GNMT: [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) 48 | - bytenet: [Neural Machine Translation in Linear Time](https://arxiv.org/abs/1610.10099) 49 | - convolutional NMT: [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122) 50 | - bpe: [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909) 51 | - word piece: [Japanese and Korean Voice Search](https://ieeexplore.ieee.org/document/6289079/) 52 | - self attention paper: [A Structured Self-attentive Sentence Embedding](https://arxiv.org/abs/1703.03130) 53 | 54 | ### Transformer original paper 55 | 56 |

57 | 58 |

59 | 60 | - [Attention is All You Need](https://arxiv.org/abs/1706.03762) 61 | 62 | ## Implementations & How to reproduce paper's result 63 | 64 | Indeed there are lots of transformer implementations on the Internet, in order to simplify learning curve, here we only include **the most valuable** projects. 65 | 66 | >**[Note]**: In transformer original paper, there are *WMT14 English-German*, *WMT14 English-French* two results 67 | ![transformer result](https://i.loli.net/2018/11/03/5bdd701614ba1.png) 68 | Here we regard a implementation as performance-reproducable **if there exists approaches to reproduce WMT14 English-German BLEU score**. Therefore, we'll also support corresponding approach to reproduce *WMT14 English-German* result. 69 | 70 | ### Minimal, paper-equavalent but not certainly performance-reproducable implementations(both *PyTorch* implementations) 71 | 72 | 1. attention-is-all-you-need-pytorch 73 | 74 | - [code](https://github.com/jadore801120/attention-is-all-you-need-pytorch) 75 | 76 | 2. Harvard NLP Group's annotation 77 | 78 | - [code](http://nlp.seas.harvard.edu/2018/04/03/attention.html) 79 | 80 | ### Complex, performance-reproducable implementations 81 | 82 | Because transformer's original implementation should run on **8 GPU** to replicate corresponding result, where each GPU loads one batch and after forward propagation 8 batch's loss is summed to execute backward operation, so we can **accumulate every 8 batch's loss** to execute backward operation if we **only have 1 GPU** to imitate this process. **You'd better assemble `gpu_count`, `tokens_on_each_gpu` and `gradient_accumulation_count` to satisfy `gpu_count * tokens_on_each_gpu * gradient_accumulation_count = 4096 * 8`**. See each implementation's guide for details. 83 | 84 | Although original paper used `multi-bleu.perl` to evaluate bleu score, we recommend using [sacrebleu](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu), which should be equivalent to `mteval-v13a.pl` but more convenient, to calculate bleu score and report the signature as `BLEU+case.mixed+lang.de-en+test.wmt17 = 32.97 66.1/40.2/26.6/18.1 (BP = 0.980 ratio = 0.980 hyp_len = 63134 ref_len = 64399)` for easy reproduction. 85 | 86 | ``` 87 | # calculate lowercase bleu on all tokenized text 88 | cat model_prediction | sacrebleu -tok none -lc ground_truth 89 | # calculate lowercase bleu on all tokenized text if you have 3 ground truth 90 | cat model_prediction | sacrebleu -tok none -lc ground_truth_1 ground_truth_2 ground_truth_3 91 | # calculate lowercase bleu on all untokenized romance-language text using v13a tokenization 92 | cat model_prediction | sacrebleu -tok 13a -lc ground_truth 93 | # calculate lowercase bleu on all untokenized romance-language text using v14 tokenization 94 | cat model_prediction | sacrebleu -tok intl -lc ground_truth 95 | ``` 96 | 97 | The transformer paper's original model settings can be found in [tensor2tensor transformer.py](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py). For example, You can find `base model configs` in`transformer_base` function. 98 | 99 | As you can see, [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt) also has a replicable instruction but we prefer tensor2tensor as a baseline to reproduce paper's result if we have to use TensorFlow since it is official. 100 | 101 | #### Paper's original implementation: tensor2tensor(using *TensorFlow*) 102 | 103 | ##### Code 104 | 105 | - [tensor2tensor](https://github.com/tensorflow/tensor2tensor) 106 | 107 | ##### Code annotation 108 | 109 | - [“变形金刚”为何强大:从模型到代码全面解析Google Tensor2Tensor系统](https://cloud.tencent.com/developer/article/1153079)(only Chinese version, corresponding to tensor2tensor v1.6.3) 110 | 111 | ##### Steps to reproduce WMT14 English-German result: 112 | 113 | **(updated on v1.10.0)** 114 | 115 | ```shell 116 | # 1. Install tensor2tensor toolkit 117 | pip install tensor2tensor 118 | 119 | # 2. Basic config 120 | # For BPE model use this problem 121 | PROBLEM=translate_ende_wmt_bpe32k 122 | MODEL=transformer 123 | HPARAMS=transformer_base 124 | # or use transformer_large to reproduce large model 125 | # HPARAMS=transformer_large 126 | DATA_DIR=$HOME/t2t_data 127 | TMP_DIR=/tmp/t2t_datagen 128 | TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS 129 | 130 | mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR 131 | 132 | # 3. Download and preprocess corpus 133 | # Note that tensor2tensor has an inner tokenizer 134 | t2t-datagen \ 135 | --data_dir=$DATA_DIR \ 136 | --tmp_dir=$TMP_DIR \ 137 | --problem=$PROBLEM 138 | 139 | # 4. Train on 8 GPUs. You'll get nearly expected performance after ~250k steps and certainly expected performance after ~500k steps. 140 | t2t-trainer \ 141 | --data_dir=$DATA_DIR \ 142 | --problem=$PROBLEM \ 143 | --model=$MODEL \ 144 | --hparams_set=$HPARAMS \ 145 | --output_dir=$TRAIN_DIR \ 146 | --train_steps=600000 147 | 148 | # 5. Translate 149 | DECODE_FILE=$TMP_DIR/newstest2014.tok.bpe.32000.en 150 | BEAM_SIZE=4 151 | ALPHA=0.6 152 | 153 | t2t-decoder \ 154 | --data_dir=$DATA_DIR \ 155 | --problem=$PROBLEM \ 156 | --model=$MODEL \ 157 | --hparams_set=$HPARAMS \ 158 | --output_dir=$TRAIN_DIR \ 159 | --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \ 160 | --decode_from_file=$DECODE_FILE \ 161 | --decode_to_file=$TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode 162 | 163 | # 6. Debpe 164 | cat $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode | sed 's/@@ //g' > $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe 165 | # Do compound splitting on the translation 166 | perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe > $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe.atat 167 | # Do same compound splitting on the ground truth and then score bleu 168 | # ... 169 | ``` 170 | 171 | **Note that step 6 remains a postprocessing**. For some historical reasons, Google split compound words before getting the final BLEU results which will bring moderate increase. see [get_ende_bleu.sh](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh) for more details. 172 | 173 | If you have only 1 GPU, you can use `transformer_base_multistep8` hparams to imitate 8 GPU. 174 | 175 | ![transformer_base_multistep8](https://i.loli.net/2018/11/03/5bdd6a22ae29a.png) 176 | 177 | You can also modify `transformer_base_multistep8` function to accumulate gradient times you want. Here is an example using 4 GPU to run transformer big model. Note that `hparams.optimizer_multistep_accumulate_steps = 2` since we only need to accumulate gradient twice for 4 GPU. 178 | 179 | ```python 180 | @registry.register_hparams 181 | def transformer_base_multistep8(): 182 | """HParams for simulating 8 GPUs with MultistepAdam optimizer.""" 183 | hparams = transformer_big() 184 | hparams.optimizer = "MultistepAdam" 185 | hparams.optimizer_multistep_accumulate_steps = 2 186 | return hparams 187 | ``` 188 | 189 | ##### Resources 190 | - [t2t issue 539](https://github.com/tensorflow/tensor2tensor/issues/539) 191 | - [t2t issue 444](https://github.com/tensorflow/tensor2tensor/issues/444) 192 | - [t2t issue 317](https://github.com/tensorflow/tensor2tensor/issues/317) 193 | - [Tensor2Tensor for Neural Machine Translation](https://arxiv.org/abs/1803.07416) 194 | 195 | #### Harvard NLP Group's implementation: OpenNMT-py(using *PyTorch*) 196 | 197 | ##### Code 198 | 199 | - [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) 200 | 201 | ##### Steps to reproduce WMT14 English-German result: 202 | 203 | **(updated on v0.5.0)** 204 | 205 | For command arguments meaning, see [OpenNMT-py doc](http://opennmt.net/OpenNMT-py/main.html) or [OpenNMT-py opts.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/opts.py) 206 | 207 | 1. Download [corpus preprocessed by OpenNMT](https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz), [sentencepiece model preprocessed by OpenNMT](https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp_model.tar.gz). Note that the preprocess procedure includes tokenization, bpe/word-piece operation(here using [sentencepiece](https://github.com/google/sentencepiece) powered by Google which implements word-piece algorithm), see [OpenNMT-tf script](https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/prepare_data.sh) for more details. 208 | 209 | 2. Preprocess. Because English and German are similar languages here we use `-share_vocab` to share vocabulary between source language and target language, which means you don't need to set this flag for distant language pairs such as Chinese-English. Meanwhile, we use a max sequence length of `100` to cover almostly all sentences on the basis of sentence length distribution of corpus. 210 | For example: 211 | 212 | ```shell 213 | python preprocess.py \ 214 | -train_src ../wmt-en-de/train.en.shuf \ 215 | -train_tgt ../wmt-en-de/train.de.shuf \ 216 | -valid_src ../wmt-en-de/valid.en \ 217 | -valid_tgt ../wmt-en-de/valid.de \ 218 | -save_data ../wmt-en-de/processed \ 219 | -src_seq_length 100 \ 220 | -tgt_seq_length 100 \ 221 | -max_shard_size 200000000 \ 222 | -share_vocab 223 | ``` 224 | 225 | 3. Train. For example, if you only have 4 GPU: 226 | ```shell 227 | python train.py -data /tmp/de2/data -save_model /tmp/extra \ 228 | -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \ 229 | -encoder_type transformer -decoder_type transformer -position_encoding \ 230 | -train_steps 200000 -max_generator_batches 2 -dropout 0.1 \ 231 | -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \ 232 | -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \ 233 | -max_grad_norm 0 -param_init 0 -param_init_glorot \ 234 | -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \ 235 | -world_size 4 -gpu_ranks 0 1 2 3 236 | ``` 237 | 238 | Note that here `-accum_count` means every `N` batches accumulating loss to backward, so it's 2 for 4 GPUs and so on. 239 | 240 | 4. Translate. For example: 241 | You can set `-batch_size`(default `30`) larger to boost the translation. 242 | 243 | ```shell 244 | python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu \ 245 | -share_vocab vocab_file -max_length 200 -model model_file -src newstest2014.en.32kspe -output model.pred -verbose 246 | ``` 247 | 248 | Note that testset in corpus preprocessed by OpenNMT is newstest2017 while it is newstest2014 in original paper, which may be a mistake. To obtain newstest2014 testset as in paper, here we can use sentencepiece to encode `newstest2014.en` manually. You can find ``in step 1's downloaded archive. 249 | 250 | ```shell 251 | spm_encode --model= --output_format=piece < newstest2014.en > newstest2014.en.32kspe 252 | ``` 253 | 254 | 5. Detokenization. Since training data is processed by [sentencepiece](https://github.com/google/sentencepiece), step 4's translation should be sentencepiece-encoded style, so we need a decoding procedure to obtain a detokenized plain prediction. 255 | For example: 256 | 257 | ```shell 258 | spm_decode --model= --input_format=piece < input > output 259 | ``` 260 | 261 | 6. Postprocess 262 | 263 | There is also a [bpe-version](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) WMT'16 ENDE corpus preprocessed by Google. See [subword-nmt](https://github.com/rsennrich/subword-nmt) for bpe encoding and decoding. 264 | 265 | ##### Resources 266 | 267 | - [OpenNMT-py FAQ](http://opennmt.net/OpenNMT-py/FAQ.html) 268 | - ~~[OpenNMT-py issue](https://github.com/OpenNMT/OpenNMT-py/issues/637)(deprecated)~~ 269 | - [OpenNMT: Open-Source Toolkit for Neural Machine Translation](https://arxiv.org/abs/1701.02810) 270 | 271 | #### FAIR's implementation: fairseq-py(using *PyTorch*) 272 | 273 | ##### Code 274 | 275 | - [fairseq-py](https://github.com/pytorch/fairseq/) 276 | 277 | ##### Steps to reproduce WMT14 English-German result: 278 | 279 | **(updated on commit `7e60d45`)** 280 | 281 | For arguments meaning, see [doc](https://fairseq.readthedocs.io/en/latest/command_line_tools.html). Note that we can use `--update-freq` when training to accumulate every `N` batches loss to backward, so it's `8` for 1 GPU, `2` for 4 GPUs and so on. 282 | 283 | 1. Download [the preprocessed WMT'16 EN-DE data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) and extract it. 284 | 285 | ``` 286 | TEXT=wmt16_en_de_bpe32k 287 | mkdir $TEXT 288 | tar -xzvf wmt16_en_de.tar.gz -C $TEXT 289 | ``` 290 | 291 | 2. Preprocess the dataset with a joined dictionary 292 | 293 | ``` 294 | python preprocess.py --source-lang en --target-lang de \ 295 | --trainpref $TEXT/train.tok.clean.bpe.32000 \ 296 | --validpref $TEXT/newstest2013.tok.bpe.32000 \ 297 | --testpref $TEXT/newstest2014.tok.bpe.32000 \ 298 | --destdir data-bin/wmt16_en_de_bpe32k \ 299 | --joined-dictionary 300 | ``` 301 | 302 | 3. Train. For a base model. 303 | 304 | ``` 305 | # train about 180k steps 306 | python train.py data-bin/wmt16_en_de_bpe32k \ 307 | --arch transformer_wmt_en_de --share-all-embeddings \ 308 | --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ 309 | --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ 310 | --lr 0.0007 --min-lr 1e-09 \ 311 | --weight-decay 0.0 --criterion label_smoothed_cross_entropy \ 312 | --label-smoothing 0.1 --max-tokens 4096 --update-freq 2 \ 313 | --no-progress-bar --log-format json --log-interval 10 --save-interval-updates 1000 \ 314 | --keep-interval-updates 5 315 | # average last 5 checkpoints 316 | modelfile=checkpoints 317 | python scripts/average_checkpoints.py --inputs $modelfile --num-update-checkpoints 5 \ 318 | --output $modelfile/average-model.pt 319 | ``` 320 | 321 | For a big model. 322 | ``` 323 | # train about 270k steps 324 | python train.py data-bin/wmt16_en_de_bpe32k \ 325 | --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ 326 | --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ 327 | --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ 328 | --lr 0.0005 --min-lr 1e-09 \ 329 | --weight-decay 0.0 --criterion label_smoothed_cross_entropy \ 330 | --label-smoothing 0.1 --max-tokens 4096 --update-freq 2\ 331 | --no-progress-bar --log-format json --log-interval 10 --save-interval-updates 1000 \ 332 | --keep-interval-updates 20 333 | # average last 20 checkpoints 334 | modelfile=checkpoints 335 | python scripts/average_checkpoints.py --inputs $modelfile --num-update-checkpoints 20 \ 336 | --output $modelfile/average-model.pt 337 | ``` 338 | 339 | 4. Inference 340 | ``` 341 | model=average-model.pt 342 | subset=test 343 | python generate.py data-bin/wmt16_en_de_bpe32k --path $modelfile/$model \ 344 | --gen-subset $subset --beam 4 --batch-size 128 --remove-bpe --lenpen 0.6 > pred.de 345 | # because fairseq's output is unordered, we need to recover its order 346 | grep ^H pred.de | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.de 347 | ``` 348 | 349 | 5. Postprocess 350 | 351 | ##### Resources 352 | 353 | - [fairseq-py example](https://github.com/pytorch/fairseq/tree/master/examples/translation) 354 | - [fairseq-py issue](https://github.com/pytorch/fairseq/issues/202). The corpus problem described in the issue has been fixed now, so we can directly follow the instruction above. 355 | 356 | ### Complex, not certainly performance-reproducable implementations 357 | 358 | - [Marian](https://github.com/marian-nmt/marian-examples/tree/master/transformer)(purely c++ implementation without any deep learning framework) 359 | 360 | ## Training tips 361 | 362 | - [Training Tips for the Transformer Model](https://arxiv.org/abs/1804.00247) 363 | 364 | ## Further 365 | 366 | - RNMT+: [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849) 367 | - [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) 368 | - Turing-complete Transformer: [Universal Transformer](https://arxiv.org/abs/1807.03819) 369 | - [Self-Attention with Relative Position Representations](https://arxiv.org/abs/1803.02155) 370 | - [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) 371 | - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) 372 | 373 | ## Contributors 374 | 375 | This project is developed and maintained by Natural Language Processing Group, ICT/CAS. 376 | 377 | - [Yong Shan](https://github.com/SkyAndCloud) 378 | - [Jinchao Zhang](https://github.com/zhangjcqq) 379 | - Shuhao Gu 380 | --------------------------------------------------------------------------------