├── .gitmodules ├── nmt-wizard ├── README.md └── data │ ├── ru_en │ └── train │ │ ├── helloworld.ruen.train.en │ │ └── helloworld.ruen.train.ru │ ├── test │ ├── helloworld.ruen.test.en │ └── helloworld.ruen.test.ru │ └── vocab │ ├── de.dict │ ├── en.dict │ ├── helloworld.ruen.src.dict │ └── helloworld.ruen.tgt.dict └── unsupervised-nmt ├── LICENSE ├── README.md ├── img ├── adversarial.png ├── decoding.png └── encoding.png ├── paper.pdf ├── ref ├── inference.py ├── train.sh └── training.py ├── requirements.txt.cpu └── requirements.txt.gpu /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "nmt-wizard/nmt-wizard"] 2 | path = nmt-wizard/nmt-wizard 3 | url = https://github.com/OpenNMT/nmt-wizard.git 4 | -------------------------------------------------------------------------------- /nmt-wizard/README.md: -------------------------------------------------------------------------------- 1 | *Owner: Jean Senellart (jean.senellart (at) opennmt.net)* 2 | 3 | # NMT-Wizard Hello Word 4 | 5 | ## Introduction 6 | 7 | The goal of this tutorial is to configure a nmt-wizard server, and to launch a task for training on CPU a simple transliteration model from Russian to English, and to test the generated model. 8 | 9 | Reference: [https://github.com/OpenNMT/nmt-wizard](https://github.com/OpenNMT/nmt-wizard) 10 | 11 | ## Server Configuration 12 | 13 | - minimal environment requested: `python`, `pip`, `build-essential` , `make` 14 | - please use python2.7 15 | 16 | ``` 17 | $ sudo apt-get update 18 | $ sudo apt-get -y install python python-pip 19 | ``` 20 | - Set environment variable `TUTORIAL` with path to working directory for this tutorial, and change directory. 21 | 22 | ``` 23 | $ mkdir tutorial-onmt-wizard-1 24 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1 25 | $ cd ${TUTORIAL} 26 | ``` 27 | 28 | - Installation of redis 29 | 30 | ``` 31 | $ sudo apt-get -y install redis-server 32 | ``` 33 | or 34 | 35 | ``` 36 | $ curl http://download.redis.io/releases/redis-4.0.8.tar.gz > redis-4.0.8.tar.gz 37 | $ tar xzf redis-4.0.8.tar.gz 38 | $ cd redis-4.0.8 39 | $ cd deps 40 | $ make hiredis jemalloc linenoise lua geohash-int 41 | $ cd .. 42 | $ make 43 | ``` 44 | 45 | launch a server (chdir to src directory if you installed by compiling): 46 | 47 | ``` 48 | $ redis-server 49 | ``` 50 | 51 | And configure keyspace event handling in a new terminal: 52 | ``` 53 | $ redis-cli config set notify-keyspace-events Klgx 54 | ``` 55 | 56 | The Redis database contains the following fields: 57 | 58 | | Field | Type | Description | 59 | | --- | --- | --- | 60 | | `active` | list | Active tasks | 61 | | `beat:` | int | Specific ttl-key for a given task | 62 | | `lock:` | value | Temporary lock on a resource or task | 63 | | `queued:` | list | Tasks waiting for a resource | 64 | | `resource::` | list | Tasks using this resource | 65 | | `task:` | dict |
  • status: [queued, allocated, running, terminating, stopped]
  • job: json of jobid (if status>=waiting)
  • service:the name of the service
  • resource: the name of the resource - or auto before allocating one message: error message (if any), ‘completed’ if successfully finished
  • container_id: container in which the task run send back by docker notifier
  • (queued|allocated|running|updated|stopped)_time: time for each event
| 66 | | `files:` | dict | files associated to a task, "log" is generated when training is complete | 67 | | `queue:` | str | expirable timestamp on the task - is used to regularily check status | 68 | | `work` | list | Tasks to process | 69 | 70 | - virtual env installation 71 | 72 | ``` 73 | $ pip install virtualenv 74 | $ virtualenv ${TUTORIAL} 75 | ``` 76 | 77 | - get github project 78 | 79 | ``` 80 | $ cd ${TUTORIAL} 81 | $ git clone https://github.com/OpenNMT/nmt-wizard.git 82 | ``` 83 | 84 | - install docker 85 | 86 | ``` 87 | $ sudo apt-get install \ 88 | apt-transport-https \ 89 | ca-certificates \ 90 | curl \ 91 | software-properties-common 92 | $ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - 93 | $ sudo add-apt-repository \ 94 | "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ 95 | $(lsb_release -cs) \ 96 | stable" 97 | $ sudo apt-get update 98 | $ sudo apt-get install docker-ce 99 | $ sudo usermod -aG docker {{YOURUSERNAME}} 100 | ``` 101 | or 102 | for other OS please see [installation instructions here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). 103 | 104 | close the session and open a new one: 105 | ``` 106 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1 107 | $ cd ${TUTORIAL} 108 | $ docker run hello-world 109 | ``` 110 | 111 | - installation of python dependencies 112 | 113 | ``` 114 | $ cd nmt-wizard 115 | $ sudo pip install -r requirements.txt 116 | ``` 117 | 118 | - create a public/private key, and add public key in `.ssh/authorized_keys` in order to enable connection from your server to your server without authentication (useful for remote servers). 119 | 120 | ``` 121 | $ ssh-keygen 122 | ``` 123 | command-line for local server 124 | ``` 125 | $ cat ${HOME}/.ssh/id_rsa.pub >> ${HOME}/.ssh/authorized_keys 126 | ``` 127 | 128 | ## Data preparation 129 | 130 | The data directory contains aligned space-tokenized russian-english names, training file and test file. 131 | 132 | You need to get it from the `hackathon/nmt-wizard` directory like that: 133 | 134 | ``` 135 | git clone https://github.com/OpenNMT/Hackathon.git 136 | ``` 137 | 138 | and copy the `nmt-wizard/data` directory into `{{TUTORIAL}}/`. 139 | 140 | 141 | ## Wizard Configuration 142 | 143 | ### Service configuration 144 | The REST server and worker are configured by `nmt-wizard/server/settings.ini`. The LAUNCHER_MODE environment variable (defaulting to Production) can be set to select different set of options in development or production. 145 | ``` 146 | [DEFAULT] 147 | # config_dir with service configuration 148 | config_dir = ./config 149 | # logging level 150 | log_level = INFO 151 | # refresh rate 152 | refresh = 60 153 | 154 | [Production] 155 | redis_host = localhost 156 | redis_port = 6379 157 | redis_db = 0 158 | #redis_password=xxx 159 | ``` 160 | Here we use the default host and port of redis server. 161 | You can choose different level of logging: `log_level`: `INFO`,`WARN`,`DEBUG`,`FATAL`,`ERROR`, for example, `DEBUG` gives the most complete log for debugging purpose. 162 | 163 | ### Local SSH server 164 | 165 | We will define as the service for the tutorial, the local computer using `services.ssh` connector. Other connectors are for instance `service.ec2` or `service.torque`. 166 | 167 | Get your IP with `ifconfig` - refered as `{{YOURIP}}` below. 168 | 169 | Copy the following json file `nmt-wizard/server/config/default.json`. 170 | 171 | ``` 172 | { 173 | "docker": { 174 | "registries": { 175 | "dockerhub": { 176 | "type": "dockerhub", 177 | "uri": "" 178 | } 179 | } 180 | }, 181 | "storages" : { 182 | "launcher": { 183 | "type": "http", 184 | "get_pattern": "/file//%s", 185 | "post_pattern": "/file//%s" 186 | } 187 | }, 188 | "callback_url": "http://{{YOURIP}}:5000", 189 | "callback_interval": 60 190 | } 191 | ``` 192 | Make sure to replace `{{YOURIP}}` by the actual IP. 193 | * The first part, defines the registry named `dockerhub` as official public dockerhub registry. You could also define private docker hub registries, or use AWS Elastic Container Service (ECS) registries. 194 | * the second part is defining the storage name `launcher` - as a simple http storage server implemented within the launcher. You will see below how to define other types of storage. 195 | 196 | ``` 197 | $ mkdir ${TUTORIAL}/inftraining_logs 198 | ``` 199 | Copy the following JSON into the `nmt-wizard/server/config/myserver.json`. 200 | ``` 201 | { 202 | "name": "myserver", 203 | "description": "My computing server", 204 | "module": "services.ssh", 205 | "variables": { 206 | "server_pool": [ 207 | { 208 | "host": "localhost", 209 | "gpus": [0], 210 | "login": "{{YOURUSERNAME}}", 211 | "log_dir": "${TUTORIAL}/inftraining_logs" 212 | } 213 | ] 214 | }, 215 | "privateKey": "${HOME}/.ssh/id_rsa", 216 | "docker": { 217 | "mount": [ 218 | "${TUTORIAL}/data/:/root/corpus/", 219 | "${TUTORIAL}/tmp:/root/tmp" 220 | ] 221 | } 222 | } 223 | ``` 224 | Make sure to replace `${TUTORIAL}` by the absolute PATH and {{YOURUSERNAME}}. 225 | 226 | This is a simple configuration of your server. 227 | * `"gpus"` is set to off `[0]` since we're not using GPU in this tutorial 228 | * the log file will be saved under `${TUTORIAL}/inftraining_logs`, make sure this directory exsit 229 | * your SSH privateKey `${HOME}/.ssh/id_rsa` will be used for connecting the server on which your publicKey `${HOME}/.ssh/id_rsa.pub` is authorized 230 | * `${TUTORIAL}/corpus/` is your training corpus' directory on the local / remote server 231 | * `${TUTORIAL}/models` is the directory for saving the models of `train` task 232 | * your custom files will be copied under `${TUTORIAL}/tmp` 233 | 234 | ## Launch the REST server 235 | 236 | For production system, see the [Flask documentation](http://flask.pocoo.org/docs/0.12/deploying/) to deploy it for production. 237 | 238 | In a terminal: 239 | ``` 240 | cd nmt-wizard/server 241 | FLASK_APP=main.py flask run --host=0.0.0.0 242 | ``` 243 | 244 | 245 | ## Launch the worker 246 | 247 | In a new terminal: 248 | ``` 249 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1 250 | $ cd ${TUTORIAL} 251 | $ cd nmt-wizard/server 252 | $ python worker.py 253 | ``` 254 | 255 | ## The client commandline 256 | 257 | {{YOURID}} is the trainer id, used as a prefix to generated models (default ENV[`LAUNCHER_TID`]) 258 | ``` 259 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1 260 | $ cd ${TUTORIAL} 261 | $ export LAUNCHER_URL=http://{{YOURIP}}:5000 262 | $ export LAUNCHER_TID={{YOURID}} 263 | $ mkdir nmt-wizard/example 264 | ``` 265 | 266 | Copy the following JSON into the `nmt-wizard/example/helloworld.json`. 267 | 268 | ``` 269 | { 270 | "source": "ru", 271 | "target": "en", 272 | "data": { 273 | "sample_dist": [ 274 | { 275 | "path": "train", 276 | "distribution": { 277 | "helloworld.*": "1" 278 | } 279 | } 280 | ], 281 | "sample": 100000, 282 | "train_dir": "ru_en" 283 | }, 284 | "options": { 285 | "train": { 286 | "rnn_size": "50", 287 | "word_vec_size": "20", 288 | "layers": "1", 289 | "src_vocab": "${CORPUS_DIR}/vocab/helloworld.ruen.src.dict", 290 | "tgt_vocab": "${CORPUS_DIR}/vocab/helloworld.ruen.tgt.dict" 291 | } 292 | } 293 | } 294 | ``` 295 | The ${CORPUS_DIR} is a local ENV, you don't need to change it. 296 | This is a configuration of simple transliteration training task, it has two parts: `data` and `options` 297 | * `data` part: the source language is `ru` and the target language is `en`, the corpus is picked from ${TUTORIAL}/corpus/`train_dir`/`path`/; the corpus which has extension `ru` \ `en` with pattern `helloworld.*` will be picked. Its coefficient is set to `1` in the total `10000` samples. see the [sampling documentation](https://github.com/OpenNMT/OpenNMT/blob/master/docs/training/sampling.md) 298 | * `options` part: the configuration of training, in this training, a local custom file `${TUTORIAL}/vocab/helloworld.ruen.src.dict` will be copied and used on the server. see the [training option documentation](https://github.com/OpenNMT/OpenNMT/blob/master/docs/options/train.md) 299 | 300 | 301 | go through the different commands: 302 | 303 | ``` 304 | $ cd nmt-wizard/client 305 | ``` 306 | 307 | 308 | - `ls`: returns available services 309 | 310 | ``` 311 | $ python launcher.py ls 312 | ``` 313 | - `lt`: returns the list of tasks in the database 314 | 315 | ``` 316 | $ python launcher.py lt 317 | ``` 318 | - `launch` `train`: start training task, the return is a task id `taskid_1` 319 | 320 | ``` 321 | $ python launcher.py launch -s myserver -i nmtwizard/opennmt-lua -- -ms /root/tmp/ -c @../example/helloworld.json train 322 | ``` 323 | - `launch` `trans`: transliterate/translate `/root/corpus/test/helloworld.ruen.test.ru` by using the model of `taskid_1`, the return is a task id `taskid_2` 324 | 325 | ``` 326 | $ python launcher.py launch -s myserver -i nmtwizard/opennmt-lua -- -ms /root/tmp/ -m trans -i /root/corpus/test/helloworld.ruen.test.ru -o "launcher:helloworld.ruen.test.ru.out" 327 | ``` 328 | - `file`: get file from transaltion task 329 | 330 | ``` 331 | $ python launcher.py file -f helloworld.ruen.test.ru.out -k > ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out 332 | ``` 333 | - `terminate`:stop a running/queued task by its `taskid` 334 | 335 | ``` 336 | $ python launcher.py terminate -k 337 | ``` 338 | - `status`:checks the status of a task by its `taskid` 339 | 340 | ``` 341 | $ python launcher.py status -k 342 | ``` 343 | evaluation: 344 | check if there're log information in the head of output file, please remove log and ending empty line 345 | 346 | ``` 347 | $ tail -n +3 ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out | head -n -1 > ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out.tmp 348 | $ mv ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out.tmp ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out 349 | ``` 350 | then 351 | 352 | ``` 353 | $ cd ${TUTORIAL} 354 | $ git clone https://github.com/OpenNMT/nmt-benchmark.git 355 | $ perl nmt-benchmark/scripts/multi-bleu.perl ${TUTORIAL}/data/test/helloworld.ruen.test.en < ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out 356 | ``` 357 | a BLEU score will be shown 358 | 359 | ``` 360 | BLEU = 70.27 +/- 0.44, 88.5/78.0/69.4/62.1 (BP=0.952, ratio=0.953, hyp_len=6164, ref_len=6469) 361 | ``` 362 | 363 | There are also other alternative storages 364 | * S3: connecting to AWS server using access ID and secret key 365 | * SSH: connecting to remote server hostname/IP via SSH 366 | 367 | ``` 368 | "storages" : { 369 | "s3_models": { 370 | "type": "s3", 371 | "bucket": "model-catalog", 372 | "aws_credentials": { 373 | "access_key_id": "XXXXX", 374 | "secret_access_key": "XXXXX", 375 | "region_name": "eu-west-3" 376 | } 377 | }, 378 | "myremoteserver": { 379 | "type": "ssh", 380 | "server": "myserver_url", 381 | "user": "XXXXX", 382 | "password": "XXXXX" 383 | } 384 | ``` 385 | 386 | ## Using pre-configured launcher with EC2 credentials 387 | 388 | If you are here, you are ready to move to the next stage - let us try to launch the same task on an EC2 instance using pre-configured launcher. 389 | 390 | First configure access to the launcher: 391 | 392 | ``` 393 | export LAUNCHER_URL=http://stlauncher.opennmt.net 394 | ``` 395 | 396 | explore the available services: 397 | 398 | ``` 399 | $ python launcher.py ls 400 | SERVICE NAME DESCRIPTION 401 | ec2 Instance on AWS EC2 402 | ``` 403 | 404 | describe the resource available on EC2: 405 | 406 | ``` 407 | $ python launcher.py describe -s ec2 408 | {"launchTemplateName": {"enum": ["CPU_C5_xlarge_50Gb", "GPU_G3_4xlarge_50Gb"], "type": "string", "description": "The name of the EC2 launch template to use", "title": "EC2 Launch Template"}} 409 | ``` 410 | 411 | There are 2 different resources configured on EC2 and available: `CPU_C5_xlarge_50Gb` and `GPU_G3_4xlarge_50Gb` - this is a JSON form to select `launchTemplateName`. 412 | 413 | To select the resource, you need to pass a json file corresponding your choice: 414 | 415 | ``` 416 | $ cat > o.json 417 | {"launchTemplateName":"GPU_G3_4xlarge_50Gb"} 418 | ``` 419 | 420 | so let us be crazy and launch our task on ec2 using either a GPU instance. 421 | 422 | The EC2 service is configured with a mount of S3 bucket - `nmt-wizard-data` on `${CORPUS_DIR}`. 423 | 424 | The S3 bucket structure contains: 425 | * `ru_en` - which is the same than in the tutorial data 426 | * `wmt17/de_en` - containing all prepared DE EN data for WMT. 427 | 428 | with this information, modify the helloworld.json and you can launch the same transliteration training on EC2 GPU instance: 429 | 430 | ``` 431 | $ python launcher.py launch -s ec2 -o @o.json -i nmtwizard/opennmt-lua -- -ms s3_models: -c @../example/helloworld.json train 432 | ``` 433 | 434 | --- 435 | 436 | *Congratulations for completing this Hello World tutorial!* 437 | -------------------------------------------------------------------------------- /nmt-wizard/data/test/helloworld.ruen.test.en: -------------------------------------------------------------------------------- 1 | A r n o 2 | G r a n a d o 3 | A l b e r t o 4 | H a m a d 5 | T h u w a i n i 6 | P e g g 7 | D a v i d 8 | A l e x a n d e r 9 | K o s h e t z 10 | V l a d i m i r 11 | G u s i n s k y 12 | V a s k s 13 | P ē t e r i s 14 | S u ñ é 15 | R u b é n 16 | M a n m e e t 17 | B h u l l a r 18 | W o l f g a n g 19 | B o e t t c h e r 20 | B r a u n e r 21 | V i c t o r 22 | L o u i s e 23 | N e t h e r l a n d s 24 | J o h i n 25 | B r a d 26 | M i l l e r 27 | N e e l a m 28 | S a n j i v a 29 | R e d d y 30 | S k l e n a ř í k o v á 31 | A d r i a n a 32 | S h u a i 33 | M a r c 34 | M á r q u e z 35 | M a r c 36 | M á r q u e z 37 | S i n n e t t 38 | A l f r e d 39 | P e r c y 40 | P a r m e n i o n 41 | G a i u s 42 | A s i n i u s 43 | G a l l u s 44 | I r i s 45 | T r e e 46 | A l e x 47 | R o d r í g u e z 48 | C o m p a n s 49 | J e a n 50 | D o m i n i q u e 51 | N i c h o l a s 52 | L e a 53 | F e l i x 54 | B o u r b o n 55 | P a r m a 56 | S i m m o n s 57 | W o o d w a r d 58 | O p p e n h e i m e r 59 | E r n e s t 60 | F i l i p p o 61 | E d u a r d o 62 | B a s a r a b 63 | T e r u o 64 | N a k a m u r a 65 | J o v a n 66 | K i r o v s k i 67 | S h a r m a 68 | A r v i n d 69 | T e m i l e 70 | F r a n k 71 | S a n t o s 72 | G o n ç a l v e s 73 | J o ã o 74 | P e d r o 75 | M i c h a e l 76 | H u t h 77 | G i r a r d o t 78 | H i p p o l y t e 79 | C a t h e r i n e 80 | P e r r e t 81 | K a b a i v a n s k a 82 | R a i n a 83 | R a j 84 | R a j a r a t n a m 85 | E e r o 86 | A a r n i o 87 | A a r n i o 88 | I n a f u n e 89 | K e i j i 90 | B o n n a t 91 | L é o n 92 | A d i t y a 93 | C h o p r a 94 | A n n e n k o v 95 | M i k h a i l 96 | A v r a h a m 97 | P o r a z 98 | B e b e l 99 | G i l b e r t o 100 | D a v i d 101 | W i l l i s 102 | B e c k h a m 103 | G e s s 104 | E d g a r 105 | M o h a m e d 106 | A m s i f 107 | W h i t e 108 | G e o r g e 109 | N ' d y 110 | A s s e m b é 111 | K a l n i ņ š 112 | I v a r s 113 | W i l e ń s k i 114 | K o n s t a n t y 115 | J e o n g 116 | T u c k e r 117 | J o n a t h a n 118 | O l d m a n 119 | A l b e r t 120 | J o h n 121 | D o b r y n i n 122 | V y a c h e s l a v 123 | S i v o r i 124 | C a m i l l o 125 | S i l v a 126 | R a m o s 127 | K l a u s 128 | T e n n s t e d t 129 | C a r m e 130 | E l i a s 131 | I m a n o v 132 | L u t f i y a r 133 | T r i c i a 134 | H e l f e r 135 | A d a m 136 | A n d e r s o n 137 | E w e n 138 | F e r g u s s o n 139 | D i e h l 140 | R i c h a r d 141 | I n n o c e n t 142 | G r a y s o n 143 | H a l l 144 | J o h n 145 | E a t w e l l 146 | E a t w e l l 147 | M a r k o 148 | K e š e l j 149 | B o g u s ł a w 150 | R a d z i w i ł ł 151 | S t a n i s ł a w 152 | K o s t k a 153 | P o t o c k i 154 | A n d y 155 | L a P l e g u a 156 | C u n n i n g h a m 157 | W i l l i a m 158 | A p e r g h i s 159 | G e o r g e s 160 | H a n s 161 | J a k o b 162 | C h r i s t o f f e l 163 | v o n 164 | G r i m m e l s h a u s e n 165 | C h a n g y 166 | A l a i n 167 | D a v i d 168 | C h a s e 169 | G i r s 170 | N i k o l a y 171 | F i l t s c h 172 | C a r l 173 | T h o m s o n 174 | W a r r e n 175 | L u i s 176 | A l b e r t o 177 | P e r e a 178 | E d u a r d o 179 | S á n c h e z 180 | F u e n t e s 181 | P o i r é 182 | A l a i n 183 | A l e k s e i 184 | S h a p o s h n i k o v 185 | Y r j ö 186 | H i e t a n e n 187 | M o r t a u d 188 | C h l o é 189 | A l e x a n d e r 190 | W a n g 191 | M a c M a n u s 192 | A r t h u r 193 | D r u z h n i k o v 194 | V l a d i m i r 195 | P a n c h e n k o 196 | Y u r i y 197 | B e n e d i c t 198 | M i c h a e l 199 | C e r v e r i s 200 | N e r s e s 201 | Y e r i t s y a n 202 | M i n t o n 203 | F a i t h 204 | G r e t a 205 | K u k k o n e n 206 | C o l e 207 | T a y l o r 208 | D o n a l d 209 | G r a h a m 210 | B u r t 211 | T h o m a s 212 | B u r r o w e s 213 | D i m i t a r 214 | A g u r a 215 | D m i t r y 216 | I v a n o v i c h 217 | K i t ō 218 | B u l l 219 | J a c o b 220 | B r e d a 221 | B l a s c o 222 | I b á ñ e z 223 | V i c e n t e 224 | T i n a 225 | D a v i s 226 | K ř e s a d l o 227 | B u t c h 228 | W a l k e r 229 | L a u r e 230 | J u n o t 231 | M a r t h a 232 | M a d i s o n 233 | P h i l i p 234 | G e o r g e 235 | N e e d h a m 236 | S a d e l e r 237 | M a r k e d o n o v 238 | S e r g e y 239 | M e l i a v a 240 | T a m a z 241 | A p o ñ o 242 | R u b i n a 243 | A l i 244 | K o u n e l l i s 245 | J a n n i s 246 | V a l e r i e 247 | B r i s c o 248 | H o o k s 249 | S a m u e l 250 | C r o w t h e r 251 | O d e n 252 | G r e g 253 | S u b b o t i n 254 | S e r g e y 255 | S u z a n 256 | A n b e h 257 | J e f f r e y 258 | C r a i g 259 | L a B e o u f 260 | H u s a y n 261 | B a y q a r a 262 | C o n s t a n c e 263 | J o h n 264 | C l a p h a m 265 | D e p e s t r e 266 | R e n e 267 | P e r r i n 268 | C l a u d e 269 | V i c t o r 270 | C a n t i u s 271 | P r u d n i k a u 272 | P a v e l 273 | H e r b e r t 274 | B r e n o n 275 | T i b o r 276 | M a c h a n 277 | C a l l i x t u s 278 | R o b e r t 279 | P a t t i n s o n 280 | V a l e n t i n e 281 | N o v i k o v 282 | S e r g e y 283 | A n d r e a s 284 | R o m b e r g 285 | K h a n k e y e v 286 | I g o r 287 | M u r a v s c h i 288 | V a l e r i u 289 | D o n g 290 | F a n g z h u o 291 | A n j a 292 | E r ž e n 293 | C o l e e n 294 | R o o n e y 295 | C h r i s t e r 296 | F u g l e s a n g 297 | J o a q u í n 298 | S á n c h e z 299 | K o s a m b i 300 | A r t h u r 301 | D o v e 302 | O w e n 303 | C h a s e 304 | G a r c í a 305 | S á n c h e z 306 | A b r a h a m 307 | b e n 308 | J a c o b 309 | F r e d d i e 310 | Y o u n g 311 | G o i t e i n 312 | S h e l o m o 313 | D o v 314 | C h a r l e s 315 | B e n n e t t 316 | A u g s b e r g e r 317 | F r a n z 318 | F y d r y c h 319 | W a l d e m a r 320 | A n t o n 321 | A l e x a n d e r 322 | v o n 323 | A u e r s p e r g 324 | T h o m a s 325 | W a l t e r 326 | S e r g e i 327 | D m i t r i y e v i c h 328 | B o g d a n o v 329 | V l a d i m i r 330 | Y e v t u s h e n k o v 331 | C o q u e r e l 332 | M i s s a k 333 | M a n o u c h i a n 334 | F r a n k o w s k i 335 | T o m a s z 336 | P a r a s h a r a 337 | E z r a 338 | H e y w o o d 339 | S h l o m o 340 | L a h a t 341 | W i l l i a m 342 | H a m i l t o n 343 | G i b s o n 344 | G r e g 345 | C o x 346 | M a k a r o v 347 | K o n s t a n t i n 348 | B e r n h a r d t 349 | M o l i q u e 350 | C a r l o s 351 | D i o g o 352 | K o v n e r 353 | B r u c e 354 | L u c a s 355 | A l a m á n 356 | S a m a n e z 357 | O c a m p o 358 | D a v i d 359 | V i c t o r 360 | L o r e t 361 | A n a y a 362 | E l e n a 363 | A n a n i a 364 | S h i r a k a t s i 365 | S h i f r i n 366 | K a r i n 367 | R u d i 368 | A r n s t a d t 369 | I r i n a 370 | A b y s o v a 371 | C a t h e r i n e 372 | M i c h e l l e 373 | G a ł c z y ń s k i 374 | K o n s t a n t y 375 | I l d e f o n s 376 | I v o 377 | M i n á ř 378 | A b u 379 | U b a i d a h 380 | i b n 381 | J a r r a h 382 | V a l l i 383 | A l i d a 384 | N i e r o t h 385 | C a r l 386 | X u a n 387 | A r t y o m 388 | K h a c h a t u r o v 389 | P i e r r e 390 | F r a n ç o i s 391 | T i s s o t 392 | V e r l a t 393 | C h a r l e s 394 | E d w a r d 395 | S e y m o u r 396 | H e r t f o r d 397 | M a t t i 398 | H ä y r y 399 | R o m a n 400 | B o r i s e v i c h 401 | L o c k e 402 | J o h n 403 | O v c h i n n i k o v 404 | V a l e r i 405 | S u b h a s i s h 406 | R o y 407 | C h o w d h u r y 408 | G e o r g 409 | M o h r 410 | A d r i á n 411 | R o d r í g u e z 412 | Y e v g e n i 413 | D y a c h k o v 414 | M a r y 415 | F r i t h 416 | G u i d o 417 | L a v e z a r i s 418 | W i l l i a m 419 | W o o l l s 420 | G o d f r i d 421 | F r i s i a 422 | S e r g e i 423 | B e l o v 424 | B l i g h 425 | W i l l i a m 426 | H e n s c h e l 427 | M i l t o n 428 | G e o r g e 429 | M o r r i s o n 430 | A d a m 431 | P y o t r 432 | D u r n o v o 433 | P e t e r 434 | S c h a e f e r 435 | C h e n 436 | B i n g d e 437 | R è n 438 | L y n n 439 | T h o r n d i k e 440 | S i m o n 441 | G a l l u p 442 | K o n s t a n t i n 443 | K o r o v i n 444 | F r a n z 445 | P f e f f e r 446 | v o n 447 | S a l o m o n 448 | B e n a d o 449 | A r i k 450 | J o h n 451 | E a t t o n 452 | C h e m i a k i n 453 | M i h a i l 454 | F e l i p e 455 | J o r g e 456 | L o u r e i r o 457 | N i k i t a 458 | B e l y k h 459 | I v a n o v 460 | S u k h a r e v s k y 461 | A l e k s a n d r 462 | S h a i n o v a 463 | M a r i n a 464 | M e t s h i n 465 | I l s u r 466 | O l e 467 | E l l e f s æ t e r 468 | V i k t o r 469 | K a l i n a 470 | N a t h a n 471 | P a r k e r 472 | F r a n t i š e k 473 | K a b e r l e 474 | H u b b a r d 475 | Q u e n t i n 476 | C o l l e e n 477 | B a c h m a n 478 | K a r i ć 479 | A m i r 480 | E g a n 481 | S e a n 482 | M a r t i n 483 | S t r a n z l 484 | O n d ř í č e k 485 | M i r o s l a v 486 | A l l a n 487 | C o m b s 488 | M o r g a n 489 | B r i a n 490 | G e r t r u d e 491 | B r u n s w i c k 492 | R a f a e l 493 | B o b a n 494 | F r e d e r i c 495 | D u b o i s 496 | O l g a 497 | F y o d o r o v a 498 | O r l o f f 499 | V a i l l a n t 500 | J e a n 501 | B a p t i s t e 502 | P h i l i b e r t 503 | I b r a h i m 504 | S a h a d 505 | J o h a n n 506 | J o a c h i m 507 | L a n g e 508 | K o r y u n 509 | K r i s s y 510 | T a y l o r 511 | G r i m a l d i 512 | D a v i d 513 | S a b i n e 514 | T h i e r r y 515 | S a p r y k i n 516 | O l e g 517 | S a r a 518 | S t o c k b r i d g e 519 | I r e n e 520 | T o m a r 521 | D e r e k 522 | M o r r i s 523 | A l o y s i u s 524 | L i l i u s 525 | L j u b a 526 | K a z a r n o v s k a y a 527 | M i c h a e l 528 | v o n 529 | d e r 530 | H e i d e 531 | K o s t i ć 532 | B o r i s 533 | V a s i l y 534 | K o n s t a n t i n o v i c h 535 | R i c h a r d 536 | S u l í k 537 | P a v e l 538 | T o n k o v 539 | R y a n 540 | C o n r o y 541 | Ć o s i ć 542 | D o b r i c a 543 | B j ø r n d a l e n 544 | O l e 545 | E i n a r 546 | M u r a v i e v 547 | N i k o l a y 548 | S e a r l e 549 | R o b e r t 550 | T h i e r r y 551 | A m a r 552 | J o e l 553 | B i l l y 554 | A m i r 555 | K h a d i r 556 | B a r b a r a 557 | H a n n i g a n 558 | M i r t a 559 | B u s n e l l i 560 | R u d i 561 | C e r n e 562 | M a z z u c a t o 563 | A l b e r t o 564 | F r a s e r 565 | K r i s t i n 566 | L e o n a r d 567 | M a t l o v i c h 568 | I u l i a n 569 | E r h a n 570 | N a t a l i e 571 | M e n d o z a 572 | G o m e s 573 | I n n a 574 | P a u l 575 | H a r d i n g 576 | S t a n 577 | M a k 578 | P e t r o v 579 | D e n i s 580 | T e i m u r a z 581 | C h i r g a d z e 582 | C h a l a b a l a 583 | Z d e n ě k 584 | M o n t b a z o n 585 | H e r c u l e 586 | F i o n a 587 | V i c t o r y 588 | A n d r e w 589 | T a r b e t 590 | C h o d o w i e c k i 591 | D a n i e l 592 | V a s h a k i d z e 593 | T a m a z 594 | C a r o l i n e 595 | G a r c i a 596 | A n d r e y 597 | B o l s h o y 598 | R y c r o f t 599 | C a r t e r 600 | M i l j k o v i ć 601 | V i j a y 602 | K u m a r 603 | S c o t t 604 | B o o t h 605 | C h é t a r d i e 606 | J a c q u e s 607 | J o a c h i m 608 | T r o t t i 609 | V a r a z d a t 610 | B a r t h e l m e s s 611 | R i c h a r d 612 | P o r t e i r o 613 | F é l i x 614 | O k s a n a 615 | K a l a s h n i k o v a 616 | A d l a n 617 | K h a s a n o v 618 | S e v a k 619 | P a r u y r 620 | L e l l 621 | C h r i s t i a n 622 | M a m m a d 623 | Y u s i f 624 | J a f a r o v 625 | C l a r k 626 | H a d d e n 627 | G u s t a v e 628 | H u m b e r t 629 | M a x i m 630 | Z i m i n 631 | K i m 632 | G w o n 633 | J e n n a 634 | C o l e m a n 635 | R a k a n 636 | R u s h a i d a t 637 | I e r o n i m 638 | U b o r e v i c h 639 | N a t a l i a 640 | G r z e g o r z 641 | K a r n a s 642 | G e n n a d y 643 | M i k h a s e v i c h 644 | L i a m 645 | H e a t h 646 | E n r i c o 647 | B e v i g n a n i 648 | V e r h a a s 649 | B a r b a r a 650 | B r i g g s 651 | E v a n s 652 | V i n c e 653 | K e v i n 654 | C a r o l a n 655 | R o b e r t 656 | A n d r e w s 657 | M i l l i k a n 658 | V y a c h e s l a v 659 | S h e v c h u k 660 | J u r i e t t i 661 | F r a n c k 662 | K a t s u r a 663 | T a r ō 664 | D a v i d 665 | R a t h 666 | K e v i n 667 | K a s h a 668 | A l e x a n d e r 669 | K o r z h a k o v 670 | F o r d 671 | F o r d 672 | F o r d 673 | F o r d 674 | M a d o x 675 | A n a t o l i y 676 | M o r o z o v 677 | Y e g o r 678 | R i d o s h 679 | O a k l a n d 680 | R o d r i g o 681 | A s t u r i a s 682 | S i t a 683 | O l i v e 684 | L e m b e 685 | L i n c a r 686 | E r i k 687 | A l o j z y 688 | F e l i ń s k i 689 | O l g a 690 | Y a k o v l e v a 691 | A f i n o g e n o v 692 | A l e x a n d e r 693 | B r a n k o 694 | G a v e l l a 695 | R a n d i 696 | Z u c k e r b e r g 697 | G i u s e p p e 698 | I m p a s t a t o 699 | I v a n 700 | B e t s k o y 701 | B e n o î t 702 | A n g b w a 703 | G u s t a f 704 | J a d e 705 | L a y l a 706 | P o l y c a r p u s 707 | B o d d i n g 708 | P a u l 709 | O l a f 710 | N a g i m a 711 | E s k a l i e v a 712 | J a n e 713 | F e l l o w e s 714 | F e l l o w e s 715 | B a r o n e s s 716 | F e l l o w e s 717 | F e l l o w e s 718 | H e r s h e l e 719 | O s t r o p o l e r 720 | G r a m s c i 721 | A n t o n i o 722 | C h u r a n d y 723 | M a r t i n a 724 | P a u l 725 | F o n o r o f f 726 | F e l l n e r 727 | H e l m e r 728 | T h o m p s o n 729 | T o n y 730 | L e o n i d 731 | B r o n e v o y 732 | R e i n a l d 733 | S c h n e l l 734 | P é t e r 735 | B e s e n y e i 736 | A n d r e y 737 | Z a l e s k i 738 | M a n s u r 739 | Y a h y a 740 | l e o n i d 741 | L a z a r e v 742 | Y e v g e n y 743 | F a d e y e v 744 | A l e k s a n d r e 745 | M i r t s k h u l a v a 746 | S e r h i y 747 | R u d y k a 748 | D m i t r y 749 | A n d r e i k i n 750 | J o h n s o n 751 | M i c h a e l 752 | V a l e n t i n 753 | G a n e v 754 | I m o g e n e 755 | B l i s s 756 | D i c k 757 | R y a n 758 | K i t t y 759 | K i r w a n 760 | W i n i t s 761 | D a n i e l l e 762 | J a n n e 763 | J a l a s v a a r a 764 | J u l i a n 765 | T o w n s e n d 766 | L i l i y a 767 | S h o b u k h o v a 768 | Z e l j k a 769 | F r a n u l o v i c 770 | B o r i s 771 | R a j e w s k y 772 | S h a r o n 773 | M a n n 774 | B o u l l é e 775 | É t i e n n e 776 | L o u i s 777 | V i k t o r 778 | F r a y o n o v 779 | É v a 780 | K ó c z i á n 781 | M a r i e 782 | M a x 783 | K e i s e r 784 | Y u r i 785 | I v l e v 786 | S t a n i s l a v 787 | G u s t a v o v i c h 788 | S t r u m i l i n 789 | I g o r 790 | L o l o 791 | T a g i r 792 | K u s i m o v 793 | A n t o i n e 794 | R o b e r t 795 | K y r a 796 | B u s c h o r 797 | J a r i 798 | K o s k i n e n 799 | N e l l 800 | B u r t o n 801 | N a s i r u d d i n 802 | Y o u s u f f 803 | J a g e r 804 | C o r n e l i s 805 | M a r k 806 | R o b e r t s 807 | D e l p h i n e 808 | T o m s o n 809 | C r a i g 810 | A d k i n s 811 | Q u a r t u c c i 812 | P e d r o 813 | F r a n c e n a 814 | M c C o r o r y 815 | J o s e p h i n n e 816 | Y a r o s h e v i c h 817 | R i c h a r d 818 | D u c k e t t 819 | L a r n i 820 | M a r t t i 821 | M a r i y a 822 | K r i v o p o l e n o v a 823 | B a r i 824 | M o r g a n 825 | N e f e r k a r a 826 | R o b e r t 827 | K a l e s k i 828 | L y n c h 829 | Y u r i 830 | A l e x a n d r o v 831 | A v d o t i a 832 | T i m o f e y e v a 833 | L á z a r o 834 | Á l v a r e z 835 | M a r g a r e t 836 | B e e b e 837 | A n n a 838 | P a n i n a 839 | L e o n i d 840 | D o b r o v s k y 841 | S u z a n n e 842 | P r i n g l e 843 | R o g e r 844 | M u n i e r 845 | A l e x a n d e r 846 | D e v o n 847 | H u r t 848 | S u s a n n e 849 | B r ü c k n e r 850 | G r e t a 851 | A m e n d 852 | W a l t e r 853 | B e a k e l 854 | P r i n c e s 855 | T o w e r 856 | V i n c e n z o 857 | C e r a m i 858 | P i e r r e 859 | F u n c k 860 | J o s e f 861 | N a d j 862 | G r a n t 863 | P a u l 864 | A d e l a i d e 865 | S o b i e s k a 866 | F o d h l a 867 | C r o n i n 868 | O ' R e i l l y 869 | C a r l 870 | L u d w i g 871 | B a r 872 | K o r n e y 873 | P e t r e n k o 874 | M i c h a e l 875 | T i n s l e y 876 | A l e k s a n d r 877 | K a z a k e v i č 878 | G a l i n a 879 | N i l o v a 880 | A n t o n 881 | N i l o v 882 | J e n n y 883 | T r i p p 884 | I g o r 885 | P e t r e n k o 886 | R i n a t 887 | I b r a g i m o v 888 | N i k o l a y 889 | P a k h o m o v 890 | I u r i i 891 | K r a k o v e t s k i i 892 | Z i n e d i n e 893 | Z i n e d i n e 894 | Z i d a n e 895 | K e l l e n b e r g e r 896 | E m i l 897 | E m i l 898 | J a n e n z 899 | A u d r i n a 900 | P a t r i d g e 901 | P r i s c i a n 902 | N o a h 903 | A k w u 904 | H a r o l d 905 | M a c m i l l a n 906 | B r e n t a n o 907 | F r a n z 908 | J a c o p o 909 | F o r o n i 910 | H i l a r y 911 | D u f f 912 | I r i n a 913 | A v v a k u m o v a 914 | A n n e t t e 915 | v o n 916 | D r o s t e 917 | H ü l s h o f f 918 | O m a r 919 | T o r r i j o s 920 | A d a m 921 | T a r n o w s k i 922 | M a r i a 923 | G a n s e v o o r t 924 | M e l v i l l 925 | V l a d i m i r 926 | R a u t b a r t 927 | Y e l i z a v e t a 928 | Y a n k o v s k a y a 929 | E d d a 930 | G ö r i n g 931 | A l e k s a n d r a 932 | N i k o l i c 933 | J ā n i s 934 | S t r e n g a 935 | L a r i s a 936 | G u z e e v a 937 | J o a c h i m 938 | D i t t r i c h 939 | J o h a n n 940 | G e o r g 941 | K a n t 942 | G i u l i a 943 | C o s i m o 944 | A m m a n n a t i 945 | J o h n 946 | S i m m i t 947 | S y l v i a 948 | Z i d e k 949 | L a u r e 950 | B a l z a c 951 | C r i s t i n a 952 | D i l l a 953 | S v e t l a n a 954 | Y u r y e v n a 955 | B l e d n a y a 956 | T e r r y 957 | C h a n 958 | M a x 959 | O p h ü l s 960 | A n a s t a s i y a 961 | B i r c t h o v a 962 | M e d e y a 963 | J u g e l i 964 | E d w a r d 965 | O l e s c h a k 966 | M i k e 967 | T a l e r i c o 968 | T h e o p o m p u s 969 | K o n r a d 970 | H u m m l e r 971 | M a r k 972 | H i d d i n k 973 | W e s t e r m a r c k 974 | E d v a r d 975 | A n d r e y 976 | M e l e n s k y 977 | G e n n a d y 978 | R e b r o v 979 | A r i n a 980 | V l a d i s l a v o v n a 981 | B e z b o r o d o v a 982 | A m i r a n 983 | A n a n i d z e 984 | M a x 985 | M e y e r 986 | R o m a n 987 | K h o k h l o v 988 | D a n i e l 989 | H a n d l i n g 990 | R o b e r t 991 | S t e v e n s o n 992 | V l a d i m i r 993 | K i l b u r g 994 | C h u r i k o v 995 | M i k h a i l 996 | K u z m i c h 997 | D m i t r y 998 | M i s h i n 999 | E k a t e r i n a 1000 | S t o y a n o v a 1001 | -------------------------------------------------------------------------------- /nmt-wizard/data/test/helloworld.ruen.test.ru: -------------------------------------------------------------------------------- 1 | А р н о 2 | Г р а н а д о 3 | А л ь б е р т о 4 | Х а м а д 5 | Т у в а й н и 6 | П е г г 7 | Д э в и д 8 | А л е к с а н д р 9 | К о ш и ц 10 | В л а д и м и р 11 | Г у с и н с к и й 12 | В а с к с 13 | П е т е р и с 14 | С у н ь е 15 | Р у б е н 16 | М а н м и т 17 | Б х у л л а р 18 | В о л ь ф г а н г 19 | Б ё т т х е р 20 | Б р а у н е р 21 | В и к т о р 22 | Л у и з а 23 | Н и д е р л а н д с к а я 24 | Ж о э н 25 | Б р э д 26 | М и л л е р 27 | Н и л а м 28 | С а н д ж и в а 29 | Р е д д и 30 | С к л е н а р и к о в а 31 | А д р и а н а 32 | Ш у а й 33 | М а р к 34 | М а р к 35 | М а р к е с 36 | М а р к е с 37 | С и н н е т т 38 | А л ь ф р е д 39 | П е р с и 40 | П а р м е н и о н 41 | Г а й 42 | А з и н и й 43 | Г а л л 44 | И р и с 45 | Т р и 46 | А л е к с 47 | Р о д р и г е з 48 | К о м п а н 49 | Ж а н 50 | Д о м и н и к 51 | Н и к о л а с 52 | Л е а 53 | Ф е л и ч е 54 | Б у р б о н 55 | П а р м с к и й 56 | С и м м о н с 57 | В у д в а р д 58 | О п п е н г е й м е р 59 | Э р н е с т 60 | Ф и л и п п о 61 | Э д у а р д о 62 | Б а с а р а б 63 | Т э р у о 64 | Н а к а м у р а 65 | Д ж о в а н 66 | К и р о в с к и 67 | Ш а р м а 68 | А р в и н д 69 | Т е м и л е 70 | Ф р а н к 71 | С а н т у ш 72 | Г о н с а л в е ш 73 | Ж у а н 74 | П е д р у 75 | М и х а э л ь 76 | Х у т 77 | Ж и р а р д о 78 | И п п о л и т 79 | К а т р и н 80 | П е р р е 81 | К а б а и в а н с к а 82 | Р а й н а 83 | Р а д ж 84 | Р а д ж а р а т н а м 85 | Э э р о 86 | Э э р о 87 | А а р н и о 88 | И н а ф у н э 89 | К э й д з и 90 | Б о н н а 91 | Л е о н 92 | А д и т ь я 93 | Ч о п р а 94 | А н н е н к о в 95 | М и х а и л 96 | А в р а а м 97 | П о р а з 98 | Б е б е л 99 | Ж и л б е р т у 100 | Д э в и д 101 | У и л л и с 102 | Б е к х э м 103 | Г е с с 104 | Э д г а р 105 | М о х а м е д 106 | А м с и ф 107 | У а й т 108 | Д ж о р д ж 109 | Н ’ Д и 110 | А с с е м б е 111 | К а л н ы н ь ш 112 | И в а р 113 | В и л е н с к и й 114 | К о н с т а н т и н 115 | Д ж о н 116 | Т а к е р 117 | Д ж о н а т а н 118 | О л д м а н 119 | А л ь б е р т 120 | И о г а н н 121 | Д о б р ы н и н 122 | В я ч е с л а в 123 | С и в о р и 124 | К а м и л л о 125 | С и л в а 126 | Р а м о с 127 | К л а у с 128 | Т е н н ш т е д т 129 | К а р м е 130 | Э л и а с 131 | И м а н о в 132 | Л ю т ф и я р 133 | Т р и ш и а 134 | Х е л ф е р 135 | А д а м 136 | А н д е р с о н 137 | Ю э н 138 | Ф е р г ю с с о н 139 | Д и л ь 140 | Р и ч а р д 141 | И н н о к е н т и й 142 | Г р э й с о н 143 | Х о л л 144 | Д ж о н 145 | И т у э л л 146 | И т у э л л 147 | М а р к о 148 | К е ш е л ь 149 | Б о г у с л а в 150 | Р а д з и в и л л 151 | С т а н и с л а в 152 | К о с т к а 153 | П о т о ц к и й 154 | Э н д и 155 | П л а г у а 156 | К а н н и н г е м 157 | У и л ь я м 158 | А п е р г и с 159 | Ж о р ж 160 | Г а н с 161 | Я к о б 162 | К р и с т о ф ф е л ь 163 | ф о н 164 | Г р и м м е л ь с г а у з е н 165 | Ш а н ж и 166 | А л е н 167 | Д э в и д 168 | Ч е й з 169 | Г и р с 170 | Н и к о л а й 171 | Ф и л ь ч 172 | К а р л 173 | Т о м с о н 174 | У о р р е н 175 | Л у и с 176 | А л ь б е р т о 177 | П е р е а 178 | Э д у а р д о 179 | С а н ч е с 180 | Ф у э н т е с 181 | П у а р е 182 | А л а н 183 | А л е к с е й 184 | Ш а п о ш н и к о в 185 | И р ь ё 186 | Х и е т а н е н 187 | М о р т о 188 | К л о э 189 | А л е к с а н д р 190 | В а н 191 | М а н у с 192 | А р т у р 193 | Д р у ж н и к о в 194 | В л а д и м и р 195 | П а н ч е н к о 196 | Ю р и й 197 | Б е н е д и к т 198 | М а й к л 199 | С е р в е р и с 200 | Н е р с е с 201 | Е р и ц я н 202 | М и н т о н 203 | Ф э й т 204 | Г р е т а 205 | К у к к о н е н 206 | К о у л 207 | Т е й л о р 208 | Д о н а л ь д 209 | Г р э х э м 210 | Б е р т 211 | Т о м а с 212 | Б е р р о у з 213 | Д м и т р и й 214 | А г у р а 215 | Д м и т р и й 216 | И в а н о в и ч 217 | К и т о 218 | Б у л л ь 219 | Я к о б 220 | Б р е д а 221 | Б л а с к о 222 | И б а н ь е с 223 | В и с е н т е 224 | Т и н а 225 | Д э в и с 226 | К р ш е с а д л о 227 | Б у т ч 228 | У о л к е р 229 | Л о р а 230 | Ж ю н о 231 | М а р т а 232 | М э д и с о н 233 | Ф и л и п п 234 | Д ж о р д ж 235 | Н и д е м 236 | С а д е л е р 237 | М а р к е д о н о в 238 | С е р г е й 239 | М е л и а в а 240 | Т а м а з 241 | А п о н ь о 242 | Р у б и н а 243 | А л и 244 | К у н е л л и с 245 | Я н н и с 246 | В а л е р и 247 | Б р и с к о 248 | Х у к с 249 | С э м ю э л 250 | К р о у т е р 251 | О д е н 252 | Г р е г 253 | С у б б о т и н 254 | С е р г е й 255 | С ь ю з а н 256 | А н б е х 257 | Д ж е ф ф р и 258 | К р э й г 259 | Л а б а ф 260 | Х у с е й н 261 | Б а й к а р а 262 | К о н с т а н с а 263 | Д ж о н 264 | К л э п е м 265 | Д е п е с т р 266 | Р е н е 267 | П е р р е н 268 | К л о д 269 | В и к т о р 270 | К е н т ы 271 | П р у д н и к о в 272 | П а в е л 273 | Г е р б е р т 274 | Б р е н о н 275 | Т и б о р 276 | М а х а н 277 | К а л и к с т 278 | Р о б е р т 279 | П а т т и н с о н 280 | В а л е н т и н 281 | Н о в и к о в 282 | С е р г е й 283 | А н д р е а с 284 | Р о м б е р г 285 | Х а н к е е в 286 | И г о р ь 287 | М у р а в с к и й 288 | В а л е р и й 289 | Д у н 290 | Ф а н ч ж о 291 | А н я 292 | Э р з е н 293 | К о л и н 294 | Р у н и 295 | К р и с т е р 296 | Ф у г л е с а н г 297 | Х о а к и н 298 | С а н ч е с 299 | К о с а м б и 300 | А р т у р 301 | Д о у в 302 | О у э н 303 | Ч е й з 304 | Г а р с и я 305 | С а н ч е с 306 | И б р а г и м 307 | и б н 308 | Я к у б 309 | Ф р е д д и 310 | Я н г 311 | Г о й т е й н 312 | Ш л о м о 313 | Д о в 314 | Ч а р л ь з 315 | Б е н н е т 316 | А у г с б е р г е р 317 | Ф р а н ц 318 | Ф и д р и х 319 | В а л ь д е м а р 320 | А н т о н 321 | А л е к с а н д е р 322 | ф о н 323 | А у э р ш п е р г 324 | Т о м а с 325 | У о л т е р 326 | С е р г е й 327 | Д м и т р и е в и ч 328 | Б о г д а н о в 329 | В л а д и м и р 330 | Е в т у ш е н к о в 331 | К о к р е л ь 332 | М и с а к 333 | М а н у ш я н 334 | Ф р а н к о в с к и й 335 | Т о м а ш 336 | П а р а ш а р а 337 | Э з р а 338 | Х е й в у д 339 | Ш л о м о 340 | Л а х а т 341 | У и л ь я м 342 | Г а м и л ь т о н 343 | Г и б с о н 344 | Г р е г 345 | К о к с 346 | М а к а р о в 347 | К о н с т а н т и н 348 | Б е р н а р 349 | М о л и к 350 | К а р л о с 351 | Д и о г о 352 | К о в н е р 353 | Б р ю с 354 | Л у к а с 355 | А л а м а н 356 | С а м а н е с 357 | О к а м п о 358 | Д а в и д 359 | В и к т о р 360 | Л о р е 361 | А н а й я 362 | Е л е н а 363 | А н а н и я 364 | Ш и р а к а ц и 365 | Ш и ф р и н 366 | К а р и н 367 | Р у д и 368 | А р н ш т а д т 369 | И р и н а 370 | А б ы с о в а 371 | К а т а л и н а 372 | М и к а э л а 373 | Г а л ч и н с к и й 374 | К о н с т а н т ы 375 | И л ь д е ф о н с 376 | И в о 377 | М и н а р ж 378 | А б у 379 | У б а й д а 380 | и б н 381 | Д ж а р р а х 382 | В а л л и 383 | А л и д а 384 | Н и р о т 385 | К а р л 386 | К с ю а н 387 | А р т ё м 388 | Х а ч а т у р о в 389 | П ь е р 390 | Ф р а н с у а 391 | Т и с с о 392 | В е р л а 393 | Ш а р л ь 394 | Э д у а р д 395 | С е й м у р 396 | Х е р т ф о р д 397 | М а т т и 398 | Х я у р ю 399 | Р о м а н 400 | Б о р и с е в и ч 401 | Л о к к 402 | Д ж о н 403 | О в ч и н н и к о в 404 | В а л е р и й 405 | С у б х а с и ш 406 | Р о й 407 | Ч о у д х у р и 408 | Г е о р г 409 | М о р 410 | А д р и а н 411 | Р о д р и г е с 412 | Е в г е н и й 413 | Д ь я ч к о в 414 | М э р и 415 | Ф р и т 416 | Г в и д о 417 | Л а в е с а р и с 418 | У и л ь я м 419 | В у л л с 420 | Г о д ф р и д 421 | Ф р и з с к и й 422 | С е р г е й 423 | Б е л о в 424 | Б л а й 425 | У и л ь я м 426 | Х е н ш е л ь 427 | М и л т о н 428 | Д ж о р д ж 429 | М о р р и с о н 430 | А д а м 431 | П ё т р 432 | Д у р н о в о 433 | П и т е р 434 | Ш е ф е р 435 | Ч э н ь 436 | Б и н д э 437 | Ж э н ь 438 | Л и н н 439 | Т о р н д а й к 440 | С а й м о н 441 | Г э л л а п 442 | К о н с т а н т и н 443 | К о р о в и н 444 | Ф р а н ц 445 | П ф е ф ф е р 446 | ф о н 447 | З а л о м о н 448 | Б е н а д о 449 | А р и к 450 | Д ж о н 451 | И т т о н 452 | Ш е м я к и н 453 | М и х а и л 454 | Ф е л и п е 455 | Ж о р ж е 456 | Л о р е й р о 457 | Н и к и т а 458 | Б е л ы х 459 | И в а н о в 460 | С у х а р е в с к и й 461 | А л е к с а н д р 462 | Ш а и н о в а 463 | М а р и н а 464 | М е т ш и н 465 | И л ь с у р 466 | У л е 467 | Э л л е ф с е т е р 468 | В и к т о р 469 | К а л и н а 470 | Н а т а н 471 | П а р к е р 472 | Ф р а н т и ш е к 473 | К а б е р л е 474 | Х а б б а р д 475 | К в е н т и н 476 | К о л л и н 477 | Б а ч м а н 478 | К а р и ч 479 | А м и р 480 | И г а н 481 | Ш о н 482 | М а р т и н 483 | Ш т р а н ц л ь 484 | О н д р ж и ч е к 485 | М и р о с л а в 486 | А л л а н 487 | К о м б с 488 | М о р г а н 489 | Б р а й а н 490 | Г е р т р у д а 491 | Б р а у н ш в е й г с к а я 492 | Р а ф а э л ь 493 | Б о б а н 494 | Ф р е д е р и к 495 | Д ю б у а 496 | О л ь г а 497 | Ф ё д о р о в а 498 | О р л о ф ф 499 | В а л ь я н 500 | Ж а н 501 | Б а т и с т 502 | Ф и л и б е р 503 | И б р а г и м 504 | С а х а д 505 | И о г а н н 506 | И о а х и м 507 | Л а н г е 508 | К о р ю н 509 | К р и с с и 510 | Т е й л о р 511 | Г р и м а л ь д и 512 | Д э в и д 513 | С а б и н 514 | Т ь е р р и 515 | С а п р ы к и н 516 | О л е г 517 | С а р а 518 | С т о к б р и д ж 519 | И р и н а 520 | Т о м а р с к а я 521 | Д е р е к 522 | М о р р и с 523 | А л о и з и й 524 | Л и л и у с 525 | Л ю б о в ь 526 | К а з а р н о в с к а я 527 | М и х а э л ь 528 | ф о н 529 | д е р 530 | Х а й д е 531 | К о с т и ч 532 | Б о р а 533 | В а с и л и й 534 | К о н с т а н т и н о в и ч 535 | Р и х а р д 536 | С у л и к 537 | П а в е л 538 | Т о н к о в 539 | Р а й а н 540 | К о н р о й 541 | Ч о с и ч 542 | Д о б р и ц а 543 | Б ь ё р н д а л е н 544 | У л е 545 | Э й н а р 546 | М у р а в ь ё в 547 | Н и к о л а й 548 | С и р л 549 | Р о б е р т 550 | Т ь е р и 551 | А м а р 552 | Д ж о э л 553 | Б и л л и 554 | А м и р 555 | Х а д и р 556 | Б а р б а р а 557 | Х а н н и г а н 558 | М и р т а 559 | Б у с н е л л и 560 | Р у д и 561 | Ц е р н е 562 | М а д з у к а т о 563 | А л ь б е р т о 564 | Ф р е й з е р 565 | К р и с т и н 566 | Л е о н а р д 567 | М э т л о в и ч 568 | Ю л и а н 569 | Е р х а н 570 | Н а т а л и 571 | М е н д о с а 572 | Г о м е с 573 | И н н а 574 | П о л 575 | Х а р д и н г 576 | С т э н 577 | М э к 578 | П е т р о в 579 | Д е н и с 580 | Т е й м у р а з 581 | Ч и р г а д з е 582 | Х а л а б а л а 583 | З д е н е к 584 | М о н б а з о н 585 | Э р к ю л ь 586 | Ф и о н а 587 | В и к т о р и 588 | Э н д р ю 589 | Т а р б е т 590 | Х о д о в е ц к и й 591 | Д а н и э л ь 592 | В а ш а к и д з е 593 | Т а м а з 594 | К а р о л и н 595 | Г а р с и я 596 | А н д р е й 597 | Б о л ь ш о й 598 | Р а й к р о ф т 599 | К а р т е р 600 | М и л ь к о в и ч 601 | В и д ж а й 602 | К у м а р 603 | С к о т т 604 | Б у т 605 | Ш е т а р д и 606 | Ж а к 607 | И о а х и м 608 | Т р о т т и 609 | В а р а з д а т 610 | Б а р т е л м е с с 611 | Р и ч а р д 612 | П о р т е й р о 613 | Ф е л и к с 614 | О к с а н а 615 | К а л а ш н и к о в а 616 | А д л а н 617 | Х а с а н о в 618 | С е в а к 619 | П а р у й р 620 | Л е л л ь 621 | К р и с т и а н 622 | М а м е д 623 | Ю с и ф 624 | Д ж а ф а р о в 625 | К л а р к 626 | Х э д д е н 627 | Г ю с т а в 628 | Х у м б е р т 629 | М а к с и м 630 | З и м и н 631 | К и м 632 | Г в о н 633 | Д ж е н н а 634 | К о у л м а н 635 | Р а к а н 636 | Р у с х а й д а т 637 | И е р о н и м 638 | У б о р е в и ч 639 | Н а т а л и я 640 | Г ж е г о ж 641 | К а р н а с 642 | Г е н н а д и й 643 | М и х а с е в и ч 644 | Л и а м 645 | Х и т 646 | Э н р и к о 647 | Б е в и н ь я н и 648 | В е р х а с 649 | Б а р б а р а 650 | Б р и г г с 651 | Э в а н с 652 | В и н с 653 | К е в и н 654 | К е р о л а н 655 | Р о б е р т 656 | Э н д р ю с 657 | М и л л и к е н 658 | В я ч е с л а в 659 | Ш е в ч у к 660 | Ж ю р ь е т т и 661 | Ф р а н к 662 | К а ц у р а 663 | Т а р о 664 | Д а в и д 665 | Р а т 666 | К е в и н 667 | К а ш а 668 | А л е к с а н д р 669 | К о р ж а к о в 670 | Ф о р д 671 | Ф о р д 672 | Ф о р д 673 | Ф о р д 674 | М э д о к с 675 | А н а т о л и й 676 | М о р о з о в 677 | Е г о р 678 | Р и д о ш 679 | О к л е н д с к и й 680 | Р о д р и г о 681 | А с т у р и а с 682 | С и т а 683 | О л и в 684 | Л е м б е 685 | Л и н к а р 686 | Э р и к 687 | А л о и з и й 688 | Ф е л и н с к и й 689 | О л ь г а 690 | Я к о в л е в а 691 | А ф и н о г е н о в 692 | А л е к с а н д р 693 | Б р а н к о 694 | Г а в е л л а 695 | Р э н д и 696 | Ц у к е р б е р г 697 | Д ж у з е п п е 698 | И м п а с т а т о 699 | И в а н 700 | Б е ц к о й 701 | Б е н у а 702 | А н г б в а 703 | Г у с т а в 704 | Д ж е й д 705 | Л е й л а 706 | П о л и к а р п 707 | Б о д д и н г 708 | П а у л ь 709 | У л а ф 710 | Н а г и м а 711 | Е с к а л и е в а 712 | Д ж е й н 713 | Ф е л л о у з 714 | Ф е л л о у з 715 | б а р о н е с с а 716 | Ф е л л о у з 717 | Ф е л л о у з 718 | Г е р ш 719 | О с т р о п о л е р 720 | Г р а м ш и 721 | А н т о н и о 722 | Ч у р а н д и 723 | М а р т и н а 724 | П о л 725 | Ф о н о р о ф ф 726 | Ф е л ь н е р 727 | Ф е л ь н е р 728 | Т о м п с о н 729 | Т о н и 730 | Л е о н и д 731 | Б р о н е в о й 732 | Р е й н а л ь д 733 | Ш н е л ь 734 | П е т е р 735 | Б е ш е н ы й 736 | А н д р е й 737 | З а л е с к и й 738 | М а н с у р 739 | Я х ъ я 740 | Л е о н и д 741 | Л а з а р е в 742 | Е в г е н и й 743 | Ф а д е е в 744 | А л е к с а н д р 745 | М и р ц х у л а в а 746 | С е р г е й 747 | Р у д ы к а 748 | Д м и т р и й 749 | А н д р е й к и н 750 | Д ж о н с о н 751 | М а й к л 752 | В а л е н т и н 753 | Г а н е в 754 | И м о д ж и н 755 | Б л и с с 756 | Д и к 757 | Р а й а н 758 | К и т т и 759 | К и р в а н 760 | У и н и т с 761 | Д а н и э л ь 762 | Я н н е 763 | Я л а с в а а р а 764 | Д ж у л и а н 765 | Т а у н с е н д 766 | Л и л и я 767 | Ш о б у х о в а 768 | Ж е л ь к а 769 | Ф р а н у л о в и ч 770 | Б о р и с 771 | Р а е в с к и й 772 | Ш а р о н 773 | М э н н 774 | Б у л л е 775 | Э т ь е н 776 | Л у и 777 | В и к т о р 778 | Ф р а ё н о в 779 | Э в а 780 | К о ц и а н 781 | М а р и я 782 | М а к с 783 | К а й з е р 784 | Ю р и й 785 | И в л е в 786 | С т а н и с л а в 787 | Г у с т а в о в и ч 788 | С т р у м и л и н 789 | И г о р 790 | Л о л о 791 | Т а г и р 792 | К у с и м о в 793 | А н т у а н 794 | Р о б е р т 795 | К а й р а 796 | Б у ш о р 797 | Я а р и 798 | К о с к и н е н 799 | Н е л л 800 | Б ё р т о н 801 | Н а с и р у д д и н 802 | Ю с у ф ф 803 | Я г е р 804 | К о р н е л и с 805 | М а р к 806 | Р о б е р т с 807 | Д э л ь ф и н 808 | Т о м с о н 809 | К р е й г 810 | Э д к и н с 811 | К в а р т у ч ч и 812 | П е д р о 813 | Ф р а н с е н а 814 | М а к к о р о р и 815 | Ж о з е ф и н н а 816 | Я р о ш е в и ч 817 | Р и ч а р д 818 | Д а к е т т 819 | Л а р н и 820 | М а р т т и 821 | М а р и я 822 | К р и в о п о л е н о в а 823 | Б а р и 824 | М о р г а н 825 | Н е ф е р к а р а 826 | Р о б е р т 827 | К а л е с к и 828 | Л и н ч 829 | Ю р и й 830 | А л е к с а н д р о в 831 | А в д о т ь я 832 | Т и м о ф е е в н а 833 | Л а с а р о 834 | А л ь в а р е с 835 | М а р г а р е т 836 | Б и б а 837 | А н н а 838 | П а н и н а 839 | Л е о н и д 840 | Д о б р о в с к и й 841 | С ю з а н н 842 | П р и н г л 843 | Р о ж е 844 | М ю н ь е 845 | А л е к с а н д р 846 | Д е в о н 847 | Х ё р т 848 | С ю з а н н а 849 | Б р ю к н е р 850 | Г р е т а 851 | А м е н д 852 | У о л т е р 853 | Б и к е л 854 | П р и н ц ы 855 | Т а у э р е 856 | В и н ч е н ц о 857 | Ч е р а м и 858 | П ь е р 859 | Ф у н к 860 | Ж о з е ф 861 | Н а д ж 862 | Г р а н т 863 | П о л 864 | А д е л а и д а 865 | С о б е с к а я 866 | Ф о д л а 867 | К р о н и н 868 | О ’ Р е й л и 869 | К а р л 870 | Л ю д в и г 871 | Б а р 872 | К о р н е й 873 | П е т р е н к о 874 | М а й к л 875 | Т и н с л и 876 | А л е к с а н д р 877 | К а з а к е в и ч 878 | Г а л и н а 879 | Н и л о в а 880 | А н т о н 881 | Н и л о в 882 | Д ж е н н и 883 | Т р и п 884 | И г о р ь 885 | П е т р е н к о 886 | Р и н а т 887 | И б р а г и м о в 888 | Н и к о л а й 889 | П а х о м о в 890 | Ю р и й 891 | К р а к о в е т с к и й 892 | З и н е д и н 893 | З и д а н 894 | З и д а н 895 | К е л л е н б е р г е р 896 | Э м и л ь 897 | Э м и л ь 898 | Я н е н ц 899 | О д р и н а 900 | П э т р и д ж 901 | П р и с ц и а н 902 | Н о й 903 | А к в у 904 | Г а р о л ь д 905 | М а к м и л л а н 906 | Б р е н т а н о 907 | Ф р а н ц 908 | Я к о п о 909 | Ф о р о н и 910 | Х и л а р и 911 | Д а ф ф 912 | И р и н а 913 | А в в а к у м о в а 914 | А н н е т т е 915 | ф о н 916 | Д р о с т е 917 | Х ю л ь с х о ф ф 918 | О м а р 919 | Т о р р и х о с 920 | А д а м 921 | Т а р н о в с к и й 922 | М а р и я 923 | Г а н с в о р т 924 | М е л в и л л 925 | В л а д и м и р 926 | Р а у т б а р т 927 | Е л и з а в е т а 928 | Я н к о в с к а я 929 | Э д д а 930 | Г е р и н г 931 | А л е к с а н д р а 932 | Н и к о л и ч 933 | Я н и с 934 | С т р е н г а 935 | Л а р и с а 936 | Г у з е е в а 937 | Й о а х и м 938 | Д и т р и х 939 | И о г а н н 940 | Г е о р г 941 | К а н т 942 | Д ж у л и я 943 | К о з и м о 944 | А м м а н н а т и 945 | Д ж о н 946 | С и м м и т 947 | С и л ь в и я 948 | З и д е к 949 | Л а у р а 950 | Б а л ь з а к 951 | К р и с т и н а 952 | Д и л ь я 953 | С в е т л а н а 954 | Ю р ь е в н а 955 | Б л е д н а я 956 | Т е р р и 957 | Ч а н 958 | М а к с 959 | О ф ю л ь с 960 | А н а с т а с и я 961 | Б и р ю ч е в а 962 | М е д е я 963 | Д ж у г е л и 964 | Э д в а р д 965 | О л е щ а к 966 | М а й к 967 | Т а л е р и к о 968 | Ф е о п о м п 969 | К о н р а д 970 | Х у м м л е р 971 | М а р к 972 | Х и д д и н к 973 | В е с т е р м а р к 974 | Э д в а р д 975 | А н д р е й 976 | М е л е н с к и й 977 | Г е н н а д и й 978 | Р е б р о в 979 | А р и н а 980 | В л а д и с л а в о в н а 981 | Б е з б о р о д о в а 982 | А м и р а н 983 | А н а н и д з е 984 | М а к с 985 | М а й е р 986 | Р о м а н 987 | Х о х л о в 988 | Д э н н и 989 | Х э н д л и н г 990 | Р о б е р т 991 | С т и в е н с о н 992 | В л а д и м и р 993 | К и л ь б у р г 994 | Ч у р и к о в 995 | М и х а и л 996 | К у з ь м и ч 997 | Д м и т р и й 998 | М и ш и н 999 | Е к а т е р и н а 1000 | С т о я н о в а 1001 | -------------------------------------------------------------------------------- /nmt-wizard/data/vocab/helloworld.ruen.src.dict: -------------------------------------------------------------------------------- 1 | Ф 2 | р 3 | е 4 | д 5 | Р 6 | о 7 | ж 8 | с 9 | Х 10 | а 11 | т 12 | М 13 | л 14 | и 15 | Д 16 | н 17 | Л 18 | у 19 | к 20 | й 21 | ф 22 | б 23 | ш 24 | Э 25 | в 26 | Т 27 | ё 28 | п 29 | Ш 30 | К 31 | А 32 | ч 33 | ь 34 | Е 35 | Б 36 | г 37 | я 38 | х 39 | П 40 | ы 41 | м 42 | У 43 | С 44 | ю 45 | з 46 | В 47 | Ж 48 | э 49 | Н 50 | Я 51 | З 52 | Г 53 | И 54 | О 55 | Ч 56 | ц 57 | Й 58 | ’ 59 | Ю 60 | ъ 61 | Ц 62 | щ 63 | Ё 64 | Ы 65 | ' 66 | Щ 67 | « 68 | » 69 | ` 70 | е́ 71 | и́ 72 | а́ 73 | о́ 74 | ћ 75 | у́ 76 | ј 77 | ‘ 78 | -------------------------------------------------------------------------------- /nmt-wizard/data/vocab/helloworld.ruen.tgt.dict: -------------------------------------------------------------------------------- 1 | F 2 | r 3 | e 4 | d 5 | R 6 | o 7 | g 8 | s 9 | H 10 | a 11 | t 12 | M 13 | l 14 | i 15 | J 16 | n 17 | L 18 | u 19 | D 20 | c 21 | f 22 | h 23 | k 24 | b 25 | E 26 | w 27 | T 28 | P 29 | p 30 | S 31 | K 32 | A 33 | C 34 | B 35 | z 36 | j 37 | m 38 | W 39 | V 40 | q 41 | x 42 | y 43 | v 44 | N 45 | é 46 | G 47 | è 48 | í 49 | O 50 | ü 51 | ō 52 | U 53 | Đ 54 | I 55 | ' 56 | Z 57 | Æ 58 | ł 59 | á 60 | ó 61 | Š 62 | ě 63 | Y 64 | š 65 | ö 66 | Ž 67 | ć 68 | ç 69 | Ł 70 | ð 71 | ő 72 | ļ 73 | Ø 74 | É 75 | X 76 | Q 77 | ê 78 | Á 79 | ø 80 | ë 81 | ä 82 | ș 83 | č 84 | å 85 | ñ 86 | Å 87 | û 88 | ú 89 | Õ 90 | ā 91 | ṇ 92 | î 93 | ş 94 | â 95 | ă 96 | ž 97 | ò 98 | ı 99 | ń 100 | Ó 101 | ý 102 | ū 103 | ß 104 | ř 105 | Ż 106 | ÿ 107 | ’ 108 | Č 109 | ã 110 | Þ 111 | ś 112 | ę 113 | ô 114 | ī 115 | æ 116 | Ö 117 | đ 118 | Ś 119 | ï 120 | ů 121 | œ 122 | Ș 123 | Ş 124 | Ā 125 | ė 126 | ğ 127 | Ç 128 | ŀ 129 | ņ 130 | ą 131 | ē 132 | à 133 | Ō 134 | ´ 135 | õ 136 | Í 137 | ‘ 138 | Ğ 139 | İ 140 | Ď 141 | Ľ 142 | ŭ 143 | ķ 144 | ț 145 | ż 146 | ţ 147 | ć 148 | Ä 149 | Ñ 150 | ť 151 | Ü 152 | ň 153 | Ć 154 | þ 155 | ì 156 | Ē 157 |  158 | ` 159 | ľ 160 | Ķ 161 | ģ 162 | È 163 | Œ 164 | Ț 165 | ṣ 166 | Ċ 167 | À 168 | Ř 169 | ù 170 | ũ 171 | ĩ 172 | Ô 173 | ṭṭ 174 | ű 175 | ′ 176 | ǎ 177 | ư 178 | ọ 179 | Ź 180 | Ú 181 | Ţ 182 | ź 183 | ‎ 184 | Ð 185 | ạ 186 | ẫ 187 | ụ 188 | ễ 189 | ậ 190 | Ģ 191 | Ṭ 192 | ĭ 193 | Ê 194 | Ī 195 | ĕ 196 | Ņ 197 | ắ 198 | ệ 199 | ứ 200 | ŵ 201 | Ḥ 202 | ṅ 203 | · 204 | Ḫ 205 | ‐ 206 | ả 207 | ď 208 | ṛ 209 | -------------------------------------------------------------------------------- /unsupervised-nmt/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017-present The OpenNMT Authors. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /unsupervised-nmt/README.md: -------------------------------------------------------------------------------- 1 | *Owner: Guillaume Klein (guillaume.klein (at) systrangroup.com)* 2 | 3 | # Unsupervised NMT with TensorFlow and OpenNMT-tf 4 | 5 | We propose you to implement the paper [*Unsupervised Machine Translation Using Monolingual Corpora Only*](https://arxiv.org/abs/1711.00043) (G. Lample et al. 2017) using [TensorFlow](https://www.tensorflow.org/) and [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf). While the project might take more than one day to complete, the goal of this session is to: 6 | 7 | * dive into an interesting research paper applying adversarial training to NMT; 8 | * learn about some TensorFlow mechanics; 9 | * discover OpenNMT-tf concepts and APIs. 10 | 11 | The guide is a step-by-step implementation with some functions left unimplemented for the curious reader. The completed code is available in the `ref/` directory. 12 | 13 | ## Table of contents 14 | 15 | * [Requirements](#requirements) 16 | * [Data](#data) 17 | * [Step-by-step tutorial](#step-by-step-tutorial) 18 | * [Training](#training) 19 | * [Step 0: Base file](#step-0-base-file) 20 | * [Step 1: Reading the data](#step-1-reading-the-data) 21 | * [Step 2: Noise model](#step-2-noise-model) 22 | * [Step 3: Creating embeddings](#step-3-creating-embeddings) 23 | * [Step 4: Encoding noisy inputs](#step-4-encoding-noisy-inputs) 24 | * [Step 5: Denoising noisy encoding](#step-5-denoising-noisy-encoding) 25 | * [Step 6: Discriminating encodings](#step-6-discriminating-encodings) 26 | * [Step 7: Optimization and training loop](#step-7-optimization-and-training-loop) 27 | * [Inference](#inference) 28 | * [Step 0: Base file](#step-0-base-file-1) 29 | * [Step 1: Reading data](#step-1-reading-data) 30 | * [Step 2: Rebuilding the model](#step-2-rebuilding-the-model) 31 | * [Step 3: Encoding and decoding](#step-3-encoding-and-decoding) 32 | * [Step 4: Loading and translating](#step-4-loading-and-translating) 33 | * [Complete training flow](#complete-training-flow) 34 | 35 | ## Requirements 36 | 37 | * `git` 38 | * `python` >= 2.7 39 | * `virtualenv` 40 | 41 | ```bash 42 | git clone https://github.com/OpenNMT/Hackathon.git 43 | cd Hackathon/unsupervised-nmt 44 | virtualenv env 45 | source env/bin/activate 46 | pip install -r requirements.txt.cpu 47 | ``` 48 | 49 | ## Data 50 | 51 | The data are available at: 52 | 53 | * [`unsupervised-nmt-enfr.tar.bz2`](https://s3.amazonaws.com/opennmt-trainingdata/unsupervised-nmt-enfr.tar.bz2) (2.2 GB) 54 | * [`unsupervised-nmt-enfr-dev.tar.bz2`](https://s3.amazonaws.com/opennmt-trainingdata/unsupervised-nmt-enfr-dev.tar.bz2) (2.4 MB) 55 | 56 | To get started, we recommend downloading the `dev` version which contains a small training set with 10K sentences. 57 | 58 | Both packages contain the vocabulary files and a first translation of the training data using [an unsupervised word-by-word translation model](https://github.com/jsenellart/papers/tree/master/WordTranslationWithoutParallelData)\* as described in the paper. The full data additionally contains pretrained word embeddings using [fastText](https://github.com/facebookresearch/fastText). 59 | 60 | \* also see the [MUSE](https://github.com/facebookresearch/MUSE) project that was recently released by Facebook Research team. 61 | 62 | ## Step-by-step tutorial 63 | 64 | For this tutorial, the following resources might come handy: 65 | 66 | * [TensorFlow documentation](https://www.tensorflow.org/api_docs/python/) 67 | * [OpenNMT-tf documentation](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.html) 68 | * [Numpy documentation](https://docs.scipy.org/doc/numpy/reference/index.html) 69 | 70 | and of course the research paper linked above. 71 | 72 | ### Training 73 | 74 | The training script `ref/training.py` implements one training iteration as described in the paper. Follow the next section to implement your own or understand the reference file. 75 | 76 | #### Step 0: Base file 77 | 78 | You can use this file to start implementing. It includes some common imports and a minimal command line argument parser: 79 | 80 | ```python 81 | from __future__ import print_function 82 | 83 | import argparse 84 | import sys 85 | 86 | import tensorflow as tf 87 | import opennmt as onmt 88 | import numpy as np 89 | 90 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 91 | parser.add_argument("--model_dir", default="model", help="Checkpoint directory.") 92 | args = parser.parse_args() 93 | ``` 94 | 95 | #### Step 1: Reading the data 96 | 97 | Loading text data in TensorFlow is made easy with the [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) and the [`tf.contrib.lookup.index_table_from_file`](https://www.tensorflow.org/api_docs/python/tf/contrib/lookup/index_table_from_file) function. This code is provided so that we can ensure the data format used as input. 98 | 99 | The required data are: 100 | 101 | * the source and target monolingual datasets 102 | * the source and target monolingual datasets translation from model `M(t-1)` 103 | * the source and target vocabularies 104 | 105 | Let's add the following command line options: 106 | 107 | ```python 108 | parser.add_argument("--src", required=True, help="Source file.") 109 | parser.add_argument("--tgt", required=True, help="Target file.") 110 | parser.add_argument("--src_trans", required=True, help="Source translation at the previous iteration.") 111 | parser.add_argument("--tgt_trans", required=True, help="Target translation at the previous iteration.") 112 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.") 113 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.") 114 | ``` 115 | 116 | and then create the training iterators: 117 | 118 | ```python 119 | from opennmt import constants 120 | from opennmt.utils.misc import count_lines 121 | 122 | def load_vocab(vocab_file): 123 | """Returns a lookup table and the vocabulary size.""" 124 | vocab_size = count_lines(vocab_file) + 1 # Add UNK. 125 | vocab = tf.contrib.lookup.index_table_from_file( 126 | vocab_file, 127 | vocab_size=vocab_size - 1, 128 | num_oov_buckets=1) 129 | return vocab, vocab_size 130 | 131 | def load_data(input_file, 132 | translated_file, 133 | input_vocab, 134 | translated_vocab, 135 | batch_size=32, 136 | max_seq_len=50, 137 | num_buckets=5): 138 | """Returns an iterator over the training data.""" 139 | 140 | def _make_dataset(text_file, vocab): 141 | dataset = tf.data.TextLineDataset(text_file) 142 | dataset = dataset.map(lambda x: tf.string_split([x]).values) # Split on spaces. 143 | dataset = dataset.map(vocab.lookup) # Lookup token in vocabulary. 144 | return dataset 145 | 146 | def _key_func(x): 147 | bucket_width = (max_seq_len + num_buckets - 1) // num_buckets 148 | bucket_id = x["length"] // bucket_width 149 | bucket_id = tf.minimum(bucket_id, num_buckets) 150 | return tf.to_int64(bucket_id) 151 | 152 | def _reduce_func(unused_key, dataset): 153 | return dataset.padded_batch(batch_size, { 154 | "ids": [None], 155 | "ids_in": [None], 156 | "ids_out": [None], 157 | "length": [], 158 | "trans_ids": [None], 159 | "trans_length": []}) 160 | 161 | bos = tf.constant([constants.START_OF_SENTENCE_ID], dtype=tf.int64) 162 | eos = tf.constant([constants.END_OF_SENTENCE_ID], dtype=tf.int64) 163 | 164 | # Make a dataset from the input and translated file. 165 | input_dataset = _make_dataset(input_file, input_vocab) 166 | translated_dataset = _make_dataset(translated_file, translated_vocab) 167 | dataset = tf.data.Dataset.zip((input_dataset, translated_dataset)) 168 | dataset = dataset.shuffle(200000) 169 | 170 | # Define the input format. 171 | dataset = dataset.map(lambda x, y: { 172 | "ids": x, 173 | "ids_in": tf.concat([bos, x], axis=0), 174 | "ids_out": tf.concat([x, eos], axis=0), 175 | "length": tf.shape(x)[0], 176 | "trans_ids": y, 177 | "trans_length": tf.shape(y)[0]}) 178 | 179 | # Filter out invalid examples. 180 | dataset = dataset.filter(lambda x: tf.greater(x["length"], 0)) 181 | 182 | # Batch the dataset using a bucketing strategy. 183 | dataset = dataset.apply(tf.contrib.data.group_by_window( 184 | _key_func, 185 | _reduce_func, 186 | window_size=batch_size)) 187 | return dataset.make_initializable_iterator() 188 | 189 | src_vocab, src_vocab_size = load_vocab(args.src_vocab) 190 | tgt_vocab, tgt_vocab_size = load_vocab(args.tgt_vocab) 191 | 192 | with tf.device("/cpu:0"): # Input pipeline should always be place on the CPU. 193 | src_iterator = load_data(args.src, args.src_trans, src_vocab, tgt_vocab) 194 | tgt_iterator = load_data(args.tgt, args.tgt_trans, tgt_vocab, src_vocab) 195 | src = src_iterator.get_next() 196 | tgt = tgt_iterator.get_next() 197 | ``` 198 | 199 | Here we use the bucketing strategy to make sure batches contain sequences of similar length which reduces the amount of padding and makes the training more efficient. For large training sets, we could also use the hard constraint of only batching sentences of the same length. 200 | 201 | You can test by printing the first example: 202 | 203 | ```python 204 | with tf.Session() as sess: 205 | sess.run(tf.global_variables_initializer()) 206 | sess.run(tf.tables_initializer()) 207 | sess.run([src_iterator.initializer, tgt_iterator.initializer]) 208 | print(sess.run(src)) 209 | ``` 210 | 211 | *During development, you can reuse this session creation code to print tensor values.* 212 | 213 | #### Step 2: Noise model 214 | 215 | This refers to the `C(x)` function described in the *Section 2.3* of the paper. As this function does not require backpropagation, we suggest to implement it in pure Python to make things easier: 216 | 217 | ```python 218 | def add_noise_python(words, dropout=0.1, k=3): 219 | """Applies the noise model in input words. 220 | 221 | Args: 222 | words: A numpy vector of word ids. 223 | dropout: The probability to drop words. 224 | k: Maximum distance of the permutation. 225 | 226 | Returns: 227 | A noisy numpy vector of word ids. 228 | """ 229 | # FIXME 230 | raise NotImplementedError() 231 | 232 | def add_noise(ids, sequence_length): 233 | """Wraps add_noise_python for a batch of tensors.""" 234 | 235 | def _add_noise_single(ids, sequence_length): 236 | noisy_ids = add_noise_python(ids[:sequence_length]) 237 | noisy_sequence_length = len(noisy_ids) 238 | ids[:noisy_sequence_length] = noisy_ids 239 | ids[noisy_sequence_length:] = 0 240 | return ids, np.int32(noisy_sequence_length) 241 | 242 | noisy_ids, noisy_sequence_length = tf.map_fn( 243 | lambda x: tf.py_func(_add_noise_single, x, [ids.dtype, tf.int32]), 244 | [ids, sequence_length], 245 | dtype=[ids.dtype, tf.int32], 246 | back_prop=False) 247 | 248 | noisy_ids.set_shape(ids.get_shape()) 249 | noisy_sequence_length.set_shape(sequence_length.get_shape()) 250 | 251 | return noisy_ids, noisy_sequence_length 252 | ``` 253 | 254 | The wrapper uses [`tf.py_func`](https://www.tensorflow.org/api_docs/python/tf/py_func) to include a Python function in the computation graph, and [`tf.map_fn`](https://www.tensorflow.org/api_docs/python/tf/map_fn) to apply the noise model on each sequence in the batch. 255 | 256 | #### Step 3: Creating embeddings 257 | 258 | The paper uses pretrained embeddings to initialize the embeddings of the model. Pretrained emnbeddings are included in the full data package (see above) and can be easily loaded with the [`load_pretrained_embeddings`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.inputters.text_inputter.html#opennmt.inputters.text_inputter.load_pretrained_embeddings) function from OpenNMT-tf. 259 | 260 | First you should add new command line arguments to accept pretrained word embeddings: 261 | 262 | ```python 263 | parser.add_argument("--src_emb", default=None, help="Source embedding.") 264 | parser.add_argument("--tgt_emb", default=None, help="Target embedding.") 265 | ``` 266 | 267 | Then, here is the code to load or create the embedding [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable): 268 | 269 | ```python 270 | from opennmt.inputters.text_inputter import load_pretrained_embeddings 271 | 272 | def create_embeddings(vocab_size, depth=300): 273 | """Creates an embedding variable.""" 274 | return tf.get_variable("embedding", shape=[vocab_size, depth]) 275 | 276 | def load_embeddings(embedding_file, vocab_file): 277 | """Loads an embedding variable or embeddings file.""" 278 | try: 279 | embeddings = tf.get_variable("embedding") 280 | except ValueError: 281 | pretrained = load_pretrained_embeddings( 282 | embedding_file, 283 | vocab_file, 284 | num_oov_buckets=1, 285 | with_header=True, 286 | case_insensitive_embeddings=True) 287 | embeddings = tf.get_variable( 288 | "embedding", 289 | shape=None, 290 | trainable=False, 291 | initializer=tf.constant(pretrained.astype(np.float32))) 292 | return embeddings 293 | 294 | with tf.variable_scope("src"): 295 | if args.src_emb is not None: 296 | src_emb = load_embeddings(args.src_emb, args.src_vocab) 297 | else: 298 | src_emb = create_embeddings(src_vocab_size) 299 | 300 | with tf.variable_scope("tgt"): 301 | if args.tgt_emb is not None: 302 | tgt_emb = load_embeddings(args.tgt_emb, args.tgt_vocab) 303 | else: 304 | tgt_emb = create_embeddings(tgt_vocab_size) 305 | ``` 306 | 307 | #### Step 4: Encoding noisy inputs 308 | 309 | The encoding uses a standard bidirectional LSTM encoder as described in *Section 2.1*. Hopefully, OpenNMT-tf exposes [several encoders](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.encoders.html) that can be used with a simple interface. 310 | 311 | First, create a new encoder instance: 312 | 313 | ```python 314 | hidden_size = 512 315 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size) 316 | ``` 317 | 318 | Then, you should implement the function `add_noise_and_encode`: 319 | 320 | ```python 321 | def add_noise_and_encode(ids, sequence_length, embedding, reuse=None): 322 | """Applies the noise model on ids, embeds and encodes. 323 | 324 | Args: 325 | ids: The tensor of words ids of shape [batch_size, max_time]. 326 | sequence_length: The tensor of sequence length of shape [batch_size]. 327 | embedding: The embedding variable. 328 | reuse: If True, reuse the encoder variables. 329 | 330 | Returns: 331 | A tuple (encoder output, encoder state, sequence length). 332 | """ 333 | # FIXME 334 | raise NotImplementedError() 335 | ``` 336 | 337 | **Related resources:** 338 | 339 | * [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) 340 | * [`tf.variable_scope`](https://www.tensorflow.org/api_docs/python/tf/variable_scope) 341 | * [`onmt.encoders.Encoder.encode`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.encoders.encoder.html#opennmt.encoders.encoder.Encoder.encode) 342 | 343 | At this point, you have everything you need to implement to encoding part showed in *Figure 2*: 344 | 345 |

346 | 347 |

348 | 349 | ```python 350 | src_encoder_auto = add_noise_and_encode( 351 | src["ids"], src["length"], src_emb, reuse=None) 352 | tgt_encoder_auto = add_noise_and_encode( 353 | tgt["ids"], tgt["length"], tgt_emb, reuse=True) 354 | 355 | src_encoder_cross = add_noise_and_encode( 356 | tgt["trans_ids"], tgt["trans_length"], src_emb, reuse=True) 357 | tgt_encoder_cross = add_noise_and_encode( 358 | src["trans_ids"], src["trans_length"], tgt_emb, reuse=True) 359 | ``` 360 | 361 | #### Step 5: Denoising noisy encoding 362 | 363 | This step completes *Section 2.3* and *2.4* of the paper by denoising noisy inputs. It uses a OpenNMT-tf attentional decoder that starts from the encoder final state: 364 | 365 | ```python 366 | decoder = onmt.decoders.AttentionalRNNDecoder( 367 | 2, hidden_size, bridge=onmt.layers.CopyBridge()) 368 | ``` 369 | 370 | You can then implement the `denoise` function: 371 | 372 | ```python 373 | from opennmt.utils.losses import cross_entropy_sequence_loss 374 | 375 | def denoise(x, embedding, encoder_outputs, generator, reuse=None): 376 | """Denoises from the noisy encoding. 377 | 378 | Args: 379 | x: The input data from the dataset. 380 | embedding: The embedding variable. 381 | encoder_outputs: A tuple with the encoder outputs. 382 | generator: A tf.layers.Dense instance for projecting the logits. 383 | reuse: If True, reuse the decoder variables. 384 | 385 | Returns: 386 | The decoder loss. 387 | """ 388 | raise NotImplementedError() 389 | ``` 390 | 391 | **Related resources:** 392 | 393 | * [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) 394 | * [`tf.variable_scope`](https://www.tensorflow.org/api_docs/python/tf/variable_scope) 395 | * [`cross_entropy_sequence_loss`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.utils.losses.html#opennmt.utils.losses.cross_entropy_sequence_loss) 396 | * [`onmt.decoders.Decoder.decode`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.decoders.decoder.html#opennmt.decoders.decoder.Decoder.decode) 397 | 398 | and build the `generator` for source and target: 399 | 400 | ```python 401 | with tf.variable_scope("src"): 402 | src_gen = tf.layers.Dense(src_vocab_size) 403 | src_gen.build([None, hidden_size]) 404 | 405 | with tf.variable_scope("tgt"): 406 | tgt_gen = tf.layers.Dense(tgt_vocab_size) 407 | tgt_gen.build([None, hidden_size]) 408 | ``` 409 | 410 | Now, you can implement the decoding part of the architecture presented in *Figure 2*: 411 | 412 |

413 | 414 |

415 | 416 | ```python 417 | l_auto_src = denoise(src, src_emb, src_encoder_auto, src_gen, reuse=None) 418 | l_auto_tgt = denoise(tgt, tgt_emb, tgt_encoder_auto, tgt_gen, reuse=True) 419 | 420 | l_cd_src = denoise(src, src_emb, tgt_encoder_cross, src_gen, reuse=True) 421 | l_cd_tgt = denoise(tgt, tgt_emb, src_encoder_cross, tgt_gen, reuse=True) 422 | ``` 423 | 424 | #### Step 6: Discriminating encodings 425 | 426 | This represents the adversarial part of the model as described in Section *2.5*. The architecture of the discriminator in described in *Section 4.4*. 427 | 428 | Here, you are asked to implement the binary cross entropy and the discriminator. 429 | 430 | ```python 431 | def binary_cross_entropy(x, y, smoothing=0, epsilon=1e-12): 432 | """Computes the averaged binary cross entropy. 433 | 434 | bce = y*log(x) + (1-y)*log(1-x) 435 | 436 | Args: 437 | x: The predicted labels. 438 | y: The true labels. 439 | smoothing: The label smoothing coefficient. 440 | 441 | Returns: 442 | The cross entropy. 443 | """ 444 | # FIXME 445 | raise NotImplementedError() 446 | 447 | def discriminator(encodings, 448 | sequence_lengths, 449 | lang_ids, 450 | num_layers=3, 451 | hidden_size=1024, 452 | dropout=0.3): 453 | """Discriminates the encoder outputs against lang_ids. 454 | 455 | Args: 456 | encodings: The encoder outputs of shape [4*batch_size, max_time, hidden_size]. 457 | sequence_lengths: The length of each sequence of shape [4*batch_size]. 458 | lang_ids: The true lang id of each sequence of shape [4*batch_size]. 459 | num_layers: The number of layers of the discriminator. 460 | hidden_size: The hidden size of the discriminator. 461 | dropout: The dropout to apply on each discriminator layer output. 462 | 463 | Returns: 464 | A tuple with: the discriminator loss (L_d) and the adversarial loss (L_adv). 465 | """ 466 | # FIXME 467 | raise NotImplementedError() 468 | ``` 469 | 470 | **Related resources:** 471 | 472 | * [`tf.layers.dense`](https://www.tensorflow.org/api_docs/python/tf/layers/dense) 473 | * [`tf.nn.dropout`](https://www.tensorflow.org/api_docs/python/tf/nn/dropout) 474 | * [`tf.sequence_mask`](https://www.tensorflow.org/api_docs/python/tf/sequence_mask) 475 | * [`tf.reduce_mean`](https://www.tensorflow.org/api_docs/python/tf/reduce_mean) 476 | 477 | To run the discriminator a single time, let's concatenate all encoder outputs (cf. *Figure 2*) and prepare the language identifiers accordingly. 478 | 479 |

480 | 481 |

482 | 483 | ```python 484 | from opennmt.layers.reducer import pad_in_time 485 | 486 | batch_size = tf.shape(src["length"])[0] 487 | all_encoder_outputs = [ 488 | src_encoder_auto, src_encoder_cross, 489 | tgt_encoder_auto, tgt_encoder_cross] 490 | lang_ids = tf.concat([ 491 | tf.fill([batch_size * 2], 0), 492 | tf.fill([batch_size * 2], 1)], 0) 493 | 494 | max_time = tf.reduce_max([tf.shape(output[0])[1] for output in all_encoder_outputs]) 495 | 496 | encodings = tf.concat([ 497 | pad_in_time(output[0], max_time - tf.shape(output[0])[1]) 498 | for output in all_encoder_outputs], 0) 499 | sequence_lengths = tf.concat([output[2] for output in all_encoder_outputs], 0) 500 | 501 | with tf.variable_scope("discriminator"): 502 | l_d, l_adv = discriminator(encodings, sequence_lengths, lang_ids) 503 | ``` 504 | 505 | #### Step 7: Optimization and training loop 506 | 507 | Finally, you can compute the final objective function as described at the end of *Section 2*: 508 | 509 | ```python 510 | lambda_auto = 1 511 | lambda_cd = 1 512 | lambda_adv = 1 513 | 514 | l_auto = l_auto_src + l_auto_tgt 515 | l_cd = l_cd_src + l_cd_tgt 516 | 517 | l_final = (lambda_auto * l_auto + lambda_cd * l_cd + lambda_adv * l_adv) 518 | ``` 519 | 520 | As described in *Section 4.4*, the training alternates "between one encoder-decoder and one discriminator update" and uses 2 different optimizers. You should implement this behavior in the `train_op` function: 521 | 522 | ```python 523 | def build_train_op(global_step, encdec_variables, discri_variables): 524 | """Returns the training Op. 525 | 526 | When global_step % 2 == 0, it minimizes l_final and updates encdec_variables. 527 | Otherwise, it minimizes l_d and updates discri_variables. 528 | 529 | Args: 530 | global_step: The training step. 531 | encdec_variables: The list of variables of the encoder/decoder model. 532 | discri_variables: The list of variables of the discriminator. 533 | 534 | Returns: 535 | The training op. 536 | """ 537 | # FIXME 538 | raise NotImplementedError() 539 | ``` 540 | 541 | **Related resources:** 542 | 543 | * [`tf.train.AdamOptimizer`](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) 544 | * [`tf.train.RMSPropOptimizer`](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer) 545 | * [`tf.cond`](https://www.tensorflow.org/api_docs/python/tf/cond) 546 | 547 | And now, we can conclude the training script with the training loop: 548 | 549 | ```python 550 | 551 | encdec_variables = [] 552 | discri_variables = [] 553 | for variable in tf.trainable_variables(): 554 | if variable.name.startswith("discriminator"): 555 | discri_variables.append(variable) 556 | else: 557 | encdec_variables.append(variable) 558 | 559 | global_step = tf.train.get_or_create_global_step() 560 | train_op = build_train_op(global_step, encdec_variables, discri_variables) 561 | 562 | i = 0 563 | with tf.train.MonitoredTrainingSession(checkpoint_dir=args.model_dir) as sess: 564 | sess.run([src_iterator.initializer, tgt_iterator.initializer]) 565 | while not sess.should_stop(): 566 | if i % 2 == 0: 567 | _, step, _l_auto, _l_cd, _l_adv, _l = sess.run( 568 | [train_op, global_step, l_auto, l_cd, l_adv, l_final]) 569 | print("{} - l_auto = {}; l_cd = {}, l_adv = {}; l = {}".format( 570 | step, _l_auto, _l_cd, _l_adv, _l)) 571 | else: 572 | _, step, _l_d = sess.run([train_op, global_step, l_d]) 573 | print("{} - l_d = {}".format(step, _l_d)) 574 | i += 1 575 | sys.stdout.flush() 576 | ``` 577 | 578 | ### Inference 579 | 580 | Inference is not only required for testing the model performance but is also used as part of the training: after one training iteration, the complete monolingual corpus must be translated and used as input to the next training iteration. 581 | 582 | This part is simpler and only requires to build the encoder-decoder model with the same dimensions and variable scoping. 583 | 584 | #### Step 0: Base file 585 | 586 | You can start with this header: 587 | 588 | ```python 589 | import argparse 590 | 591 | import tensorflow as tf 592 | import opennmt as onmt 593 | 594 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 595 | parser.add_argument("--model_dir", default="model", help="Checkpoint directory.") 596 | 597 | args = parser.parse_args() 598 | ``` 599 | 600 | #### Step 1: Reading data 601 | 602 | Let's define the script interface by defining additional command line arguments: 603 | 604 | ```python 605 | parser.add_argument("--src", required=True, help="Source file.") 606 | parser.add_argument("--tgt", required=True, help="Target file.") 607 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.") 608 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.") 609 | parser.add_argument("--direction", required=True, type=int, 610 | help="1 = translation source, 2 = translate target.") 611 | ``` 612 | 613 | Here, we choose to set both the source and target file and add a `direction` flag to select from which file to translate. 614 | 615 | Based on the input pipeline implemented in the training phase, this time we propose to build the dataset iterator. This should be a textbook usage of the [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) API. 616 | 617 | ```python 618 | def load_data(input_file, input_vocab): 619 | """Returns an iterator over the input file. 620 | 621 | Args: 622 | input_file: The input text file. 623 | input_vocab: The input vocabulary. 624 | 625 | Returns: 626 | A dataset iterator. 627 | """ 628 | # FIXME 629 | raise NotImplementedError() 630 | ``` 631 | 632 | Batching should be simpler than during the training, see [`tf.data.Dataset.padded_batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch). 633 | 634 | Then, the iterator can be used: 635 | 636 | ```python 637 | from opennmt.utils.misc import count_lines 638 | 639 | if args.direction == 1: 640 | src_file, tgt_file = args.src, args.tgt 641 | src_vocab_file, tgt_vocab_file = args.src_vocab, args.tgt_vocab 642 | else: 643 | src_file, tgt_file = args.tgt, args.src 644 | src_vocab_file, tgt_vocab_file = args.tgt_vocab, args.src_vocab 645 | 646 | tgt_vocab_size = count_lines(tgt_vocab_file) + 1 647 | src_vocab_size = count_lines(src_vocab_file) + 1 648 | src_vocab = tf.contrib.lookup.index_table_from_file( 649 | src_vocab_file, 650 | vocab_size=src_vocab_size - 1, 651 | num_oov_buckets=1) 652 | 653 | with tf.device("cpu:0"): 654 | src_iterator = load_data(src_file, src_vocab) 655 | 656 | src = src_iterator.get_next() 657 | ``` 658 | 659 | #### Step 2: Rebuilding the model 660 | 661 | In this step, we need to define the same model that was used during the training, including variable scoping: 662 | 663 | ```python 664 | hidden_size = 512 665 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size) 666 | decoder = onmt.decoders.AttentionalRNNDecoder( 667 | 2, hidden_size, bridge=onmt.layers.CopyBridge()) 668 | 669 | with tf.variable_scope("src" if args.direction == 1 else "tgt"): 670 | src_emb = tf.get_variable("embedding", shape=[src_vocab_size, 300]) 671 | src_gen = tf.layers.Dense(src_vocab_size) 672 | src_gen.build([None, hidden_size]) 673 | 674 | with tf.variable_scope("tgt" if args.direction == 1 else "src"): 675 | tgt_emb = tf.get_variable("embedding", shape=[tgt_vocab_size, 300]) 676 | tgt_gen = tf.layers.Dense(tgt_vocab_size) 677 | tgt_gen.build([None, hidden_size]) 678 | ``` 679 | 680 | **Note:** Larger TensorFlow project usually do not handle inference this way. For example OpenNMT-tf shares training, inference, and evaluation code but reads from a [`mode`](https://www.tensorflow.org/api_docs/python/tf/estimator/ModeKeys) variable to implement behavior specific to each phase. The `mode` argument will be **required for encoding and decoding** to disable dropout in the next step. 681 | 682 | #### Step 3: Encoding and decoding 683 | 684 | Encoding and decoding is basically a method call on the encoder and decoder object (including beam search!). Make sure to use the same variable scope that you used during the training phase. 685 | 686 | ```python 687 | from opennmt import constants 688 | 689 | def encode(): 690 | """Encodes src. 691 | 692 | Returns: 693 | A tuple (encoder output, encoder state, sequence length). 694 | """ 695 | # FIXME 696 | raise NotImplementedError() 697 | 698 | def decode(encoder_output): 699 | """Dynamically decodes from the encoder output. 700 | 701 | Args: 702 | encoder_output: The output of encode(). 703 | 704 | Returns: 705 | A tuple with: the decoded word ids and the length of each decoded sequence. 706 | """ 707 | # FIXME 708 | raise NotImplementedError() 709 | ``` 710 | 711 | **Related resources:** 712 | 713 | * [`onmt.encoders.Encoder.encode`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.encoders.encoder.html#opennmt.encoders.encoder.Encoder.encode) 714 | * [`onmt.decoders.Decoder.dynamic_decode_and_search`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.decoders.decoder.html#opennmt.decoders.decoder.Decoder.dynamic_decode_and_search) 715 | 716 | These functions can then be called like this to build the actual translation: 717 | 718 | ```python 719 | encoder_output = encode() 720 | sampled_ids, sampled_length = decode(encoder_output) 721 | 722 | tgt_vocab_rev = tf.contrib.lookup.index_to_string_table_from_file( 723 | tgt_vocab_file, 724 | vocab_size=tgt_vocab_size - 1, 725 | default_value=constants.UNKNOWN_TOKEN) 726 | 727 | tokens = tgt_vocab_rev.lookup(tf.cast(sampled_ids, tf.int64)) 728 | length = sampled_length 729 | ``` 730 | 731 | #### Step 4: Loading and translating 732 | 733 | Finally, the inference script can be concluded with the code that restores variables and run the translation: 734 | 735 | ```python 736 | from opennmt.utils.misc import print_bytes 737 | 738 | saver = tf.train.Saver() 739 | checkpoint_path = tf.train.latest_checkpoint(args.model_dir) 740 | 741 | def session_init_op(_scaffold, sess): 742 | saver.restore(sess, checkpoint_path) 743 | tf.logging.info("Restored model from %s", checkpoint_path) 744 | 745 | scaffold = tf.train.Scaffold(init_fn=session_init_op) 746 | session_creator = tf.train.ChiefSessionCreator(scaffold=scaffold) 747 | 748 | with tf.train.MonitoredSession(session_creator=session_creator) as sess: 749 | sess.run(src_iterator.initializer) 750 | while not sess.should_stop(): 751 | _tokens, _length = sess.run([tokens, length]) 752 | for b in range(_tokens.shape[0]): 753 | pred_toks = _tokens[b][0][:_length[b][0] - 1] 754 | pred_sent = b" ".join(pred_toks) 755 | print_bytes(pred_sent) 756 | ``` 757 | 758 | ### Complete training flow 759 | 760 | Using the training and inference scripts, you can now write the complete training algorithm described in *Section 3.1*. 761 | 762 | See for example the shell script `ref/train.sh` that can be run on the full data package: 763 | 764 | ```bash 765 | # Download data. 766 | mkdir data && cd data 767 | wget https://s3.amazonaws.com/opennmt-trainingdata/unsupervised-nmt-enfr.tar.bz2 768 | tar xf unsupervised-nmt-enfr.tar.bz2 769 | cd .. 770 | 771 | # Download multi-bleu.perl. 772 | wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl 773 | 774 | # Train algorithm. 775 | ./ref/train.sh 776 | ``` 777 | 778 | Here are some results (reporting BLEU score on tokenized *newstest2014* translation): 779 | 780 | | Iteration | ENFR | FREN | 781 | | --- | --- | --- | 782 | | M1 | 6.20 | 9.12 | 783 | | M2 | 13.02 | 10.73 | 784 | | M3 | 13.81 | 14.25 | 785 | 786 | where M1 is the unsupervised word-by-word translation model. 787 | 788 | --- 789 | 790 | *Congratulations for completing the tutorial! Wether you implemented the functions on your own or went through the provided implementation, we hope that you learned new things on TensorFlow, OpenNMT-tf and adversarial training applied to unsupervised MT.* 791 | -------------------------------------------------------------------------------- /unsupervised-nmt/img/adversarial.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/img/adversarial.png -------------------------------------------------------------------------------- /unsupervised-nmt/img/decoding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/img/decoding.png -------------------------------------------------------------------------------- /unsupervised-nmt/img/encoding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/img/encoding.png -------------------------------------------------------------------------------- /unsupervised-nmt/paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/paper.pdf -------------------------------------------------------------------------------- /unsupervised-nmt/ref/inference.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import tensorflow as tf 4 | import opennmt as onmt 5 | 6 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 7 | parser.add_argument("--model_dir", default="model", 8 | help="Checkpoint directory.") 9 | 10 | # Step 1 11 | parser.add_argument("--src", required=True, help="Source file.") 12 | parser.add_argument("--tgt", required=True, help="Target file.") 13 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.") 14 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.") 15 | parser.add_argument("--direction", required=True, type=int, 16 | help="1 = translation source, 2 = translate target.") 17 | 18 | args = parser.parse_args() 19 | 20 | 21 | # Step 1 22 | 23 | def load_data(input_file, input_vocab): 24 | """Returns an iterator over the input file. 25 | 26 | Args: 27 | input_file: The input text file. 28 | input_vocab: The input vocabulary. 29 | 30 | Returns: 31 | A dataset batch iterator. 32 | """ 33 | dataset = tf.data.TextLineDataset(input_file) 34 | dataset = dataset.map(lambda x: tf.string_split([x]).values) 35 | dataset = dataset.map(input_vocab.lookup) 36 | dataset = dataset.map(lambda x: { 37 | "ids": x, 38 | "length": tf.shape(x)[0]}) 39 | dataset = dataset.padded_batch(64, { 40 | "ids": [None], 41 | "length": []}) 42 | return dataset.make_initializable_iterator() 43 | 44 | if args.direction == 1: 45 | src_file, tgt_file = args.src, args.tgt 46 | src_vocab_file, tgt_vocab_file = args.src_vocab, args.tgt_vocab 47 | else: 48 | src_file, tgt_file = args.tgt, args.src 49 | src_vocab_file, tgt_vocab_file = args.tgt_vocab, args.src_vocab 50 | 51 | from opennmt.utils.misc import count_lines 52 | 53 | tgt_vocab_size = count_lines(tgt_vocab_file) + 1 54 | src_vocab_size = count_lines(src_vocab_file) + 1 55 | src_vocab = tf.contrib.lookup.index_table_from_file( 56 | src_vocab_file, 57 | vocab_size=src_vocab_size - 1, 58 | num_oov_buckets=1) 59 | 60 | with tf.device("cpu:0"): 61 | src_iterator = load_data(src_file, src_vocab) 62 | 63 | src = src_iterator.get_next() 64 | 65 | 66 | # Step 2 67 | 68 | 69 | hidden_size = 512 70 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size) 71 | decoder = onmt.decoders.AttentionalRNNDecoder( 72 | 2, hidden_size, bridge=onmt.layers.CopyBridge()) 73 | 74 | with tf.variable_scope("src" if args.direction == 1 else "tgt"): 75 | src_emb = tf.get_variable("embedding", shape=[src_vocab_size, 300]) 76 | src_gen = tf.layers.Dense(src_vocab_size) 77 | src_gen.build([None, hidden_size]) 78 | 79 | with tf.variable_scope("tgt" if args.direction == 1 else "src"): 80 | tgt_emb = tf.get_variable("embedding", shape=[tgt_vocab_size, 300]) 81 | tgt_gen = tf.layers.Dense(tgt_vocab_size) 82 | tgt_gen.build([None, hidden_size]) 83 | 84 | 85 | # Step 3 86 | 87 | 88 | from opennmt import constants 89 | 90 | def encode(): 91 | """Encodes src. 92 | 93 | Returns: 94 | A tuple (encoder output, encoder state, sequence length). 95 | """ 96 | with tf.variable_scope("encoder"): 97 | return encoder.encode( 98 | tf.nn.embedding_lookup(src_emb, src["ids"]), 99 | sequence_length=src["length"], 100 | mode=tf.estimator.ModeKeys.PREDICT) 101 | 102 | def decode(encoder_output): 103 | """Dynamically decodes from the encoder output. 104 | 105 | Args: 106 | encoder_output: The output of encode(). 107 | 108 | Returns: 109 | A tuple with: the decoded word ids and the length of each decoded sequence. 110 | """ 111 | batch_size = tf.shape(src["length"])[0] 112 | start_tokens = tf.fill([batch_size], constants.START_OF_SENTENCE_ID) 113 | end_token = constants.END_OF_SENTENCE_ID 114 | 115 | with tf.variable_scope("decoder"): 116 | sampled_ids, _, sampled_length, _ = decoder.dynamic_decode_and_search( 117 | tgt_emb, 118 | start_tokens, 119 | end_token, 120 | vocab_size=tgt_vocab_size, 121 | initial_state=encoder_output[1], 122 | beam_width=5, 123 | maximum_iterations=200, 124 | output_layer=tgt_gen, 125 | mode=tf.estimator.ModeKeys.PREDICT, 126 | memory=encoder_output[0], 127 | memory_sequence_length=encoder_output[2]) 128 | return sampled_ids, sampled_length 129 | 130 | encoder_output = encode() 131 | sampled_ids, sampled_length = decode(encoder_output) 132 | 133 | tgt_vocab_rev = tf.contrib.lookup.index_to_string_table_from_file( 134 | tgt_vocab_file, 135 | vocab_size=tgt_vocab_size - 1, 136 | default_value=constants.UNKNOWN_TOKEN) 137 | 138 | tokens = tgt_vocab_rev.lookup(tf.cast(sampled_ids, tf.int64)) 139 | length = sampled_length 140 | 141 | 142 | # Step 4 143 | 144 | 145 | from opennmt.utils.misc import print_bytes 146 | 147 | saver = tf.train.Saver() 148 | checkpoint_path = tf.train.latest_checkpoint(args.model_dir) 149 | 150 | def session_init_op(_scaffold, sess): 151 | saver.restore(sess, checkpoint_path) 152 | tf.logging.info("Restored model from %s", checkpoint_path) 153 | 154 | scaffold = tf.train.Scaffold(init_fn=session_init_op) 155 | session_creator = tf.train.ChiefSessionCreator(scaffold=scaffold) 156 | 157 | with tf.train.MonitoredSession(session_creator=session_creator) as sess: 158 | sess.run(src_iterator.initializer) 159 | while not sess.should_stop(): 160 | _tokens, _length = sess.run([tokens, length]) 161 | for b in range(_tokens.shape[0]): 162 | pred_toks = _tokens[b][0][:_length[b][0] - 1] 163 | pred_sent = b" ".join(pred_toks) 164 | print_bytes(pred_sent) 165 | -------------------------------------------------------------------------------- /unsupervised-nmt/ref/train.sh: -------------------------------------------------------------------------------- 1 | #! /bin/sh 2 | 3 | model_dir=unsupervised-nmt-enfr 4 | data_dir=data/unsupervised-nmt-enfr 5 | 6 | src_vocab=${data_dir}/en-vocab.txt 7 | tgt_vocab=${data_dir}/fr-vocab.txt 8 | src_emb=${data_dir}/wmt14m.en300.vec 9 | tgt_emb=${data_dir}/wmt14m.fr300.vec 10 | 11 | src=${data_dir}/train.en 12 | tgt=${data_dir}/train.fr 13 | src_trans=${data_dir}/train.en.m1 14 | tgt_trans=${data_dir}/train.fr.m1 15 | 16 | src_test=${data_dir}/newstest2014.en.tok 17 | tgt_test=${data_dir}/newstest2014.fr.tok 18 | src_test_trans=${data_dir}/newstest2014.en.tok.m1 19 | tgt_test_trans=${data_dir}/newstest2014.fr.tok.m1 20 | 21 | timestamp=$(date +%s) 22 | score_file=scores-${timestamp}.txt 23 | 24 | > ${score_file} 25 | 26 | score_test() 27 | { 28 | echo ${src_test_trans} >> ${score_file} 29 | perl multi-bleu.perl ${tgt_test} < ${src_test_trans} >> ${score_file} 30 | echo ${tgt_test_trans} >> ${score_file} 31 | perl multi-bleu.perl ${src_test} < ${tgt_test_trans} >> ${score_file} 32 | } 33 | 34 | score_test 35 | 36 | for i in $(seq 2 5); do 37 | # Train for one epoch. 38 | python ref/training.py \ 39 | --model_dir ${model_dir} \ 40 | --src ${src} \ 41 | --tgt ${tgt} \ 42 | --src_trans ${src_trans} \ 43 | --tgt_trans ${tgt_trans} \ 44 | --src_vocab ${src_vocab} \ 45 | --tgt_vocab ${tgt_vocab} \ 46 | --src_emb ${src_emb} \ 47 | --tgt_emb ${tgt_emb} 48 | 49 | # Evaluate on test files. 50 | src_test_trans=${src_test}.m${i} 51 | tgt_test_trans=${tgt_test}.m${i} 52 | 53 | python ref/inference.py \ 54 | --model_dir ${model_dir} \ 55 | --src ${src_test} \ 56 | --tgt ${tgt_test} \ 57 | --src_vocab ${src_vocab} \ 58 | --tgt_vocab ${tgt_vocab} \ 59 | --direction 1 \ 60 | > ${src_test_trans} 61 | python ref/inference.py \ 62 | --model_dir ${model_dir} \ 63 | --src ${src_test} \ 64 | --tgt ${tgt_test} \ 65 | --src_vocab ${src_vocab} \ 66 | --tgt_vocab ${tgt_vocab} \ 67 | --direction 2 \ 68 | > ${tgt_test_trans} 69 | 70 | score_test 71 | 72 | # Translate training data. 73 | src_trans=${src}.m${i} 74 | tgt_trans=${tgt}.m${i} 75 | 76 | python ref/inference.py \ 77 | --model_dir ${model_dir} \ 78 | --src ${src} \ 79 | --tgt ${tgt} \ 80 | --src_vocab ${src_vocab} \ 81 | --tgt_vocab ${tgt_vocab} \ 82 | --direction 1 \ 83 | > ${src_trans} 84 | python ref/inference.py \ 85 | --model_dir ${model_dir} \ 86 | --src ${src} \ 87 | --tgt ${tgt} \ 88 | --src_vocab ${src_vocab} \ 89 | --tgt_vocab ${tgt_vocab} \ 90 | --direction 2 \ 91 | > ${tgt_trans} 92 | done 93 | -------------------------------------------------------------------------------- /unsupervised-nmt/ref/training.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import argparse 4 | import sys 5 | 6 | import tensorflow as tf 7 | import opennmt as onmt 8 | import numpy as np 9 | 10 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) 11 | parser.add_argument("--model_dir", default="model", 12 | help="Checkpoint directory.") 13 | 14 | ## Step 1 15 | parser.add_argument("--src", required=True, help="Source file.") 16 | parser.add_argument("--tgt", required=True, help="Target file.") 17 | parser.add_argument("--src_trans", required=True, help="Source translation at the previous iteration.") 18 | parser.add_argument("--tgt_trans", required=True, help="Target translation at the previous iteration.") 19 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.") 20 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.") 21 | 22 | ## Step 3 23 | parser.add_argument("--src_emb", default=None, help="Source embedding.") 24 | parser.add_argument("--tgt_emb", default=None, help="Target embedding.") 25 | 26 | args = parser.parse_args() 27 | 28 | 29 | # Step 1 30 | 31 | 32 | from opennmt import constants 33 | from opennmt.utils.misc import count_lines 34 | 35 | def load_vocab(vocab_file): 36 | """Returns a lookup table and the vocabulary size.""" 37 | vocab_size = count_lines(vocab_file) + 1 # Add UNK. 38 | vocab = tf.contrib.lookup.index_table_from_file( 39 | vocab_file, 40 | vocab_size=vocab_size - 1, 41 | num_oov_buckets=1) 42 | return vocab, vocab_size 43 | 44 | def load_data(input_file, 45 | translated_file, 46 | input_vocab, 47 | translated_vocab, 48 | batch_size=32, 49 | max_seq_len=50, 50 | num_buckets=5): 51 | """Returns an iterator over the training data.""" 52 | 53 | def _make_dataset(text_file, vocab): 54 | dataset = tf.data.TextLineDataset(text_file) 55 | dataset = dataset.map(lambda x: tf.string_split([x]).values) # Split on spaces. 56 | dataset = dataset.map(vocab.lookup) # Lookup token in vocabulary. 57 | return dataset 58 | 59 | def _key_func(x): 60 | bucket_width = (max_seq_len + num_buckets - 1) // num_buckets 61 | bucket_id = x["length"] // bucket_width 62 | bucket_id = tf.minimum(bucket_id, num_buckets) 63 | return tf.to_int64(bucket_id) 64 | 65 | def _reduce_func(unused_key, dataset): 66 | return dataset.padded_batch(batch_size, { 67 | "ids": [None], 68 | "ids_in": [None], 69 | "ids_out": [None], 70 | "length": [], 71 | "trans_ids": [None], 72 | "trans_length": []}) 73 | 74 | bos = tf.constant([constants.START_OF_SENTENCE_ID], dtype=tf.int64) 75 | eos = tf.constant([constants.END_OF_SENTENCE_ID], dtype=tf.int64) 76 | 77 | # Make a dataset from the input and translated file. 78 | input_dataset = _make_dataset(input_file, input_vocab) 79 | translated_dataset = _make_dataset(translated_file, translated_vocab) 80 | dataset = tf.data.Dataset.zip((input_dataset, translated_dataset)) 81 | dataset = dataset.shuffle(200000) 82 | 83 | # Define the input format. 84 | dataset = dataset.map(lambda x, y: { 85 | "ids": x, 86 | "ids_in": tf.concat([bos, x], axis=0), 87 | "ids_out": tf.concat([x, eos], axis=0), 88 | "length": tf.shape(x)[0], 89 | "trans_ids": y, 90 | "trans_length": tf.shape(y)[0]}) 91 | 92 | # Filter out invalid examples. 93 | dataset = dataset.filter(lambda x: tf.greater(x["length"], 0)) 94 | 95 | # Batch the dataset using a bucketing strategy. 96 | dataset = dataset.apply(tf.contrib.data.group_by_window( 97 | _key_func, 98 | _reduce_func, 99 | window_size=batch_size)) 100 | return dataset.make_initializable_iterator() 101 | 102 | src_vocab, src_vocab_size = load_vocab(args.src_vocab) 103 | tgt_vocab, tgt_vocab_size = load_vocab(args.tgt_vocab) 104 | 105 | with tf.device("/cpu:0"): # Input pipeline should always be place on the CPU. 106 | src_iterator = load_data(args.src, args.src_trans, src_vocab, tgt_vocab) 107 | tgt_iterator = load_data(args.tgt, args.tgt_trans, tgt_vocab, src_vocab) 108 | src = src_iterator.get_next() 109 | tgt = tgt_iterator.get_next() 110 | 111 | 112 | # Step 2 113 | 114 | 115 | def add_noise_python(words, dropout=0.1, k=3): 116 | """Applies the noise model in input words. 117 | 118 | Args: 119 | words: A numpy vector of word ids. 120 | dropout: The probability to drop words. 121 | k: Maximum distance of the permutation. 122 | 123 | Returns: 124 | A noisy numpy vector of word ids. 125 | """ 126 | 127 | def _drop_words(words, probability): 128 | """Drops words with the given probability.""" 129 | length = len(words) 130 | keep_prob = np.random.uniform(size=length) 131 | keep = np.random.uniform(size=length) > probability 132 | if np.count_nonzero(keep) == 0: 133 | ind = np.random.randint(0, length) 134 | keep[ind] = True 135 | words = np.take(words, keep.nonzero())[0] 136 | return words 137 | 138 | def _rand_perm_with_constraint(words, k): 139 | """Randomly permutes words ensuring that words are no more than k positions 140 | away from their original position.""" 141 | length = len(words) 142 | offset = np.random.uniform(size=length) * (k + 1) 143 | new_pos = np.arange(length) + offset 144 | return np.take(words, np.argsort(new_pos)) 145 | 146 | words = _drop_words(words, dropout) 147 | words = _rand_perm_with_constraint(words, k) 148 | return words 149 | 150 | def add_noise(ids, sequence_length): 151 | """Wraps add_noise_python for a batch of tensors.""" 152 | 153 | def _add_noise_single(ids, sequence_length): 154 | noisy_ids = add_noise_python(ids[:sequence_length]) 155 | noisy_sequence_length = len(noisy_ids) 156 | ids[:noisy_sequence_length] = noisy_ids 157 | ids[noisy_sequence_length:] = 0 158 | return ids, np.int32(noisy_sequence_length) 159 | 160 | noisy_ids, noisy_sequence_length = tf.map_fn( 161 | lambda x: tf.py_func(_add_noise_single, x, [ids.dtype, tf.int32]), 162 | [ids, sequence_length], 163 | dtype=[ids.dtype, tf.int32], 164 | back_prop=False) 165 | 166 | noisy_ids.set_shape(ids.get_shape()) 167 | noisy_sequence_length.set_shape(sequence_length.get_shape()) 168 | 169 | return noisy_ids, noisy_sequence_length 170 | 171 | 172 | # Step 3 173 | 174 | 175 | from opennmt.inputters.text_inputter import load_pretrained_embeddings 176 | 177 | def create_embeddings(vocab_size, depth=300): 178 | """Creates an embedding variable.""" 179 | return tf.get_variable("embedding", shape=[vocab_size, depth]) 180 | 181 | def load_embeddings(embedding_file, vocab_file): 182 | """Loads an embedding variable or embeddings file.""" 183 | try: 184 | embeddings = tf.get_variable("embedding") 185 | except ValueError: 186 | pretrained = load_pretrained_embeddings( 187 | embedding_file, 188 | vocab_file, 189 | num_oov_buckets=1, 190 | with_header=True, 191 | case_insensitive_embeddings=True) 192 | embeddings = tf.get_variable( 193 | "embedding", 194 | shape=None, 195 | trainable=False, 196 | initializer=tf.constant(pretrained.astype(np.float32))) 197 | return embeddings 198 | 199 | with tf.variable_scope("src"): 200 | if args.src_emb is not None: 201 | src_emb = load_embeddings(args.src_emb, args.src_vocab) 202 | else: 203 | src_emb = create_embeddings(src_vocab_size) 204 | 205 | with tf.variable_scope("tgt"): 206 | if args.tgt_emb is not None: 207 | tgt_emb = load_embeddings(args.tgt_emb, args.tgt_vocab) 208 | else: 209 | tgt_emb = create_embeddings(tgt_vocab_size) 210 | 211 | 212 | # Step 4 213 | 214 | 215 | hidden_size = 512 216 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size) 217 | 218 | def add_noise_and_encode(ids, sequence_length, embedding, reuse=None): 219 | """Applies the noise model on ids, embeds and encodes. 220 | 221 | Args: 222 | ids: The tensor of words ids of shape [batch_size, max_time]. 223 | sequence_length: The tensor of sequence length of shape [batch_size]. 224 | embedding: The embedding variable. 225 | reuse: If True, reuse the encoder variables. 226 | 227 | Returns: 228 | A tuple (encoder output, encoder state, sequence length). 229 | """ 230 | noisy_ids, noisy_sequence_length = add_noise(ids, sequence_length) 231 | noisy = tf.nn.embedding_lookup(embedding, noisy_ids) 232 | with tf.variable_scope("encoder", reuse=reuse): 233 | return encoder.encode(noisy, sequence_length=noisy_sequence_length) 234 | 235 | src_encoder_auto = add_noise_and_encode( 236 | src["ids"], src["length"], src_emb, reuse=None) 237 | tgt_encoder_auto = add_noise_and_encode( 238 | tgt["ids"], tgt["length"], tgt_emb, reuse=True) 239 | 240 | src_encoder_cross = add_noise_and_encode( 241 | tgt["trans_ids"], tgt["trans_length"], src_emb, reuse=True) 242 | tgt_encoder_cross = add_noise_and_encode( 243 | src["trans_ids"], src["trans_length"], tgt_emb, reuse=True) 244 | 245 | 246 | # Step 5 247 | 248 | 249 | decoder = onmt.decoders.AttentionalRNNDecoder( 250 | 2, hidden_size, bridge=onmt.layers.CopyBridge()) 251 | 252 | from opennmt.utils.losses import cross_entropy_sequence_loss 253 | 254 | def denoise(x, embedding, encoder_outputs, generator, reuse=None): 255 | """Denoises from the noisy encoding. 256 | 257 | Args: 258 | x: The input data from the dataset. 259 | embedding: The embedding variable. 260 | encoder_outputs: A tuple with the encoder outputs. 261 | generator: A tf.layers.Dense instance for projecting the logits. 262 | reuse: If True, reuse the decoder variables. 263 | 264 | Returns: 265 | The decoder loss. 266 | """ 267 | with tf.variable_scope("decoder", reuse=reuse): 268 | logits, _, _ = decoder.decode( 269 | tf.nn.embedding_lookup(embedding, x["ids_in"]), 270 | x["length"] + 1, 271 | initial_state=encoder_outputs[1], 272 | output_layer=generator, 273 | memory=encoder_outputs[0], 274 | memory_sequence_length=encoder_outputs[2]) 275 | cumulated_loss, _, normalizer = cross_entropy_sequence_loss( 276 | logits, x["ids_out"], x["length"] + 1) 277 | return cumulated_loss / normalizer 278 | 279 | with tf.variable_scope("src"): 280 | src_gen = tf.layers.Dense(src_vocab_size) 281 | src_gen.build([None, hidden_size]) 282 | 283 | with tf.variable_scope("tgt"): 284 | tgt_gen = tf.layers.Dense(tgt_vocab_size) 285 | tgt_gen.build([None, hidden_size]) 286 | 287 | l_auto_src = denoise(src, src_emb, src_encoder_auto, src_gen, reuse=None) 288 | l_auto_tgt = denoise(tgt, tgt_emb, tgt_encoder_auto, tgt_gen, reuse=True) 289 | 290 | l_cd_src = denoise(src, src_emb, tgt_encoder_cross, src_gen, reuse=True) 291 | l_cd_tgt = denoise(tgt, tgt_emb, src_encoder_cross, tgt_gen, reuse=True) 292 | 293 | 294 | # Step 6 295 | 296 | 297 | def binary_cross_entropy(x, y, smoothing=0, epsilon=1e-12): 298 | """Computes the averaged binary cross entropy. 299 | 300 | bce = y*log(x) + (1-y)*log(1-x) 301 | 302 | Args: 303 | x: The predicted labels. 304 | y: The true labels. 305 | smoothing: The label smoothing coefficient. 306 | 307 | Returns: 308 | The cross entropy. 309 | """ 310 | y = tf.to_float(y) 311 | if smoothing > 0: 312 | smoothing *= 2 313 | y = y * (1 - smoothing) + 0.5 * smoothing 314 | return -tf.reduce_mean(tf.log(x + epsilon) * y + tf.log(1.0 - x + epsilon) * (1 - y)) 315 | 316 | def discriminator(encodings, 317 | sequence_lengths, 318 | lang_ids, 319 | num_layers=3, 320 | hidden_size=1024, 321 | dropout=0.3): 322 | """Discriminates the encoder outputs against lang_ids. 323 | 324 | Args: 325 | encodings: The encoder outputs of shape [batch_size, max_time, hidden_size]. 326 | sequence_lengths: The length of each sequence of shape [batch_size]. 327 | lang_ids: The true lang id of each sequence of shape [batch_size]. 328 | num_layers: The number of layers of the discriminator. 329 | hidden_size: The hidden size of the discriminator. 330 | dropout: The dropout to apply on each discriminator layer output. 331 | 332 | Returns: 333 | A tuple with: the discriminator loss (L_d) and the adversarial loss (L_adv). 334 | """ 335 | x = encodings 336 | for _ in range(num_layers): 337 | x = tf.nn.dropout(x, 1.0 - dropout) 338 | x = tf.layers.dense(x, hidden_size, activation=tf.nn.leaky_relu) 339 | x = tf.nn.dropout(x, 1.0 - dropout) 340 | y = tf.layers.dense(x, 1) 341 | 342 | mask = tf.sequence_mask( 343 | sequence_lengths, maxlen=tf.shape(encodings)[1], dtype=tf.float32) 344 | mask = tf.expand_dims(mask, -1) 345 | 346 | y = tf.log_sigmoid(y) * mask 347 | y = tf.reduce_sum(y, axis=1) 348 | y = tf.exp(y) 349 | 350 | l_d = binary_cross_entropy(y, lang_ids, smoothing=0.1) 351 | l_adv = binary_cross_entropy(y, 1 - lang_ids) 352 | 353 | return l_d, l_adv 354 | 355 | from opennmt.layers.reducer import pad_in_time 356 | 357 | batch_size = tf.shape(src["length"])[0] 358 | all_encoder_outputs = [ 359 | src_encoder_auto, src_encoder_cross, 360 | tgt_encoder_auto, tgt_encoder_cross] 361 | lang_ids = tf.concat([ 362 | tf.fill([batch_size * 2], 0), 363 | tf.fill([batch_size * 2], 1)], 0) 364 | 365 | max_time = tf.reduce_max([tf.shape(output[0])[1] for output in all_encoder_outputs]) 366 | 367 | encodings = tf.concat([ 368 | pad_in_time(output[0], max_time - tf.shape(output[0])[1]) 369 | for output in all_encoder_outputs], 0) 370 | sequence_lengths = tf.concat([output[2] for output in all_encoder_outputs], 0) 371 | 372 | with tf.variable_scope("discriminator"): 373 | l_d, l_adv = discriminator(encodings, sequence_lengths, lang_ids) 374 | 375 | 376 | # Step 7 377 | 378 | 379 | lambda_auto = 1 380 | lambda_cd = 1 381 | lambda_adv = 1 382 | 383 | l_auto = l_auto_src + l_auto_tgt 384 | l_cd = l_cd_src + l_cd_tgt 385 | 386 | l_final = (lambda_auto * l_auto + lambda_cd * l_cd + lambda_adv * l_adv) 387 | 388 | def build_train_op(global_step, encdec_variables, discri_variables): 389 | """Returns the training Op. 390 | 391 | When global_step % 2 == 0, it minimizes l_final and updates encdec_variables. 392 | Otherwise, it minimizes l_d and updates discri_variables. 393 | 394 | Args: 395 | global_step: The training step. 396 | encdec_variables: The list of variables of the encoder/decoder model. 397 | discri_variables: The list of variables of the discriminator. 398 | 399 | Returns: 400 | The training op. 401 | """ 402 | encdec_opt = tf.train.AdamOptimizer(learning_rate=0.0003, beta1=0.5) 403 | discri_opt = tf.train.RMSPropOptimizer(0.0005) 404 | encdec_gradients = encdec_opt.compute_gradients(l_final, var_list=encdec_variables) 405 | discri_gradients = discri_opt.compute_gradients(l_d, var_list=discri_variables) 406 | return tf.cond( 407 | tf.equal(tf.mod(global_step, 2), 0), 408 | true_fn=lambda: encdec_opt.apply_gradients(encdec_gradients, global_step=global_step), 409 | false_fn=lambda: discri_opt.apply_gradients(discri_gradients, global_step=global_step)) 410 | 411 | encdec_variables = [] 412 | discri_variables = [] 413 | for variable in tf.trainable_variables(): 414 | if variable.name.startswith("discriminator"): 415 | discri_variables.append(variable) 416 | else: 417 | encdec_variables.append(variable) 418 | 419 | global_step = tf.train.get_or_create_global_step() 420 | train_op = build_train_op(global_step, encdec_variables, discri_variables) 421 | 422 | i = 0 423 | with tf.train.MonitoredTrainingSession(checkpoint_dir=args.model_dir) as sess: 424 | sess.run([src_iterator.initializer, tgt_iterator.initializer]) 425 | while not sess.should_stop(): 426 | if i % 2 == 0: 427 | _, step, _l_auto, _l_cd, _l_adv, _l = sess.run( 428 | [train_op, global_step, l_auto, l_cd, l_adv, l_final]) 429 | print("{} - l_auto = {}; l_cd = {}, l_adv = {}; l = {}".format( 430 | step, _l_auto, _l_cd, _l_adv, _l)) 431 | else: 432 | _, step, _l_d = sess.run([train_op, global_step, l_d]) 433 | print("{} - l_d = {}".format(step, _l_d)) 434 | i += 1 435 | sys.stdout.flush() 436 | -------------------------------------------------------------------------------- /unsupervised-nmt/requirements.txt.cpu: -------------------------------------------------------------------------------- 1 | OpenNMT-tf[tensorflow]==1.1.0 2 | -------------------------------------------------------------------------------- /unsupervised-nmt/requirements.txt.gpu: -------------------------------------------------------------------------------- 1 | OpenNMT-tf[tensorflow_gpu]==1.1.0 2 | --------------------------------------------------------------------------------