├── .gitmodules
├── nmt-wizard
    ├── README.md
    └── data
    │   ├── ru_en
    │       └── train
    │       │   ├── helloworld.ruen.train.en
    │       │   └── helloworld.ruen.train.ru
    │   ├── test
    │       ├── helloworld.ruen.test.en
    │       └── helloworld.ruen.test.ru
    │   └── vocab
    │       ├── de.dict
    │       ├── en.dict
    │       ├── helloworld.ruen.src.dict
    │       └── helloworld.ruen.tgt.dict
└── unsupervised-nmt
    ├── LICENSE
    ├── README.md
    ├── img
        ├── adversarial.png
        ├── decoding.png
        └── encoding.png
    ├── paper.pdf
    ├── ref
        ├── inference.py
        ├── train.sh
        └── training.py
    ├── requirements.txt.cpu
    └── requirements.txt.gpu


/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "nmt-wizard/nmt-wizard"]
2 |         path = nmt-wizard/nmt-wizard
3 |         url = https://github.com/OpenNMT/nmt-wizard.git
4 | 


--------------------------------------------------------------------------------
/nmt-wizard/README.md:
--------------------------------------------------------------------------------
  1 | *Owner: Jean Senellart (jean.senellart (at) opennmt.net)*
  2 | 
  3 | # NMT-Wizard Hello Word
  4 | 
  5 | ## Introduction
  6 | 
  7 | The goal of this tutorial is to configure a nmt-wizard server, and to launch a task for training on CPU a simple transliteration model from Russian to English, and to test the generated model.
  8 | 
  9 | Reference: [https://github.com/OpenNMT/nmt-wizard](https://github.com/OpenNMT/nmt-wizard)
 10 | 
 11 | ## Server Configuration
 12 | 
 13 | - minimal environment requested: `python`, `pip`, `build-essential` , `make`
 14 | - please use python2.7
 15 | 
 16 | ```
 17 | $ sudo apt-get update
 18 | $ sudo apt-get -y install python python-pip
 19 | ```
 20 | - Set environment variable `TUTORIAL` with path to working directory for this tutorial, and change directory.
 21 | 
 22 | ```
 23 | $ mkdir tutorial-onmt-wizard-1
 24 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1
 25 | $ cd ${TUTORIAL}
 26 | ```
 27 | 
 28 | - Installation of redis
 29 | 
 30 | ```
 31 | $ sudo apt-get -y install redis-server
 32 | ```
 33 | or
 34 | 
 35 | ```
 36 | $ curl http://download.redis.io/releases/redis-4.0.8.tar.gz > redis-4.0.8.tar.gz
 37 | $ tar xzf redis-4.0.8.tar.gz
 38 | $ cd redis-4.0.8
 39 | $ cd deps
 40 | $ make hiredis jemalloc linenoise lua geohash-int
 41 | $ cd ..
 42 | $ make
 43 | ```
 44 | 
 45 | launch a server (chdir to src directory if you installed by compiling):
 46 | 
 47 | ```
 48 | $ redis-server
 49 | ```
 50 | 
 51 | And configure keyspace event handling in a new terminal:
 52 | ```
 53 | $ redis-cli config set notify-keyspace-events Klgx
 54 | ```
 55 | 
 56 | The Redis database contains the following fields:
 57 | 
 58 | | Field | Type | Description |
 59 | | --- | --- | --- |
 60 | | `active` | list | Active tasks |
 61 | | `beat:<task_id>` | int | Specific ttl-key for a given task |
 62 | | `lock:<resource...,task:…>` | value | Temporary lock on a resource or task |
 63 | | `queued:<service>` | list | Tasks waiting for a resource |
 64 | | `resource:<service>:<resourceid>` | list | Tasks using this resource |
 65 | | `task:<taskid>` | dict | <ul><li>status: [queued, allocated, running, terminating, stopped]</li><li>job: json of jobid (if status>=waiting)</li><li>service:the name of the service</li><li>resource: the name of the resource - or auto before allocating one message: error message (if any), ‘completed’ if successfully finished</li><li>container_id: container in which the task run send back by docker notifier</li><li>(queued|allocated|running|updated|stopped)_time: time for each event</li></ul> |
 66 | | `files:<task_id>` | dict | files associated to a task, "log" is generated when training is complete |
 67 | | `queue:<task_id>` | str | expirable timestamp on the task - is used to regularily check status |
 68 | | `work` | list | Tasks to process |
 69 | 
 70 | - virtual env installation
 71 | 
 72 | ```
 73 | $ pip install virtualenv
 74 | $ virtualenv ${TUTORIAL}
 75 | ```
 76 | 
 77 | - get github project
 78 | 
 79 | ```
 80 | $ cd ${TUTORIAL}
 81 | $ git clone https://github.com/OpenNMT/nmt-wizard.git
 82 | ```
 83 | 
 84 | - install docker
 85 | 
 86 | ```
 87 | $ sudo apt-get install \
 88 | apt-transport-https \
 89 | ca-certificates \
 90 | curl \
 91 | software-properties-common
 92 | $ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
 93 | $ sudo add-apt-repository \
 94 | "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
 95 | $(lsb_release -cs) \
 96 | stable"
 97 | $ sudo apt-get update
 98 | $ sudo apt-get install docker-ce
 99 | $ sudo usermod -aG docker {{YOURUSERNAME}}
100 | ```
101 | or
102 | for other OS please see [installation instructions here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
103 | 
104 | close the session and open a new one:
105 | ```
106 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1
107 | $ cd ${TUTORIAL}
108 | $ docker run hello-world
109 | ```
110 | 
111 | - installation of python dependencies
112 | 
113 | ```
114 | $ cd nmt-wizard
115 | $ sudo pip install -r requirements.txt
116 | ```
117 | 
118 | - create a public/private key, and add public key in `.ssh/authorized_keys` in order to enable connection from your server to your server without authentication (useful for remote servers).
119 | 
120 | ```
121 | $ ssh-keygen
122 | ```
123 | command-line for local server
124 | ```
125 | $ cat ${HOME}/.ssh/id_rsa.pub >> ${HOME}/.ssh/authorized_keys
126 | ```
127 | 
128 | ## Data preparation
129 | 
130 | The data directory contains aligned space-tokenized russian-english names, training file and test file.
131 | 
132 | You need to get it from the `hackathon/nmt-wizard` directory like that:
133 | 
134 | ```
135 | git clone https://github.com/OpenNMT/Hackathon.git
136 | ```
137 | 
138 | and copy the `nmt-wizard/data` directory into `{{TUTORIAL}}/`.
139 | 
140 | 
141 | ## Wizard Configuration
142 | 
143 | ### Service configuration
144 | The REST server and worker are configured by `nmt-wizard/server/settings.ini`. The LAUNCHER_MODE environment variable (defaulting to Production) can be set to select different set of options in development or production.
145 | ```
146 | [DEFAULT]
147 | # config_dir with service configuration
148 | config_dir = ./config
149 | # logging level
150 | log_level = INFO
151 | # refresh rate
152 | refresh = 60
153 | 
154 | [Production]
155 | redis_host = localhost
156 | redis_port = 6379
157 | redis_db   = 0
158 | #redis_password=xxx
159 | ```
160 | Here we use the default host and port of redis server.
161 | You can choose different level of logging: `log_level`: `INFO`,`WARN`,`DEBUG`,`FATAL`,`ERROR`, for example, `DEBUG` gives the most complete log for debugging purpose.
162 | 
163 | ### Local SSH server
164 | 
165 | We will define as the service for the tutorial, the local computer using `services.ssh` connector. Other connectors are for instance `service.ec2` or `service.torque`.
166 | 
167 | Get your IP with `ifconfig` - refered as `{{YOURIP}}` below.
168 | 
169 | Copy the following json file `nmt-wizard/server/config/default.json`.
170 | 
171 | ```
172 | {
173 |     "docker": {
174 |         "registries": {
175 |             "dockerhub": {
176 |                 "type": "dockerhub",
177 |                 "uri": ""
178 |             }
179 |         }
180 |     },
181 |     "storages" : {
182 |         "launcher": {
183 |             "type": "http",
184 |             "get_pattern": "<CALLBACK_URL>/file/<TASK_ID>/%s",
185 |             "post_pattern": "<CALLBACK_URL>/file/<TASK_ID>/%s"
186 |         }
187 |     },
188 |     "callback_url": "http://{{YOURIP}}:5000",
189 |     "callback_interval": 60
190 | }
191 | ```
192 | Make sure to replace `{{YOURIP}}` by the actual IP.
193 | * The first part, defines the registry named `dockerhub` as official public dockerhub registry. You could also define private docker hub registries, or use AWS Elastic Container Service (ECS) registries.
194 | * the second part is defining the storage name `launcher` - as a simple http storage server implemented within the launcher. You will see below how to define other types of storage.
195 | 
196 | ```
197 | $ mkdir ${TUTORIAL}/inftraining_logs
198 | ```
199 | Copy the following JSON into the `nmt-wizard/server/config/myserver.json`.
200 | ```
201 | {
202 |     "name": "myserver",
203 |     "description": "My computing server",
204 |     "module": "services.ssh",
205 |     "variables": {
206 |         "server_pool": [
207 |             {
208 |                 "host": "localhost",
209 |                 "gpus": [0],
210 |                 "login": "{{YOURUSERNAME}}",
211 |                 "log_dir": "${TUTORIAL}/inftraining_logs"
212 |             }
213 |         ]
214 |     },
215 |     "privateKey": "${HOME}/.ssh/id_rsa",
216 |     "docker": {
217 |         "mount": [
218 |             "${TUTORIAL}/data/:/root/corpus/",
219 |             "${TUTORIAL}/tmp:/root/tmp"
220 |         ]
221 |     }
222 | }
223 | ```
224 | Make sure to replace `${TUTORIAL}` by the absolute PATH and {{YOURUSERNAME}}.
225 | 
226 | This is a simple configuration of your server.
227 | * `"gpus"` is set to off `[0]` since we're not using GPU in this tutorial
228 | * the log file will be saved under `${TUTORIAL}/inftraining_logs`, make sure this directory exsit
229 | * your SSH privateKey `${HOME}/.ssh/id_rsa` will be used for connecting the server on which your publicKey  `${HOME}/.ssh/id_rsa.pub` is authorized
230 | * `${TUTORIAL}/corpus/` is your training corpus' directory on the local / remote server
231 | * `${TUTORIAL}/models` is the directory for saving the models of `train` task
232 | *  your custom files will be copied under `${TUTORIAL}/tmp`
233 | 
234 | ## Launch the REST server
235 | 
236 | For production system, see the [Flask documentation](http://flask.pocoo.org/docs/0.12/deploying/) to deploy it for production.
237 | 
238 | In a terminal:
239 | ```
240 | cd nmt-wizard/server
241 | FLASK_APP=main.py flask run --host=0.0.0.0
242 | ```
243 | 
244 | 
245 | ## Launch the worker
246 | 
247 | In a new terminal:
248 | ```
249 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1
250 | $ cd ${TUTORIAL}
251 | $ cd nmt-wizard/server
252 | $ python worker.py
253 | ```
254 | 
255 | ## The client commandline
256 | 
257 | {{YOURID}} is the trainer id, used as a prefix to generated models (default ENV[`LAUNCHER_TID`])
258 | ```
259 | $ export TUTORIAL=${PWD}/tutorial-onmt-wizard-1
260 | $ cd ${TUTORIAL}
261 | $ export LAUNCHER_URL=http://{{YOURIP}}:5000
262 | $ export LAUNCHER_TID={{YOURID}}
263 | $ mkdir nmt-wizard/example
264 | ```
265 | 
266 | Copy the following JSON into the `nmt-wizard/example/helloworld.json`.
267 | 
268 | ```
269 | {
270 |     "source": "ru",
271 |     "target": "en",
272 |     "data": {
273 |         "sample_dist": [
274 |             {
275 |                 "path": "train",
276 |                 "distribution": {
277 |                     "helloworld.*": "1"
278 |                 }
279 |             }
280 |         ],
281 |         "sample": 100000,
282 |         "train_dir": "ru_en"
283 |     },
284 |     "options": {
285 |         "train": {
286 |             "rnn_size": "50",
287 |             "word_vec_size": "20",
288 |             "layers": "1",
289 |             "src_vocab": "${CORPUS_DIR}/vocab/helloworld.ruen.src.dict",
290 |             "tgt_vocab": "${CORPUS_DIR}/vocab/helloworld.ruen.tgt.dict"
291 |         }
292 |     }
293 | }
294 | ```
295 | The ${CORPUS_DIR} is a local ENV, you don't need to change it.
296 | This is a configuration of simple transliteration training task, it has two parts: `data` and `options`
297 | * `data` part: the source language is `ru` and the target language is `en`, the corpus is picked from ${TUTORIAL}/corpus/`train_dir`/`path`/; the corpus which has extension `ru` \ `en` with pattern `helloworld.*` will be picked. Its coefficient is set to `1` in the total `10000` samples. see the [sampling documentation](https://github.com/OpenNMT/OpenNMT/blob/master/docs/training/sampling.md)
298 | * `options` part: the configuration of training, in this training, a local custom file `${TUTORIAL}/vocab/helloworld.ruen.src.dict` will be copied and used on the server.  see the [training option documentation](https://github.com/OpenNMT/OpenNMT/blob/master/docs/options/train.md)
299 | 
300 | 
301 | go through the different commands:
302 | 
303 | ```
304 | $ cd nmt-wizard/client
305 | ```
306 | 
307 | 
308 | - `ls`： returns available services
309 | 
310 | ```
311 | $ python launcher.py ls
312 | ```
313 | - `lt`： returns the list of tasks in the database
314 | 
315 | ```
316 | $ python launcher.py lt
317 | ```
318 | - `launch` `train`： start training task, the return is a task id `taskid_1`
319 | 
320 | ```
321 | $ python launcher.py launch -s myserver -i nmtwizard/opennmt-lua -- -ms /root/tmp/ -c @../example/helloworld.json train
322 | ```
323 | - `launch` `trans`： transliterate/translate `/root/corpus/test/helloworld.ruen.test.ru` by using the model of `taskid_1`, the return is a task id `taskid_2`
324 | 
325 | ```
326 | $ python launcher.py launch -s myserver -i nmtwizard/opennmt-lua -- -ms /root/tmp/ -m <taskid_1> trans -i /root/corpus/test/helloworld.ruen.test.ru -o "launcher:helloworld.ruen.test.ru.out"
327 | ```
328 | - `file`： get file from transaltion task
329 | 
330 | ```
331 | $ python launcher.py file -f helloworld.ruen.test.ru.out -k <taskid_2> > ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out
332 | ```
333 | - `terminate`：stop a running/queued task by its `taskid`
334 | 
335 | ```
336 | $ python launcher.py terminate -k <taskid>
337 | ```
338 | - `status`：checks the status of a task by its `taskid`
339 | 
340 | ```
341 | $ python launcher.py status -k <taskid>
342 | ```
343 | evaluation:
344 | check if there're log information in the head of output file, please remove log and ending empty line
345 | 
346 | ```
347 | $ tail -n +3 ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out | head -n -1 > ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out.tmp
348 | $ mv ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out.tmp ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out
349 | ```
350 | then
351 | 
352 | ```
353 | $ cd ${TUTORIAL}
354 | $ git clone https://github.com/OpenNMT/nmt-benchmark.git
355 | $ perl nmt-benchmark/scripts/multi-bleu.perl ${TUTORIAL}/data/test/helloworld.ruen.test.en < ${TUTORIAL}/data/test/helloworld.ruen.test.ru.out
356 | ```
357 | a BLEU score will be shown
358 | 
359 | ```
360 | BLEU = 70.27 +/- 0.44, 88.5/78.0/69.4/62.1 (BP=0.952, ratio=0.953, hyp_len=6164, ref_len=6469)
361 | ```
362 | 
363 | There are also other alternative storages
364 | * S3: connecting to AWS server using access ID and secret key
365 | * SSH: connecting to remote server hostname/IP via SSH
366 | 
367 | ```
368 |     "storages" : {
369 |         "s3_models": {
370 |             "type": "s3",
371 |             "bucket": "model-catalog",
372 |             "aws_credentials": {
373 |                 "access_key_id": "XXXXX",
374 |                 "secret_access_key": "XXXXX",
375 |                 "region_name": "eu-west-3"
376 |             }
377 |         },
378 |         "myremoteserver": {
379 |             "type": "ssh",
380 |             "server": "myserver_url",
381 |             "user": "XXXXX",
382 |             "password": "XXXXX"
383 |         }
384 | ```
385 | 
386 | ## Using pre-configured launcher with EC2 credentials
387 | 
388 | If you are here, you are ready to move to the next stage - let us try to launch the same task on an EC2 instance using pre-configured launcher.
389 | 
390 | First configure access to the launcher:
391 | 
392 | ```
393 | export LAUNCHER_URL=http://stlauncher.opennmt.net
394 | ```
395 | 
396 | explore the available services:
397 | 
398 | ```
399 | $ python launcher.py ls
400 | SERVICE NAME            DESCRIPTION
401 | ec2                     Instance on AWS EC2
402 | ```
403 | 
404 | describe the resource available on EC2:
405 | 
406 | ```
407 | $ python launcher.py describe -s ec2
408 | {"launchTemplateName": {"enum": ["CPU_C5_xlarge_50Gb", "GPU_G3_4xlarge_50Gb"], "type": "string", "description": "The name of the EC2 launch template to use", "title": "EC2 Launch Template"}}
409 | ```
410 | 
411 | There are 2 different resources configured on EC2 and available: `CPU_C5_xlarge_50Gb` and `GPU_G3_4xlarge_50Gb` - this is a JSON form to select `launchTemplateName`.
412 | 
413 | To select the resource, you need to pass a json file corresponding your choice:
414 | 
415 | ```
416 | $ cat > o.json
417 | {"launchTemplateName":"GPU_G3_4xlarge_50Gb"}
418 | ```
419 | 
420 | so let us be crazy and launch our task on ec2 using either a GPU instance.
421 | 
422 | The EC2 service is configured with a mount of S3 bucket - `nmt-wizard-data` on `${CORPUS_DIR}`.
423 | 
424 | The S3 bucket structure contains:
425 | * `ru_en` - which is the same than in the tutorial data
426 | * `wmt17/de_en` - containing all prepared DE EN data for WMT.
427 | 
428 | with this information, modify the helloworld.json and you can launch the same transliteration training on EC2 GPU instance:
429 | 
430 | ```
431 | $ python launcher.py launch -s ec2 -o @o.json -i nmtwizard/opennmt-lua -- -ms s3_models: -c @../example/helloworld.json train
432 | ```
433 | 
434 | ---
435 | 
436 | *Congratulations for completing this Hello World tutorial!*
437 | 


--------------------------------------------------------------------------------
/nmt-wizard/data/test/helloworld.ruen.test.en:
--------------------------------------------------------------------------------
   1 | A r n o
   2 | G r a n a d o
   3 | A l b e r t o
   4 | H a m a d
   5 | T h u w a i n i
   6 | P e g g
   7 | D a v i d
   8 | A l e x a n d e r
   9 | K o s h e t z
  10 | V l a d i m i r
  11 | G u s i n s k y
  12 | V a s k s
  13 | P ē t e r i s
  14 | S u ñ é
  15 | R u b é n
  16 | M a n m e e t
  17 | B h u l l a r
  18 | W o l f g a n g
  19 | B o e t t c h e r
  20 | B r a u n e r
  21 | V i c t o r
  22 | L o u i s e
  23 | N e t h e r l a n d s
  24 | J o h i n
  25 | B r a d
  26 | M i l l e r
  27 | N e e l a m
  28 | S a n j i v a
  29 | R e d d y
  30 | S k l e n a ř í k o v á
  31 | A d r i a n a
  32 | S h u a i
  33 | M a r c
  34 | M á r q u e z
  35 | M a r c
  36 | M á r q u e z
  37 | S i n n e t t
  38 | A l f r e d
  39 | P e r c y
  40 | P a r m e n i o n
  41 | G a i u s
  42 | A s i n i u s
  43 | G a l l u s
  44 | I r i s
  45 | T r e e
  46 | A l e x
  47 | R o d r í g u e z
  48 | C o m p a n s
  49 | J e a n
  50 | D o m i n i q u e
  51 | N i c h o l a s
  52 | L e a
  53 | F e l i x
  54 | B o u r b o n
  55 | P a r m a
  56 | S i m m o n s
  57 | W o o d w a r d
  58 | O p p e n h e i m e r
  59 | E r n e s t
  60 | F i l i p p o
  61 | E d u a r d o
  62 | B a s a r a b
  63 | T e r u o
  64 | N a k a m u r a
  65 | J o v a n
  66 | K i r o v s k i
  67 | S h a r m a
  68 | A r v i n d
  69 | T e m i l e
  70 | F r a n k
  71 | S a n t o s
  72 | G o n ç a l v e s
  73 | J o ã o
  74 | P e d r o
  75 | M i c h a e l
  76 | H u t h
  77 | G i r a r d o t
  78 | H i p p o l y t e
  79 | C a t h e r i n e
  80 | P e r r e t
  81 | K a b a i v a n s k a
  82 | R a i n a
  83 | R a j
  84 | R a j a r a t n a m
  85 | E e r o
  86 | A a r n i o
  87 | A a r n i o
  88 | I n a f u n e
  89 | K e i j i
  90 | B o n n a t
  91 | L é o n
  92 | A d i t y a
  93 | C h o p r a
  94 | A n n e n k o v
  95 | M i k h a i l
  96 | A v r a h a m
  97 | P o r a z
  98 | B e b e l
  99 | G i l b e r t o
 100 | D a v i d
 101 | W i l l i s
 102 | B e c k h a m
 103 | G e s s
 104 | E d g a r
 105 | M o h a m e d
 106 | A m s i f
 107 | W h i t e
 108 | G e o r g e
 109 | N ' d y
 110 | A s s e m b é
 111 | K a l n i ņ š
 112 | I v a r s
 113 | W i l e ń s k i
 114 | K o n s t a n t y
 115 | J e o n g
 116 | T u c k e r
 117 | J o n a t h a n
 118 | O l d m a n
 119 | A l b e r t
 120 | J o h n
 121 | D o b r y n i n
 122 | V y a c h e s l a v
 123 | S i v o r i
 124 | C a m i l l o
 125 | S i l v a
 126 | R a m o s
 127 | K l a u s
 128 | T e n n s t e d t
 129 | C a r m e
 130 | E l i a s
 131 | I m a n o v
 132 | L u t f i y a r
 133 | T r i c i a
 134 | H e l f e r
 135 | A d a m
 136 | A n d e r s o n
 137 | E w e n
 138 | F e r g u s s o n
 139 | D i e h l
 140 | R i c h a r d
 141 | I n n o c e n t
 142 | G r a y s o n
 143 | H a l l
 144 | J o h n
 145 | E a t w e l l
 146 | E a t w e l l
 147 | M a r k o
 148 | K e š e l j
 149 | B o g u s ł a w
 150 | R a d z i w i ł ł
 151 | S t a n i s ł a w
 152 | K o s t k a
 153 | P o t o c k i
 154 | A n d y
 155 | L a P l e g u a
 156 | C u n n i n g h a m
 157 | W i l l i a m
 158 | A p e r g h i s
 159 | G e o r g e s
 160 | H a n s
 161 | J a k o b
 162 | C h r i s t o f f e l
 163 | v o n
 164 | G r i m m e l s h a u s e n
 165 | C h a n g y
 166 | A l a i n
 167 | D a v i d
 168 | C h a s e
 169 | G i r s
 170 | N i k o l a y
 171 | F i l t s c h
 172 | C a r l
 173 | T h o m s o n
 174 | W a r r e n
 175 | L u i s
 176 | A l b e r t o
 177 | P e r e a
 178 | E d u a r d o
 179 | S á n c h e z
 180 | F u e n t e s
 181 | P o i r é
 182 | A l a i n
 183 | A l e k s e i
 184 | S h a p o s h n i k o v
 185 | Y r j ö
 186 | H i e t a n e n
 187 | M o r t a u d
 188 | C h l o é
 189 | A l e x a n d e r
 190 | W a n g
 191 | M a c M a n u s
 192 | A r t h u r
 193 | D r u z h n i k o v
 194 | V l a d i m i r
 195 | P a n c h e n k o
 196 | Y u r i y
 197 | B e n e d i c t
 198 | M i c h a e l
 199 | C e r v e r i s
 200 | N e r s e s
 201 | Y e r i t s y a n
 202 | M i n t o n
 203 | F a i t h
 204 | G r e t a
 205 | K u k k o n e n
 206 | C o l e
 207 | T a y l o r
 208 | D o n a l d
 209 | G r a h a m
 210 | B u r t
 211 | T h o m a s
 212 | B u r r o w e s
 213 | D i m i t a r
 214 | A g u r a
 215 | D m i t r y
 216 | I v a n o v i c h
 217 | K i t ō
 218 | B u l l
 219 | J a c o b
 220 | B r e d a
 221 | B l a s c o
 222 | I b á ñ e z
 223 | V i c e n t e
 224 | T i n a
 225 | D a v i s
 226 | K ř e s a d l o
 227 | B u t c h
 228 | W a l k e r
 229 | L a u r e
 230 | J u n o t
 231 | M a r t h a
 232 | M a d i s o n
 233 | P h i l i p
 234 | G e o r g e
 235 | N e e d h a m
 236 | S a d e l e r
 237 | M a r k e d o n o v
 238 | S e r g e y
 239 | M e l i a v a
 240 | T a m a z
 241 | A p o ñ o
 242 | R u b i n a
 243 | A l i
 244 | K o u n e l l i s
 245 | J a n n i s
 246 | V a l e r i e
 247 | B r i s c o
 248 | H o o k s
 249 | S a m u e l
 250 | C r o w t h e r
 251 | O d e n
 252 | G r e g
 253 | S u b b o t i n
 254 | S e r g e y
 255 | S u z a n
 256 | A n b e h
 257 | J e f f r e y
 258 | C r a i g
 259 | L a B e o u f
 260 | H u s a y n
 261 | B a y q a r a
 262 | C o n s t a n c e
 263 | J o h n
 264 | C l a p h a m
 265 | D e p e s t r e
 266 | R e n e
 267 | P e r r i n
 268 | C l a u d e
 269 | V i c t o r
 270 | C a n t i u s
 271 | P r u d n i k a u
 272 | P a v e l
 273 | H e r b e r t
 274 | B r e n o n
 275 | T i b o r
 276 | M a c h a n
 277 | C a l l i x t u s
 278 | R o b e r t
 279 | P a t t i n s o n
 280 | V a l e n t i n e
 281 | N o v i k o v
 282 | S e r g e y
 283 | A n d r e a s
 284 | R o m b e r g
 285 | K h a n k e y e v
 286 | I g o r
 287 | M u r a v s c h i
 288 | V a l e r i u
 289 | D o n g
 290 | F a n g z h u o
 291 | A n j a
 292 | E r ž e n
 293 | C o l e e n
 294 | R o o n e y
 295 | C h r i s t e r
 296 | F u g l e s a n g
 297 | J o a q u í n
 298 | S á n c h e z
 299 | K o s a m b i
 300 | A r t h u r
 301 | D o v e
 302 | O w e n
 303 | C h a s e
 304 | G a r c í a
 305 | S á n c h e z
 306 | A b r a h a m
 307 | b e n
 308 | J a c o b
 309 | F r e d d i e
 310 | Y o u n g
 311 | G o i t e i n
 312 | S h e l o m o
 313 | D o v
 314 | C h a r l e s
 315 | B e n n e t t
 316 | A u g s b e r g e r
 317 | F r a n z
 318 | F y d r y c h
 319 | W a l d e m a r
 320 | A n t o n
 321 | A l e x a n d e r
 322 | v o n
 323 | A u e r s p e r g
 324 | T h o m a s
 325 | W a l t e r
 326 | S e r g e i
 327 | D m i t r i y e v i c h
 328 | B o g d a n o v
 329 | V l a d i m i r
 330 | Y e v t u s h e n k o v
 331 | C o q u e r e l
 332 | M i s s a k
 333 | M a n o u c h i a n
 334 | F r a n k o w s k i
 335 | T o m a s z
 336 | P a r a s h a r a
 337 | E z r a
 338 | H e y w o o d
 339 | S h l o m o
 340 | L a h a t
 341 | W i l l i a m
 342 | H a m i l t o n
 343 | G i b s o n
 344 | G r e g
 345 | C o x
 346 | M a k a r o v
 347 | K o n s t a n t i n
 348 | B e r n h a r d t
 349 | M o l i q u e
 350 | C a r l o s
 351 | D i o g o
 352 | K o v n e r
 353 | B r u c e
 354 | L u c a s
 355 | A l a m á n
 356 | S a m a n e z
 357 | O c a m p o
 358 | D a v i d
 359 | V i c t o r
 360 | L o r e t
 361 | A n a y a
 362 | E l e n a
 363 | A n a n i a
 364 | S h i r a k a t s i
 365 | S h i f r i n
 366 | K a r i n
 367 | R u d i
 368 | A r n s t a d t
 369 | I r i n a
 370 | A b y s o v a
 371 | C a t h e r i n e
 372 | M i c h e l l e
 373 | G a ł c z y ń s k i
 374 | K o n s t a n t y
 375 | I l d e f o n s
 376 | I v o
 377 | M i n á ř
 378 | A b u
 379 | U b a i d a h
 380 | i b n
 381 | J a r r a h
 382 | V a l l i
 383 | A l i d a
 384 | N i e r o t h
 385 | C a r l
 386 | X u a n
 387 | A r t y o m
 388 | K h a c h a t u r o v
 389 | P i e r r e
 390 | F r a n ç o i s
 391 | T i s s o t
 392 | V e r l a t
 393 | C h a r l e s
 394 | E d w a r d
 395 | S e y m o u r
 396 | H e r t f o r d
 397 | M a t t i
 398 | H ä y r y
 399 | R o m a n
 400 | B o r i s e v i c h
 401 | L o c k e
 402 | J o h n
 403 | O v c h i n n i k o v
 404 | V a l e r i
 405 | S u b h a s i s h
 406 | R o y
 407 | C h o w d h u r y
 408 | G e o r g
 409 | M o h r
 410 | A d r i á n
 411 | R o d r í g u e z
 412 | Y e v g e n i
 413 | D y a c h k o v
 414 | M a r y
 415 | F r i t h
 416 | G u i d o
 417 | L a v e z a r i s
 418 | W i l l i a m
 419 | W o o l l s
 420 | G o d f r i d
 421 | F r i s i a
 422 | S e r g e i
 423 | B e l o v
 424 | B l i g h
 425 | W i l l i a m
 426 | H e n s c h e l
 427 | M i l t o n
 428 | G e o r g e
 429 | M o r r i s o n
 430 | A d a m
 431 | P y o t r
 432 | D u r n o v o
 433 | P e t e r
 434 | S c h a e f e r
 435 | C h e n
 436 | B i n g d e
 437 | R è n
 438 | L y n n
 439 | T h o r n d i k e
 440 | S i m o n
 441 | G a l l u p
 442 | K o n s t a n t i n
 443 | K o r o v i n
 444 | F r a n z
 445 | P f e f f e r
 446 | v o n
 447 | S a l o m o n
 448 | B e n a d o
 449 | A r i k
 450 | J o h n
 451 | E a t t o n
 452 | C h e m i a k i n
 453 | M i h a i l
 454 | F e l i p e
 455 | J o r g e
 456 | L o u r e i r o
 457 | N i k i t a
 458 | B e l y k h
 459 | I v a n o v
 460 | S u k h a r e v s k y
 461 | A l e k s a n d r
 462 | S h a i n o v a
 463 | M a r i n a
 464 | M e t s h i n
 465 | I l s u r
 466 | O l e
 467 | E l l e f s æ t e r
 468 | V i k t o r
 469 | K a l i n a
 470 | N a t h a n
 471 | P a r k e r
 472 | F r a n t i š e k
 473 | K a b e r l e
 474 | H u b b a r d
 475 | Q u e n t i n
 476 | C o l l e e n
 477 | B a c h m a n
 478 | K a r i ć
 479 | A m i r
 480 | E g a n
 481 | S e a n
 482 | M a r t i n
 483 | S t r a n z l
 484 | O n d ř í č e k
 485 | M i r o s l a v
 486 | A l l a n
 487 | C o m b s
 488 | M o r g a n
 489 | B r i a n
 490 | G e r t r u d e
 491 | B r u n s w i c k
 492 | R a f a e l
 493 | B o b a n
 494 | F r e d e r i c
 495 | D u b o i s
 496 | O l g a
 497 | F y o d o r o v a
 498 | O r l o f f
 499 | V a i l l a n t
 500 | J e a n
 501 | B a p t i s t e
 502 | P h i l i b e r t
 503 | I b r a h i m
 504 | S a h a d
 505 | J o h a n n
 506 | J o a c h i m
 507 | L a n g e
 508 | K o r y u n
 509 | K r i s s y
 510 | T a y l o r
 511 | G r i m a l d i
 512 | D a v i d
 513 | S a b i n e
 514 | T h i e r r y
 515 | S a p r y k i n
 516 | O l e g
 517 | S a r a
 518 | S t o c k b r i d g e
 519 | I r e n e
 520 | T o m a r
 521 | D e r e k
 522 | M o r r i s
 523 | A l o y s i u s
 524 | L i l i u s
 525 | L j u b a
 526 | K a z a r n o v s k a y a
 527 | M i c h a e l
 528 | v o n
 529 | d e r
 530 | H e i d e
 531 | K o s t i ć
 532 | B o r i s
 533 | V a s i l y
 534 | K o n s t a n t i n o v i c h
 535 | R i c h a r d
 536 | S u l í k
 537 | P a v e l
 538 | T o n k o v
 539 | R y a n
 540 | C o n r o y
 541 | Ć o s i ć
 542 | D o b r i c a
 543 | B j ø r n d a l e n
 544 | O l e
 545 | E i n a r
 546 | M u r a v i e v
 547 | N i k o l a y
 548 | S e a r l e
 549 | R o b e r t
 550 | T h i e r r y
 551 | A m a r
 552 | J o e l
 553 | B i l l y
 554 | A m i r
 555 | K h a d i r
 556 | B a r b a r a
 557 | H a n n i g a n
 558 | M i r t a
 559 | B u s n e l l i
 560 | R u d i
 561 | C e r n e
 562 | M a z z u c a t o
 563 | A l b e r t o
 564 | F r a s e r
 565 | K r i s t i n
 566 | L e o n a r d
 567 | M a t l o v i c h
 568 | I u l i a n
 569 | E r h a n
 570 | N a t a l i e
 571 | M e n d o z a
 572 | G o m e s
 573 | I n n a
 574 | P a u l
 575 | H a r d i n g
 576 | S t a n
 577 | M a k
 578 | P e t r o v
 579 | D e n i s
 580 | T e i m u r a z
 581 | C h i r g a d z e
 582 | C h a l a b a l a
 583 | Z d e n ě k
 584 | M o n t b a z o n
 585 | H e r c u l e
 586 | F i o n a
 587 | V i c t o r y
 588 | A n d r e w
 589 | T a r b e t
 590 | C h o d o w i e c k i
 591 | D a n i e l
 592 | V a s h a k i d z e
 593 | T a m a z
 594 | C a r o l i n e
 595 | G a r c i a
 596 | A n d r e y
 597 | B o l s h o y
 598 | R y c r o f t
 599 | C a r t e r
 600 | M i l j k o v i ć
 601 | V i j a y
 602 | K u m a r
 603 | S c o t t
 604 | B o o t h
 605 | C h é t a r d i e
 606 | J a c q u e s
 607 | J o a c h i m
 608 | T r o t t i
 609 | V a r a z d a t
 610 | B a r t h e l m e s s
 611 | R i c h a r d
 612 | P o r t e i r o
 613 | F é l i x
 614 | O k s a n a
 615 | K a l a s h n i k o v a
 616 | A d l a n
 617 | K h a s a n o v
 618 | S e v a k
 619 | P a r u y r
 620 | L e l l
 621 | C h r i s t i a n
 622 | M a m m a d
 623 | Y u s i f
 624 | J a f a r o v
 625 | C l a r k
 626 | H a d d e n
 627 | G u s t a v e
 628 | H u m b e r t
 629 | M a x i m
 630 | Z i m i n
 631 | K i m
 632 | G w o n
 633 | J e n n a
 634 | C o l e m a n
 635 | R a k a n
 636 | R u s h a i d a t
 637 | I e r o n i m
 638 | U b o r e v i c h
 639 | N a t a l i a
 640 | G r z e g o r z
 641 | K a r n a s
 642 | G e n n a d y
 643 | M i k h a s e v i c h
 644 | L i a m
 645 | H e a t h
 646 | E n r i c o
 647 | B e v i g n a n i
 648 | V e r h a a s
 649 | B a r b a r a
 650 | B r i g g s
 651 | E v a n s
 652 | V i n c e
 653 | K e v i n
 654 | C a r o l a n
 655 | R o b e r t
 656 | A n d r e w s
 657 | M i l l i k a n
 658 | V y a c h e s l a v
 659 | S h e v c h u k
 660 | J u r i e t t i
 661 | F r a n c k
 662 | K a t s u r a
 663 | T a r ō
 664 | D a v i d
 665 | R a t h
 666 | K e v i n
 667 | K a s h a
 668 | A l e x a n d e r
 669 | K o r z h a k o v
 670 | F o r d
 671 | F o r d
 672 | F o r d
 673 | F o r d
 674 | M a d o x
 675 | A n a t o l i y
 676 | M o r o z o v
 677 | Y e g o r
 678 | R i d o s h
 679 | O a k l a n d
 680 | R o d r i g o
 681 | A s t u r i a s
 682 | S i t a
 683 | O l i v e
 684 | L e m b e
 685 | L i n c a r
 686 | E r i k
 687 | A l o j z y
 688 | F e l i ń s k i
 689 | O l g a
 690 | Y a k o v l e v a
 691 | A f i n o g e n o v
 692 | A l e x a n d e r
 693 | B r a n k o
 694 | G a v e l l a
 695 | R a n d i
 696 | Z u c k e r b e r g
 697 | G i u s e p p e
 698 | I m p a s t a t o
 699 | I v a n
 700 | B e t s k o y
 701 | B e n o î t
 702 | A n g b w a
 703 | G u s t a f
 704 | J a d e
 705 | L a y l a
 706 | P o l y c a r p u s
 707 | B o d d i n g
 708 | P a u l
 709 | O l a f
 710 | N a g i m a
 711 | E s k a l i e v a
 712 | J a n e
 713 | F e l l o w e s
 714 | F e l l o w e s
 715 | B a r o n e s s
 716 | F e l l o w e s
 717 | F e l l o w e s
 718 | H e r s h e l e
 719 | O s t r o p o l e r
 720 | G r a m s c i
 721 | A n t o n i o
 722 | C h u r a n d y
 723 | M a r t i n a
 724 | P a u l
 725 | F o n o r o f f
 726 | F e l l n e r
 727 | H e l m e r
 728 | T h o m p s o n
 729 | T o n y
 730 | L e o n i d
 731 | B r o n e v o y
 732 | R e i n a l d
 733 | S c h n e l l
 734 | P é t e r
 735 | B e s e n y e i
 736 | A n d r e y
 737 | Z a l e s k i
 738 | M a n s u r
 739 | Y a h y a
 740 | l e o n i d
 741 | L a z a r e v
 742 | Y e v g e n y
 743 | F a d e y e v
 744 | A l e k s a n d r e
 745 | M i r t s k h u l a v a
 746 | S e r h i y
 747 | R u d y k a
 748 | D m i t r y
 749 | A n d r e i k i n
 750 | J o h n s o n
 751 | M i c h a e l
 752 | V a l e n t i n
 753 | G a n e v
 754 | I m o g e n e
 755 | B l i s s
 756 | D i c k
 757 | R y a n
 758 | K i t t y
 759 | K i r w a n
 760 | W i n i t s
 761 | D a n i e l l e
 762 | J a n n e
 763 | J a l a s v a a r a
 764 | J u l i a n
 765 | T o w n s e n d
 766 | L i l i y a
 767 | S h o b u k h o v a
 768 | Z e l j k a
 769 | F r a n u l o v i c
 770 | B o r i s
 771 | R a j e w s k y
 772 | S h a r o n
 773 | M a n n
 774 | B o u l l é e
 775 | É t i e n n e
 776 | L o u i s
 777 | V i k t o r
 778 | F r a y o n o v
 779 | É v a
 780 | K ó c z i á n
 781 | M a r i e
 782 | M a x
 783 | K e i s e r
 784 | Y u r i
 785 | I v l e v
 786 | S t a n i s l a v
 787 | G u s t a v o v i c h
 788 | S t r u m i l i n
 789 | I g o r
 790 | L o l o
 791 | T a g i r
 792 | K u s i m o v
 793 | A n t o i n e
 794 | R o b e r t
 795 | K y r a
 796 | B u s c h o r
 797 | J a r i
 798 | K o s k i n e n
 799 | N e l l
 800 | B u r t o n
 801 | N a s i r u d d i n
 802 | Y o u s u f f
 803 | J a g e r
 804 | C o r n e l i s
 805 | M a r k
 806 | R o b e r t s
 807 | D e l p h i n e
 808 | T o m s o n
 809 | C r a i g
 810 | A d k i n s
 811 | Q u a r t u c c i
 812 | P e d r o
 813 | F r a n c e n a
 814 | M c C o r o r y
 815 | J o s e p h i n n e
 816 | Y a r o s h e v i c h
 817 | R i c h a r d
 818 | D u c k e t t
 819 | L a r n i
 820 | M a r t t i
 821 | M a r i y a
 822 | K r i v o p o l e n o v a
 823 | B a r i
 824 | M o r g a n
 825 | N e f e r k a r a
 826 | R o b e r t
 827 | K a l e s k i
 828 | L y n c h
 829 | Y u r i
 830 | A l e x a n d r o v
 831 | A v d o t i a
 832 | T i m o f e y e v a
 833 | L á z a r o
 834 | Á l v a r e z
 835 | M a r g a r e t
 836 | B e e b e
 837 | A n n a
 838 | P a n i n a
 839 | L e o n i d
 840 | D o b r o v s k y
 841 | S u z a n n e
 842 | P r i n g l e
 843 | R o g e r
 844 | M u n i e r
 845 | A l e x a n d e r
 846 | D e v o n
 847 | H u r t
 848 | S u s a n n e
 849 | B r ü c k n e r
 850 | G r e t a
 851 | A m e n d
 852 | W a l t e r
 853 | B e a k e l
 854 | P r i n c e s
 855 | T o w e r
 856 | V i n c e n z o
 857 | C e r a m i
 858 | P i e r r e
 859 | F u n c k
 860 | J o s e f
 861 | N a d j
 862 | G r a n t
 863 | P a u l
 864 | A d e l a i d e
 865 | S o b i e s k a
 866 | F o d h l a
 867 | C r o n i n
 868 | O ' R e i l l y
 869 | C a r l
 870 | L u d w i g
 871 | B a r
 872 | K o r n e y
 873 | P e t r e n k o
 874 | M i c h a e l
 875 | T i n s l e y
 876 | A l e k s a n d r
 877 | K a z a k e v i č
 878 | G a l i n a
 879 | N i l o v a
 880 | A n t o n
 881 | N i l o v
 882 | J e n n y
 883 | T r i p p
 884 | I g o r
 885 | P e t r e n k o
 886 | R i n a t
 887 | I b r a g i m o v
 888 | N i k o l a y
 889 | P a k h o m o v
 890 | I u r i i
 891 | K r a k o v e t s k i i
 892 | Z i n e d i n e
 893 | Z i n e d i n e
 894 | Z i d a n e
 895 | K e l l e n b e r g e r
 896 | E m i l
 897 | E m i l
 898 | J a n e n z
 899 | A u d r i n a
 900 | P a t r i d g e
 901 | P r i s c i a n
 902 | N o a h
 903 | A k w u
 904 | H a r o l d
 905 | M a c m i l l a n
 906 | B r e n t a n o
 907 | F r a n z
 908 | J a c o p o
 909 | F o r o n i
 910 | H i l a r y
 911 | D u f f
 912 | I r i n a
 913 | A v v a k u m o v a
 914 | A n n e t t e
 915 | v o n
 916 | D r o s t e
 917 | H ü l s h o f f
 918 | O m a r
 919 | T o r r i j o s
 920 | A d a m
 921 | T a r n o w s k i
 922 | M a r i a
 923 | G a n s e v o o r t
 924 | M e l v i l l
 925 | V l a d i m i r
 926 | R a u t b a r t
 927 | Y e l i z a v e t a
 928 | Y a n k o v s k a y a
 929 | E d d a
 930 | G ö r i n g
 931 | A l e k s a n d r a
 932 | N i k o l i c
 933 | J ā n i s
 934 | S t r e n g a
 935 | L a r i s a
 936 | G u z e e v a
 937 | J o a c h i m
 938 | D i t t r i c h
 939 | J o h a n n
 940 | G e o r g
 941 | K a n t
 942 | G i u l i a
 943 | C o s i m o
 944 | A m m a n n a t i
 945 | J o h n
 946 | S i m m i t
 947 | S y l v i a
 948 | Z i d e k
 949 | L a u r e
 950 | B a l z a c
 951 | C r i s t i n a
 952 | D i l l a
 953 | S v e t l a n a
 954 | Y u r y e v n a
 955 | B l e d n a y a
 956 | T e r r y
 957 | C h a n
 958 | M a x
 959 | O p h ü l s
 960 | A n a s t a s i y a
 961 | B i r c t h o v a
 962 | M e d e y a
 963 | J u g e l i
 964 | E d w a r d
 965 | O l e s c h a k
 966 | M i k e
 967 | T a l e r i c o
 968 | T h e o p o m p u s
 969 | K o n r a d
 970 | H u m m l e r
 971 | M a r k
 972 | H i d d i n k
 973 | W e s t e r m a r c k
 974 | E d v a r d
 975 | A n d r e y
 976 | M e l e n s k y
 977 | G e n n a d y
 978 | R e b r o v
 979 | A r i n a
 980 | V l a d i s l a v o v n a
 981 | B e z b o r o d o v a
 982 | A m i r a n
 983 | A n a n i d z e
 984 | M a x
 985 | M e y e r
 986 | R o m a n
 987 | K h o k h l o v
 988 | D a n i e l
 989 | H a n d l i n g
 990 | R o b e r t
 991 | S t e v e n s o n
 992 | V l a d i m i r
 993 | K i l b u r g
 994 | C h u r i k o v
 995 | M i k h a i l
 996 | K u z m i c h
 997 | D m i t r y
 998 | M i s h i n
 999 | E k a t e r i n a
1000 | S t o y a n o v a
1001 | 


--------------------------------------------------------------------------------
/nmt-wizard/data/test/helloworld.ruen.test.ru:
--------------------------------------------------------------------------------
   1 | А р н о
   2 | Г р а н а д о
   3 | А л ь б е р т о
   4 | Х а м а д
   5 | Т у в а й н и
   6 | П е г г
   7 | Д э в и д
   8 | А л е к с а н д р
   9 | К о ш и ц
  10 | В л а д и м и р
  11 | Г у с и н с к и й
  12 | В а с к с
  13 | П е т е р и с
  14 | С у н ь е
  15 | Р у б е н
  16 | М а н м и т
  17 | Б х у л л а р
  18 | В о л ь ф г а н г
  19 | Б ё т т х е р
  20 | Б р а у н е р
  21 | В и к т о р
  22 | Л у и з а
  23 | Н и д е р л а н д с к а я
  24 | Ж о э н
  25 | Б р э д
  26 | М и л л е р
  27 | Н и л а м
  28 | С а н д ж и в а
  29 | Р е д д и
  30 | С к л е н а р и к о в а
  31 | А д р и а н а
  32 | Ш у а й
  33 | М а р к
  34 | М а р к
  35 | М а р к е с
  36 | М а р к е с
  37 | С и н н е т т
  38 | А л ь ф р е д
  39 | П е р с и
  40 | П а р м е н и о н
  41 | Г а й
  42 | А з и н и й
  43 | Г а л л
  44 | И р и с
  45 | Т р и
  46 | А л е к с
  47 | Р о д р и г е з
  48 | К о м п а н
  49 | Ж а н
  50 | Д о м и н и к
  51 | Н и к о л а с
  52 | Л е а
  53 | Ф е л и ч е
  54 | Б у р б о н
  55 | П а р м с к и й
  56 | С и м м о н с
  57 | В у д в а р д
  58 | О п п е н г е й м е р
  59 | Э р н е с т
  60 | Ф и л и п п о
  61 | Э д у а р д о
  62 | Б а с а р а б
  63 | Т э р у о
  64 | Н а к а м у р а
  65 | Д ж о в а н
  66 | К и р о в с к и
  67 | Ш а р м а
  68 | А р в и н д
  69 | Т е м и л е
  70 | Ф р а н к
  71 | С а н т у ш
  72 | Г о н с а л в е ш
  73 | Ж у а н
  74 | П е д р у
  75 | М и х а э л ь
  76 | Х у т
  77 | Ж и р а р д о
  78 | И п п о л и т
  79 | К а т р и н
  80 | П е р р е
  81 | К а б а и в а н с к а
  82 | Р а й н а
  83 | Р а д ж
  84 | Р а д ж а р а т н а м
  85 | Э э р о
  86 | Э э р о
  87 | А а р н и о
  88 | И н а ф у н э
  89 | К э й д з и
  90 | Б о н н а
  91 | Л е о н
  92 | А д и т ь я
  93 | Ч о п р а
  94 | А н н е н к о в
  95 | М и х а и л
  96 | А в р а а м
  97 | П о р а з
  98 | Б е б е л
  99 | Ж и л б е р т у
 100 | Д э в и д
 101 | У и л л и с
 102 | Б е к х э м
 103 | Г е с с
 104 | Э д г а р
 105 | М о х а м е д
 106 | А м с и ф
 107 | У а й т
 108 | Д ж о р д ж
 109 | Н ’ Д и
 110 | А с с е м б е
 111 | К а л н ы н ь ш
 112 | И в а р
 113 | В и л е н с к и й
 114 | К о н с т а н т и н
 115 | Д ж о н
 116 | Т а к е р
 117 | Д ж о н а т а н
 118 | О л д м а н
 119 | А л ь б е р т
 120 | И о г а н н
 121 | Д о б р ы н и н
 122 | В я ч е с л а в
 123 | С и в о р и
 124 | К а м и л л о
 125 | С и л в а
 126 | Р а м о с
 127 | К л а у с
 128 | Т е н н ш т е д т
 129 | К а р м е
 130 | Э л и а с
 131 | И м а н о в
 132 | Л ю т ф и я р
 133 | Т р и ш и а
 134 | Х е л ф е р
 135 | А д а м
 136 | А н д е р с о н
 137 | Ю э н
 138 | Ф е р г ю с с о н
 139 | Д и л ь
 140 | Р и ч а р д
 141 | И н н о к е н т и й
 142 | Г р э й с о н
 143 | Х о л л
 144 | Д ж о н
 145 | И т у э л л
 146 | И т у э л л
 147 | М а р к о
 148 | К е ш е л ь
 149 | Б о г у с л а в
 150 | Р а д з и в и л л
 151 | С т а н и с л а в
 152 | К о с т к а
 153 | П о т о ц к и й
 154 | Э н д и
 155 | П л а г у а
 156 | К а н н и н г е м
 157 | У и л ь я м
 158 | А п е р г и с
 159 | Ж о р ж
 160 | Г а н с
 161 | Я к о б
 162 | К р и с т о ф ф е л ь
 163 | ф о н
 164 | Г р и м м е л ь с г а у з е н
 165 | Ш а н ж и
 166 | А л е н
 167 | Д э в и д
 168 | Ч е й з
 169 | Г и р с
 170 | Н и к о л а й
 171 | Ф и л ь ч
 172 | К а р л
 173 | Т о м с о н
 174 | У о р р е н
 175 | Л у и с
 176 | А л ь б е р т о
 177 | П е р е а
 178 | Э д у а р д о
 179 | С а н ч е с
 180 | Ф у э н т е с
 181 | П у а р е
 182 | А л а н
 183 | А л е к с е й
 184 | Ш а п о ш н и к о в
 185 | И р ь ё
 186 | Х и е т а н е н
 187 | М о р т о
 188 | К л о э
 189 | А л е к с а н д р
 190 | В а н
 191 | М а н у с
 192 | А р т у р
 193 | Д р у ж н и к о в
 194 | В л а д и м и р
 195 | П а н ч е н к о
 196 | Ю р и й
 197 | Б е н е д и к т
 198 | М а й к л
 199 | С е р в е р и с
 200 | Н е р с е с
 201 | Е р и ц я н
 202 | М и н т о н
 203 | Ф э й т
 204 | Г р е т а
 205 | К у к к о н е н
 206 | К о у л
 207 | Т е й л о р
 208 | Д о н а л ь д
 209 | Г р э х э м
 210 | Б е р т
 211 | Т о м а с
 212 | Б е р р о у з
 213 | Д м и т р и й
 214 | А г у р а
 215 | Д м и т р и й
 216 | И в а н о в и ч
 217 | К и т о
 218 | Б у л л ь
 219 | Я к о б
 220 | Б р е д а
 221 | Б л а с к о
 222 | И б а н ь е с
 223 | В и с е н т е
 224 | Т и н а
 225 | Д э в и с
 226 | К р ш е с а д л о
 227 | Б у т ч
 228 | У о л к е р
 229 | Л о р а
 230 | Ж ю н о
 231 | М а р т а
 232 | М э д и с о н
 233 | Ф и л и п п
 234 | Д ж о р д ж
 235 | Н и д е м
 236 | С а д е л е р
 237 | М а р к е д о н о в
 238 | С е р г е й
 239 | М е л и а в а
 240 | Т а м а з
 241 | А п о н ь о
 242 | Р у б и н а
 243 | А л и
 244 | К у н е л л и с
 245 | Я н н и с
 246 | В а л е р и
 247 | Б р и с к о
 248 | Х у к с
 249 | С э м ю э л
 250 | К р о у т е р
 251 | О д е н
 252 | Г р е г
 253 | С у б б о т и н
 254 | С е р г е й
 255 | С ь ю з а н
 256 | А н б е х
 257 | Д ж е ф ф р и
 258 | К р э й г
 259 | Л а б а ф
 260 | Х у с е й н
 261 | Б а й к а р а
 262 | К о н с т а н с а
 263 | Д ж о н
 264 | К л э п е м
 265 | Д е п е с т р
 266 | Р е н е
 267 | П е р р е н
 268 | К л о д
 269 | В и к т о р
 270 | К е н т ы
 271 | П р у д н и к о в
 272 | П а в е л
 273 | Г е р б е р т
 274 | Б р е н о н
 275 | Т и б о р
 276 | М а х а н
 277 | К а л и к с т
 278 | Р о б е р т
 279 | П а т т и н с о н
 280 | В а л е н т и н
 281 | Н о в и к о в
 282 | С е р г е й
 283 | А н д р е а с
 284 | Р о м б е р г
 285 | Х а н к е е в
 286 | И г о р ь
 287 | М у р а в с к и й
 288 | В а л е р и й
 289 | Д у н
 290 | Ф а н ч ж о
 291 | А н я
 292 | Э р з е н
 293 | К о л и н
 294 | Р у н и
 295 | К р и с т е р
 296 | Ф у г л е с а н г
 297 | Х о а к и н
 298 | С а н ч е с
 299 | К о с а м б и
 300 | А р т у р
 301 | Д о у в
 302 | О у э н
 303 | Ч е й з
 304 | Г а р с и я
 305 | С а н ч е с
 306 | И б р а г и м
 307 | и б н
 308 | Я к у б
 309 | Ф р е д д и
 310 | Я н г
 311 | Г о й т е й н
 312 | Ш л о м о
 313 | Д о в
 314 | Ч а р л ь з
 315 | Б е н н е т
 316 | А у г с б е р г е р
 317 | Ф р а н ц
 318 | Ф и д р и х
 319 | В а л ь д е м а р
 320 | А н т о н
 321 | А л е к с а н д е р
 322 | ф о н
 323 | А у э р ш п е р г
 324 | Т о м а с
 325 | У о л т е р
 326 | С е р г е й
 327 | Д м и т р и е в и ч
 328 | Б о г д а н о в
 329 | В л а д и м и р
 330 | Е в т у ш е н к о в
 331 | К о к р е л ь
 332 | М и с а к
 333 | М а н у ш я н
 334 | Ф р а н к о в с к и й
 335 | Т о м а ш
 336 | П а р а ш а р а
 337 | Э з р а
 338 | Х е й в у д
 339 | Ш л о м о
 340 | Л а х а т
 341 | У и л ь я м
 342 | Г а м и л ь т о н
 343 | Г и б с о н
 344 | Г р е г
 345 | К о к с
 346 | М а к а р о в
 347 | К о н с т а н т и н
 348 | Б е р н а р
 349 | М о л и к
 350 | К а р л о с
 351 | Д и о г о
 352 | К о в н е р
 353 | Б р ю с
 354 | Л у к а с
 355 | А л а м а н
 356 | С а м а н е с
 357 | О к а м п о
 358 | Д а в и д
 359 | В и к т о р
 360 | Л о р е
 361 | А н а й я
 362 | Е л е н а
 363 | А н а н и я
 364 | Ш и р а к а ц и
 365 | Ш и ф р и н
 366 | К а р и н
 367 | Р у д и
 368 | А р н ш т а д т
 369 | И р и н а
 370 | А б ы с о в а
 371 | К а т а л и н а
 372 | М и к а э л а
 373 | Г а л ч и н с к и й
 374 | К о н с т а н т ы
 375 | И л ь д е ф о н с
 376 | И в о
 377 | М и н а р ж
 378 | А б у
 379 | У б а й д а
 380 | и б н
 381 | Д ж а р р а х
 382 | В а л л и
 383 | А л и д а
 384 | Н и р о т
 385 | К а р л
 386 | К с ю а н
 387 | А р т ё м
 388 | Х а ч а т у р о в
 389 | П ь е р
 390 | Ф р а н с у а
 391 | Т и с с о
 392 | В е р л а
 393 | Ш а р л ь
 394 | Э д у а р д
 395 | С е й м у р
 396 | Х е р т ф о р д
 397 | М а т т и
 398 | Х я у р ю
 399 | Р о м а н
 400 | Б о р и с е в и ч
 401 | Л о к к
 402 | Д ж о н
 403 | О в ч и н н и к о в
 404 | В а л е р и й
 405 | С у б х а с и ш
 406 | Р о й
 407 | Ч о у д х у р и
 408 | Г е о р г
 409 | М о р
 410 | А д р и а н
 411 | Р о д р и г е с
 412 | Е в г е н и й
 413 | Д ь я ч к о в
 414 | М э р и
 415 | Ф р и т
 416 | Г в и д о
 417 | Л а в е с а р и с
 418 | У и л ь я м
 419 | В у л л с
 420 | Г о д ф р и д
 421 | Ф р и з с к и й
 422 | С е р г е й
 423 | Б е л о в
 424 | Б л а й
 425 | У и л ь я м
 426 | Х е н ш е л ь
 427 | М и л т о н
 428 | Д ж о р д ж
 429 | М о р р и с о н
 430 | А д а м
 431 | П ё т р
 432 | Д у р н о в о
 433 | П и т е р
 434 | Ш е ф е р
 435 | Ч э н ь
 436 | Б и н д э
 437 | Ж э н ь
 438 | Л и н н
 439 | Т о р н д а й к
 440 | С а й м о н
 441 | Г э л л а п
 442 | К о н с т а н т и н
 443 | К о р о в и н
 444 | Ф р а н ц
 445 | П ф е ф ф е р
 446 | ф о н
 447 | З а л о м о н
 448 | Б е н а д о
 449 | А р и к
 450 | Д ж о н
 451 | И т т о н
 452 | Ш е м я к и н
 453 | М и х а и л
 454 | Ф е л и п е
 455 | Ж о р ж е
 456 | Л о р е й р о
 457 | Н и к и т а
 458 | Б е л ы х
 459 | И в а н о в
 460 | С у х а р е в с к и й
 461 | А л е к с а н д р
 462 | Ш а и н о в а
 463 | М а р и н а
 464 | М е т ш и н
 465 | И л ь с у р
 466 | У л е
 467 | Э л л е ф с е т е р
 468 | В и к т о р
 469 | К а л и н а
 470 | Н а т а н
 471 | П а р к е р
 472 | Ф р а н т и ш е к
 473 | К а б е р л е
 474 | Х а б б а р д
 475 | К в е н т и н
 476 | К о л л и н
 477 | Б а ч м а н
 478 | К а р и ч
 479 | А м и р
 480 | И г а н
 481 | Ш о н
 482 | М а р т и н
 483 | Ш т р а н ц л ь
 484 | О н д р ж и ч е к
 485 | М и р о с л а в
 486 | А л л а н
 487 | К о м б с
 488 | М о р г а н
 489 | Б р а й а н
 490 | Г е р т р у д а
 491 | Б р а у н ш в е й г с к а я
 492 | Р а ф а э л ь
 493 | Б о б а н
 494 | Ф р е д е р и к
 495 | Д ю б у а
 496 | О л ь г а
 497 | Ф ё д о р о в а
 498 | О р л о ф ф
 499 | В а л ь я н
 500 | Ж а н
 501 | Б а т и с т
 502 | Ф и л и б е р
 503 | И б р а г и м
 504 | С а х а д
 505 | И о г а н н
 506 | И о а х и м
 507 | Л а н г е
 508 | К о р ю н
 509 | К р и с с и
 510 | Т е й л о р
 511 | Г р и м а л ь д и
 512 | Д э в и д
 513 | С а б и н
 514 | Т ь е р р и
 515 | С а п р ы к и н
 516 | О л е г
 517 | С а р а
 518 | С т о к б р и д ж
 519 | И р и н а
 520 | Т о м а р с к а я
 521 | Д е р е к
 522 | М о р р и с
 523 | А л о и з и й
 524 | Л и л и у с
 525 | Л ю б о в ь
 526 | К а з а р н о в с к а я
 527 | М и х а э л ь
 528 | ф о н
 529 | д е р
 530 | Х а й д е
 531 | К о с т и ч
 532 | Б о р а
 533 | В а с и л и й
 534 | К о н с т а н т и н о в и ч
 535 | Р и х а р д
 536 | С у л и к
 537 | П а в е л
 538 | Т о н к о в
 539 | Р а й а н
 540 | К о н р о й
 541 | Ч о с и ч
 542 | Д о б р и ц а
 543 | Б ь ё р н д а л е н
 544 | У л е
 545 | Э й н а р
 546 | М у р а в ь ё в
 547 | Н и к о л а й
 548 | С и р л
 549 | Р о б е р т
 550 | Т ь е р и
 551 | А м а р
 552 | Д ж о э л
 553 | Б и л л и
 554 | А м и р
 555 | Х а д и р
 556 | Б а р б а р а
 557 | Х а н н и г а н
 558 | М и р т а
 559 | Б у с н е л л и
 560 | Р у д и
 561 | Ц е р н е
 562 | М а д з у к а т о
 563 | А л ь б е р т о
 564 | Ф р е й з е р
 565 | К р и с т и н
 566 | Л е о н а р д
 567 | М э т л о в и ч
 568 | Ю л и а н
 569 | Е р х а н
 570 | Н а т а л и
 571 | М е н д о с а
 572 | Г о м е с
 573 | И н н а
 574 | П о л
 575 | Х а р д и н г
 576 | С т э н
 577 | М э к
 578 | П е т р о в
 579 | Д е н и с
 580 | Т е й м у р а з
 581 | Ч и р г а д з е
 582 | Х а л а б а л а
 583 | З д е н е к
 584 | М о н б а з о н
 585 | Э р к ю л ь
 586 | Ф и о н а
 587 | В и к т о р и
 588 | Э н д р ю
 589 | Т а р б е т
 590 | Х о д о в е ц к и й
 591 | Д а н и э л ь
 592 | В а ш а к и д з е
 593 | Т а м а з
 594 | К а р о л и н
 595 | Г а р с и я
 596 | А н д р е й
 597 | Б о л ь ш о й
 598 | Р а й к р о ф т
 599 | К а р т е р
 600 | М и л ь к о в и ч
 601 | В и д ж а й
 602 | К у м а р
 603 | С к о т т
 604 | Б у т
 605 | Ш е т а р д и
 606 | Ж а к
 607 | И о а х и м
 608 | Т р о т т и
 609 | В а р а з д а т
 610 | Б а р т е л м е с с
 611 | Р и ч а р д
 612 | П о р т е й р о
 613 | Ф е л и к с
 614 | О к с а н а
 615 | К а л а ш н и к о в а
 616 | А д л а н
 617 | Х а с а н о в
 618 | С е в а к
 619 | П а р у й р
 620 | Л е л л ь
 621 | К р и с т и а н
 622 | М а м е д
 623 | Ю с и ф
 624 | Д ж а ф а р о в
 625 | К л а р к
 626 | Х э д д е н
 627 | Г ю с т а в
 628 | Х у м б е р т
 629 | М а к с и м
 630 | З и м и н
 631 | К и м
 632 | Г в о н
 633 | Д ж е н н а
 634 | К о у л м а н
 635 | Р а к а н
 636 | Р у с х а й д а т
 637 | И е р о н и м
 638 | У б о р е в и ч
 639 | Н а т а л и я
 640 | Г ж е г о ж
 641 | К а р н а с
 642 | Г е н н а д и й
 643 | М и х а с е в и ч
 644 | Л и а м
 645 | Х и т
 646 | Э н р и к о
 647 | Б е в и н ь я н и
 648 | В е р х а с
 649 | Б а р б а р а
 650 | Б р и г г с
 651 | Э в а н с
 652 | В и н с
 653 | К е в и н
 654 | К е р о л а н
 655 | Р о б е р т
 656 | Э н д р ю с
 657 | М и л л и к е н
 658 | В я ч е с л а в
 659 | Ш е в ч у к
 660 | Ж ю р ь е т т и
 661 | Ф р а н к
 662 | К а ц у р а
 663 | Т а р о
 664 | Д а в и д
 665 | Р а т
 666 | К е в и н
 667 | К а ш а
 668 | А л е к с а н д р
 669 | К о р ж а к о в
 670 | Ф о р д
 671 | Ф о р д
 672 | Ф о р д
 673 | Ф о р д
 674 | М э д о к с
 675 | А н а т о л и й
 676 | М о р о з о в
 677 | Е г о р
 678 | Р и д о ш
 679 | О к л е н д с к и й
 680 | Р о д р и г о
 681 | А с т у р и а с
 682 | С и т а
 683 | О л и в
 684 | Л е м б е
 685 | Л и н к а р
 686 | Э р и к
 687 | А л о и з и й
 688 | Ф е л и н с к и й
 689 | О л ь г а
 690 | Я к о в л е в а
 691 | А ф и н о г е н о в
 692 | А л е к с а н д р
 693 | Б р а н к о
 694 | Г а в е л л а
 695 | Р э н д и
 696 | Ц у к е р б е р г
 697 | Д ж у з е п п е
 698 | И м п а с т а т о
 699 | И в а н
 700 | Б е ц к о й
 701 | Б е н у а
 702 | А н г б в а
 703 | Г у с т а в
 704 | Д ж е й д
 705 | Л е й л а
 706 | П о л и к а р п
 707 | Б о д д и н г
 708 | П а у л ь
 709 | У л а ф
 710 | Н а г и м а
 711 | Е с к а л и е в а
 712 | Д ж е й н
 713 | Ф е л л о у з
 714 | Ф е л л о у з
 715 | б а р о н е с с а
 716 | Ф е л л о у з
 717 | Ф е л л о у з
 718 | Г е р ш
 719 | О с т р о п о л е р
 720 | Г р а м ш и
 721 | А н т о н и о
 722 | Ч у р а н д и
 723 | М а р т и н а
 724 | П о л
 725 | Ф о н о р о ф ф
 726 | Ф е л ь н е р
 727 | Ф е л ь н е р
 728 | Т о м п с о н
 729 | Т о н и
 730 | Л е о н и д
 731 | Б р о н е в о й
 732 | Р е й н а л ь д
 733 | Ш н е л ь
 734 | П е т е р
 735 | Б е ш е н ы й
 736 | А н д р е й
 737 | З а л е с к и й
 738 | М а н с у р
 739 | Я х ъ я
 740 | Л е о н и д
 741 | Л а з а р е в
 742 | Е в г е н и й
 743 | Ф а д е е в
 744 | А л е к с а н д р
 745 | М и р ц х у л а в а
 746 | С е р г е й
 747 | Р у д ы к а
 748 | Д м и т р и й
 749 | А н д р е й к и н
 750 | Д ж о н с о н
 751 | М а й к л
 752 | В а л е н т и н
 753 | Г а н е в
 754 | И м о д ж и н
 755 | Б л и с с
 756 | Д и к
 757 | Р а й а н
 758 | К и т т и
 759 | К и р в а н
 760 | У и н и т с
 761 | Д а н и э л ь
 762 | Я н н е
 763 | Я л а с в а а р а
 764 | Д ж у л и а н
 765 | Т а у н с е н д
 766 | Л и л и я
 767 | Ш о б у х о в а
 768 | Ж е л ь к а
 769 | Ф р а н у л о в и ч
 770 | Б о р и с
 771 | Р а е в с к и й
 772 | Ш а р о н
 773 | М э н н
 774 | Б у л л е
 775 | Э т ь е н
 776 | Л у и
 777 | В и к т о р
 778 | Ф р а ё н о в
 779 | Э в а
 780 | К о ц и а н
 781 | М а р и я
 782 | М а к с
 783 | К а й з е р
 784 | Ю р и й
 785 | И в л е в
 786 | С т а н и с л а в
 787 | Г у с т а в о в и ч
 788 | С т р у м и л и н
 789 | И г о р
 790 | Л о л о
 791 | Т а г и р
 792 | К у с и м о в
 793 | А н т у а н
 794 | Р о б е р т
 795 | К а й р а
 796 | Б у ш о р
 797 | Я а р и
 798 | К о с к и н е н
 799 | Н е л л
 800 | Б ё р т о н
 801 | Н а с и р у д д и н
 802 | Ю с у ф ф
 803 | Я г е р
 804 | К о р н е л и с
 805 | М а р к
 806 | Р о б е р т с
 807 | Д э л ь ф и н
 808 | Т о м с о н
 809 | К р е й г
 810 | Э д к и н с
 811 | К в а р т у ч ч и
 812 | П е д р о
 813 | Ф р а н с е н а
 814 | М а к к о р о р и
 815 | Ж о з е ф и н н а
 816 | Я р о ш е в и ч
 817 | Р и ч а р д
 818 | Д а к е т т
 819 | Л а р н и
 820 | М а р т т и
 821 | М а р и я
 822 | К р и в о п о л е н о в а
 823 | Б а р и
 824 | М о р г а н
 825 | Н е ф е р к а р а
 826 | Р о б е р т
 827 | К а л е с к и
 828 | Л и н ч
 829 | Ю р и й
 830 | А л е к с а н д р о в
 831 | А в д о т ь я
 832 | Т и м о ф е е в н а
 833 | Л а с а р о
 834 | А л ь в а р е с
 835 | М а р г а р е т
 836 | Б и б а
 837 | А н н а
 838 | П а н и н а
 839 | Л е о н и д
 840 | Д о б р о в с к и й
 841 | С ю з а н н
 842 | П р и н г л
 843 | Р о ж е
 844 | М ю н ь е
 845 | А л е к с а н д р
 846 | Д е в о н
 847 | Х ё р т
 848 | С ю з а н н а
 849 | Б р ю к н е р
 850 | Г р е т а
 851 | А м е н д
 852 | У о л т е р
 853 | Б и к е л
 854 | П р и н ц ы
 855 | Т а у э р е
 856 | В и н ч е н ц о
 857 | Ч е р а м и
 858 | П ь е р
 859 | Ф у н к
 860 | Ж о з е ф
 861 | Н а д ж
 862 | Г р а н т
 863 | П о л
 864 | А д е л а и д а
 865 | С о б е с к а я
 866 | Ф о д л а
 867 | К р о н и н
 868 | О ’ Р е й л и
 869 | К а р л
 870 | Л ю д в и г
 871 | Б а р
 872 | К о р н е й
 873 | П е т р е н к о
 874 | М а й к л
 875 | Т и н с л и
 876 | А л е к с а н д р
 877 | К а з а к е в и ч
 878 | Г а л и н а
 879 | Н и л о в а
 880 | А н т о н
 881 | Н и л о в
 882 | Д ж е н н и
 883 | Т р и п
 884 | И г о р ь
 885 | П е т р е н к о
 886 | Р и н а т
 887 | И б р а г и м о в
 888 | Н и к о л а й
 889 | П а х о м о в
 890 | Ю р и й
 891 | К р а к о в е т с к и й
 892 | З и н е д и н
 893 | З и д а н
 894 | З и д а н
 895 | К е л л е н б е р г е р
 896 | Э м и л ь
 897 | Э м и л ь
 898 | Я н е н ц
 899 | О д р и н а
 900 | П э т р и д ж
 901 | П р и с ц и а н
 902 | Н о й
 903 | А к в у
 904 | Г а р о л ь д
 905 | М а к м и л л а н
 906 | Б р е н т а н о
 907 | Ф р а н ц
 908 | Я к о п о
 909 | Ф о р о н и
 910 | Х и л а р и
 911 | Д а ф ф
 912 | И р и н а
 913 | А в в а к у м о в а
 914 | А н н е т т е
 915 | ф о н
 916 | Д р о с т е
 917 | Х ю л ь с х о ф ф
 918 | О м а р
 919 | Т о р р и х о с
 920 | А д а м
 921 | Т а р н о в с к и й
 922 | М а р и я
 923 | Г а н с в о р т
 924 | М е л в и л л
 925 | В л а д и м и р
 926 | Р а у т б а р т
 927 | Е л и з а в е т а
 928 | Я н к о в с к а я
 929 | Э д д а
 930 | Г е р и н г
 931 | А л е к с а н д р а
 932 | Н и к о л и ч
 933 | Я н и с
 934 | С т р е н г а
 935 | Л а р и с а
 936 | Г у з е е в а
 937 | Й о а х и м
 938 | Д и т р и х
 939 | И о г а н н
 940 | Г е о р г
 941 | К а н т
 942 | Д ж у л и я
 943 | К о з и м о
 944 | А м м а н н а т и
 945 | Д ж о н
 946 | С и м м и т
 947 | С и л ь в и я
 948 | З и д е к
 949 | Л а у р а
 950 | Б а л ь з а к
 951 | К р и с т и н а
 952 | Д и л ь я
 953 | С в е т л а н а
 954 | Ю р ь е в н а
 955 | Б л е д н а я
 956 | Т е р р и
 957 | Ч а н
 958 | М а к с
 959 | О ф ю л ь с
 960 | А н а с т а с и я
 961 | Б и р ю ч е в а
 962 | М е д е я
 963 | Д ж у г е л и
 964 | Э д в а р д
 965 | О л е щ а к
 966 | М а й к
 967 | Т а л е р и к о
 968 | Ф е о п о м п
 969 | К о н р а д
 970 | Х у м м л е р
 971 | М а р к
 972 | Х и д д и н к
 973 | В е с т е р м а р к
 974 | Э д в а р д
 975 | А н д р е й
 976 | М е л е н с к и й
 977 | Г е н н а д и й
 978 | Р е б р о в
 979 | А р и н а
 980 | В л а д и с л а в о в н а
 981 | Б е з б о р о д о в а
 982 | А м и р а н
 983 | А н а н и д з е
 984 | М а к с
 985 | М а й е р
 986 | Р о м а н
 987 | Х о х л о в
 988 | Д э н н и
 989 | Х э н д л и н г
 990 | Р о б е р т
 991 | С т и в е н с о н
 992 | В л а д и м и р
 993 | К и л ь б у р г
 994 | Ч у р и к о в
 995 | М и х а и л
 996 | К у з ь м и ч
 997 | Д м и т р и й
 998 | М и ш и н
 999 | Е к а т е р и н а
1000 | С т о я н о в а
1001 | 


--------------------------------------------------------------------------------
/nmt-wizard/data/vocab/helloworld.ruen.src.dict:
--------------------------------------------------------------------------------
 1 | Ф
 2 | р
 3 | е
 4 | д
 5 | Р
 6 | о
 7 | ж
 8 | с
 9 | Х
10 | а
11 | т
12 | М
13 | л
14 | и
15 | Д
16 | н
17 | Л
18 | у
19 | к
20 | й
21 | ф
22 | б
23 | ш
24 | Э
25 | в
26 | Т
27 | ё
28 | п
29 | Ш
30 | К
31 | А
32 | ч
33 | ь
34 | Е
35 | Б
36 | г
37 | я
38 | х
39 | П
40 | ы
41 | м
42 | У
43 | С
44 | ю
45 | з
46 | В
47 | Ж
48 | э
49 | Н
50 | Я
51 | З
52 | Г
53 | И
54 | О
55 | Ч
56 | ц
57 | Й
58 | ’
59 | Ю
60 | ъ
61 | Ц
62 | щ
63 | Ё
64 | Ы
65 | '
66 | Щ
67 | «
68 | »
69 | `
70 | е́
71 | и́
72 | а́
73 | о́
74 | ћ
75 | у́
76 | ј
77 | ‘
78 | 


--------------------------------------------------------------------------------
/nmt-wizard/data/vocab/helloworld.ruen.tgt.dict:
--------------------------------------------------------------------------------
  1 | F
  2 | r
  3 | e
  4 | d
  5 | R
  6 | o
  7 | g
  8 | s
  9 | H
 10 | a
 11 | t
 12 | M
 13 | l
 14 | i
 15 | J
 16 | n
 17 | L
 18 | u
 19 | D
 20 | c
 21 | f
 22 | h
 23 | k
 24 | b
 25 | E
 26 | w
 27 | T
 28 | P
 29 | p
 30 | S
 31 | K
 32 | A
 33 | C
 34 | B
 35 | z
 36 | j
 37 | m
 38 | W
 39 | V
 40 | q
 41 | x
 42 | y
 43 | v
 44 | N
 45 | é
 46 | G
 47 | è
 48 | í
 49 | O
 50 | ü
 51 | ō
 52 | U
 53 | Đ
 54 | I
 55 | '
 56 | Z
 57 | Æ
 58 | ł
 59 | á
 60 | ó
 61 | Š
 62 | ě
 63 | Y
 64 | š
 65 | ö
 66 | Ž
 67 | ć
 68 | ç
 69 | Ł
 70 | ð
 71 | ő
 72 | ļ
 73 | Ø
 74 | É
 75 | X
 76 | Q
 77 | ê
 78 | Á
 79 | ø
 80 | ë
 81 | ä
 82 | ș
 83 | č
 84 | å
 85 | ñ
 86 | Å
 87 | û
 88 | ú
 89 | Õ
 90 | ā
 91 | ṇ
 92 | î
 93 | ş
 94 | â
 95 | ă
 96 | ž
 97 | ò
 98 | ı
 99 | ń
100 | Ó
101 | ý
102 | ū
103 | ß
104 | ř
105 | Ż
106 | ÿ
107 | ’
108 | Č
109 | ã
110 | Þ
111 | ś
112 | ę
113 | ô
114 | ī
115 | æ
116 | Ö
117 | đ
118 | Ś
119 | ï
120 | ů
121 | œ
122 | Ș
123 | Ş
124 | Ā
125 | ė
126 | ğ
127 | Ç
128 | ŀ
129 | ņ
130 | ą
131 | ē
132 | à
133 | Ō
134 | ´
135 | õ
136 | Í
137 | ‘
138 | Ğ
139 | İ
140 | Ď
141 | Ľ
142 | ŭ
143 | ķ
144 | ț
145 | ż
146 | ţ
147 | ć
148 | Ä
149 | Ñ
150 | ť
151 | Ü
152 | ň
153 | Ć
154 | þ
155 | ì
156 | Ē
157 | Â
158 | `
159 | ľ
160 | Ķ
161 | ģ
162 | È
163 | Œ
164 | Ț
165 | ṣ
166 | Ċ
167 | À
168 | Ř
169 | ù
170 | ũ
171 | ĩ
172 | Ô
173 | ṭṭ
174 | ű
175 | ′
176 | ǎ
177 | ư
178 | ọ
179 | Ź
180 | Ú
181 | Ţ
182 | ź
183 | ‎
184 | Ð
185 | ạ
186 | ẫ
187 | ụ
188 | ễ
189 | ậ
190 | Ģ
191 | Ṭ
192 | ĭ
193 | Ê
194 | Ī
195 | ĕ
196 | Ņ
197 | ắ
198 | ệ
199 | ứ
200 | ŵ
201 | Ḥ
202 | ṅ
203 | ·
204 | Ḫ
205 | ‐
206 | ả
207 | ď
208 | ṛ
209 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017-present The OpenNMT Authors.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/README.md:
--------------------------------------------------------------------------------
  1 | *Owner: Guillaume Klein (guillaume.klein (at) systrangroup.com)*
  2 | 
  3 | # Unsupervised NMT with TensorFlow and OpenNMT-tf
  4 | 
  5 | We propose you to implement the paper [*Unsupervised Machine Translation Using Monolingual Corpora Only*](https://arxiv.org/abs/1711.00043) (G. Lample et al. 2017) using [TensorFlow](https://www.tensorflow.org/) and [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf). While the project might take more than one day to complete, the goal of this session is to:
  6 | 
  7 | * dive into an interesting research paper applying adversarial training to NMT;
  8 | * learn about some TensorFlow mechanics;
  9 | * discover OpenNMT-tf concepts and APIs.
 10 | 
 11 | The guide is a step-by-step implementation with some functions left unimplemented for the curious reader. The completed code is available in the `ref/` directory.
 12 | 
 13 | ## Table of contents
 14 | 
 15 | * [Requirements](#requirements)
 16 | * [Data](#data)
 17 | * [Step-by-step tutorial](#step-by-step-tutorial)
 18 |    * [Training](#training)
 19 |       * [Step 0: Base file](#step-0-base-file)
 20 |       * [Step 1: Reading the data](#step-1-reading-the-data)
 21 |       * [Step 2: Noise model](#step-2-noise-model)
 22 |       * [Step 3: Creating embeddings](#step-3-creating-embeddings)
 23 |       * [Step 4: Encoding noisy inputs](#step-4-encoding-noisy-inputs)
 24 |       * [Step 5: Denoising noisy encoding](#step-5-denoising-noisy-encoding)
 25 |       * [Step 6: Discriminating encodings](#step-6-discriminating-encodings)
 26 |       * [Step 7: Optimization and training loop](#step-7-optimization-and-training-loop)
 27 |    * [Inference](#inference)
 28 |       * [Step 0: Base file](#step-0-base-file-1)
 29 |       * [Step 1: Reading data](#step-1-reading-data)
 30 |       * [Step 2: Rebuilding the model](#step-2-rebuilding-the-model)
 31 |       * [Step 3: Encoding and decoding](#step-3-encoding-and-decoding)
 32 |       * [Step 4: Loading and translating](#step-4-loading-and-translating)
 33 |    * [Complete training flow](#complete-training-flow)
 34 | 
 35 | ## Requirements
 36 | 
 37 | * `git`
 38 | * `python` >= 2.7
 39 | * `virtualenv`
 40 | 
 41 | ```bash
 42 | git clone https://github.com/OpenNMT/Hackathon.git
 43 | cd Hackathon/unsupervised-nmt
 44 | virtualenv env
 45 | source env/bin/activate
 46 | pip install -r requirements.txt.cpu
 47 | ```
 48 | 
 49 | ## Data
 50 | 
 51 | The data are available at:
 52 | 
 53 | * [`unsupervised-nmt-enfr.tar.bz2`](https://s3.amazonaws.com/opennmt-trainingdata/unsupervised-nmt-enfr.tar.bz2) (2.2 GB)
 54 | * [`unsupervised-nmt-enfr-dev.tar.bz2`](https://s3.amazonaws.com/opennmt-trainingdata/unsupervised-nmt-enfr-dev.tar.bz2) (2.4 MB)
 55 | 
 56 | To get started, we recommend downloading the `dev` version which contains a small training set with 10K sentences.
 57 | 
 58 | Both packages contain the vocabulary files and a first translation of the training data using [an unsupervised word-by-word translation model](https://github.com/jsenellart/papers/tree/master/WordTranslationWithoutParallelData)\* as described in the paper. The full data additionally contains pretrained word embeddings using [fastText](https://github.com/facebookresearch/fastText).
 59 | 
 60 | \* also see the [MUSE](https://github.com/facebookresearch/MUSE) project that was recently released by Facebook Research team.
 61 | 
 62 | ## Step-by-step tutorial
 63 | 
 64 | For this tutorial, the following resources might come handy:
 65 | 
 66 | * [TensorFlow documentation](https://www.tensorflow.org/api_docs/python/)
 67 | * [OpenNMT-tf documentation](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.html)
 68 | * [Numpy documentation](https://docs.scipy.org/doc/numpy/reference/index.html)
 69 | 
 70 | and of course the research paper linked above.
 71 | 
 72 | ### Training
 73 | 
 74 | The training script `ref/training.py` implements one training iteration as described in the paper. Follow the next section to implement your own or understand the reference file.
 75 | 
 76 | #### Step 0: Base file
 77 | 
 78 | You can use this file to start implementing. It includes some common imports and a minimal command line argument parser:
 79 | 
 80 | ```python
 81 | from __future__ import print_function
 82 | 
 83 | import argparse
 84 | import sys
 85 | 
 86 | import tensorflow as tf
 87 | import opennmt as onmt
 88 | import numpy as np
 89 | 
 90 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 91 | parser.add_argument("--model_dir", default="model", help="Checkpoint directory.")
 92 | args = parser.parse_args()
 93 | ```
 94 | 
 95 | #### Step 1: Reading the data
 96 | 
 97 | Loading text data in TensorFlow is made easy with the [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) and the [`tf.contrib.lookup.index_table_from_file`](https://www.tensorflow.org/api_docs/python/tf/contrib/lookup/index_table_from_file) function. This code is provided so that we can ensure the data format used as input.
 98 | 
 99 | The required data are:
100 | 
101 | * the source and target monolingual datasets
102 | * the source and target monolingual datasets translation from model `M(t-1)`
103 | * the source and target vocabularies
104 | 
105 | Let's add the following command line options:
106 | 
107 | ```python
108 | parser.add_argument("--src", required=True, help="Source file.")
109 | parser.add_argument("--tgt", required=True, help="Target file.")
110 | parser.add_argument("--src_trans", required=True, help="Source translation at the previous iteration.")
111 | parser.add_argument("--tgt_trans", required=True, help="Target translation at the previous iteration.")
112 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.")
113 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.")
114 | ```
115 | 
116 | and then create the training iterators:
117 | 
118 | ```python
119 | from opennmt import constants
120 | from opennmt.utils.misc import count_lines
121 | 
122 | def load_vocab(vocab_file):
123 |   """Returns a lookup table and the vocabulary size."""
124 |   vocab_size = count_lines(vocab_file) + 1  # Add UNK.
125 |   vocab = tf.contrib.lookup.index_table_from_file(
126 |       vocab_file,
127 |       vocab_size=vocab_size - 1,
128 |       num_oov_buckets=1)
129 |   return vocab, vocab_size
130 | 
131 | def load_data(input_file,
132 |               translated_file,
133 |               input_vocab,
134 |               translated_vocab,
135 |               batch_size=32,
136 |               max_seq_len=50,
137 |               num_buckets=5):
138 |   """Returns an iterator over the training data."""
139 | 
140 |   def _make_dataset(text_file, vocab):
141 |     dataset = tf.data.TextLineDataset(text_file)
142 |     dataset = dataset.map(lambda x: tf.string_split([x]).values)  # Split on spaces.
143 |     dataset = dataset.map(vocab.lookup)  # Lookup token in vocabulary.
144 |     return dataset
145 | 
146 |   def _key_func(x):
147 |     bucket_width = (max_seq_len + num_buckets - 1) // num_buckets
148 |     bucket_id = x["length"] // bucket_width
149 |     bucket_id = tf.minimum(bucket_id, num_buckets)
150 |     return tf.to_int64(bucket_id)
151 | 
152 |   def _reduce_func(unused_key, dataset):
153 |     return dataset.padded_batch(batch_size, {
154 |         "ids": [None],
155 |         "ids_in": [None],
156 |         "ids_out": [None],
157 |         "length": [],
158 |         "trans_ids": [None],
159 |         "trans_length": []})
160 | 
161 |   bos = tf.constant([constants.START_OF_SENTENCE_ID], dtype=tf.int64)
162 |   eos = tf.constant([constants.END_OF_SENTENCE_ID], dtype=tf.int64)
163 | 
164 |   # Make a dataset from the input and translated file.
165 |   input_dataset = _make_dataset(input_file, input_vocab)
166 |   translated_dataset = _make_dataset(translated_file, translated_vocab)
167 |   dataset = tf.data.Dataset.zip((input_dataset, translated_dataset))
168 |   dataset = dataset.shuffle(200000)
169 | 
170 |   # Define the input format.
171 |   dataset = dataset.map(lambda x, y: {
172 |       "ids": x,
173 |       "ids_in": tf.concat([bos, x], axis=0),
174 |       "ids_out": tf.concat([x, eos], axis=0),
175 |       "length": tf.shape(x)[0],
176 |       "trans_ids": y,
177 |       "trans_length": tf.shape(y)[0]})
178 | 
179 |   # Filter out invalid examples.
180 |   dataset = dataset.filter(lambda x: tf.greater(x["length"], 0))
181 | 
182 |   # Batch the dataset using a bucketing strategy.
183 |   dataset = dataset.apply(tf.contrib.data.group_by_window(
184 |       _key_func,
185 |       _reduce_func,
186 |       window_size=batch_size))
187 |   return dataset.make_initializable_iterator()
188 | 
189 | src_vocab, src_vocab_size = load_vocab(args.src_vocab)
190 | tgt_vocab, tgt_vocab_size = load_vocab(args.tgt_vocab)
191 | 
192 | with tf.device("/cpu:0"):  # Input pipeline should always be place on the CPU.
193 |   src_iterator = load_data(args.src, args.src_trans, src_vocab, tgt_vocab)
194 |   tgt_iterator = load_data(args.tgt, args.tgt_trans, tgt_vocab, src_vocab)
195 |   src = src_iterator.get_next()
196 |   tgt = tgt_iterator.get_next()
197 | ```
198 | 
199 | Here we use the bucketing strategy to make sure batches contain sequences of similar length which reduces the amount of padding and makes the training more efficient. For large training sets, we could also use the hard constraint of only batching sentences of the same length.
200 | 
201 | You can test by printing the first example:
202 | 
203 | ```python
204 | with tf.Session() as sess:
205 |   sess.run(tf.global_variables_initializer())
206 |   sess.run(tf.tables_initializer())
207 |   sess.run([src_iterator.initializer, tgt_iterator.initializer])
208 |   print(sess.run(src))
209 | ```
210 | 
211 | *During development, you can reuse this session creation code to print tensor values.*
212 | 
213 | #### Step 2: Noise model
214 | 
215 | This refers to the `C(x)` function described in the *Section 2.3* of the paper. As this function does not require backpropagation, we suggest to implement it in pure Python to make things easier:
216 | 
217 | ```python
218 | def add_noise_python(words, dropout=0.1, k=3):
219 |   """Applies the noise model in input words.
220 | 
221 |   Args:
222 |     words: A numpy vector of word ids.
223 |     dropout: The probability to drop words.
224 |     k: Maximum distance of the permutation.
225 | 
226 |   Returns:
227 |     A noisy numpy vector of word ids.
228 |   """
229 |   # FIXME
230 |   raise NotImplementedError()
231 | 
232 | def add_noise(ids, sequence_length):
233 |   """Wraps add_noise_python for a batch of tensors."""
234 | 
235 |   def _add_noise_single(ids, sequence_length):
236 |     noisy_ids = add_noise_python(ids[:sequence_length])
237 |     noisy_sequence_length = len(noisy_ids)
238 |     ids[:noisy_sequence_length] = noisy_ids
239 |     ids[noisy_sequence_length:] = 0
240 |     return ids, np.int32(noisy_sequence_length)
241 | 
242 |   noisy_ids, noisy_sequence_length = tf.map_fn(
243 |       lambda x: tf.py_func(_add_noise_single, x, [ids.dtype, tf.int32]),
244 |       [ids, sequence_length],
245 |       dtype=[ids.dtype, tf.int32],
246 |       back_prop=False)
247 | 
248 |   noisy_ids.set_shape(ids.get_shape())
249 |   noisy_sequence_length.set_shape(sequence_length.get_shape())
250 | 
251 |   return noisy_ids, noisy_sequence_length
252 | ```
253 | 
254 | The wrapper uses [`tf.py_func`](https://www.tensorflow.org/api_docs/python/tf/py_func) to include a Python function in the computation graph, and [`tf.map_fn`](https://www.tensorflow.org/api_docs/python/tf/map_fn) to apply the noise model on each sequence in the batch.
255 | 
256 | #### Step 3: Creating embeddings
257 | 
258 | The paper uses pretrained embeddings to initialize the embeddings of the model. Pretrained emnbeddings are included in the full data package (see above) and can be easily loaded with the [`load_pretrained_embeddings`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.inputters.text_inputter.html#opennmt.inputters.text_inputter.load_pretrained_embeddings) function from OpenNMT-tf.
259 | 
260 | First you should add new command line arguments to accept pretrained word embeddings:
261 | 
262 | ```python
263 | parser.add_argument("--src_emb", default=None, help="Source embedding.")
264 | parser.add_argument("--tgt_emb", default=None, help="Target embedding.")
265 | ```
266 | 
267 | Then, here is the code to load or create the embedding [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable):
268 | 
269 | ```python
270 | from opennmt.inputters.text_inputter import load_pretrained_embeddings
271 | 
272 | def create_embeddings(vocab_size, depth=300):
273 |   """Creates an embedding variable."""
274 |   return tf.get_variable("embedding", shape=[vocab_size, depth])
275 | 
276 | def load_embeddings(embedding_file, vocab_file):
277 |   """Loads an embedding variable or embeddings file."""
278 |   try:
279 |     embeddings = tf.get_variable("embedding")
280 |   except ValueError:
281 |     pretrained = load_pretrained_embeddings(
282 |         embedding_file,
283 |         vocab_file,
284 |         num_oov_buckets=1,
285 |         with_header=True,
286 |         case_insensitive_embeddings=True)
287 |     embeddings = tf.get_variable(
288 |         "embedding",
289 |         shape=None,
290 |         trainable=False,
291 |         initializer=tf.constant(pretrained.astype(np.float32)))
292 |   return embeddings
293 | 
294 | with tf.variable_scope("src"):
295 |   if args.src_emb is not None:
296 |     src_emb = load_embeddings(args.src_emb, args.src_vocab)
297 |   else:
298 |     src_emb = create_embeddings(src_vocab_size)
299 | 
300 | with tf.variable_scope("tgt"):
301 |   if args.tgt_emb is not None:
302 |     tgt_emb = load_embeddings(args.tgt_emb, args.tgt_vocab)
303 |   else:
304 |     tgt_emb = create_embeddings(tgt_vocab_size)
305 | ```
306 | 
307 | #### Step 4: Encoding noisy inputs
308 | 
309 | The encoding uses a standard bidirectional LSTM encoder as described in *Section 2.1*. Hopefully, OpenNMT-tf exposes [several encoders](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.encoders.html) that can be used with a simple interface.
310 | 
311 | First, create a new encoder instance:
312 | 
313 | ```python
314 | hidden_size = 512
315 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size)
316 | ```
317 | 
318 | Then, you should implement the function `add_noise_and_encode`:
319 | 
320 | ```python
321 | def add_noise_and_encode(ids, sequence_length, embedding, reuse=None):
322 |   """Applies the noise model on ids, embeds and encodes.
323 | 
324 |   Args:
325 |     ids: The tensor of words ids of shape [batch_size, max_time].
326 |     sequence_length: The tensor of sequence length of shape [batch_size].
327 |     embedding: The embedding variable.
328 |     reuse: If True, reuse the encoder variables.
329 | 
330 |   Returns:
331 |     A tuple (encoder output, encoder state, sequence length).
332 |   """
333 |   # FIXME
334 |   raise NotImplementedError()
335 | ```
336 | 
337 | **Related resources:**
338 | 
339 | * [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup)
340 | * [`tf.variable_scope`](https://www.tensorflow.org/api_docs/python/tf/variable_scope)
341 | * [`onmt.encoders.Encoder.encode`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.encoders.encoder.html#opennmt.encoders.encoder.Encoder.encode)
342 | 
343 | At this point, you have everything you need to implement to encoding part showed in *Figure 2*:
344 | 
345 | <p align="center">
346 |     <img src ="img/encoding.png" width="400" />
347 | </p>
348 | 
349 | ```python
350 | src_encoder_auto = add_noise_and_encode(
351 |     src["ids"], src["length"], src_emb, reuse=None)
352 | tgt_encoder_auto = add_noise_and_encode(
353 |     tgt["ids"], tgt["length"], tgt_emb, reuse=True)
354 | 
355 | src_encoder_cross = add_noise_and_encode(
356 |     tgt["trans_ids"], tgt["trans_length"], src_emb, reuse=True)
357 | tgt_encoder_cross = add_noise_and_encode(
358 |     src["trans_ids"], src["trans_length"], tgt_emb, reuse=True)
359 | ```
360 | 
361 | #### Step 5: Denoising noisy encoding
362 | 
363 | This step completes *Section 2.3* and *2.4* of the paper by denoising noisy inputs. It uses a OpenNMT-tf attentional decoder that starts from the encoder final state:
364 | 
365 | ```python
366 | decoder = onmt.decoders.AttentionalRNNDecoder(
367 |     2, hidden_size, bridge=onmt.layers.CopyBridge())
368 | ```
369 | 
370 | You can then implement the `denoise` function:
371 | 
372 | ```python
373 | from opennmt.utils.losses import cross_entropy_sequence_loss
374 | 
375 | def denoise(x, embedding, encoder_outputs, generator, reuse=None):
376 |   """Denoises from the noisy encoding.
377 | 
378 |   Args:
379 |     x: The input data from the dataset.
380 |     embedding: The embedding variable.
381 |     encoder_outputs: A tuple with the encoder outputs.
382 |     generator: A tf.layers.Dense instance for projecting the logits.
383 |     reuse: If True, reuse the decoder variables.
384 | 
385 |   Returns:
386 |     The decoder loss.
387 |   """
388 |   raise NotImplementedError()
389 | ```
390 | 
391 | **Related resources:**
392 | 
393 | * [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup)
394 | * [`tf.variable_scope`](https://www.tensorflow.org/api_docs/python/tf/variable_scope)
395 | * [`cross_entropy_sequence_loss`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.utils.losses.html#opennmt.utils.losses.cross_entropy_sequence_loss)
396 | * [`onmt.decoders.Decoder.decode`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.decoders.decoder.html#opennmt.decoders.decoder.Decoder.decode)
397 | 
398 | and build the `generator` for source and target:
399 | 
400 | ```python
401 | with tf.variable_scope("src"):
402 |   src_gen = tf.layers.Dense(src_vocab_size)
403 |   src_gen.build([None, hidden_size])
404 | 
405 | with tf.variable_scope("tgt"):
406 |   tgt_gen = tf.layers.Dense(tgt_vocab_size)
407 |   tgt_gen.build([None, hidden_size])
408 | ```
409 | 
410 | Now, you can implement the decoding part of the architecture presented in *Figure 2*:
411 | 
412 | <p align="center">
413 |     <img src ="img/decoding.png" width="400" />
414 | </p>
415 | 
416 | ```python
417 | l_auto_src = denoise(src, src_emb, src_encoder_auto, src_gen, reuse=None)
418 | l_auto_tgt = denoise(tgt, tgt_emb, tgt_encoder_auto, tgt_gen, reuse=True)
419 | 
420 | l_cd_src = denoise(src, src_emb, tgt_encoder_cross, src_gen, reuse=True)
421 | l_cd_tgt = denoise(tgt, tgt_emb, src_encoder_cross, tgt_gen, reuse=True)
422 | ```
423 | 
424 | #### Step 6: Discriminating encodings
425 | 
426 | This represents the adversarial part of the model as described in Section *2.5*. The architecture of the discriminator in described in *Section 4.4*.
427 | 
428 | Here, you are asked to implement the binary cross entropy and the discriminator.
429 | 
430 | ```python
431 | def binary_cross_entropy(x, y, smoothing=0, epsilon=1e-12):
432 |   """Computes the averaged binary cross entropy.
433 | 
434 |   bce = y*log(x) + (1-y)*log(1-x)
435 | 
436 |   Args:
437 |     x: The predicted labels.
438 |     y: The true labels.
439 |     smoothing: The label smoothing coefficient.
440 | 
441 |   Returns:
442 |     The cross entropy.
443 |   """
444 |   # FIXME
445 |   raise NotImplementedError()
446 | 
447 | def discriminator(encodings,
448 |                   sequence_lengths,
449 |                   lang_ids,
450 |                   num_layers=3,
451 |                   hidden_size=1024,
452 |                   dropout=0.3):
453 |   """Discriminates the encoder outputs against lang_ids.
454 | 
455 |   Args:
456 |     encodings: The encoder outputs of shape [4*batch_size, max_time, hidden_size].
457 |     sequence_lengths: The length of each sequence of shape [4*batch_size].
458 |     lang_ids: The true lang id of each sequence of shape [4*batch_size].
459 |     num_layers: The number of layers of the discriminator.
460 |     hidden_size: The hidden size of the discriminator.
461 |     dropout: The dropout to apply on each discriminator layer output.
462 | 
463 |   Returns:
464 |     A tuple with: the discriminator loss (L_d) and the adversarial loss (L_adv).
465 |   """
466 |   # FIXME
467 |   raise NotImplementedError()
468 | ```
469 | 
470 | **Related resources:**
471 | 
472 | * [`tf.layers.dense`](https://www.tensorflow.org/api_docs/python/tf/layers/dense)
473 | * [`tf.nn.dropout`](https://www.tensorflow.org/api_docs/python/tf/nn/dropout)
474 | * [`tf.sequence_mask`](https://www.tensorflow.org/api_docs/python/tf/sequence_mask)
475 | * [`tf.reduce_mean`](https://www.tensorflow.org/api_docs/python/tf/reduce_mean)
476 | 
477 | To run the discriminator a single time, let's concatenate all encoder outputs (cf. *Figure 2*) and prepare the language identifiers accordingly.
478 | 
479 | <p align="center">
480 |     <img src ="img/adversarial.png" width="180" />
481 | </p>
482 | 
483 | ```python
484 | from opennmt.layers.reducer import pad_in_time
485 | 
486 | batch_size = tf.shape(src["length"])[0]
487 | all_encoder_outputs = [
488 |     src_encoder_auto, src_encoder_cross,
489 |     tgt_encoder_auto, tgt_encoder_cross]
490 | lang_ids = tf.concat([
491 |     tf.fill([batch_size * 2], 0),
492 |     tf.fill([batch_size * 2], 1)], 0)
493 | 
494 | max_time = tf.reduce_max([tf.shape(output[0])[1] for output in all_encoder_outputs])
495 | 
496 | encodings = tf.concat([
497 |     pad_in_time(output[0], max_time - tf.shape(output[0])[1])
498 |     for output in all_encoder_outputs], 0)
499 | sequence_lengths = tf.concat([output[2] for output in all_encoder_outputs], 0)
500 | 
501 | with tf.variable_scope("discriminator"):
502 |   l_d, l_adv = discriminator(encodings, sequence_lengths, lang_ids)
503 | ```
504 | 
505 | #### Step 7: Optimization and training loop
506 | 
507 | Finally, you can compute the final objective function as described at the end of *Section 2*:
508 | 
509 | ```python
510 | lambda_auto = 1
511 | lambda_cd = 1
512 | lambda_adv = 1
513 | 
514 | l_auto = l_auto_src + l_auto_tgt
515 | l_cd = l_cd_src + l_cd_tgt
516 | 
517 | l_final = (lambda_auto * l_auto + lambda_cd * l_cd + lambda_adv * l_adv)
518 | ```
519 | 
520 | As described in *Section 4.4*, the training alternates "between one encoder-decoder and one discriminator update" and uses 2 different optimizers. You should implement this behavior in the `train_op` function:
521 | 
522 | ```python
523 | def build_train_op(global_step, encdec_variables, discri_variables):
524 |   """Returns the training Op.
525 | 
526 |   When global_step % 2 == 0, it minimizes l_final and updates encdec_variables.
527 |   Otherwise, it minimizes l_d and updates discri_variables.
528 | 
529 |   Args:
530 |     global_step: The training step.
531 |     encdec_variables: The list of variables of the encoder/decoder model.
532 |     discri_variables: The list of variables of the discriminator.
533 | 
534 |   Returns:
535 |     The training op.
536 |   """
537 |   # FIXME
538 |   raise NotImplementedError()
539 | ```
540 | 
541 | **Related resources:**
542 | 
543 | * [`tf.train.AdamOptimizer`](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
544 | * [`tf.train.RMSPropOptimizer`](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer)
545 | * [`tf.cond`](https://www.tensorflow.org/api_docs/python/tf/cond)
546 | 
547 | And now, we can conclude the training script with the training loop:
548 | 
549 | ```python
550 | 
551 | encdec_variables = []
552 | discri_variables = []
553 | for variable in tf.trainable_variables():
554 |   if variable.name.startswith("discriminator"):
555 |     discri_variables.append(variable)
556 |   else:
557 |     encdec_variables.append(variable)
558 | 
559 | global_step = tf.train.get_or_create_global_step()
560 | train_op = build_train_op(global_step, encdec_variables, discri_variables)
561 | 
562 | i = 0
563 | with tf.train.MonitoredTrainingSession(checkpoint_dir=args.model_dir) as sess:
564 |   sess.run([src_iterator.initializer, tgt_iterator.initializer])
565 |   while not sess.should_stop():
566 |     if i % 2 == 0:
567 |       _, step, _l_auto, _l_cd, _l_adv, _l = sess.run(
568 |           [train_op, global_step, l_auto, l_cd, l_adv, l_final])
569 |       print("{} - l_auto = {}; l_cd = {}, l_adv = {}; l = {}".format(
570 |           step, _l_auto, _l_cd, _l_adv, _l))
571 |     else:
572 |       _, step, _l_d = sess.run([train_op, global_step, l_d])
573 |       print("{} - l_d = {}".format(step, _l_d))
574 |     i += 1
575 |     sys.stdout.flush()
576 | ```
577 | 
578 | ### Inference
579 | 
580 | Inference is not only required for testing the model performance but is also used as part of the training: after one training iteration, the complete monolingual corpus must be translated and used as input to the next training iteration.
581 | 
582 | This part is simpler and only requires to build the encoder-decoder model with the same dimensions and variable scoping.
583 | 
584 | #### Step 0: Base file
585 | 
586 | You can start with this header:
587 | 
588 | ```python
589 | import argparse
590 | 
591 | import tensorflow as tf
592 | import opennmt as onmt
593 | 
594 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
595 | parser.add_argument("--model_dir", default="model", help="Checkpoint directory.")
596 | 
597 | args = parser.parse_args()
598 | ```
599 | 
600 | #### Step 1: Reading data
601 | 
602 | Let's define the script interface by defining additional command line arguments:
603 | 
604 | ```python
605 | parser.add_argument("--src", required=True, help="Source file.")
606 | parser.add_argument("--tgt", required=True, help="Target file.")
607 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.")
608 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.")
609 | parser.add_argument("--direction", required=True, type=int,
610 |                     help="1 = translation source, 2 = translate target.")
611 | ```
612 | 
613 | Here, we choose to set both the source and target file and add a `direction` flag to select from which file to translate.
614 | 
615 | Based on the input pipeline implemented in the training phase, this time we propose to build the dataset iterator. This should be a textbook usage of the [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) API.
616 | 
617 | ```python
618 | def load_data(input_file, input_vocab):
619 |   """Returns an iterator over the input file.
620 | 
621 |   Args:
622 |     input_file: The input text file.
623 |     input_vocab: The input vocabulary.
624 | 
625 |   Returns:
626 |     A dataset iterator.
627 |   """
628 |   # FIXME
629 |   raise NotImplementedError()
630 | ```
631 | 
632 | Batching should be simpler than during the training, see [`tf.data.Dataset.padded_batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch).
633 | 
634 | Then, the iterator can be used:
635 | 
636 | ```python
637 | from opennmt.utils.misc import count_lines
638 | 
639 | if args.direction == 1:
640 |   src_file, tgt_file = args.src, args.tgt
641 |   src_vocab_file, tgt_vocab_file = args.src_vocab, args.tgt_vocab
642 | else:
643 |   src_file, tgt_file = args.tgt, args.src
644 |   src_vocab_file, tgt_vocab_file = args.tgt_vocab, args.src_vocab
645 | 
646 | tgt_vocab_size = count_lines(tgt_vocab_file) + 1
647 | src_vocab_size = count_lines(src_vocab_file) + 1
648 | src_vocab = tf.contrib.lookup.index_table_from_file(
649 |     src_vocab_file,
650 |     vocab_size=src_vocab_size - 1,
651 |     num_oov_buckets=1)
652 | 
653 | with tf.device("cpu:0"):
654 |   src_iterator = load_data(src_file, src_vocab)
655 | 
656 | src = src_iterator.get_next()
657 | ```
658 | 
659 | #### Step 2: Rebuilding the model
660 | 
661 | In this step, we need to define the same model that was used during the training, including variable scoping:
662 | 
663 | ```python
664 | hidden_size = 512
665 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size)
666 | decoder = onmt.decoders.AttentionalRNNDecoder(
667 |     2, hidden_size, bridge=onmt.layers.CopyBridge())
668 | 
669 | with tf.variable_scope("src" if args.direction == 1 else "tgt"):
670 |   src_emb = tf.get_variable("embedding", shape=[src_vocab_size, 300])
671 |   src_gen = tf.layers.Dense(src_vocab_size)
672 |   src_gen.build([None, hidden_size])
673 | 
674 | with tf.variable_scope("tgt" if args.direction == 1 else "src"):
675 |   tgt_emb = tf.get_variable("embedding", shape=[tgt_vocab_size, 300])
676 |   tgt_gen = tf.layers.Dense(tgt_vocab_size)
677 |   tgt_gen.build([None, hidden_size])
678 | ```
679 | 
680 | **Note:** Larger TensorFlow project usually do not handle inference this way. For example OpenNMT-tf shares training, inference, and evaluation code but reads from a [`mode`](https://www.tensorflow.org/api_docs/python/tf/estimator/ModeKeys) variable to implement behavior specific to each phase. The `mode` argument will be **required for encoding and decoding** to disable dropout in the next step.
681 | 
682 | #### Step 3: Encoding and decoding
683 | 
684 | Encoding and decoding is basically a method call on the encoder and decoder object (including beam search!). Make sure to use the same variable scope that you used during the training phase.
685 | 
686 | ```python
687 | from opennmt import constants
688 | 
689 | def encode():
690 |   """Encodes src.
691 | 
692 |   Returns:
693 |     A tuple (encoder output, encoder state, sequence length).
694 |   """
695 |   # FIXME
696 |   raise NotImplementedError()
697 | 
698 | def decode(encoder_output):
699 |   """Dynamically decodes from the encoder output.
700 | 
701 |   Args:
702 |     encoder_output: The output of encode().
703 | 
704 |   Returns:
705 |     A tuple with: the decoded word ids and the length of each decoded sequence.
706 |   """
707 |   # FIXME
708 |   raise NotImplementedError()
709 | ```
710 | 
711 | **Related resources:**
712 | 
713 | * [`onmt.encoders.Encoder.encode`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.encoders.encoder.html#opennmt.encoders.encoder.Encoder.encode)
714 | * [`onmt.decoders.Decoder.dynamic_decode_and_search`](http://opennmt.net/OpenNMT-tf/v1.1.0/package/opennmt.decoders.decoder.html#opennmt.decoders.decoder.Decoder.dynamic_decode_and_search)
715 | 
716 | These functions can then be called like this to build the actual translation:
717 | 
718 | ```python
719 | encoder_output = encode()
720 | sampled_ids, sampled_length = decode(encoder_output)
721 | 
722 | tgt_vocab_rev = tf.contrib.lookup.index_to_string_table_from_file(
723 |   tgt_vocab_file,
724 |   vocab_size=tgt_vocab_size - 1,
725 |   default_value=constants.UNKNOWN_TOKEN)
726 | 
727 | tokens = tgt_vocab_rev.lookup(tf.cast(sampled_ids, tf.int64))
728 | length = sampled_length
729 | ```
730 | 
731 | #### Step 4: Loading and translating
732 | 
733 | Finally, the inference script can be concluded with the code that restores variables and run the translation:
734 | 
735 | ```python
736 | from opennmt.utils.misc import print_bytes
737 | 
738 | saver = tf.train.Saver()
739 | checkpoint_path = tf.train.latest_checkpoint(args.model_dir)
740 | 
741 | def session_init_op(_scaffold, sess):
742 |   saver.restore(sess, checkpoint_path)
743 |   tf.logging.info("Restored model from %s", checkpoint_path)
744 | 
745 | scaffold = tf.train.Scaffold(init_fn=session_init_op)
746 | session_creator = tf.train.ChiefSessionCreator(scaffold=scaffold)
747 | 
748 | with tf.train.MonitoredSession(session_creator=session_creator) as sess:
749 |   sess.run(src_iterator.initializer)
750 |   while not sess.should_stop():
751 |     _tokens, _length = sess.run([tokens, length])
752 |     for b in range(_tokens.shape[0]):
753 |       pred_toks = _tokens[b][0][:_length[b][0] - 1]
754 |       pred_sent = b" ".join(pred_toks)
755 |       print_bytes(pred_sent)
756 | ```
757 | 
758 | ### Complete training flow
759 | 
760 | Using the training and inference scripts, you can now write the complete training algorithm described in *Section 3.1*.
761 | 
762 | See for example the shell script `ref/train.sh` that can be run on the full data package:
763 | 
764 | ```bash
765 | # Download data.
766 | mkdir data && cd data
767 | wget https://s3.amazonaws.com/opennmt-trainingdata/unsupervised-nmt-enfr.tar.bz2
768 | tar xf unsupervised-nmt-enfr.tar.bz2
769 | cd ..
770 | 
771 | # Download multi-bleu.perl.
772 | wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl
773 | 
774 | # Train algorithm.
775 | ./ref/train.sh
776 | ```
777 | 
778 | Here are some results (reporting BLEU score on tokenized *newstest2014* translation):
779 | 
780 | | Iteration | ENFR | FREN |
781 | | --- | --- | --- |
782 | | M1 | 6.20 | 9.12 |
783 | | M2 | 13.02 | 10.73 |
784 | | M3 | 13.81 | 14.25 |
785 | 
786 | where M1 is the unsupervised word-by-word translation model.
787 | 
788 | ---
789 | 
790 | *Congratulations for completing the tutorial! Wether you implemented the functions on your own or went through the provided implementation, we hope that you learned new things on TensorFlow, OpenNMT-tf and adversarial training applied to unsupervised MT.*
791 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/img/adversarial.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/img/adversarial.png


--------------------------------------------------------------------------------
/unsupervised-nmt/img/decoding.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/img/decoding.png


--------------------------------------------------------------------------------
/unsupervised-nmt/img/encoding.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/img/encoding.png


--------------------------------------------------------------------------------
/unsupervised-nmt/paper.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenNMT/Hackathon/e975f378793e6a162be9de2c1d447d48e677989c/unsupervised-nmt/paper.pdf


--------------------------------------------------------------------------------
/unsupervised-nmt/ref/inference.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | 
  3 | import tensorflow as tf
  4 | import opennmt as onmt
  5 | 
  6 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
  7 | parser.add_argument("--model_dir", default="model",
  8 |                     help="Checkpoint directory.")
  9 | 
 10 | # Step 1
 11 | parser.add_argument("--src", required=True, help="Source file.")
 12 | parser.add_argument("--tgt", required=True, help="Target file.")
 13 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.")
 14 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.")
 15 | parser.add_argument("--direction", required=True, type=int,
 16 |                     help="1 = translation source, 2 = translate target.")
 17 | 
 18 | args = parser.parse_args()
 19 | 
 20 | 
 21 | # Step 1
 22 | 
 23 | def load_data(input_file, input_vocab):
 24 |   """Returns an iterator over the input file.
 25 | 
 26 |   Args:
 27 |     input_file: The input text file.
 28 |     input_vocab: The input vocabulary.
 29 | 
 30 |   Returns:
 31 |     A dataset batch iterator.
 32 |   """
 33 |   dataset = tf.data.TextLineDataset(input_file)
 34 |   dataset = dataset.map(lambda x: tf.string_split([x]).values)
 35 |   dataset = dataset.map(input_vocab.lookup)
 36 |   dataset = dataset.map(lambda x: {
 37 |       "ids": x,
 38 |       "length": tf.shape(x)[0]})
 39 |   dataset = dataset.padded_batch(64, {
 40 |       "ids": [None],
 41 |       "length": []})
 42 |   return dataset.make_initializable_iterator()
 43 | 
 44 | if args.direction == 1:
 45 |   src_file, tgt_file = args.src, args.tgt
 46 |   src_vocab_file, tgt_vocab_file = args.src_vocab, args.tgt_vocab
 47 | else:
 48 |   src_file, tgt_file = args.tgt, args.src
 49 |   src_vocab_file, tgt_vocab_file = args.tgt_vocab, args.src_vocab
 50 | 
 51 | from opennmt.utils.misc import count_lines
 52 | 
 53 | tgt_vocab_size = count_lines(tgt_vocab_file) + 1
 54 | src_vocab_size = count_lines(src_vocab_file) + 1
 55 | src_vocab = tf.contrib.lookup.index_table_from_file(
 56 |     src_vocab_file,
 57 |     vocab_size=src_vocab_size - 1,
 58 |     num_oov_buckets=1)
 59 | 
 60 | with tf.device("cpu:0"):
 61 |   src_iterator = load_data(src_file, src_vocab)
 62 | 
 63 | src = src_iterator.get_next()
 64 | 
 65 | 
 66 | # Step 2
 67 | 
 68 | 
 69 | hidden_size = 512
 70 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size)
 71 | decoder = onmt.decoders.AttentionalRNNDecoder(
 72 |     2, hidden_size, bridge=onmt.layers.CopyBridge())
 73 | 
 74 | with tf.variable_scope("src" if args.direction == 1 else "tgt"):
 75 |   src_emb = tf.get_variable("embedding", shape=[src_vocab_size, 300])
 76 |   src_gen = tf.layers.Dense(src_vocab_size)
 77 |   src_gen.build([None, hidden_size])
 78 | 
 79 | with tf.variable_scope("tgt" if args.direction == 1 else "src"):
 80 |   tgt_emb = tf.get_variable("embedding", shape=[tgt_vocab_size, 300])
 81 |   tgt_gen = tf.layers.Dense(tgt_vocab_size)
 82 |   tgt_gen.build([None, hidden_size])
 83 | 
 84 | 
 85 | # Step 3
 86 | 
 87 | 
 88 | from opennmt import constants
 89 | 
 90 | def encode():
 91 |   """Encodes src.
 92 | 
 93 |   Returns:
 94 |     A tuple (encoder output, encoder state, sequence length).
 95 |   """
 96 |   with tf.variable_scope("encoder"):
 97 |     return encoder.encode(
 98 |         tf.nn.embedding_lookup(src_emb, src["ids"]),
 99 |         sequence_length=src["length"],
100 |         mode=tf.estimator.ModeKeys.PREDICT)
101 | 
102 | def decode(encoder_output):
103 |   """Dynamically decodes from the encoder output.
104 | 
105 |   Args:
106 |     encoder_output: The output of encode().
107 | 
108 |   Returns:
109 |     A tuple with: the decoded word ids and the length of each decoded sequence.
110 |   """
111 |   batch_size = tf.shape(src["length"])[0]
112 |   start_tokens = tf.fill([batch_size], constants.START_OF_SENTENCE_ID)
113 |   end_token = constants.END_OF_SENTENCE_ID
114 | 
115 |   with tf.variable_scope("decoder"):
116 |     sampled_ids, _, sampled_length, _ = decoder.dynamic_decode_and_search(
117 |         tgt_emb,
118 |         start_tokens,
119 |         end_token,
120 |         vocab_size=tgt_vocab_size,
121 |         initial_state=encoder_output[1],
122 |         beam_width=5,
123 |         maximum_iterations=200,
124 |         output_layer=tgt_gen,
125 |         mode=tf.estimator.ModeKeys.PREDICT,
126 |         memory=encoder_output[0],
127 |         memory_sequence_length=encoder_output[2])
128 |     return sampled_ids, sampled_length
129 | 
130 | encoder_output = encode()
131 | sampled_ids, sampled_length = decode(encoder_output)
132 | 
133 | tgt_vocab_rev = tf.contrib.lookup.index_to_string_table_from_file(
134 |   tgt_vocab_file,
135 |   vocab_size=tgt_vocab_size - 1,
136 |   default_value=constants.UNKNOWN_TOKEN)
137 | 
138 | tokens = tgt_vocab_rev.lookup(tf.cast(sampled_ids, tf.int64))
139 | length = sampled_length
140 | 
141 | 
142 | # Step 4
143 | 
144 | 
145 | from opennmt.utils.misc import print_bytes
146 | 
147 | saver = tf.train.Saver()
148 | checkpoint_path = tf.train.latest_checkpoint(args.model_dir)
149 | 
150 | def session_init_op(_scaffold, sess):
151 |   saver.restore(sess, checkpoint_path)
152 |   tf.logging.info("Restored model from %s", checkpoint_path)
153 | 
154 | scaffold = tf.train.Scaffold(init_fn=session_init_op)
155 | session_creator = tf.train.ChiefSessionCreator(scaffold=scaffold)
156 | 
157 | with tf.train.MonitoredSession(session_creator=session_creator) as sess:
158 |   sess.run(src_iterator.initializer)
159 |   while not sess.should_stop():
160 |     _tokens, _length = sess.run([tokens, length])
161 |     for b in range(_tokens.shape[0]):
162 |       pred_toks = _tokens[b][0][:_length[b][0] - 1]
163 |       pred_sent = b" ".join(pred_toks)
164 |       print_bytes(pred_sent)
165 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/ref/train.sh:
--------------------------------------------------------------------------------
 1 | #! /bin/sh
 2 | 
 3 | model_dir=unsupervised-nmt-enfr
 4 | data_dir=data/unsupervised-nmt-enfr
 5 | 
 6 | src_vocab=${data_dir}/en-vocab.txt
 7 | tgt_vocab=${data_dir}/fr-vocab.txt
 8 | src_emb=${data_dir}/wmt14m.en300.vec
 9 | tgt_emb=${data_dir}/wmt14m.fr300.vec
10 | 
11 | src=${data_dir}/train.en
12 | tgt=${data_dir}/train.fr
13 | src_trans=${data_dir}/train.en.m1
14 | tgt_trans=${data_dir}/train.fr.m1
15 | 
16 | src_test=${data_dir}/newstest2014.en.tok
17 | tgt_test=${data_dir}/newstest2014.fr.tok
18 | src_test_trans=${data_dir}/newstest2014.en.tok.m1
19 | tgt_test_trans=${data_dir}/newstest2014.fr.tok.m1
20 | 
21 | timestamp=$(date +%s)
22 | score_file=scores-${timestamp}.txt
23 | 
24 | > ${score_file}
25 | 
26 | score_test()
27 | {
28 |     echo ${src_test_trans} >> ${score_file}
29 |     perl multi-bleu.perl ${tgt_test} < ${src_test_trans} >> ${score_file}
30 |     echo ${tgt_test_trans} >> ${score_file}
31 |     perl multi-bleu.perl ${src_test} < ${tgt_test_trans} >> ${score_file}
32 | }
33 | 
34 | score_test
35 | 
36 | for i in $(seq 2 5); do
37 |     # Train for one epoch.
38 |     python ref/training.py \
39 |            --model_dir ${model_dir} \
40 |            --src ${src} \
41 |            --tgt ${tgt} \
42 |            --src_trans ${src_trans} \
43 |            --tgt_trans ${tgt_trans} \
44 |            --src_vocab ${src_vocab} \
45 |            --tgt_vocab ${tgt_vocab} \
46 |            --src_emb ${src_emb} \
47 |            --tgt_emb ${tgt_emb}
48 | 
49 |     # Evaluate on test files.
50 |     src_test_trans=${src_test}.m${i}
51 |     tgt_test_trans=${tgt_test}.m${i}
52 | 
53 |     python ref/inference.py \
54 |            --model_dir ${model_dir} \
55 |            --src ${src_test} \
56 |            --tgt ${tgt_test} \
57 |            --src_vocab ${src_vocab} \
58 |            --tgt_vocab ${tgt_vocab} \
59 |            --direction 1 \
60 |            > ${src_test_trans}
61 |     python ref/inference.py \
62 |            --model_dir ${model_dir} \
63 |            --src ${src_test} \
64 |            --tgt ${tgt_test} \
65 |            --src_vocab ${src_vocab} \
66 |            --tgt_vocab ${tgt_vocab} \
67 |            --direction 2 \
68 |            > ${tgt_test_trans}
69 | 
70 |     score_test
71 | 
72 |     # Translate training data.
73 |     src_trans=${src}.m${i}
74 |     tgt_trans=${tgt}.m${i}
75 | 
76 |     python ref/inference.py \
77 |            --model_dir ${model_dir} \
78 |            --src ${src} \
79 |            --tgt ${tgt} \
80 |            --src_vocab ${src_vocab} \
81 |            --tgt_vocab ${tgt_vocab} \
82 |            --direction 1 \
83 |            > ${src_trans}
84 |     python ref/inference.py \
85 |            --model_dir ${model_dir} \
86 |            --src ${src} \
87 |            --tgt ${tgt} \
88 |            --src_vocab ${src_vocab} \
89 |            --tgt_vocab ${tgt_vocab} \
90 |            --direction 2 \
91 |            > ${tgt_trans}
92 | done
93 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/ref/training.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | 
  3 | import argparse
  4 | import sys
  5 | 
  6 | import tensorflow as tf
  7 | import opennmt as onmt
  8 | import numpy as np
  9 | 
 10 | parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 11 | parser.add_argument("--model_dir", default="model",
 12 |                     help="Checkpoint directory.")
 13 | 
 14 | ## Step 1
 15 | parser.add_argument("--src", required=True, help="Source file.")
 16 | parser.add_argument("--tgt", required=True, help="Target file.")
 17 | parser.add_argument("--src_trans", required=True, help="Source translation at the previous iteration.")
 18 | parser.add_argument("--tgt_trans", required=True, help="Target translation at the previous iteration.")
 19 | parser.add_argument("--src_vocab", required=True, help="Source vocabulary.")
 20 | parser.add_argument("--tgt_vocab", required=True, help="Target vocabulary.")
 21 | 
 22 | ## Step 3
 23 | parser.add_argument("--src_emb", default=None, help="Source embedding.")
 24 | parser.add_argument("--tgt_emb", default=None, help="Target embedding.")
 25 | 
 26 | args = parser.parse_args()
 27 | 
 28 | 
 29 | # Step 1
 30 | 
 31 | 
 32 | from opennmt import constants
 33 | from opennmt.utils.misc import count_lines
 34 | 
 35 | def load_vocab(vocab_file):
 36 |   """Returns a lookup table and the vocabulary size."""
 37 |   vocab_size = count_lines(vocab_file) + 1  # Add UNK.
 38 |   vocab = tf.contrib.lookup.index_table_from_file(
 39 |       vocab_file,
 40 |       vocab_size=vocab_size - 1,
 41 |       num_oov_buckets=1)
 42 |   return vocab, vocab_size
 43 | 
 44 | def load_data(input_file,
 45 |               translated_file,
 46 |               input_vocab,
 47 |               translated_vocab,
 48 |               batch_size=32,
 49 |               max_seq_len=50,
 50 |               num_buckets=5):
 51 |   """Returns an iterator over the training data."""
 52 | 
 53 |   def _make_dataset(text_file, vocab):
 54 |     dataset = tf.data.TextLineDataset(text_file)
 55 |     dataset = dataset.map(lambda x: tf.string_split([x]).values)  # Split on spaces.
 56 |     dataset = dataset.map(vocab.lookup)  # Lookup token in vocabulary.
 57 |     return dataset
 58 | 
 59 |   def _key_func(x):
 60 |     bucket_width = (max_seq_len + num_buckets - 1) // num_buckets
 61 |     bucket_id = x["length"] // bucket_width
 62 |     bucket_id = tf.minimum(bucket_id, num_buckets)
 63 |     return tf.to_int64(bucket_id)
 64 | 
 65 |   def _reduce_func(unused_key, dataset):
 66 |     return dataset.padded_batch(batch_size, {
 67 |         "ids": [None],
 68 |         "ids_in": [None],
 69 |         "ids_out": [None],
 70 |         "length": [],
 71 |         "trans_ids": [None],
 72 |         "trans_length": []})
 73 | 
 74 |   bos = tf.constant([constants.START_OF_SENTENCE_ID], dtype=tf.int64)
 75 |   eos = tf.constant([constants.END_OF_SENTENCE_ID], dtype=tf.int64)
 76 | 
 77 |   # Make a dataset from the input and translated file.
 78 |   input_dataset = _make_dataset(input_file, input_vocab)
 79 |   translated_dataset = _make_dataset(translated_file, translated_vocab)
 80 |   dataset = tf.data.Dataset.zip((input_dataset, translated_dataset))
 81 |   dataset = dataset.shuffle(200000)
 82 | 
 83 |   # Define the input format.
 84 |   dataset = dataset.map(lambda x, y: {
 85 |       "ids": x,
 86 |       "ids_in": tf.concat([bos, x], axis=0),
 87 |       "ids_out": tf.concat([x, eos], axis=0),
 88 |       "length": tf.shape(x)[0],
 89 |       "trans_ids": y,
 90 |       "trans_length": tf.shape(y)[0]})
 91 | 
 92 |   # Filter out invalid examples.
 93 |   dataset = dataset.filter(lambda x: tf.greater(x["length"], 0))
 94 | 
 95 |   # Batch the dataset using a bucketing strategy.
 96 |   dataset = dataset.apply(tf.contrib.data.group_by_window(
 97 |       _key_func,
 98 |       _reduce_func,
 99 |       window_size=batch_size))
100 |   return dataset.make_initializable_iterator()
101 | 
102 | src_vocab, src_vocab_size = load_vocab(args.src_vocab)
103 | tgt_vocab, tgt_vocab_size = load_vocab(args.tgt_vocab)
104 | 
105 | with tf.device("/cpu:0"):  # Input pipeline should always be place on the CPU.
106 |   src_iterator = load_data(args.src, args.src_trans, src_vocab, tgt_vocab)
107 |   tgt_iterator = load_data(args.tgt, args.tgt_trans, tgt_vocab, src_vocab)
108 |   src = src_iterator.get_next()
109 |   tgt = tgt_iterator.get_next()
110 | 
111 | 
112 | # Step 2
113 | 
114 | 
115 | def add_noise_python(words, dropout=0.1, k=3):
116 |   """Applies the noise model in input words.
117 | 
118 |   Args:
119 |     words: A numpy vector of word ids.
120 |     dropout: The probability to drop words.
121 |     k: Maximum distance of the permutation.
122 | 
123 |   Returns:
124 |     A noisy numpy vector of word ids.
125 |   """
126 | 
127 |   def _drop_words(words, probability):
128 |     """Drops words with the given probability."""
129 |     length = len(words)
130 |     keep_prob = np.random.uniform(size=length)
131 |     keep = np.random.uniform(size=length) > probability
132 |     if np.count_nonzero(keep) == 0:
133 |       ind = np.random.randint(0, length)
134 |       keep[ind] = True
135 |     words = np.take(words, keep.nonzero())[0]
136 |     return words
137 | 
138 |   def _rand_perm_with_constraint(words, k):
139 |     """Randomly permutes words ensuring that words are no more than k positions
140 |     away from their original position."""
141 |     length = len(words)
142 |     offset = np.random.uniform(size=length) * (k + 1)
143 |     new_pos = np.arange(length) + offset
144 |     return np.take(words, np.argsort(new_pos))
145 | 
146 |   words = _drop_words(words, dropout)
147 |   words = _rand_perm_with_constraint(words, k)
148 |   return words
149 | 
150 | def add_noise(ids, sequence_length):
151 |   """Wraps add_noise_python for a batch of tensors."""
152 | 
153 |   def _add_noise_single(ids, sequence_length):
154 |     noisy_ids = add_noise_python(ids[:sequence_length])
155 |     noisy_sequence_length = len(noisy_ids)
156 |     ids[:noisy_sequence_length] = noisy_ids
157 |     ids[noisy_sequence_length:] = 0
158 |     return ids, np.int32(noisy_sequence_length)
159 | 
160 |   noisy_ids, noisy_sequence_length = tf.map_fn(
161 |       lambda x: tf.py_func(_add_noise_single, x, [ids.dtype, tf.int32]),
162 |       [ids, sequence_length],
163 |       dtype=[ids.dtype, tf.int32],
164 |       back_prop=False)
165 | 
166 |   noisy_ids.set_shape(ids.get_shape())
167 |   noisy_sequence_length.set_shape(sequence_length.get_shape())
168 | 
169 |   return noisy_ids, noisy_sequence_length
170 | 
171 | 
172 | # Step 3
173 | 
174 | 
175 | from opennmt.inputters.text_inputter import load_pretrained_embeddings
176 | 
177 | def create_embeddings(vocab_size, depth=300):
178 |   """Creates an embedding variable."""
179 |   return tf.get_variable("embedding", shape=[vocab_size, depth])
180 | 
181 | def load_embeddings(embedding_file, vocab_file):
182 |   """Loads an embedding variable or embeddings file."""
183 |   try:
184 |     embeddings = tf.get_variable("embedding")
185 |   except ValueError:
186 |     pretrained = load_pretrained_embeddings(
187 |         embedding_file,
188 |         vocab_file,
189 |         num_oov_buckets=1,
190 |         with_header=True,
191 |         case_insensitive_embeddings=True)
192 |     embeddings = tf.get_variable(
193 |         "embedding",
194 |         shape=None,
195 |         trainable=False,
196 |         initializer=tf.constant(pretrained.astype(np.float32)))
197 |   return embeddings
198 | 
199 | with tf.variable_scope("src"):
200 |   if args.src_emb is not None:
201 |     src_emb = load_embeddings(args.src_emb, args.src_vocab)
202 |   else:
203 |     src_emb = create_embeddings(src_vocab_size)
204 | 
205 | with tf.variable_scope("tgt"):
206 |   if args.tgt_emb is not None:
207 |     tgt_emb = load_embeddings(args.tgt_emb, args.tgt_vocab)
208 |   else:
209 |     tgt_emb = create_embeddings(tgt_vocab_size)
210 | 
211 | 
212 | # Step 4
213 | 
214 | 
215 | hidden_size = 512
216 | encoder = onmt.encoders.BidirectionalRNNEncoder(2, hidden_size)
217 | 
218 | def add_noise_and_encode(ids, sequence_length, embedding, reuse=None):
219 |   """Applies the noise model on ids, embeds and encodes.
220 | 
221 |   Args:
222 |     ids: The tensor of words ids of shape [batch_size, max_time].
223 |     sequence_length: The tensor of sequence length of shape [batch_size].
224 |     embedding: The embedding variable.
225 |     reuse: If True, reuse the encoder variables.
226 | 
227 |   Returns:
228 |     A tuple (encoder output, encoder state, sequence length).
229 |   """
230 |   noisy_ids, noisy_sequence_length = add_noise(ids, sequence_length)
231 |   noisy = tf.nn.embedding_lookup(embedding, noisy_ids)
232 |   with tf.variable_scope("encoder", reuse=reuse):
233 |     return encoder.encode(noisy, sequence_length=noisy_sequence_length)
234 | 
235 | src_encoder_auto = add_noise_and_encode(
236 |     src["ids"], src["length"], src_emb, reuse=None)
237 | tgt_encoder_auto = add_noise_and_encode(
238 |     tgt["ids"], tgt["length"], tgt_emb, reuse=True)
239 | 
240 | src_encoder_cross = add_noise_and_encode(
241 |     tgt["trans_ids"], tgt["trans_length"], src_emb, reuse=True)
242 | tgt_encoder_cross = add_noise_and_encode(
243 |     src["trans_ids"], src["trans_length"], tgt_emb, reuse=True)
244 | 
245 | 
246 | # Step 5
247 | 
248 | 
249 | decoder = onmt.decoders.AttentionalRNNDecoder(
250 |     2, hidden_size, bridge=onmt.layers.CopyBridge())
251 | 
252 | from opennmt.utils.losses import cross_entropy_sequence_loss
253 | 
254 | def denoise(x, embedding, encoder_outputs, generator, reuse=None):
255 |   """Denoises from the noisy encoding.
256 | 
257 |   Args:
258 |     x: The input data from the dataset.
259 |     embedding: The embedding variable.
260 |     encoder_outputs: A tuple with the encoder outputs.
261 |     generator: A tf.layers.Dense instance for projecting the logits.
262 |     reuse: If True, reuse the decoder variables.
263 | 
264 |   Returns:
265 |     The decoder loss.
266 |   """
267 |   with tf.variable_scope("decoder", reuse=reuse):
268 |     logits, _, _ = decoder.decode(
269 |         tf.nn.embedding_lookup(embedding, x["ids_in"]),
270 |         x["length"] + 1,
271 |         initial_state=encoder_outputs[1],
272 |         output_layer=generator,
273 |         memory=encoder_outputs[0],
274 |         memory_sequence_length=encoder_outputs[2])
275 |   cumulated_loss, _, normalizer = cross_entropy_sequence_loss(
276 |       logits, x["ids_out"], x["length"] + 1)
277 |   return cumulated_loss / normalizer
278 | 
279 | with tf.variable_scope("src"):
280 |   src_gen = tf.layers.Dense(src_vocab_size)
281 |   src_gen.build([None, hidden_size])
282 | 
283 | with tf.variable_scope("tgt"):
284 |   tgt_gen = tf.layers.Dense(tgt_vocab_size)
285 |   tgt_gen.build([None, hidden_size])
286 | 
287 | l_auto_src = denoise(src, src_emb, src_encoder_auto, src_gen, reuse=None)
288 | l_auto_tgt = denoise(tgt, tgt_emb, tgt_encoder_auto, tgt_gen, reuse=True)
289 | 
290 | l_cd_src = denoise(src, src_emb, tgt_encoder_cross, src_gen, reuse=True)
291 | l_cd_tgt = denoise(tgt, tgt_emb, src_encoder_cross, tgt_gen, reuse=True)
292 | 
293 | 
294 | # Step 6
295 | 
296 | 
297 | def binary_cross_entropy(x, y, smoothing=0, epsilon=1e-12):
298 |   """Computes the averaged binary cross entropy.
299 | 
300 |   bce = y*log(x) + (1-y)*log(1-x)
301 | 
302 |   Args:
303 |     x: The predicted labels.
304 |     y: The true labels.
305 |     smoothing: The label smoothing coefficient.
306 | 
307 |   Returns:
308 |     The cross entropy.
309 |   """
310 |   y = tf.to_float(y)
311 |   if smoothing > 0:
312 |     smoothing *= 2
313 |     y = y * (1 - smoothing) + 0.5 * smoothing
314 |   return -tf.reduce_mean(tf.log(x + epsilon) * y + tf.log(1.0 - x + epsilon) * (1 - y))
315 | 
316 | def discriminator(encodings,
317 |                   sequence_lengths,
318 |                   lang_ids,
319 |                   num_layers=3,
320 |                   hidden_size=1024,
321 |                   dropout=0.3):
322 |   """Discriminates the encoder outputs against lang_ids.
323 | 
324 |   Args:
325 |     encodings: The encoder outputs of shape [batch_size, max_time, hidden_size].
326 |     sequence_lengths: The length of each sequence of shape [batch_size].
327 |     lang_ids: The true lang id of each sequence of shape [batch_size].
328 |     num_layers: The number of layers of the discriminator.
329 |     hidden_size: The hidden size of the discriminator.
330 |     dropout: The dropout to apply on each discriminator layer output.
331 | 
332 |   Returns:
333 |     A tuple with: the discriminator loss (L_d) and the adversarial loss (L_adv).
334 |   """
335 |   x = encodings
336 |   for _ in range(num_layers):
337 |     x = tf.nn.dropout(x, 1.0 - dropout)
338 |     x = tf.layers.dense(x, hidden_size, activation=tf.nn.leaky_relu)
339 |   x = tf.nn.dropout(x, 1.0 - dropout)
340 |   y = tf.layers.dense(x, 1)
341 | 
342 |   mask = tf.sequence_mask(
343 |       sequence_lengths, maxlen=tf.shape(encodings)[1], dtype=tf.float32)
344 |   mask = tf.expand_dims(mask, -1)
345 | 
346 |   y = tf.log_sigmoid(y) * mask
347 |   y = tf.reduce_sum(y, axis=1)
348 |   y = tf.exp(y)
349 | 
350 |   l_d = binary_cross_entropy(y, lang_ids, smoothing=0.1)
351 |   l_adv = binary_cross_entropy(y, 1 - lang_ids)
352 | 
353 |   return l_d, l_adv
354 | 
355 | from opennmt.layers.reducer import pad_in_time
356 | 
357 | batch_size = tf.shape(src["length"])[0]
358 | all_encoder_outputs = [
359 |     src_encoder_auto, src_encoder_cross,
360 |     tgt_encoder_auto, tgt_encoder_cross]
361 | lang_ids = tf.concat([
362 |     tf.fill([batch_size * 2], 0),
363 |     tf.fill([batch_size * 2], 1)], 0)
364 | 
365 | max_time = tf.reduce_max([tf.shape(output[0])[1] for output in all_encoder_outputs])
366 | 
367 | encodings = tf.concat([
368 |     pad_in_time(output[0], max_time - tf.shape(output[0])[1])
369 |     for output in all_encoder_outputs], 0)
370 | sequence_lengths = tf.concat([output[2] for output in all_encoder_outputs], 0)
371 | 
372 | with tf.variable_scope("discriminator"):
373 |   l_d, l_adv = discriminator(encodings, sequence_lengths, lang_ids)
374 | 
375 | 
376 | # Step 7
377 | 
378 | 
379 | lambda_auto = 1
380 | lambda_cd = 1
381 | lambda_adv = 1
382 | 
383 | l_auto = l_auto_src + l_auto_tgt
384 | l_cd = l_cd_src + l_cd_tgt
385 | 
386 | l_final = (lambda_auto * l_auto + lambda_cd * l_cd + lambda_adv * l_adv)
387 | 
388 | def build_train_op(global_step, encdec_variables, discri_variables):
389 |   """Returns the training Op.
390 | 
391 |   When global_step % 2 == 0, it minimizes l_final and updates encdec_variables.
392 |   Otherwise, it minimizes l_d and updates discri_variables.
393 | 
394 |   Args:
395 |     global_step: The training step.
396 |     encdec_variables: The list of variables of the encoder/decoder model.
397 |     discri_variables: The list of variables of the discriminator.
398 | 
399 |   Returns:
400 |     The training op.
401 |   """
402 |   encdec_opt = tf.train.AdamOptimizer(learning_rate=0.0003, beta1=0.5)
403 |   discri_opt = tf.train.RMSPropOptimizer(0.0005)
404 |   encdec_gradients = encdec_opt.compute_gradients(l_final, var_list=encdec_variables)
405 |   discri_gradients = discri_opt.compute_gradients(l_d, var_list=discri_variables)
406 |   return tf.cond(
407 |       tf.equal(tf.mod(global_step, 2), 0),
408 |       true_fn=lambda: encdec_opt.apply_gradients(encdec_gradients, global_step=global_step),
409 |       false_fn=lambda: discri_opt.apply_gradients(discri_gradients, global_step=global_step))
410 | 
411 | encdec_variables = []
412 | discri_variables = []
413 | for variable in tf.trainable_variables():
414 |   if variable.name.startswith("discriminator"):
415 |     discri_variables.append(variable)
416 |   else:
417 |     encdec_variables.append(variable)
418 | 
419 | global_step = tf.train.get_or_create_global_step()
420 | train_op = build_train_op(global_step, encdec_variables, discri_variables)
421 | 
422 | i = 0
423 | with tf.train.MonitoredTrainingSession(checkpoint_dir=args.model_dir) as sess:
424 |   sess.run([src_iterator.initializer, tgt_iterator.initializer])
425 |   while not sess.should_stop():
426 |     if i % 2 == 0:
427 |       _, step, _l_auto, _l_cd, _l_adv, _l = sess.run(
428 |           [train_op, global_step, l_auto, l_cd, l_adv, l_final])
429 |       print("{} - l_auto = {}; l_cd = {}, l_adv = {}; l = {}".format(
430 |           step, _l_auto, _l_cd, _l_adv, _l))
431 |     else:
432 |       _, step, _l_d = sess.run([train_op, global_step, l_d])
433 |       print("{} - l_d = {}".format(step, _l_d))
434 |     i += 1
435 |     sys.stdout.flush()
436 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/requirements.txt.cpu:
--------------------------------------------------------------------------------
1 | OpenNMT-tf[tensorflow]==1.1.0
2 | 


--------------------------------------------------------------------------------
/unsupervised-nmt/requirements.txt.gpu:
--------------------------------------------------------------------------------
1 | OpenNMT-tf[tensorflow_gpu]==1.1.0
2 | 


--------------------------------------------------------------------------------