├── LICENSE
├── README.md
├── __init__.py
├── character_base
    ├── __init__.py
    ├── char_base.py
    ├── char_base_both.py
    ├── train_wmt15_csen_bpe2char_adam.py
    ├── train_wmt15_deen_bpe2char_adam.py
    ├── train_wmt15_deen_bpe2char_both_adam.py
    ├── train_wmt15_fien_bpe2char_adam.py
    ├── train_wmt15_ruen_bpe2char_adam.py
    ├── translate.py
    ├── translate_both.py
    ├── translate_bpe2char_ensemble_csen.py
    ├── translate_bpe2char_ensemble_deen.py
    ├── translate_bpe2char_ensemble_fien.py
    ├── translate_bpe2char_ensemble_ruen.py
    ├── wmt15_csen_bpe2char_adam.txt
    ├── wmt15_deen_bpe2char_adam.txt
    ├── wmt15_fien_bpe2char_adam.txt
    └── wmt15_ruen_bpe2char_adam.txt
├── character_biscale
    ├── __init__.py
    ├── char_biscale.py
    ├── char_biscale_attc.py
    ├── char_biscale_both.py
    ├── train_wmt15_csen_adam.py
    ├── train_wmt15_deen_adam.py
    ├── train_wmt15_deen_attc_adam.py
    ├── train_wmt15_deen_both_adam.py
    ├── train_wmt15_fien_adam.py
    ├── train_wmt15_ruen_adam.py
    ├── translate.py
    ├── translate_attc.py
    ├── translate_both.py
    ├── translate_ensemble_csen.py
    ├── translate_ensemble_deen.py
    ├── translate_ensemble_fien.py
    ├── translate_ensemble_ruen.py
    ├── wmt15_csen_bpe2char_adam.txt
    ├── wmt15_deen_bpe2char_adam.txt
    ├── wmt15_fien_bpe2char_adam.txt
    └── wmt15_ruen_bpe2char_adam.txt
├── data_iterator.py
├── mixer.py
├── nmt.py
├── preprocess
    ├── build_dictionary_char.py
    ├── build_dictionary_word.py
    ├── clean_tags.py
    ├── fix_appo.sh
    ├── merge.sh
    ├── multi-bleu.perl
    ├── nonbreaking_prefixes
    │   ├── README.txt
    │   ├── nonbreaking_prefix.ca
    │   ├── nonbreaking_prefix.cs
    │   ├── nonbreaking_prefix.de
    │   ├── nonbreaking_prefix.el
    │   ├── nonbreaking_prefix.en
    │   ├── nonbreaking_prefix.es
    │   ├── nonbreaking_prefix.fi
    │   ├── nonbreaking_prefix.fr
    │   ├── nonbreaking_prefix.hu
    │   ├── nonbreaking_prefix.is
    │   ├── nonbreaking_prefix.it
    │   ├── nonbreaking_prefix.lv
    │   ├── nonbreaking_prefix.nl
    │   ├── nonbreaking_prefix.pl
    │   ├── nonbreaking_prefix.pt
    │   ├── nonbreaking_prefix.ro
    │   ├── nonbreaking_prefix.ru
    │   ├── nonbreaking_prefix.sk
    │   ├── nonbreaking_prefix.sl
    │   ├── nonbreaking_prefix.sv
    │   └── nonbreaking_prefix.ta
    ├── normalize-punctuation.perl
    ├── preprocess.sh
    ├── shuffle.py
    ├── tokenizer.perl
    └── tokenizer_apos.perl
├── presentation
    └── appendix.pdf
├── subword_base
    ├── subword_base.py
    ├── subword_base_both.py
    ├── train_wmt15_csen_bpe2bpe_both_adam.py
    ├── train_wmt15_deen_bpe2bpe_adam.py
    ├── train_wmt15_deen_bpe2bpe_both_adam.py
    ├── train_wmt15_fien_bpe2bpe_both_adam.py
    ├── train_wmt15_ruen_bpe2bpe_both_adam.py
    ├── translate.py
    ├── translate_both.py
    ├── translate_both_bpe2bpe_ensemble_csen.py
    ├── translate_both_bpe2bpe_ensemble_deen.py
    ├── translate_both_bpe2bpe_ensemble_fien.py
    ├── translate_both_bpe2bpe_ensemble_ruen.py
    ├── wmt15_csen_bpe2bpe_adam.txt
    ├── wmt15_deen_bpe2bpe_adam.txt
    ├── wmt15_fien_bpe2bpe_adam.txt
    └── wmt15_ruen_bpe2bpe_adam.txt
└── translate_readme.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2016, Junyoung Chung, Kyunghyun Cho
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 | 
 7 | * Redistributions of source code must retain the above copyright notice, this
 8 |   list of conditions and the following disclaimer.
 9 | 
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 |   this list of conditions and the following disclaimer in the documentation
12 |   and/or other materials provided with the distribution.
13 | 
14 | * Neither the name of dl4mt-cdec nor the names of its
15 |   contributors may be used to endorse or promote products derived from
16 |   this software without specific prior written permission.
17 | 
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Character-Level Neural Machine TranslationThis is an implementation of the models described in the paper "A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation".http://arxiv.org/abs/1603.06147Dependencies:-------------The majority of the script files are written in pure Theano.<br>In the preprocessing pipeline, there are the following dependencies.<br>Python Libraries: NLTK<br>MOSES: https://github.com/moses-smt/mosesdecoder<br>Subword-NMT (http://arxiv.org/abs/1508.07909): https://github.com/rsennrich/subword-nmt<br>This code is based on the dl4mt library.<br>link: https://github.com/nyu-dl/dl4mt-tutorialBe sure to include the path to this library in your PYTHONPATH.We recommend you to use the latest version of Theano.<br>If you want exact reproduction however, please use the following version of Theano.<br>commit hash: fdfbab37146ee475b3fd17d8d104fb09bf3a8d5cPreparing Text Corpora:-----------------------The original text corpora can be downloaded from http://www.statmt.org/wmt15/translation-task.html<br>Once the downloading is finished, use the 'preprocess.sh' in 'preprocess' directory to preprocess the text files.For the character-level decoders, preprocessing is not necessary however,in order to compare the results with subword-level decoders and other word-level approaches, we apply the same process to all of the target corpora.Finally, use 'build_dictionary_char.py' for character-case and 'build_dictionary_word.py' for subword-case to build the vocabulary.<br>Updating...


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/__init__.py


--------------------------------------------------------------------------------
/character_base/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/character_base/__init__.py


--------------------------------------------------------------------------------
/character_base/train_wmt15_csen_bpe2char_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_base import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 |                                     'two_layer_gru_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_csen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_base/train_wmt15_deen_bpe2char_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_base import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 |                                     'two_layer_gru_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_base/train_wmt15_deen_bpe2char_both_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from char_base_both import train
 5 | from nmt_both import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 |                                     'two_layer_gru_decoder_both'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_two_layer_gru_decoder_both_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_base/train_wmt15_fien_bpe2char_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_base import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 |                                     'two_layer_gru_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_fien_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_base/train_wmt15_ruen_bpe2char_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_base import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 |                                     'two_layer_gru_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = True
17 |     save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_ruen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_base/translate.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from char_base import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 queue.put((idx, x))
126 |         return idx+1
127 | 
128 |     def _finish_processes():
129 |         for midx in xrange(n_process):
130 |             queue.put(None)
131 | 
132 |     def _retrieve_jobs(n_samples):
133 |         trans = [None] * n_samples
134 |         for idx in xrange(n_samples):
135 |             resp = rqueue.get()
136 |             trans[resp[0]] = resp[1]
137 |             if numpy.mod(idx, 10) == 0:
138 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 |         return trans
140 | 
141 |     print 'Translating ', source_file, '...'
142 |     n_samples = _send_jobs(source_file)
143 |     trans = _seqs2words(_retrieve_jobs(n_samples))
144 |     _finish_processes()
145 |     with open(saveto, 'w') as f:
146 |         if decoder_bpe_to_tok:
147 |             print >>f, '\n'.join(trans).replace('@@ ', '')
148 |         else:
149 |             print >>f, '\n'.join(trans)
150 |     print 'Done'
151 | 
152 | 
153 | if __name__ == "__main__":
154 |     parser = argparse.ArgumentParser()
155 |     parser.add_argument('-k', type=int, default=5)
156 |     parser.add_argument('-p', type=int, default=5)
157 |     parser.add_argument('-n', action="store_true", default=False)
158 |     parser.add_argument('-bpe', action="store_true", default=False)
159 |     parser.add_argument('-enc_c', action="store_true", default=False)
160 |     parser.add_argument('-dec_c', action="store_true", default=False)
161 |     parser.add_argument('-utf8', action="store_true", default=False)
162 |     parser.add_argument('model', type=str)
163 |     parser.add_argument('dictionary', type=str)
164 |     parser.add_argument('dictionary_target', type=str)
165 |     parser.add_argument('source', type=str)
166 |     parser.add_argument('saveto', type=str)
167 | 
168 |     args = parser.parse_args()
169 | 
170 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
171 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 |          encoder_chr_level=args.enc_c,
173 |          decoder_chr_level=args.dec_c,
174 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 | 


--------------------------------------------------------------------------------
/character_base/translate_both.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from char_base_both import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 queue.put((idx, x))
126 |         return idx+1
127 | 
128 |     def _finish_processes():
129 |         for midx in xrange(n_process):
130 |             queue.put(None)
131 | 
132 |     def _retrieve_jobs(n_samples):
133 |         trans = [None] * n_samples
134 |         for idx in xrange(n_samples):
135 |             resp = rqueue.get()
136 |             trans[resp[0]] = resp[1]
137 |             if numpy.mod(idx, 10) == 0:
138 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 |         return trans
140 | 
141 |     print 'Translating ', source_file, '...'
142 |     n_samples = _send_jobs(source_file)
143 |     trans = _seqs2words(_retrieve_jobs(n_samples))
144 |     _finish_processes()
145 |     with open(saveto, 'w') as f:
146 |         if decoder_bpe_to_tok:
147 |             print >>f, '\n'.join(trans).replace('@@ ', '')
148 |         else:
149 |             print >>f, '\n'.join(trans)
150 |     print 'Done'
151 | 
152 | 
153 | if __name__ == "__main__":
154 |     parser = argparse.ArgumentParser()
155 |     parser.add_argument('-k', type=int, default=5)
156 |     parser.add_argument('-p', type=int, default=5)
157 |     parser.add_argument('-n', action="store_true", default=False)
158 |     parser.add_argument('-bpe', action="store_true", default=False)
159 |     parser.add_argument('-enc_c', action="store_true", default=False)
160 |     parser.add_argument('-dec_c', action="store_true", default=False)
161 |     parser.add_argument('-utf8', action="store_true", default=False)
162 |     parser.add_argument('model', type=str)
163 |     parser.add_argument('dictionary', type=str)
164 |     parser.add_argument('dictionary_target', type=str)
165 |     parser.add_argument('source', type=str)
166 |     parser.add_argument('saveto', type=str)
167 | 
168 |     args = parser.parse_args()
169 | 
170 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
171 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 |          encoder_chr_level=args.enc_c,
173 |          decoder_chr_level=args.dec_c,
174 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 | 


--------------------------------------------------------------------------------
/character_base/translate_bpe2char_ensemble_deen.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from nmt import (build_sampler, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None,
 16 |                k=1, maxlen=500, stochastic=True, argmax=False):
 17 | 
 18 |     # k is the beam size we have
 19 |     if k > 1:
 20 |         assert not stochastic, \
 21 |             'Beam search does not support stochastic sampling'
 22 | 
 23 |     sample = []
 24 |     sample_score = []
 25 |     if stochastic:
 26 |         sample_score = 0
 27 | 
 28 |     live_k = 1
 29 |     dead_k = 0
 30 | 
 31 |     hyp_samples = [[]] * live_k
 32 |     hyp_scores = numpy.zeros(live_k).astype('float32')
 33 |     hyp_states = []
 34 | 
 35 |     # get initial state of decoder rnn and encoder context
 36 |     rets = []
 37 |     next_state_chars = []
 38 |     next_state_words = []
 39 |     ctx0s = []
 40 | 
 41 |     for i in xrange(len(f_inits)):
 42 |         ret = f_inits[i](x)
 43 |         next_state_chars.append(ret[0])
 44 |         next_state_words.append(ret[1])
 45 |         ctx0s.append(ret[2])
 46 |     next_w = -1 * numpy.ones((1,)).astype('int64')  # bos indicator
 47 | 
 48 |     num_models = len(f_inits)
 49 | 
 50 |     for ii in xrange(maxlen):
 51 | 
 52 |         temp_next_p = []
 53 |         temp_next_state_char = []
 54 |         temp_next_state_word = []
 55 | 
 56 |         for i in xrange(num_models):
 57 | 
 58 |             ctx = numpy.tile(ctx0s[i], [live_k, 1])
 59 |             inps = [next_w, ctx, next_state_chars[i], next_state_words[i]]
 60 |             ret = f_nexts[i](*inps)
 61 |             next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3]
 62 |             temp_next_p.append(next_p)
 63 |             temp_next_state_char.append(next_state_char)
 64 |             temp_next_state_word.append(next_state_word)
 65 |         #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models
 66 |         next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0))
 67 | 
 68 |         if stochastic:
 69 |             if argmax:
 70 |                 nw = next_p[0].argmax()
 71 |             else:
 72 |                 nw = next_w[0]
 73 |             sample.append(nw)
 74 |             sample_score += next_p[0, nw]
 75 |             if nw == 0:
 76 |                 break
 77 |         else:
 78 |             cand_scores = hyp_scores[:, None] - next_p
 79 |             cand_flat = cand_scores.flatten()
 80 |             ranks_flat = cand_flat.argsort()[:(k - dead_k)]
 81 | 
 82 |             voc_size = next_p.shape[1]
 83 |             trans_indices = ranks_flat / voc_size
 84 |             word_indices = ranks_flat % voc_size
 85 |             costs = cand_flat[ranks_flat]
 86 | 
 87 |             new_hyp_samples = []
 88 |             new_hyp_scores = numpy.zeros(k - dead_k).astype('float32')
 89 |             new_hyp_states_chars = []
 90 |             new_hyp_states_words = []
 91 | 
 92 |             for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)):
 93 |                 new_hyp_samples.append(hyp_samples[ti] + [wi])
 94 |                 new_hyp_scores[idx] = copy.copy(costs[idx])
 95 | 
 96 |             for i in xrange(num_models):
 97 |                 new_hyp_states_char = []
 98 |                 new_hyp_states_word = []
 99 | 
100 |                 for ti in trans_indices:
101 |                     new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti]))
102 |                     new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti]))
103 | 
104 |                 new_hyp_states_chars.append(new_hyp_states_char)
105 |                 new_hyp_states_words.append(new_hyp_states_word)
106 | 
107 |             # check the finished samples
108 |             new_live_k = 0
109 |             hyp_samples = []
110 |             hyp_scores = []
111 | 
112 |             for idx in xrange(len(new_hyp_samples)):
113 |                 if new_hyp_samples[idx][-1] == 0:
114 |                     sample.append(new_hyp_samples[idx])
115 |                     sample_score.append(new_hyp_scores[idx])
116 |                     dead_k += 1
117 |                 else:
118 |                     new_live_k += 1
119 |                     hyp_samples.append(new_hyp_samples[idx])
120 |                     hyp_scores.append(new_hyp_scores[idx])
121 | 
122 |             for i in xrange(num_models):
123 |                 hyp_states_char = []
124 |                 hyp_states_word = []
125 | 
126 |                 for idx in xrange(len(new_hyp_samples)):
127 |                     if new_hyp_samples[idx][-1] != 0:
128 |                         hyp_states_char.append(new_hyp_states_chars[i][idx])
129 |                         hyp_states_word.append(new_hyp_states_words[i][idx])
130 | 
131 |                 next_state_chars[i] = numpy.array(hyp_states_char)
132 |                 next_state_words[i] = numpy.array(hyp_states_word)
133 | 
134 |             hyp_scores = numpy.array(hyp_scores)
135 |             live_k = new_live_k
136 | 
137 |             if new_live_k < 1:
138 |                 break
139 |             if dead_k >= k:
140 |                 break
141 | 
142 |             next_w = numpy.array([w[-1] for w in hyp_samples])
143 | 
144 |     if not stochastic:
145 |         # dump every remaining one
146 |         if live_k > 0:
147 |             for idx in xrange(live_k):
148 |                 sample.append(hyp_samples[idx])
149 |                 sample_score.append(hyp_scores[idx])
150 | 
151 |     return sample, sample_score
152 | 
153 | 
154 | def translate_model(queue, rqueue, pid, models, options, k, normalize):
155 | 
156 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
157 |     trng = RandomStreams(1234)
158 | 
159 |     # allocate model parameters
160 |     params = []
161 |     for i in xrange(len(models)):
162 |         params.append(init_params(options))
163 | 
164 |     # load model parameters and set theano shared variables
165 |     tparams = []
166 |     for i in xrange(len(params)):
167 |         params[i] = load_params(models[i], params[i])
168 |         tparams.append(init_tparams(params[i]))
169 | 
170 |     # word index
171 |     use_noise = theano.shared(numpy.float32(0.))
172 |     f_inits = []
173 |     f_nexts = []
174 |     for i in xrange(len(tparams)):
175 |         f_init, f_next = build_sampler(tparams[i], options, trng, use_noise)
176 |         f_inits.append(f_init)
177 |         f_nexts.append(f_next)
178 | 
179 |     def _translate(seq):
180 |         use_noise.set_value(0.)
181 |         # sample given an input sequence and obtain scores
182 |         sample, score = gen_sample(tparams, f_inits, f_nexts,
183 |                                    numpy.array(seq).reshape([len(seq), 1]),
184 |                                    options, trng=trng, k=k, maxlen=500,
185 |                                    stochastic=False, argmax=False)
186 | 
187 |         # normalize scores according to sequence lengths
188 |         if normalize:
189 |             lengths = numpy.array([len(s) for s in sample])
190 |             score = score / lengths
191 |         sidx = numpy.argmin(score)
192 |         return sample[sidx]
193 | 
194 |     while True:
195 |         req = queue.get()
196 |         if req is None:
197 |             break
198 | 
199 |         idx, x = req[0], req[1]
200 |         print pid, '-', idx
201 |         seq = _translate(x)
202 | 
203 |         rqueue.put((idx, seq))
204 | 
205 |     return
206 | 
207 | 
208 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5,
209 |          normalize=False, n_process=5, encoder_chr_level=False,
210 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
211 | 
212 |     # load model model_options
213 |     pkl_file = models[0].split('.')[0] + '.pkl'
214 |     with open(pkl_file, 'rb') as f:
215 |         options = pkl.load(f)
216 | 
217 |     # load source dictionary and invert
218 |     with open(dictionary, 'rb') as f:
219 |         word_dict = pkl.load(f)
220 |     word_idict = dict()
221 |     for kk, vv in word_dict.iteritems():
222 |         word_idict[vv] = kk
223 |     word_idict[0] = '<eos>'
224 |     word_idict[1] = 'UNK'
225 | 
226 |     # load target dictionary and invert
227 |     with open(dictionary_target, 'rb') as f:
228 |         word_dict_trg = pkl.load(f)
229 |     word_idict_trg = dict()
230 |     for kk, vv in word_dict_trg.iteritems():
231 |         word_idict_trg[vv] = kk
232 |     word_idict_trg[0] = '<eos>'
233 |     word_idict_trg[1] = 'UNK'
234 | 
235 |     # create input and output queues for processes
236 |     queue = Queue()
237 |     rqueue = Queue()
238 |     processes = [None] * n_process
239 |     for midx in xrange(n_process):
240 |         processes[midx] = Process(
241 |             target=translate_model,
242 |             args=(queue, rqueue, midx, models, options, k, normalize))
243 |         processes[midx].start()
244 | 
245 |     # utility function
246 |     def _seqs2words(caps):
247 |         capsw = []
248 |         for cc in caps:
249 |             ww = []
250 |             for w in cc:
251 |                 if w == 0:
252 |                     break
253 |                 if utf8:
254 |                     ww.append(word_idict_trg[w].encode('utf-8'))
255 |                 else:
256 |                     ww.append(word_idict_trg[w])
257 |             if decoder_chr_level:
258 |                 capsw.append(''.join(ww))
259 |             else:
260 |                 capsw.append(' '.join(ww))
261 |         return capsw
262 | 
263 |     def _send_jobs(fname):
264 |         with open(fname, 'r') as f:
265 |             for idx, line in enumerate(f):
266 |                 if encoder_chr_level:
267 |                     words = list(line.decode('utf-8').strip())
268 |                 else:
269 |                     words = line.strip().split()
270 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
271 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
272 |                 x += [0]
273 |                 queue.put((idx, x))
274 |         return idx+1
275 | 
276 |     def _finish_processes():
277 |         for midx in xrange(n_process):
278 |             queue.put(None)
279 | 
280 |     def _retrieve_jobs(n_samples):
281 |         trans = [None] * n_samples
282 |         for idx in xrange(n_samples):
283 |             resp = rqueue.get()
284 |             trans[resp[0]] = resp[1]
285 |             if numpy.mod(idx, 10) == 0:
286 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
287 |         return trans
288 | 
289 |     print 'Translating ', source_file, '...'
290 |     n_samples = _send_jobs(source_file)
291 |     trans = _seqs2words(_retrieve_jobs(n_samples))
292 |     _finish_processes()
293 |     with open(saveto, 'w') as f:
294 |         if decoder_bpe_to_tok:
295 |             print >>f, '\n'.join(trans).replace('@@ ', '')
296 |         else:
297 |             print >>f, '\n'.join(trans)
298 |     print 'Done'
299 | 
300 | 
301 | if __name__ == "__main__":
302 |     parser = argparse.ArgumentParser()
303 |     parser.add_argument('-k', type=int, default=5)
304 |     parser.add_argument('-p', type=int, default=5)
305 |     parser.add_argument('-n', action="store_true", default=False)
306 |     parser.add_argument('-bpe', action="store_true", default=False)
307 |     parser.add_argument('-enc_c', action="store_true", default=False)
308 |     parser.add_argument('-dec_c', action="store_true", default=False)
309 |     parser.add_argument('-utf8', action="store_true", default=False)
310 |     parser.add_argument('saveto', type=str)
311 | 
312 |     model_path = '/misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2char_two_layer_gru_decoder/0209/'
313 |     model1 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en1.380000.npz'
314 |     model2 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en2.425000.npz'
315 |     model3 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en3.400000.npz'
316 |     model4 = model_path + 'bpe2char_two_layer_gru_decoder_adam.365000.npz'
317 |     models = [model1, model2, model3, model4]
318 |     dictionary = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.en.tok.bpe.word.pkl'
319 |     dictionary_target = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.de.tok.300.pkl'
320 |     source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/newstest2013.en.tok.bpe'
321 |     #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2014-deen-src.en.tok.bpe'
322 |     #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2015-deen-src.en.tok.bpe'
323 | 
324 |     args = parser.parse_args()
325 | 
326 |     main(models, dictionary, dictionary_target, source,
327 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
328 |          encoder_chr_level=args.enc_c,
329 |          decoder_chr_level=args.dec_c,
330 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
331 | 


--------------------------------------------------------------------------------
/character_base/translate_bpe2char_ensemble_fien.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from nmt import (build_sampler, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None,
 16 |                k=1, maxlen=500, stochastic=True, argmax=False):
 17 | 
 18 |     # k is the beam size we have
 19 |     if k > 1:
 20 |         assert not stochastic, \
 21 |             'Beam search does not support stochastic sampling'
 22 | 
 23 |     sample = []
 24 |     sample_score = []
 25 |     if stochastic:
 26 |         sample_score = 0
 27 | 
 28 |     live_k = 1
 29 |     dead_k = 0
 30 | 
 31 |     hyp_samples = [[]] * live_k
 32 |     hyp_scores = numpy.zeros(live_k).astype('float32')
 33 |     hyp_states = []
 34 | 
 35 |     # get initial state of decoder rnn and encoder context
 36 |     rets = []
 37 |     next_state_chars = []
 38 |     next_state_words = []
 39 |     ctx0s = []
 40 | 
 41 |     for i in xrange(len(f_inits)):
 42 |         ret = f_inits[i](x)
 43 |         next_state_chars.append(ret[0])
 44 |         next_state_words.append(ret[1])
 45 |         ctx0s.append(ret[2])
 46 |     next_w = -1 * numpy.ones((1,)).astype('int64')  # bos indicator
 47 | 
 48 |     num_models = len(f_inits)
 49 | 
 50 |     for ii in xrange(maxlen):
 51 | 
 52 |         temp_next_p = []
 53 |         temp_next_state_char = []
 54 |         temp_next_state_word = []
 55 | 
 56 |         for i in xrange(num_models):
 57 | 
 58 |             ctx = numpy.tile(ctx0s[i], [live_k, 1])
 59 |             inps = [next_w, ctx, next_state_chars[i], next_state_words[i]]
 60 |             ret = f_nexts[i](*inps)
 61 |             next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3]
 62 |             temp_next_p.append(next_p)
 63 |             temp_next_state_char.append(next_state_char)
 64 |             temp_next_state_word.append(next_state_word)
 65 |         #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models
 66 |         next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0))
 67 | 
 68 |         if stochastic:
 69 |             if argmax:
 70 |                 nw = next_p[0].argmax()
 71 |             else:
 72 |                 nw = next_w[0]
 73 |             sample.append(nw)
 74 |             sample_score += next_p[0, nw]
 75 |             if nw == 0:
 76 |                 break
 77 |         else:
 78 |             cand_scores = hyp_scores[:, None] - next_p
 79 |             cand_flat = cand_scores.flatten()
 80 |             ranks_flat = cand_flat.argsort()[:(k - dead_k)]
 81 | 
 82 |             voc_size = next_p.shape[1]
 83 |             trans_indices = ranks_flat / voc_size
 84 |             word_indices = ranks_flat % voc_size
 85 |             costs = cand_flat[ranks_flat]
 86 | 
 87 |             new_hyp_samples = []
 88 |             new_hyp_scores = numpy.zeros(k - dead_k).astype('float32')
 89 |             new_hyp_states_chars = []
 90 |             new_hyp_states_words = []
 91 | 
 92 |             for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)):
 93 |                 new_hyp_samples.append(hyp_samples[ti] + [wi])
 94 |                 new_hyp_scores[idx] = copy.copy(costs[idx])
 95 | 
 96 |             for i in xrange(num_models):
 97 |                 new_hyp_states_char = []
 98 |                 new_hyp_states_word = []
 99 | 
100 |                 for ti in trans_indices:
101 |                     new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti]))
102 |                     new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti]))
103 | 
104 |                 new_hyp_states_chars.append(new_hyp_states_char)
105 |                 new_hyp_states_words.append(new_hyp_states_word)
106 | 
107 |             # check the finished samples
108 |             new_live_k = 0
109 |             hyp_samples = []
110 |             hyp_scores = []
111 | 
112 |             for idx in xrange(len(new_hyp_samples)):
113 |                 if new_hyp_samples[idx][-1] == 0:
114 |                     sample.append(new_hyp_samples[idx])
115 |                     sample_score.append(new_hyp_scores[idx])
116 |                     dead_k += 1
117 |                 else:
118 |                     new_live_k += 1
119 |                     hyp_samples.append(new_hyp_samples[idx])
120 |                     hyp_scores.append(new_hyp_scores[idx])
121 | 
122 |             for i in xrange(num_models):
123 |                 hyp_states_char = []
124 |                 hyp_states_word = []
125 | 
126 |                 for idx in xrange(len(new_hyp_samples)):
127 |                     if new_hyp_samples[idx][-1] != 0:
128 |                         hyp_states_char.append(new_hyp_states_chars[i][idx])
129 |                         hyp_states_word.append(new_hyp_states_words[i][idx])
130 | 
131 |                 next_state_chars[i] = numpy.array(hyp_states_char)
132 |                 next_state_words[i] = numpy.array(hyp_states_word)
133 | 
134 |             hyp_scores = numpy.array(hyp_scores)
135 |             live_k = new_live_k
136 | 
137 |             if new_live_k < 1:
138 |                 break
139 |             if dead_k >= k:
140 |                 break
141 | 
142 |             next_w = numpy.array([w[-1] for w in hyp_samples])
143 | 
144 |     if not stochastic:
145 |         # dump every remaining one
146 |         if live_k > 0:
147 |             for idx in xrange(live_k):
148 |                 sample.append(hyp_samples[idx])
149 |                 sample_score.append(hyp_scores[idx])
150 | 
151 |     return sample, sample_score
152 | 
153 | 
154 | def translate_model(queue, rqueue, pid, models, options, k, normalize):
155 | 
156 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
157 |     trng = RandomStreams(1234)
158 | 
159 |     # allocate model parameters
160 |     params = []
161 |     for i in xrange(len(models)):
162 |         params.append(init_params(options))
163 | 
164 |     # load model parameters and set theano shared variables
165 |     tparams = []
166 |     for i in xrange(len(params)):
167 |         params[i] = load_params(models[i], params[i])
168 |         tparams.append(init_tparams(params[i]))
169 | 
170 |     # word index
171 |     use_noise = theano.shared(numpy.float32(0.))
172 |     f_inits = []
173 |     f_nexts = []
174 |     for i in xrange(len(tparams)):
175 |         f_init, f_next = build_sampler(tparams[i], options, trng, use_noise)
176 |         f_inits.append(f_init)
177 |         f_nexts.append(f_next)
178 | 
179 |     def _translate(seq):
180 |         use_noise.set_value(0.)
181 |         # sample given an input sequence and obtain scores
182 |         sample, score = gen_sample(tparams, f_inits, f_nexts,
183 |                                    numpy.array(seq).reshape([len(seq), 1]),
184 |                                    options, trng=trng, k=k, maxlen=500,
185 |                                    stochastic=False, argmax=False)
186 | 
187 |         # normalize scores according to sequence lengths
188 |         if normalize:
189 |             lengths = numpy.array([len(s) for s in sample])
190 |             score = score / lengths
191 |         sidx = numpy.argmin(score)
192 |         return sample[sidx]
193 | 
194 |     while True:
195 |         req = queue.get()
196 |         if req is None:
197 |             break
198 | 
199 |         idx, x = req[0], req[1]
200 |         print pid, '-', idx
201 |         seq = _translate(x)
202 | 
203 |         rqueue.put((idx, seq))
204 | 
205 |     return
206 | 
207 | 
208 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5,
209 |          normalize=False, n_process=5, encoder_chr_level=False,
210 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
211 | 
212 |     # load model model_options
213 |     pkl_file = models[0].split('.')[0] + '.pkl'
214 |     with open(pkl_file, 'rb') as f:
215 |         options = pkl.load(f)
216 | 
217 |     # load source dictionary and invert
218 |     with open(dictionary, 'rb') as f:
219 |         word_dict = pkl.load(f)
220 |     word_idict = dict()
221 |     for kk, vv in word_dict.iteritems():
222 |         word_idict[vv] = kk
223 |     word_idict[0] = '<eos>'
224 |     word_idict[1] = 'UNK'
225 | 
226 |     # load target dictionary and invert
227 |     with open(dictionary_target, 'rb') as f:
228 |         word_dict_trg = pkl.load(f)
229 |     word_idict_trg = dict()
230 |     for kk, vv in word_dict_trg.iteritems():
231 |         word_idict_trg[vv] = kk
232 |     word_idict_trg[0] = '<eos>'
233 |     word_idict_trg[1] = 'UNK'
234 | 
235 |     # create input and output queues for processes
236 |     queue = Queue()
237 |     rqueue = Queue()
238 |     processes = [None] * n_process
239 |     for midx in xrange(n_process):
240 |         processes[midx] = Process(
241 |             target=translate_model,
242 |             args=(queue, rqueue, midx, models, options, k, normalize))
243 |         processes[midx].start()
244 | 
245 |     # utility function
246 |     def _seqs2words(caps):
247 |         capsw = []
248 |         for cc in caps:
249 |             ww = []
250 |             for w in cc:
251 |                 if w == 0:
252 |                     break
253 |                 if utf8:
254 |                     ww.append(word_idict_trg[w].encode('utf-8'))
255 |                 else:
256 |                     ww.append(word_idict_trg[w])
257 |             if decoder_chr_level:
258 |                 capsw.append(''.join(ww))
259 |             else:
260 |                 capsw.append(' '.join(ww))
261 |         return capsw
262 | 
263 |     def _send_jobs(fname):
264 |         with open(fname, 'r') as f:
265 |             for idx, line in enumerate(f):
266 |                 if encoder_chr_level:
267 |                     words = list(line.decode('utf-8').strip())
268 |                 else:
269 |                     words = line.strip().split()
270 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
271 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
272 |                 x += [0]
273 |                 queue.put((idx, x))
274 |         return idx+1
275 | 
276 |     def _finish_processes():
277 |         for midx in xrange(n_process):
278 |             queue.put(None)
279 | 
280 |     def _retrieve_jobs(n_samples):
281 |         trans = [None] * n_samples
282 |         for idx in xrange(n_samples):
283 |             resp = rqueue.get()
284 |             trans[resp[0]] = resp[1]
285 |             if numpy.mod(idx, 10) == 0:
286 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
287 |         return trans
288 | 
289 |     print 'Translating ', source_file, '...'
290 |     n_samples = _send_jobs(source_file)
291 |     trans = _seqs2words(_retrieve_jobs(n_samples))
292 |     _finish_processes()
293 |     with open(saveto, 'w') as f:
294 |         if decoder_bpe_to_tok:
295 |             print >>f, '\n'.join(trans).replace('@@ ', '')
296 |         else:
297 |             print >>f, '\n'.join(trans)
298 |     print 'Done'
299 | 
300 | 
301 | if __name__ == "__main__":
302 |     parser = argparse.ArgumentParser()
303 |     parser.add_argument('-k', type=int, default=5)
304 |     parser.add_argument('-p', type=int, default=5)
305 |     parser.add_argument('-n', action="store_true", default=False)
306 |     parser.add_argument('-bpe', action="store_true", default=False)
307 |     parser.add_argument('-enc_c', action="store_true", default=False)
308 |     parser.add_argument('-dec_c', action="store_true", default=False)
309 |     parser.add_argument('-utf8', action="store_true", default=False)
310 |     parser.add_argument('saveto', type=str)
311 | 
312 |     model_path = '/scratch/jc7382/acl2016/wmt15/fien/bpe2char_two_layer_gru_decoder/0209/'
313 |     model1 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en1.205000.npz'
314 |     model2 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en2.200000.npz'
315 |     model3 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en3.200000.npz'
316 |     model4 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en4.200000.npz'
317 |     model5 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en5.210000.npz'
318 |     model6 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en6.205000.npz'
319 |     model7 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en7.200000.npz'
320 |     model8 = model_path + 'new_bpe2char_two_layer_gru_decoder_adam.240000.npz'
321 |     models = [model1, model2, model3, model4, model5, model6, model7, model8]
322 |     dictionary = '/scratch/jc7382/data/wmt15/fien/train/all_fi-en.en.tok.bpe.word.pkl'
323 |     dictionary_target = '/scratch/jc7382/data/wmt15/fien/train/all_fi-en.fi.tok.300.pkl'
324 |     source = '/scratch/jc7382/data/wmt15/fien/dev/newsdev2015-enfi-src.en.tok.bpe'
325 |     #source = '/scratch/jc7382/data/wmt15/fien/test/newstest2015-fien-src.en.tok.bpe'
326 | 
327 |     args = parser.parse_args()
328 | 
329 |     main(models, dictionary, dictionary_target, source,
330 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
331 |          encoder_chr_level=args.enc_c,
332 |          decoder_chr_level=args.dec_c,
333 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
334 | 


--------------------------------------------------------------------------------
/character_base/wmt15_csen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2char_two_layer_gru_decoder/0328/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 21907
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_cs-en.en.tok.bpe
30 | target_dataset all_cs-en.cs.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.cs.tok
33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl
34 | target_dictionary all_cs-en.cs.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_base/wmt15_deen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /raid/chungjun/acl2016/wmt15/deen/bpe2char_two_layer_gru_decoder/0417/
 2 | train_data_path /raid/chungjun/data/wmt15/deen/train/
 3 | dev_data_path /raid/chungjun/data/wmt15/deen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 24440
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_de-en.en.tok.bpe.shuf
30 | target_dataset all_de-en.de.tok.shuf
31 | valid_source_dataset newstest2013.en.tok.bpe
32 | valid_target_dataset newstest2013.de.tok
33 | source_dictionary all_de-en.en.tok.bpe.word.pkl
34 | target_dictionary all_de-en.de.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_base/wmt15_fien_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2char_two_layer_gru_decoder/0328/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 292
14 | n_words_src 20174
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_fi-en.en.tok.bpe.shuf
30 | target_dataset all_fi-en.fi.tok.shuf
31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe
32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok
33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl
34 | target_dictionary all_fi-en.fi.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_base/wmt15_ruen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2char_two_layer_gru_decoder/0328/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 22030
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_ru-en.en.tok.bpe
30 | target_dataset all_ru-en.ru.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.ru.tok
33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl
34 | target_dictionary all_ru-en.ru.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_biscale/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/character_biscale/__init__.py


--------------------------------------------------------------------------------
/character_biscale/train_wmt15_csen_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_biscale import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'biscale_decoder': ('param_init_biscale_decoder',
11 |                               'biscale_decoder_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_biscale_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_csen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_biscale/train_wmt15_deen_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_biscale import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'biscale_decoder': ('param_init_biscale_decoder',
11 |                               'biscale_decoder_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_biscale_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_biscale/train_wmt15_deen_attc_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_biscale_attc import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'biscale_decoder_attc': ('param_init_biscale_decoder_attc',
11 |                                    'biscale_decoder_attc_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_biscale_decoder_attc_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample,
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_biscale/train_wmt15_deen_both_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_biscale_both import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'biscale_decoder_both': ('param_init_biscale_decoder_both',
11 |                                    'biscale_decoder_both_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_biscale_decoder_both_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample,
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_biscale/train_wmt15_fien_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_biscale import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'biscale_decoder': ('param_init_biscale_decoder',
11 |                               'biscale_decoder_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_biscale_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_fien_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_biscale/train_wmt15_ruen_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from char_biscale import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'biscale_decoder': ('param_init_biscale_decoder',
11 |                               'biscale_decoder_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2char_biscale_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_ruen_bpe2char_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/character_biscale/translate.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from char_biscale import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 queue.put((idx, x))
126 |         return idx+1
127 | 
128 |     def _finish_processes():
129 |         for midx in xrange(n_process):
130 |             queue.put(None)
131 | 
132 |     def _retrieve_jobs(n_samples):
133 |         trans = [None] * n_samples
134 |         for idx in xrange(n_samples):
135 |             resp = rqueue.get()
136 |             trans[resp[0]] = resp[1]
137 |             if numpy.mod(idx, 10) == 0:
138 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 |         return trans
140 | 
141 |     print 'Translating ', source_file, '...'
142 |     n_samples = _send_jobs(source_file)
143 |     trans = _seqs2words(_retrieve_jobs(n_samples))
144 |     _finish_processes()
145 |     with open(saveto, 'w') as f:
146 |         if decoder_bpe_to_tok:
147 |             print >>f, '\n'.join(trans).replace('@@ ', '')
148 |         else:
149 |             print >>f, '\n'.join(trans)
150 |     print 'Done'
151 | 
152 | 
153 | if __name__ == "__main__":
154 |     parser = argparse.ArgumentParser()
155 |     parser.add_argument('-k', type=int, default=5)
156 |     parser.add_argument('-p', type=int, default=5)
157 |     parser.add_argument('-n', action="store_true", default=False)
158 |     parser.add_argument('-bpe', action="store_true", default=False)
159 |     parser.add_argument('-enc_c', action="store_true", default=False)
160 |     parser.add_argument('-dec_c', action="store_true", default=False)
161 |     parser.add_argument('-utf8', action="store_true", default=False)
162 |     parser.add_argument('model', type=str)
163 |     parser.add_argument('dictionary', type=str)
164 |     parser.add_argument('dictionary_target', type=str)
165 |     parser.add_argument('source', type=str)
166 |     parser.add_argument('saveto', type=str)
167 | 
168 |     args = parser.parse_args()
169 | 
170 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
171 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 |          encoder_chr_level=args.enc_c,
173 |          decoder_chr_level=args.dec_c,
174 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 | 


--------------------------------------------------------------------------------
/character_biscale/translate_attc.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from char_biscale_attc import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 #print '=============================='
126 |                 #print line
127 |                 #print '------------------------------'
128 |                 #print ' '.join([word_idict[wx] for wx in x])
129 |                 #print '=============================='
130 |                 queue.put((idx, x))
131 |         return idx+1
132 | 
133 |     def _finish_processes():
134 |         for midx in xrange(n_process):
135 |             queue.put(None)
136 | 
137 |     def _retrieve_jobs(n_samples):
138 |         trans = [None] * n_samples
139 |         for idx in xrange(n_samples):
140 |             resp = rqueue.get()
141 |             trans[resp[0]] = resp[1]
142 |             if numpy.mod(idx, 10) == 0:
143 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
144 |         return trans
145 | 
146 |     print 'Translating ', source_file, '...'
147 |     n_samples = _send_jobs(source_file)
148 |     trans = _seqs2words(_retrieve_jobs(n_samples))
149 |     _finish_processes()
150 |     with open(saveto, 'w') as f:
151 |         print >>f, '\n'.join(trans)
152 |     print 'Done'
153 | 
154 | 
155 | if __name__ == "__main__":
156 |     parser = argparse.ArgumentParser()
157 |     parser.add_argument('-k', type=int, default=5)
158 |     parser.add_argument('-p', type=int, default=5)
159 |     parser.add_argument('-n', action="store_true", default=False)
160 |     parser.add_argument('-enc_c', action="store_true", default=False)
161 |     parser.add_argument('-dec_c', action="store_true", default=False)
162 |     parser.add_argument('-utf8', action="store_true", default=False)
163 |     parser.add_argument('model', type=str)
164 |     parser.add_argument('dictionary', type=str)
165 |     parser.add_argument('dictionary_target', type=str)
166 |     parser.add_argument('source', type=str)
167 |     parser.add_argument('saveto', type=str)
168 | 
169 |     args = parser.parse_args()
170 | 
171 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
172 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
173 |          encoder_chr_level=args.enc_c,
174 |          decoder_chr_level=args.dec_c,
175 |          utf8=args.utf8)
176 | 


--------------------------------------------------------------------------------
/character_biscale/translate_both.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from char_biscale_both import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 queue.put((idx, x))
126 |         return idx+1
127 | 
128 |     def _finish_processes():
129 |         for midx in xrange(n_process):
130 |             queue.put(None)
131 | 
132 |     def _retrieve_jobs(n_samples):
133 |         trans = [None] * n_samples
134 |         for idx in xrange(n_samples):
135 |             resp = rqueue.get()
136 |             trans[resp[0]] = resp[1]
137 |             if numpy.mod(idx, 10) == 0:
138 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 |         return trans
140 | 
141 |     print 'Translating ', source_file, '...'
142 |     n_samples = _send_jobs(source_file)
143 |     trans = _seqs2words(_retrieve_jobs(n_samples))
144 |     _finish_processes()
145 |     with open(saveto, 'w') as f:
146 |         if decoder_bpe_to_tok:
147 |             print >>f, '\n'.join(trans).replace('@@ ', '')
148 |         else:
149 |             print >>f, '\n'.join(trans)
150 |     print 'Done'
151 | 
152 | 
153 | if __name__ == "__main__":
154 |     parser = argparse.ArgumentParser()
155 |     parser.add_argument('-k', type=int, default=5)
156 |     parser.add_argument('-p', type=int, default=5)
157 |     parser.add_argument('-n', action="store_true", default=False)
158 |     parser.add_argument('-bpe', action="store_true", default=False)
159 |     parser.add_argument('-enc_c', action="store_true", default=False)
160 |     parser.add_argument('-dec_c', action="store_true", default=False)
161 |     parser.add_argument('-utf8', action="store_true", default=False)
162 |     parser.add_argument('model', type=str)
163 |     parser.add_argument('dictionary', type=str)
164 |     parser.add_argument('dictionary_target', type=str)
165 |     parser.add_argument('source', type=str)
166 |     parser.add_argument('saveto', type=str)
167 | 
168 |     args = parser.parse_args()
169 | 
170 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
171 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 |          encoder_chr_level=args.enc_c,
173 |          decoder_chr_level=args.dec_c,
174 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 | 


--------------------------------------------------------------------------------
/character_biscale/wmt15_csen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2char_seg_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 21907
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_cs-en.en.tok.bpe
30 | target_dataset all_cs-en.cs.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.cs.tok
33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl
34 | target_dictionary all_cs-en.cs.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_biscale/wmt15_deen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2char_seg_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 24440
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_de-en.en.tok.bpe.shuf
30 | target_dataset all_de-en.de.tok.shuf
31 | valid_source_dataset newstest2013.en.tok.bpe
32 | valid_target_dataset newstest2013.de.tok
33 | source_dictionary all_de-en.en.tok.bpe.word.pkl
34 | target_dictionary all_de-en.de.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_biscale/wmt15_fien_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2char_seg_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 292
14 | n_words_src 20174
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_fi-en.en.tok.bpe.shuf
30 | target_dataset all_fi-en.fi.tok.shuf
31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe
32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok
33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl
34 | target_dictionary all_fi-en.fi.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/character_biscale/wmt15_ruen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2char_seg_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 22030
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_ru-en.en.tok.bpe
30 | target_dataset all_ru-en.ru.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.ru.tok
33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl
34 | target_dictionary all_ru-en.ru.tok.300.pkl
35 | 


--------------------------------------------------------------------------------
/data_iterator.py:
--------------------------------------------------------------------------------
  1 | import nltk
  2 | import numpy
  3 | import os
  4 | import random
  5 | 
  6 | import cPickle
  7 | import gzip
  8 | import codecs
  9 | 
 10 | from tempfile import mkstemp
 11 | 
 12 | 
 13 | def fopen(filename, mode='r'):
 14 |     if filename.endswith('.gz'):
 15 |         return gzip.open(filename, mode)
 16 |     return open(filename, mode)
 17 | 
 18 | 
 19 | class TextIterator:
 20 |     """Simple Bitext iterator."""
 21 |     def __init__(self,
 22 |                  source, source_dict,
 23 |                  target=None, target_dict=None,
 24 |                  source_word_level=0,
 25 |                  target_word_level=0,
 26 |                  batch_size=128,
 27 |                  job_id=0,
 28 |                  sort_size=20,
 29 |                  n_words_source=-1,
 30 |                  n_words_target=-1,
 31 |                  shuffle_per_epoch=False):
 32 |         self.source_file = source
 33 |         self.target_file = target
 34 |         self.source = fopen(source, 'r')
 35 |         with open(source_dict, 'rb') as f:
 36 |             self.source_dict = cPickle.load(f)
 37 |         if target is not None:
 38 |             self.target = fopen(target, 'r')
 39 |             if target_dict is not None:
 40 |                 with open(target_dict, 'rb') as f:
 41 |                     self.target_dict = cPickle.load(f)
 42 |         else:
 43 |             self.target = None
 44 | 
 45 |         self.source_word_level = source_word_level
 46 |         self.target_word_level = target_word_level
 47 |         self.batch_size = batch_size
 48 | 
 49 |         self.n_words_source = n_words_source
 50 |         self.n_words_target = n_words_target
 51 |         self.shuffle_per_epoch = shuffle_per_epoch
 52 | 
 53 |         self.source_buffer = []
 54 |         self.target_buffer = []
 55 |         self.k = batch_size * sort_size
 56 | 
 57 |         self.end_of_data = False
 58 |         self.job_id = job_id
 59 | 
 60 |     def __iter__(self):
 61 |         return self
 62 | 
 63 |     def reset(self):
 64 |         if self.shuffle_per_epoch:
 65 |             # close current files
 66 |             self.source.close()
 67 |             if self.target is None:
 68 |                 self.shuffle([self.source_file])
 69 |                 self.source = fopen(self.source_file + '.reshuf_%d' % self.job_id, 'r')
 70 |             else:
 71 |                 self.target.close()
 72 |                 # shuffle *original* source files,
 73 |                 self.shuffle([self.source_file, self.target_file])
 74 |                 # open newly 're-shuffled' file as input
 75 |                 self.source = fopen(self.source_file + '.reshuf_%d' % self.job_id, 'r')
 76 |                 self.target = fopen(self.target_file + '.reshuf_%d' % self.job_id, 'r')
 77 |         else:
 78 |             self.source.seek(0)
 79 |             if self.target is not None:
 80 |                 self.target.seek(0)
 81 | 
 82 |     @staticmethod
 83 |     def shuffle(files):
 84 |         tf_os, tpath = mkstemp()
 85 |         tf = open(tpath, 'w')
 86 |         fds = [open(ff) for ff in files]
 87 |         for l in fds[0]:
 88 |             lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]]
 89 |             print >>tf, "|||".join(lines)
 90 |         [ff.close() for ff in fds]
 91 |         tf.close()
 92 |         tf = open(tpath, 'r')
 93 |         lines = tf.readlines()
 94 |         random.shuffle(lines)
 95 |         fds = [open(ff+'.reshuf','w') for ff in files]
 96 |         for l in lines:
 97 |             s = l.strip().split('|||')
 98 |             for ii, fd in enumerate(fds):
 99 |                 print >>fd, s[ii]
100 |         [ff.close() for ff in fds]
101 |         os.remove(tpath)
102 |         return
103 | 
104 |     def next(self):
105 |         if self.end_of_data:
106 |             self.end_of_data = False
107 |             self.reset()
108 |             raise StopIteration
109 | 
110 |         source = []
111 |         target = []
112 | 
113 |         # fill buffer, if it's empty
114 |         if self.target is not None:
115 |             assert len(self.source_buffer) == len(self.target_buffer), 'Buffer size mismatch!'
116 | 
117 |         if len(self.source_buffer) == 0:
118 |             for k_ in xrange(self.k):
119 |                 ss = self.source.readline()
120 | 
121 |                 if ss == "":
122 |                     break
123 | 
124 |                 if self.source_word_level:
125 |                     ss = ss.strip().split()
126 |                 else:
127 |                     ss = ss.strip()
128 |                     ss = list(ss.decode('utf8'))
129 | 
130 |                 self.source_buffer.append(ss)
131 | 
132 |                 if self.target is not None:
133 |                     tt = self.target.readline()
134 | 
135 |                     if tt == "":
136 |                         break
137 | 
138 |                     if self.target_word_level:
139 |                         tt = tt.strip().split()
140 |                     else:
141 |                         tt = tt.strip()
142 |                         tt = list(tt.decode('utf8'))
143 | 
144 |                     self.target_buffer.append(tt)
145 | 
146 |             if self.target is not None:
147 |                 # sort by target buffer
148 |                 tlen = numpy.array([len(t) for t in self.target_buffer])
149 |                 tidx = tlen.argsort()
150 |                 _sbuf = [self.source_buffer[i] for i in tidx]
151 |                 _tbuf = [self.target_buffer[i] for i in tidx]
152 |                 self.target_buffer = _tbuf
153 |             else:
154 |                 slen = numpy.array([len(s) for s in self.source_buffer])
155 |                 sidx = slen.argsort()
156 |                 _sbuf = [self.source_buffer[i] for i in sidx]
157 | 
158 |             self.source_buffer = _sbuf
159 | 
160 |         if self.target is not None:
161 |             if len(self.source_buffer) == 0 or len(self.target_buffer) == 0:
162 |                 self.end_of_data = False
163 |                 self.reset()
164 |                 raise StopIteration
165 |         elif len(self.source_buffer) == 0:
166 |             self.end_of_data = False
167 |             self.reset()
168 |             raise StopIteration
169 | 
170 |         try:
171 |             # actual work here
172 |             while True:
173 |                 # read from source file and map to word index
174 |                 try:
175 |                     ss_ = self.source_buffer.pop()
176 |                 except IndexError:
177 |                     break
178 |                 ss = [self.source_dict[w] if w in self.source_dict else 1 for w in ss_]
179 |                 if self.n_words_source > 0:
180 |                     ss = [w if w < self.n_words_source else 1 for w in ss]
181 |                 source.append(ss)
182 |                 if self.target is not None:
183 |                     # read from target file and map to word index
184 |                     tt_ = self.target_buffer.pop()
185 |                     tt = [self.target_dict[w] if w in self.target_dict else 1 for w in tt_]
186 |                     if self.n_words_target > 0:
187 |                         tt = [w if w < self.n_words_target else 1 for w in tt]
188 |                     target.append(tt)
189 | 
190 |                 if len(source) >= self.batch_size:
191 |                     break
192 |         except IOError:
193 |             self.end_of_data = True
194 | 
195 |         if self.target is not None:
196 |             if len(source) <= 0 or len(target) <= 0:
197 |                 self.end_of_data = False
198 |                 self.reset()
199 |                 raise StopIteration
200 |             return source, target
201 |         else:
202 |             if len(source) <= 0:
203 |                 self.end_of_data = False
204 |                 self.reset()
205 |                 raise StopIteration
206 |             return source
207 | 


--------------------------------------------------------------------------------
/preprocess/build_dictionary_char.py:
--------------------------------------------------------------------------------
 1 | import cPickle as pkl
 2 | import fileinput
 3 | import numpy
 4 | import sys
 5 | import codecs
 6 | 
 7 | from collections import OrderedDict
 8 | 
 9 | 
10 | short_list = 300
11 | 
12 | def main():
13 |     for filename in sys.argv[1:]:
14 |         print 'Processing', filename
15 |         word_freqs = OrderedDict()
16 | 
17 |         with open(filename, 'r') as f:
18 |             for line in f:
19 |                 words_in = line.strip()
20 |                 words_in = list(words_in.decode('utf8'))
21 |                 for w in words_in:
22 |                     if w not in word_freqs:
23 |                         word_freqs[w] = 0
24 |                     word_freqs[w] += 1
25 | 
26 |         words = word_freqs.keys()
27 |         freqs = word_freqs.values()
28 | 
29 |         sorted_idx = numpy.argsort(freqs)
30 |         sorted_words = [words[ii] for ii in sorted_idx[::-1]]
31 | 
32 |         worddict = OrderedDict()
33 |         worddict['eos'] = 0
34 |         worddict['UNK'] = 1
35 | 
36 |         if short_list is not None:
37 |             for ii in xrange(min(short_list, len(sorted_words))):
38 |                 worddict[sorted_words[ii]] = ii + 2
39 |         else:
40 |             for ii, ww in enumerate(sorted_words):
41 |                 worddict[ww] = ii + 2
42 | 
43 |         with open('%s.%d.pkl' % (filename, short_list), 'wb') as f:
44 |             pkl.dump(worddict, f)
45 | 
46 |         f.close()
47 |         print 'Done'
48 |         print len(worddict)
49 | 
50 | if __name__ == '__main__':
51 |     main()
52 | 


--------------------------------------------------------------------------------
/preprocess/build_dictionary_word.py:
--------------------------------------------------------------------------------
 1 | import cPickle as pkl
 2 | import fileinput
 3 | import numpy
 4 | import sys
 5 | import codecs
 6 | 
 7 | from collections import OrderedDict
 8 | 
 9 | 
10 | def main():
11 |     for filename in sys.argv[1:]:
12 |         print 'Processing', filename
13 |         word_freqs = OrderedDict()
14 | 
15 |         with open(filename, 'r') as f:
16 |             for line in f:
17 |                 words_in = line.strip().split(' ')
18 |                 for w in words_in:
19 |                     if w not in word_freqs:
20 |                         word_freqs[w] = 0
21 |                     word_freqs[w] += 1
22 | 
23 |         words = word_freqs.keys()
24 |         freqs = word_freqs.values()
25 | 
26 |         sorted_idx = numpy.argsort(freqs)
27 |         sorted_words = [words[ii] for ii in sorted_idx[::-1]]
28 | 
29 |         worddict = OrderedDict()
30 |         worddict['eos'] = 0
31 |         worddict['UNK'] = 1
32 | 
33 |         for ii, ww in enumerate(sorted_words):
34 |             worddict[ww] = ii + 2
35 | 
36 |         with open('%s.word.pkl' % filename, 'wb') as f:
37 |             pkl.dump(worddict, f)
38 | 
39 |         f.close()
40 |         print 'Done'
41 |         print len(worddict)
42 | 
43 | if __name__ == '__main__':
44 |     main()
45 | 


--------------------------------------------------------------------------------
/preprocess/clean_tags.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import re
 3 | 
 4 | from_file = sys.argv[1]
 5 | to_file = sys.argv[2]
 6 | to_file_out = open(to_file, "w")
 7 | 
 8 | regex = "<.*>"
 9 | 
10 | tag_match = re.compile(regex)
11 | matched_lines = []
12 | 
13 | with open(from_file) as from_file:
14 |     content = from_file.readlines()
15 |     for line in content:
16 |         if (tag_match.match(line)):
17 |             pass
18 |         else:
19 |             matched_lines.append(line)
20 | 
21 | matched_lines = "".join(matched_lines)
22 | to_file_out.write(matched_lines)
23 | to_file_out.close()
24 | 
25 | 


--------------------------------------------------------------------------------
/preprocess/fix_appo.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # {1} is the directory name
 3 | 
 4 | 
 5 | for f in ${1}/*.xml
 6 | do
 7 |     cat $f | grep "</seg>" | sed "s/’/'/g" | sed "s/“/\"/g" | sed "s/”/\"/g" > ${f}.fixed
 8 | done
 9 | 
10 | 


--------------------------------------------------------------------------------
/preprocess/merge.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | 
 4 | SRC=$1
 5 | TRG=$2
 6 | 
 7 | FSRC=all_${1}-${2}.${1}
 8 | FTRG=all_${1}-${2}.${2}
 9 | 
10 | echo "" > $FSRC
11 | for F in *${1}-${2}.${1}
12 | do
13 |     if [ "$F" = "$FSRC" ]; then
14 |         echo "pass"
15 |     else
16 |         cat $F >> $FSRC
17 |     fi
18 | done
19 | 
20 | 
21 | echo "" > $FTRG
22 | for F in *${1}-${2}.${2}
23 | do
24 |     if [ "$F" = "$FTRG" ]; then
25 |         echo "pass"
26 |     else
27 |         cat $F >> $FTRG
28 |     fi
29 | done
30 | 


--------------------------------------------------------------------------------
/preprocess/multi-bleu.perl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | #
  3 | # This file is part of moses.  Its use is licensed under the GNU Lesser General
  4 | # Public License version 2.1 or, at your option, any later version.
  5 | 
  6 | # $Id$
  7 | use warnings;
  8 | use strict;
  9 | 
 10 | my $lowercase = 0;
 11 | if ($ARGV[0] eq "-lc") {
 12 |   $lowercase = 1;
 13 |   shift;
 14 | }
 15 | 
 16 | my $stem = $ARGV[0];
 17 | if (!defined $stem) {
 18 |   print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
 19 |   print STDERR "Reads the references from reference or reference0, reference1, ...\n";
 20 |   exit(1);
 21 | }
 22 | 
 23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
 24 | 
 25 | my @REF;
 26 | my $ref=0;
 27 | while(-e "$stem$ref") {
 28 |     &add_to_ref("$stem$ref",\@REF);
 29 |     $ref++;
 30 | }
 31 | &add_to_ref($stem,\@REF) if -e $stem;
 32 | die("ERROR: could not find reference file $stem") unless scalar @REF;
 33 | 
 34 | sub add_to_ref {
 35 |     my ($file,$REF) = @_;
 36 |     my $s=0;
 37 |     open(REF,$file) or die "Can't read $file";
 38 |     while(<REF>) {
 39 |     chop;
 40 |     push @{$$REF[$s++]}, $_;
 41 |     }
 42 |     close(REF);
 43 | }
 44 | 
 45 | my(@CORRECT,@TOTAL,$length_translation,$length_reference);
 46 | my $s=0;
 47 | while(<STDIN>) {
 48 |     chop;
 49 |     $_ = lc if $lowercase;
 50 |     my @WORD = split;
 51 |     my %REF_NGRAM = ();
 52 |     my $length_translation_this_sentence = scalar(@WORD);
 53 |     my ($closest_diff,$closest_length) = (9999,9999);
 54 |     foreach my $reference (@{$REF[$s]}) {
 55 | #      print "$s $_ <=> $reference\n";
 56 |   $reference = lc($reference) if $lowercase;
 57 |     my @WORD = split(' ',$reference);
 58 |     my $length = scalar(@WORD);
 59 |         my $diff = abs($length_translation_this_sentence-$length);
 60 |     if ($diff < $closest_diff) {
 61 |         $closest_diff = $diff;
 62 |         $closest_length = $length;
 63 |         # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
 64 |     } elsif ($diff == $closest_diff) {
 65 |             $closest_length = $length if $length < $closest_length;
 66 |             # from two references with the same closeness to me
 67 |             # take the *shorter* into account, not the "first" one.
 68 |         }
 69 |     for(my $n=1;$n<=4;$n++) {
 70 |         my %REF_NGRAM_N = ();
 71 |         for(my $start=0;$start<=$#WORD-($n-1);$start++) {
 72 |         my $ngram = "$n";
 73 |         for(my $w=0;$w<$n;$w++) {
 74 |             $ngram .= " ".$WORD[$start+$w];
 75 |         }
 76 |         $REF_NGRAM_N{$ngram}++;
 77 |         }
 78 |         foreach my $ngram (keys %REF_NGRAM_N) {
 79 |         if (!defined($REF_NGRAM{$ngram}) ||
 80 |             $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
 81 |             $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
 82 | #       print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}<BR>\n";
 83 |         }
 84 |         }
 85 |     }
 86 |     }
 87 |     $length_translation += $length_translation_this_sentence;
 88 |     $length_reference += $closest_length;
 89 |     for(my $n=1;$n<=4;$n++) {
 90 |     my %T_NGRAM = ();
 91 |     for(my $start=0;$start<=$#WORD-($n-1);$start++) {
 92 |         my $ngram = "$n";
 93 |         for(my $w=0;$w<$n;$w++) {
 94 |         $ngram .= " ".$WORD[$start+$w];
 95 |         }
 96 |         $T_NGRAM{$ngram}++;
 97 |     }
 98 |     foreach my $ngram (keys %T_NGRAM) {
 99 |         $ngram =~ /^(\d+) /;
100 |         my $n = $1;
101 |             # my $corr = 0;
102 | #   print "$i e $ngram $T_NGRAM{$ngram}<BR>\n";
103 |         $TOTAL[$n] += $T_NGRAM{$ngram};
104 |         if (defined($REF_NGRAM{$ngram})) {
105 |         if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
106 |             $CORRECT[$n] += $T_NGRAM{$ngram};
107 |                     # $corr =  $T_NGRAM{$ngram};
108 | #       print "$i e correct1 $T_NGRAM{$ngram}<BR>\n";
109 |         }
110 |         else {
111 |             $CORRECT[$n] += $REF_NGRAM{$ngram};
112 |                     # $corr =  $REF_NGRAM{$ngram};
113 | #       print "$i e correct2 $REF_NGRAM{$ngram}<BR>\n";
114 |         }
115 |         }
116 |             # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
117 |             # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
118 |     }
119 |     }
120 |     $s++;
121 | }
122 | my $brevity_penalty = 1;
123 | my $bleu = 0;
124 | 
125 | my @bleu=();
126 | 
127 | for(my $n=1;$n<=4;$n++) {
128 |   if (defined ($TOTAL[$n]) && defined ($CORRECT[$n]) && $TOTAL[$n] > 0){
129 |     $bleu[$n]=($TOTAL[$n]>0)?$CORRECT[$n]/$TOTAL[$n]:0;
130 |     # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
131 |   }else{
132 |     $bleu[$n]=0;
133 |   }
134 | }
135 | 
136 | if ($length_reference==0){
137 |   printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
138 |   exit(1);
139 | }
140 | 
141 | if ($length_translation<$length_reference) {
142 |   $brevity_penalty = exp(1-$length_reference/$length_translation);
143 | }
144 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
145 |                 my_log( $bleu[2] ) +
146 |                 my_log( $bleu[3] ) +
147 |                 my_log( $bleu[4] ) ) / 4) ;
148 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
149 |     100*$bleu,
150 |     100*$bleu[1],
151 |     100*$bleu[2],
152 |     100*$bleu[3],
153 |     100*$bleu[4],
154 |     $brevity_penalty,
155 |     $length_translation / $length_reference,
156 |     $length_translation,
157 |     $length_reference;
158 | 
159 | sub my_log {
160 |   return -9999999999 unless $_[0];
161 |   return log($_[0]);
162 | }
163 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/README.txt:
--------------------------------------------------------------------------------
1 | The language suffix can be found here:
2 | 
3 | http://www.loc.gov/standards/iso639-2/php/code_list.php
4 | 
5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations).
6 | This code includes data from czech wiktionary (also czech abbreviations).
7 | 
8 | 
9 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ca:
--------------------------------------------------------------------------------
 1 | Dr
 2 | Dra
 3 | pàg
 4 | p
 5 | c
 6 | av
 7 | Sr
 8 | Sra
 9 | adm
10 | esq
11 | Prof
12 | S.A
13 | S.L
14 | p.e
15 | ptes
16 | Sta
17 | St
18 | pl
19 | màx
20 | cast
21 | dir
22 | nre
23 | fra
24 | admdora
25 | Emm
26 | Excma
27 | espf
28 | dc
29 | admdor
30 | tel
31 | angl
32 | aprox
33 | ca
34 | dept
35 | dj
36 | dl
37 | dt
38 | ds
39 | dg
40 | dv
41 | ed
42 | entl
43 | al
44 | i.e
45 | maj
46 | smin
47 | n
48 | núm
49 | pta
50 | A
51 | B
52 | C
53 | D
54 | E
55 | F
56 | G
57 | H
58 | I
59 | J
60 | K
61 | L
62 | M
63 | N
64 | O
65 | P
66 | Q
67 | R
68 | S
69 | T
70 | U
71 | V
72 | W
73 | X
74 | Y
75 | Z
76 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.cs:
--------------------------------------------------------------------------------
  1 | Bc
  2 | BcA
  3 | Ing
  4 | Ing.arch
  5 | MUDr
  6 | MVDr
  7 | MgA
  8 | Mgr
  9 | JUDr
 10 | PhDr
 11 | RNDr
 12 | PharmDr
 13 | ThLic
 14 | ThDr
 15 | Ph.D
 16 | Th.D
 17 | prof
 18 | doc
 19 | CSc
 20 | DrSc
 21 | dr. h. c
 22 | PaedDr
 23 | Dr
 24 | PhMr
 25 | DiS
 26 | abt
 27 | ad
 28 | a.i
 29 | aj
 30 | angl
 31 | anon
 32 | apod
 33 | atd
 34 | atp
 35 | aut
 36 | bd
 37 | biogr
 38 | b.m
 39 | b.p
 40 | b.r
 41 | cca
 42 | cit
 43 | cizojaz
 44 | c.k
 45 | col
 46 | čes
 47 | čín
 48 | čj
 49 | ed
 50 | facs
 51 | fasc
 52 | fol
 53 | fot
 54 | franc
 55 | h.c
 56 | hist
 57 | hl
 58 | hrsg
 59 | ibid
 60 | il
 61 | ind
 62 | inv.č
 63 | jap
 64 | jhdt
 65 | jv
 66 | koed
 67 | kol
 68 | korej
 69 | kl
 70 | krit
 71 | lat
 72 | lit
 73 | m.a
 74 | maď
 75 | mj
 76 | mp
 77 | násl
 78 | např
 79 | nepubl
 80 | něm
 81 | no
 82 | nr
 83 | n.s
 84 | okr
 85 | odd
 86 | odp
 87 | obr
 88 | opr
 89 | orig
 90 | phil
 91 | pl
 92 | pokrač
 93 | pol
 94 | port
 95 | pozn
 96 | př.kr
 97 | př.n.l
 98 | přel
 99 | přeprac
100 | příl
101 | pseud
102 | pt
103 | red
104 | repr
105 | resp
106 | revid
107 | rkp
108 | roč
109 | roz
110 | rozš
111 | samost
112 | sect
113 | sest
114 | seš
115 | sign
116 | sl
117 | srv
118 | stol
119 | sv
120 | šk
121 | šk.ro
122 | špan
123 | tab
124 | t.č
125 | tis
126 | tj
127 | tř
128 | tzv
129 | univ
130 | uspoř
131 | vol
132 | vl.jm
133 | vs
134 | vyd
135 | vyobr
136 | zal
137 | zejm
138 | zkr
139 | zprac
140 | zvl
141 | n.p
142 | např
143 | než
144 | MUDr
145 | abl
146 | absol
147 | adj
148 | adv
149 | ak
150 | ak. sl
151 | akt
152 | alch
153 | amer
154 | anat
155 | angl
156 | anglosas
157 | arab
158 | arch
159 | archit
160 | arg
161 | astr
162 | astrol
163 | att
164 | bás
165 | belg
166 | bibl
167 | biol
168 | boh
169 | bot
170 | bulh
171 | círk
172 | csl
173 | č
174 | čas
175 | čes
176 | dat
177 | děj
178 | dep
179 | dět
180 | dial
181 | dór
182 | dopr
183 | dosl
184 | ekon
185 | epic
186 | etnonym
187 | eufem
188 | f
189 | fam
190 | fem
191 | fil
192 | film
193 | form
194 | fot
195 | fr
196 | fut
197 | fyz
198 | gen
199 | geogr
200 | geol
201 | geom
202 | germ
203 | gram
204 | hebr
205 | herald
206 | hist
207 | hl
208 | hovor
209 | hud
210 | hut
211 | chcsl
212 | chem
213 | ie
214 | imp
215 | impf
216 | ind
217 | indoevr
218 | inf
219 | instr
220 | interj
221 | ión
222 | iron
223 | it
224 | kanad
225 | katalán
226 | klas
227 | kniž
228 | komp
229 | konj
230 |  
231 | konkr
232 | kř
233 | kuch
234 | lat
235 | lék
236 | les
237 | lid
238 | lit
239 | liturg
240 | lok
241 | log
242 | m
243 | mat
244 | meteor
245 | metr
246 | mod
247 | ms
248 | mysl
249 | n
250 | náb
251 | námoř
252 | neklas
253 | něm
254 | nesklon
255 | nom
256 | ob
257 | obch
258 | obyč
259 | ojed
260 | opt
261 | part
262 | pas
263 | pejor
264 | pers
265 | pf
266 | pl
267 | plpf
268 |  
269 | práv
270 | prep
271 | předl
272 | přivl
273 | r
274 | rcsl
275 | refl
276 | reg
277 | rkp
278 | ř
279 | řec
280 | s
281 | samohl
282 | sg
283 | sl
284 | souhl
285 | spec
286 | srov
287 | stfr
288 | střv
289 | stsl
290 | subj
291 | subst
292 | superl
293 | sv
294 | sz
295 | táz
296 | tech
297 | telev
298 | teol
299 | trans
300 | typogr
301 | var
302 | vedl
303 | verb
304 | vl. jm
305 | voj
306 | vok
307 | vůb
308 | vulg
309 | výtv
310 | vztaž
311 | zahr
312 | zájm
313 | zast
314 | zejm
315 |  
316 | zeměd
317 | zkr
318 | zř
319 | mj
320 | dl
321 | atp
322 | sport
323 | Mgr
324 | horn
325 | MVDr
326 | JUDr
327 | RSDr
328 | Bc
329 | PhDr
330 | ThDr
331 | Ing
332 | aj
333 | apod
334 | PharmDr
335 | pomn
336 | ev
337 | slang
338 | nprap
339 | odp
340 | dop
341 | pol
342 | st
343 | stol
344 | p. n. l
345 | před n. l
346 | n. l
347 | př. Kr
348 | po Kr
349 | př. n. l
350 | odd
351 | RNDr
352 | tzv
353 | atd
354 | tzn
355 | resp
356 | tj
357 | p
358 | br
359 | č. j
360 | čj
361 | č. p
362 | čp
363 | a. s
364 | s. r. o
365 | spol. s r. o
366 | p. o
367 | s. p
368 | v. o. s
369 | k. s
370 | o. p. s
371 | o. s
372 | v. r
373 | v z
374 | ml
375 | vč
376 | kr
377 | mld
378 | hod
379 | popř
380 | ap
381 | event
382 | rus
383 | slov
384 | rum
385 | švýc
386 | P. T
387 | zvl
388 | hor
389 | dol
390 | S.O.S


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.de:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | #no german words end in single lower-case letters, so we throw those in too.
  7 | A
  8 | B
  9 | C
 10 | D
 11 | E
 12 | F
 13 | G
 14 | H
 15 | I
 16 | J
 17 | K
 18 | L
 19 | M
 20 | N
 21 | O
 22 | P
 23 | Q
 24 | R
 25 | S
 26 | T
 27 | U
 28 | V
 29 | W
 30 | X
 31 | Y
 32 | Z
 33 | a
 34 | b
 35 | c
 36 | d
 37 | e
 38 | f
 39 | g
 40 | h
 41 | i
 42 | j
 43 | k
 44 | l
 45 | m
 46 | n
 47 | o
 48 | p
 49 | q
 50 | r
 51 | s
 52 | t
 53 | u
 54 | v
 55 | w
 56 | x
 57 | y
 58 | z
 59 | 
 60 | 
 61 | #Roman Numerals. A dot after one of these is not a sentence break in German.
 62 | I
 63 | II
 64 | III
 65 | IV
 66 | V
 67 | VI
 68 | VII
 69 | VIII
 70 | IX
 71 | X
 72 | XI
 73 | XII
 74 | XIII
 75 | XIV
 76 | XV
 77 | XVI
 78 | XVII
 79 | XVIII
 80 | XIX
 81 | XX
 82 | i
 83 | ii
 84 | iii
 85 | iv
 86 | v
 87 | vi
 88 | vii
 89 | viii
 90 | ix
 91 | x
 92 | xi
 93 | xii
 94 | xiii
 95 | xiv
 96 | xv
 97 | xvi
 98 | xvii
 99 | xviii
100 | xix
101 | xx
102 | 
103 | #Titles and Honorifics
104 | Adj
105 | Adm
106 | Adv
107 | Asst
108 | Bart
109 | Bldg
110 | Brig
111 | Bros
112 | Capt
113 | Cmdr
114 | Col
115 | Comdr
116 | Con
117 | Corp
118 | Cpl
119 | DR
120 | Dr
121 | Ens
122 | Gen
123 | Gov
124 | Hon
125 | Hosp
126 | Insp
127 | Lt
128 | MM
129 | MR
130 | MRS
131 | MS
132 | Maj
133 | Messrs
134 | Mlle
135 | Mme
136 | Mr
137 | Mrs
138 | Ms
139 | Msgr
140 | Op
141 | Ord
142 | Pfc
143 | Ph
144 | Prof
145 | Pvt
146 | Rep
147 | Reps
148 | Res
149 | Rev
150 | Rt
151 | Sen
152 | Sens
153 | Sfc
154 | Sgt
155 | Sr
156 | St
157 | Supt
158 | Surg
159 | 
160 | #Misc symbols
161 | Mio
162 | Mrd
163 | bzw
164 | v
165 | vs
166 | usw
167 | d.h
168 | z.B
169 | u.a
170 | etc
171 | Mrd
172 | MwSt
173 | ggf
174 | d.J
175 | D.h
176 | m.E
177 | vgl
178 | I.F
179 | z.T
180 | sogen
181 | ff
182 | u.E
183 | g.U
184 | g.g.A
185 | c.-à-d
186 | Buchst
187 | u.s.w
188 | sog
189 | u.ä
190 | Std
191 | evtl
192 | Zt
193 | Chr
194 | u.U
195 | o.ä
196 | Ltd
197 | b.A
198 | z.Zt
199 | spp
200 | sen
201 | SA
202 | k.o
203 | jun
204 | i.H.v
205 | dgl
206 | dergl
207 | Co
208 | zzt
209 | usf
210 | s.p.a
211 | Dkr
212 | Corp
213 | bzgl
214 | BSE
215 | 
216 | #Number indicators
217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it
218 | No
219 | Nos
220 | Art
221 | Nr
222 | pp
223 | ca
224 | Ca
225 | 
226 | #Ordinals are done with . in German - "1." = "1st" in English
227 | 1
228 | 2
229 | 3
230 | 4
231 | 5
232 | 6
233 | 7
234 | 8
235 | 9
236 | 10
237 | 11
238 | 12
239 | 13
240 | 14
241 | 15
242 | 16
243 | 17
244 | 18
245 | 19
246 | 20
247 | 21
248 | 22
249 | 23
250 | 24
251 | 25
252 | 26
253 | 27
254 | 28
255 | 29
256 | 30
257 | 31
258 | 32
259 | 33
260 | 34
261 | 35
262 | 36
263 | 37
264 | 38
265 | 39
266 | 40
267 | 41
268 | 42
269 | 43
270 | 44
271 | 45
272 | 46
273 | 47
274 | 48
275 | 49
276 | 50
277 | 51
278 | 52
279 | 53
280 | 54
281 | 55
282 | 56
283 | 57
284 | 58
285 | 59
286 | 60
287 | 61
288 | 62
289 | 63
290 | 64
291 | 65
292 | 66
293 | 67
294 | 68
295 | 69
296 | 70
297 | 71
298 | 72
299 | 73
300 | 74
301 | 75
302 | 76
303 | 77
304 | 78
305 | 79
306 | 80
307 | 81
308 | 82
309 | 83
310 | 84
311 | 85
312 | 86
313 | 87
314 | 88
315 | 89
316 | 90
317 | 91
318 | 92
319 | 93
320 | 94
321 | 95
322 | 96
323 | 97
324 | 98
325 | 99
326 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.en:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | 
 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 34 | Adj
 35 | Adm
 36 | Adv
 37 | Asst
 38 | Bart
 39 | Bldg
 40 | Brig
 41 | Bros
 42 | Capt
 43 | Cmdr
 44 | Col
 45 | Comdr
 46 | Con
 47 | Corp
 48 | Cpl
 49 | DR
 50 | Dr
 51 | Drs
 52 | Ens
 53 | Gen
 54 | Gov
 55 | Hon
 56 | Hr
 57 | Hosp
 58 | Insp
 59 | Lt
 60 | MM
 61 | MR
 62 | MRS
 63 | MS
 64 | Maj
 65 | Messrs
 66 | Mlle
 67 | Mme
 68 | Mr
 69 | Mrs
 70 | Ms
 71 | Msgr
 72 | Op
 73 | Ord
 74 | Pfc
 75 | Ph
 76 | Prof
 77 | Pvt
 78 | Rep
 79 | Reps
 80 | Res
 81 | Rev
 82 | Rt
 83 | Sen
 84 | Sens
 85 | Sfc
 86 | Sgt
 87 | Sr
 88 | St
 89 | Supt
 90 | Surg
 91 | 
 92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
 93 | v
 94 | vs
 95 | i.e
 96 | rev
 97 | e.g
 98 | 
 99 | #Numbers only. These should only induce breaks when followed by a numeric sequence
100 | # add NUMERIC_ONLY after the word for this function
101 | #This case is mostly for the english "No." which can either be a sentence of its own, or
102 | #if followed by a number, a non-breaking prefix
103 | No #NUMERIC_ONLY# 
104 | Nos
105 | Art #NUMERIC_ONLY#
106 | Nr
107 | pp #NUMERIC_ONLY#
108 | 
109 | #month abbreviations
110 | Jan
111 | Feb
112 | Mar
113 | Apr
114 | #May is a full word
115 | Jun
116 | Jul
117 | Aug
118 | Sep
119 | Oct
120 | Nov
121 | Dec
122 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.es:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | 
 33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm
 34 | 
 35 | A.C
 36 | Apdo
 37 | Av
 38 | Bco
 39 | CC.AA
 40 | Da
 41 | Dep
 42 | Dn
 43 | Dr
 44 | Dra
 45 | EE.UU
 46 | Excmo
 47 | FF.CC
 48 | Fil 
 49 | Gral
 50 | J.C
 51 | Let
 52 | Lic
 53 | N.B
 54 | P.D
 55 | P.V.P
 56 | Prof
 57 | Pts
 58 | Rte
 59 | S.A
 60 | S.A.R
 61 | S.E
 62 | S.L
 63 | S.R.C
 64 | Sr
 65 | Sra
 66 | Srta
 67 | Sta
 68 | Sto
 69 | T.V.E
 70 | Tel
 71 | Ud
 72 | Uds
 73 | V.B
 74 | V.E
 75 | Vd
 76 | Vds
 77 | a/c
 78 | adj
 79 | admón
 80 | afmo
 81 | apdo
 82 | av
 83 | c
 84 | c.f
 85 | c.g
 86 | cap
 87 | cm
 88 | cta
 89 | dcha
 90 | doc
 91 | ej
 92 | entlo
 93 | esq
 94 | etc
 95 | f.c
 96 | gr 
 97 | grs
 98 | izq
 99 | kg
100 | km
101 | mg
102 | mm
103 | nÃºm
104 | núm
105 | p
106 | p.a
107 | p.ej
108 | ptas
109 | pÃ¡g 
110 | pÃ¡gs
111 | pág
112 | págs
113 | q.e.g.e
114 | q.e.s.m
115 | s
116 | s.s.s
117 | vid
118 | vol
119 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.fi:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT
  2 | #indicate an end-of-sentence marker.  Special cases are included for prefixes
  3 | #that ONLY appear before 0-9 numbers.
  4 | 
  5 | #This list is compiled from omorfi <http://code.google.com/p/omorfi> database
  6 | #by Tommi A Pirinen.
  7 | 
  8 | 
  9 | #any single upper case letter  followed by a period is not a sentence ender
 10 | A
 11 | B
 12 | C
 13 | D
 14 | E
 15 | F
 16 | G
 17 | H
 18 | I
 19 | J
 20 | K
 21 | L
 22 | M
 23 | N
 24 | O
 25 | P
 26 | Q
 27 | R
 28 | S
 29 | T
 30 | U
 31 | V
 32 | W
 33 | X
 34 | Y
 35 | Z
 36 | Å
 37 | Ä
 38 | Ö
 39 | 
 40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 41 | alik
 42 | alil
 43 | amir
 44 | apul
 45 | apul.prof
 46 | arkkit
 47 | ass
 48 | assist
 49 | dipl
 50 | dipl.arkkit
 51 | dipl.ekon
 52 | dipl.ins
 53 | dipl.kielenk
 54 | dipl.kirjeenv
 55 | dipl.kosm
 56 | dipl.urk
 57 | dos
 58 | erikoiseläinl
 59 | erikoishammasl
 60 | erikoisl
 61 | erikoist
 62 | ev.luutn
 63 | evp
 64 | fil
 65 | ft
 66 | hallinton
 67 | hallintot
 68 | hammaslääket
 69 | jatk
 70 | jääk
 71 | kansaned
 72 | kapt
 73 | kapt.luutn
 74 | kenr
 75 | kenr.luutn
 76 | kenr.maj
 77 | kers
 78 | kirjeenv
 79 | kom
 80 | kom.kapt
 81 | komm
 82 | konst
 83 | korpr
 84 | luutn
 85 | maist
 86 | maj
 87 | Mr
 88 | Mrs
 89 | Ms
 90 | M.Sc
 91 | neuv
 92 | nimim
 93 | Ph.D
 94 | prof
 95 | puh.joht
 96 | pääll
 97 | res
 98 | san
 99 | siht
100 | suom
101 | sähköp
102 | säv
103 | toht
104 | toim
105 | toim.apul
106 | toim.joht
107 | toim.siht
108 | tuom
109 | ups
110 | vänr
111 | vääp
112 | ye.ups
113 | ylik
114 | ylil
115 | ylim
116 | ylimatr
117 | yliop
118 | yliopp
119 | ylip
120 | yliv
121 | 
122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall
123 | #into this category - it sometimes ends a sentence)
124 | e.g
125 | ent
126 | esim
127 | huom
128 | i.e
129 | ilm
130 | l
131 | mm
132 | myöh
133 | nk
134 | nyk
135 | par
136 | po
137 | t
138 | v
139 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.fr:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | #
  4 | #any single upper case letter  followed by a period is not a sentence ender
  5 | #usually upper case letters are initials in a name
  6 | #no French words end in single lower-case letters, so we throw those in too?
  7 | A
  8 | B
  9 | C
 10 | D
 11 | E
 12 | F
 13 | G
 14 | H
 15 | I
 16 | J
 17 | K
 18 | L
 19 | M
 20 | N
 21 | O
 22 | P
 23 | Q
 24 | R
 25 | S
 26 | T
 27 | U
 28 | V
 29 | W
 30 | X
 31 | Y
 32 | Z
 33 | a
 34 | b
 35 | c
 36 | d
 37 | e
 38 | f
 39 | g
 40 | h
 41 | i
 42 | j
 43 | k
 44 | l
 45 | m
 46 | n
 47 | o
 48 | p
 49 | q
 50 | r
 51 | s
 52 | t
 53 | u
 54 | v
 55 | w
 56 | x
 57 | y
 58 | z
 59 | 
 60 | # Period-final abbreviation list for French
 61 | A.C.N
 62 | A.M
 63 | art
 64 | ann
 65 | apr
 66 | av
 67 | auj
 68 | lib
 69 | B.P
 70 | boul
 71 | ca
 72 | c.-à-d
 73 | cf
 74 | ch.-l
 75 | chap
 76 | contr
 77 | C.P.I
 78 | C.Q.F.D
 79 | C.N
 80 | C.N.S
 81 | C.S
 82 | dir
 83 | éd
 84 | e.g
 85 | env
 86 | al
 87 | etc
 88 | E.V
 89 | ex
 90 | fasc
 91 | fém
 92 | fig
 93 | fr
 94 | hab
 95 | ibid
 96 | id
 97 | i.e
 98 | inf
 99 | LL.AA
100 | LL.AA.II
101 | LL.AA.RR
102 | LL.AA.SS
103 | L.D
104 | LL.EE
105 | LL.MM
106 | LL.MM.II.RR
107 | loc.cit
108 | masc
109 | MM
110 | ms
111 | N.B
112 | N.D.A
113 | N.D.L.R
114 | N.D.T
115 | n/réf
116 | NN.SS
117 | N.S
118 | N.D
119 | N.P.A.I
120 | p.c.c
121 | pl
122 | pp
123 | p.ex
124 | p.j
125 | P.S
126 | R.A.S
127 | R.-V
128 | R.P
129 | R.I.P
130 | SS
131 | S.S
132 | S.A
133 | S.A.I
134 | S.A.R
135 | S.A.S
136 | S.E
137 | sec
138 | sect
139 | sing
140 | S.M
141 | S.M.I.R
142 | sq
143 | sqq
144 | suiv
145 | sup
146 | suppl
147 | tél
148 | T.S.V.P
149 | vb
150 | vol
151 | vs
152 | X.O
153 | Z.I
154 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.hu:
--------------------------------------------------------------------------------
  1 | ﻿#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | Á
 33 | É
 34 | Í
 35 | Ó
 36 | Ö
 37 | Ő
 38 | Ú
 39 | Ü
 40 | Ű
 41 | 
 42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 43 | Dr
 44 | dr
 45 | kb
 46 | Kb
 47 | vö
 48 | Vö
 49 | pl
 50 | Pl
 51 | ca
 52 | Ca
 53 | min
 54 | Min
 55 | max
 56 | Max
 57 | ún
 58 | Ún
 59 | prof
 60 | Prof
 61 | de
 62 | De
 63 | du
 64 | Du
 65 | Szt
 66 | St
 67 | 
 68 | #Numbers only. These should only induce breaks when followed by a numeric sequence
 69 | # add NUMERIC_ONLY after the word for this function
 70 | #This case is mostly for the english "No." which can either be a sentence of its own, or
 71 | #if followed by a number, a non-breaking prefix
 72 | 
 73 | # Month name abbreviations
 74 | jan #NUMERIC_ONLY#
 75 | Jan #NUMERIC_ONLY#
 76 | Feb #NUMERIC_ONLY#
 77 | feb #NUMERIC_ONLY#
 78 | márc #NUMERIC_ONLY#
 79 | Márc #NUMERIC_ONLY#
 80 | ápr #NUMERIC_ONLY#
 81 | Ápr #NUMERIC_ONLY#
 82 | máj #NUMERIC_ONLY#
 83 | Máj #NUMERIC_ONLY#
 84 | jún #NUMERIC_ONLY#
 85 | Jún #NUMERIC_ONLY#
 86 | Júl #NUMERIC_ONLY#
 87 | júl #NUMERIC_ONLY#
 88 | aug #NUMERIC_ONLY#
 89 | Aug #NUMERIC_ONLY#
 90 | Szept #NUMERIC_ONLY#
 91 | szept #NUMERIC_ONLY#
 92 | okt #NUMERIC_ONLY#
 93 | Okt #NUMERIC_ONLY#
 94 | nov #NUMERIC_ONLY#
 95 | Nov #NUMERIC_ONLY#
 96 | dec #NUMERIC_ONLY#
 97 | Dec #NUMERIC_ONLY#
 98 | 
 99 | # Other abbreviations
100 | tel #NUMERIC_ONLY#
101 | Tel #NUMERIC_ONLY#
102 | Fax #NUMERIC_ONLY#
103 | fax #NUMERIC_ONLY#
104 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.is:
--------------------------------------------------------------------------------
  1 | no #NUMERIC_ONLY#
  2 | No #NUMERIC_ONLY#
  3 | nr #NUMERIC_ONLY#
  4 | Nr #NUMERIC_ONLY#
  5 | nR #NUMERIC_ONLY#
  6 | NR #NUMERIC_ONLY#
  7 | a
  8 | b
  9 | c
 10 | d
 11 | e
 12 | f
 13 | g
 14 | h
 15 | i
 16 | j
 17 | k
 18 | l
 19 | m
 20 | n
 21 | o
 22 | p
 23 | q
 24 | r
 25 | s
 26 | t
 27 | u
 28 | v
 29 | w
 30 | x
 31 | y
 32 | z
 33 | ^
 34 | í
 35 | á
 36 | ó
 37 | æ
 38 | A
 39 | B
 40 | C
 41 | D
 42 | E
 43 | F
 44 | G
 45 | H
 46 | I
 47 | J
 48 | K
 49 | L
 50 | M
 51 | N
 52 | O
 53 | P
 54 | Q
 55 | R
 56 | S
 57 | T
 58 | U
 59 | V
 60 | W
 61 | X
 62 | Y
 63 | Z
 64 | ab.fn
 65 | a.fn
 66 | afs
 67 | al
 68 | alm
 69 | alg
 70 | andh
 71 | ath
 72 | aths
 73 | atr
 74 | ao
 75 | au
 76 | aukaf
 77 | áfn
 78 | áhrl.s
 79 | áhrs
 80 | ákv.gr
 81 | ákv
 82 | bh
 83 | bls
 84 | dr
 85 | e.Kr
 86 | et
 87 | ef
 88 | efn
 89 | ennfr
 90 | eink
 91 | end
 92 | e.st
 93 | erl
 94 | fél
 95 | fskj
 96 | fh
 97 | f.hl
 98 | físl
 99 | fl
100 | fn
101 | fo
102 | forl
103 | frb
104 | frl
105 | frh
106 | frt
107 | fsl
108 | fsh
109 | fs
110 | fsk
111 | fst
112 | f.Kr
113 | ft
114 | fv
115 | fyrrn
116 | fyrrv
117 | germ
118 | gm
119 | gr
120 | hdl
121 | hdr
122 | hf
123 | hl
124 | hlsk
125 | hljsk
126 | hljv
127 | hljóðv
128 | hr
129 | hv
130 | hvk
131 | holl
132 | Hos
133 | höf
134 | hk
135 | hrl
136 | ísl
137 | kaf
138 | kap
139 | Khöfn
140 | kk
141 | kg
142 | kk
143 | km
144 | kl
145 | klst
146 | kr
147 | kt
148 | kgúrsk
149 | kvk
150 | leturbr
151 | lh
152 | lh.nt
153 | lh.þt
154 | lo
155 | ltr
156 | mlja
157 | mljó
158 | millj
159 | mm
160 | mms
161 | m.fl
162 | miðm
163 | mgr
164 | mst
165 | mín
166 | nf
167 | nh
168 | nhm
169 | nl
170 | nk
171 | nmgr
172 | no
173 | núv
174 | nt
175 | o.áfr
176 | o.m.fl
177 | ohf
178 | o.fl
179 | o.s.frv
180 | ófn
181 | ób
182 | óákv.gr
183 | óákv
184 | pfn
185 | PR
186 | pr
187 | Ritstj
188 | Rvík
189 | Rvk
190 | samb
191 | samhlj
192 | samn
193 | samn
194 | sbr
195 | sek
196 | sérn
197 | sf
198 | sfn
199 | sh
200 | sfn
201 | sh
202 | s.hl
203 | sk
204 | skv
205 | sl
206 | sn
207 | so
208 | ss.us
209 | s.st
210 | samþ
211 | sbr
212 | shlj
213 | sign
214 | skál
215 | st
216 | st.s
217 | stk
218 | sþ
219 | teg
220 | tbl
221 | tfn
222 | tl
223 | tvíhlj
224 | tvt
225 | till
226 | to
227 | umr
228 | uh
229 | us
230 | uppl
231 | útg
232 | vb
233 | Vf
234 | vh
235 | vkf
236 | Vl
237 | vl
238 | vlf
239 | vmf
240 | 8vo
241 | vsk
242 | vth
243 | þt
244 | þf
245 | þjs
246 | þgf
247 | þlt
248 | þolm
249 | þm
250 | þml
251 | þýð
252 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.it:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | 
 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 34 | Adj
 35 | Adm
 36 | Adv
 37 | Amn 
 38 | Arch 
 39 | Asst
 40 | Avv
 41 | Bart
 42 | Bcc
 43 | Bldg
 44 | Brig
 45 | Bros
 46 | C.A.P
 47 | C.P
 48 | Capt
 49 | Cc
 50 | Cmdr
 51 | Co
 52 | Col
 53 | Comdr
 54 | Con
 55 | Corp
 56 | Cpl
 57 | DR
 58 | Dott
 59 | Dr
 60 | Drs
 61 | Egr
 62 | Ens
 63 | Gen
 64 | Geom
 65 | Gov
 66 | Hon
 67 | Hosp
 68 | Hr
 69 | Id
 70 | Ing
 71 | Insp
 72 | Lt
 73 | MM
 74 | MR
 75 | MRS
 76 | MS
 77 | Maj
 78 | Messrs
 79 | Mlle
 80 | Mme
 81 | Mo
 82 | Mons
 83 | Mr
 84 | Mrs
 85 | Ms
 86 | Msgr
 87 | N.B
 88 | Op
 89 | Ord
 90 | P.S
 91 | P.T
 92 | Pfc
 93 | Ph
 94 | Prof
 95 | Pvt
 96 | RP
 97 | RSVP
 98 | Rag
 99 | Rep
100 | Reps
101 | Res
102 | Rev
103 | Rif
104 | Rt
105 | S.A
106 | S.B.F
107 | S.P.M
108 | S.p.A
109 | S.r.l
110 | Sen
111 | Sens
112 | Sfc
113 | Sgt
114 | Sig
115 | Sigg
116 | Soc
117 | Spett
118 | Sr
119 | St
120 | Supt
121 | Surg
122 | V.P
123 | 
124 | # other
125 | a.c 
126 | acc
127 | all 
128 | banc
129 | c.a
130 | c.c.p
131 | c.m
132 | c.p
133 | c.s
134 | c.v
135 | corr
136 | dott
137 | e.p.c
138 | ecc
139 | es 
140 | fatt
141 | gg
142 | int
143 | lett
144 | ogg
145 | on
146 | p.c
147 | p.c.c
148 | p.es
149 | p.f
150 | p.r
151 | p.v
152 | post
153 | pp
154 | racc
155 | ric
156 | s.n.c
157 | seg
158 | sgg
159 | ss
160 | tel
161 | u.s
162 | v.r
163 | v.s
164 | 
165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
166 | v
167 | vs
168 | i.e
169 | rev
170 | e.g
171 | 
172 | #Numbers only. These should only induce breaks when followed by a numeric sequence
173 | # add NUMERIC_ONLY after the word for this function
174 | #This case is mostly for the english "No." which can either be a sentence of its own, or
175 | #if followed by a number, a non-breaking prefix
176 | No #NUMERIC_ONLY# 
177 | Nos
178 | Art #NUMERIC_ONLY#
179 | Nr
180 | pp #NUMERIC_ONLY#
181 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.lv:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | Ā
  8 | B
  9 | C
 10 | Č
 11 | D
 12 | E
 13 | Ē
 14 | F
 15 | G
 16 | Ģ
 17 | H
 18 | I
 19 | Ī
 20 | J
 21 | K
 22 | Ķ
 23 | L
 24 | Ļ
 25 | M
 26 | N
 27 | Ņ
 28 | O
 29 | P
 30 | Q
 31 | R
 32 | S
 33 | Š
 34 | T
 35 | U
 36 | Ū
 37 | V
 38 | W
 39 | X
 40 | Y
 41 | Z
 42 | Ž
 43 | 
 44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 45 | dr
 46 | Dr
 47 | med
 48 | prof
 49 | Prof
 50 | inž
 51 | Inž
 52 | ist.loc
 53 | Ist.loc
 54 | kor.loc
 55 | Kor.loc
 56 | v.i
 57 | vietn
 58 | Vietn
 59 | 
 60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
 61 | a.l
 62 | t.p
 63 | pārb
 64 | Pārb
 65 | vec
 66 | Vec
 67 | inv
 68 | Inv
 69 | sk
 70 | Sk
 71 | spec
 72 | Spec
 73 | vienk
 74 | Vienk
 75 | virz
 76 | Virz
 77 | māksl
 78 | Māksl
 79 | mūz
 80 | Mūz
 81 | akad
 82 | Akad
 83 | soc
 84 | Soc
 85 | galv
 86 | Galv
 87 | vad
 88 | Vad
 89 | sertif
 90 | Sertif
 91 | folkl
 92 | Folkl
 93 | hum
 94 | Hum
 95 | 
 96 | #Numbers only. These should only induce breaks when followed by a numeric sequence
 97 | # add NUMERIC_ONLY after the word for this function
 98 | #This case is mostly for the english "No." which can either be a sentence of its own, or
 99 | #if followed by a number, a non-breaking prefix
100 | Nr #NUMERIC_ONLY# 
101 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.nl:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen 
  4 | #         http://nl.wikipedia.org/wiki/Aanspreekvorm
  5 | #         http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs
  6 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  7 | #usually upper case letters are initials in a name
  8 | A
  9 | B
 10 | C
 11 | D
 12 | E
 13 | F
 14 | G
 15 | H
 16 | I
 17 | J
 18 | K
 19 | L
 20 | M
 21 | N
 22 | O
 23 | P
 24 | Q
 25 | R
 26 | S
 27 | T
 28 | U
 29 | V
 30 | W
 31 | X
 32 | Y
 33 | Z
 34 | 
 35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 36 | bacc
 37 | bc
 38 | bgen
 39 | c.i
 40 | dhr
 41 | dr
 42 | dr.h.c
 43 | drs
 44 | drs
 45 | ds
 46 | eint
 47 | fa
 48 | Fa
 49 | fam
 50 | gen
 51 | genm
 52 | ing
 53 | ir
 54 | jhr
 55 | jkvr
 56 | jr
 57 | kand
 58 | kol
 59 | lgen
 60 | lkol
 61 | Lt
 62 | maj
 63 | Mej
 64 | mevr
 65 | Mme
 66 | mr
 67 | mr
 68 | Mw
 69 | o.b.s
 70 | plv
 71 | prof
 72 | ritm
 73 | tint
 74 | Vz
 75 | Z.D
 76 | Z.D.H
 77 | Z.E
 78 | Z.Em
 79 | Z.H
 80 | Z.K.H
 81 | Z.K.M
 82 | Z.M
 83 | z.v
 84 | 
 85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
 86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence
 87 | a.g.v
 88 | bijv
 89 | bijz
 90 | bv
 91 | d.w.z
 92 | e.c
 93 | e.g
 94 | e.k
 95 | ev
 96 | i.p.v
 97 | i.s.m
 98 | i.t.t
 99 | i.v.m
100 | m.a.w
101 | m.b.t
102 | m.b.v
103 | m.h.o
104 | m.i
105 | m.i.v
106 | v.w.t
107 | 
108 | #Numbers only. These should only induce breaks when followed by a numeric sequence
109 | # add NUMERIC_ONLY after the word for this function
110 | #This case is mostly for the english "No." which can either be a sentence of its own, or
111 | #if followed by a number, a non-breaking prefix
112 | Nr #NUMERIC_ONLY# 
113 | Nrs 
114 | nrs
115 | nr #NUMERIC_ONLY#
116 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.pl:
--------------------------------------------------------------------------------
  1 | adw
  2 | afr
  3 | akad
  4 | al
  5 | Al
  6 | am
  7 | amer
  8 | arch
  9 | art
 10 | Art
 11 | artyst
 12 | astr
 13 | austr
 14 | bałt
 15 | bdb
 16 | bł
 17 | bm
 18 | br
 19 | bryg
 20 | bryt
 21 | centr
 22 | ces
 23 | chem
 24 | chiń
 25 | chir
 26 | c.k
 27 | c.o
 28 | cyg
 29 | cyw
 30 | cyt
 31 | czes
 32 | czw
 33 | cd
 34 | Cd
 35 | czyt
 36 | ćw
 37 | ćwicz
 38 | daw
 39 | dcn
 40 | dekl
 41 | demokr
 42 | det
 43 | diec
 44 | dł
 45 | dn
 46 | dot
 47 | dol
 48 | dop
 49 | dost
 50 | dosł
 51 | h.c
 52 | ds
 53 | dst
 54 | duszp
 55 | dypl
 56 | egz
 57 | ekol
 58 | ekon
 59 | elektr
 60 | em
 61 | ew
 62 | fab
 63 | farm
 64 | fot
 65 | fr
 66 | gat
 67 | gastr
 68 | geogr
 69 | geol
 70 | gimn
 71 | głęb
 72 | gm
 73 | godz
 74 | górn
 75 | gosp
 76 | gr
 77 | gram
 78 | hist
 79 | hiszp
 80 | hr
 81 | Hr
 82 | hot
 83 | id
 84 | in
 85 | im
 86 | iron
 87 | jn
 88 | kard
 89 | kat
 90 | katol
 91 | k.k
 92 | kk
 93 | kol
 94 | kl
 95 | k.p.a
 96 | kpc
 97 | k.p.c
 98 | kpt
 99 | kr
100 | k.r
101 | krak
102 | k.r.o
103 | kryt
104 | kult
105 | laic
106 | łac
107 | niem
108 | woj
109 | nb
110 | np
111 | Nb
112 | Np
113 | pol
114 | pow
115 | m.in
116 | pt
117 | ps
118 | Pt
119 | Ps
120 | cdn
121 | jw
122 | ryc
123 | rys
124 | Ryc
125 | Rys
126 | tj
127 | tzw
128 | Tzw
129 | tzn
130 | zob
131 | ang
132 | ub
133 | ul
134 | pw
135 | pn
136 | pl
137 | al
138 | k
139 | n
140 | nr #NUMERIC_ONLY#
141 | Nr #NUMERIC_ONLY#
142 | ww
143 | wł
144 | ur
145 | zm
146 | żyd
147 | żarg
148 | żyw
149 | wył
150 | bp
151 | bp
152 | wyst
153 | tow
154 | Tow
155 | o
156 | sp
157 | Sp
158 | st
159 | spółdz
160 | Spółdz
161 | społ
162 | spółgł
163 | stoł
164 | stow
165 | Stoł
166 | Stow
167 | zn
168 | zew
169 | zewn
170 | zdr
171 | zazw
172 | zast
173 | zaw
174 | zał
175 | zal
176 | zam
177 | zak
178 | zakł
179 | zagr
180 | zach
181 | adw
182 | Adw
183 | lek
184 | Lek
185 | med
186 | mec
187 | Mec
188 | doc
189 | Doc
190 | dyw
191 | dyr
192 | Dyw
193 | Dyr
194 | inż
195 | Inż
196 | mgr
197 | Mgr
198 | dh
199 | dr
200 | Dh
201 | Dr
202 | p
203 | P
204 | red
205 | Red
206 | prof
207 | prok
208 | Prof
209 | Prok
210 | hab
211 | płk
212 | Płk
213 | nadkom
214 | Nadkom
215 | podkom
216 | Podkom
217 | ks
218 | Ks
219 | gen
220 | Gen
221 | por
222 | Por
223 | reż
224 | Reż
225 | przyp
226 | Przyp
227 | śp
228 | św
229 | śW
230 | Śp
231 | Św
232 | ŚW
233 | szer
234 | Szer
235 | pkt #NUMERIC_ONLY#
236 | str #NUMERIC_ONLY#
237 | tab #NUMERIC_ONLY#
238 | Tab #NUMERIC_ONLY#
239 | tel
240 | ust #NUMERIC_ONLY#
241 | par #NUMERIC_ONLY#
242 | poz
243 | pok
244 | oo
245 | oO
246 | Oo
247 | OO
248 | r #NUMERIC_ONLY#
249 | l #NUMERIC_ONLY#
250 | s #NUMERIC_ONLY#
251 | najśw
252 | Najśw
253 | A
254 | B
255 | C
256 | D
257 | E
258 | F
259 | G
260 | H
261 | I
262 | J
263 | K
264 | L
265 | M
266 | N
267 | O
268 | P
269 | Q
270 | R
271 | S
272 | T
273 | U
274 | V
275 | W
276 | X
277 | Y
278 | Z
279 | Ś
280 | Ć
281 | Ż
282 | Ź
283 | Dz
284 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.pt:
--------------------------------------------------------------------------------
  1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009.
  2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  4 | 
  5 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  6 | #usually upper case letters are initials in a name
  7 | A
  8 | B
  9 | C
 10 | D
 11 | E
 12 | F
 13 | G
 14 | H
 15 | I
 16 | J
 17 | K
 18 | L
 19 | M
 20 | N
 21 | O
 22 | P
 23 | Q
 24 | R
 25 | S
 26 | T
 27 | U
 28 | V
 29 | W
 30 | X
 31 | Y
 32 | Z
 33 | a
 34 | b
 35 | c
 36 | d
 37 | e
 38 | f
 39 | g
 40 | h
 41 | i
 42 | j
 43 | k
 44 | l
 45 | m
 46 | n
 47 | o
 48 | p
 49 | q
 50 | r
 51 | s
 52 | t
 53 | u
 54 | v
 55 | w
 56 | x
 57 | y
 58 | z
 59 | 
 60 | 
 61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese.
 62 | I
 63 | II
 64 | III
 65 | IV
 66 | V
 67 | VI
 68 | VII
 69 | VIII
 70 | IX
 71 | X
 72 | XI
 73 | XII
 74 | XIII
 75 | XIV
 76 | XV
 77 | XVI
 78 | XVII
 79 | XVIII
 80 | XIX
 81 | XX
 82 | i
 83 | ii
 84 | iii
 85 | iv
 86 | v
 87 | vi
 88 | vii
 89 | viii
 90 | ix
 91 | x
 92 | xi
 93 | xii
 94 | xiii
 95 | xiv
 96 | xv
 97 | xvi
 98 | xvii
 99 | xviii
100 | xix
101 | xx
102 | 
103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
104 | Adj
105 | Adm
106 | Adv
107 | Art
108 | Ca
109 | Capt
110 | Cmdr
111 | Col
112 | Comdr
113 | Con
114 | Corp
115 | Cpl
116 | DR
117 | DRA
118 | Dr
119 | Dra
120 | Dras
121 | Drs
122 | Eng
123 | Enga
124 | Engas
125 | Engos
126 | Ex
127 | Exo
128 | Exmo
129 | Fig
130 | Gen
131 | Hosp
132 | Insp
133 | Lda
134 | MM
135 | MR
136 | MRS
137 | MS
138 | Maj
139 | Mrs
140 | Ms
141 | Msgr
142 | Op
143 | Ord
144 | Pfc
145 | Ph
146 | Prof
147 | Pvt
148 | Rep
149 | Reps
150 | Res
151 | Rev
152 | Rt
153 | Sen
154 | Sens
155 | Sfc
156 | Sgt
157 | Sr
158 | Sra
159 | Sras
160 | Srs
161 | Sto
162 | Supt
163 | Surg
164 | adj
165 | adm
166 | adv
167 | art
168 | cit
169 | col
170 | con
171 | corp
172 | cpl
173 | dr
174 | dra
175 | dras
176 | drs
177 | eng
178 | enga
179 | engas
180 | engos
181 | ex
182 | exo
183 | exmo
184 | fig
185 | op
186 | prof
187 | sr
188 | sra
189 | sras
190 | srs
191 | sto
192 | 
193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
194 | v
195 | vs
196 | i.e
197 | rev
198 | e.g
199 | 
200 | #Numbers only. These should only induce breaks when followed by a numeric sequence
201 | # add NUMERIC_ONLY after the word for this function
202 | #This case is mostly for the english "No." which can either be a sentence of its own, or
203 | #if followed by a number, a non-breaking prefix
204 | No #NUMERIC_ONLY# 
205 | Nos
206 | Art #NUMERIC_ONLY#
207 | Nr
208 | p #NUMERIC_ONLY#
209 | pp #NUMERIC_ONLY#
210 | 
211 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ro:
--------------------------------------------------------------------------------
 1 | A
 2 | B
 3 | C
 4 | D
 5 | E
 6 | F
 7 | G
 8 | H
 9 | I
10 | J
11 | K
12 | L
13 | M
14 | N
15 | O
16 | P
17 | Q
18 | R
19 | S
20 | T
21 | U
22 | V
23 | W
24 | X
25 | Y
26 | Z
27 | dpdv
28 | etc
29 | șamd
30 | M.Ap.N
31 | dl
32 | Dl
33 | d-na
34 | D-na
35 | dvs
36 | Dvs
37 | pt
38 | Pt
39 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ru:
--------------------------------------------------------------------------------
  1 | # added Cyrillic uppercase letters [А-Я]
  2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes)
  3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013
  4 | А
  5 | Б
  6 | В
  7 | Г
  8 | Д
  9 | Е
 10 | Ж
 11 | З
 12 | И
 13 | Й
 14 | К
 15 | Л
 16 | М
 17 | Н
 18 | О
 19 | П
 20 | Р
 21 | С
 22 | Т
 23 | У
 24 | Ф
 25 | Х
 26 | Ц
 27 | Ч
 28 | Ш
 29 | Щ
 30 | Ъ
 31 | Ы
 32 | Ь
 33 | Э
 34 | Ю
 35 | Я
 36 | A
 37 | B
 38 | C
 39 | D
 40 | E
 41 | F
 42 | G
 43 | H
 44 | I
 45 | J
 46 | K
 47 | L
 48 | M
 49 | N
 50 | O
 51 | P
 52 | Q
 53 | R
 54 | S
 55 | T
 56 | U
 57 | V
 58 | W
 59 | X
 60 | Y
 61 | Z
 62 | 0гг
 63 | 1гг
 64 | 2гг
 65 | 3гг
 66 | 4гг
 67 | 5гг
 68 | 6гг
 69 | 7гг
 70 | 8гг
 71 | 9гг
 72 | 0г
 73 | 1г
 74 | 2г
 75 | 3г
 76 | 4г
 77 | 5г
 78 | 6г
 79 | 7г
 80 | 8г
 81 | 9г
 82 | Xвв
 83 | Vвв
 84 | Iвв
 85 | Lвв
 86 | Mвв
 87 | Cвв
 88 | Xв
 89 | Vв
 90 | Iв
 91 | Lв
 92 | Mв
 93 | Cв
 94 | 0м
 95 | 1м
 96 | 2м
 97 | 3м
 98 | 4м
 99 | 5м
100 | 6м
101 | 7м
102 | 8м
103 | 9м
104 | 0мм
105 | 1мм
106 | 2мм
107 | 3мм
108 | 4мм
109 | 5мм
110 | 6мм
111 | 7мм
112 | 8мм
113 | 9мм
114 | 0см
115 | 1см
116 | 2см
117 | 3см
118 | 4см
119 | 5см
120 | 6см
121 | 7см
122 | 8см
123 | 9см
124 | 0дм
125 | 1дм
126 | 2дм
127 | 3дм
128 | 4дм
129 | 5дм
130 | 6дм
131 | 7дм
132 | 8дм
133 | 9дм
134 | 0л
135 | 1л
136 | 2л
137 | 3л
138 | 4л
139 | 5л
140 | 6л
141 | 7л
142 | 8л
143 | 9л
144 | 0км
145 | 1км
146 | 2км
147 | 3км
148 | 4км
149 | 5км
150 | 6км
151 | 7км
152 | 8км
153 | 9км
154 | 0га
155 | 1га
156 | 2га
157 | 3га
158 | 4га
159 | 5га
160 | 6га
161 | 7га
162 | 8га
163 | 9га
164 | 0кг
165 | 1кг
166 | 2кг
167 | 3кг
168 | 4кг
169 | 5кг
170 | 6кг
171 | 7кг
172 | 8кг
173 | 9кг
174 | 0т
175 | 1т
176 | 2т
177 | 3т
178 | 4т
179 | 5т
180 | 6т
181 | 7т
182 | 8т
183 | 9т
184 | 0г
185 | 1г
186 | 2г
187 | 3г
188 | 4г
189 | 5г
190 | 6г
191 | 7г
192 | 8г
193 | 9г
194 | 0мг
195 | 1мг
196 | 2мг
197 | 3мг
198 | 4мг
199 | 5мг
200 | 6мг
201 | 7мг
202 | 8мг
203 | 9мг
204 | бульв
205 | в
206 | вв
207 | г
208 | га
209 | гг
210 | гл
211 | гос
212 | д
213 | дм
214 | доп
215 | др
216 | е
217 | ед
218 | ед
219 | зам
220 | и
221 | инд
222 | исп
223 | Исп
224 | к
225 | кап
226 | кг
227 | кв
228 | кл
229 | км
230 | кол
231 | комн
232 | коп
233 | куб
234 | л
235 | лиц
236 | лл
237 | м
238 | макс
239 | мг
240 | мин
241 | мл
242 | млн
243 | млрд
244 | мм
245 | н
246 | наб
247 | нач
248 | неуд
249 | ном
250 | о
251 | обл
252 | обр
253 | общ
254 | ок
255 | ост
256 | отл
257 | п
258 | пер
259 | перераб
260 | пл
261 | пос
262 | пр
263 | просп
264 | проф
265 | р
266 | ред
267 | руб
268 | с
269 | сб
270 | св
271 | см
272 | соч
273 | ср
274 | ст
275 | стр
276 | т
277 | тел
278 | Тел
279 | тех
280 | тт
281 | туп
282 | тыс
283 | уд
284 | ул
285 | уч
286 | физ
287 | х
288 | хор
289 | ч
290 | чел
291 | шт
292 | экз
293 | э
294 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.sk:
--------------------------------------------------------------------------------
  1 | Bc
  2 | Mgr
  3 | RNDr
  4 | PharmDr
  5 | PhDr
  6 | JUDr
  7 | PaedDr
  8 | ThDr
  9 | Ing
 10 | MUDr
 11 | MDDr
 12 | MVDr
 13 | Dr
 14 | ThLic
 15 | PhD
 16 | ArtD
 17 | ThDr
 18 | Dr
 19 | DrSc
 20 | CSs
 21 | prof
 22 | obr
 23 | Obr
 24 | Č
 25 | č
 26 | absol
 27 | adj
 28 | admin
 29 | adr
 30 | Adr
 31 | adv
 32 | advok
 33 | afr
 34 | ak
 35 | akad
 36 | akc
 37 | akuz
 38 | et
 39 | al
 40 | alch
 41 | amer
 42 | anat
 43 | angl
 44 | Angl
 45 | anglosas
 46 | anorg
 47 | ap
 48 | apod
 49 | arch
 50 | archeol
 51 | archit
 52 | arg
 53 | art
 54 | astr
 55 | astrol
 56 | astron
 57 | atp
 58 | atď
 59 | austr
 60 | Austr
 61 | aut
 62 | belg
 63 | Belg
 64 | bibl
 65 | Bibl
 66 | biol
 67 | bot
 68 | bud
 69 | bás
 70 | býv
 71 | cest
 72 | chem
 73 | cirk
 74 | csl
 75 | čs
 76 | Čs
 77 | dat
 78 | dep
 79 | det
 80 | dial
 81 | diaľ
 82 | dipl
 83 | distrib
 84 | dokl
 85 | dosl
 86 | dopr
 87 | dram
 88 | duš
 89 | dv
 90 | dvojčl
 91 | dór
 92 | ekol
 93 | ekon
 94 | el
 95 | elektr
 96 | elektrotech
 97 | energet
 98 | epic
 99 | est
100 | etc
101 | etonym
102 | eufem
103 | európ
104 | Európ
105 | ev
106 | evid
107 | expr
108 | fa
109 | fam
110 | farm
111 | fem
112 | feud
113 | fil
114 | filat
115 | filoz
116 | fi
117 | fon
118 | form
119 | fot
120 | fr
121 | Fr
122 | franc
123 | Franc
124 | fraz
125 | fut
126 | fyz
127 | fyziol
128 | garb
129 | gen
130 | genet
131 | genpor
132 | geod
133 | geogr
134 | geol
135 | geom
136 | germ
137 | gr
138 | Gr
139 | gréc
140 | Gréc
141 | gréckokat
142 | hebr
143 | herald
144 | hist
145 | hlav
146 | hosp
147 | hromad
148 | hud
149 | hypok
150 | ident
151 | i.e
152 | ident
153 | imp
154 | impf
155 | indoeur
156 | inf
157 | inform
158 | instr
159 | int
160 | interj
161 | inšt
162 | inštr
163 | iron
164 | jap
165 | Jap
166 | jaz
167 | jedn
168 | juhoamer
169 | juhových
170 | juhozáp
171 | juž
172 | kanad
173 | Kanad
174 | kanc
175 | kapit
176 | kpt
177 | kart
178 | katastr
179 | knih
180 | kniž
181 | komp
182 | konj
183 | konkr
184 | kozmet
185 | krajč
186 | kresť
187 | kt
188 | kuch
189 | lat
190 | latinskoamer
191 | lek
192 | lex
193 | lingv
194 | lit
195 | litur
196 | log
197 | lok
198 | max
199 | Max
200 | maď
201 | Maď
202 | medzinár
203 | mest
204 | metr
205 | mil
206 | Mil
207 | min
208 | Min
209 | miner
210 | ml
211 | mld
212 | mn
213 | mod
214 | mytol
215 | napr
216 | nar
217 | Nar
218 | nasl
219 | nedok
220 | neg
221 | negat
222 | neklas
223 | nem
224 | Nem
225 | neodb
226 | neos
227 | neskl
228 | nesklon
229 | nespis
230 | nespráv
231 | neved
232 | než
233 | niekt
234 | niž
235 | nom
236 | náb
237 | nákl
238 | námor
239 | nár
240 | obch
241 | obj
242 | obv
243 | obyč
244 | obč
245 | občian
246 | odb
247 | odd
248 | ods
249 | ojed
250 | okr
251 | Okr
252 | opt
253 | opyt
254 | org
255 | os
256 | osob
257 | ot
258 | ovoc
259 | par
260 | part
261 | pejor
262 | pers
263 | pf
264 | Pf 
265 | P.f
266 | p.f
267 | pl
268 | Plk
269 | pod
270 | podst
271 | pokl
272 | polit
273 | politol
274 | polygr
275 | pomn
276 | popl
277 | por
278 | porad
279 | porov
280 | posch
281 | potrav
282 | použ
283 | poz
284 | pozit
285 | poľ
286 | poľno
287 | poľnohosp
288 | poľov
289 | pošt
290 | pož
291 | prac
292 | predl
293 | pren
294 | prep
295 | preuk
296 | priezv
297 | Priezv
298 | privl
299 | prof
300 | práv
301 | príd
302 | príj
303 | prík
304 | príp
305 | prír
306 | prísl
307 | príslov
308 | príč
309 | psych
310 | publ
311 | pís
312 | písm
313 | pôv
314 | refl
315 | reg
316 | rep
317 | resp
318 | rozk
319 | rozlič
320 | rozpráv
321 | roč
322 | Roč
323 | ryb
324 | rádiotech
325 | rím
326 | samohl
327 | semest
328 | sev
329 | severoamer
330 | severových
331 | severozáp
332 | sg
333 | skr
334 | skup
335 | sl
336 | Sloven
337 | soc
338 | soch
339 | sociol
340 | sp
341 | spol
342 | Spol
343 | spoloč
344 | spoluhl
345 | správ
346 | spôs
347 | st
348 | star
349 | starogréc
350 | starorím
351 | s.r.o
352 | stol
353 | stor
354 | str
355 | stredoamer
356 | stredoškol
357 | subj
358 | subst
359 | superl
360 | sv
361 | sz
362 | súkr
363 | súp
364 | súvzť
365 | tal
366 | Tal
367 | tech
368 | tel
369 | Tel
370 | telef
371 | teles
372 | telev
373 | teol
374 | trans
375 | turist
376 | tuzem
377 | typogr
378 | tzn
379 | tzv
380 | ukaz
381 | ul
382 | Ul
383 | umel
384 | univ
385 | ust
386 | ved
387 | vedľ
388 | verb
389 | veter
390 | vin
391 | viď
392 | vl
393 | vod
394 | vodohosp
395 | pnl
396 | vulg
397 | vyj
398 | vys
399 | vysokoškol
400 | vzťaž
401 | vôb
402 | vých
403 | výd
404 | výrob
405 | výsk
406 | výsl
407 | výtv
408 | výtvar
409 | význ
410 | včel
411 | vš
412 | všeob
413 | zahr
414 | zar
415 | zariad
416 | zast
417 | zastar
418 | zastaráv
419 | zb
420 | zdravot
421 | združ
422 | zjemn
423 | zlat
424 | zn
425 | Zn
426 | zool
427 | zr
428 | zried
429 | zv
430 | záhr
431 | zák
432 | zákl
433 | zám
434 | záp
435 | západoeur
436 | zázn
437 | územ
438 | účt
439 | čast
440 | čes
441 | Čes
442 | čl
443 | čísl
444 | živ
445 | pr
446 | fak
447 | Kr
448 | p.n.l
449 | A
450 | B
451 | C
452 | D
453 | E
454 | F
455 | G
456 | H
457 | I
458 | J
459 | K
460 | L
461 | M
462 | N
463 | O
464 | P
465 | Q
466 | R
467 | S
468 | T
469 | U
470 | V
471 | W
472 | X
473 | Y
474 | Z
475 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.sl:
--------------------------------------------------------------------------------
 1 | dr
 2 | Dr
 3 | itd
 4 | itn
 5 | št #NUMERIC_ONLY#
 6 | Št #NUMERIC_ONLY#
 7 | d
 8 | jan
 9 | Jan
10 | feb
11 | Feb
12 | mar
13 | Mar
14 | apr
15 | Apr
16 | jun
17 | Jun
18 | jul
19 | Jul
20 | avg
21 | Avg
22 | sept
23 | Sept
24 | sep
25 | Sep
26 | okt
27 | Okt
28 | nov
29 | Nov
30 | dec
31 | Dec
32 | tj
33 | Tj
34 | npr
35 | Npr
36 | sl
37 | Sl
38 | op
39 | Op
40 | gl
41 | Gl
42 | oz
43 | Oz
44 | prev
45 | dipl
46 | ing
47 | prim
48 | Prim
49 | cf
50 | Cf
51 | gl
52 | Gl
53 | A
54 | B
55 | C
56 | D
57 | E
58 | F
59 | G
60 | H
61 | I
62 | J
63 | K
64 | L
65 | M
66 | N
67 | O
68 | P
69 | Q
70 | R
71 | S
72 | T
73 | U
74 | V
75 | W
76 | X
77 | Y
78 | Z
79 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.sv:
--------------------------------------------------------------------------------
 1 | #single upper case letter are usually initials
 2 | A
 3 | B
 4 | C
 5 | D
 6 | E
 7 | F
 8 | G
 9 | H
10 | I
11 | J
12 | K
13 | L
14 | M
15 | N
16 | O
17 | P
18 | Q
19 | R
20 | S
21 | T
22 | U
23 | V
24 | W
25 | X
26 | Y
27 | Z
28 | #misc abbreviations
29 | AB
30 | G
31 | VG
32 | dvs
33 | etc
34 | from
35 | iaf
36 | jfr
37 | kl
38 | kr
39 | mao
40 | mfl
41 | mm
42 | osv
43 | pga
44 | tex
45 | tom
46 | vs
47 | 


--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ta:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | அ
  7 | ஆ
  8 | இ
  9 | ஈ
 10 | உ
 11 | ஊ
 12 | எ
 13 | ஏ
 14 | ஐ
 15 | ஒ
 16 | ஓ
 17 | ஔ
 18 | ஃ
 19 | க
 20 | கா
 21 | கி
 22 | கீ
 23 | கு
 24 | கூ
 25 | கெ
 26 | கே
 27 | கை
 28 | கொ
 29 | கோ
 30 | கௌ
 31 | க்
 32 | ச
 33 | சா
 34 | சி
 35 | சீ
 36 | சு
 37 | சூ
 38 | செ
 39 | சே
 40 | சை
 41 | சொ
 42 | சோ
 43 | சௌ
 44 | ச்
 45 | ட
 46 | டா
 47 | டி
 48 | டீ
 49 | டு
 50 | டூ
 51 | டெ
 52 | டே
 53 | டை
 54 | டொ
 55 | டோ
 56 | டௌ
 57 | ட்
 58 | த
 59 | தா
 60 | தி
 61 | தீ
 62 | து
 63 | தூ
 64 | தெ
 65 | தே
 66 | தை
 67 | தொ
 68 | தோ
 69 | தௌ
 70 | த்
 71 | ப
 72 | பா
 73 | பி
 74 | பீ
 75 | பு
 76 | பூ
 77 | பெ
 78 | பே
 79 | பை
 80 | பொ
 81 | போ
 82 | பௌ
 83 | ப்
 84 | ற
 85 | றா
 86 | றி
 87 | றீ
 88 | று
 89 | றூ
 90 | றெ
 91 | றே
 92 | றை
 93 | றொ
 94 | றோ
 95 | றௌ
 96 | ற்
 97 | ய
 98 | யா
 99 | யி
100 | யீ
101 | யு
102 | யூ
103 | யெ
104 | யே
105 | யை
106 | யொ
107 | யோ
108 | யௌ
109 | ய்
110 | ர
111 | ரா
112 | ரி
113 | ரீ
114 | ரு
115 | ரூ
116 | ரெ
117 | ரே
118 | ரை
119 | ரொ
120 | ரோ
121 | ரௌ
122 | ர்
123 | ல
124 | லா
125 | லி
126 | லீ
127 | லு
128 | லூ
129 | லெ
130 | லே
131 | லை
132 | லொ
133 | லோ
134 | லௌ
135 | ல்
136 | வ
137 | வா
138 | வி
139 | வீ
140 | வு
141 | வூ
142 | வெ
143 | வே
144 | வை
145 | வொ
146 | வோ
147 | வௌ
148 | வ்
149 | ள
150 | ளா
151 | ளி
152 | ளீ
153 | ளு
154 | ளூ
155 | ளெ
156 | ளே
157 | ளை
158 | ளொ
159 | ளோ
160 | ளௌ
161 | ள்
162 | ழ
163 | ழா
164 | ழி
165 | ழீ
166 | ழு
167 | ழூ
168 | ழெ
169 | ழே
170 | ழை
171 | ழொ
172 | ழோ
173 | ழௌ
174 | ழ்
175 | ங
176 | ஙா
177 | ஙி
178 | ஙீ
179 | ஙு
180 | ஙூ
181 | ஙெ
182 | ஙே
183 | ஙை
184 | ஙொ
185 | ஙோ
186 | ஙௌ
187 | ங்  
188 | ஞ
189 | ஞா
190 | ஞி
191 | ஞீ
192 | ஞு
193 | ஞூ
194 | ஞெ
195 | ஞே
196 | ஞை
197 | ஞொ
198 | ஞோ
199 | ஞௌ
200 | ஞ் 
201 | ண
202 | ணா
203 | ணி
204 | ணீ
205 | ணு
206 | ணூ
207 | ணெ
208 | ணே
209 | ணை
210 | ணொ
211 | ணோ
212 | ணௌ
213 | ண்
214 | ந
215 | நா
216 | நி
217 | நீ
218 | நு
219 | நூ
220 | நெ
221 | நே
222 | நை
223 | நொ
224 | நோ
225 | நௌ
226 | ந் 	
227 | ம
228 | மா
229 | மி
230 | மீ
231 | மு
232 | மூ
233 | மெ
234 | மே
235 | மை
236 | மொ
237 | மோ
238 | மௌ
239 | ம் 	
240 | ன
241 | னா
242 | னி
243 | னீ
244 | னு
245 | னூ
246 | னெ
247 | னே
248 | னை
249 | னொ
250 | னோ
251 | னௌ
252 | ன்
253 | 
254 | 
255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
256 | திரு
257 | திருமதி
258 | வண
259 | கௌரவ
260 | 
261 | 
262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
263 | உ.ம்
264 | #கா.ம்
265 | #எ.ம்
266 | 
267 | 
268 | #Numbers only. These should only induce breaks when followed by a numeric sequence
269 | # add NUMERIC_ONLY after the word for this function
270 | #This case is mostly for the english "No." which can either be a sentence of its own, or
271 | #if followed by a number, a non-breaking prefix
272 | No #NUMERIC_ONLY# 
273 | Nos
274 | Art #NUMERIC_ONLY#
275 | Nr
276 | pp #NUMERIC_ONLY#
277 | 


--------------------------------------------------------------------------------
/preprocess/normalize-punctuation.perl:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/perl -w
 2 | 
 3 | use strict;
 4 | 
 5 | my ($language) = @ARGV;
 6 | 
 7 | while(<STDIN>) {
 8 |     s/\r//g;
 9 |     # remove extra spaces
10 |     s/\(/ \(/g;
11 |     s/\)/\) /g; s/ +/ /g;
12 |     s/\) ([\.\!\:\?\;\,])/\)$1/g;
13 |     s/\( /\(/g;
14 |     s/ \)/\)/g;
15 |     s/(\d) \%/$1\%/g;
16 |     s/ :/:/g;
17 |     s/ ;/;/g;
18 |     # normalize unicode punctuation
19 |     s/„/\"/g;
20 |     s/“/\"/g;
21 |     s/”/\"/g;
22 |     s/–/-/g;
23 |     s/—/ - /g; s/ +/ /g;
24 |     s/´/\'/g;
25 |     s/([a-z])‘([a-z])/$1\'$2/gi;
26 |     s/([a-z])’([a-z])/$1\'$2/gi;
27 |     s/‘/\"/g;
28 |     s/‚/\"/g;
29 |     s/’/\"/g;
30 |     s/''/\"/g;
31 |     s/´´/\"/g;
32 |     s/…/.../g;
33 |     # French quotes
34 |     s/ « / \"/g;
35 |     s/« /\"/g;
36 |     s/«/\"/g;
37 |     s/ » /\" /g;
38 |     s/ »/\"/g;
39 |     s/»/\"/g;
40 |     # handle pseudo-spaces
41 |     s/ \%/\%/g;
42 |     s/nº /nº /g;
43 |     s/ :/:/g;
44 |     s/ ºC/ ºC/g;
45 |     s/ cm/ cm/g;
46 |     s/ \?/\?/g;
47 |     s/ \!/\!/g;
48 |     s/ ;/;/g;
49 |     s/, /, /g; s/ +/ /g;
50 | 
51 |     # English "quotation," followed by comma, style
52 |     if ($language eq "en") {
53 | 	s/\"([,\.]+)/$1\"/g;
54 |     }
55 |     # Czech is confused
56 |     elsif ($language eq "cs" || $language eq "cz") {
57 |     }
58 |     # German/Spanish/French "quotation", followed by comma, style
59 |     else {
60 | 	s/,\"/\",/g;	
61 | 	s/(\.+)\"(\s*[^<])/\"$1$2/g; # don't fix period at end of sentence
62 |     }
63 | 
64 |     print STDERR $_ if /﻿/;
65 | 
66 |     if ($language eq "de" || $language eq "es" || $language eq "cz" || $language eq "cs" || $language eq "fr") {
67 | 	s/(\d) (\d)/$1,$2/g;
68 |     }
69 |     else {
70 | 	s/(\d) (\d)/$1.$2/g;
71 |     }
72 |     print $_;
73 | }
74 | 


--------------------------------------------------------------------------------
/preprocess/preprocess.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # source language (example: fr)
 4 | S=$1
 5 | # target language (example: en)
 6 | T=$2
 7 | 
 8 | # path to dl4mt/data
 9 | P1=$3
10 | 
11 | # path to subword NMT scripts (can be downloaded from https://github.com/rsennrich/subword-nmt)
12 | P2=$4
13 | 
14 | ## merge all parallel corpora
15 | #./merge.sh $1 $2
16 | 
17 | perl $P1/normalize-punctuation.perl -l ${S} < all_${S}-${T}.${S} > all_${S}-${T}.${S}.norm  # do this for validation and test
18 | perl $P1/normalize-punctuation.perl -l ${T} < all_${S}-${T}.${T} > all_${S}-${T}.${T}.norm  # do this for validation and test
19 | 
20 | # tokenize
21 | perl $P1/tokenizer_apos.perl -threads 5 -l $S < all_${S}-${T}.${S}.norm > all_${S}-${T}.${S}.tok  # do this for validation and test
22 | perl $P1/tokenizer_apos.perl -threads 5 -l $T < all_${S}-${T}.${T}.norm > all_${S}-${T}.${T}.tok  # do this for validation and test
23 | 
24 | # BPE
25 | if [ ! -f "../${S}.bpe" ]; then
26 |     python $P2/learn_bpe.py -s 20000 < all_${S}-${T}.${S}.tok > ../${S}.bpe
27 | fi
28 | if [ ! -f "../${T}.bpe" ]; then
29 |     python $P2/learn_bpe.py -s 20000 < all_${S}-${T}.${T}.tok > ../${T}.bpe
30 | fi
31 | 
32 | python $P2/apply_bpe.py -c ../${S}.bpe < all_${S}-${T}.${S}.tok > all_${S}-${T}.${S}.tok.bpe  # do this for validation and test
33 | python $P2/apply_bpe.py -c ../${T}.bpe < all_${S}-${T}.${T}.tok > all_${S}-${T}.${T}.tok.bpe  # do this for validation and test
34 | 
35 | # shuffle 
36 | python $P1/shuffle.py all_${S}-${T}.${S}.tok.bpe all_${S}-${T}.${T}.tok.bpe all_${S}-${T}.${S}.tok all_${S}-${T}.${T}.tok
37 | 
38 | # build dictionary
39 | #python $P1/build_dictionary.py all_${S}-${T}.${S}.tok &
40 | #python $P1/build_dictionary.py all_${S}-${T}.${T}.tok &
41 | #python $P1/build_dictionary_word.py all_${S}-${T}.${S}.tok.bpe &
42 | #python $P1/build_dictionary_word.py all_${S}-${T}.${T}.tok.bpe &
43 | 


--------------------------------------------------------------------------------
/preprocess/shuffle.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import random
 4 | 
 5 | from tempfile import mkstemp
 6 | from subprocess import call
 7 | 
 8 | 
 9 | 
10 | def main(files):
11 | 
12 |     tf_os, tpath = mkstemp()
13 |     tf = open(tpath, 'w')
14 | 
15 |     fds = [open(ff) for ff in files]
16 | 
17 |     for l in fds[0]:
18 |         lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]]
19 |         print >>tf, "|||".join(lines)
20 | 
21 |     [ff.close() for ff in fds]
22 |     tf.close()
23 | 
24 |     tf = open(tpath, 'r')
25 |     lines = tf.readlines()
26 |     random.shuffle(lines)
27 | 
28 |     fds = [open(ff+'.shuf','w') for ff in files]
29 | 
30 |     for l in lines:
31 |         s = l.strip().split('|||')
32 |         for ii, fd in enumerate(fds):
33 |             print >>fd, s[ii]
34 | 
35 |     [ff.close() for ff in fds]
36 | 
37 |     os.remove(tpath)
38 | 
39 | if __name__ == '__main__':
40 |     main(sys.argv[1:])
41 | 
42 |     
43 | 
44 | 
45 | 


--------------------------------------------------------------------------------
/presentation/appendix.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/presentation/appendix.pdf


--------------------------------------------------------------------------------
/subword_base/train_wmt15_csen_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from subword_base_both import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 |                                          'two_layer_gru_decoder_both'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_csen_bpe2bpe_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/subword_base/train_wmt15_deen_bpe2bpe_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from subword_base import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 |                                     'two_layer_gru_decoder'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2bpe_two_layer_gru_decoder_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2bpe_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/subword_base/train_wmt15_deen_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from subword_base_both import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 |                                          'two_layer_gru_decoder_both'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_deen_bpe2bpe_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/subword_base/train_wmt15_fien_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from subword_base_both import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 |                                          'two_layer_gru_decoder_both'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_fien_bpe2bpe_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/subword_base/train_wmt15_ruen_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from collections import OrderedDict
 4 | from nmt import train
 5 | from subword_base_both import *
 6 | 
 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
 8 |           'fff': ('param_init_ffflayer', 'ffflayer'),
 9 |           'gru': ('param_init_gru', 'gru_layer'),
10 |           'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 |                                          'two_layer_gru_decoder_both'),
12 |           }
13 | 
14 | 
15 | def main(job_id, params):
16 |     re_load = False
17 |     save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 |     source_dataset = params['train_data_path'] + params['source_dataset']
19 |     target_dataset = params['train_data_path'] + params['target_dataset']
20 |     valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 |     valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 |     source_dictionary = params['train_data_path'] + params['source_dictionary']
23 |     target_dictionary = params['train_data_path'] + params['target_dictionary']
24 | 
25 |     print params, params['save_path'], save_file_name
26 |     validerr = train(
27 |         max_epochs=int(params['max_epochs']),
28 |         patience=int(params['patience']),
29 |         dim_word=int(params['dim_word']),
30 |         dim_word_src=int(params['dim_word_src']),
31 |         save_path=params['save_path'],
32 |         save_file_name=save_file_name,
33 |         re_load=re_load,
34 |         enc_dim=int(params['enc_dim']),
35 |         dec_dim=int(params['dec_dim']),
36 |         n_words=int(params['n_words']),
37 |         n_words_src=int(params['n_words_src']),
38 |         decay_c=float(params['decay_c']),
39 |         lrate=float(params['learning_rate']),
40 |         optimizer=params['optimizer'],
41 |         maxlen=int(params['maxlen']),
42 |         maxlen_trg=int(params['maxlen_trg']),
43 |         maxlen_sample=int(params['maxlen_sample']),
44 |         batch_size=int(params['batch_size']),
45 |         valid_batch_size=int(params['valid_batch_size']),
46 |         sort_size=int(params['sort_size']),
47 |         validFreq=int(params['validFreq']),
48 |         dispFreq=int(params['dispFreq']),
49 |         saveFreq=int(params['saveFreq']),
50 |         sampleFreq=int(params['sampleFreq']),
51 |         clip_c=int(params['clip_c']),
52 |         datasets=[source_dataset, target_dataset],
53 |         valid_datasets=[valid_source_dataset, valid_target_dataset],
54 |         dictionaries=[source_dictionary, target_dictionary],
55 |         use_dropout=int(params['use_dropout']),
56 |         source_word_level=int(params['source_word_level']),
57 |         target_word_level=int(params['target_word_level']),
58 |         layers=layers,
59 |         save_every_saveFreq=1,
60 |         use_bpe=1,
61 |         init_params=init_params,
62 |         build_model=build_model,
63 |         build_sampler=build_sampler,
64 |         gen_sample=gen_sample
65 |     )
66 |     return validerr
67 | 
68 | if __name__ == '__main__':
69 | 
70 |     import sys, time
71 |     if len(sys.argv) > 1:
72 |         config_file_name = sys.argv[-1]
73 |     else:
74 |         config_file_name = 'wmt15_ruen_bpe2bpe_adam.txt'
75 | 
76 |     f = open(config_file_name, 'r')
77 |     lines = f.readlines()
78 |     params = OrderedDict()
79 | 
80 |     for line in lines:
81 |         line = line.split('\n')[0]
82 |         param_list = line.split(' ')
83 |         param_name = param_list[0]
84 |         param_value = param_list[1]
85 |         params[param_name] = param_value
86 | 
87 |     main(0, params)
88 | 


--------------------------------------------------------------------------------
/subword_base/translate.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from subword_base import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 queue.put((idx, x))
126 |         return idx+1
127 | 
128 |     def _finish_processes():
129 |         for midx in xrange(n_process):
130 |             queue.put(None)
131 | 
132 |     def _retrieve_jobs(n_samples):
133 |         trans = [None] * n_samples
134 |         for idx in xrange(n_samples):
135 |             resp = rqueue.get()
136 |             trans[resp[0]] = resp[1]
137 |             if numpy.mod(idx, 10) == 0:
138 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 |         return trans
140 | 
141 |     print 'Translating ', source_file, '...'
142 |     n_samples = _send_jobs(source_file)
143 |     trans = _seqs2words(_retrieve_jobs(n_samples))
144 |     _finish_processes()
145 |     with open(saveto, 'w') as f:
146 |         if decoder_bpe_to_tok:
147 |             print >>f, '\n'.join(trans).replace('@@ ', '')
148 |         else:
149 |             print >>f, '\n'.join(trans)
150 |     print 'Done'
151 | 
152 | 
153 | if __name__ == "__main__":
154 |     parser = argparse.ArgumentParser()
155 |     parser.add_argument('-k', type=int, default=5)
156 |     parser.add_argument('-p', type=int, default=5)
157 |     parser.add_argument('-n', action="store_true", default=False)
158 |     parser.add_argument('-bpe', action="store_true", default=False)
159 |     parser.add_argument('-enc_c', action="store_true", default=False)
160 |     parser.add_argument('-dec_c', action="store_true", default=False)
161 |     parser.add_argument('-utf8', action="store_true", default=False)
162 |     parser.add_argument('model', type=str)
163 |     parser.add_argument('dictionary', type=str)
164 |     parser.add_argument('dictionary_target', type=str)
165 |     parser.add_argument('source', type=str)
166 |     parser.add_argument('saveto', type=str)
167 | 
168 |     args = parser.parse_args()
169 | 
170 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
171 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 |          encoder_chr_level=args.enc_c,
173 |          decoder_chr_level=args.dec_c,
174 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 | 


--------------------------------------------------------------------------------
/subword_base/translate_both.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | 
  9 | from subword_base_both import (build_sampler, gen_sample, init_params)
 10 | from mixer import *
 11 | 
 12 | from multiprocessing import Process, Queue
 13 | 
 14 | 
 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
 16 | 
 17 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
 18 |     trng = RandomStreams(1234)
 19 | 
 20 |     # allocate model parameters
 21 |     params = init_params(options)
 22 | 
 23 |     # load model parameters and set theano shared variables
 24 |     params = load_params(model, params)
 25 |     tparams = init_tparams(params)
 26 | 
 27 |     # word index
 28 |     use_noise = theano.shared(numpy.float32(0.))
 29 |     f_init, f_next = build_sampler(tparams, options, trng, use_noise)
 30 | 
 31 |     def _translate(seq):
 32 |         use_noise.set_value(0.)
 33 |         # sample given an input sequence and obtain scores
 34 |         sample, score = gen_sample(tparams, f_init, f_next,
 35 |                                    numpy.array(seq).reshape([len(seq), 1]),
 36 |                                    options, trng=trng, k=k, maxlen=500,
 37 |                                    stochastic=False, argmax=False)
 38 | 
 39 |         # normalize scores according to sequence lengths
 40 |         if normalize:
 41 |             lengths = numpy.array([len(s) for s in sample])
 42 |             score = score / lengths
 43 |         sidx = numpy.argmin(score)
 44 |         return sample[sidx]
 45 | 
 46 |     while True:
 47 |         req = queue.get()
 48 |         if req is None:
 49 |             break
 50 | 
 51 |         idx, x = req[0], req[1]
 52 |         print pid, '-', idx
 53 |         seq = _translate(x)
 54 | 
 55 |         rqueue.put((idx, seq))
 56 | 
 57 |     return
 58 | 
 59 | 
 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
 61 |          normalize=False, n_process=5, encoder_chr_level=False,
 62 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
 63 | 
 64 |     # load model model_options
 65 |     pkl_file = model.split('.')[0] + '.pkl'
 66 |     with open(pkl_file, 'rb') as f:
 67 |         options = pkl.load(f)
 68 | 
 69 |     # load source dictionary and invert
 70 |     with open(dictionary, 'rb') as f:
 71 |         word_dict = pkl.load(f)
 72 |     word_idict = dict()
 73 |     for kk, vv in word_dict.iteritems():
 74 |         word_idict[vv] = kk
 75 |     word_idict[0] = '<eos>'
 76 |     word_idict[1] = 'UNK'
 77 | 
 78 |     # load target dictionary and invert
 79 |     with open(dictionary_target, 'rb') as f:
 80 |         word_dict_trg = pkl.load(f)
 81 |     word_idict_trg = dict()
 82 |     for kk, vv in word_dict_trg.iteritems():
 83 |         word_idict_trg[vv] = kk
 84 |     word_idict_trg[0] = '<eos>'
 85 |     word_idict_trg[1] = 'UNK'
 86 | 
 87 |     # create input and output queues for processes
 88 |     queue = Queue()
 89 |     rqueue = Queue()
 90 |     processes = [None] * n_process
 91 |     for midx in xrange(n_process):
 92 |         processes[midx] = Process(
 93 |             target=translate_model,
 94 |             args=(queue, rqueue, midx, model, options, k, normalize))
 95 |         processes[midx].start()
 96 | 
 97 |     # utility function
 98 |     def _seqs2words(caps):
 99 |         capsw = []
100 |         for cc in caps:
101 |             ww = []
102 |             for w in cc:
103 |                 if w == 0:
104 |                     break
105 |                 if utf8:
106 |                     ww.append(word_idict_trg[w].encode('utf-8'))
107 |                 else:
108 |                     ww.append(word_idict_trg[w])
109 |             if decoder_chr_level:
110 |                 capsw.append(''.join(ww))
111 |             else:
112 |                 capsw.append(' '.join(ww))
113 |         return capsw
114 | 
115 |     def _send_jobs(fname):
116 |         with open(fname, 'r') as f:
117 |             for idx, line in enumerate(f):
118 |                 if encoder_chr_level:
119 |                     words = list(line.decode('utf-8').strip())
120 |                 else:
121 |                     words = line.strip().split()
122 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 |                 x += [0]
125 |                 queue.put((idx, x))
126 |         return idx+1
127 | 
128 |     def _finish_processes():
129 |         for midx in xrange(n_process):
130 |             queue.put(None)
131 | 
132 |     def _retrieve_jobs(n_samples):
133 |         trans = [None] * n_samples
134 |         for idx in xrange(n_samples):
135 |             resp = rqueue.get()
136 |             trans[resp[0]] = resp[1]
137 |             if numpy.mod(idx, 10) == 0:
138 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 |         return trans
140 | 
141 |     print 'Translating ', source_file, '...'
142 |     n_samples = _send_jobs(source_file)
143 |     trans = _seqs2words(_retrieve_jobs(n_samples))
144 |     _finish_processes()
145 |     with open(saveto, 'w') as f:
146 |         if decoder_bpe_to_tok:
147 |             print >>f, '\n'.join(trans).replace('@@ ', '')
148 |         else:
149 |             print >>f, '\n'.join(trans)
150 |     print 'Done'
151 | 
152 | 
153 | if __name__ == "__main__":
154 |     parser = argparse.ArgumentParser()
155 |     parser.add_argument('-k', type=int, default=5)
156 |     parser.add_argument('-p', type=int, default=5)
157 |     parser.add_argument('-n', action="store_true", default=False)
158 |     parser.add_argument('-bpe', action="store_true", default=False)
159 |     parser.add_argument('-enc_c', action="store_true", default=False)
160 |     parser.add_argument('-dec_c', action="store_true", default=False)
161 |     parser.add_argument('-utf8', action="store_true", default=False)
162 |     parser.add_argument('model', type=str)
163 |     parser.add_argument('dictionary', type=str)
164 |     parser.add_argument('dictionary_target', type=str)
165 |     parser.add_argument('source', type=str)
166 |     parser.add_argument('saveto', type=str)
167 | 
168 |     args = parser.parse_args()
169 | 
170 |     main(args.model, args.dictionary, args.dictionary_target, args.source,
171 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 |          encoder_chr_level=args.enc_c,
173 |          decoder_chr_level=args.dec_c,
174 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 | 


--------------------------------------------------------------------------------
/subword_base/translate_both_bpe2bpe_ensemble_deen.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Translates a source file using a translation model.
  3 | '''
  4 | import argparse
  5 | 
  6 | import numpy
  7 | import cPickle as pkl
  8 | import ipdb
  9 | 
 10 | from nmt_both import (build_sampler, init_params)
 11 | from mixer import *
 12 | 
 13 | from multiprocessing import Process, Queue
 14 | 
 15 | 
 16 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None,
 17 |                k=1, maxlen=500, stochastic=True, argmax=False):
 18 | 
 19 |     # k is the beam size we have
 20 |     if k > 1:
 21 |         assert not stochastic, \
 22 |             'Beam search does not support stochastic sampling'
 23 | 
 24 |     sample = []
 25 |     sample_score = []
 26 |     if stochastic:
 27 |         sample_score = 0
 28 | 
 29 |     live_k = 1
 30 |     dead_k = 0
 31 | 
 32 |     hyp_samples = [[]] * live_k
 33 |     hyp_scores = numpy.zeros(live_k).astype('float32')
 34 |     hyp_states = []
 35 | 
 36 |     # get initial state of decoder rnn and encoder context
 37 |     rets = []
 38 |     next_state_chars = []
 39 |     next_state_words = []
 40 |     ctx0s = []
 41 | 
 42 |     for i in xrange(len(f_inits)):
 43 |         ret = f_inits[i](x)
 44 |         next_state_chars.append(ret[0])
 45 |         next_state_words.append(ret[1])
 46 |         ctx0s.append(ret[2])
 47 |     next_w = -1 * numpy.ones((1,)).astype('int64')  # bos indicator
 48 | 
 49 |     num_models = len(f_inits)
 50 | 
 51 |     for ii in xrange(maxlen):
 52 | 
 53 |         temp_next_p = []
 54 |         temp_next_state_char = []
 55 |         temp_next_state_word = []
 56 | 
 57 |         for i in xrange(num_models):
 58 | 
 59 |             ctx = numpy.tile(ctx0s[i], [live_k, 1])
 60 |             inps = [next_w, ctx, next_state_chars[i], next_state_words[i]]
 61 |             ret = f_nexts[i](*inps)
 62 |             next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3]
 63 |             temp_next_p.append(next_p)
 64 |             temp_next_state_char.append(next_state_char)
 65 |             temp_next_state_word.append(next_state_word)
 66 |         #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models
 67 |         next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0))
 68 | 
 69 |         if stochastic:
 70 |             if argmax:
 71 |                 nw = next_p[0].argmax()
 72 |             else:
 73 |                 nw = next_w[0]
 74 |             sample.append(nw)
 75 |             sample_score += next_p[0, nw]
 76 |             if nw == 0:
 77 |                 break
 78 |         else:
 79 |             cand_scores = hyp_scores[:, None] - next_p
 80 |             cand_flat = cand_scores.flatten()
 81 |             ranks_flat = cand_flat.argsort()[:(k - dead_k)]
 82 | 
 83 |             voc_size = next_p.shape[1]
 84 |             trans_indices = ranks_flat / voc_size
 85 |             word_indices = ranks_flat % voc_size
 86 |             costs = cand_flat[ranks_flat]
 87 | 
 88 |             new_hyp_samples = []
 89 |             new_hyp_scores = numpy.zeros(k - dead_k).astype('float32')
 90 |             new_hyp_states_chars = []
 91 |             new_hyp_states_words = []
 92 | 
 93 |             for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)):
 94 |                 new_hyp_samples.append(hyp_samples[ti] + [wi])
 95 |                 new_hyp_scores[idx] = copy.copy(costs[idx])
 96 | 
 97 |             for i in xrange(num_models):
 98 |                 new_hyp_states_char = []
 99 |                 new_hyp_states_word = []
100 | 
101 |                 for ti in trans_indices:
102 |                     new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti]))
103 |                     new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti]))
104 | 
105 |                 new_hyp_states_chars.append(new_hyp_states_char)
106 |                 new_hyp_states_words.append(new_hyp_states_word)
107 | 
108 |             # check the finished samples
109 |             new_live_k = 0
110 |             hyp_samples = []
111 |             hyp_scores = []
112 | 
113 |             for idx in xrange(len(new_hyp_samples)):
114 |                 if new_hyp_samples[idx][-1] == 0:
115 |                     sample.append(new_hyp_samples[idx])
116 |                     sample_score.append(new_hyp_scores[idx])
117 |                     dead_k += 1
118 |                 else:
119 |                     new_live_k += 1
120 |                     hyp_samples.append(new_hyp_samples[idx])
121 |                     hyp_scores.append(new_hyp_scores[idx])
122 | 
123 |             for i in xrange(num_models):
124 |                 hyp_states_char = []
125 |                 hyp_states_word = []
126 | 
127 |                 for idx in xrange(len(new_hyp_samples)):
128 |                     if new_hyp_samples[idx][-1] != 0:
129 |                         hyp_states_char.append(new_hyp_states_chars[i][idx])
130 |                         hyp_states_word.append(new_hyp_states_words[i][idx])
131 | 
132 |                 next_state_chars[i] = numpy.array(hyp_states_char)
133 |                 next_state_words[i] = numpy.array(hyp_states_word)
134 | 
135 |             hyp_scores = numpy.array(hyp_scores)
136 |             live_k = new_live_k
137 | 
138 |             if new_live_k < 1:
139 |                 break
140 |             if dead_k >= k:
141 |                 break
142 | 
143 |             next_w = numpy.array([w[-1] for w in hyp_samples])
144 | 
145 |     if not stochastic:
146 |         # dump every remaining one
147 |         if live_k > 0:
148 |             for idx in xrange(live_k):
149 |                 sample.append(hyp_samples[idx])
150 |                 sample_score.append(hyp_scores[idx])
151 | 
152 |     return sample, sample_score
153 | 
154 | 
155 | def translate_model(queue, rqueue, pid, models, options, k, normalize):
156 | 
157 |     from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
158 |     trng = RandomStreams(1234)
159 | 
160 |     # allocate model parameters
161 |     params = []
162 |     for i in xrange(len(models)):
163 |         params.append(init_params(options))
164 | 
165 |     # load model parameters and set theano shared variables
166 |     tparams = []
167 |     for i in xrange(len(params)):
168 |         params[i] = load_params(models[i], params[i])
169 |         tparams.append(init_tparams(params[i]))
170 | 
171 |     # word index
172 |     use_noise = theano.shared(numpy.float32(0.))
173 |     f_inits = []
174 |     f_nexts = []
175 |     for i in xrange(len(tparams)):
176 |         f_init, f_next = build_sampler(tparams[i], options, trng, use_noise)
177 |         f_inits.append(f_init)
178 |         f_nexts.append(f_next)
179 | 
180 |     def _translate(seq):
181 |         use_noise.set_value(0.)
182 |         # sample given an input sequence and obtain scores
183 |         sample, score = gen_sample(tparams, f_inits, f_nexts,
184 |                                    numpy.array(seq).reshape([len(seq), 1]),
185 |                                    options, trng=trng, k=k, maxlen=500,
186 |                                    stochastic=False, argmax=False)
187 | 
188 |         # normalize scores according to sequence lengths
189 |         if normalize:
190 |             lengths = numpy.array([len(s) for s in sample])
191 |             score = score / lengths
192 |         sidx = numpy.argmin(score)
193 |         return sample[sidx]
194 | 
195 |     while True:
196 |         req = queue.get()
197 |         if req is None:
198 |             break
199 | 
200 |         idx, x = req[0], req[1]
201 |         print pid, '-', idx
202 |         seq = _translate(x)
203 | 
204 |         rqueue.put((idx, seq))
205 | 
206 |     return
207 | 
208 | 
209 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5,
210 |          normalize=False, n_process=5, encoder_chr_level=False,
211 |          decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
212 | 
213 |     # load model model_options
214 |     pkl_file = models[0].split('.')[0] + '.pkl'
215 |     with open(pkl_file, 'rb') as f:
216 |         options = pkl.load(f)
217 | 
218 |     # load source dictionary and invert
219 |     with open(dictionary, 'rb') as f:
220 |         word_dict = pkl.load(f)
221 |     word_idict = dict()
222 |     for kk, vv in word_dict.iteritems():
223 |         word_idict[vv] = kk
224 |     word_idict[0] = '<eos>'
225 |     word_idict[1] = 'UNK'
226 | 
227 |     # load target dictionary and invert
228 |     with open(dictionary_target, 'rb') as f:
229 |         word_dict_trg = pkl.load(f)
230 |     word_idict_trg = dict()
231 |     for kk, vv in word_dict_trg.iteritems():
232 |         word_idict_trg[vv] = kk
233 |     word_idict_trg[0] = '<eos>'
234 |     word_idict_trg[1] = 'UNK'
235 | 
236 |     # create input and output queues for processes
237 |     queue = Queue()
238 |     rqueue = Queue()
239 |     processes = [None] * n_process
240 |     for midx in xrange(n_process):
241 |         processes[midx] = Process(
242 |             target=translate_model,
243 |             args=(queue, rqueue, midx, models, options, k, normalize))
244 |         processes[midx].start()
245 | 
246 |     # utility function
247 |     def _seqs2words(caps):
248 |         capsw = []
249 |         for cc in caps:
250 |             ww = []
251 |             for w in cc:
252 |                 if w == 0:
253 |                     break
254 |                 if utf8:
255 |                     ww.append(word_idict_trg[w].encode('utf-8'))
256 |                 else:
257 |                     ww.append(word_idict_trg[w])
258 |             if decoder_chr_level:
259 |                 capsw.append(''.join(ww))
260 |             else:
261 |                 capsw.append(' '.join(ww))
262 |         return capsw
263 | 
264 |     def _send_jobs(fname):
265 |         with open(fname, 'r') as f:
266 |             for idx, line in enumerate(f):
267 |                 if encoder_chr_level:
268 |                     words = list(line.decode('utf-8').strip())
269 |                 else:
270 |                     words = line.strip().split()
271 |                 x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
272 |                 x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
273 |                 x += [0]
274 |                 queue.put((idx, x))
275 |         return idx+1
276 | 
277 |     def _finish_processes():
278 |         for midx in xrange(n_process):
279 |             queue.put(None)
280 | 
281 |     def _retrieve_jobs(n_samples):
282 |         trans = [None] * n_samples
283 |         for idx in xrange(n_samples):
284 |             resp = rqueue.get()
285 |             trans[resp[0]] = resp[1]
286 |             if numpy.mod(idx, 10) == 0:
287 |                 print 'Sample ', (idx+1), '/', n_samples, ' Done'
288 |         return trans
289 | 
290 |     print 'Translating ', source_file, '...'
291 |     n_samples = _send_jobs(source_file)
292 |     trans = _seqs2words(_retrieve_jobs(n_samples))
293 |     _finish_processes()
294 |     with open(saveto, 'w') as f:
295 |         if decoder_bpe_to_tok:
296 |             print >>f, '\n'.join(trans).replace('@@ ', '')
297 |         else:
298 |             print >>f, '\n'.join(trans)
299 |     print 'Done'
300 | 
301 | 
302 | if __name__ == "__main__":
303 |     parser = argparse.ArgumentParser()
304 |     parser.add_argument('-k', type=int, default=5)
305 |     parser.add_argument('-p', type=int, default=5)
306 |     parser.add_argument('-n', action="store_true", default=False)
307 |     parser.add_argument('-bpe', action="store_true", default=False)
308 |     parser.add_argument('-enc_c', action="store_true", default=False)
309 |     parser.add_argument('-dec_c', action="store_true", default=False)
310 |     parser.add_argument('-utf8', action="store_true", default=False)
311 |     parser.add_argument('saveto', type=str)
312 | 
313 |     model_path = '/misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2bpe_two_layer_gru_decoder/0209/'
314 |     model1 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en1.290000.npz'
315 |     model2 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en2.260000.npz'
316 |     model3 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en3.290000.npz'
317 |     model4 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam.335000.npz'
318 |     models = [model1, model2, model3, model4]
319 |     dictionary = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.en.tok.bpe.word.pkl'
320 |     dictionary_target = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.de.tok.bpe.word.pkl'
321 |     source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/newstest2013.en.tok.bpe'
322 |     #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2014-deen-src.en.tok.bpe'
323 |     #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2015-deen-src.en.tok.bpe'
324 | 
325 |     args = parser.parse_args()
326 | 
327 |     main(models, dictionary, dictionary_target, source,
328 |          args.saveto, k=args.k, normalize=args.n, n_process=args.p,
329 |          encoder_chr_level=args.enc_c,
330 |          decoder_chr_level=args.dec_c,
331 |          utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
332 | 


--------------------------------------------------------------------------------
/subword_base/wmt15_csen_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2bpe_two_layer_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 21816
14 | n_words_src 21907
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_cs-en.en.tok.bpe
30 | target_dataset all_cs-en.cs.tok.bpe
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.cs.tok.bpe
33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl
34 | target_dictionary all_cs-en.cs.tok.bpe.word.pkl
35 | 


--------------------------------------------------------------------------------
/subword_base/wmt15_deen_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2bpe_two_layer_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 24254
14 | n_words_src 24440
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_de-en.en.tok.bpe.shuf
30 | target_dataset all_de-en.de.tok.bpe.shuf
31 | valid_source_dataset newstest2013.en.tok.bpe
32 | valid_target_dataset newstest2013.de.tok.bpe
33 | source_dictionary all_de-en.en.tok.bpe.word.pkl
34 | target_dictionary all_de-en.de.tok.bpe.word.pkl
35 | 


--------------------------------------------------------------------------------
/subword_base/wmt15_fien_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2bpe_two_layer_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 20783
14 | n_words_src 20174
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_fi-en.en.tok.bpe.shuf
30 | target_dataset all_fi-en.fi.tok.bpe.shuf
31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe
32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok.bpe
33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl
34 | target_dictionary all_fi-en.fi.tok.bpe.word.pkl
35 | 


--------------------------------------------------------------------------------
/subword_base/wmt15_ruen_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2bpe_two_layer_gru_decoder/0209/
 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/
 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/
 4 | max_epochs 1000000000000
 5 | patience -1
 6 | learning_rate 0.0001
 7 | batch_size 128
 8 | valid_batch_size 128
 9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 22106
14 | n_words_src 22030
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_ru-en.en.tok.bpe
30 | target_dataset all_ru-en.ru.tok.bpe
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.ru.tok.bpe
33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl
34 | target_dictionary all_ru-en.ru.tok.bpe.word.pkl
35 | 


--------------------------------------------------------------------------------
/translate_readme.txt:
--------------------------------------------------------------------------------
 1 | Command for using translate.py BPE-case:
 2 | python translate.py -k {beam_width} -p {number_of_processors} -n -bpe {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
 3 | 
 4 | Command for using translate.py Char-case:
 5 | python translate.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
 6 | 
 7 | Command for using translate_both.py BPE-case:
 8 | python translate_both.py -k {beam_width} -p {number_of_processors} -n -bpe {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
 9 | 
10 | Command for using translate_both.py Char-case:
11 | python translate_both.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
12 | 
13 | Command for using translate_attc.py Char-case:
14 | python translate_attc.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
15 | 
16 | Command for using `multi-bleu.perl':
17 | perl multi-bleu.perl {reference.txt} < {translated.txt}
18 | 


--------------------------------------------------------------------------------