├── LICENSE
├── README.md
├── __init__.py
├── character_base
├── __init__.py
├── char_base.py
├── char_base_both.py
├── train_wmt15_csen_bpe2char_adam.py
├── train_wmt15_deen_bpe2char_adam.py
├── train_wmt15_deen_bpe2char_both_adam.py
├── train_wmt15_fien_bpe2char_adam.py
├── train_wmt15_ruen_bpe2char_adam.py
├── translate.py
├── translate_both.py
├── translate_bpe2char_ensemble_csen.py
├── translate_bpe2char_ensemble_deen.py
├── translate_bpe2char_ensemble_fien.py
├── translate_bpe2char_ensemble_ruen.py
├── wmt15_csen_bpe2char_adam.txt
├── wmt15_deen_bpe2char_adam.txt
├── wmt15_fien_bpe2char_adam.txt
└── wmt15_ruen_bpe2char_adam.txt
├── character_biscale
├── __init__.py
├── char_biscale.py
├── char_biscale_attc.py
├── char_biscale_both.py
├── train_wmt15_csen_adam.py
├── train_wmt15_deen_adam.py
├── train_wmt15_deen_attc_adam.py
├── train_wmt15_deen_both_adam.py
├── train_wmt15_fien_adam.py
├── train_wmt15_ruen_adam.py
├── translate.py
├── translate_attc.py
├── translate_both.py
├── translate_ensemble_csen.py
├── translate_ensemble_deen.py
├── translate_ensemble_fien.py
├── translate_ensemble_ruen.py
├── wmt15_csen_bpe2char_adam.txt
├── wmt15_deen_bpe2char_adam.txt
├── wmt15_fien_bpe2char_adam.txt
└── wmt15_ruen_bpe2char_adam.txt
├── data_iterator.py
├── mixer.py
├── nmt.py
├── preprocess
├── build_dictionary_char.py
├── build_dictionary_word.py
├── clean_tags.py
├── fix_appo.sh
├── merge.sh
├── multi-bleu.perl
├── nonbreaking_prefixes
│ ├── README.txt
│ ├── nonbreaking_prefix.ca
│ ├── nonbreaking_prefix.cs
│ ├── nonbreaking_prefix.de
│ ├── nonbreaking_prefix.el
│ ├── nonbreaking_prefix.en
│ ├── nonbreaking_prefix.es
│ ├── nonbreaking_prefix.fi
│ ├── nonbreaking_prefix.fr
│ ├── nonbreaking_prefix.hu
│ ├── nonbreaking_prefix.is
│ ├── nonbreaking_prefix.it
│ ├── nonbreaking_prefix.lv
│ ├── nonbreaking_prefix.nl
│ ├── nonbreaking_prefix.pl
│ ├── nonbreaking_prefix.pt
│ ├── nonbreaking_prefix.ro
│ ├── nonbreaking_prefix.ru
│ ├── nonbreaking_prefix.sk
│ ├── nonbreaking_prefix.sl
│ ├── nonbreaking_prefix.sv
│ └── nonbreaking_prefix.ta
├── normalize-punctuation.perl
├── preprocess.sh
├── shuffle.py
├── tokenizer.perl
└── tokenizer_apos.perl
├── presentation
└── appendix.pdf
├── subword_base
├── subword_base.py
├── subword_base_both.py
├── train_wmt15_csen_bpe2bpe_both_adam.py
├── train_wmt15_deen_bpe2bpe_adam.py
├── train_wmt15_deen_bpe2bpe_both_adam.py
├── train_wmt15_fien_bpe2bpe_both_adam.py
├── train_wmt15_ruen_bpe2bpe_both_adam.py
├── translate.py
├── translate_both.py
├── translate_both_bpe2bpe_ensemble_csen.py
├── translate_both_bpe2bpe_ensemble_deen.py
├── translate_both_bpe2bpe_ensemble_fien.py
├── translate_both_bpe2bpe_ensemble_ruen.py
├── wmt15_csen_bpe2bpe_adam.txt
├── wmt15_deen_bpe2bpe_adam.txt
├── wmt15_fien_bpe2bpe_adam.txt
└── wmt15_ruen_bpe2bpe_adam.txt
└── translate_readme.txt
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2016, Junyoung Chung, Kyunghyun Cho
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without
5 | modification, are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 | this list of conditions and the following disclaimer in the documentation
12 | and/or other materials provided with the distribution.
13 |
14 | * Neither the name of dl4mt-cdec nor the names of its
15 | contributors may be used to endorse or promote products derived from
16 | this software without specific prior written permission.
17 |
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Character-Level Neural Machine Translation
This is an implementation of the models described in the paper "A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation".
http://arxiv.org/abs/1603.06147
Dependencies:
-------------
The majority of the script files are written in pure Theano.
In the preprocessing pipeline, there are the following dependencies.
Python Libraries: NLTK
MOSES: https://github.com/moses-smt/mosesdecoder
Subword-NMT (http://arxiv.org/abs/1508.07909): https://github.com/rsennrich/subword-nmt
This code is based on the dl4mt library.
link: https://github.com/nyu-dl/dl4mt-tutorial
Be sure to include the path to this library in your PYTHONPATH.
We recommend you to use the latest version of Theano.
If you want exact reproduction however, please use the following version of Theano.
commit hash: fdfbab37146ee475b3fd17d8d104fb09bf3a8d5c
Preparing Text Corpora:
-----------------------
The original text corpora can be downloaded from http://www.statmt.org/wmt15/translation-task.html
Once the downloading is finished, use the 'preprocess.sh' in 'preprocess' directory to preprocess the text files.
For the character-level decoders, preprocessing is not necessary however,
in order to compare the results with subword-level decoders and other word-level approaches, we apply the same process to all of the target corpora.
Finally, use 'build_dictionary_char.py' for character-case and 'build_dictionary_word.py' for subword-case to build the vocabulary.
Updating...
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/__init__.py
--------------------------------------------------------------------------------
/character_base/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/character_base/__init__.py
--------------------------------------------------------------------------------
/character_base/train_wmt15_csen_bpe2char_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_base import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 | 'two_layer_gru_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_csen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_base/train_wmt15_deen_bpe2char_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_base import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 | 'two_layer_gru_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_base/train_wmt15_deen_bpe2char_both_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from char_base_both import train
5 | from nmt_both import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 | 'two_layer_gru_decoder_both'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_two_layer_gru_decoder_both_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_base/train_wmt15_fien_bpe2char_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_base import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 | 'two_layer_gru_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_fien_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_base/train_wmt15_ruen_bpe2char_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_base import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 | 'two_layer_gru_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = True
17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_ruen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_base/translate.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from char_base import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | queue.put((idx, x))
126 | return idx+1
127 |
128 | def _finish_processes():
129 | for midx in xrange(n_process):
130 | queue.put(None)
131 |
132 | def _retrieve_jobs(n_samples):
133 | trans = [None] * n_samples
134 | for idx in xrange(n_samples):
135 | resp = rqueue.get()
136 | trans[resp[0]] = resp[1]
137 | if numpy.mod(idx, 10) == 0:
138 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 | return trans
140 |
141 | print 'Translating ', source_file, '...'
142 | n_samples = _send_jobs(source_file)
143 | trans = _seqs2words(_retrieve_jobs(n_samples))
144 | _finish_processes()
145 | with open(saveto, 'w') as f:
146 | if decoder_bpe_to_tok:
147 | print >>f, '\n'.join(trans).replace('@@ ', '')
148 | else:
149 | print >>f, '\n'.join(trans)
150 | print 'Done'
151 |
152 |
153 | if __name__ == "__main__":
154 | parser = argparse.ArgumentParser()
155 | parser.add_argument('-k', type=int, default=5)
156 | parser.add_argument('-p', type=int, default=5)
157 | parser.add_argument('-n', action="store_true", default=False)
158 | parser.add_argument('-bpe', action="store_true", default=False)
159 | parser.add_argument('-enc_c', action="store_true", default=False)
160 | parser.add_argument('-dec_c', action="store_true", default=False)
161 | parser.add_argument('-utf8', action="store_true", default=False)
162 | parser.add_argument('model', type=str)
163 | parser.add_argument('dictionary', type=str)
164 | parser.add_argument('dictionary_target', type=str)
165 | parser.add_argument('source', type=str)
166 | parser.add_argument('saveto', type=str)
167 |
168 | args = parser.parse_args()
169 |
170 | main(args.model, args.dictionary, args.dictionary_target, args.source,
171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 | encoder_chr_level=args.enc_c,
173 | decoder_chr_level=args.dec_c,
174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 |
--------------------------------------------------------------------------------
/character_base/translate_both.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from char_base_both import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | queue.put((idx, x))
126 | return idx+1
127 |
128 | def _finish_processes():
129 | for midx in xrange(n_process):
130 | queue.put(None)
131 |
132 | def _retrieve_jobs(n_samples):
133 | trans = [None] * n_samples
134 | for idx in xrange(n_samples):
135 | resp = rqueue.get()
136 | trans[resp[0]] = resp[1]
137 | if numpy.mod(idx, 10) == 0:
138 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 | return trans
140 |
141 | print 'Translating ', source_file, '...'
142 | n_samples = _send_jobs(source_file)
143 | trans = _seqs2words(_retrieve_jobs(n_samples))
144 | _finish_processes()
145 | with open(saveto, 'w') as f:
146 | if decoder_bpe_to_tok:
147 | print >>f, '\n'.join(trans).replace('@@ ', '')
148 | else:
149 | print >>f, '\n'.join(trans)
150 | print 'Done'
151 |
152 |
153 | if __name__ == "__main__":
154 | parser = argparse.ArgumentParser()
155 | parser.add_argument('-k', type=int, default=5)
156 | parser.add_argument('-p', type=int, default=5)
157 | parser.add_argument('-n', action="store_true", default=False)
158 | parser.add_argument('-bpe', action="store_true", default=False)
159 | parser.add_argument('-enc_c', action="store_true", default=False)
160 | parser.add_argument('-dec_c', action="store_true", default=False)
161 | parser.add_argument('-utf8', action="store_true", default=False)
162 | parser.add_argument('model', type=str)
163 | parser.add_argument('dictionary', type=str)
164 | parser.add_argument('dictionary_target', type=str)
165 | parser.add_argument('source', type=str)
166 | parser.add_argument('saveto', type=str)
167 |
168 | args = parser.parse_args()
169 |
170 | main(args.model, args.dictionary, args.dictionary_target, args.source,
171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 | encoder_chr_level=args.enc_c,
173 | decoder_chr_level=args.dec_c,
174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 |
--------------------------------------------------------------------------------
/character_base/translate_bpe2char_ensemble_deen.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from nmt import (build_sampler, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None,
16 | k=1, maxlen=500, stochastic=True, argmax=False):
17 |
18 | # k is the beam size we have
19 | if k > 1:
20 | assert not stochastic, \
21 | 'Beam search does not support stochastic sampling'
22 |
23 | sample = []
24 | sample_score = []
25 | if stochastic:
26 | sample_score = 0
27 |
28 | live_k = 1
29 | dead_k = 0
30 |
31 | hyp_samples = [[]] * live_k
32 | hyp_scores = numpy.zeros(live_k).astype('float32')
33 | hyp_states = []
34 |
35 | # get initial state of decoder rnn and encoder context
36 | rets = []
37 | next_state_chars = []
38 | next_state_words = []
39 | ctx0s = []
40 |
41 | for i in xrange(len(f_inits)):
42 | ret = f_inits[i](x)
43 | next_state_chars.append(ret[0])
44 | next_state_words.append(ret[1])
45 | ctx0s.append(ret[2])
46 | next_w = -1 * numpy.ones((1,)).astype('int64') # bos indicator
47 |
48 | num_models = len(f_inits)
49 |
50 | for ii in xrange(maxlen):
51 |
52 | temp_next_p = []
53 | temp_next_state_char = []
54 | temp_next_state_word = []
55 |
56 | for i in xrange(num_models):
57 |
58 | ctx = numpy.tile(ctx0s[i], [live_k, 1])
59 | inps = [next_w, ctx, next_state_chars[i], next_state_words[i]]
60 | ret = f_nexts[i](*inps)
61 | next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3]
62 | temp_next_p.append(next_p)
63 | temp_next_state_char.append(next_state_char)
64 | temp_next_state_word.append(next_state_word)
65 | #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models
66 | next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0))
67 |
68 | if stochastic:
69 | if argmax:
70 | nw = next_p[0].argmax()
71 | else:
72 | nw = next_w[0]
73 | sample.append(nw)
74 | sample_score += next_p[0, nw]
75 | if nw == 0:
76 | break
77 | else:
78 | cand_scores = hyp_scores[:, None] - next_p
79 | cand_flat = cand_scores.flatten()
80 | ranks_flat = cand_flat.argsort()[:(k - dead_k)]
81 |
82 | voc_size = next_p.shape[1]
83 | trans_indices = ranks_flat / voc_size
84 | word_indices = ranks_flat % voc_size
85 | costs = cand_flat[ranks_flat]
86 |
87 | new_hyp_samples = []
88 | new_hyp_scores = numpy.zeros(k - dead_k).astype('float32')
89 | new_hyp_states_chars = []
90 | new_hyp_states_words = []
91 |
92 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)):
93 | new_hyp_samples.append(hyp_samples[ti] + [wi])
94 | new_hyp_scores[idx] = copy.copy(costs[idx])
95 |
96 | for i in xrange(num_models):
97 | new_hyp_states_char = []
98 | new_hyp_states_word = []
99 |
100 | for ti in trans_indices:
101 | new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti]))
102 | new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti]))
103 |
104 | new_hyp_states_chars.append(new_hyp_states_char)
105 | new_hyp_states_words.append(new_hyp_states_word)
106 |
107 | # check the finished samples
108 | new_live_k = 0
109 | hyp_samples = []
110 | hyp_scores = []
111 |
112 | for idx in xrange(len(new_hyp_samples)):
113 | if new_hyp_samples[idx][-1] == 0:
114 | sample.append(new_hyp_samples[idx])
115 | sample_score.append(new_hyp_scores[idx])
116 | dead_k += 1
117 | else:
118 | new_live_k += 1
119 | hyp_samples.append(new_hyp_samples[idx])
120 | hyp_scores.append(new_hyp_scores[idx])
121 |
122 | for i in xrange(num_models):
123 | hyp_states_char = []
124 | hyp_states_word = []
125 |
126 | for idx in xrange(len(new_hyp_samples)):
127 | if new_hyp_samples[idx][-1] != 0:
128 | hyp_states_char.append(new_hyp_states_chars[i][idx])
129 | hyp_states_word.append(new_hyp_states_words[i][idx])
130 |
131 | next_state_chars[i] = numpy.array(hyp_states_char)
132 | next_state_words[i] = numpy.array(hyp_states_word)
133 |
134 | hyp_scores = numpy.array(hyp_scores)
135 | live_k = new_live_k
136 |
137 | if new_live_k < 1:
138 | break
139 | if dead_k >= k:
140 | break
141 |
142 | next_w = numpy.array([w[-1] for w in hyp_samples])
143 |
144 | if not stochastic:
145 | # dump every remaining one
146 | if live_k > 0:
147 | for idx in xrange(live_k):
148 | sample.append(hyp_samples[idx])
149 | sample_score.append(hyp_scores[idx])
150 |
151 | return sample, sample_score
152 |
153 |
154 | def translate_model(queue, rqueue, pid, models, options, k, normalize):
155 |
156 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
157 | trng = RandomStreams(1234)
158 |
159 | # allocate model parameters
160 | params = []
161 | for i in xrange(len(models)):
162 | params.append(init_params(options))
163 |
164 | # load model parameters and set theano shared variables
165 | tparams = []
166 | for i in xrange(len(params)):
167 | params[i] = load_params(models[i], params[i])
168 | tparams.append(init_tparams(params[i]))
169 |
170 | # word index
171 | use_noise = theano.shared(numpy.float32(0.))
172 | f_inits = []
173 | f_nexts = []
174 | for i in xrange(len(tparams)):
175 | f_init, f_next = build_sampler(tparams[i], options, trng, use_noise)
176 | f_inits.append(f_init)
177 | f_nexts.append(f_next)
178 |
179 | def _translate(seq):
180 | use_noise.set_value(0.)
181 | # sample given an input sequence and obtain scores
182 | sample, score = gen_sample(tparams, f_inits, f_nexts,
183 | numpy.array(seq).reshape([len(seq), 1]),
184 | options, trng=trng, k=k, maxlen=500,
185 | stochastic=False, argmax=False)
186 |
187 | # normalize scores according to sequence lengths
188 | if normalize:
189 | lengths = numpy.array([len(s) for s in sample])
190 | score = score / lengths
191 | sidx = numpy.argmin(score)
192 | return sample[sidx]
193 |
194 | while True:
195 | req = queue.get()
196 | if req is None:
197 | break
198 |
199 | idx, x = req[0], req[1]
200 | print pid, '-', idx
201 | seq = _translate(x)
202 |
203 | rqueue.put((idx, seq))
204 |
205 | return
206 |
207 |
208 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5,
209 | normalize=False, n_process=5, encoder_chr_level=False,
210 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
211 |
212 | # load model model_options
213 | pkl_file = models[0].split('.')[0] + '.pkl'
214 | with open(pkl_file, 'rb') as f:
215 | options = pkl.load(f)
216 |
217 | # load source dictionary and invert
218 | with open(dictionary, 'rb') as f:
219 | word_dict = pkl.load(f)
220 | word_idict = dict()
221 | for kk, vv in word_dict.iteritems():
222 | word_idict[vv] = kk
223 | word_idict[0] = ''
224 | word_idict[1] = 'UNK'
225 |
226 | # load target dictionary and invert
227 | with open(dictionary_target, 'rb') as f:
228 | word_dict_trg = pkl.load(f)
229 | word_idict_trg = dict()
230 | for kk, vv in word_dict_trg.iteritems():
231 | word_idict_trg[vv] = kk
232 | word_idict_trg[0] = ''
233 | word_idict_trg[1] = 'UNK'
234 |
235 | # create input and output queues for processes
236 | queue = Queue()
237 | rqueue = Queue()
238 | processes = [None] * n_process
239 | for midx in xrange(n_process):
240 | processes[midx] = Process(
241 | target=translate_model,
242 | args=(queue, rqueue, midx, models, options, k, normalize))
243 | processes[midx].start()
244 |
245 | # utility function
246 | def _seqs2words(caps):
247 | capsw = []
248 | for cc in caps:
249 | ww = []
250 | for w in cc:
251 | if w == 0:
252 | break
253 | if utf8:
254 | ww.append(word_idict_trg[w].encode('utf-8'))
255 | else:
256 | ww.append(word_idict_trg[w])
257 | if decoder_chr_level:
258 | capsw.append(''.join(ww))
259 | else:
260 | capsw.append(' '.join(ww))
261 | return capsw
262 |
263 | def _send_jobs(fname):
264 | with open(fname, 'r') as f:
265 | for idx, line in enumerate(f):
266 | if encoder_chr_level:
267 | words = list(line.decode('utf-8').strip())
268 | else:
269 | words = line.strip().split()
270 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
271 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
272 | x += [0]
273 | queue.put((idx, x))
274 | return idx+1
275 |
276 | def _finish_processes():
277 | for midx in xrange(n_process):
278 | queue.put(None)
279 |
280 | def _retrieve_jobs(n_samples):
281 | trans = [None] * n_samples
282 | for idx in xrange(n_samples):
283 | resp = rqueue.get()
284 | trans[resp[0]] = resp[1]
285 | if numpy.mod(idx, 10) == 0:
286 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
287 | return trans
288 |
289 | print 'Translating ', source_file, '...'
290 | n_samples = _send_jobs(source_file)
291 | trans = _seqs2words(_retrieve_jobs(n_samples))
292 | _finish_processes()
293 | with open(saveto, 'w') as f:
294 | if decoder_bpe_to_tok:
295 | print >>f, '\n'.join(trans).replace('@@ ', '')
296 | else:
297 | print >>f, '\n'.join(trans)
298 | print 'Done'
299 |
300 |
301 | if __name__ == "__main__":
302 | parser = argparse.ArgumentParser()
303 | parser.add_argument('-k', type=int, default=5)
304 | parser.add_argument('-p', type=int, default=5)
305 | parser.add_argument('-n', action="store_true", default=False)
306 | parser.add_argument('-bpe', action="store_true", default=False)
307 | parser.add_argument('-enc_c', action="store_true", default=False)
308 | parser.add_argument('-dec_c', action="store_true", default=False)
309 | parser.add_argument('-utf8', action="store_true", default=False)
310 | parser.add_argument('saveto', type=str)
311 |
312 | model_path = '/misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2char_two_layer_gru_decoder/0209/'
313 | model1 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en1.380000.npz'
314 | model2 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en2.425000.npz'
315 | model3 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en3.400000.npz'
316 | model4 = model_path + 'bpe2char_two_layer_gru_decoder_adam.365000.npz'
317 | models = [model1, model2, model3, model4]
318 | dictionary = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.en.tok.bpe.word.pkl'
319 | dictionary_target = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.de.tok.300.pkl'
320 | source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/newstest2013.en.tok.bpe'
321 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2014-deen-src.en.tok.bpe'
322 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2015-deen-src.en.tok.bpe'
323 |
324 | args = parser.parse_args()
325 |
326 | main(models, dictionary, dictionary_target, source,
327 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
328 | encoder_chr_level=args.enc_c,
329 | decoder_chr_level=args.dec_c,
330 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
331 |
--------------------------------------------------------------------------------
/character_base/translate_bpe2char_ensemble_fien.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from nmt import (build_sampler, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None,
16 | k=1, maxlen=500, stochastic=True, argmax=False):
17 |
18 | # k is the beam size we have
19 | if k > 1:
20 | assert not stochastic, \
21 | 'Beam search does not support stochastic sampling'
22 |
23 | sample = []
24 | sample_score = []
25 | if stochastic:
26 | sample_score = 0
27 |
28 | live_k = 1
29 | dead_k = 0
30 |
31 | hyp_samples = [[]] * live_k
32 | hyp_scores = numpy.zeros(live_k).astype('float32')
33 | hyp_states = []
34 |
35 | # get initial state of decoder rnn and encoder context
36 | rets = []
37 | next_state_chars = []
38 | next_state_words = []
39 | ctx0s = []
40 |
41 | for i in xrange(len(f_inits)):
42 | ret = f_inits[i](x)
43 | next_state_chars.append(ret[0])
44 | next_state_words.append(ret[1])
45 | ctx0s.append(ret[2])
46 | next_w = -1 * numpy.ones((1,)).astype('int64') # bos indicator
47 |
48 | num_models = len(f_inits)
49 |
50 | for ii in xrange(maxlen):
51 |
52 | temp_next_p = []
53 | temp_next_state_char = []
54 | temp_next_state_word = []
55 |
56 | for i in xrange(num_models):
57 |
58 | ctx = numpy.tile(ctx0s[i], [live_k, 1])
59 | inps = [next_w, ctx, next_state_chars[i], next_state_words[i]]
60 | ret = f_nexts[i](*inps)
61 | next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3]
62 | temp_next_p.append(next_p)
63 | temp_next_state_char.append(next_state_char)
64 | temp_next_state_word.append(next_state_word)
65 | #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models
66 | next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0))
67 |
68 | if stochastic:
69 | if argmax:
70 | nw = next_p[0].argmax()
71 | else:
72 | nw = next_w[0]
73 | sample.append(nw)
74 | sample_score += next_p[0, nw]
75 | if nw == 0:
76 | break
77 | else:
78 | cand_scores = hyp_scores[:, None] - next_p
79 | cand_flat = cand_scores.flatten()
80 | ranks_flat = cand_flat.argsort()[:(k - dead_k)]
81 |
82 | voc_size = next_p.shape[1]
83 | trans_indices = ranks_flat / voc_size
84 | word_indices = ranks_flat % voc_size
85 | costs = cand_flat[ranks_flat]
86 |
87 | new_hyp_samples = []
88 | new_hyp_scores = numpy.zeros(k - dead_k).astype('float32')
89 | new_hyp_states_chars = []
90 | new_hyp_states_words = []
91 |
92 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)):
93 | new_hyp_samples.append(hyp_samples[ti] + [wi])
94 | new_hyp_scores[idx] = copy.copy(costs[idx])
95 |
96 | for i in xrange(num_models):
97 | new_hyp_states_char = []
98 | new_hyp_states_word = []
99 |
100 | for ti in trans_indices:
101 | new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti]))
102 | new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti]))
103 |
104 | new_hyp_states_chars.append(new_hyp_states_char)
105 | new_hyp_states_words.append(new_hyp_states_word)
106 |
107 | # check the finished samples
108 | new_live_k = 0
109 | hyp_samples = []
110 | hyp_scores = []
111 |
112 | for idx in xrange(len(new_hyp_samples)):
113 | if new_hyp_samples[idx][-1] == 0:
114 | sample.append(new_hyp_samples[idx])
115 | sample_score.append(new_hyp_scores[idx])
116 | dead_k += 1
117 | else:
118 | new_live_k += 1
119 | hyp_samples.append(new_hyp_samples[idx])
120 | hyp_scores.append(new_hyp_scores[idx])
121 |
122 | for i in xrange(num_models):
123 | hyp_states_char = []
124 | hyp_states_word = []
125 |
126 | for idx in xrange(len(new_hyp_samples)):
127 | if new_hyp_samples[idx][-1] != 0:
128 | hyp_states_char.append(new_hyp_states_chars[i][idx])
129 | hyp_states_word.append(new_hyp_states_words[i][idx])
130 |
131 | next_state_chars[i] = numpy.array(hyp_states_char)
132 | next_state_words[i] = numpy.array(hyp_states_word)
133 |
134 | hyp_scores = numpy.array(hyp_scores)
135 | live_k = new_live_k
136 |
137 | if new_live_k < 1:
138 | break
139 | if dead_k >= k:
140 | break
141 |
142 | next_w = numpy.array([w[-1] for w in hyp_samples])
143 |
144 | if not stochastic:
145 | # dump every remaining one
146 | if live_k > 0:
147 | for idx in xrange(live_k):
148 | sample.append(hyp_samples[idx])
149 | sample_score.append(hyp_scores[idx])
150 |
151 | return sample, sample_score
152 |
153 |
154 | def translate_model(queue, rqueue, pid, models, options, k, normalize):
155 |
156 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
157 | trng = RandomStreams(1234)
158 |
159 | # allocate model parameters
160 | params = []
161 | for i in xrange(len(models)):
162 | params.append(init_params(options))
163 |
164 | # load model parameters and set theano shared variables
165 | tparams = []
166 | for i in xrange(len(params)):
167 | params[i] = load_params(models[i], params[i])
168 | tparams.append(init_tparams(params[i]))
169 |
170 | # word index
171 | use_noise = theano.shared(numpy.float32(0.))
172 | f_inits = []
173 | f_nexts = []
174 | for i in xrange(len(tparams)):
175 | f_init, f_next = build_sampler(tparams[i], options, trng, use_noise)
176 | f_inits.append(f_init)
177 | f_nexts.append(f_next)
178 |
179 | def _translate(seq):
180 | use_noise.set_value(0.)
181 | # sample given an input sequence and obtain scores
182 | sample, score = gen_sample(tparams, f_inits, f_nexts,
183 | numpy.array(seq).reshape([len(seq), 1]),
184 | options, trng=trng, k=k, maxlen=500,
185 | stochastic=False, argmax=False)
186 |
187 | # normalize scores according to sequence lengths
188 | if normalize:
189 | lengths = numpy.array([len(s) for s in sample])
190 | score = score / lengths
191 | sidx = numpy.argmin(score)
192 | return sample[sidx]
193 |
194 | while True:
195 | req = queue.get()
196 | if req is None:
197 | break
198 |
199 | idx, x = req[0], req[1]
200 | print pid, '-', idx
201 | seq = _translate(x)
202 |
203 | rqueue.put((idx, seq))
204 |
205 | return
206 |
207 |
208 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5,
209 | normalize=False, n_process=5, encoder_chr_level=False,
210 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
211 |
212 | # load model model_options
213 | pkl_file = models[0].split('.')[0] + '.pkl'
214 | with open(pkl_file, 'rb') as f:
215 | options = pkl.load(f)
216 |
217 | # load source dictionary and invert
218 | with open(dictionary, 'rb') as f:
219 | word_dict = pkl.load(f)
220 | word_idict = dict()
221 | for kk, vv in word_dict.iteritems():
222 | word_idict[vv] = kk
223 | word_idict[0] = ''
224 | word_idict[1] = 'UNK'
225 |
226 | # load target dictionary and invert
227 | with open(dictionary_target, 'rb') as f:
228 | word_dict_trg = pkl.load(f)
229 | word_idict_trg = dict()
230 | for kk, vv in word_dict_trg.iteritems():
231 | word_idict_trg[vv] = kk
232 | word_idict_trg[0] = ''
233 | word_idict_trg[1] = 'UNK'
234 |
235 | # create input and output queues for processes
236 | queue = Queue()
237 | rqueue = Queue()
238 | processes = [None] * n_process
239 | for midx in xrange(n_process):
240 | processes[midx] = Process(
241 | target=translate_model,
242 | args=(queue, rqueue, midx, models, options, k, normalize))
243 | processes[midx].start()
244 |
245 | # utility function
246 | def _seqs2words(caps):
247 | capsw = []
248 | for cc in caps:
249 | ww = []
250 | for w in cc:
251 | if w == 0:
252 | break
253 | if utf8:
254 | ww.append(word_idict_trg[w].encode('utf-8'))
255 | else:
256 | ww.append(word_idict_trg[w])
257 | if decoder_chr_level:
258 | capsw.append(''.join(ww))
259 | else:
260 | capsw.append(' '.join(ww))
261 | return capsw
262 |
263 | def _send_jobs(fname):
264 | with open(fname, 'r') as f:
265 | for idx, line in enumerate(f):
266 | if encoder_chr_level:
267 | words = list(line.decode('utf-8').strip())
268 | else:
269 | words = line.strip().split()
270 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
271 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
272 | x += [0]
273 | queue.put((idx, x))
274 | return idx+1
275 |
276 | def _finish_processes():
277 | for midx in xrange(n_process):
278 | queue.put(None)
279 |
280 | def _retrieve_jobs(n_samples):
281 | trans = [None] * n_samples
282 | for idx in xrange(n_samples):
283 | resp = rqueue.get()
284 | trans[resp[0]] = resp[1]
285 | if numpy.mod(idx, 10) == 0:
286 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
287 | return trans
288 |
289 | print 'Translating ', source_file, '...'
290 | n_samples = _send_jobs(source_file)
291 | trans = _seqs2words(_retrieve_jobs(n_samples))
292 | _finish_processes()
293 | with open(saveto, 'w') as f:
294 | if decoder_bpe_to_tok:
295 | print >>f, '\n'.join(trans).replace('@@ ', '')
296 | else:
297 | print >>f, '\n'.join(trans)
298 | print 'Done'
299 |
300 |
301 | if __name__ == "__main__":
302 | parser = argparse.ArgumentParser()
303 | parser.add_argument('-k', type=int, default=5)
304 | parser.add_argument('-p', type=int, default=5)
305 | parser.add_argument('-n', action="store_true", default=False)
306 | parser.add_argument('-bpe', action="store_true", default=False)
307 | parser.add_argument('-enc_c', action="store_true", default=False)
308 | parser.add_argument('-dec_c', action="store_true", default=False)
309 | parser.add_argument('-utf8', action="store_true", default=False)
310 | parser.add_argument('saveto', type=str)
311 |
312 | model_path = '/scratch/jc7382/acl2016/wmt15/fien/bpe2char_two_layer_gru_decoder/0209/'
313 | model1 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en1.205000.npz'
314 | model2 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en2.200000.npz'
315 | model3 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en3.200000.npz'
316 | model4 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en4.200000.npz'
317 | model5 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en5.210000.npz'
318 | model6 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en6.205000.npz'
319 | model7 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en7.200000.npz'
320 | model8 = model_path + 'new_bpe2char_two_layer_gru_decoder_adam.240000.npz'
321 | models = [model1, model2, model3, model4, model5, model6, model7, model8]
322 | dictionary = '/scratch/jc7382/data/wmt15/fien/train/all_fi-en.en.tok.bpe.word.pkl'
323 | dictionary_target = '/scratch/jc7382/data/wmt15/fien/train/all_fi-en.fi.tok.300.pkl'
324 | source = '/scratch/jc7382/data/wmt15/fien/dev/newsdev2015-enfi-src.en.tok.bpe'
325 | #source = '/scratch/jc7382/data/wmt15/fien/test/newstest2015-fien-src.en.tok.bpe'
326 |
327 | args = parser.parse_args()
328 |
329 | main(models, dictionary, dictionary_target, source,
330 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
331 | encoder_chr_level=args.enc_c,
332 | decoder_chr_level=args.dec_c,
333 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
334 |
--------------------------------------------------------------------------------
/character_base/wmt15_csen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2char_two_layer_gru_decoder/0328/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 21907
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_cs-en.en.tok.bpe
30 | target_dataset all_cs-en.cs.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.cs.tok
33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl
34 | target_dictionary all_cs-en.cs.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_base/wmt15_deen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /raid/chungjun/acl2016/wmt15/deen/bpe2char_two_layer_gru_decoder/0417/
2 | train_data_path /raid/chungjun/data/wmt15/deen/train/
3 | dev_data_path /raid/chungjun/data/wmt15/deen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 24440
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_de-en.en.tok.bpe.shuf
30 | target_dataset all_de-en.de.tok.shuf
31 | valid_source_dataset newstest2013.en.tok.bpe
32 | valid_target_dataset newstest2013.de.tok
33 | source_dictionary all_de-en.en.tok.bpe.word.pkl
34 | target_dictionary all_de-en.de.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_base/wmt15_fien_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2char_two_layer_gru_decoder/0328/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 292
14 | n_words_src 20174
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_fi-en.en.tok.bpe.shuf
30 | target_dataset all_fi-en.fi.tok.shuf
31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe
32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok
33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl
34 | target_dictionary all_fi-en.fi.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_base/wmt15_ruen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2char_two_layer_gru_decoder/0328/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 22030
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_ru-en.en.tok.bpe
30 | target_dataset all_ru-en.ru.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.ru.tok
33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl
34 | target_dictionary all_ru-en.ru.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_biscale/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/character_biscale/__init__.py
--------------------------------------------------------------------------------
/character_biscale/train_wmt15_csen_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_biscale import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'biscale_decoder': ('param_init_biscale_decoder',
11 | 'biscale_decoder_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_biscale_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_csen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_biscale/train_wmt15_deen_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_biscale import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'biscale_decoder': ('param_init_biscale_decoder',
11 | 'biscale_decoder_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_biscale_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_biscale/train_wmt15_deen_attc_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_biscale_attc import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'biscale_decoder_attc': ('param_init_biscale_decoder_attc',
11 | 'biscale_decoder_attc_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_biscale_decoder_attc_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample,
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_biscale/train_wmt15_deen_both_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_biscale_both import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'biscale_decoder_both': ('param_init_biscale_decoder_both',
11 | 'biscale_decoder_both_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_biscale_decoder_both_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample,
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_biscale/train_wmt15_fien_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_biscale import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'biscale_decoder': ('param_init_biscale_decoder',
11 | 'biscale_decoder_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_biscale_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_fien_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_biscale/train_wmt15_ruen_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from char_biscale import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'biscale_decoder': ('param_init_biscale_decoder',
11 | 'biscale_decoder_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2char_biscale_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_ruen_bpe2char_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/character_biscale/translate.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from char_biscale import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | queue.put((idx, x))
126 | return idx+1
127 |
128 | def _finish_processes():
129 | for midx in xrange(n_process):
130 | queue.put(None)
131 |
132 | def _retrieve_jobs(n_samples):
133 | trans = [None] * n_samples
134 | for idx in xrange(n_samples):
135 | resp = rqueue.get()
136 | trans[resp[0]] = resp[1]
137 | if numpy.mod(idx, 10) == 0:
138 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 | return trans
140 |
141 | print 'Translating ', source_file, '...'
142 | n_samples = _send_jobs(source_file)
143 | trans = _seqs2words(_retrieve_jobs(n_samples))
144 | _finish_processes()
145 | with open(saveto, 'w') as f:
146 | if decoder_bpe_to_tok:
147 | print >>f, '\n'.join(trans).replace('@@ ', '')
148 | else:
149 | print >>f, '\n'.join(trans)
150 | print 'Done'
151 |
152 |
153 | if __name__ == "__main__":
154 | parser = argparse.ArgumentParser()
155 | parser.add_argument('-k', type=int, default=5)
156 | parser.add_argument('-p', type=int, default=5)
157 | parser.add_argument('-n', action="store_true", default=False)
158 | parser.add_argument('-bpe', action="store_true", default=False)
159 | parser.add_argument('-enc_c', action="store_true", default=False)
160 | parser.add_argument('-dec_c', action="store_true", default=False)
161 | parser.add_argument('-utf8', action="store_true", default=False)
162 | parser.add_argument('model', type=str)
163 | parser.add_argument('dictionary', type=str)
164 | parser.add_argument('dictionary_target', type=str)
165 | parser.add_argument('source', type=str)
166 | parser.add_argument('saveto', type=str)
167 |
168 | args = parser.parse_args()
169 |
170 | main(args.model, args.dictionary, args.dictionary_target, args.source,
171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 | encoder_chr_level=args.enc_c,
173 | decoder_chr_level=args.dec_c,
174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 |
--------------------------------------------------------------------------------
/character_biscale/translate_attc.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from char_biscale_attc import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | #print '=============================='
126 | #print line
127 | #print '------------------------------'
128 | #print ' '.join([word_idict[wx] for wx in x])
129 | #print '=============================='
130 | queue.put((idx, x))
131 | return idx+1
132 |
133 | def _finish_processes():
134 | for midx in xrange(n_process):
135 | queue.put(None)
136 |
137 | def _retrieve_jobs(n_samples):
138 | trans = [None] * n_samples
139 | for idx in xrange(n_samples):
140 | resp = rqueue.get()
141 | trans[resp[0]] = resp[1]
142 | if numpy.mod(idx, 10) == 0:
143 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
144 | return trans
145 |
146 | print 'Translating ', source_file, '...'
147 | n_samples = _send_jobs(source_file)
148 | trans = _seqs2words(_retrieve_jobs(n_samples))
149 | _finish_processes()
150 | with open(saveto, 'w') as f:
151 | print >>f, '\n'.join(trans)
152 | print 'Done'
153 |
154 |
155 | if __name__ == "__main__":
156 | parser = argparse.ArgumentParser()
157 | parser.add_argument('-k', type=int, default=5)
158 | parser.add_argument('-p', type=int, default=5)
159 | parser.add_argument('-n', action="store_true", default=False)
160 | parser.add_argument('-enc_c', action="store_true", default=False)
161 | parser.add_argument('-dec_c', action="store_true", default=False)
162 | parser.add_argument('-utf8', action="store_true", default=False)
163 | parser.add_argument('model', type=str)
164 | parser.add_argument('dictionary', type=str)
165 | parser.add_argument('dictionary_target', type=str)
166 | parser.add_argument('source', type=str)
167 | parser.add_argument('saveto', type=str)
168 |
169 | args = parser.parse_args()
170 |
171 | main(args.model, args.dictionary, args.dictionary_target, args.source,
172 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
173 | encoder_chr_level=args.enc_c,
174 | decoder_chr_level=args.dec_c,
175 | utf8=args.utf8)
176 |
--------------------------------------------------------------------------------
/character_biscale/translate_both.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from char_biscale_both import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | queue.put((idx, x))
126 | return idx+1
127 |
128 | def _finish_processes():
129 | for midx in xrange(n_process):
130 | queue.put(None)
131 |
132 | def _retrieve_jobs(n_samples):
133 | trans = [None] * n_samples
134 | for idx in xrange(n_samples):
135 | resp = rqueue.get()
136 | trans[resp[0]] = resp[1]
137 | if numpy.mod(idx, 10) == 0:
138 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 | return trans
140 |
141 | print 'Translating ', source_file, '...'
142 | n_samples = _send_jobs(source_file)
143 | trans = _seqs2words(_retrieve_jobs(n_samples))
144 | _finish_processes()
145 | with open(saveto, 'w') as f:
146 | if decoder_bpe_to_tok:
147 | print >>f, '\n'.join(trans).replace('@@ ', '')
148 | else:
149 | print >>f, '\n'.join(trans)
150 | print 'Done'
151 |
152 |
153 | if __name__ == "__main__":
154 | parser = argparse.ArgumentParser()
155 | parser.add_argument('-k', type=int, default=5)
156 | parser.add_argument('-p', type=int, default=5)
157 | parser.add_argument('-n', action="store_true", default=False)
158 | parser.add_argument('-bpe', action="store_true", default=False)
159 | parser.add_argument('-enc_c', action="store_true", default=False)
160 | parser.add_argument('-dec_c', action="store_true", default=False)
161 | parser.add_argument('-utf8', action="store_true", default=False)
162 | parser.add_argument('model', type=str)
163 | parser.add_argument('dictionary', type=str)
164 | parser.add_argument('dictionary_target', type=str)
165 | parser.add_argument('source', type=str)
166 | parser.add_argument('saveto', type=str)
167 |
168 | args = parser.parse_args()
169 |
170 | main(args.model, args.dictionary, args.dictionary_target, args.source,
171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 | encoder_chr_level=args.enc_c,
173 | decoder_chr_level=args.dec_c,
174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 |
--------------------------------------------------------------------------------
/character_biscale/wmt15_csen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2char_seg_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 21907
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_cs-en.en.tok.bpe
30 | target_dataset all_cs-en.cs.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.cs.tok
33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl
34 | target_dictionary all_cs-en.cs.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_biscale/wmt15_deen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2char_seg_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 24440
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_de-en.en.tok.bpe.shuf
30 | target_dataset all_de-en.de.tok.shuf
31 | valid_source_dataset newstest2013.en.tok.bpe
32 | valid_target_dataset newstest2013.de.tok
33 | source_dictionary all_de-en.en.tok.bpe.word.pkl
34 | target_dictionary all_de-en.de.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_biscale/wmt15_fien_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2char_seg_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 292
14 | n_words_src 20174
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_fi-en.en.tok.bpe.shuf
30 | target_dataset all_fi-en.fi.tok.shuf
31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe
32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok
33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl
34 | target_dictionary all_fi-en.fi.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/character_biscale/wmt15_ruen_bpe2char_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2char_seg_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 302
14 | n_words_src 22030
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 500
26 | maxlen_sample 500
27 | source_word_level 1
28 | target_word_level 0
29 | source_dataset all_ru-en.en.tok.bpe
30 | target_dataset all_ru-en.ru.tok
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.ru.tok
33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl
34 | target_dictionary all_ru-en.ru.tok.300.pkl
35 |
--------------------------------------------------------------------------------
/data_iterator.py:
--------------------------------------------------------------------------------
1 | import nltk
2 | import numpy
3 | import os
4 | import random
5 |
6 | import cPickle
7 | import gzip
8 | import codecs
9 |
10 | from tempfile import mkstemp
11 |
12 |
13 | def fopen(filename, mode='r'):
14 | if filename.endswith('.gz'):
15 | return gzip.open(filename, mode)
16 | return open(filename, mode)
17 |
18 |
19 | class TextIterator:
20 | """Simple Bitext iterator."""
21 | def __init__(self,
22 | source, source_dict,
23 | target=None, target_dict=None,
24 | source_word_level=0,
25 | target_word_level=0,
26 | batch_size=128,
27 | job_id=0,
28 | sort_size=20,
29 | n_words_source=-1,
30 | n_words_target=-1,
31 | shuffle_per_epoch=False):
32 | self.source_file = source
33 | self.target_file = target
34 | self.source = fopen(source, 'r')
35 | with open(source_dict, 'rb') as f:
36 | self.source_dict = cPickle.load(f)
37 | if target is not None:
38 | self.target = fopen(target, 'r')
39 | if target_dict is not None:
40 | with open(target_dict, 'rb') as f:
41 | self.target_dict = cPickle.load(f)
42 | else:
43 | self.target = None
44 |
45 | self.source_word_level = source_word_level
46 | self.target_word_level = target_word_level
47 | self.batch_size = batch_size
48 |
49 | self.n_words_source = n_words_source
50 | self.n_words_target = n_words_target
51 | self.shuffle_per_epoch = shuffle_per_epoch
52 |
53 | self.source_buffer = []
54 | self.target_buffer = []
55 | self.k = batch_size * sort_size
56 |
57 | self.end_of_data = False
58 | self.job_id = job_id
59 |
60 | def __iter__(self):
61 | return self
62 |
63 | def reset(self):
64 | if self.shuffle_per_epoch:
65 | # close current files
66 | self.source.close()
67 | if self.target is None:
68 | self.shuffle([self.source_file])
69 | self.source = fopen(self.source_file + '.reshuf_%d' % self.job_id, 'r')
70 | else:
71 | self.target.close()
72 | # shuffle *original* source files,
73 | self.shuffle([self.source_file, self.target_file])
74 | # open newly 're-shuffled' file as input
75 | self.source = fopen(self.source_file + '.reshuf_%d' % self.job_id, 'r')
76 | self.target = fopen(self.target_file + '.reshuf_%d' % self.job_id, 'r')
77 | else:
78 | self.source.seek(0)
79 | if self.target is not None:
80 | self.target.seek(0)
81 |
82 | @staticmethod
83 | def shuffle(files):
84 | tf_os, tpath = mkstemp()
85 | tf = open(tpath, 'w')
86 | fds = [open(ff) for ff in files]
87 | for l in fds[0]:
88 | lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]]
89 | print >>tf, "|||".join(lines)
90 | [ff.close() for ff in fds]
91 | tf.close()
92 | tf = open(tpath, 'r')
93 | lines = tf.readlines()
94 | random.shuffle(lines)
95 | fds = [open(ff+'.reshuf','w') for ff in files]
96 | for l in lines:
97 | s = l.strip().split('|||')
98 | for ii, fd in enumerate(fds):
99 | print >>fd, s[ii]
100 | [ff.close() for ff in fds]
101 | os.remove(tpath)
102 | return
103 |
104 | def next(self):
105 | if self.end_of_data:
106 | self.end_of_data = False
107 | self.reset()
108 | raise StopIteration
109 |
110 | source = []
111 | target = []
112 |
113 | # fill buffer, if it's empty
114 | if self.target is not None:
115 | assert len(self.source_buffer) == len(self.target_buffer), 'Buffer size mismatch!'
116 |
117 | if len(self.source_buffer) == 0:
118 | for k_ in xrange(self.k):
119 | ss = self.source.readline()
120 |
121 | if ss == "":
122 | break
123 |
124 | if self.source_word_level:
125 | ss = ss.strip().split()
126 | else:
127 | ss = ss.strip()
128 | ss = list(ss.decode('utf8'))
129 |
130 | self.source_buffer.append(ss)
131 |
132 | if self.target is not None:
133 | tt = self.target.readline()
134 |
135 | if tt == "":
136 | break
137 |
138 | if self.target_word_level:
139 | tt = tt.strip().split()
140 | else:
141 | tt = tt.strip()
142 | tt = list(tt.decode('utf8'))
143 |
144 | self.target_buffer.append(tt)
145 |
146 | if self.target is not None:
147 | # sort by target buffer
148 | tlen = numpy.array([len(t) for t in self.target_buffer])
149 | tidx = tlen.argsort()
150 | _sbuf = [self.source_buffer[i] for i in tidx]
151 | _tbuf = [self.target_buffer[i] for i in tidx]
152 | self.target_buffer = _tbuf
153 | else:
154 | slen = numpy.array([len(s) for s in self.source_buffer])
155 | sidx = slen.argsort()
156 | _sbuf = [self.source_buffer[i] for i in sidx]
157 |
158 | self.source_buffer = _sbuf
159 |
160 | if self.target is not None:
161 | if len(self.source_buffer) == 0 or len(self.target_buffer) == 0:
162 | self.end_of_data = False
163 | self.reset()
164 | raise StopIteration
165 | elif len(self.source_buffer) == 0:
166 | self.end_of_data = False
167 | self.reset()
168 | raise StopIteration
169 |
170 | try:
171 | # actual work here
172 | while True:
173 | # read from source file and map to word index
174 | try:
175 | ss_ = self.source_buffer.pop()
176 | except IndexError:
177 | break
178 | ss = [self.source_dict[w] if w in self.source_dict else 1 for w in ss_]
179 | if self.n_words_source > 0:
180 | ss = [w if w < self.n_words_source else 1 for w in ss]
181 | source.append(ss)
182 | if self.target is not None:
183 | # read from target file and map to word index
184 | tt_ = self.target_buffer.pop()
185 | tt = [self.target_dict[w] if w in self.target_dict else 1 for w in tt_]
186 | if self.n_words_target > 0:
187 | tt = [w if w < self.n_words_target else 1 for w in tt]
188 | target.append(tt)
189 |
190 | if len(source) >= self.batch_size:
191 | break
192 | except IOError:
193 | self.end_of_data = True
194 |
195 | if self.target is not None:
196 | if len(source) <= 0 or len(target) <= 0:
197 | self.end_of_data = False
198 | self.reset()
199 | raise StopIteration
200 | return source, target
201 | else:
202 | if len(source) <= 0:
203 | self.end_of_data = False
204 | self.reset()
205 | raise StopIteration
206 | return source
207 |
--------------------------------------------------------------------------------
/preprocess/build_dictionary_char.py:
--------------------------------------------------------------------------------
1 | import cPickle as pkl
2 | import fileinput
3 | import numpy
4 | import sys
5 | import codecs
6 |
7 | from collections import OrderedDict
8 |
9 |
10 | short_list = 300
11 |
12 | def main():
13 | for filename in sys.argv[1:]:
14 | print 'Processing', filename
15 | word_freqs = OrderedDict()
16 |
17 | with open(filename, 'r') as f:
18 | for line in f:
19 | words_in = line.strip()
20 | words_in = list(words_in.decode('utf8'))
21 | for w in words_in:
22 | if w not in word_freqs:
23 | word_freqs[w] = 0
24 | word_freqs[w] += 1
25 |
26 | words = word_freqs.keys()
27 | freqs = word_freqs.values()
28 |
29 | sorted_idx = numpy.argsort(freqs)
30 | sorted_words = [words[ii] for ii in sorted_idx[::-1]]
31 |
32 | worddict = OrderedDict()
33 | worddict['eos'] = 0
34 | worddict['UNK'] = 1
35 |
36 | if short_list is not None:
37 | for ii in xrange(min(short_list, len(sorted_words))):
38 | worddict[sorted_words[ii]] = ii + 2
39 | else:
40 | for ii, ww in enumerate(sorted_words):
41 | worddict[ww] = ii + 2
42 |
43 | with open('%s.%d.pkl' % (filename, short_list), 'wb') as f:
44 | pkl.dump(worddict, f)
45 |
46 | f.close()
47 | print 'Done'
48 | print len(worddict)
49 |
50 | if __name__ == '__main__':
51 | main()
52 |
--------------------------------------------------------------------------------
/preprocess/build_dictionary_word.py:
--------------------------------------------------------------------------------
1 | import cPickle as pkl
2 | import fileinput
3 | import numpy
4 | import sys
5 | import codecs
6 |
7 | from collections import OrderedDict
8 |
9 |
10 | def main():
11 | for filename in sys.argv[1:]:
12 | print 'Processing', filename
13 | word_freqs = OrderedDict()
14 |
15 | with open(filename, 'r') as f:
16 | for line in f:
17 | words_in = line.strip().split(' ')
18 | for w in words_in:
19 | if w not in word_freqs:
20 | word_freqs[w] = 0
21 | word_freqs[w] += 1
22 |
23 | words = word_freqs.keys()
24 | freqs = word_freqs.values()
25 |
26 | sorted_idx = numpy.argsort(freqs)
27 | sorted_words = [words[ii] for ii in sorted_idx[::-1]]
28 |
29 | worddict = OrderedDict()
30 | worddict['eos'] = 0
31 | worddict['UNK'] = 1
32 |
33 | for ii, ww in enumerate(sorted_words):
34 | worddict[ww] = ii + 2
35 |
36 | with open('%s.word.pkl' % filename, 'wb') as f:
37 | pkl.dump(worddict, f)
38 |
39 | f.close()
40 | print 'Done'
41 | print len(worddict)
42 |
43 | if __name__ == '__main__':
44 | main()
45 |
--------------------------------------------------------------------------------
/preprocess/clean_tags.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import re
3 |
4 | from_file = sys.argv[1]
5 | to_file = sys.argv[2]
6 | to_file_out = open(to_file, "w")
7 |
8 | regex = "<.*>"
9 |
10 | tag_match = re.compile(regex)
11 | matched_lines = []
12 |
13 | with open(from_file) as from_file:
14 | content = from_file.readlines()
15 | for line in content:
16 | if (tag_match.match(line)):
17 | pass
18 | else:
19 | matched_lines.append(line)
20 |
21 | matched_lines = "".join(matched_lines)
22 | to_file_out.write(matched_lines)
23 | to_file_out.close()
24 |
25 |
--------------------------------------------------------------------------------
/preprocess/fix_appo.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # {1} is the directory name
3 |
4 |
5 | for f in ${1}/*.xml
6 | do
7 | cat $f | grep "" | sed "s/’/'/g" | sed "s/“/\"/g" | sed "s/”/\"/g" > ${f}.fixed
8 | done
9 |
10 |
--------------------------------------------------------------------------------
/preprocess/merge.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | SRC=$1
5 | TRG=$2
6 |
7 | FSRC=all_${1}-${2}.${1}
8 | FTRG=all_${1}-${2}.${2}
9 |
10 | echo "" > $FSRC
11 | for F in *${1}-${2}.${1}
12 | do
13 | if [ "$F" = "$FSRC" ]; then
14 | echo "pass"
15 | else
16 | cat $F >> $FSRC
17 | fi
18 | done
19 |
20 |
21 | echo "" > $FTRG
22 | for F in *${1}-${2}.${2}
23 | do
24 | if [ "$F" = "$FTRG" ]; then
25 | echo "pass"
26 | else
27 | cat $F >> $FTRG
28 | fi
29 | done
30 |
--------------------------------------------------------------------------------
/preprocess/multi-bleu.perl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 | #
3 | # This file is part of moses. Its use is licensed under the GNU Lesser General
4 | # Public License version 2.1 or, at your option, any later version.
5 |
6 | # $Id$
7 | use warnings;
8 | use strict;
9 |
10 | my $lowercase = 0;
11 | if ($ARGV[0] eq "-lc") {
12 | $lowercase = 1;
13 | shift;
14 | }
15 |
16 | my $stem = $ARGV[0];
17 | if (!defined $stem) {
18 | print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
19 | print STDERR "Reads the references from reference or reference0, reference1, ...\n";
20 | exit(1);
21 | }
22 |
23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
24 |
25 | my @REF;
26 | my $ref=0;
27 | while(-e "$stem$ref") {
28 | &add_to_ref("$stem$ref",\@REF);
29 | $ref++;
30 | }
31 | &add_to_ref($stem,\@REF) if -e $stem;
32 | die("ERROR: could not find reference file $stem") unless scalar @REF;
33 |
34 | sub add_to_ref {
35 | my ($file,$REF) = @_;
36 | my $s=0;
37 | open(REF,$file) or die "Can't read $file";
38 | while([) {
39 | chop;
40 | push @{$$REF[$s++]}, $_;
41 | }
42 | close(REF);
43 | }
44 |
45 | my(@CORRECT,@TOTAL,$length_translation,$length_reference);
46 | my $s=0;
47 | while() {
48 | chop;
49 | $_ = lc if $lowercase;
50 | my @WORD = split;
51 | my %REF_NGRAM = ();
52 | my $length_translation_this_sentence = scalar(@WORD);
53 | my ($closest_diff,$closest_length) = (9999,9999);
54 | foreach my $reference (@{$REF[$s]}) {
55 | # print "$s $_ <=> $reference\n";
56 | $reference = lc($reference) if $lowercase;
57 | my @WORD = split(' ',$reference);
58 | my $length = scalar(@WORD);
59 | my $diff = abs($length_translation_this_sentence-$length);
60 | if ($diff < $closest_diff) {
61 | $closest_diff = $diff;
62 | $closest_length = $length;
63 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
64 | } elsif ($diff == $closest_diff) {
65 | $closest_length = $length if $length < $closest_length;
66 | # from two references with the same closeness to me
67 | # take the *shorter* into account, not the "first" one.
68 | }
69 | for(my $n=1;$n<=4;$n++) {
70 | my %REF_NGRAM_N = ();
71 | for(my $start=0;$start<=$#WORD-($n-1);$start++) {
72 | my $ngram = "$n";
73 | for(my $w=0;$w<$n;$w++) {
74 | $ngram .= " ".$WORD[$start+$w];
75 | }
76 | $REF_NGRAM_N{$ngram}++;
77 | }
78 | foreach my $ngram (keys %REF_NGRAM_N) {
79 | if (!defined($REF_NGRAM{$ngram}) ||
80 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
81 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
82 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}]
\n";
83 | }
84 | }
85 | }
86 | }
87 | $length_translation += $length_translation_this_sentence;
88 | $length_reference += $closest_length;
89 | for(my $n=1;$n<=4;$n++) {
90 | my %T_NGRAM = ();
91 | for(my $start=0;$start<=$#WORD-($n-1);$start++) {
92 | my $ngram = "$n";
93 | for(my $w=0;$w<$n;$w++) {
94 | $ngram .= " ".$WORD[$start+$w];
95 | }
96 | $T_NGRAM{$ngram}++;
97 | }
98 | foreach my $ngram (keys %T_NGRAM) {
99 | $ngram =~ /^(\d+) /;
100 | my $n = $1;
101 | # my $corr = 0;
102 | # print "$i e $ngram $T_NGRAM{$ngram}
\n";
103 | $TOTAL[$n] += $T_NGRAM{$ngram};
104 | if (defined($REF_NGRAM{$ngram})) {
105 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
106 | $CORRECT[$n] += $T_NGRAM{$ngram};
107 | # $corr = $T_NGRAM{$ngram};
108 | # print "$i e correct1 $T_NGRAM{$ngram}
\n";
109 | }
110 | else {
111 | $CORRECT[$n] += $REF_NGRAM{$ngram};
112 | # $corr = $REF_NGRAM{$ngram};
113 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n";
114 | }
115 | }
116 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
117 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
118 | }
119 | }
120 | $s++;
121 | }
122 | my $brevity_penalty = 1;
123 | my $bleu = 0;
124 |
125 | my @bleu=();
126 |
127 | for(my $n=1;$n<=4;$n++) {
128 | if (defined ($TOTAL[$n]) && defined ($CORRECT[$n]) && $TOTAL[$n] > 0){
129 | $bleu[$n]=($TOTAL[$n]>0)?$CORRECT[$n]/$TOTAL[$n]:0;
130 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
131 | }else{
132 | $bleu[$n]=0;
133 | }
134 | }
135 |
136 | if ($length_reference==0){
137 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
138 | exit(1);
139 | }
140 |
141 | if ($length_translation<$length_reference) {
142 | $brevity_penalty = exp(1-$length_reference/$length_translation);
143 | }
144 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
145 | my_log( $bleu[2] ) +
146 | my_log( $bleu[3] ) +
147 | my_log( $bleu[4] ) ) / 4) ;
148 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
149 | 100*$bleu,
150 | 100*$bleu[1],
151 | 100*$bleu[2],
152 | 100*$bleu[3],
153 | 100*$bleu[4],
154 | $brevity_penalty,
155 | $length_translation / $length_reference,
156 | $length_translation,
157 | $length_reference;
158 |
159 | sub my_log {
160 | return -9999999999 unless $_[0];
161 | return log($_[0]);
162 | }
163 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/README.txt:
--------------------------------------------------------------------------------
1 | The language suffix can be found here:
2 |
3 | http://www.loc.gov/standards/iso639-2/php/code_list.php
4 |
5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations).
6 | This code includes data from czech wiktionary (also czech abbreviations).
7 |
8 |
9 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ca:
--------------------------------------------------------------------------------
1 | Dr
2 | Dra
3 | pàg
4 | p
5 | c
6 | av
7 | Sr
8 | Sra
9 | adm
10 | esq
11 | Prof
12 | S.A
13 | S.L
14 | p.e
15 | ptes
16 | Sta
17 | St
18 | pl
19 | màx
20 | cast
21 | dir
22 | nre
23 | fra
24 | admdora
25 | Emm
26 | Excma
27 | espf
28 | dc
29 | admdor
30 | tel
31 | angl
32 | aprox
33 | ca
34 | dept
35 | dj
36 | dl
37 | dt
38 | ds
39 | dg
40 | dv
41 | ed
42 | entl
43 | al
44 | i.e
45 | maj
46 | smin
47 | n
48 | núm
49 | pta
50 | A
51 | B
52 | C
53 | D
54 | E
55 | F
56 | G
57 | H
58 | I
59 | J
60 | K
61 | L
62 | M
63 | N
64 | O
65 | P
66 | Q
67 | R
68 | S
69 | T
70 | U
71 | V
72 | W
73 | X
74 | Y
75 | Z
76 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.cs:
--------------------------------------------------------------------------------
1 | Bc
2 | BcA
3 | Ing
4 | Ing.arch
5 | MUDr
6 | MVDr
7 | MgA
8 | Mgr
9 | JUDr
10 | PhDr
11 | RNDr
12 | PharmDr
13 | ThLic
14 | ThDr
15 | Ph.D
16 | Th.D
17 | prof
18 | doc
19 | CSc
20 | DrSc
21 | dr. h. c
22 | PaedDr
23 | Dr
24 | PhMr
25 | DiS
26 | abt
27 | ad
28 | a.i
29 | aj
30 | angl
31 | anon
32 | apod
33 | atd
34 | atp
35 | aut
36 | bd
37 | biogr
38 | b.m
39 | b.p
40 | b.r
41 | cca
42 | cit
43 | cizojaz
44 | c.k
45 | col
46 | čes
47 | čín
48 | čj
49 | ed
50 | facs
51 | fasc
52 | fol
53 | fot
54 | franc
55 | h.c
56 | hist
57 | hl
58 | hrsg
59 | ibid
60 | il
61 | ind
62 | inv.č
63 | jap
64 | jhdt
65 | jv
66 | koed
67 | kol
68 | korej
69 | kl
70 | krit
71 | lat
72 | lit
73 | m.a
74 | maď
75 | mj
76 | mp
77 | násl
78 | např
79 | nepubl
80 | něm
81 | no
82 | nr
83 | n.s
84 | okr
85 | odd
86 | odp
87 | obr
88 | opr
89 | orig
90 | phil
91 | pl
92 | pokrač
93 | pol
94 | port
95 | pozn
96 | př.kr
97 | př.n.l
98 | přel
99 | přeprac
100 | příl
101 | pseud
102 | pt
103 | red
104 | repr
105 | resp
106 | revid
107 | rkp
108 | roč
109 | roz
110 | rozš
111 | samost
112 | sect
113 | sest
114 | seš
115 | sign
116 | sl
117 | srv
118 | stol
119 | sv
120 | šk
121 | šk.ro
122 | špan
123 | tab
124 | t.č
125 | tis
126 | tj
127 | tř
128 | tzv
129 | univ
130 | uspoř
131 | vol
132 | vl.jm
133 | vs
134 | vyd
135 | vyobr
136 | zal
137 | zejm
138 | zkr
139 | zprac
140 | zvl
141 | n.p
142 | např
143 | než
144 | MUDr
145 | abl
146 | absol
147 | adj
148 | adv
149 | ak
150 | ak. sl
151 | akt
152 | alch
153 | amer
154 | anat
155 | angl
156 | anglosas
157 | arab
158 | arch
159 | archit
160 | arg
161 | astr
162 | astrol
163 | att
164 | bás
165 | belg
166 | bibl
167 | biol
168 | boh
169 | bot
170 | bulh
171 | círk
172 | csl
173 | č
174 | čas
175 | čes
176 | dat
177 | děj
178 | dep
179 | dět
180 | dial
181 | dór
182 | dopr
183 | dosl
184 | ekon
185 | epic
186 | etnonym
187 | eufem
188 | f
189 | fam
190 | fem
191 | fil
192 | film
193 | form
194 | fot
195 | fr
196 | fut
197 | fyz
198 | gen
199 | geogr
200 | geol
201 | geom
202 | germ
203 | gram
204 | hebr
205 | herald
206 | hist
207 | hl
208 | hovor
209 | hud
210 | hut
211 | chcsl
212 | chem
213 | ie
214 | imp
215 | impf
216 | ind
217 | indoevr
218 | inf
219 | instr
220 | interj
221 | ión
222 | iron
223 | it
224 | kanad
225 | katalán
226 | klas
227 | kniž
228 | komp
229 | konj
230 |
231 | konkr
232 | kř
233 | kuch
234 | lat
235 | lék
236 | les
237 | lid
238 | lit
239 | liturg
240 | lok
241 | log
242 | m
243 | mat
244 | meteor
245 | metr
246 | mod
247 | ms
248 | mysl
249 | n
250 | náb
251 | námoř
252 | neklas
253 | něm
254 | nesklon
255 | nom
256 | ob
257 | obch
258 | obyč
259 | ojed
260 | opt
261 | part
262 | pas
263 | pejor
264 | pers
265 | pf
266 | pl
267 | plpf
268 |
269 | práv
270 | prep
271 | předl
272 | přivl
273 | r
274 | rcsl
275 | refl
276 | reg
277 | rkp
278 | ř
279 | řec
280 | s
281 | samohl
282 | sg
283 | sl
284 | souhl
285 | spec
286 | srov
287 | stfr
288 | střv
289 | stsl
290 | subj
291 | subst
292 | superl
293 | sv
294 | sz
295 | táz
296 | tech
297 | telev
298 | teol
299 | trans
300 | typogr
301 | var
302 | vedl
303 | verb
304 | vl. jm
305 | voj
306 | vok
307 | vůb
308 | vulg
309 | výtv
310 | vztaž
311 | zahr
312 | zájm
313 | zast
314 | zejm
315 |
316 | zeměd
317 | zkr
318 | zř
319 | mj
320 | dl
321 | atp
322 | sport
323 | Mgr
324 | horn
325 | MVDr
326 | JUDr
327 | RSDr
328 | Bc
329 | PhDr
330 | ThDr
331 | Ing
332 | aj
333 | apod
334 | PharmDr
335 | pomn
336 | ev
337 | slang
338 | nprap
339 | odp
340 | dop
341 | pol
342 | st
343 | stol
344 | p. n. l
345 | před n. l
346 | n. l
347 | př. Kr
348 | po Kr
349 | př. n. l
350 | odd
351 | RNDr
352 | tzv
353 | atd
354 | tzn
355 | resp
356 | tj
357 | p
358 | br
359 | č. j
360 | čj
361 | č. p
362 | čp
363 | a. s
364 | s. r. o
365 | spol. s r. o
366 | p. o
367 | s. p
368 | v. o. s
369 | k. s
370 | o. p. s
371 | o. s
372 | v. r
373 | v z
374 | ml
375 | vč
376 | kr
377 | mld
378 | hod
379 | popř
380 | ap
381 | event
382 | rus
383 | slov
384 | rum
385 | švýc
386 | P. T
387 | zvl
388 | hor
389 | dol
390 | S.O.S
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.de:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | #no german words end in single lower-case letters, so we throw those in too.
7 | A
8 | B
9 | C
10 | D
11 | E
12 | F
13 | G
14 | H
15 | I
16 | J
17 | K
18 | L
19 | M
20 | N
21 | O
22 | P
23 | Q
24 | R
25 | S
26 | T
27 | U
28 | V
29 | W
30 | X
31 | Y
32 | Z
33 | a
34 | b
35 | c
36 | d
37 | e
38 | f
39 | g
40 | h
41 | i
42 | j
43 | k
44 | l
45 | m
46 | n
47 | o
48 | p
49 | q
50 | r
51 | s
52 | t
53 | u
54 | v
55 | w
56 | x
57 | y
58 | z
59 |
60 |
61 | #Roman Numerals. A dot after one of these is not a sentence break in German.
62 | I
63 | II
64 | III
65 | IV
66 | V
67 | VI
68 | VII
69 | VIII
70 | IX
71 | X
72 | XI
73 | XII
74 | XIII
75 | XIV
76 | XV
77 | XVI
78 | XVII
79 | XVIII
80 | XIX
81 | XX
82 | i
83 | ii
84 | iii
85 | iv
86 | v
87 | vi
88 | vii
89 | viii
90 | ix
91 | x
92 | xi
93 | xii
94 | xiii
95 | xiv
96 | xv
97 | xvi
98 | xvii
99 | xviii
100 | xix
101 | xx
102 |
103 | #Titles and Honorifics
104 | Adj
105 | Adm
106 | Adv
107 | Asst
108 | Bart
109 | Bldg
110 | Brig
111 | Bros
112 | Capt
113 | Cmdr
114 | Col
115 | Comdr
116 | Con
117 | Corp
118 | Cpl
119 | DR
120 | Dr
121 | Ens
122 | Gen
123 | Gov
124 | Hon
125 | Hosp
126 | Insp
127 | Lt
128 | MM
129 | MR
130 | MRS
131 | MS
132 | Maj
133 | Messrs
134 | Mlle
135 | Mme
136 | Mr
137 | Mrs
138 | Ms
139 | Msgr
140 | Op
141 | Ord
142 | Pfc
143 | Ph
144 | Prof
145 | Pvt
146 | Rep
147 | Reps
148 | Res
149 | Rev
150 | Rt
151 | Sen
152 | Sens
153 | Sfc
154 | Sgt
155 | Sr
156 | St
157 | Supt
158 | Surg
159 |
160 | #Misc symbols
161 | Mio
162 | Mrd
163 | bzw
164 | v
165 | vs
166 | usw
167 | d.h
168 | z.B
169 | u.a
170 | etc
171 | Mrd
172 | MwSt
173 | ggf
174 | d.J
175 | D.h
176 | m.E
177 | vgl
178 | I.F
179 | z.T
180 | sogen
181 | ff
182 | u.E
183 | g.U
184 | g.g.A
185 | c.-à-d
186 | Buchst
187 | u.s.w
188 | sog
189 | u.ä
190 | Std
191 | evtl
192 | Zt
193 | Chr
194 | u.U
195 | o.ä
196 | Ltd
197 | b.A
198 | z.Zt
199 | spp
200 | sen
201 | SA
202 | k.o
203 | jun
204 | i.H.v
205 | dgl
206 | dergl
207 | Co
208 | zzt
209 | usf
210 | s.p.a
211 | Dkr
212 | Corp
213 | bzgl
214 | BSE
215 |
216 | #Number indicators
217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it
218 | No
219 | Nos
220 | Art
221 | Nr
222 | pp
223 | ca
224 | Ca
225 |
226 | #Ordinals are done with . in German - "1." = "1st" in English
227 | 1
228 | 2
229 | 3
230 | 4
231 | 5
232 | 6
233 | 7
234 | 8
235 | 9
236 | 10
237 | 11
238 | 12
239 | 13
240 | 14
241 | 15
242 | 16
243 | 17
244 | 18
245 | 19
246 | 20
247 | 21
248 | 22
249 | 23
250 | 24
251 | 25
252 | 26
253 | 27
254 | 28
255 | 29
256 | 30
257 | 31
258 | 32
259 | 33
260 | 34
261 | 35
262 | 36
263 | 37
264 | 38
265 | 39
266 | 40
267 | 41
268 | 42
269 | 43
270 | 44
271 | 45
272 | 46
273 | 47
274 | 48
275 | 49
276 | 50
277 | 51
278 | 52
279 | 53
280 | 54
281 | 55
282 | 56
283 | 57
284 | 58
285 | 59
286 | 60
287 | 61
288 | 62
289 | 63
290 | 64
291 | 65
292 | 66
293 | 67
294 | 68
295 | 69
296 | 70
297 | 71
298 | 72
299 | 73
300 | 74
301 | 75
302 | 76
303 | 77
304 | 78
305 | 79
306 | 80
307 | 81
308 | 82
309 | 83
310 | 84
311 | 85
312 | 86
313 | 87
314 | 88
315 | 89
316 | 90
317 | 91
318 | 92
319 | 93
320 | 94
321 | 95
322 | 96
323 | 97
324 | 98
325 | 99
326 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.en:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 |
33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
34 | Adj
35 | Adm
36 | Adv
37 | Asst
38 | Bart
39 | Bldg
40 | Brig
41 | Bros
42 | Capt
43 | Cmdr
44 | Col
45 | Comdr
46 | Con
47 | Corp
48 | Cpl
49 | DR
50 | Dr
51 | Drs
52 | Ens
53 | Gen
54 | Gov
55 | Hon
56 | Hr
57 | Hosp
58 | Insp
59 | Lt
60 | MM
61 | MR
62 | MRS
63 | MS
64 | Maj
65 | Messrs
66 | Mlle
67 | Mme
68 | Mr
69 | Mrs
70 | Ms
71 | Msgr
72 | Op
73 | Ord
74 | Pfc
75 | Ph
76 | Prof
77 | Pvt
78 | Rep
79 | Reps
80 | Res
81 | Rev
82 | Rt
83 | Sen
84 | Sens
85 | Sfc
86 | Sgt
87 | Sr
88 | St
89 | Supt
90 | Surg
91 |
92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
93 | v
94 | vs
95 | i.e
96 | rev
97 | e.g
98 |
99 | #Numbers only. These should only induce breaks when followed by a numeric sequence
100 | # add NUMERIC_ONLY after the word for this function
101 | #This case is mostly for the english "No." which can either be a sentence of its own, or
102 | #if followed by a number, a non-breaking prefix
103 | No #NUMERIC_ONLY#
104 | Nos
105 | Art #NUMERIC_ONLY#
106 | Nr
107 | pp #NUMERIC_ONLY#
108 |
109 | #month abbreviations
110 | Jan
111 | Feb
112 | Mar
113 | Apr
114 | #May is a full word
115 | Jun
116 | Jul
117 | Aug
118 | Sep
119 | Oct
120 | Nov
121 | Dec
122 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.es:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 |
33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm
34 |
35 | A.C
36 | Apdo
37 | Av
38 | Bco
39 | CC.AA
40 | Da
41 | Dep
42 | Dn
43 | Dr
44 | Dra
45 | EE.UU
46 | Excmo
47 | FF.CC
48 | Fil
49 | Gral
50 | J.C
51 | Let
52 | Lic
53 | N.B
54 | P.D
55 | P.V.P
56 | Prof
57 | Pts
58 | Rte
59 | S.A
60 | S.A.R
61 | S.E
62 | S.L
63 | S.R.C
64 | Sr
65 | Sra
66 | Srta
67 | Sta
68 | Sto
69 | T.V.E
70 | Tel
71 | Ud
72 | Uds
73 | V.B
74 | V.E
75 | Vd
76 | Vds
77 | a/c
78 | adj
79 | admón
80 | afmo
81 | apdo
82 | av
83 | c
84 | c.f
85 | c.g
86 | cap
87 | cm
88 | cta
89 | dcha
90 | doc
91 | ej
92 | entlo
93 | esq
94 | etc
95 | f.c
96 | gr
97 | grs
98 | izq
99 | kg
100 | km
101 | mg
102 | mm
103 | núm
104 | núm
105 | p
106 | p.a
107 | p.ej
108 | ptas
109 | pág
110 | págs
111 | pág
112 | págs
113 | q.e.g.e
114 | q.e.s.m
115 | s
116 | s.s.s
117 | vid
118 | vol
119 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.fi:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT
2 | #indicate an end-of-sentence marker. Special cases are included for prefixes
3 | #that ONLY appear before 0-9 numbers.
4 |
5 | #This list is compiled from omorfi database
6 | #by Tommi A Pirinen.
7 |
8 |
9 | #any single upper case letter followed by a period is not a sentence ender
10 | A
11 | B
12 | C
13 | D
14 | E
15 | F
16 | G
17 | H
18 | I
19 | J
20 | K
21 | L
22 | M
23 | N
24 | O
25 | P
26 | Q
27 | R
28 | S
29 | T
30 | U
31 | V
32 | W
33 | X
34 | Y
35 | Z
36 | Å
37 | Ä
38 | Ö
39 |
40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
41 | alik
42 | alil
43 | amir
44 | apul
45 | apul.prof
46 | arkkit
47 | ass
48 | assist
49 | dipl
50 | dipl.arkkit
51 | dipl.ekon
52 | dipl.ins
53 | dipl.kielenk
54 | dipl.kirjeenv
55 | dipl.kosm
56 | dipl.urk
57 | dos
58 | erikoiseläinl
59 | erikoishammasl
60 | erikoisl
61 | erikoist
62 | ev.luutn
63 | evp
64 | fil
65 | ft
66 | hallinton
67 | hallintot
68 | hammaslääket
69 | jatk
70 | jääk
71 | kansaned
72 | kapt
73 | kapt.luutn
74 | kenr
75 | kenr.luutn
76 | kenr.maj
77 | kers
78 | kirjeenv
79 | kom
80 | kom.kapt
81 | komm
82 | konst
83 | korpr
84 | luutn
85 | maist
86 | maj
87 | Mr
88 | Mrs
89 | Ms
90 | M.Sc
91 | neuv
92 | nimim
93 | Ph.D
94 | prof
95 | puh.joht
96 | pääll
97 | res
98 | san
99 | siht
100 | suom
101 | sähköp
102 | säv
103 | toht
104 | toim
105 | toim.apul
106 | toim.joht
107 | toim.siht
108 | tuom
109 | ups
110 | vänr
111 | vääp
112 | ye.ups
113 | ylik
114 | ylil
115 | ylim
116 | ylimatr
117 | yliop
118 | yliopp
119 | ylip
120 | yliv
121 |
122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall
123 | #into this category - it sometimes ends a sentence)
124 | e.g
125 | ent
126 | esim
127 | huom
128 | i.e
129 | ilm
130 | l
131 | mm
132 | myöh
133 | nk
134 | nyk
135 | par
136 | po
137 | t
138 | v
139 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.fr:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 | #
4 | #any single upper case letter followed by a period is not a sentence ender
5 | #usually upper case letters are initials in a name
6 | #no French words end in single lower-case letters, so we throw those in too?
7 | A
8 | B
9 | C
10 | D
11 | E
12 | F
13 | G
14 | H
15 | I
16 | J
17 | K
18 | L
19 | M
20 | N
21 | O
22 | P
23 | Q
24 | R
25 | S
26 | T
27 | U
28 | V
29 | W
30 | X
31 | Y
32 | Z
33 | a
34 | b
35 | c
36 | d
37 | e
38 | f
39 | g
40 | h
41 | i
42 | j
43 | k
44 | l
45 | m
46 | n
47 | o
48 | p
49 | q
50 | r
51 | s
52 | t
53 | u
54 | v
55 | w
56 | x
57 | y
58 | z
59 |
60 | # Period-final abbreviation list for French
61 | A.C.N
62 | A.M
63 | art
64 | ann
65 | apr
66 | av
67 | auj
68 | lib
69 | B.P
70 | boul
71 | ca
72 | c.-à-d
73 | cf
74 | ch.-l
75 | chap
76 | contr
77 | C.P.I
78 | C.Q.F.D
79 | C.N
80 | C.N.S
81 | C.S
82 | dir
83 | éd
84 | e.g
85 | env
86 | al
87 | etc
88 | E.V
89 | ex
90 | fasc
91 | fém
92 | fig
93 | fr
94 | hab
95 | ibid
96 | id
97 | i.e
98 | inf
99 | LL.AA
100 | LL.AA.II
101 | LL.AA.RR
102 | LL.AA.SS
103 | L.D
104 | LL.EE
105 | LL.MM
106 | LL.MM.II.RR
107 | loc.cit
108 | masc
109 | MM
110 | ms
111 | N.B
112 | N.D.A
113 | N.D.L.R
114 | N.D.T
115 | n/réf
116 | NN.SS
117 | N.S
118 | N.D
119 | N.P.A.I
120 | p.c.c
121 | pl
122 | pp
123 | p.ex
124 | p.j
125 | P.S
126 | R.A.S
127 | R.-V
128 | R.P
129 | R.I.P
130 | SS
131 | S.S
132 | S.A
133 | S.A.I
134 | S.A.R
135 | S.A.S
136 | S.E
137 | sec
138 | sect
139 | sing
140 | S.M
141 | S.M.I.R
142 | sq
143 | sqq
144 | suiv
145 | sup
146 | suppl
147 | tél
148 | T.S.V.P
149 | vb
150 | vol
151 | vs
152 | X.O
153 | Z.I
154 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.hu:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 | Á
33 | É
34 | Í
35 | Ó
36 | Ö
37 | Ő
38 | Ú
39 | Ü
40 | Ű
41 |
42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
43 | Dr
44 | dr
45 | kb
46 | Kb
47 | vö
48 | Vö
49 | pl
50 | Pl
51 | ca
52 | Ca
53 | min
54 | Min
55 | max
56 | Max
57 | ún
58 | Ún
59 | prof
60 | Prof
61 | de
62 | De
63 | du
64 | Du
65 | Szt
66 | St
67 |
68 | #Numbers only. These should only induce breaks when followed by a numeric sequence
69 | # add NUMERIC_ONLY after the word for this function
70 | #This case is mostly for the english "No." which can either be a sentence of its own, or
71 | #if followed by a number, a non-breaking prefix
72 |
73 | # Month name abbreviations
74 | jan #NUMERIC_ONLY#
75 | Jan #NUMERIC_ONLY#
76 | Feb #NUMERIC_ONLY#
77 | feb #NUMERIC_ONLY#
78 | márc #NUMERIC_ONLY#
79 | Márc #NUMERIC_ONLY#
80 | ápr #NUMERIC_ONLY#
81 | Ápr #NUMERIC_ONLY#
82 | máj #NUMERIC_ONLY#
83 | Máj #NUMERIC_ONLY#
84 | jún #NUMERIC_ONLY#
85 | Jún #NUMERIC_ONLY#
86 | Júl #NUMERIC_ONLY#
87 | júl #NUMERIC_ONLY#
88 | aug #NUMERIC_ONLY#
89 | Aug #NUMERIC_ONLY#
90 | Szept #NUMERIC_ONLY#
91 | szept #NUMERIC_ONLY#
92 | okt #NUMERIC_ONLY#
93 | Okt #NUMERIC_ONLY#
94 | nov #NUMERIC_ONLY#
95 | Nov #NUMERIC_ONLY#
96 | dec #NUMERIC_ONLY#
97 | Dec #NUMERIC_ONLY#
98 |
99 | # Other abbreviations
100 | tel #NUMERIC_ONLY#
101 | Tel #NUMERIC_ONLY#
102 | Fax #NUMERIC_ONLY#
103 | fax #NUMERIC_ONLY#
104 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.is:
--------------------------------------------------------------------------------
1 | no #NUMERIC_ONLY#
2 | No #NUMERIC_ONLY#
3 | nr #NUMERIC_ONLY#
4 | Nr #NUMERIC_ONLY#
5 | nR #NUMERIC_ONLY#
6 | NR #NUMERIC_ONLY#
7 | a
8 | b
9 | c
10 | d
11 | e
12 | f
13 | g
14 | h
15 | i
16 | j
17 | k
18 | l
19 | m
20 | n
21 | o
22 | p
23 | q
24 | r
25 | s
26 | t
27 | u
28 | v
29 | w
30 | x
31 | y
32 | z
33 | ^
34 | í
35 | á
36 | ó
37 | æ
38 | A
39 | B
40 | C
41 | D
42 | E
43 | F
44 | G
45 | H
46 | I
47 | J
48 | K
49 | L
50 | M
51 | N
52 | O
53 | P
54 | Q
55 | R
56 | S
57 | T
58 | U
59 | V
60 | W
61 | X
62 | Y
63 | Z
64 | ab.fn
65 | a.fn
66 | afs
67 | al
68 | alm
69 | alg
70 | andh
71 | ath
72 | aths
73 | atr
74 | ao
75 | au
76 | aukaf
77 | áfn
78 | áhrl.s
79 | áhrs
80 | ákv.gr
81 | ákv
82 | bh
83 | bls
84 | dr
85 | e.Kr
86 | et
87 | ef
88 | efn
89 | ennfr
90 | eink
91 | end
92 | e.st
93 | erl
94 | fél
95 | fskj
96 | fh
97 | f.hl
98 | físl
99 | fl
100 | fn
101 | fo
102 | forl
103 | frb
104 | frl
105 | frh
106 | frt
107 | fsl
108 | fsh
109 | fs
110 | fsk
111 | fst
112 | f.Kr
113 | ft
114 | fv
115 | fyrrn
116 | fyrrv
117 | germ
118 | gm
119 | gr
120 | hdl
121 | hdr
122 | hf
123 | hl
124 | hlsk
125 | hljsk
126 | hljv
127 | hljóðv
128 | hr
129 | hv
130 | hvk
131 | holl
132 | Hos
133 | höf
134 | hk
135 | hrl
136 | ísl
137 | kaf
138 | kap
139 | Khöfn
140 | kk
141 | kg
142 | kk
143 | km
144 | kl
145 | klst
146 | kr
147 | kt
148 | kgúrsk
149 | kvk
150 | leturbr
151 | lh
152 | lh.nt
153 | lh.þt
154 | lo
155 | ltr
156 | mlja
157 | mljó
158 | millj
159 | mm
160 | mms
161 | m.fl
162 | miðm
163 | mgr
164 | mst
165 | mín
166 | nf
167 | nh
168 | nhm
169 | nl
170 | nk
171 | nmgr
172 | no
173 | núv
174 | nt
175 | o.áfr
176 | o.m.fl
177 | ohf
178 | o.fl
179 | o.s.frv
180 | ófn
181 | ób
182 | óákv.gr
183 | óákv
184 | pfn
185 | PR
186 | pr
187 | Ritstj
188 | Rvík
189 | Rvk
190 | samb
191 | samhlj
192 | samn
193 | samn
194 | sbr
195 | sek
196 | sérn
197 | sf
198 | sfn
199 | sh
200 | sfn
201 | sh
202 | s.hl
203 | sk
204 | skv
205 | sl
206 | sn
207 | so
208 | ss.us
209 | s.st
210 | samþ
211 | sbr
212 | shlj
213 | sign
214 | skál
215 | st
216 | st.s
217 | stk
218 | sþ
219 | teg
220 | tbl
221 | tfn
222 | tl
223 | tvíhlj
224 | tvt
225 | till
226 | to
227 | umr
228 | uh
229 | us
230 | uppl
231 | útg
232 | vb
233 | Vf
234 | vh
235 | vkf
236 | Vl
237 | vl
238 | vlf
239 | vmf
240 | 8vo
241 | vsk
242 | vth
243 | þt
244 | þf
245 | þjs
246 | þgf
247 | þlt
248 | þolm
249 | þm
250 | þml
251 | þýð
252 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.it:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 |
33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
34 | Adj
35 | Adm
36 | Adv
37 | Amn
38 | Arch
39 | Asst
40 | Avv
41 | Bart
42 | Bcc
43 | Bldg
44 | Brig
45 | Bros
46 | C.A.P
47 | C.P
48 | Capt
49 | Cc
50 | Cmdr
51 | Co
52 | Col
53 | Comdr
54 | Con
55 | Corp
56 | Cpl
57 | DR
58 | Dott
59 | Dr
60 | Drs
61 | Egr
62 | Ens
63 | Gen
64 | Geom
65 | Gov
66 | Hon
67 | Hosp
68 | Hr
69 | Id
70 | Ing
71 | Insp
72 | Lt
73 | MM
74 | MR
75 | MRS
76 | MS
77 | Maj
78 | Messrs
79 | Mlle
80 | Mme
81 | Mo
82 | Mons
83 | Mr
84 | Mrs
85 | Ms
86 | Msgr
87 | N.B
88 | Op
89 | Ord
90 | P.S
91 | P.T
92 | Pfc
93 | Ph
94 | Prof
95 | Pvt
96 | RP
97 | RSVP
98 | Rag
99 | Rep
100 | Reps
101 | Res
102 | Rev
103 | Rif
104 | Rt
105 | S.A
106 | S.B.F
107 | S.P.M
108 | S.p.A
109 | S.r.l
110 | Sen
111 | Sens
112 | Sfc
113 | Sgt
114 | Sig
115 | Sigg
116 | Soc
117 | Spett
118 | Sr
119 | St
120 | Supt
121 | Surg
122 | V.P
123 |
124 | # other
125 | a.c
126 | acc
127 | all
128 | banc
129 | c.a
130 | c.c.p
131 | c.m
132 | c.p
133 | c.s
134 | c.v
135 | corr
136 | dott
137 | e.p.c
138 | ecc
139 | es
140 | fatt
141 | gg
142 | int
143 | lett
144 | ogg
145 | on
146 | p.c
147 | p.c.c
148 | p.es
149 | p.f
150 | p.r
151 | p.v
152 | post
153 | pp
154 | racc
155 | ric
156 | s.n.c
157 | seg
158 | sgg
159 | ss
160 | tel
161 | u.s
162 | v.r
163 | v.s
164 |
165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
166 | v
167 | vs
168 | i.e
169 | rev
170 | e.g
171 |
172 | #Numbers only. These should only induce breaks when followed by a numeric sequence
173 | # add NUMERIC_ONLY after the word for this function
174 | #This case is mostly for the english "No." which can either be a sentence of its own, or
175 | #if followed by a number, a non-breaking prefix
176 | No #NUMERIC_ONLY#
177 | Nos
178 | Art #NUMERIC_ONLY#
179 | Nr
180 | pp #NUMERIC_ONLY#
181 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.lv:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | Ā
8 | B
9 | C
10 | Č
11 | D
12 | E
13 | Ē
14 | F
15 | G
16 | Ģ
17 | H
18 | I
19 | Ī
20 | J
21 | K
22 | Ķ
23 | L
24 | Ļ
25 | M
26 | N
27 | Ņ
28 | O
29 | P
30 | Q
31 | R
32 | S
33 | Š
34 | T
35 | U
36 | Ū
37 | V
38 | W
39 | X
40 | Y
41 | Z
42 | Ž
43 |
44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
45 | dr
46 | Dr
47 | med
48 | prof
49 | Prof
50 | inž
51 | Inž
52 | ist.loc
53 | Ist.loc
54 | kor.loc
55 | Kor.loc
56 | v.i
57 | vietn
58 | Vietn
59 |
60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
61 | a.l
62 | t.p
63 | pārb
64 | Pārb
65 | vec
66 | Vec
67 | inv
68 | Inv
69 | sk
70 | Sk
71 | spec
72 | Spec
73 | vienk
74 | Vienk
75 | virz
76 | Virz
77 | māksl
78 | Māksl
79 | mūz
80 | Mūz
81 | akad
82 | Akad
83 | soc
84 | Soc
85 | galv
86 | Galv
87 | vad
88 | Vad
89 | sertif
90 | Sertif
91 | folkl
92 | Folkl
93 | hum
94 | Hum
95 |
96 | #Numbers only. These should only induce breaks when followed by a numeric sequence
97 | # add NUMERIC_ONLY after the word for this function
98 | #This case is mostly for the english "No." which can either be a sentence of its own, or
99 | #if followed by a number, a non-breaking prefix
100 | Nr #NUMERIC_ONLY#
101 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.nl:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen
4 | # http://nl.wikipedia.org/wiki/Aanspreekvorm
5 | # http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs
6 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
7 | #usually upper case letters are initials in a name
8 | A
9 | B
10 | C
11 | D
12 | E
13 | F
14 | G
15 | H
16 | I
17 | J
18 | K
19 | L
20 | M
21 | N
22 | O
23 | P
24 | Q
25 | R
26 | S
27 | T
28 | U
29 | V
30 | W
31 | X
32 | Y
33 | Z
34 |
35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
36 | bacc
37 | bc
38 | bgen
39 | c.i
40 | dhr
41 | dr
42 | dr.h.c
43 | drs
44 | drs
45 | ds
46 | eint
47 | fa
48 | Fa
49 | fam
50 | gen
51 | genm
52 | ing
53 | ir
54 | jhr
55 | jkvr
56 | jr
57 | kand
58 | kol
59 | lgen
60 | lkol
61 | Lt
62 | maj
63 | Mej
64 | mevr
65 | Mme
66 | mr
67 | mr
68 | Mw
69 | o.b.s
70 | plv
71 | prof
72 | ritm
73 | tint
74 | Vz
75 | Z.D
76 | Z.D.H
77 | Z.E
78 | Z.Em
79 | Z.H
80 | Z.K.H
81 | Z.K.M
82 | Z.M
83 | z.v
84 |
85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence
87 | a.g.v
88 | bijv
89 | bijz
90 | bv
91 | d.w.z
92 | e.c
93 | e.g
94 | e.k
95 | ev
96 | i.p.v
97 | i.s.m
98 | i.t.t
99 | i.v.m
100 | m.a.w
101 | m.b.t
102 | m.b.v
103 | m.h.o
104 | m.i
105 | m.i.v
106 | v.w.t
107 |
108 | #Numbers only. These should only induce breaks when followed by a numeric sequence
109 | # add NUMERIC_ONLY after the word for this function
110 | #This case is mostly for the english "No." which can either be a sentence of its own, or
111 | #if followed by a number, a non-breaking prefix
112 | Nr #NUMERIC_ONLY#
113 | Nrs
114 | nrs
115 | nr #NUMERIC_ONLY#
116 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.pl:
--------------------------------------------------------------------------------
1 | adw
2 | afr
3 | akad
4 | al
5 | Al
6 | am
7 | amer
8 | arch
9 | art
10 | Art
11 | artyst
12 | astr
13 | austr
14 | bałt
15 | bdb
16 | bł
17 | bm
18 | br
19 | bryg
20 | bryt
21 | centr
22 | ces
23 | chem
24 | chiń
25 | chir
26 | c.k
27 | c.o
28 | cyg
29 | cyw
30 | cyt
31 | czes
32 | czw
33 | cd
34 | Cd
35 | czyt
36 | ćw
37 | ćwicz
38 | daw
39 | dcn
40 | dekl
41 | demokr
42 | det
43 | diec
44 | dł
45 | dn
46 | dot
47 | dol
48 | dop
49 | dost
50 | dosł
51 | h.c
52 | ds
53 | dst
54 | duszp
55 | dypl
56 | egz
57 | ekol
58 | ekon
59 | elektr
60 | em
61 | ew
62 | fab
63 | farm
64 | fot
65 | fr
66 | gat
67 | gastr
68 | geogr
69 | geol
70 | gimn
71 | głęb
72 | gm
73 | godz
74 | górn
75 | gosp
76 | gr
77 | gram
78 | hist
79 | hiszp
80 | hr
81 | Hr
82 | hot
83 | id
84 | in
85 | im
86 | iron
87 | jn
88 | kard
89 | kat
90 | katol
91 | k.k
92 | kk
93 | kol
94 | kl
95 | k.p.a
96 | kpc
97 | k.p.c
98 | kpt
99 | kr
100 | k.r
101 | krak
102 | k.r.o
103 | kryt
104 | kult
105 | laic
106 | łac
107 | niem
108 | woj
109 | nb
110 | np
111 | Nb
112 | Np
113 | pol
114 | pow
115 | m.in
116 | pt
117 | ps
118 | Pt
119 | Ps
120 | cdn
121 | jw
122 | ryc
123 | rys
124 | Ryc
125 | Rys
126 | tj
127 | tzw
128 | Tzw
129 | tzn
130 | zob
131 | ang
132 | ub
133 | ul
134 | pw
135 | pn
136 | pl
137 | al
138 | k
139 | n
140 | nr #NUMERIC_ONLY#
141 | Nr #NUMERIC_ONLY#
142 | ww
143 | wł
144 | ur
145 | zm
146 | żyd
147 | żarg
148 | żyw
149 | wył
150 | bp
151 | bp
152 | wyst
153 | tow
154 | Tow
155 | o
156 | sp
157 | Sp
158 | st
159 | spółdz
160 | Spółdz
161 | społ
162 | spółgł
163 | stoł
164 | stow
165 | Stoł
166 | Stow
167 | zn
168 | zew
169 | zewn
170 | zdr
171 | zazw
172 | zast
173 | zaw
174 | zał
175 | zal
176 | zam
177 | zak
178 | zakł
179 | zagr
180 | zach
181 | adw
182 | Adw
183 | lek
184 | Lek
185 | med
186 | mec
187 | Mec
188 | doc
189 | Doc
190 | dyw
191 | dyr
192 | Dyw
193 | Dyr
194 | inż
195 | Inż
196 | mgr
197 | Mgr
198 | dh
199 | dr
200 | Dh
201 | Dr
202 | p
203 | P
204 | red
205 | Red
206 | prof
207 | prok
208 | Prof
209 | Prok
210 | hab
211 | płk
212 | Płk
213 | nadkom
214 | Nadkom
215 | podkom
216 | Podkom
217 | ks
218 | Ks
219 | gen
220 | Gen
221 | por
222 | Por
223 | reż
224 | Reż
225 | przyp
226 | Przyp
227 | śp
228 | św
229 | śW
230 | Śp
231 | Św
232 | ŚW
233 | szer
234 | Szer
235 | pkt #NUMERIC_ONLY#
236 | str #NUMERIC_ONLY#
237 | tab #NUMERIC_ONLY#
238 | Tab #NUMERIC_ONLY#
239 | tel
240 | ust #NUMERIC_ONLY#
241 | par #NUMERIC_ONLY#
242 | poz
243 | pok
244 | oo
245 | oO
246 | Oo
247 | OO
248 | r #NUMERIC_ONLY#
249 | l #NUMERIC_ONLY#
250 | s #NUMERIC_ONLY#
251 | najśw
252 | Najśw
253 | A
254 | B
255 | C
256 | D
257 | E
258 | F
259 | G
260 | H
261 | I
262 | J
263 | K
264 | L
265 | M
266 | N
267 | O
268 | P
269 | Q
270 | R
271 | S
272 | T
273 | U
274 | V
275 | W
276 | X
277 | Y
278 | Z
279 | Ś
280 | Ć
281 | Ż
282 | Ź
283 | Dz
284 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.pt:
--------------------------------------------------------------------------------
1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009.
2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
4 |
5 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
6 | #usually upper case letters are initials in a name
7 | A
8 | B
9 | C
10 | D
11 | E
12 | F
13 | G
14 | H
15 | I
16 | J
17 | K
18 | L
19 | M
20 | N
21 | O
22 | P
23 | Q
24 | R
25 | S
26 | T
27 | U
28 | V
29 | W
30 | X
31 | Y
32 | Z
33 | a
34 | b
35 | c
36 | d
37 | e
38 | f
39 | g
40 | h
41 | i
42 | j
43 | k
44 | l
45 | m
46 | n
47 | o
48 | p
49 | q
50 | r
51 | s
52 | t
53 | u
54 | v
55 | w
56 | x
57 | y
58 | z
59 |
60 |
61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese.
62 | I
63 | II
64 | III
65 | IV
66 | V
67 | VI
68 | VII
69 | VIII
70 | IX
71 | X
72 | XI
73 | XII
74 | XIII
75 | XIV
76 | XV
77 | XVI
78 | XVII
79 | XVIII
80 | XIX
81 | XX
82 | i
83 | ii
84 | iii
85 | iv
86 | v
87 | vi
88 | vii
89 | viii
90 | ix
91 | x
92 | xi
93 | xii
94 | xiii
95 | xiv
96 | xv
97 | xvi
98 | xvii
99 | xviii
100 | xix
101 | xx
102 |
103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
104 | Adj
105 | Adm
106 | Adv
107 | Art
108 | Ca
109 | Capt
110 | Cmdr
111 | Col
112 | Comdr
113 | Con
114 | Corp
115 | Cpl
116 | DR
117 | DRA
118 | Dr
119 | Dra
120 | Dras
121 | Drs
122 | Eng
123 | Enga
124 | Engas
125 | Engos
126 | Ex
127 | Exo
128 | Exmo
129 | Fig
130 | Gen
131 | Hosp
132 | Insp
133 | Lda
134 | MM
135 | MR
136 | MRS
137 | MS
138 | Maj
139 | Mrs
140 | Ms
141 | Msgr
142 | Op
143 | Ord
144 | Pfc
145 | Ph
146 | Prof
147 | Pvt
148 | Rep
149 | Reps
150 | Res
151 | Rev
152 | Rt
153 | Sen
154 | Sens
155 | Sfc
156 | Sgt
157 | Sr
158 | Sra
159 | Sras
160 | Srs
161 | Sto
162 | Supt
163 | Surg
164 | adj
165 | adm
166 | adv
167 | art
168 | cit
169 | col
170 | con
171 | corp
172 | cpl
173 | dr
174 | dra
175 | dras
176 | drs
177 | eng
178 | enga
179 | engas
180 | engos
181 | ex
182 | exo
183 | exmo
184 | fig
185 | op
186 | prof
187 | sr
188 | sra
189 | sras
190 | srs
191 | sto
192 |
193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
194 | v
195 | vs
196 | i.e
197 | rev
198 | e.g
199 |
200 | #Numbers only. These should only induce breaks when followed by a numeric sequence
201 | # add NUMERIC_ONLY after the word for this function
202 | #This case is mostly for the english "No." which can either be a sentence of its own, or
203 | #if followed by a number, a non-breaking prefix
204 | No #NUMERIC_ONLY#
205 | Nos
206 | Art #NUMERIC_ONLY#
207 | Nr
208 | p #NUMERIC_ONLY#
209 | pp #NUMERIC_ONLY#
210 |
211 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ro:
--------------------------------------------------------------------------------
1 | A
2 | B
3 | C
4 | D
5 | E
6 | F
7 | G
8 | H
9 | I
10 | J
11 | K
12 | L
13 | M
14 | N
15 | O
16 | P
17 | Q
18 | R
19 | S
20 | T
21 | U
22 | V
23 | W
24 | X
25 | Y
26 | Z
27 | dpdv
28 | etc
29 | șamd
30 | M.Ap.N
31 | dl
32 | Dl
33 | d-na
34 | D-na
35 | dvs
36 | Dvs
37 | pt
38 | Pt
39 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ru:
--------------------------------------------------------------------------------
1 | # added Cyrillic uppercase letters [А-Я]
2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes)
3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013
4 | А
5 | Б
6 | В
7 | Г
8 | Д
9 | Е
10 | Ж
11 | З
12 | И
13 | Й
14 | К
15 | Л
16 | М
17 | Н
18 | О
19 | П
20 | Р
21 | С
22 | Т
23 | У
24 | Ф
25 | Х
26 | Ц
27 | Ч
28 | Ш
29 | Щ
30 | Ъ
31 | Ы
32 | Ь
33 | Э
34 | Ю
35 | Я
36 | A
37 | B
38 | C
39 | D
40 | E
41 | F
42 | G
43 | H
44 | I
45 | J
46 | K
47 | L
48 | M
49 | N
50 | O
51 | P
52 | Q
53 | R
54 | S
55 | T
56 | U
57 | V
58 | W
59 | X
60 | Y
61 | Z
62 | 0гг
63 | 1гг
64 | 2гг
65 | 3гг
66 | 4гг
67 | 5гг
68 | 6гг
69 | 7гг
70 | 8гг
71 | 9гг
72 | 0г
73 | 1г
74 | 2г
75 | 3г
76 | 4г
77 | 5г
78 | 6г
79 | 7г
80 | 8г
81 | 9г
82 | Xвв
83 | Vвв
84 | Iвв
85 | Lвв
86 | Mвв
87 | Cвв
88 | Xв
89 | Vв
90 | Iв
91 | Lв
92 | Mв
93 | Cв
94 | 0м
95 | 1м
96 | 2м
97 | 3м
98 | 4м
99 | 5м
100 | 6м
101 | 7м
102 | 8м
103 | 9м
104 | 0мм
105 | 1мм
106 | 2мм
107 | 3мм
108 | 4мм
109 | 5мм
110 | 6мм
111 | 7мм
112 | 8мм
113 | 9мм
114 | 0см
115 | 1см
116 | 2см
117 | 3см
118 | 4см
119 | 5см
120 | 6см
121 | 7см
122 | 8см
123 | 9см
124 | 0дм
125 | 1дм
126 | 2дм
127 | 3дм
128 | 4дм
129 | 5дм
130 | 6дм
131 | 7дм
132 | 8дм
133 | 9дм
134 | 0л
135 | 1л
136 | 2л
137 | 3л
138 | 4л
139 | 5л
140 | 6л
141 | 7л
142 | 8л
143 | 9л
144 | 0км
145 | 1км
146 | 2км
147 | 3км
148 | 4км
149 | 5км
150 | 6км
151 | 7км
152 | 8км
153 | 9км
154 | 0га
155 | 1га
156 | 2га
157 | 3га
158 | 4га
159 | 5га
160 | 6га
161 | 7га
162 | 8га
163 | 9га
164 | 0кг
165 | 1кг
166 | 2кг
167 | 3кг
168 | 4кг
169 | 5кг
170 | 6кг
171 | 7кг
172 | 8кг
173 | 9кг
174 | 0т
175 | 1т
176 | 2т
177 | 3т
178 | 4т
179 | 5т
180 | 6т
181 | 7т
182 | 8т
183 | 9т
184 | 0г
185 | 1г
186 | 2г
187 | 3г
188 | 4г
189 | 5г
190 | 6г
191 | 7г
192 | 8г
193 | 9г
194 | 0мг
195 | 1мг
196 | 2мг
197 | 3мг
198 | 4мг
199 | 5мг
200 | 6мг
201 | 7мг
202 | 8мг
203 | 9мг
204 | бульв
205 | в
206 | вв
207 | г
208 | га
209 | гг
210 | гл
211 | гос
212 | д
213 | дм
214 | доп
215 | др
216 | е
217 | ед
218 | ед
219 | зам
220 | и
221 | инд
222 | исп
223 | Исп
224 | к
225 | кап
226 | кг
227 | кв
228 | кл
229 | км
230 | кол
231 | комн
232 | коп
233 | куб
234 | л
235 | лиц
236 | лл
237 | м
238 | макс
239 | мг
240 | мин
241 | мл
242 | млн
243 | млрд
244 | мм
245 | н
246 | наб
247 | нач
248 | неуд
249 | ном
250 | о
251 | обл
252 | обр
253 | общ
254 | ок
255 | ост
256 | отл
257 | п
258 | пер
259 | перераб
260 | пл
261 | пос
262 | пр
263 | просп
264 | проф
265 | р
266 | ред
267 | руб
268 | с
269 | сб
270 | св
271 | см
272 | соч
273 | ср
274 | ст
275 | стр
276 | т
277 | тел
278 | Тел
279 | тех
280 | тт
281 | туп
282 | тыс
283 | уд
284 | ул
285 | уч
286 | физ
287 | х
288 | хор
289 | ч
290 | чел
291 | шт
292 | экз
293 | э
294 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.sk:
--------------------------------------------------------------------------------
1 | Bc
2 | Mgr
3 | RNDr
4 | PharmDr
5 | PhDr
6 | JUDr
7 | PaedDr
8 | ThDr
9 | Ing
10 | MUDr
11 | MDDr
12 | MVDr
13 | Dr
14 | ThLic
15 | PhD
16 | ArtD
17 | ThDr
18 | Dr
19 | DrSc
20 | CSs
21 | prof
22 | obr
23 | Obr
24 | Č
25 | č
26 | absol
27 | adj
28 | admin
29 | adr
30 | Adr
31 | adv
32 | advok
33 | afr
34 | ak
35 | akad
36 | akc
37 | akuz
38 | et
39 | al
40 | alch
41 | amer
42 | anat
43 | angl
44 | Angl
45 | anglosas
46 | anorg
47 | ap
48 | apod
49 | arch
50 | archeol
51 | archit
52 | arg
53 | art
54 | astr
55 | astrol
56 | astron
57 | atp
58 | atď
59 | austr
60 | Austr
61 | aut
62 | belg
63 | Belg
64 | bibl
65 | Bibl
66 | biol
67 | bot
68 | bud
69 | bás
70 | býv
71 | cest
72 | chem
73 | cirk
74 | csl
75 | čs
76 | Čs
77 | dat
78 | dep
79 | det
80 | dial
81 | diaľ
82 | dipl
83 | distrib
84 | dokl
85 | dosl
86 | dopr
87 | dram
88 | duš
89 | dv
90 | dvojčl
91 | dór
92 | ekol
93 | ekon
94 | el
95 | elektr
96 | elektrotech
97 | energet
98 | epic
99 | est
100 | etc
101 | etonym
102 | eufem
103 | európ
104 | Európ
105 | ev
106 | evid
107 | expr
108 | fa
109 | fam
110 | farm
111 | fem
112 | feud
113 | fil
114 | filat
115 | filoz
116 | fi
117 | fon
118 | form
119 | fot
120 | fr
121 | Fr
122 | franc
123 | Franc
124 | fraz
125 | fut
126 | fyz
127 | fyziol
128 | garb
129 | gen
130 | genet
131 | genpor
132 | geod
133 | geogr
134 | geol
135 | geom
136 | germ
137 | gr
138 | Gr
139 | gréc
140 | Gréc
141 | gréckokat
142 | hebr
143 | herald
144 | hist
145 | hlav
146 | hosp
147 | hromad
148 | hud
149 | hypok
150 | ident
151 | i.e
152 | ident
153 | imp
154 | impf
155 | indoeur
156 | inf
157 | inform
158 | instr
159 | int
160 | interj
161 | inšt
162 | inštr
163 | iron
164 | jap
165 | Jap
166 | jaz
167 | jedn
168 | juhoamer
169 | juhových
170 | juhozáp
171 | juž
172 | kanad
173 | Kanad
174 | kanc
175 | kapit
176 | kpt
177 | kart
178 | katastr
179 | knih
180 | kniž
181 | komp
182 | konj
183 | konkr
184 | kozmet
185 | krajč
186 | kresť
187 | kt
188 | kuch
189 | lat
190 | latinskoamer
191 | lek
192 | lex
193 | lingv
194 | lit
195 | litur
196 | log
197 | lok
198 | max
199 | Max
200 | maď
201 | Maď
202 | medzinár
203 | mest
204 | metr
205 | mil
206 | Mil
207 | min
208 | Min
209 | miner
210 | ml
211 | mld
212 | mn
213 | mod
214 | mytol
215 | napr
216 | nar
217 | Nar
218 | nasl
219 | nedok
220 | neg
221 | negat
222 | neklas
223 | nem
224 | Nem
225 | neodb
226 | neos
227 | neskl
228 | nesklon
229 | nespis
230 | nespráv
231 | neved
232 | než
233 | niekt
234 | niž
235 | nom
236 | náb
237 | nákl
238 | námor
239 | nár
240 | obch
241 | obj
242 | obv
243 | obyč
244 | obč
245 | občian
246 | odb
247 | odd
248 | ods
249 | ojed
250 | okr
251 | Okr
252 | opt
253 | opyt
254 | org
255 | os
256 | osob
257 | ot
258 | ovoc
259 | par
260 | part
261 | pejor
262 | pers
263 | pf
264 | Pf
265 | P.f
266 | p.f
267 | pl
268 | Plk
269 | pod
270 | podst
271 | pokl
272 | polit
273 | politol
274 | polygr
275 | pomn
276 | popl
277 | por
278 | porad
279 | porov
280 | posch
281 | potrav
282 | použ
283 | poz
284 | pozit
285 | poľ
286 | poľno
287 | poľnohosp
288 | poľov
289 | pošt
290 | pož
291 | prac
292 | predl
293 | pren
294 | prep
295 | preuk
296 | priezv
297 | Priezv
298 | privl
299 | prof
300 | práv
301 | príd
302 | príj
303 | prík
304 | príp
305 | prír
306 | prísl
307 | príslov
308 | príč
309 | psych
310 | publ
311 | pís
312 | písm
313 | pôv
314 | refl
315 | reg
316 | rep
317 | resp
318 | rozk
319 | rozlič
320 | rozpráv
321 | roč
322 | Roč
323 | ryb
324 | rádiotech
325 | rím
326 | samohl
327 | semest
328 | sev
329 | severoamer
330 | severových
331 | severozáp
332 | sg
333 | skr
334 | skup
335 | sl
336 | Sloven
337 | soc
338 | soch
339 | sociol
340 | sp
341 | spol
342 | Spol
343 | spoloč
344 | spoluhl
345 | správ
346 | spôs
347 | st
348 | star
349 | starogréc
350 | starorím
351 | s.r.o
352 | stol
353 | stor
354 | str
355 | stredoamer
356 | stredoškol
357 | subj
358 | subst
359 | superl
360 | sv
361 | sz
362 | súkr
363 | súp
364 | súvzť
365 | tal
366 | Tal
367 | tech
368 | tel
369 | Tel
370 | telef
371 | teles
372 | telev
373 | teol
374 | trans
375 | turist
376 | tuzem
377 | typogr
378 | tzn
379 | tzv
380 | ukaz
381 | ul
382 | Ul
383 | umel
384 | univ
385 | ust
386 | ved
387 | vedľ
388 | verb
389 | veter
390 | vin
391 | viď
392 | vl
393 | vod
394 | vodohosp
395 | pnl
396 | vulg
397 | vyj
398 | vys
399 | vysokoškol
400 | vzťaž
401 | vôb
402 | vých
403 | výd
404 | výrob
405 | výsk
406 | výsl
407 | výtv
408 | výtvar
409 | význ
410 | včel
411 | vš
412 | všeob
413 | zahr
414 | zar
415 | zariad
416 | zast
417 | zastar
418 | zastaráv
419 | zb
420 | zdravot
421 | združ
422 | zjemn
423 | zlat
424 | zn
425 | Zn
426 | zool
427 | zr
428 | zried
429 | zv
430 | záhr
431 | zák
432 | zákl
433 | zám
434 | záp
435 | západoeur
436 | zázn
437 | územ
438 | účt
439 | čast
440 | čes
441 | Čes
442 | čl
443 | čísl
444 | živ
445 | pr
446 | fak
447 | Kr
448 | p.n.l
449 | A
450 | B
451 | C
452 | D
453 | E
454 | F
455 | G
456 | H
457 | I
458 | J
459 | K
460 | L
461 | M
462 | N
463 | O
464 | P
465 | Q
466 | R
467 | S
468 | T
469 | U
470 | V
471 | W
472 | X
473 | Y
474 | Z
475 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.sl:
--------------------------------------------------------------------------------
1 | dr
2 | Dr
3 | itd
4 | itn
5 | št #NUMERIC_ONLY#
6 | Št #NUMERIC_ONLY#
7 | d
8 | jan
9 | Jan
10 | feb
11 | Feb
12 | mar
13 | Mar
14 | apr
15 | Apr
16 | jun
17 | Jun
18 | jul
19 | Jul
20 | avg
21 | Avg
22 | sept
23 | Sept
24 | sep
25 | Sep
26 | okt
27 | Okt
28 | nov
29 | Nov
30 | dec
31 | Dec
32 | tj
33 | Tj
34 | npr
35 | Npr
36 | sl
37 | Sl
38 | op
39 | Op
40 | gl
41 | Gl
42 | oz
43 | Oz
44 | prev
45 | dipl
46 | ing
47 | prim
48 | Prim
49 | cf
50 | Cf
51 | gl
52 | Gl
53 | A
54 | B
55 | C
56 | D
57 | E
58 | F
59 | G
60 | H
61 | I
62 | J
63 | K
64 | L
65 | M
66 | N
67 | O
68 | P
69 | Q
70 | R
71 | S
72 | T
73 | U
74 | V
75 | W
76 | X
77 | Y
78 | Z
79 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.sv:
--------------------------------------------------------------------------------
1 | #single upper case letter are usually initials
2 | A
3 | B
4 | C
5 | D
6 | E
7 | F
8 | G
9 | H
10 | I
11 | J
12 | K
13 | L
14 | M
15 | N
16 | O
17 | P
18 | Q
19 | R
20 | S
21 | T
22 | U
23 | V
24 | W
25 | X
26 | Y
27 | Z
28 | #misc abbreviations
29 | AB
30 | G
31 | VG
32 | dvs
33 | etc
34 | from
35 | iaf
36 | jfr
37 | kl
38 | kr
39 | mao
40 | mfl
41 | mm
42 | osv
43 | pga
44 | tex
45 | tom
46 | vs
47 |
--------------------------------------------------------------------------------
/preprocess/nonbreaking_prefixes/nonbreaking_prefix.ta:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | அ
7 | ஆ
8 | இ
9 | ஈ
10 | உ
11 | ஊ
12 | எ
13 | ஏ
14 | ஐ
15 | ஒ
16 | ஓ
17 | ஔ
18 | ஃ
19 | க
20 | கா
21 | கி
22 | கீ
23 | கு
24 | கூ
25 | கெ
26 | கே
27 | கை
28 | கொ
29 | கோ
30 | கௌ
31 | க்
32 | ச
33 | சா
34 | சி
35 | சீ
36 | சு
37 | சூ
38 | செ
39 | சே
40 | சை
41 | சொ
42 | சோ
43 | சௌ
44 | ச்
45 | ட
46 | டா
47 | டி
48 | டீ
49 | டு
50 | டூ
51 | டெ
52 | டே
53 | டை
54 | டொ
55 | டோ
56 | டௌ
57 | ட்
58 | த
59 | தா
60 | தி
61 | தீ
62 | து
63 | தூ
64 | தெ
65 | தே
66 | தை
67 | தொ
68 | தோ
69 | தௌ
70 | த்
71 | ப
72 | பா
73 | பி
74 | பீ
75 | பு
76 | பூ
77 | பெ
78 | பே
79 | பை
80 | பொ
81 | போ
82 | பௌ
83 | ப்
84 | ற
85 | றா
86 | றி
87 | றீ
88 | று
89 | றூ
90 | றெ
91 | றே
92 | றை
93 | றொ
94 | றோ
95 | றௌ
96 | ற்
97 | ய
98 | யா
99 | யி
100 | யீ
101 | யு
102 | யூ
103 | யெ
104 | யே
105 | யை
106 | யொ
107 | யோ
108 | யௌ
109 | ய்
110 | ர
111 | ரா
112 | ரி
113 | ரீ
114 | ரு
115 | ரூ
116 | ரெ
117 | ரே
118 | ரை
119 | ரொ
120 | ரோ
121 | ரௌ
122 | ர்
123 | ல
124 | லா
125 | லி
126 | லீ
127 | லு
128 | லூ
129 | லெ
130 | லே
131 | லை
132 | லொ
133 | லோ
134 | லௌ
135 | ல்
136 | வ
137 | வா
138 | வி
139 | வீ
140 | வு
141 | வூ
142 | வெ
143 | வே
144 | வை
145 | வொ
146 | வோ
147 | வௌ
148 | வ்
149 | ள
150 | ளா
151 | ளி
152 | ளீ
153 | ளு
154 | ளூ
155 | ளெ
156 | ளே
157 | ளை
158 | ளொ
159 | ளோ
160 | ளௌ
161 | ள்
162 | ழ
163 | ழா
164 | ழி
165 | ழீ
166 | ழு
167 | ழூ
168 | ழெ
169 | ழே
170 | ழை
171 | ழொ
172 | ழோ
173 | ழௌ
174 | ழ்
175 | ங
176 | ஙா
177 | ஙி
178 | ஙீ
179 | ஙு
180 | ஙூ
181 | ஙெ
182 | ஙே
183 | ஙை
184 | ஙொ
185 | ஙோ
186 | ஙௌ
187 | ங்
188 | ஞ
189 | ஞா
190 | ஞி
191 | ஞீ
192 | ஞு
193 | ஞூ
194 | ஞெ
195 | ஞே
196 | ஞை
197 | ஞொ
198 | ஞோ
199 | ஞௌ
200 | ஞ்
201 | ண
202 | ணா
203 | ணி
204 | ணீ
205 | ணு
206 | ணூ
207 | ணெ
208 | ணே
209 | ணை
210 | ணொ
211 | ணோ
212 | ணௌ
213 | ண்
214 | ந
215 | நா
216 | நி
217 | நீ
218 | நு
219 | நூ
220 | நெ
221 | நே
222 | நை
223 | நொ
224 | நோ
225 | நௌ
226 | ந்
227 | ம
228 | மா
229 | மி
230 | மீ
231 | மு
232 | மூ
233 | மெ
234 | மே
235 | மை
236 | மொ
237 | மோ
238 | மௌ
239 | ம்
240 | ன
241 | னா
242 | னி
243 | னீ
244 | னு
245 | னூ
246 | னெ
247 | னே
248 | னை
249 | னொ
250 | னோ
251 | னௌ
252 | ன்
253 |
254 |
255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
256 | திரு
257 | திருமதி
258 | வண
259 | கௌரவ
260 |
261 |
262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
263 | உ.ம்
264 | #கா.ம்
265 | #எ.ம்
266 |
267 |
268 | #Numbers only. These should only induce breaks when followed by a numeric sequence
269 | # add NUMERIC_ONLY after the word for this function
270 | #This case is mostly for the english "No." which can either be a sentence of its own, or
271 | #if followed by a number, a non-breaking prefix
272 | No #NUMERIC_ONLY#
273 | Nos
274 | Art #NUMERIC_ONLY#
275 | Nr
276 | pp #NUMERIC_ONLY#
277 |
--------------------------------------------------------------------------------
/preprocess/normalize-punctuation.perl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl -w
2 |
3 | use strict;
4 |
5 | my ($language) = @ARGV;
6 |
7 | while() {
8 | s/\r//g;
9 | # remove extra spaces
10 | s/\(/ \(/g;
11 | s/\)/\) /g; s/ +/ /g;
12 | s/\) ([\.\!\:\?\;\,])/\)$1/g;
13 | s/\( /\(/g;
14 | s/ \)/\)/g;
15 | s/(\d) \%/$1\%/g;
16 | s/ :/:/g;
17 | s/ ;/;/g;
18 | # normalize unicode punctuation
19 | s/„/\"/g;
20 | s/“/\"/g;
21 | s/”/\"/g;
22 | s/–/-/g;
23 | s/—/ - /g; s/ +/ /g;
24 | s/´/\'/g;
25 | s/([a-z])‘([a-z])/$1\'$2/gi;
26 | s/([a-z])’([a-z])/$1\'$2/gi;
27 | s/‘/\"/g;
28 | s/‚/\"/g;
29 | s/’/\"/g;
30 | s/''/\"/g;
31 | s/´´/\"/g;
32 | s/…/.../g;
33 | # French quotes
34 | s/ « / \"/g;
35 | s/« /\"/g;
36 | s/«/\"/g;
37 | s/ » /\" /g;
38 | s/ »/\"/g;
39 | s/»/\"/g;
40 | # handle pseudo-spaces
41 | s/ \%/\%/g;
42 | s/nº /nº /g;
43 | s/ :/:/g;
44 | s/ ºC/ ºC/g;
45 | s/ cm/ cm/g;
46 | s/ \?/\?/g;
47 | s/ \!/\!/g;
48 | s/ ;/;/g;
49 | s/, /, /g; s/ +/ /g;
50 |
51 | # English "quotation," followed by comma, style
52 | if ($language eq "en") {
53 | s/\"([,\.]+)/$1\"/g;
54 | }
55 | # Czech is confused
56 | elsif ($language eq "cs" || $language eq "cz") {
57 | }
58 | # German/Spanish/French "quotation", followed by comma, style
59 | else {
60 | s/,\"/\",/g;
61 | s/(\.+)\"(\s*[^<])/\"$1$2/g; # don't fix period at end of sentence
62 | }
63 |
64 | print STDERR $_ if //;
65 |
66 | if ($language eq "de" || $language eq "es" || $language eq "cz" || $language eq "cs" || $language eq "fr") {
67 | s/(\d) (\d)/$1,$2/g;
68 | }
69 | else {
70 | s/(\d) (\d)/$1.$2/g;
71 | }
72 | print $_;
73 | }
74 |
--------------------------------------------------------------------------------
/preprocess/preprocess.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # source language (example: fr)
4 | S=$1
5 | # target language (example: en)
6 | T=$2
7 |
8 | # path to dl4mt/data
9 | P1=$3
10 |
11 | # path to subword NMT scripts (can be downloaded from https://github.com/rsennrich/subword-nmt)
12 | P2=$4
13 |
14 | ## merge all parallel corpora
15 | #./merge.sh $1 $2
16 |
17 | perl $P1/normalize-punctuation.perl -l ${S} < all_${S}-${T}.${S} > all_${S}-${T}.${S}.norm # do this for validation and test
18 | perl $P1/normalize-punctuation.perl -l ${T} < all_${S}-${T}.${T} > all_${S}-${T}.${T}.norm # do this for validation and test
19 |
20 | # tokenize
21 | perl $P1/tokenizer_apos.perl -threads 5 -l $S < all_${S}-${T}.${S}.norm > all_${S}-${T}.${S}.tok # do this for validation and test
22 | perl $P1/tokenizer_apos.perl -threads 5 -l $T < all_${S}-${T}.${T}.norm > all_${S}-${T}.${T}.tok # do this for validation and test
23 |
24 | # BPE
25 | if [ ! -f "../${S}.bpe" ]; then
26 | python $P2/learn_bpe.py -s 20000 < all_${S}-${T}.${S}.tok > ../${S}.bpe
27 | fi
28 | if [ ! -f "../${T}.bpe" ]; then
29 | python $P2/learn_bpe.py -s 20000 < all_${S}-${T}.${T}.tok > ../${T}.bpe
30 | fi
31 |
32 | python $P2/apply_bpe.py -c ../${S}.bpe < all_${S}-${T}.${S}.tok > all_${S}-${T}.${S}.tok.bpe # do this for validation and test
33 | python $P2/apply_bpe.py -c ../${T}.bpe < all_${S}-${T}.${T}.tok > all_${S}-${T}.${T}.tok.bpe # do this for validation and test
34 |
35 | # shuffle
36 | python $P1/shuffle.py all_${S}-${T}.${S}.tok.bpe all_${S}-${T}.${T}.tok.bpe all_${S}-${T}.${S}.tok all_${S}-${T}.${T}.tok
37 |
38 | # build dictionary
39 | #python $P1/build_dictionary.py all_${S}-${T}.${S}.tok &
40 | #python $P1/build_dictionary.py all_${S}-${T}.${T}.tok &
41 | #python $P1/build_dictionary_word.py all_${S}-${T}.${S}.tok.bpe &
42 | #python $P1/build_dictionary_word.py all_${S}-${T}.${T}.tok.bpe &
43 |
--------------------------------------------------------------------------------
/preprocess/shuffle.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import random
4 |
5 | from tempfile import mkstemp
6 | from subprocess import call
7 |
8 |
9 |
10 | def main(files):
11 |
12 | tf_os, tpath = mkstemp()
13 | tf = open(tpath, 'w')
14 |
15 | fds = [open(ff) for ff in files]
16 |
17 | for l in fds[0]:
18 | lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]]
19 | print >>tf, "|||".join(lines)
20 |
21 | [ff.close() for ff in fds]
22 | tf.close()
23 |
24 | tf = open(tpath, 'r')
25 | lines = tf.readlines()
26 | random.shuffle(lines)
27 |
28 | fds = [open(ff+'.shuf','w') for ff in files]
29 |
30 | for l in lines:
31 | s = l.strip().split('|||')
32 | for ii, fd in enumerate(fds):
33 | print >>fd, s[ii]
34 |
35 | [ff.close() for ff in fds]
36 |
37 | os.remove(tpath)
38 |
39 | if __name__ == '__main__':
40 | main(sys.argv[1:])
41 |
42 |
43 |
44 |
45 |
--------------------------------------------------------------------------------
/presentation/appendix.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/presentation/appendix.pdf
--------------------------------------------------------------------------------
/subword_base/train_wmt15_csen_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from subword_base_both import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 | 'two_layer_gru_decoder_both'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_csen_bpe2bpe_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/subword_base/train_wmt15_deen_bpe2bpe_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from subword_base import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder',
11 | 'two_layer_gru_decoder'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2bpe_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/subword_base/train_wmt15_deen_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from subword_base_both import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 | 'two_layer_gru_decoder_both'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_deen_bpe2bpe_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/subword_base/train_wmt15_fien_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from subword_base_both import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 | 'two_layer_gru_decoder_both'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_fien_bpe2bpe_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/subword_base/train_wmt15_ruen_bpe2bpe_both_adam.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | from collections import OrderedDict
4 | from nmt import train
5 | from subword_base_both import *
6 |
7 | layers = {'ff': ('param_init_fflayer', 'fflayer'),
8 | 'fff': ('param_init_ffflayer', 'ffflayer'),
9 | 'gru': ('param_init_gru', 'gru_layer'),
10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both',
11 | 'two_layer_gru_decoder_both'),
12 | }
13 |
14 |
15 | def main(job_id, params):
16 | re_load = False
17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam'
18 | source_dataset = params['train_data_path'] + params['source_dataset']
19 | target_dataset = params['train_data_path'] + params['target_dataset']
20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset']
21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset']
22 | source_dictionary = params['train_data_path'] + params['source_dictionary']
23 | target_dictionary = params['train_data_path'] + params['target_dictionary']
24 |
25 | print params, params['save_path'], save_file_name
26 | validerr = train(
27 | max_epochs=int(params['max_epochs']),
28 | patience=int(params['patience']),
29 | dim_word=int(params['dim_word']),
30 | dim_word_src=int(params['dim_word_src']),
31 | save_path=params['save_path'],
32 | save_file_name=save_file_name,
33 | re_load=re_load,
34 | enc_dim=int(params['enc_dim']),
35 | dec_dim=int(params['dec_dim']),
36 | n_words=int(params['n_words']),
37 | n_words_src=int(params['n_words_src']),
38 | decay_c=float(params['decay_c']),
39 | lrate=float(params['learning_rate']),
40 | optimizer=params['optimizer'],
41 | maxlen=int(params['maxlen']),
42 | maxlen_trg=int(params['maxlen_trg']),
43 | maxlen_sample=int(params['maxlen_sample']),
44 | batch_size=int(params['batch_size']),
45 | valid_batch_size=int(params['valid_batch_size']),
46 | sort_size=int(params['sort_size']),
47 | validFreq=int(params['validFreq']),
48 | dispFreq=int(params['dispFreq']),
49 | saveFreq=int(params['saveFreq']),
50 | sampleFreq=int(params['sampleFreq']),
51 | clip_c=int(params['clip_c']),
52 | datasets=[source_dataset, target_dataset],
53 | valid_datasets=[valid_source_dataset, valid_target_dataset],
54 | dictionaries=[source_dictionary, target_dictionary],
55 | use_dropout=int(params['use_dropout']),
56 | source_word_level=int(params['source_word_level']),
57 | target_word_level=int(params['target_word_level']),
58 | layers=layers,
59 | save_every_saveFreq=1,
60 | use_bpe=1,
61 | init_params=init_params,
62 | build_model=build_model,
63 | build_sampler=build_sampler,
64 | gen_sample=gen_sample
65 | )
66 | return validerr
67 |
68 | if __name__ == '__main__':
69 |
70 | import sys, time
71 | if len(sys.argv) > 1:
72 | config_file_name = sys.argv[-1]
73 | else:
74 | config_file_name = 'wmt15_ruen_bpe2bpe_adam.txt'
75 |
76 | f = open(config_file_name, 'r')
77 | lines = f.readlines()
78 | params = OrderedDict()
79 |
80 | for line in lines:
81 | line = line.split('\n')[0]
82 | param_list = line.split(' ')
83 | param_name = param_list[0]
84 | param_value = param_list[1]
85 | params[param_name] = param_value
86 |
87 | main(0, params)
88 |
--------------------------------------------------------------------------------
/subword_base/translate.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from subword_base import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | queue.put((idx, x))
126 | return idx+1
127 |
128 | def _finish_processes():
129 | for midx in xrange(n_process):
130 | queue.put(None)
131 |
132 | def _retrieve_jobs(n_samples):
133 | trans = [None] * n_samples
134 | for idx in xrange(n_samples):
135 | resp = rqueue.get()
136 | trans[resp[0]] = resp[1]
137 | if numpy.mod(idx, 10) == 0:
138 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 | return trans
140 |
141 | print 'Translating ', source_file, '...'
142 | n_samples = _send_jobs(source_file)
143 | trans = _seqs2words(_retrieve_jobs(n_samples))
144 | _finish_processes()
145 | with open(saveto, 'w') as f:
146 | if decoder_bpe_to_tok:
147 | print >>f, '\n'.join(trans).replace('@@ ', '')
148 | else:
149 | print >>f, '\n'.join(trans)
150 | print 'Done'
151 |
152 |
153 | if __name__ == "__main__":
154 | parser = argparse.ArgumentParser()
155 | parser.add_argument('-k', type=int, default=5)
156 | parser.add_argument('-p', type=int, default=5)
157 | parser.add_argument('-n', action="store_true", default=False)
158 | parser.add_argument('-bpe', action="store_true", default=False)
159 | parser.add_argument('-enc_c', action="store_true", default=False)
160 | parser.add_argument('-dec_c', action="store_true", default=False)
161 | parser.add_argument('-utf8', action="store_true", default=False)
162 | parser.add_argument('model', type=str)
163 | parser.add_argument('dictionary', type=str)
164 | parser.add_argument('dictionary_target', type=str)
165 | parser.add_argument('source', type=str)
166 | parser.add_argument('saveto', type=str)
167 |
168 | args = parser.parse_args()
169 |
170 | main(args.model, args.dictionary, args.dictionary_target, args.source,
171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 | encoder_chr_level=args.enc_c,
173 | decoder_chr_level=args.dec_c,
174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 |
--------------------------------------------------------------------------------
/subword_base/translate_both.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 |
9 | from subword_base_both import (build_sampler, gen_sample, init_params)
10 | from mixer import *
11 |
12 | from multiprocessing import Process, Queue
13 |
14 |
15 | def translate_model(queue, rqueue, pid, model, options, k, normalize):
16 |
17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
18 | trng = RandomStreams(1234)
19 |
20 | # allocate model parameters
21 | params = init_params(options)
22 |
23 | # load model parameters and set theano shared variables
24 | params = load_params(model, params)
25 | tparams = init_tparams(params)
26 |
27 | # word index
28 | use_noise = theano.shared(numpy.float32(0.))
29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise)
30 |
31 | def _translate(seq):
32 | use_noise.set_value(0.)
33 | # sample given an input sequence and obtain scores
34 | sample, score = gen_sample(tparams, f_init, f_next,
35 | numpy.array(seq).reshape([len(seq), 1]),
36 | options, trng=trng, k=k, maxlen=500,
37 | stochastic=False, argmax=False)
38 |
39 | # normalize scores according to sequence lengths
40 | if normalize:
41 | lengths = numpy.array([len(s) for s in sample])
42 | score = score / lengths
43 | sidx = numpy.argmin(score)
44 | return sample[sidx]
45 |
46 | while True:
47 | req = queue.get()
48 | if req is None:
49 | break
50 |
51 | idx, x = req[0], req[1]
52 | print pid, '-', idx
53 | seq = _translate(x)
54 |
55 | rqueue.put((idx, seq))
56 |
57 | return
58 |
59 |
60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5,
61 | normalize=False, n_process=5, encoder_chr_level=False,
62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
63 |
64 | # load model model_options
65 | pkl_file = model.split('.')[0] + '.pkl'
66 | with open(pkl_file, 'rb') as f:
67 | options = pkl.load(f)
68 |
69 | # load source dictionary and invert
70 | with open(dictionary, 'rb') as f:
71 | word_dict = pkl.load(f)
72 | word_idict = dict()
73 | for kk, vv in word_dict.iteritems():
74 | word_idict[vv] = kk
75 | word_idict[0] = ''
76 | word_idict[1] = 'UNK'
77 |
78 | # load target dictionary and invert
79 | with open(dictionary_target, 'rb') as f:
80 | word_dict_trg = pkl.load(f)
81 | word_idict_trg = dict()
82 | for kk, vv in word_dict_trg.iteritems():
83 | word_idict_trg[vv] = kk
84 | word_idict_trg[0] = ''
85 | word_idict_trg[1] = 'UNK'
86 |
87 | # create input and output queues for processes
88 | queue = Queue()
89 | rqueue = Queue()
90 | processes = [None] * n_process
91 | for midx in xrange(n_process):
92 | processes[midx] = Process(
93 | target=translate_model,
94 | args=(queue, rqueue, midx, model, options, k, normalize))
95 | processes[midx].start()
96 |
97 | # utility function
98 | def _seqs2words(caps):
99 | capsw = []
100 | for cc in caps:
101 | ww = []
102 | for w in cc:
103 | if w == 0:
104 | break
105 | if utf8:
106 | ww.append(word_idict_trg[w].encode('utf-8'))
107 | else:
108 | ww.append(word_idict_trg[w])
109 | if decoder_chr_level:
110 | capsw.append(''.join(ww))
111 | else:
112 | capsw.append(' '.join(ww))
113 | return capsw
114 |
115 | def _send_jobs(fname):
116 | with open(fname, 'r') as f:
117 | for idx, line in enumerate(f):
118 | if encoder_chr_level:
119 | words = list(line.decode('utf-8').strip())
120 | else:
121 | words = line.strip().split()
122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
124 | x += [0]
125 | queue.put((idx, x))
126 | return idx+1
127 |
128 | def _finish_processes():
129 | for midx in xrange(n_process):
130 | queue.put(None)
131 |
132 | def _retrieve_jobs(n_samples):
133 | trans = [None] * n_samples
134 | for idx in xrange(n_samples):
135 | resp = rqueue.get()
136 | trans[resp[0]] = resp[1]
137 | if numpy.mod(idx, 10) == 0:
138 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
139 | return trans
140 |
141 | print 'Translating ', source_file, '...'
142 | n_samples = _send_jobs(source_file)
143 | trans = _seqs2words(_retrieve_jobs(n_samples))
144 | _finish_processes()
145 | with open(saveto, 'w') as f:
146 | if decoder_bpe_to_tok:
147 | print >>f, '\n'.join(trans).replace('@@ ', '')
148 | else:
149 | print >>f, '\n'.join(trans)
150 | print 'Done'
151 |
152 |
153 | if __name__ == "__main__":
154 | parser = argparse.ArgumentParser()
155 | parser.add_argument('-k', type=int, default=5)
156 | parser.add_argument('-p', type=int, default=5)
157 | parser.add_argument('-n', action="store_true", default=False)
158 | parser.add_argument('-bpe', action="store_true", default=False)
159 | parser.add_argument('-enc_c', action="store_true", default=False)
160 | parser.add_argument('-dec_c', action="store_true", default=False)
161 | parser.add_argument('-utf8', action="store_true", default=False)
162 | parser.add_argument('model', type=str)
163 | parser.add_argument('dictionary', type=str)
164 | parser.add_argument('dictionary_target', type=str)
165 | parser.add_argument('source', type=str)
166 | parser.add_argument('saveto', type=str)
167 |
168 | args = parser.parse_args()
169 |
170 | main(args.model, args.dictionary, args.dictionary_target, args.source,
171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
172 | encoder_chr_level=args.enc_c,
173 | decoder_chr_level=args.dec_c,
174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
175 |
--------------------------------------------------------------------------------
/subword_base/translate_both_bpe2bpe_ensemble_deen.py:
--------------------------------------------------------------------------------
1 | '''
2 | Translates a source file using a translation model.
3 | '''
4 | import argparse
5 |
6 | import numpy
7 | import cPickle as pkl
8 | import ipdb
9 |
10 | from nmt_both import (build_sampler, init_params)
11 | from mixer import *
12 |
13 | from multiprocessing import Process, Queue
14 |
15 |
16 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None,
17 | k=1, maxlen=500, stochastic=True, argmax=False):
18 |
19 | # k is the beam size we have
20 | if k > 1:
21 | assert not stochastic, \
22 | 'Beam search does not support stochastic sampling'
23 |
24 | sample = []
25 | sample_score = []
26 | if stochastic:
27 | sample_score = 0
28 |
29 | live_k = 1
30 | dead_k = 0
31 |
32 | hyp_samples = [[]] * live_k
33 | hyp_scores = numpy.zeros(live_k).astype('float32')
34 | hyp_states = []
35 |
36 | # get initial state of decoder rnn and encoder context
37 | rets = []
38 | next_state_chars = []
39 | next_state_words = []
40 | ctx0s = []
41 |
42 | for i in xrange(len(f_inits)):
43 | ret = f_inits[i](x)
44 | next_state_chars.append(ret[0])
45 | next_state_words.append(ret[1])
46 | ctx0s.append(ret[2])
47 | next_w = -1 * numpy.ones((1,)).astype('int64') # bos indicator
48 |
49 | num_models = len(f_inits)
50 |
51 | for ii in xrange(maxlen):
52 |
53 | temp_next_p = []
54 | temp_next_state_char = []
55 | temp_next_state_word = []
56 |
57 | for i in xrange(num_models):
58 |
59 | ctx = numpy.tile(ctx0s[i], [live_k, 1])
60 | inps = [next_w, ctx, next_state_chars[i], next_state_words[i]]
61 | ret = f_nexts[i](*inps)
62 | next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3]
63 | temp_next_p.append(next_p)
64 | temp_next_state_char.append(next_state_char)
65 | temp_next_state_word.append(next_state_word)
66 | #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models
67 | next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0))
68 |
69 | if stochastic:
70 | if argmax:
71 | nw = next_p[0].argmax()
72 | else:
73 | nw = next_w[0]
74 | sample.append(nw)
75 | sample_score += next_p[0, nw]
76 | if nw == 0:
77 | break
78 | else:
79 | cand_scores = hyp_scores[:, None] - next_p
80 | cand_flat = cand_scores.flatten()
81 | ranks_flat = cand_flat.argsort()[:(k - dead_k)]
82 |
83 | voc_size = next_p.shape[1]
84 | trans_indices = ranks_flat / voc_size
85 | word_indices = ranks_flat % voc_size
86 | costs = cand_flat[ranks_flat]
87 |
88 | new_hyp_samples = []
89 | new_hyp_scores = numpy.zeros(k - dead_k).astype('float32')
90 | new_hyp_states_chars = []
91 | new_hyp_states_words = []
92 |
93 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)):
94 | new_hyp_samples.append(hyp_samples[ti] + [wi])
95 | new_hyp_scores[idx] = copy.copy(costs[idx])
96 |
97 | for i in xrange(num_models):
98 | new_hyp_states_char = []
99 | new_hyp_states_word = []
100 |
101 | for ti in trans_indices:
102 | new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti]))
103 | new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti]))
104 |
105 | new_hyp_states_chars.append(new_hyp_states_char)
106 | new_hyp_states_words.append(new_hyp_states_word)
107 |
108 | # check the finished samples
109 | new_live_k = 0
110 | hyp_samples = []
111 | hyp_scores = []
112 |
113 | for idx in xrange(len(new_hyp_samples)):
114 | if new_hyp_samples[idx][-1] == 0:
115 | sample.append(new_hyp_samples[idx])
116 | sample_score.append(new_hyp_scores[idx])
117 | dead_k += 1
118 | else:
119 | new_live_k += 1
120 | hyp_samples.append(new_hyp_samples[idx])
121 | hyp_scores.append(new_hyp_scores[idx])
122 |
123 | for i in xrange(num_models):
124 | hyp_states_char = []
125 | hyp_states_word = []
126 |
127 | for idx in xrange(len(new_hyp_samples)):
128 | if new_hyp_samples[idx][-1] != 0:
129 | hyp_states_char.append(new_hyp_states_chars[i][idx])
130 | hyp_states_word.append(new_hyp_states_words[i][idx])
131 |
132 | next_state_chars[i] = numpy.array(hyp_states_char)
133 | next_state_words[i] = numpy.array(hyp_states_word)
134 |
135 | hyp_scores = numpy.array(hyp_scores)
136 | live_k = new_live_k
137 |
138 | if new_live_k < 1:
139 | break
140 | if dead_k >= k:
141 | break
142 |
143 | next_w = numpy.array([w[-1] for w in hyp_samples])
144 |
145 | if not stochastic:
146 | # dump every remaining one
147 | if live_k > 0:
148 | for idx in xrange(live_k):
149 | sample.append(hyp_samples[idx])
150 | sample_score.append(hyp_scores[idx])
151 |
152 | return sample, sample_score
153 |
154 |
155 | def translate_model(queue, rqueue, pid, models, options, k, normalize):
156 |
157 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
158 | trng = RandomStreams(1234)
159 |
160 | # allocate model parameters
161 | params = []
162 | for i in xrange(len(models)):
163 | params.append(init_params(options))
164 |
165 | # load model parameters and set theano shared variables
166 | tparams = []
167 | for i in xrange(len(params)):
168 | params[i] = load_params(models[i], params[i])
169 | tparams.append(init_tparams(params[i]))
170 |
171 | # word index
172 | use_noise = theano.shared(numpy.float32(0.))
173 | f_inits = []
174 | f_nexts = []
175 | for i in xrange(len(tparams)):
176 | f_init, f_next = build_sampler(tparams[i], options, trng, use_noise)
177 | f_inits.append(f_init)
178 | f_nexts.append(f_next)
179 |
180 | def _translate(seq):
181 | use_noise.set_value(0.)
182 | # sample given an input sequence and obtain scores
183 | sample, score = gen_sample(tparams, f_inits, f_nexts,
184 | numpy.array(seq).reshape([len(seq), 1]),
185 | options, trng=trng, k=k, maxlen=500,
186 | stochastic=False, argmax=False)
187 |
188 | # normalize scores according to sequence lengths
189 | if normalize:
190 | lengths = numpy.array([len(s) for s in sample])
191 | score = score / lengths
192 | sidx = numpy.argmin(score)
193 | return sample[sidx]
194 |
195 | while True:
196 | req = queue.get()
197 | if req is None:
198 | break
199 |
200 | idx, x = req[0], req[1]
201 | print pid, '-', idx
202 | seq = _translate(x)
203 |
204 | rqueue.put((idx, seq))
205 |
206 | return
207 |
208 |
209 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5,
210 | normalize=False, n_process=5, encoder_chr_level=False,
211 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False):
212 |
213 | # load model model_options
214 | pkl_file = models[0].split('.')[0] + '.pkl'
215 | with open(pkl_file, 'rb') as f:
216 | options = pkl.load(f)
217 |
218 | # load source dictionary and invert
219 | with open(dictionary, 'rb') as f:
220 | word_dict = pkl.load(f)
221 | word_idict = dict()
222 | for kk, vv in word_dict.iteritems():
223 | word_idict[vv] = kk
224 | word_idict[0] = ''
225 | word_idict[1] = 'UNK'
226 |
227 | # load target dictionary and invert
228 | with open(dictionary_target, 'rb') as f:
229 | word_dict_trg = pkl.load(f)
230 | word_idict_trg = dict()
231 | for kk, vv in word_dict_trg.iteritems():
232 | word_idict_trg[vv] = kk
233 | word_idict_trg[0] = ''
234 | word_idict_trg[1] = 'UNK'
235 |
236 | # create input and output queues for processes
237 | queue = Queue()
238 | rqueue = Queue()
239 | processes = [None] * n_process
240 | for midx in xrange(n_process):
241 | processes[midx] = Process(
242 | target=translate_model,
243 | args=(queue, rqueue, midx, models, options, k, normalize))
244 | processes[midx].start()
245 |
246 | # utility function
247 | def _seqs2words(caps):
248 | capsw = []
249 | for cc in caps:
250 | ww = []
251 | for w in cc:
252 | if w == 0:
253 | break
254 | if utf8:
255 | ww.append(word_idict_trg[w].encode('utf-8'))
256 | else:
257 | ww.append(word_idict_trg[w])
258 | if decoder_chr_level:
259 | capsw.append(''.join(ww))
260 | else:
261 | capsw.append(' '.join(ww))
262 | return capsw
263 |
264 | def _send_jobs(fname):
265 | with open(fname, 'r') as f:
266 | for idx, line in enumerate(f):
267 | if encoder_chr_level:
268 | words = list(line.decode('utf-8').strip())
269 | else:
270 | words = line.strip().split()
271 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words)
272 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x)
273 | x += [0]
274 | queue.put((idx, x))
275 | return idx+1
276 |
277 | def _finish_processes():
278 | for midx in xrange(n_process):
279 | queue.put(None)
280 |
281 | def _retrieve_jobs(n_samples):
282 | trans = [None] * n_samples
283 | for idx in xrange(n_samples):
284 | resp = rqueue.get()
285 | trans[resp[0]] = resp[1]
286 | if numpy.mod(idx, 10) == 0:
287 | print 'Sample ', (idx+1), '/', n_samples, ' Done'
288 | return trans
289 |
290 | print 'Translating ', source_file, '...'
291 | n_samples = _send_jobs(source_file)
292 | trans = _seqs2words(_retrieve_jobs(n_samples))
293 | _finish_processes()
294 | with open(saveto, 'w') as f:
295 | if decoder_bpe_to_tok:
296 | print >>f, '\n'.join(trans).replace('@@ ', '')
297 | else:
298 | print >>f, '\n'.join(trans)
299 | print 'Done'
300 |
301 |
302 | if __name__ == "__main__":
303 | parser = argparse.ArgumentParser()
304 | parser.add_argument('-k', type=int, default=5)
305 | parser.add_argument('-p', type=int, default=5)
306 | parser.add_argument('-n', action="store_true", default=False)
307 | parser.add_argument('-bpe', action="store_true", default=False)
308 | parser.add_argument('-enc_c', action="store_true", default=False)
309 | parser.add_argument('-dec_c', action="store_true", default=False)
310 | parser.add_argument('-utf8', action="store_true", default=False)
311 | parser.add_argument('saveto', type=str)
312 |
313 | model_path = '/misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2bpe_two_layer_gru_decoder/0209/'
314 | model1 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en1.290000.npz'
315 | model2 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en2.260000.npz'
316 | model3 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en3.290000.npz'
317 | model4 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam.335000.npz'
318 | models = [model1, model2, model3, model4]
319 | dictionary = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.en.tok.bpe.word.pkl'
320 | dictionary_target = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.de.tok.bpe.word.pkl'
321 | source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/newstest2013.en.tok.bpe'
322 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2014-deen-src.en.tok.bpe'
323 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2015-deen-src.en.tok.bpe'
324 |
325 | args = parser.parse_args()
326 |
327 | main(models, dictionary, dictionary_target, source,
328 | args.saveto, k=args.k, normalize=args.n, n_process=args.p,
329 | encoder_chr_level=args.enc_c,
330 | decoder_chr_level=args.dec_c,
331 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe)
332 |
--------------------------------------------------------------------------------
/subword_base/wmt15_csen_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2bpe_two_layer_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 21816
14 | n_words_src 21907
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_cs-en.en.tok.bpe
30 | target_dataset all_cs-en.cs.tok.bpe
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.cs.tok.bpe
33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl
34 | target_dictionary all_cs-en.cs.tok.bpe.word.pkl
35 |
--------------------------------------------------------------------------------
/subword_base/wmt15_deen_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2bpe_two_layer_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 24254
14 | n_words_src 24440
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_de-en.en.tok.bpe.shuf
30 | target_dataset all_de-en.de.tok.bpe.shuf
31 | valid_source_dataset newstest2013.en.tok.bpe
32 | valid_target_dataset newstest2013.de.tok.bpe
33 | source_dictionary all_de-en.en.tok.bpe.word.pkl
34 | target_dictionary all_de-en.de.tok.bpe.word.pkl
35 |
--------------------------------------------------------------------------------
/subword_base/wmt15_fien_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2bpe_two_layer_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 20783
14 | n_words_src 20174
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_fi-en.en.tok.bpe.shuf
30 | target_dataset all_fi-en.fi.tok.bpe.shuf
31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe
32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok.bpe
33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl
34 | target_dictionary all_fi-en.fi.tok.bpe.word.pkl
35 |
--------------------------------------------------------------------------------
/subword_base/wmt15_ruen_bpe2bpe_adam.txt:
--------------------------------------------------------------------------------
1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2bpe_two_layer_gru_decoder/0209/
2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/
3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/
4 | max_epochs 1000000000000
5 | patience -1
6 | learning_rate 0.0001
7 | batch_size 128
8 | valid_batch_size 128
9 | enc_dim 512
10 | dec_dim 1024
11 | dim_word 512
12 | dim_word_src 512
13 | n_words 22106
14 | n_words_src 22030
15 | optimizer adam
16 | decay_c 0
17 | use_dropout 0
18 | clip_c 1
19 | saveFreq 5000
20 | sampleFreq 5000
21 | dispFreq 1000
22 | validFreq 5000
23 | sort_size 20
24 | maxlen 50
25 | maxlen_trg 100
26 | maxlen_sample 100
27 | source_word_level 1
28 | target_word_level 1
29 | source_dataset all_ru-en.en.tok.bpe
30 | target_dataset all_ru-en.ru.tok.bpe
31 | valid_source_dataset newstest2013-src.en.tok.bpe
32 | valid_target_dataset newstest2013-ref.ru.tok.bpe
33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl
34 | target_dictionary all_ru-en.ru.tok.bpe.word.pkl
35 |
--------------------------------------------------------------------------------
/translate_readme.txt:
--------------------------------------------------------------------------------
1 | Command for using translate.py BPE-case:
2 | python translate.py -k {beam_width} -p {number_of_processors} -n -bpe {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
3 |
4 | Command for using translate.py Char-case:
5 | python translate.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
6 |
7 | Command for using translate_both.py BPE-case:
8 | python translate_both.py -k {beam_width} -p {number_of_processors} -n -bpe {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
9 |
10 | Command for using translate_both.py Char-case:
11 | python translate_both.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
12 |
13 | Command for using translate_attc.py Char-case:
14 | python translate_attc.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name}
15 |
16 | Command for using `multi-bleu.perl':
17 | perl multi-bleu.perl {reference.txt} < {translated.txt}
18 |
--------------------------------------------------------------------------------