├── LICENSE ├── README.md ├── __init__.py ├── character_base ├── __init__.py ├── char_base.py ├── char_base_both.py ├── train_wmt15_csen_bpe2char_adam.py ├── train_wmt15_deen_bpe2char_adam.py ├── train_wmt15_deen_bpe2char_both_adam.py ├── train_wmt15_fien_bpe2char_adam.py ├── train_wmt15_ruen_bpe2char_adam.py ├── translate.py ├── translate_both.py ├── translate_bpe2char_ensemble_csen.py ├── translate_bpe2char_ensemble_deen.py ├── translate_bpe2char_ensemble_fien.py ├── translate_bpe2char_ensemble_ruen.py ├── wmt15_csen_bpe2char_adam.txt ├── wmt15_deen_bpe2char_adam.txt ├── wmt15_fien_bpe2char_adam.txt └── wmt15_ruen_bpe2char_adam.txt ├── character_biscale ├── __init__.py ├── char_biscale.py ├── char_biscale_attc.py ├── char_biscale_both.py ├── train_wmt15_csen_adam.py ├── train_wmt15_deen_adam.py ├── train_wmt15_deen_attc_adam.py ├── train_wmt15_deen_both_adam.py ├── train_wmt15_fien_adam.py ├── train_wmt15_ruen_adam.py ├── translate.py ├── translate_attc.py ├── translate_both.py ├── translate_ensemble_csen.py ├── translate_ensemble_deen.py ├── translate_ensemble_fien.py ├── translate_ensemble_ruen.py ├── wmt15_csen_bpe2char_adam.txt ├── wmt15_deen_bpe2char_adam.txt ├── wmt15_fien_bpe2char_adam.txt └── wmt15_ruen_bpe2char_adam.txt ├── data_iterator.py ├── mixer.py ├── nmt.py ├── preprocess ├── build_dictionary_char.py ├── build_dictionary_word.py ├── clean_tags.py ├── fix_appo.sh ├── merge.sh ├── multi-bleu.perl ├── nonbreaking_prefixes │ ├── README.txt │ ├── nonbreaking_prefix.ca │ ├── nonbreaking_prefix.cs │ ├── nonbreaking_prefix.de │ ├── nonbreaking_prefix.el │ ├── nonbreaking_prefix.en │ ├── nonbreaking_prefix.es │ ├── nonbreaking_prefix.fi │ ├── nonbreaking_prefix.fr │ ├── nonbreaking_prefix.hu │ ├── nonbreaking_prefix.is │ ├── nonbreaking_prefix.it │ ├── nonbreaking_prefix.lv │ ├── nonbreaking_prefix.nl │ ├── nonbreaking_prefix.pl │ ├── nonbreaking_prefix.pt │ ├── nonbreaking_prefix.ro │ ├── nonbreaking_prefix.ru │ ├── nonbreaking_prefix.sk │ ├── nonbreaking_prefix.sl │ ├── nonbreaking_prefix.sv │ └── nonbreaking_prefix.ta ├── normalize-punctuation.perl ├── preprocess.sh ├── shuffle.py ├── tokenizer.perl └── tokenizer_apos.perl ├── presentation └── appendix.pdf ├── subword_base ├── subword_base.py ├── subword_base_both.py ├── train_wmt15_csen_bpe2bpe_both_adam.py ├── train_wmt15_deen_bpe2bpe_adam.py ├── train_wmt15_deen_bpe2bpe_both_adam.py ├── train_wmt15_fien_bpe2bpe_both_adam.py ├── train_wmt15_ruen_bpe2bpe_both_adam.py ├── translate.py ├── translate_both.py ├── translate_both_bpe2bpe_ensemble_csen.py ├── translate_both_bpe2bpe_ensemble_deen.py ├── translate_both_bpe2bpe_ensemble_fien.py ├── translate_both_bpe2bpe_ensemble_ruen.py ├── wmt15_csen_bpe2bpe_adam.txt ├── wmt15_deen_bpe2bpe_adam.txt ├── wmt15_fien_bpe2bpe_adam.txt └── wmt15_ruen_bpe2bpe_adam.txt └── translate_readme.txt /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, Junyoung Chung, Kyunghyun Cho 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of dl4mt-cdec nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Character-Level Neural Machine Translation This is an implementation of the models described in the paper "A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation". http://arxiv.org/abs/1603.06147 Dependencies: ------------- The majority of the script files are written in pure Theano.
In the preprocessing pipeline, there are the following dependencies.
Python Libraries: NLTK
MOSES: https://github.com/moses-smt/mosesdecoder
Subword-NMT (http://arxiv.org/abs/1508.07909): https://github.com/rsennrich/subword-nmt
This code is based on the dl4mt library.
link: https://github.com/nyu-dl/dl4mt-tutorial Be sure to include the path to this library in your PYTHONPATH. We recommend you to use the latest version of Theano.
If you want exact reproduction however, please use the following version of Theano.
commit hash: fdfbab37146ee475b3fd17d8d104fb09bf3a8d5c Preparing Text Corpora: ----------------------- The original text corpora can be downloaded from http://www.statmt.org/wmt15/translation-task.html
Once the downloading is finished, use the 'preprocess.sh' in 'preprocess' directory to preprocess the text files. For the character-level decoders, preprocessing is not necessary however, in order to compare the results with subword-level decoders and other word-level approaches, we apply the same process to all of the target corpora. Finally, use 'build_dictionary_char.py' for character-case and 'build_dictionary_word.py' for subword-case to build the vocabulary.
Updating... -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/__init__.py -------------------------------------------------------------------------------- /character_base/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/character_base/__init__.py -------------------------------------------------------------------------------- /character_base/train_wmt15_csen_bpe2char_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_base import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder', 11 | 'two_layer_gru_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_csen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_base/train_wmt15_deen_bpe2char_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_base import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder', 11 | 'two_layer_gru_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_base/train_wmt15_deen_bpe2char_both_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from char_base_both import train 5 | from nmt_both import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both', 11 | 'two_layer_gru_decoder_both'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_two_layer_gru_decoder_both_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_base/train_wmt15_fien_bpe2char_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_base import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder', 11 | 'two_layer_gru_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_fien_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_base/train_wmt15_ruen_bpe2char_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_base import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder', 11 | 'two_layer_gru_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = True 17 | save_file_name = 'bpe2char_two_layer_gru_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_ruen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_base/translate.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from char_base import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | queue.put((idx, x)) 126 | return idx+1 127 | 128 | def _finish_processes(): 129 | for midx in xrange(n_process): 130 | queue.put(None) 131 | 132 | def _retrieve_jobs(n_samples): 133 | trans = [None] * n_samples 134 | for idx in xrange(n_samples): 135 | resp = rqueue.get() 136 | trans[resp[0]] = resp[1] 137 | if numpy.mod(idx, 10) == 0: 138 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 139 | return trans 140 | 141 | print 'Translating ', source_file, '...' 142 | n_samples = _send_jobs(source_file) 143 | trans = _seqs2words(_retrieve_jobs(n_samples)) 144 | _finish_processes() 145 | with open(saveto, 'w') as f: 146 | if decoder_bpe_to_tok: 147 | print >>f, '\n'.join(trans).replace('@@ ', '') 148 | else: 149 | print >>f, '\n'.join(trans) 150 | print 'Done' 151 | 152 | 153 | if __name__ == "__main__": 154 | parser = argparse.ArgumentParser() 155 | parser.add_argument('-k', type=int, default=5) 156 | parser.add_argument('-p', type=int, default=5) 157 | parser.add_argument('-n', action="store_true", default=False) 158 | parser.add_argument('-bpe', action="store_true", default=False) 159 | parser.add_argument('-enc_c', action="store_true", default=False) 160 | parser.add_argument('-dec_c', action="store_true", default=False) 161 | parser.add_argument('-utf8', action="store_true", default=False) 162 | parser.add_argument('model', type=str) 163 | parser.add_argument('dictionary', type=str) 164 | parser.add_argument('dictionary_target', type=str) 165 | parser.add_argument('source', type=str) 166 | parser.add_argument('saveto', type=str) 167 | 168 | args = parser.parse_args() 169 | 170 | main(args.model, args.dictionary, args.dictionary_target, args.source, 171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 172 | encoder_chr_level=args.enc_c, 173 | decoder_chr_level=args.dec_c, 174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 175 | -------------------------------------------------------------------------------- /character_base/translate_both.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from char_base_both import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | queue.put((idx, x)) 126 | return idx+1 127 | 128 | def _finish_processes(): 129 | for midx in xrange(n_process): 130 | queue.put(None) 131 | 132 | def _retrieve_jobs(n_samples): 133 | trans = [None] * n_samples 134 | for idx in xrange(n_samples): 135 | resp = rqueue.get() 136 | trans[resp[0]] = resp[1] 137 | if numpy.mod(idx, 10) == 0: 138 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 139 | return trans 140 | 141 | print 'Translating ', source_file, '...' 142 | n_samples = _send_jobs(source_file) 143 | trans = _seqs2words(_retrieve_jobs(n_samples)) 144 | _finish_processes() 145 | with open(saveto, 'w') as f: 146 | if decoder_bpe_to_tok: 147 | print >>f, '\n'.join(trans).replace('@@ ', '') 148 | else: 149 | print >>f, '\n'.join(trans) 150 | print 'Done' 151 | 152 | 153 | if __name__ == "__main__": 154 | parser = argparse.ArgumentParser() 155 | parser.add_argument('-k', type=int, default=5) 156 | parser.add_argument('-p', type=int, default=5) 157 | parser.add_argument('-n', action="store_true", default=False) 158 | parser.add_argument('-bpe', action="store_true", default=False) 159 | parser.add_argument('-enc_c', action="store_true", default=False) 160 | parser.add_argument('-dec_c', action="store_true", default=False) 161 | parser.add_argument('-utf8', action="store_true", default=False) 162 | parser.add_argument('model', type=str) 163 | parser.add_argument('dictionary', type=str) 164 | parser.add_argument('dictionary_target', type=str) 165 | parser.add_argument('source', type=str) 166 | parser.add_argument('saveto', type=str) 167 | 168 | args = parser.parse_args() 169 | 170 | main(args.model, args.dictionary, args.dictionary_target, args.source, 171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 172 | encoder_chr_level=args.enc_c, 173 | decoder_chr_level=args.dec_c, 174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 175 | -------------------------------------------------------------------------------- /character_base/translate_bpe2char_ensemble_deen.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from nmt import (build_sampler, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None, 16 | k=1, maxlen=500, stochastic=True, argmax=False): 17 | 18 | # k is the beam size we have 19 | if k > 1: 20 | assert not stochastic, \ 21 | 'Beam search does not support stochastic sampling' 22 | 23 | sample = [] 24 | sample_score = [] 25 | if stochastic: 26 | sample_score = 0 27 | 28 | live_k = 1 29 | dead_k = 0 30 | 31 | hyp_samples = [[]] * live_k 32 | hyp_scores = numpy.zeros(live_k).astype('float32') 33 | hyp_states = [] 34 | 35 | # get initial state of decoder rnn and encoder context 36 | rets = [] 37 | next_state_chars = [] 38 | next_state_words = [] 39 | ctx0s = [] 40 | 41 | for i in xrange(len(f_inits)): 42 | ret = f_inits[i](x) 43 | next_state_chars.append(ret[0]) 44 | next_state_words.append(ret[1]) 45 | ctx0s.append(ret[2]) 46 | next_w = -1 * numpy.ones((1,)).astype('int64') # bos indicator 47 | 48 | num_models = len(f_inits) 49 | 50 | for ii in xrange(maxlen): 51 | 52 | temp_next_p = [] 53 | temp_next_state_char = [] 54 | temp_next_state_word = [] 55 | 56 | for i in xrange(num_models): 57 | 58 | ctx = numpy.tile(ctx0s[i], [live_k, 1]) 59 | inps = [next_w, ctx, next_state_chars[i], next_state_words[i]] 60 | ret = f_nexts[i](*inps) 61 | next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3] 62 | temp_next_p.append(next_p) 63 | temp_next_state_char.append(next_state_char) 64 | temp_next_state_word.append(next_state_word) 65 | #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models 66 | next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0)) 67 | 68 | if stochastic: 69 | if argmax: 70 | nw = next_p[0].argmax() 71 | else: 72 | nw = next_w[0] 73 | sample.append(nw) 74 | sample_score += next_p[0, nw] 75 | if nw == 0: 76 | break 77 | else: 78 | cand_scores = hyp_scores[:, None] - next_p 79 | cand_flat = cand_scores.flatten() 80 | ranks_flat = cand_flat.argsort()[:(k - dead_k)] 81 | 82 | voc_size = next_p.shape[1] 83 | trans_indices = ranks_flat / voc_size 84 | word_indices = ranks_flat % voc_size 85 | costs = cand_flat[ranks_flat] 86 | 87 | new_hyp_samples = [] 88 | new_hyp_scores = numpy.zeros(k - dead_k).astype('float32') 89 | new_hyp_states_chars = [] 90 | new_hyp_states_words = [] 91 | 92 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)): 93 | new_hyp_samples.append(hyp_samples[ti] + [wi]) 94 | new_hyp_scores[idx] = copy.copy(costs[idx]) 95 | 96 | for i in xrange(num_models): 97 | new_hyp_states_char = [] 98 | new_hyp_states_word = [] 99 | 100 | for ti in trans_indices: 101 | new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti])) 102 | new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti])) 103 | 104 | new_hyp_states_chars.append(new_hyp_states_char) 105 | new_hyp_states_words.append(new_hyp_states_word) 106 | 107 | # check the finished samples 108 | new_live_k = 0 109 | hyp_samples = [] 110 | hyp_scores = [] 111 | 112 | for idx in xrange(len(new_hyp_samples)): 113 | if new_hyp_samples[idx][-1] == 0: 114 | sample.append(new_hyp_samples[idx]) 115 | sample_score.append(new_hyp_scores[idx]) 116 | dead_k += 1 117 | else: 118 | new_live_k += 1 119 | hyp_samples.append(new_hyp_samples[idx]) 120 | hyp_scores.append(new_hyp_scores[idx]) 121 | 122 | for i in xrange(num_models): 123 | hyp_states_char = [] 124 | hyp_states_word = [] 125 | 126 | for idx in xrange(len(new_hyp_samples)): 127 | if new_hyp_samples[idx][-1] != 0: 128 | hyp_states_char.append(new_hyp_states_chars[i][idx]) 129 | hyp_states_word.append(new_hyp_states_words[i][idx]) 130 | 131 | next_state_chars[i] = numpy.array(hyp_states_char) 132 | next_state_words[i] = numpy.array(hyp_states_word) 133 | 134 | hyp_scores = numpy.array(hyp_scores) 135 | live_k = new_live_k 136 | 137 | if new_live_k < 1: 138 | break 139 | if dead_k >= k: 140 | break 141 | 142 | next_w = numpy.array([w[-1] for w in hyp_samples]) 143 | 144 | if not stochastic: 145 | # dump every remaining one 146 | if live_k > 0: 147 | for idx in xrange(live_k): 148 | sample.append(hyp_samples[idx]) 149 | sample_score.append(hyp_scores[idx]) 150 | 151 | return sample, sample_score 152 | 153 | 154 | def translate_model(queue, rqueue, pid, models, options, k, normalize): 155 | 156 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 157 | trng = RandomStreams(1234) 158 | 159 | # allocate model parameters 160 | params = [] 161 | for i in xrange(len(models)): 162 | params.append(init_params(options)) 163 | 164 | # load model parameters and set theano shared variables 165 | tparams = [] 166 | for i in xrange(len(params)): 167 | params[i] = load_params(models[i], params[i]) 168 | tparams.append(init_tparams(params[i])) 169 | 170 | # word index 171 | use_noise = theano.shared(numpy.float32(0.)) 172 | f_inits = [] 173 | f_nexts = [] 174 | for i in xrange(len(tparams)): 175 | f_init, f_next = build_sampler(tparams[i], options, trng, use_noise) 176 | f_inits.append(f_init) 177 | f_nexts.append(f_next) 178 | 179 | def _translate(seq): 180 | use_noise.set_value(0.) 181 | # sample given an input sequence and obtain scores 182 | sample, score = gen_sample(tparams, f_inits, f_nexts, 183 | numpy.array(seq).reshape([len(seq), 1]), 184 | options, trng=trng, k=k, maxlen=500, 185 | stochastic=False, argmax=False) 186 | 187 | # normalize scores according to sequence lengths 188 | if normalize: 189 | lengths = numpy.array([len(s) for s in sample]) 190 | score = score / lengths 191 | sidx = numpy.argmin(score) 192 | return sample[sidx] 193 | 194 | while True: 195 | req = queue.get() 196 | if req is None: 197 | break 198 | 199 | idx, x = req[0], req[1] 200 | print pid, '-', idx 201 | seq = _translate(x) 202 | 203 | rqueue.put((idx, seq)) 204 | 205 | return 206 | 207 | 208 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5, 209 | normalize=False, n_process=5, encoder_chr_level=False, 210 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 211 | 212 | # load model model_options 213 | pkl_file = models[0].split('.')[0] + '.pkl' 214 | with open(pkl_file, 'rb') as f: 215 | options = pkl.load(f) 216 | 217 | # load source dictionary and invert 218 | with open(dictionary, 'rb') as f: 219 | word_dict = pkl.load(f) 220 | word_idict = dict() 221 | for kk, vv in word_dict.iteritems(): 222 | word_idict[vv] = kk 223 | word_idict[0] = '' 224 | word_idict[1] = 'UNK' 225 | 226 | # load target dictionary and invert 227 | with open(dictionary_target, 'rb') as f: 228 | word_dict_trg = pkl.load(f) 229 | word_idict_trg = dict() 230 | for kk, vv in word_dict_trg.iteritems(): 231 | word_idict_trg[vv] = kk 232 | word_idict_trg[0] = '' 233 | word_idict_trg[1] = 'UNK' 234 | 235 | # create input and output queues for processes 236 | queue = Queue() 237 | rqueue = Queue() 238 | processes = [None] * n_process 239 | for midx in xrange(n_process): 240 | processes[midx] = Process( 241 | target=translate_model, 242 | args=(queue, rqueue, midx, models, options, k, normalize)) 243 | processes[midx].start() 244 | 245 | # utility function 246 | def _seqs2words(caps): 247 | capsw = [] 248 | for cc in caps: 249 | ww = [] 250 | for w in cc: 251 | if w == 0: 252 | break 253 | if utf8: 254 | ww.append(word_idict_trg[w].encode('utf-8')) 255 | else: 256 | ww.append(word_idict_trg[w]) 257 | if decoder_chr_level: 258 | capsw.append(''.join(ww)) 259 | else: 260 | capsw.append(' '.join(ww)) 261 | return capsw 262 | 263 | def _send_jobs(fname): 264 | with open(fname, 'r') as f: 265 | for idx, line in enumerate(f): 266 | if encoder_chr_level: 267 | words = list(line.decode('utf-8').strip()) 268 | else: 269 | words = line.strip().split() 270 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 271 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 272 | x += [0] 273 | queue.put((idx, x)) 274 | return idx+1 275 | 276 | def _finish_processes(): 277 | for midx in xrange(n_process): 278 | queue.put(None) 279 | 280 | def _retrieve_jobs(n_samples): 281 | trans = [None] * n_samples 282 | for idx in xrange(n_samples): 283 | resp = rqueue.get() 284 | trans[resp[0]] = resp[1] 285 | if numpy.mod(idx, 10) == 0: 286 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 287 | return trans 288 | 289 | print 'Translating ', source_file, '...' 290 | n_samples = _send_jobs(source_file) 291 | trans = _seqs2words(_retrieve_jobs(n_samples)) 292 | _finish_processes() 293 | with open(saveto, 'w') as f: 294 | if decoder_bpe_to_tok: 295 | print >>f, '\n'.join(trans).replace('@@ ', '') 296 | else: 297 | print >>f, '\n'.join(trans) 298 | print 'Done' 299 | 300 | 301 | if __name__ == "__main__": 302 | parser = argparse.ArgumentParser() 303 | parser.add_argument('-k', type=int, default=5) 304 | parser.add_argument('-p', type=int, default=5) 305 | parser.add_argument('-n', action="store_true", default=False) 306 | parser.add_argument('-bpe', action="store_true", default=False) 307 | parser.add_argument('-enc_c', action="store_true", default=False) 308 | parser.add_argument('-dec_c', action="store_true", default=False) 309 | parser.add_argument('-utf8', action="store_true", default=False) 310 | parser.add_argument('saveto', type=str) 311 | 312 | model_path = '/misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2char_two_layer_gru_decoder/0209/' 313 | model1 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en1.380000.npz' 314 | model2 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en2.425000.npz' 315 | model3 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en3.400000.npz' 316 | model4 = model_path + 'bpe2char_two_layer_gru_decoder_adam.365000.npz' 317 | models = [model1, model2, model3, model4] 318 | dictionary = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.en.tok.bpe.word.pkl' 319 | dictionary_target = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.de.tok.300.pkl' 320 | source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/newstest2013.en.tok.bpe' 321 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2014-deen-src.en.tok.bpe' 322 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2015-deen-src.en.tok.bpe' 323 | 324 | args = parser.parse_args() 325 | 326 | main(models, dictionary, dictionary_target, source, 327 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 328 | encoder_chr_level=args.enc_c, 329 | decoder_chr_level=args.dec_c, 330 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 331 | -------------------------------------------------------------------------------- /character_base/translate_bpe2char_ensemble_fien.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from nmt import (build_sampler, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None, 16 | k=1, maxlen=500, stochastic=True, argmax=False): 17 | 18 | # k is the beam size we have 19 | if k > 1: 20 | assert not stochastic, \ 21 | 'Beam search does not support stochastic sampling' 22 | 23 | sample = [] 24 | sample_score = [] 25 | if stochastic: 26 | sample_score = 0 27 | 28 | live_k = 1 29 | dead_k = 0 30 | 31 | hyp_samples = [[]] * live_k 32 | hyp_scores = numpy.zeros(live_k).astype('float32') 33 | hyp_states = [] 34 | 35 | # get initial state of decoder rnn and encoder context 36 | rets = [] 37 | next_state_chars = [] 38 | next_state_words = [] 39 | ctx0s = [] 40 | 41 | for i in xrange(len(f_inits)): 42 | ret = f_inits[i](x) 43 | next_state_chars.append(ret[0]) 44 | next_state_words.append(ret[1]) 45 | ctx0s.append(ret[2]) 46 | next_w = -1 * numpy.ones((1,)).astype('int64') # bos indicator 47 | 48 | num_models = len(f_inits) 49 | 50 | for ii in xrange(maxlen): 51 | 52 | temp_next_p = [] 53 | temp_next_state_char = [] 54 | temp_next_state_word = [] 55 | 56 | for i in xrange(num_models): 57 | 58 | ctx = numpy.tile(ctx0s[i], [live_k, 1]) 59 | inps = [next_w, ctx, next_state_chars[i], next_state_words[i]] 60 | ret = f_nexts[i](*inps) 61 | next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3] 62 | temp_next_p.append(next_p) 63 | temp_next_state_char.append(next_state_char) 64 | temp_next_state_word.append(next_state_word) 65 | #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models 66 | next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0)) 67 | 68 | if stochastic: 69 | if argmax: 70 | nw = next_p[0].argmax() 71 | else: 72 | nw = next_w[0] 73 | sample.append(nw) 74 | sample_score += next_p[0, nw] 75 | if nw == 0: 76 | break 77 | else: 78 | cand_scores = hyp_scores[:, None] - next_p 79 | cand_flat = cand_scores.flatten() 80 | ranks_flat = cand_flat.argsort()[:(k - dead_k)] 81 | 82 | voc_size = next_p.shape[1] 83 | trans_indices = ranks_flat / voc_size 84 | word_indices = ranks_flat % voc_size 85 | costs = cand_flat[ranks_flat] 86 | 87 | new_hyp_samples = [] 88 | new_hyp_scores = numpy.zeros(k - dead_k).astype('float32') 89 | new_hyp_states_chars = [] 90 | new_hyp_states_words = [] 91 | 92 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)): 93 | new_hyp_samples.append(hyp_samples[ti] + [wi]) 94 | new_hyp_scores[idx] = copy.copy(costs[idx]) 95 | 96 | for i in xrange(num_models): 97 | new_hyp_states_char = [] 98 | new_hyp_states_word = [] 99 | 100 | for ti in trans_indices: 101 | new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti])) 102 | new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti])) 103 | 104 | new_hyp_states_chars.append(new_hyp_states_char) 105 | new_hyp_states_words.append(new_hyp_states_word) 106 | 107 | # check the finished samples 108 | new_live_k = 0 109 | hyp_samples = [] 110 | hyp_scores = [] 111 | 112 | for idx in xrange(len(new_hyp_samples)): 113 | if new_hyp_samples[idx][-1] == 0: 114 | sample.append(new_hyp_samples[idx]) 115 | sample_score.append(new_hyp_scores[idx]) 116 | dead_k += 1 117 | else: 118 | new_live_k += 1 119 | hyp_samples.append(new_hyp_samples[idx]) 120 | hyp_scores.append(new_hyp_scores[idx]) 121 | 122 | for i in xrange(num_models): 123 | hyp_states_char = [] 124 | hyp_states_word = [] 125 | 126 | for idx in xrange(len(new_hyp_samples)): 127 | if new_hyp_samples[idx][-1] != 0: 128 | hyp_states_char.append(new_hyp_states_chars[i][idx]) 129 | hyp_states_word.append(new_hyp_states_words[i][idx]) 130 | 131 | next_state_chars[i] = numpy.array(hyp_states_char) 132 | next_state_words[i] = numpy.array(hyp_states_word) 133 | 134 | hyp_scores = numpy.array(hyp_scores) 135 | live_k = new_live_k 136 | 137 | if new_live_k < 1: 138 | break 139 | if dead_k >= k: 140 | break 141 | 142 | next_w = numpy.array([w[-1] for w in hyp_samples]) 143 | 144 | if not stochastic: 145 | # dump every remaining one 146 | if live_k > 0: 147 | for idx in xrange(live_k): 148 | sample.append(hyp_samples[idx]) 149 | sample_score.append(hyp_scores[idx]) 150 | 151 | return sample, sample_score 152 | 153 | 154 | def translate_model(queue, rqueue, pid, models, options, k, normalize): 155 | 156 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 157 | trng = RandomStreams(1234) 158 | 159 | # allocate model parameters 160 | params = [] 161 | for i in xrange(len(models)): 162 | params.append(init_params(options)) 163 | 164 | # load model parameters and set theano shared variables 165 | tparams = [] 166 | for i in xrange(len(params)): 167 | params[i] = load_params(models[i], params[i]) 168 | tparams.append(init_tparams(params[i])) 169 | 170 | # word index 171 | use_noise = theano.shared(numpy.float32(0.)) 172 | f_inits = [] 173 | f_nexts = [] 174 | for i in xrange(len(tparams)): 175 | f_init, f_next = build_sampler(tparams[i], options, trng, use_noise) 176 | f_inits.append(f_init) 177 | f_nexts.append(f_next) 178 | 179 | def _translate(seq): 180 | use_noise.set_value(0.) 181 | # sample given an input sequence and obtain scores 182 | sample, score = gen_sample(tparams, f_inits, f_nexts, 183 | numpy.array(seq).reshape([len(seq), 1]), 184 | options, trng=trng, k=k, maxlen=500, 185 | stochastic=False, argmax=False) 186 | 187 | # normalize scores according to sequence lengths 188 | if normalize: 189 | lengths = numpy.array([len(s) for s in sample]) 190 | score = score / lengths 191 | sidx = numpy.argmin(score) 192 | return sample[sidx] 193 | 194 | while True: 195 | req = queue.get() 196 | if req is None: 197 | break 198 | 199 | idx, x = req[0], req[1] 200 | print pid, '-', idx 201 | seq = _translate(x) 202 | 203 | rqueue.put((idx, seq)) 204 | 205 | return 206 | 207 | 208 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5, 209 | normalize=False, n_process=5, encoder_chr_level=False, 210 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 211 | 212 | # load model model_options 213 | pkl_file = models[0].split('.')[0] + '.pkl' 214 | with open(pkl_file, 'rb') as f: 215 | options = pkl.load(f) 216 | 217 | # load source dictionary and invert 218 | with open(dictionary, 'rb') as f: 219 | word_dict = pkl.load(f) 220 | word_idict = dict() 221 | for kk, vv in word_dict.iteritems(): 222 | word_idict[vv] = kk 223 | word_idict[0] = '' 224 | word_idict[1] = 'UNK' 225 | 226 | # load target dictionary and invert 227 | with open(dictionary_target, 'rb') as f: 228 | word_dict_trg = pkl.load(f) 229 | word_idict_trg = dict() 230 | for kk, vv in word_dict_trg.iteritems(): 231 | word_idict_trg[vv] = kk 232 | word_idict_trg[0] = '' 233 | word_idict_trg[1] = 'UNK' 234 | 235 | # create input and output queues for processes 236 | queue = Queue() 237 | rqueue = Queue() 238 | processes = [None] * n_process 239 | for midx in xrange(n_process): 240 | processes[midx] = Process( 241 | target=translate_model, 242 | args=(queue, rqueue, midx, models, options, k, normalize)) 243 | processes[midx].start() 244 | 245 | # utility function 246 | def _seqs2words(caps): 247 | capsw = [] 248 | for cc in caps: 249 | ww = [] 250 | for w in cc: 251 | if w == 0: 252 | break 253 | if utf8: 254 | ww.append(word_idict_trg[w].encode('utf-8')) 255 | else: 256 | ww.append(word_idict_trg[w]) 257 | if decoder_chr_level: 258 | capsw.append(''.join(ww)) 259 | else: 260 | capsw.append(' '.join(ww)) 261 | return capsw 262 | 263 | def _send_jobs(fname): 264 | with open(fname, 'r') as f: 265 | for idx, line in enumerate(f): 266 | if encoder_chr_level: 267 | words = list(line.decode('utf-8').strip()) 268 | else: 269 | words = line.strip().split() 270 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 271 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 272 | x += [0] 273 | queue.put((idx, x)) 274 | return idx+1 275 | 276 | def _finish_processes(): 277 | for midx in xrange(n_process): 278 | queue.put(None) 279 | 280 | def _retrieve_jobs(n_samples): 281 | trans = [None] * n_samples 282 | for idx in xrange(n_samples): 283 | resp = rqueue.get() 284 | trans[resp[0]] = resp[1] 285 | if numpy.mod(idx, 10) == 0: 286 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 287 | return trans 288 | 289 | print 'Translating ', source_file, '...' 290 | n_samples = _send_jobs(source_file) 291 | trans = _seqs2words(_retrieve_jobs(n_samples)) 292 | _finish_processes() 293 | with open(saveto, 'w') as f: 294 | if decoder_bpe_to_tok: 295 | print >>f, '\n'.join(trans).replace('@@ ', '') 296 | else: 297 | print >>f, '\n'.join(trans) 298 | print 'Done' 299 | 300 | 301 | if __name__ == "__main__": 302 | parser = argparse.ArgumentParser() 303 | parser.add_argument('-k', type=int, default=5) 304 | parser.add_argument('-p', type=int, default=5) 305 | parser.add_argument('-n', action="store_true", default=False) 306 | parser.add_argument('-bpe', action="store_true", default=False) 307 | parser.add_argument('-enc_c', action="store_true", default=False) 308 | parser.add_argument('-dec_c', action="store_true", default=False) 309 | parser.add_argument('-utf8', action="store_true", default=False) 310 | parser.add_argument('saveto', type=str) 311 | 312 | model_path = '/scratch/jc7382/acl2016/wmt15/fien/bpe2char_two_layer_gru_decoder/0209/' 313 | model1 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en1.205000.npz' 314 | model2 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en2.200000.npz' 315 | model3 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en3.200000.npz' 316 | model4 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en4.200000.npz' 317 | model5 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en5.210000.npz' 318 | model6 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en6.205000.npz' 319 | model7 = model_path + 'bpe2char_two_layer_gru_decoder_adam_en7.200000.npz' 320 | model8 = model_path + 'new_bpe2char_two_layer_gru_decoder_adam.240000.npz' 321 | models = [model1, model2, model3, model4, model5, model6, model7, model8] 322 | dictionary = '/scratch/jc7382/data/wmt15/fien/train/all_fi-en.en.tok.bpe.word.pkl' 323 | dictionary_target = '/scratch/jc7382/data/wmt15/fien/train/all_fi-en.fi.tok.300.pkl' 324 | source = '/scratch/jc7382/data/wmt15/fien/dev/newsdev2015-enfi-src.en.tok.bpe' 325 | #source = '/scratch/jc7382/data/wmt15/fien/test/newstest2015-fien-src.en.tok.bpe' 326 | 327 | args = parser.parse_args() 328 | 329 | main(models, dictionary, dictionary_target, source, 330 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 331 | encoder_chr_level=args.enc_c, 332 | decoder_chr_level=args.dec_c, 333 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 334 | -------------------------------------------------------------------------------- /character_base/wmt15_csen_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2char_two_layer_gru_decoder/0328/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 302 14 | n_words_src 21907 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_cs-en.en.tok.bpe 30 | target_dataset all_cs-en.cs.tok 31 | valid_source_dataset newstest2013-src.en.tok.bpe 32 | valid_target_dataset newstest2013-ref.cs.tok 33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl 34 | target_dictionary all_cs-en.cs.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_base/wmt15_deen_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /raid/chungjun/acl2016/wmt15/deen/bpe2char_two_layer_gru_decoder/0417/ 2 | train_data_path /raid/chungjun/data/wmt15/deen/train/ 3 | dev_data_path /raid/chungjun/data/wmt15/deen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 302 14 | n_words_src 24440 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_de-en.en.tok.bpe.shuf 30 | target_dataset all_de-en.de.tok.shuf 31 | valid_source_dataset newstest2013.en.tok.bpe 32 | valid_target_dataset newstest2013.de.tok 33 | source_dictionary all_de-en.en.tok.bpe.word.pkl 34 | target_dictionary all_de-en.de.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_base/wmt15_fien_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2char_two_layer_gru_decoder/0328/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 292 14 | n_words_src 20174 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_fi-en.en.tok.bpe.shuf 30 | target_dataset all_fi-en.fi.tok.shuf 31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe 32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok 33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl 34 | target_dictionary all_fi-en.fi.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_base/wmt15_ruen_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2char_two_layer_gru_decoder/0328/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 302 14 | n_words_src 22030 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_ru-en.en.tok.bpe 30 | target_dataset all_ru-en.ru.tok 31 | valid_source_dataset newstest2013-src.en.tok.bpe 32 | valid_target_dataset newstest2013-ref.ru.tok 33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl 34 | target_dictionary all_ru-en.ru.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_biscale/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/character_biscale/__init__.py -------------------------------------------------------------------------------- /character_biscale/train_wmt15_csen_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_biscale import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'biscale_decoder': ('param_init_biscale_decoder', 11 | 'biscale_decoder_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_biscale_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_csen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_biscale/train_wmt15_deen_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_biscale import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'biscale_decoder': ('param_init_biscale_decoder', 11 | 'biscale_decoder_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_biscale_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_biscale/train_wmt15_deen_attc_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_biscale_attc import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'biscale_decoder_attc': ('param_init_biscale_decoder_attc', 11 | 'biscale_decoder_attc_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_biscale_decoder_attc_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample, 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_biscale/train_wmt15_deen_both_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_biscale_both import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'biscale_decoder_both': ('param_init_biscale_decoder_both', 11 | 'biscale_decoder_both_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_biscale_decoder_both_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample, 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_biscale/train_wmt15_fien_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_biscale import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'biscale_decoder': ('param_init_biscale_decoder', 11 | 'biscale_decoder_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_biscale_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_fien_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_biscale/train_wmt15_ruen_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from char_biscale import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'biscale_decoder': ('param_init_biscale_decoder', 11 | 'biscale_decoder_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2char_biscale_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_ruen_bpe2char_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /character_biscale/translate.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from char_biscale import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | queue.put((idx, x)) 126 | return idx+1 127 | 128 | def _finish_processes(): 129 | for midx in xrange(n_process): 130 | queue.put(None) 131 | 132 | def _retrieve_jobs(n_samples): 133 | trans = [None] * n_samples 134 | for idx in xrange(n_samples): 135 | resp = rqueue.get() 136 | trans[resp[0]] = resp[1] 137 | if numpy.mod(idx, 10) == 0: 138 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 139 | return trans 140 | 141 | print 'Translating ', source_file, '...' 142 | n_samples = _send_jobs(source_file) 143 | trans = _seqs2words(_retrieve_jobs(n_samples)) 144 | _finish_processes() 145 | with open(saveto, 'w') as f: 146 | if decoder_bpe_to_tok: 147 | print >>f, '\n'.join(trans).replace('@@ ', '') 148 | else: 149 | print >>f, '\n'.join(trans) 150 | print 'Done' 151 | 152 | 153 | if __name__ == "__main__": 154 | parser = argparse.ArgumentParser() 155 | parser.add_argument('-k', type=int, default=5) 156 | parser.add_argument('-p', type=int, default=5) 157 | parser.add_argument('-n', action="store_true", default=False) 158 | parser.add_argument('-bpe', action="store_true", default=False) 159 | parser.add_argument('-enc_c', action="store_true", default=False) 160 | parser.add_argument('-dec_c', action="store_true", default=False) 161 | parser.add_argument('-utf8', action="store_true", default=False) 162 | parser.add_argument('model', type=str) 163 | parser.add_argument('dictionary', type=str) 164 | parser.add_argument('dictionary_target', type=str) 165 | parser.add_argument('source', type=str) 166 | parser.add_argument('saveto', type=str) 167 | 168 | args = parser.parse_args() 169 | 170 | main(args.model, args.dictionary, args.dictionary_target, args.source, 171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 172 | encoder_chr_level=args.enc_c, 173 | decoder_chr_level=args.dec_c, 174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 175 | -------------------------------------------------------------------------------- /character_biscale/translate_attc.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from char_biscale_attc import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | #print '==============================' 126 | #print line 127 | #print '------------------------------' 128 | #print ' '.join([word_idict[wx] for wx in x]) 129 | #print '==============================' 130 | queue.put((idx, x)) 131 | return idx+1 132 | 133 | def _finish_processes(): 134 | for midx in xrange(n_process): 135 | queue.put(None) 136 | 137 | def _retrieve_jobs(n_samples): 138 | trans = [None] * n_samples 139 | for idx in xrange(n_samples): 140 | resp = rqueue.get() 141 | trans[resp[0]] = resp[1] 142 | if numpy.mod(idx, 10) == 0: 143 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 144 | return trans 145 | 146 | print 'Translating ', source_file, '...' 147 | n_samples = _send_jobs(source_file) 148 | trans = _seqs2words(_retrieve_jobs(n_samples)) 149 | _finish_processes() 150 | with open(saveto, 'w') as f: 151 | print >>f, '\n'.join(trans) 152 | print 'Done' 153 | 154 | 155 | if __name__ == "__main__": 156 | parser = argparse.ArgumentParser() 157 | parser.add_argument('-k', type=int, default=5) 158 | parser.add_argument('-p', type=int, default=5) 159 | parser.add_argument('-n', action="store_true", default=False) 160 | parser.add_argument('-enc_c', action="store_true", default=False) 161 | parser.add_argument('-dec_c', action="store_true", default=False) 162 | parser.add_argument('-utf8', action="store_true", default=False) 163 | parser.add_argument('model', type=str) 164 | parser.add_argument('dictionary', type=str) 165 | parser.add_argument('dictionary_target', type=str) 166 | parser.add_argument('source', type=str) 167 | parser.add_argument('saveto', type=str) 168 | 169 | args = parser.parse_args() 170 | 171 | main(args.model, args.dictionary, args.dictionary_target, args.source, 172 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 173 | encoder_chr_level=args.enc_c, 174 | decoder_chr_level=args.dec_c, 175 | utf8=args.utf8) 176 | -------------------------------------------------------------------------------- /character_biscale/translate_both.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from char_biscale_both import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | queue.put((idx, x)) 126 | return idx+1 127 | 128 | def _finish_processes(): 129 | for midx in xrange(n_process): 130 | queue.put(None) 131 | 132 | def _retrieve_jobs(n_samples): 133 | trans = [None] * n_samples 134 | for idx in xrange(n_samples): 135 | resp = rqueue.get() 136 | trans[resp[0]] = resp[1] 137 | if numpy.mod(idx, 10) == 0: 138 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 139 | return trans 140 | 141 | print 'Translating ', source_file, '...' 142 | n_samples = _send_jobs(source_file) 143 | trans = _seqs2words(_retrieve_jobs(n_samples)) 144 | _finish_processes() 145 | with open(saveto, 'w') as f: 146 | if decoder_bpe_to_tok: 147 | print >>f, '\n'.join(trans).replace('@@ ', '') 148 | else: 149 | print >>f, '\n'.join(trans) 150 | print 'Done' 151 | 152 | 153 | if __name__ == "__main__": 154 | parser = argparse.ArgumentParser() 155 | parser.add_argument('-k', type=int, default=5) 156 | parser.add_argument('-p', type=int, default=5) 157 | parser.add_argument('-n', action="store_true", default=False) 158 | parser.add_argument('-bpe', action="store_true", default=False) 159 | parser.add_argument('-enc_c', action="store_true", default=False) 160 | parser.add_argument('-dec_c', action="store_true", default=False) 161 | parser.add_argument('-utf8', action="store_true", default=False) 162 | parser.add_argument('model', type=str) 163 | parser.add_argument('dictionary', type=str) 164 | parser.add_argument('dictionary_target', type=str) 165 | parser.add_argument('source', type=str) 166 | parser.add_argument('saveto', type=str) 167 | 168 | args = parser.parse_args() 169 | 170 | main(args.model, args.dictionary, args.dictionary_target, args.source, 171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 172 | encoder_chr_level=args.enc_c, 173 | decoder_chr_level=args.dec_c, 174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 175 | -------------------------------------------------------------------------------- /character_biscale/wmt15_csen_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2char_seg_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 302 14 | n_words_src 21907 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_cs-en.en.tok.bpe 30 | target_dataset all_cs-en.cs.tok 31 | valid_source_dataset newstest2013-src.en.tok.bpe 32 | valid_target_dataset newstest2013-ref.cs.tok 33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl 34 | target_dictionary all_cs-en.cs.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_biscale/wmt15_deen_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2char_seg_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 302 14 | n_words_src 24440 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_de-en.en.tok.bpe.shuf 30 | target_dataset all_de-en.de.tok.shuf 31 | valid_source_dataset newstest2013.en.tok.bpe 32 | valid_target_dataset newstest2013.de.tok 33 | source_dictionary all_de-en.en.tok.bpe.word.pkl 34 | target_dictionary all_de-en.de.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_biscale/wmt15_fien_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2char_seg_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 292 14 | n_words_src 20174 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_fi-en.en.tok.bpe.shuf 30 | target_dataset all_fi-en.fi.tok.shuf 31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe 32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok 33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl 34 | target_dictionary all_fi-en.fi.tok.300.pkl 35 | -------------------------------------------------------------------------------- /character_biscale/wmt15_ruen_bpe2char_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2char_seg_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 302 14 | n_words_src 22030 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 500 26 | maxlen_sample 500 27 | source_word_level 1 28 | target_word_level 0 29 | source_dataset all_ru-en.en.tok.bpe 30 | target_dataset all_ru-en.ru.tok 31 | valid_source_dataset newstest2013-src.en.tok.bpe 32 | valid_target_dataset newstest2013-ref.ru.tok 33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl 34 | target_dictionary all_ru-en.ru.tok.300.pkl 35 | -------------------------------------------------------------------------------- /data_iterator.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | import numpy 3 | import os 4 | import random 5 | 6 | import cPickle 7 | import gzip 8 | import codecs 9 | 10 | from tempfile import mkstemp 11 | 12 | 13 | def fopen(filename, mode='r'): 14 | if filename.endswith('.gz'): 15 | return gzip.open(filename, mode) 16 | return open(filename, mode) 17 | 18 | 19 | class TextIterator: 20 | """Simple Bitext iterator.""" 21 | def __init__(self, 22 | source, source_dict, 23 | target=None, target_dict=None, 24 | source_word_level=0, 25 | target_word_level=0, 26 | batch_size=128, 27 | job_id=0, 28 | sort_size=20, 29 | n_words_source=-1, 30 | n_words_target=-1, 31 | shuffle_per_epoch=False): 32 | self.source_file = source 33 | self.target_file = target 34 | self.source = fopen(source, 'r') 35 | with open(source_dict, 'rb') as f: 36 | self.source_dict = cPickle.load(f) 37 | if target is not None: 38 | self.target = fopen(target, 'r') 39 | if target_dict is not None: 40 | with open(target_dict, 'rb') as f: 41 | self.target_dict = cPickle.load(f) 42 | else: 43 | self.target = None 44 | 45 | self.source_word_level = source_word_level 46 | self.target_word_level = target_word_level 47 | self.batch_size = batch_size 48 | 49 | self.n_words_source = n_words_source 50 | self.n_words_target = n_words_target 51 | self.shuffle_per_epoch = shuffle_per_epoch 52 | 53 | self.source_buffer = [] 54 | self.target_buffer = [] 55 | self.k = batch_size * sort_size 56 | 57 | self.end_of_data = False 58 | self.job_id = job_id 59 | 60 | def __iter__(self): 61 | return self 62 | 63 | def reset(self): 64 | if self.shuffle_per_epoch: 65 | # close current files 66 | self.source.close() 67 | if self.target is None: 68 | self.shuffle([self.source_file]) 69 | self.source = fopen(self.source_file + '.reshuf_%d' % self.job_id, 'r') 70 | else: 71 | self.target.close() 72 | # shuffle *original* source files, 73 | self.shuffle([self.source_file, self.target_file]) 74 | # open newly 're-shuffled' file as input 75 | self.source = fopen(self.source_file + '.reshuf_%d' % self.job_id, 'r') 76 | self.target = fopen(self.target_file + '.reshuf_%d' % self.job_id, 'r') 77 | else: 78 | self.source.seek(0) 79 | if self.target is not None: 80 | self.target.seek(0) 81 | 82 | @staticmethod 83 | def shuffle(files): 84 | tf_os, tpath = mkstemp() 85 | tf = open(tpath, 'w') 86 | fds = [open(ff) for ff in files] 87 | for l in fds[0]: 88 | lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]] 89 | print >>tf, "|||".join(lines) 90 | [ff.close() for ff in fds] 91 | tf.close() 92 | tf = open(tpath, 'r') 93 | lines = tf.readlines() 94 | random.shuffle(lines) 95 | fds = [open(ff+'.reshuf','w') for ff in files] 96 | for l in lines: 97 | s = l.strip().split('|||') 98 | for ii, fd in enumerate(fds): 99 | print >>fd, s[ii] 100 | [ff.close() for ff in fds] 101 | os.remove(tpath) 102 | return 103 | 104 | def next(self): 105 | if self.end_of_data: 106 | self.end_of_data = False 107 | self.reset() 108 | raise StopIteration 109 | 110 | source = [] 111 | target = [] 112 | 113 | # fill buffer, if it's empty 114 | if self.target is not None: 115 | assert len(self.source_buffer) == len(self.target_buffer), 'Buffer size mismatch!' 116 | 117 | if len(self.source_buffer) == 0: 118 | for k_ in xrange(self.k): 119 | ss = self.source.readline() 120 | 121 | if ss == "": 122 | break 123 | 124 | if self.source_word_level: 125 | ss = ss.strip().split() 126 | else: 127 | ss = ss.strip() 128 | ss = list(ss.decode('utf8')) 129 | 130 | self.source_buffer.append(ss) 131 | 132 | if self.target is not None: 133 | tt = self.target.readline() 134 | 135 | if tt == "": 136 | break 137 | 138 | if self.target_word_level: 139 | tt = tt.strip().split() 140 | else: 141 | tt = tt.strip() 142 | tt = list(tt.decode('utf8')) 143 | 144 | self.target_buffer.append(tt) 145 | 146 | if self.target is not None: 147 | # sort by target buffer 148 | tlen = numpy.array([len(t) for t in self.target_buffer]) 149 | tidx = tlen.argsort() 150 | _sbuf = [self.source_buffer[i] for i in tidx] 151 | _tbuf = [self.target_buffer[i] for i in tidx] 152 | self.target_buffer = _tbuf 153 | else: 154 | slen = numpy.array([len(s) for s in self.source_buffer]) 155 | sidx = slen.argsort() 156 | _sbuf = [self.source_buffer[i] for i in sidx] 157 | 158 | self.source_buffer = _sbuf 159 | 160 | if self.target is not None: 161 | if len(self.source_buffer) == 0 or len(self.target_buffer) == 0: 162 | self.end_of_data = False 163 | self.reset() 164 | raise StopIteration 165 | elif len(self.source_buffer) == 0: 166 | self.end_of_data = False 167 | self.reset() 168 | raise StopIteration 169 | 170 | try: 171 | # actual work here 172 | while True: 173 | # read from source file and map to word index 174 | try: 175 | ss_ = self.source_buffer.pop() 176 | except IndexError: 177 | break 178 | ss = [self.source_dict[w] if w in self.source_dict else 1 for w in ss_] 179 | if self.n_words_source > 0: 180 | ss = [w if w < self.n_words_source else 1 for w in ss] 181 | source.append(ss) 182 | if self.target is not None: 183 | # read from target file and map to word index 184 | tt_ = self.target_buffer.pop() 185 | tt = [self.target_dict[w] if w in self.target_dict else 1 for w in tt_] 186 | if self.n_words_target > 0: 187 | tt = [w if w < self.n_words_target else 1 for w in tt] 188 | target.append(tt) 189 | 190 | if len(source) >= self.batch_size: 191 | break 192 | except IOError: 193 | self.end_of_data = True 194 | 195 | if self.target is not None: 196 | if len(source) <= 0 or len(target) <= 0: 197 | self.end_of_data = False 198 | self.reset() 199 | raise StopIteration 200 | return source, target 201 | else: 202 | if len(source) <= 0: 203 | self.end_of_data = False 204 | self.reset() 205 | raise StopIteration 206 | return source 207 | -------------------------------------------------------------------------------- /preprocess/build_dictionary_char.py: -------------------------------------------------------------------------------- 1 | import cPickle as pkl 2 | import fileinput 3 | import numpy 4 | import sys 5 | import codecs 6 | 7 | from collections import OrderedDict 8 | 9 | 10 | short_list = 300 11 | 12 | def main(): 13 | for filename in sys.argv[1:]: 14 | print 'Processing', filename 15 | word_freqs = OrderedDict() 16 | 17 | with open(filename, 'r') as f: 18 | for line in f: 19 | words_in = line.strip() 20 | words_in = list(words_in.decode('utf8')) 21 | for w in words_in: 22 | if w not in word_freqs: 23 | word_freqs[w] = 0 24 | word_freqs[w] += 1 25 | 26 | words = word_freqs.keys() 27 | freqs = word_freqs.values() 28 | 29 | sorted_idx = numpy.argsort(freqs) 30 | sorted_words = [words[ii] for ii in sorted_idx[::-1]] 31 | 32 | worddict = OrderedDict() 33 | worddict['eos'] = 0 34 | worddict['UNK'] = 1 35 | 36 | if short_list is not None: 37 | for ii in xrange(min(short_list, len(sorted_words))): 38 | worddict[sorted_words[ii]] = ii + 2 39 | else: 40 | for ii, ww in enumerate(sorted_words): 41 | worddict[ww] = ii + 2 42 | 43 | with open('%s.%d.pkl' % (filename, short_list), 'wb') as f: 44 | pkl.dump(worddict, f) 45 | 46 | f.close() 47 | print 'Done' 48 | print len(worddict) 49 | 50 | if __name__ == '__main__': 51 | main() 52 | -------------------------------------------------------------------------------- /preprocess/build_dictionary_word.py: -------------------------------------------------------------------------------- 1 | import cPickle as pkl 2 | import fileinput 3 | import numpy 4 | import sys 5 | import codecs 6 | 7 | from collections import OrderedDict 8 | 9 | 10 | def main(): 11 | for filename in sys.argv[1:]: 12 | print 'Processing', filename 13 | word_freqs = OrderedDict() 14 | 15 | with open(filename, 'r') as f: 16 | for line in f: 17 | words_in = line.strip().split(' ') 18 | for w in words_in: 19 | if w not in word_freqs: 20 | word_freqs[w] = 0 21 | word_freqs[w] += 1 22 | 23 | words = word_freqs.keys() 24 | freqs = word_freqs.values() 25 | 26 | sorted_idx = numpy.argsort(freqs) 27 | sorted_words = [words[ii] for ii in sorted_idx[::-1]] 28 | 29 | worddict = OrderedDict() 30 | worddict['eos'] = 0 31 | worddict['UNK'] = 1 32 | 33 | for ii, ww in enumerate(sorted_words): 34 | worddict[ww] = ii + 2 35 | 36 | with open('%s.word.pkl' % filename, 'wb') as f: 37 | pkl.dump(worddict, f) 38 | 39 | f.close() 40 | print 'Done' 41 | print len(worddict) 42 | 43 | if __name__ == '__main__': 44 | main() 45 | -------------------------------------------------------------------------------- /preprocess/clean_tags.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import re 3 | 4 | from_file = sys.argv[1] 5 | to_file = sys.argv[2] 6 | to_file_out = open(to_file, "w") 7 | 8 | regex = "<.*>" 9 | 10 | tag_match = re.compile(regex) 11 | matched_lines = [] 12 | 13 | with open(from_file) as from_file: 14 | content = from_file.readlines() 15 | for line in content: 16 | if (tag_match.match(line)): 17 | pass 18 | else: 19 | matched_lines.append(line) 20 | 21 | matched_lines = "".join(matched_lines) 22 | to_file_out.write(matched_lines) 23 | to_file_out.close() 24 | 25 | -------------------------------------------------------------------------------- /preprocess/fix_appo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # {1} is the directory name 3 | 4 | 5 | for f in ${1}/*.xml 6 | do 7 | cat $f | grep "" | sed "s/’/'/g" | sed "s/“/\"/g" | sed "s/”/\"/g" > ${f}.fixed 8 | done 9 | 10 | -------------------------------------------------------------------------------- /preprocess/merge.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | SRC=$1 5 | TRG=$2 6 | 7 | FSRC=all_${1}-${2}.${1} 8 | FTRG=all_${1}-${2}.${2} 9 | 10 | echo "" > $FSRC 11 | for F in *${1}-${2}.${1} 12 | do 13 | if [ "$F" = "$FSRC" ]; then 14 | echo "pass" 15 | else 16 | cat $F >> $FSRC 17 | fi 18 | done 19 | 20 | 21 | echo "" > $FTRG 22 | for F in *${1}-${2}.${2} 23 | do 24 | if [ "$F" = "$FTRG" ]; then 25 | echo "pass" 26 | else 27 | cat $F >> $FTRG 28 | fi 29 | done 30 | -------------------------------------------------------------------------------- /preprocess/multi-bleu.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | # $Id$ 7 | use warnings; 8 | use strict; 9 | 10 | my $lowercase = 0; 11 | if ($ARGV[0] eq "-lc") { 12 | $lowercase = 1; 13 | shift; 14 | } 15 | 16 | my $stem = $ARGV[0]; 17 | if (!defined $stem) { 18 | print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n"; 19 | print STDERR "Reads the references from reference or reference0, reference1, ...\n"; 20 | exit(1); 21 | } 22 | 23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0"; 24 | 25 | my @REF; 26 | my $ref=0; 27 | while(-e "$stem$ref") { 28 | &add_to_ref("$stem$ref",\@REF); 29 | $ref++; 30 | } 31 | &add_to_ref($stem,\@REF) if -e $stem; 32 | die("ERROR: could not find reference file $stem") unless scalar @REF; 33 | 34 | sub add_to_ref { 35 | my ($file,$REF) = @_; 36 | my $s=0; 37 | open(REF,$file) or die "Can't read $file"; 38 | while() { 39 | chop; 40 | push @{$$REF[$s++]}, $_; 41 | } 42 | close(REF); 43 | } 44 | 45 | my(@CORRECT,@TOTAL,$length_translation,$length_reference); 46 | my $s=0; 47 | while() { 48 | chop; 49 | $_ = lc if $lowercase; 50 | my @WORD = split; 51 | my %REF_NGRAM = (); 52 | my $length_translation_this_sentence = scalar(@WORD); 53 | my ($closest_diff,$closest_length) = (9999,9999); 54 | foreach my $reference (@{$REF[$s]}) { 55 | # print "$s $_ <=> $reference\n"; 56 | $reference = lc($reference) if $lowercase; 57 | my @WORD = split(' ',$reference); 58 | my $length = scalar(@WORD); 59 | my $diff = abs($length_translation_this_sentence-$length); 60 | if ($diff < $closest_diff) { 61 | $closest_diff = $diff; 62 | $closest_length = $length; 63 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n"; 64 | } elsif ($diff == $closest_diff) { 65 | $closest_length = $length if $length < $closest_length; 66 | # from two references with the same closeness to me 67 | # take the *shorter* into account, not the "first" one. 68 | } 69 | for(my $n=1;$n<=4;$n++) { 70 | my %REF_NGRAM_N = (); 71 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 72 | my $ngram = "$n"; 73 | for(my $w=0;$w<$n;$w++) { 74 | $ngram .= " ".$WORD[$start+$w]; 75 | } 76 | $REF_NGRAM_N{$ngram}++; 77 | } 78 | foreach my $ngram (keys %REF_NGRAM_N) { 79 | if (!defined($REF_NGRAM{$ngram}) || 80 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) { 81 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram}; 82 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}
\n"; 83 | } 84 | } 85 | } 86 | } 87 | $length_translation += $length_translation_this_sentence; 88 | $length_reference += $closest_length; 89 | for(my $n=1;$n<=4;$n++) { 90 | my %T_NGRAM = (); 91 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 92 | my $ngram = "$n"; 93 | for(my $w=0;$w<$n;$w++) { 94 | $ngram .= " ".$WORD[$start+$w]; 95 | } 96 | $T_NGRAM{$ngram}++; 97 | } 98 | foreach my $ngram (keys %T_NGRAM) { 99 | $ngram =~ /^(\d+) /; 100 | my $n = $1; 101 | # my $corr = 0; 102 | # print "$i e $ngram $T_NGRAM{$ngram}
\n"; 103 | $TOTAL[$n] += $T_NGRAM{$ngram}; 104 | if (defined($REF_NGRAM{$ngram})) { 105 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) { 106 | $CORRECT[$n] += $T_NGRAM{$ngram}; 107 | # $corr = $T_NGRAM{$ngram}; 108 | # print "$i e correct1 $T_NGRAM{$ngram}
\n"; 109 | } 110 | else { 111 | $CORRECT[$n] += $REF_NGRAM{$ngram}; 112 | # $corr = $REF_NGRAM{$ngram}; 113 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n"; 114 | } 115 | } 116 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram}; 117 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n" 118 | } 119 | } 120 | $s++; 121 | } 122 | my $brevity_penalty = 1; 123 | my $bleu = 0; 124 | 125 | my @bleu=(); 126 | 127 | for(my $n=1;$n<=4;$n++) { 128 | if (defined ($TOTAL[$n]) && defined ($CORRECT[$n]) && $TOTAL[$n] > 0){ 129 | $bleu[$n]=($TOTAL[$n]>0)?$CORRECT[$n]/$TOTAL[$n]:0; 130 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n"; 131 | }else{ 132 | $bleu[$n]=0; 133 | } 134 | } 135 | 136 | if ($length_reference==0){ 137 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n"; 138 | exit(1); 139 | } 140 | 141 | if ($length_translation<$length_reference) { 142 | $brevity_penalty = exp(1-$length_reference/$length_translation); 143 | } 144 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) + 145 | my_log( $bleu[2] ) + 146 | my_log( $bleu[3] ) + 147 | my_log( $bleu[4] ) ) / 4) ; 148 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n", 149 | 100*$bleu, 150 | 100*$bleu[1], 151 | 100*$bleu[2], 152 | 100*$bleu[3], 153 | 100*$bleu[4], 154 | $brevity_penalty, 155 | $length_translation / $length_reference, 156 | $length_translation, 157 | $length_reference; 158 | 159 | sub my_log { 160 | return -9999999999 unless $_[0]; 161 | return log($_[0]); 162 | } 163 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/README.txt: -------------------------------------------------------------------------------- 1 | The language suffix can be found here: 2 | 3 | http://www.loc.gov/standards/iso639-2/php/code_list.php 4 | 5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations). 6 | This code includes data from czech wiktionary (also czech abbreviations). 7 | 8 | 9 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.ca: -------------------------------------------------------------------------------- 1 | Dr 2 | Dra 3 | pàg 4 | p 5 | c 6 | av 7 | Sr 8 | Sra 9 | adm 10 | esq 11 | Prof 12 | S.A 13 | S.L 14 | p.e 15 | ptes 16 | Sta 17 | St 18 | pl 19 | màx 20 | cast 21 | dir 22 | nre 23 | fra 24 | admdora 25 | Emm 26 | Excma 27 | espf 28 | dc 29 | admdor 30 | tel 31 | angl 32 | aprox 33 | ca 34 | dept 35 | dj 36 | dl 37 | dt 38 | ds 39 | dg 40 | dv 41 | ed 42 | entl 43 | al 44 | i.e 45 | maj 46 | smin 47 | n 48 | núm 49 | pta 50 | A 51 | B 52 | C 53 | D 54 | E 55 | F 56 | G 57 | H 58 | I 59 | J 60 | K 61 | L 62 | M 63 | N 64 | O 65 | P 66 | Q 67 | R 68 | S 69 | T 70 | U 71 | V 72 | W 73 | X 74 | Y 75 | Z 76 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.cs: -------------------------------------------------------------------------------- 1 | Bc 2 | BcA 3 | Ing 4 | Ing.arch 5 | MUDr 6 | MVDr 7 | MgA 8 | Mgr 9 | JUDr 10 | PhDr 11 | RNDr 12 | PharmDr 13 | ThLic 14 | ThDr 15 | Ph.D 16 | Th.D 17 | prof 18 | doc 19 | CSc 20 | DrSc 21 | dr. h. c 22 | PaedDr 23 | Dr 24 | PhMr 25 | DiS 26 | abt 27 | ad 28 | a.i 29 | aj 30 | angl 31 | anon 32 | apod 33 | atd 34 | atp 35 | aut 36 | bd 37 | biogr 38 | b.m 39 | b.p 40 | b.r 41 | cca 42 | cit 43 | cizojaz 44 | c.k 45 | col 46 | čes 47 | čín 48 | čj 49 | ed 50 | facs 51 | fasc 52 | fol 53 | fot 54 | franc 55 | h.c 56 | hist 57 | hl 58 | hrsg 59 | ibid 60 | il 61 | ind 62 | inv.č 63 | jap 64 | jhdt 65 | jv 66 | koed 67 | kol 68 | korej 69 | kl 70 | krit 71 | lat 72 | lit 73 | m.a 74 | maď 75 | mj 76 | mp 77 | násl 78 | např 79 | nepubl 80 | něm 81 | no 82 | nr 83 | n.s 84 | okr 85 | odd 86 | odp 87 | obr 88 | opr 89 | orig 90 | phil 91 | pl 92 | pokrač 93 | pol 94 | port 95 | pozn 96 | př.kr 97 | př.n.l 98 | přel 99 | přeprac 100 | příl 101 | pseud 102 | pt 103 | red 104 | repr 105 | resp 106 | revid 107 | rkp 108 | roč 109 | roz 110 | rozš 111 | samost 112 | sect 113 | sest 114 | seš 115 | sign 116 | sl 117 | srv 118 | stol 119 | sv 120 | šk 121 | šk.ro 122 | špan 123 | tab 124 | t.č 125 | tis 126 | tj 127 | tř 128 | tzv 129 | univ 130 | uspoř 131 | vol 132 | vl.jm 133 | vs 134 | vyd 135 | vyobr 136 | zal 137 | zejm 138 | zkr 139 | zprac 140 | zvl 141 | n.p 142 | např 143 | než 144 | MUDr 145 | abl 146 | absol 147 | adj 148 | adv 149 | ak 150 | ak. sl 151 | akt 152 | alch 153 | amer 154 | anat 155 | angl 156 | anglosas 157 | arab 158 | arch 159 | archit 160 | arg 161 | astr 162 | astrol 163 | att 164 | bás 165 | belg 166 | bibl 167 | biol 168 | boh 169 | bot 170 | bulh 171 | círk 172 | csl 173 | č 174 | čas 175 | čes 176 | dat 177 | děj 178 | dep 179 | dět 180 | dial 181 | dór 182 | dopr 183 | dosl 184 | ekon 185 | epic 186 | etnonym 187 | eufem 188 | f 189 | fam 190 | fem 191 | fil 192 | film 193 | form 194 | fot 195 | fr 196 | fut 197 | fyz 198 | gen 199 | geogr 200 | geol 201 | geom 202 | germ 203 | gram 204 | hebr 205 | herald 206 | hist 207 | hl 208 | hovor 209 | hud 210 | hut 211 | chcsl 212 | chem 213 | ie 214 | imp 215 | impf 216 | ind 217 | indoevr 218 | inf 219 | instr 220 | interj 221 | ión 222 | iron 223 | it 224 | kanad 225 | katalán 226 | klas 227 | kniž 228 | komp 229 | konj 230 | 231 | konkr 232 | kř 233 | kuch 234 | lat 235 | lék 236 | les 237 | lid 238 | lit 239 | liturg 240 | lok 241 | log 242 | m 243 | mat 244 | meteor 245 | metr 246 | mod 247 | ms 248 | mysl 249 | n 250 | náb 251 | námoř 252 | neklas 253 | něm 254 | nesklon 255 | nom 256 | ob 257 | obch 258 | obyč 259 | ojed 260 | opt 261 | part 262 | pas 263 | pejor 264 | pers 265 | pf 266 | pl 267 | plpf 268 | 269 | práv 270 | prep 271 | předl 272 | přivl 273 | r 274 | rcsl 275 | refl 276 | reg 277 | rkp 278 | ř 279 | řec 280 | s 281 | samohl 282 | sg 283 | sl 284 | souhl 285 | spec 286 | srov 287 | stfr 288 | střv 289 | stsl 290 | subj 291 | subst 292 | superl 293 | sv 294 | sz 295 | táz 296 | tech 297 | telev 298 | teol 299 | trans 300 | typogr 301 | var 302 | vedl 303 | verb 304 | vl. jm 305 | voj 306 | vok 307 | vůb 308 | vulg 309 | výtv 310 | vztaž 311 | zahr 312 | zájm 313 | zast 314 | zejm 315 | 316 | zeměd 317 | zkr 318 | zř 319 | mj 320 | dl 321 | atp 322 | sport 323 | Mgr 324 | horn 325 | MVDr 326 | JUDr 327 | RSDr 328 | Bc 329 | PhDr 330 | ThDr 331 | Ing 332 | aj 333 | apod 334 | PharmDr 335 | pomn 336 | ev 337 | slang 338 | nprap 339 | odp 340 | dop 341 | pol 342 | st 343 | stol 344 | p. n. l 345 | před n. l 346 | n. l 347 | př. Kr 348 | po Kr 349 | př. n. l 350 | odd 351 | RNDr 352 | tzv 353 | atd 354 | tzn 355 | resp 356 | tj 357 | p 358 | br 359 | č. j 360 | čj 361 | č. p 362 | čp 363 | a. s 364 | s. r. o 365 | spol. s r. o 366 | p. o 367 | s. p 368 | v. o. s 369 | k. s 370 | o. p. s 371 | o. s 372 | v. r 373 | v z 374 | ml 375 | vč 376 | kr 377 | mld 378 | hod 379 | popř 380 | ap 381 | event 382 | rus 383 | slov 384 | rum 385 | švýc 386 | P. T 387 | zvl 388 | hor 389 | dol 390 | S.O.S -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.de: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | #no german words end in single lower-case letters, so we throw those in too. 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in German. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #Titles and Honorifics 104 | Adj 105 | Adm 106 | Adv 107 | Asst 108 | Bart 109 | Bldg 110 | Brig 111 | Bros 112 | Capt 113 | Cmdr 114 | Col 115 | Comdr 116 | Con 117 | Corp 118 | Cpl 119 | DR 120 | Dr 121 | Ens 122 | Gen 123 | Gov 124 | Hon 125 | Hosp 126 | Insp 127 | Lt 128 | MM 129 | MR 130 | MRS 131 | MS 132 | Maj 133 | Messrs 134 | Mlle 135 | Mme 136 | Mr 137 | Mrs 138 | Ms 139 | Msgr 140 | Op 141 | Ord 142 | Pfc 143 | Ph 144 | Prof 145 | Pvt 146 | Rep 147 | Reps 148 | Res 149 | Rev 150 | Rt 151 | Sen 152 | Sens 153 | Sfc 154 | Sgt 155 | Sr 156 | St 157 | Supt 158 | Surg 159 | 160 | #Misc symbols 161 | Mio 162 | Mrd 163 | bzw 164 | v 165 | vs 166 | usw 167 | d.h 168 | z.B 169 | u.a 170 | etc 171 | Mrd 172 | MwSt 173 | ggf 174 | d.J 175 | D.h 176 | m.E 177 | vgl 178 | I.F 179 | z.T 180 | sogen 181 | ff 182 | u.E 183 | g.U 184 | g.g.A 185 | c.-à-d 186 | Buchst 187 | u.s.w 188 | sog 189 | u.ä 190 | Std 191 | evtl 192 | Zt 193 | Chr 194 | u.U 195 | o.ä 196 | Ltd 197 | b.A 198 | z.Zt 199 | spp 200 | sen 201 | SA 202 | k.o 203 | jun 204 | i.H.v 205 | dgl 206 | dergl 207 | Co 208 | zzt 209 | usf 210 | s.p.a 211 | Dkr 212 | Corp 213 | bzgl 214 | BSE 215 | 216 | #Number indicators 217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it 218 | No 219 | Nos 220 | Art 221 | Nr 222 | pp 223 | ca 224 | Ca 225 | 226 | #Ordinals are done with . in German - "1." = "1st" in English 227 | 1 228 | 2 229 | 3 230 | 4 231 | 5 232 | 6 233 | 7 234 | 8 235 | 9 236 | 10 237 | 11 238 | 12 239 | 13 240 | 14 241 | 15 242 | 16 243 | 17 244 | 18 245 | 19 246 | 20 247 | 21 248 | 22 249 | 23 250 | 24 251 | 25 252 | 26 253 | 27 254 | 28 255 | 29 256 | 30 257 | 31 258 | 32 259 | 33 260 | 34 261 | 35 262 | 36 263 | 37 264 | 38 265 | 39 266 | 40 267 | 41 268 | 42 269 | 43 270 | 44 271 | 45 272 | 46 273 | 47 274 | 48 275 | 49 276 | 50 277 | 51 278 | 52 279 | 53 280 | 54 281 | 55 282 | 56 283 | 57 284 | 58 285 | 59 286 | 60 287 | 61 288 | 62 289 | 63 290 | 64 291 | 65 292 | 66 293 | 67 294 | 68 295 | 69 296 | 70 297 | 71 298 | 72 299 | 73 300 | 74 301 | 75 302 | 76 303 | 77 304 | 78 305 | 79 306 | 80 307 | 81 308 | 82 309 | 83 310 | 84 311 | 85 312 | 86 313 | 87 314 | 88 315 | 89 316 | 90 317 | 91 318 | 92 319 | 93 320 | 94 321 | 95 322 | 96 323 | 97 324 | 98 325 | 99 326 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.en: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Asst 38 | Bart 39 | Bldg 40 | Brig 41 | Bros 42 | Capt 43 | Cmdr 44 | Col 45 | Comdr 46 | Con 47 | Corp 48 | Cpl 49 | DR 50 | Dr 51 | Drs 52 | Ens 53 | Gen 54 | Gov 55 | Hon 56 | Hr 57 | Hosp 58 | Insp 59 | Lt 60 | MM 61 | MR 62 | MRS 63 | MS 64 | Maj 65 | Messrs 66 | Mlle 67 | Mme 68 | Mr 69 | Mrs 70 | Ms 71 | Msgr 72 | Op 73 | Ord 74 | Pfc 75 | Ph 76 | Prof 77 | Pvt 78 | Rep 79 | Reps 80 | Res 81 | Rev 82 | Rt 83 | Sen 84 | Sens 85 | Sfc 86 | Sgt 87 | Sr 88 | St 89 | Supt 90 | Surg 91 | 92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 93 | v 94 | vs 95 | i.e 96 | rev 97 | e.g 98 | 99 | #Numbers only. These should only induce breaks when followed by a numeric sequence 100 | # add NUMERIC_ONLY after the word for this function 101 | #This case is mostly for the english "No." which can either be a sentence of its own, or 102 | #if followed by a number, a non-breaking prefix 103 | No #NUMERIC_ONLY# 104 | Nos 105 | Art #NUMERIC_ONLY# 106 | Nr 107 | pp #NUMERIC_ONLY# 108 | 109 | #month abbreviations 110 | Jan 111 | Feb 112 | Mar 113 | Apr 114 | #May is a full word 115 | Jun 116 | Jul 117 | Aug 118 | Sep 119 | Oct 120 | Nov 121 | Dec 122 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.es: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm 34 | 35 | A.C 36 | Apdo 37 | Av 38 | Bco 39 | CC.AA 40 | Da 41 | Dep 42 | Dn 43 | Dr 44 | Dra 45 | EE.UU 46 | Excmo 47 | FF.CC 48 | Fil 49 | Gral 50 | J.C 51 | Let 52 | Lic 53 | N.B 54 | P.D 55 | P.V.P 56 | Prof 57 | Pts 58 | Rte 59 | S.A 60 | S.A.R 61 | S.E 62 | S.L 63 | S.R.C 64 | Sr 65 | Sra 66 | Srta 67 | Sta 68 | Sto 69 | T.V.E 70 | Tel 71 | Ud 72 | Uds 73 | V.B 74 | V.E 75 | Vd 76 | Vds 77 | a/c 78 | adj 79 | admón 80 | afmo 81 | apdo 82 | av 83 | c 84 | c.f 85 | c.g 86 | cap 87 | cm 88 | cta 89 | dcha 90 | doc 91 | ej 92 | entlo 93 | esq 94 | etc 95 | f.c 96 | gr 97 | grs 98 | izq 99 | kg 100 | km 101 | mg 102 | mm 103 | núm 104 | núm 105 | p 106 | p.a 107 | p.ej 108 | ptas 109 | pág 110 | págs 111 | pág 112 | págs 113 | q.e.g.e 114 | q.e.s.m 115 | s 116 | s.s.s 117 | vid 118 | vol 119 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.fi: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT 2 | #indicate an end-of-sentence marker. Special cases are included for prefixes 3 | #that ONLY appear before 0-9 numbers. 4 | 5 | #This list is compiled from omorfi database 6 | #by Tommi A Pirinen. 7 | 8 | 9 | #any single upper case letter followed by a period is not a sentence ender 10 | A 11 | B 12 | C 13 | D 14 | E 15 | F 16 | G 17 | H 18 | I 19 | J 20 | K 21 | L 22 | M 23 | N 24 | O 25 | P 26 | Q 27 | R 28 | S 29 | T 30 | U 31 | V 32 | W 33 | X 34 | Y 35 | Z 36 | Å 37 | Ä 38 | Ö 39 | 40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 41 | alik 42 | alil 43 | amir 44 | apul 45 | apul.prof 46 | arkkit 47 | ass 48 | assist 49 | dipl 50 | dipl.arkkit 51 | dipl.ekon 52 | dipl.ins 53 | dipl.kielenk 54 | dipl.kirjeenv 55 | dipl.kosm 56 | dipl.urk 57 | dos 58 | erikoiseläinl 59 | erikoishammasl 60 | erikoisl 61 | erikoist 62 | ev.luutn 63 | evp 64 | fil 65 | ft 66 | hallinton 67 | hallintot 68 | hammaslääket 69 | jatk 70 | jääk 71 | kansaned 72 | kapt 73 | kapt.luutn 74 | kenr 75 | kenr.luutn 76 | kenr.maj 77 | kers 78 | kirjeenv 79 | kom 80 | kom.kapt 81 | komm 82 | konst 83 | korpr 84 | luutn 85 | maist 86 | maj 87 | Mr 88 | Mrs 89 | Ms 90 | M.Sc 91 | neuv 92 | nimim 93 | Ph.D 94 | prof 95 | puh.joht 96 | pääll 97 | res 98 | san 99 | siht 100 | suom 101 | sähköp 102 | säv 103 | toht 104 | toim 105 | toim.apul 106 | toim.joht 107 | toim.siht 108 | tuom 109 | ups 110 | vänr 111 | vääp 112 | ye.ups 113 | ylik 114 | ylil 115 | ylim 116 | ylimatr 117 | yliop 118 | yliopp 119 | ylip 120 | yliv 121 | 122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall 123 | #into this category - it sometimes ends a sentence) 124 | e.g 125 | ent 126 | esim 127 | huom 128 | i.e 129 | ilm 130 | l 131 | mm 132 | myöh 133 | nk 134 | nyk 135 | par 136 | po 137 | t 138 | v 139 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.fr: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | # 4 | #any single upper case letter followed by a period is not a sentence ender 5 | #usually upper case letters are initials in a name 6 | #no French words end in single lower-case letters, so we throw those in too? 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | # Period-final abbreviation list for French 61 | A.C.N 62 | A.M 63 | art 64 | ann 65 | apr 66 | av 67 | auj 68 | lib 69 | B.P 70 | boul 71 | ca 72 | c.-à-d 73 | cf 74 | ch.-l 75 | chap 76 | contr 77 | C.P.I 78 | C.Q.F.D 79 | C.N 80 | C.N.S 81 | C.S 82 | dir 83 | éd 84 | e.g 85 | env 86 | al 87 | etc 88 | E.V 89 | ex 90 | fasc 91 | fém 92 | fig 93 | fr 94 | hab 95 | ibid 96 | id 97 | i.e 98 | inf 99 | LL.AA 100 | LL.AA.II 101 | LL.AA.RR 102 | LL.AA.SS 103 | L.D 104 | LL.EE 105 | LL.MM 106 | LL.MM.II.RR 107 | loc.cit 108 | masc 109 | MM 110 | ms 111 | N.B 112 | N.D.A 113 | N.D.L.R 114 | N.D.T 115 | n/réf 116 | NN.SS 117 | N.S 118 | N.D 119 | N.P.A.I 120 | p.c.c 121 | pl 122 | pp 123 | p.ex 124 | p.j 125 | P.S 126 | R.A.S 127 | R.-V 128 | R.P 129 | R.I.P 130 | SS 131 | S.S 132 | S.A 133 | S.A.I 134 | S.A.R 135 | S.A.S 136 | S.E 137 | sec 138 | sect 139 | sing 140 | S.M 141 | S.M.I.R 142 | sq 143 | sqq 144 | suiv 145 | sup 146 | suppl 147 | tél 148 | T.S.V.P 149 | vb 150 | vol 151 | vs 152 | X.O 153 | Z.I 154 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.hu: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | Á 33 | É 34 | Í 35 | Ó 36 | Ö 37 | Ő 38 | Ú 39 | Ü 40 | Ű 41 | 42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 43 | Dr 44 | dr 45 | kb 46 | Kb 47 | vö 48 | Vö 49 | pl 50 | Pl 51 | ca 52 | Ca 53 | min 54 | Min 55 | max 56 | Max 57 | ún 58 | Ún 59 | prof 60 | Prof 61 | de 62 | De 63 | du 64 | Du 65 | Szt 66 | St 67 | 68 | #Numbers only. These should only induce breaks when followed by a numeric sequence 69 | # add NUMERIC_ONLY after the word for this function 70 | #This case is mostly for the english "No." which can either be a sentence of its own, or 71 | #if followed by a number, a non-breaking prefix 72 | 73 | # Month name abbreviations 74 | jan #NUMERIC_ONLY# 75 | Jan #NUMERIC_ONLY# 76 | Feb #NUMERIC_ONLY# 77 | feb #NUMERIC_ONLY# 78 | márc #NUMERIC_ONLY# 79 | Márc #NUMERIC_ONLY# 80 | ápr #NUMERIC_ONLY# 81 | Ápr #NUMERIC_ONLY# 82 | máj #NUMERIC_ONLY# 83 | Máj #NUMERIC_ONLY# 84 | jún #NUMERIC_ONLY# 85 | Jún #NUMERIC_ONLY# 86 | Júl #NUMERIC_ONLY# 87 | júl #NUMERIC_ONLY# 88 | aug #NUMERIC_ONLY# 89 | Aug #NUMERIC_ONLY# 90 | Szept #NUMERIC_ONLY# 91 | szept #NUMERIC_ONLY# 92 | okt #NUMERIC_ONLY# 93 | Okt #NUMERIC_ONLY# 94 | nov #NUMERIC_ONLY# 95 | Nov #NUMERIC_ONLY# 96 | dec #NUMERIC_ONLY# 97 | Dec #NUMERIC_ONLY# 98 | 99 | # Other abbreviations 100 | tel #NUMERIC_ONLY# 101 | Tel #NUMERIC_ONLY# 102 | Fax #NUMERIC_ONLY# 103 | fax #NUMERIC_ONLY# 104 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.is: -------------------------------------------------------------------------------- 1 | no #NUMERIC_ONLY# 2 | No #NUMERIC_ONLY# 3 | nr #NUMERIC_ONLY# 4 | Nr #NUMERIC_ONLY# 5 | nR #NUMERIC_ONLY# 6 | NR #NUMERIC_ONLY# 7 | a 8 | b 9 | c 10 | d 11 | e 12 | f 13 | g 14 | h 15 | i 16 | j 17 | k 18 | l 19 | m 20 | n 21 | o 22 | p 23 | q 24 | r 25 | s 26 | t 27 | u 28 | v 29 | w 30 | x 31 | y 32 | z 33 | ^ 34 | í 35 | á 36 | ó 37 | æ 38 | A 39 | B 40 | C 41 | D 42 | E 43 | F 44 | G 45 | H 46 | I 47 | J 48 | K 49 | L 50 | M 51 | N 52 | O 53 | P 54 | Q 55 | R 56 | S 57 | T 58 | U 59 | V 60 | W 61 | X 62 | Y 63 | Z 64 | ab.fn 65 | a.fn 66 | afs 67 | al 68 | alm 69 | alg 70 | andh 71 | ath 72 | aths 73 | atr 74 | ao 75 | au 76 | aukaf 77 | áfn 78 | áhrl.s 79 | áhrs 80 | ákv.gr 81 | ákv 82 | bh 83 | bls 84 | dr 85 | e.Kr 86 | et 87 | ef 88 | efn 89 | ennfr 90 | eink 91 | end 92 | e.st 93 | erl 94 | fél 95 | fskj 96 | fh 97 | f.hl 98 | físl 99 | fl 100 | fn 101 | fo 102 | forl 103 | frb 104 | frl 105 | frh 106 | frt 107 | fsl 108 | fsh 109 | fs 110 | fsk 111 | fst 112 | f.Kr 113 | ft 114 | fv 115 | fyrrn 116 | fyrrv 117 | germ 118 | gm 119 | gr 120 | hdl 121 | hdr 122 | hf 123 | hl 124 | hlsk 125 | hljsk 126 | hljv 127 | hljóðv 128 | hr 129 | hv 130 | hvk 131 | holl 132 | Hos 133 | höf 134 | hk 135 | hrl 136 | ísl 137 | kaf 138 | kap 139 | Khöfn 140 | kk 141 | kg 142 | kk 143 | km 144 | kl 145 | klst 146 | kr 147 | kt 148 | kgúrsk 149 | kvk 150 | leturbr 151 | lh 152 | lh.nt 153 | lh.þt 154 | lo 155 | ltr 156 | mlja 157 | mljó 158 | millj 159 | mm 160 | mms 161 | m.fl 162 | miðm 163 | mgr 164 | mst 165 | mín 166 | nf 167 | nh 168 | nhm 169 | nl 170 | nk 171 | nmgr 172 | no 173 | núv 174 | nt 175 | o.áfr 176 | o.m.fl 177 | ohf 178 | o.fl 179 | o.s.frv 180 | ófn 181 | ób 182 | óákv.gr 183 | óákv 184 | pfn 185 | PR 186 | pr 187 | Ritstj 188 | Rvík 189 | Rvk 190 | samb 191 | samhlj 192 | samn 193 | samn 194 | sbr 195 | sek 196 | sérn 197 | sf 198 | sfn 199 | sh 200 | sfn 201 | sh 202 | s.hl 203 | sk 204 | skv 205 | sl 206 | sn 207 | so 208 | ss.us 209 | s.st 210 | samþ 211 | sbr 212 | shlj 213 | sign 214 | skál 215 | st 216 | st.s 217 | stk 218 | sþ 219 | teg 220 | tbl 221 | tfn 222 | tl 223 | tvíhlj 224 | tvt 225 | till 226 | to 227 | umr 228 | uh 229 | us 230 | uppl 231 | útg 232 | vb 233 | Vf 234 | vh 235 | vkf 236 | Vl 237 | vl 238 | vlf 239 | vmf 240 | 8vo 241 | vsk 242 | vth 243 | þt 244 | þf 245 | þjs 246 | þgf 247 | þlt 248 | þolm 249 | þm 250 | þml 251 | þýð 252 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.it: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Amn 38 | Arch 39 | Asst 40 | Avv 41 | Bart 42 | Bcc 43 | Bldg 44 | Brig 45 | Bros 46 | C.A.P 47 | C.P 48 | Capt 49 | Cc 50 | Cmdr 51 | Co 52 | Col 53 | Comdr 54 | Con 55 | Corp 56 | Cpl 57 | DR 58 | Dott 59 | Dr 60 | Drs 61 | Egr 62 | Ens 63 | Gen 64 | Geom 65 | Gov 66 | Hon 67 | Hosp 68 | Hr 69 | Id 70 | Ing 71 | Insp 72 | Lt 73 | MM 74 | MR 75 | MRS 76 | MS 77 | Maj 78 | Messrs 79 | Mlle 80 | Mme 81 | Mo 82 | Mons 83 | Mr 84 | Mrs 85 | Ms 86 | Msgr 87 | N.B 88 | Op 89 | Ord 90 | P.S 91 | P.T 92 | Pfc 93 | Ph 94 | Prof 95 | Pvt 96 | RP 97 | RSVP 98 | Rag 99 | Rep 100 | Reps 101 | Res 102 | Rev 103 | Rif 104 | Rt 105 | S.A 106 | S.B.F 107 | S.P.M 108 | S.p.A 109 | S.r.l 110 | Sen 111 | Sens 112 | Sfc 113 | Sgt 114 | Sig 115 | Sigg 116 | Soc 117 | Spett 118 | Sr 119 | St 120 | Supt 121 | Surg 122 | V.P 123 | 124 | # other 125 | a.c 126 | acc 127 | all 128 | banc 129 | c.a 130 | c.c.p 131 | c.m 132 | c.p 133 | c.s 134 | c.v 135 | corr 136 | dott 137 | e.p.c 138 | ecc 139 | es 140 | fatt 141 | gg 142 | int 143 | lett 144 | ogg 145 | on 146 | p.c 147 | p.c.c 148 | p.es 149 | p.f 150 | p.r 151 | p.v 152 | post 153 | pp 154 | racc 155 | ric 156 | s.n.c 157 | seg 158 | sgg 159 | ss 160 | tel 161 | u.s 162 | v.r 163 | v.s 164 | 165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 166 | v 167 | vs 168 | i.e 169 | rev 170 | e.g 171 | 172 | #Numbers only. These should only induce breaks when followed by a numeric sequence 173 | # add NUMERIC_ONLY after the word for this function 174 | #This case is mostly for the english "No." which can either be a sentence of its own, or 175 | #if followed by a number, a non-breaking prefix 176 | No #NUMERIC_ONLY# 177 | Nos 178 | Art #NUMERIC_ONLY# 179 | Nr 180 | pp #NUMERIC_ONLY# 181 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.lv: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | Ā 8 | B 9 | C 10 | Č 11 | D 12 | E 13 | Ē 14 | F 15 | G 16 | Ģ 17 | H 18 | I 19 | Ī 20 | J 21 | K 22 | Ķ 23 | L 24 | Ļ 25 | M 26 | N 27 | Ņ 28 | O 29 | P 30 | Q 31 | R 32 | S 33 | Š 34 | T 35 | U 36 | Ū 37 | V 38 | W 39 | X 40 | Y 41 | Z 42 | Ž 43 | 44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 45 | dr 46 | Dr 47 | med 48 | prof 49 | Prof 50 | inž 51 | Inž 52 | ist.loc 53 | Ist.loc 54 | kor.loc 55 | Kor.loc 56 | v.i 57 | vietn 58 | Vietn 59 | 60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 61 | a.l 62 | t.p 63 | pārb 64 | Pārb 65 | vec 66 | Vec 67 | inv 68 | Inv 69 | sk 70 | Sk 71 | spec 72 | Spec 73 | vienk 74 | Vienk 75 | virz 76 | Virz 77 | māksl 78 | Māksl 79 | mūz 80 | Mūz 81 | akad 82 | Akad 83 | soc 84 | Soc 85 | galv 86 | Galv 87 | vad 88 | Vad 89 | sertif 90 | Sertif 91 | folkl 92 | Folkl 93 | hum 94 | Hum 95 | 96 | #Numbers only. These should only induce breaks when followed by a numeric sequence 97 | # add NUMERIC_ONLY after the word for this function 98 | #This case is mostly for the english "No." which can either be a sentence of its own, or 99 | #if followed by a number, a non-breaking prefix 100 | Nr #NUMERIC_ONLY# 101 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.nl: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen 4 | # http://nl.wikipedia.org/wiki/Aanspreekvorm 5 | # http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs 6 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 7 | #usually upper case letters are initials in a name 8 | A 9 | B 10 | C 11 | D 12 | E 13 | F 14 | G 15 | H 16 | I 17 | J 18 | K 19 | L 20 | M 21 | N 22 | O 23 | P 24 | Q 25 | R 26 | S 27 | T 28 | U 29 | V 30 | W 31 | X 32 | Y 33 | Z 34 | 35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 36 | bacc 37 | bc 38 | bgen 39 | c.i 40 | dhr 41 | dr 42 | dr.h.c 43 | drs 44 | drs 45 | ds 46 | eint 47 | fa 48 | Fa 49 | fam 50 | gen 51 | genm 52 | ing 53 | ir 54 | jhr 55 | jkvr 56 | jr 57 | kand 58 | kol 59 | lgen 60 | lkol 61 | Lt 62 | maj 63 | Mej 64 | mevr 65 | Mme 66 | mr 67 | mr 68 | Mw 69 | o.b.s 70 | plv 71 | prof 72 | ritm 73 | tint 74 | Vz 75 | Z.D 76 | Z.D.H 77 | Z.E 78 | Z.Em 79 | Z.H 80 | Z.K.H 81 | Z.K.M 82 | Z.M 83 | z.v 84 | 85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence 87 | a.g.v 88 | bijv 89 | bijz 90 | bv 91 | d.w.z 92 | e.c 93 | e.g 94 | e.k 95 | ev 96 | i.p.v 97 | i.s.m 98 | i.t.t 99 | i.v.m 100 | m.a.w 101 | m.b.t 102 | m.b.v 103 | m.h.o 104 | m.i 105 | m.i.v 106 | v.w.t 107 | 108 | #Numbers only. These should only induce breaks when followed by a numeric sequence 109 | # add NUMERIC_ONLY after the word for this function 110 | #This case is mostly for the english "No." which can either be a sentence of its own, or 111 | #if followed by a number, a non-breaking prefix 112 | Nr #NUMERIC_ONLY# 113 | Nrs 114 | nrs 115 | nr #NUMERIC_ONLY# 116 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.pl: -------------------------------------------------------------------------------- 1 | adw 2 | afr 3 | akad 4 | al 5 | Al 6 | am 7 | amer 8 | arch 9 | art 10 | Art 11 | artyst 12 | astr 13 | austr 14 | bałt 15 | bdb 16 | bł 17 | bm 18 | br 19 | bryg 20 | bryt 21 | centr 22 | ces 23 | chem 24 | chiń 25 | chir 26 | c.k 27 | c.o 28 | cyg 29 | cyw 30 | cyt 31 | czes 32 | czw 33 | cd 34 | Cd 35 | czyt 36 | ćw 37 | ćwicz 38 | daw 39 | dcn 40 | dekl 41 | demokr 42 | det 43 | diec 44 | dł 45 | dn 46 | dot 47 | dol 48 | dop 49 | dost 50 | dosł 51 | h.c 52 | ds 53 | dst 54 | duszp 55 | dypl 56 | egz 57 | ekol 58 | ekon 59 | elektr 60 | em 61 | ew 62 | fab 63 | farm 64 | fot 65 | fr 66 | gat 67 | gastr 68 | geogr 69 | geol 70 | gimn 71 | głęb 72 | gm 73 | godz 74 | górn 75 | gosp 76 | gr 77 | gram 78 | hist 79 | hiszp 80 | hr 81 | Hr 82 | hot 83 | id 84 | in 85 | im 86 | iron 87 | jn 88 | kard 89 | kat 90 | katol 91 | k.k 92 | kk 93 | kol 94 | kl 95 | k.p.a 96 | kpc 97 | k.p.c 98 | kpt 99 | kr 100 | k.r 101 | krak 102 | k.r.o 103 | kryt 104 | kult 105 | laic 106 | łac 107 | niem 108 | woj 109 | nb 110 | np 111 | Nb 112 | Np 113 | pol 114 | pow 115 | m.in 116 | pt 117 | ps 118 | Pt 119 | Ps 120 | cdn 121 | jw 122 | ryc 123 | rys 124 | Ryc 125 | Rys 126 | tj 127 | tzw 128 | Tzw 129 | tzn 130 | zob 131 | ang 132 | ub 133 | ul 134 | pw 135 | pn 136 | pl 137 | al 138 | k 139 | n 140 | nr #NUMERIC_ONLY# 141 | Nr #NUMERIC_ONLY# 142 | ww 143 | wł 144 | ur 145 | zm 146 | żyd 147 | żarg 148 | żyw 149 | wył 150 | bp 151 | bp 152 | wyst 153 | tow 154 | Tow 155 | o 156 | sp 157 | Sp 158 | st 159 | spółdz 160 | Spółdz 161 | społ 162 | spółgł 163 | stoł 164 | stow 165 | Stoł 166 | Stow 167 | zn 168 | zew 169 | zewn 170 | zdr 171 | zazw 172 | zast 173 | zaw 174 | zał 175 | zal 176 | zam 177 | zak 178 | zakł 179 | zagr 180 | zach 181 | adw 182 | Adw 183 | lek 184 | Lek 185 | med 186 | mec 187 | Mec 188 | doc 189 | Doc 190 | dyw 191 | dyr 192 | Dyw 193 | Dyr 194 | inż 195 | Inż 196 | mgr 197 | Mgr 198 | dh 199 | dr 200 | Dh 201 | Dr 202 | p 203 | P 204 | red 205 | Red 206 | prof 207 | prok 208 | Prof 209 | Prok 210 | hab 211 | płk 212 | Płk 213 | nadkom 214 | Nadkom 215 | podkom 216 | Podkom 217 | ks 218 | Ks 219 | gen 220 | Gen 221 | por 222 | Por 223 | reż 224 | Reż 225 | przyp 226 | Przyp 227 | śp 228 | św 229 | śW 230 | Śp 231 | Św 232 | ŚW 233 | szer 234 | Szer 235 | pkt #NUMERIC_ONLY# 236 | str #NUMERIC_ONLY# 237 | tab #NUMERIC_ONLY# 238 | Tab #NUMERIC_ONLY# 239 | tel 240 | ust #NUMERIC_ONLY# 241 | par #NUMERIC_ONLY# 242 | poz 243 | pok 244 | oo 245 | oO 246 | Oo 247 | OO 248 | r #NUMERIC_ONLY# 249 | l #NUMERIC_ONLY# 250 | s #NUMERIC_ONLY# 251 | najśw 252 | Najśw 253 | A 254 | B 255 | C 256 | D 257 | E 258 | F 259 | G 260 | H 261 | I 262 | J 263 | K 264 | L 265 | M 266 | N 267 | O 268 | P 269 | Q 270 | R 271 | S 272 | T 273 | U 274 | V 275 | W 276 | X 277 | Y 278 | Z 279 | Ś 280 | Ć 281 | Ż 282 | Ź 283 | Dz 284 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.pt: -------------------------------------------------------------------------------- 1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009. 2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 4 | 5 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 6 | #usually upper case letters are initials in a name 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 104 | Adj 105 | Adm 106 | Adv 107 | Art 108 | Ca 109 | Capt 110 | Cmdr 111 | Col 112 | Comdr 113 | Con 114 | Corp 115 | Cpl 116 | DR 117 | DRA 118 | Dr 119 | Dra 120 | Dras 121 | Drs 122 | Eng 123 | Enga 124 | Engas 125 | Engos 126 | Ex 127 | Exo 128 | Exmo 129 | Fig 130 | Gen 131 | Hosp 132 | Insp 133 | Lda 134 | MM 135 | MR 136 | MRS 137 | MS 138 | Maj 139 | Mrs 140 | Ms 141 | Msgr 142 | Op 143 | Ord 144 | Pfc 145 | Ph 146 | Prof 147 | Pvt 148 | Rep 149 | Reps 150 | Res 151 | Rev 152 | Rt 153 | Sen 154 | Sens 155 | Sfc 156 | Sgt 157 | Sr 158 | Sra 159 | Sras 160 | Srs 161 | Sto 162 | Supt 163 | Surg 164 | adj 165 | adm 166 | adv 167 | art 168 | cit 169 | col 170 | con 171 | corp 172 | cpl 173 | dr 174 | dra 175 | dras 176 | drs 177 | eng 178 | enga 179 | engas 180 | engos 181 | ex 182 | exo 183 | exmo 184 | fig 185 | op 186 | prof 187 | sr 188 | sra 189 | sras 190 | srs 191 | sto 192 | 193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 194 | v 195 | vs 196 | i.e 197 | rev 198 | e.g 199 | 200 | #Numbers only. These should only induce breaks when followed by a numeric sequence 201 | # add NUMERIC_ONLY after the word for this function 202 | #This case is mostly for the english "No." which can either be a sentence of its own, or 203 | #if followed by a number, a non-breaking prefix 204 | No #NUMERIC_ONLY# 205 | Nos 206 | Art #NUMERIC_ONLY# 207 | Nr 208 | p #NUMERIC_ONLY# 209 | pp #NUMERIC_ONLY# 210 | 211 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.ro: -------------------------------------------------------------------------------- 1 | A 2 | B 3 | C 4 | D 5 | E 6 | F 7 | G 8 | H 9 | I 10 | J 11 | K 12 | L 13 | M 14 | N 15 | O 16 | P 17 | Q 18 | R 19 | S 20 | T 21 | U 22 | V 23 | W 24 | X 25 | Y 26 | Z 27 | dpdv 28 | etc 29 | șamd 30 | M.Ap.N 31 | dl 32 | Dl 33 | d-na 34 | D-na 35 | dvs 36 | Dvs 37 | pt 38 | Pt 39 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.ru: -------------------------------------------------------------------------------- 1 | # added Cyrillic uppercase letters [А-Я] 2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes) 3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013 4 | А 5 | Б 6 | В 7 | Г 8 | Д 9 | Е 10 | Ж 11 | З 12 | И 13 | Й 14 | К 15 | Л 16 | М 17 | Н 18 | О 19 | П 20 | Р 21 | С 22 | Т 23 | У 24 | Ф 25 | Х 26 | Ц 27 | Ч 28 | Ш 29 | Щ 30 | Ъ 31 | Ы 32 | Ь 33 | Э 34 | Ю 35 | Я 36 | A 37 | B 38 | C 39 | D 40 | E 41 | F 42 | G 43 | H 44 | I 45 | J 46 | K 47 | L 48 | M 49 | N 50 | O 51 | P 52 | Q 53 | R 54 | S 55 | T 56 | U 57 | V 58 | W 59 | X 60 | Y 61 | Z 62 | 0гг 63 | 1гг 64 | 2гг 65 | 3гг 66 | 4гг 67 | 5гг 68 | 6гг 69 | 7гг 70 | 8гг 71 | 9гг 72 | 0г 73 | 1г 74 | 2г 75 | 3г 76 | 4г 77 | 5г 78 | 6г 79 | 7г 80 | 8г 81 | 9г 82 | Xвв 83 | Vвв 84 | Iвв 85 | Lвв 86 | Mвв 87 | Cвв 88 | Xв 89 | Vв 90 | Iв 91 | Lв 92 | Mв 93 | Cв 94 | 0м 95 | 1м 96 | 2м 97 | 3м 98 | 4м 99 | 5м 100 | 6м 101 | 7м 102 | 8м 103 | 9м 104 | 0мм 105 | 1мм 106 | 2мм 107 | 3мм 108 | 4мм 109 | 5мм 110 | 6мм 111 | 7мм 112 | 8мм 113 | 9мм 114 | 0см 115 | 1см 116 | 2см 117 | 3см 118 | 4см 119 | 5см 120 | 6см 121 | 7см 122 | 8см 123 | 9см 124 | 0дм 125 | 1дм 126 | 2дм 127 | 3дм 128 | 4дм 129 | 5дм 130 | 6дм 131 | 7дм 132 | 8дм 133 | 9дм 134 | 0л 135 | 1л 136 | 2л 137 | 3л 138 | 4л 139 | 5л 140 | 6л 141 | 7л 142 | 8л 143 | 9л 144 | 0км 145 | 1км 146 | 2км 147 | 3км 148 | 4км 149 | 5км 150 | 6км 151 | 7км 152 | 8км 153 | 9км 154 | 0га 155 | 1га 156 | 2га 157 | 3га 158 | 4га 159 | 5га 160 | 6га 161 | 7га 162 | 8га 163 | 9га 164 | 0кг 165 | 1кг 166 | 2кг 167 | 3кг 168 | 4кг 169 | 5кг 170 | 6кг 171 | 7кг 172 | 8кг 173 | 9кг 174 | 0т 175 | 1т 176 | 2т 177 | 3т 178 | 4т 179 | 5т 180 | 6т 181 | 7т 182 | 8т 183 | 9т 184 | 0г 185 | 1г 186 | 2г 187 | 3г 188 | 4г 189 | 5г 190 | 6г 191 | 7г 192 | 8г 193 | 9г 194 | 0мг 195 | 1мг 196 | 2мг 197 | 3мг 198 | 4мг 199 | 5мг 200 | 6мг 201 | 7мг 202 | 8мг 203 | 9мг 204 | бульв 205 | в 206 | вв 207 | г 208 | га 209 | гг 210 | гл 211 | гос 212 | д 213 | дм 214 | доп 215 | др 216 | е 217 | ед 218 | ед 219 | зам 220 | и 221 | инд 222 | исп 223 | Исп 224 | к 225 | кап 226 | кг 227 | кв 228 | кл 229 | км 230 | кол 231 | комн 232 | коп 233 | куб 234 | л 235 | лиц 236 | лл 237 | м 238 | макс 239 | мг 240 | мин 241 | мл 242 | млн 243 | млрд 244 | мм 245 | н 246 | наб 247 | нач 248 | неуд 249 | ном 250 | о 251 | обл 252 | обр 253 | общ 254 | ок 255 | ост 256 | отл 257 | п 258 | пер 259 | перераб 260 | пл 261 | пос 262 | пр 263 | просп 264 | проф 265 | р 266 | ред 267 | руб 268 | с 269 | сб 270 | св 271 | см 272 | соч 273 | ср 274 | ст 275 | стр 276 | т 277 | тел 278 | Тел 279 | тех 280 | тт 281 | туп 282 | тыс 283 | уд 284 | ул 285 | уч 286 | физ 287 | х 288 | хор 289 | ч 290 | чел 291 | шт 292 | экз 293 | э 294 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.sk: -------------------------------------------------------------------------------- 1 | Bc 2 | Mgr 3 | RNDr 4 | PharmDr 5 | PhDr 6 | JUDr 7 | PaedDr 8 | ThDr 9 | Ing 10 | MUDr 11 | MDDr 12 | MVDr 13 | Dr 14 | ThLic 15 | PhD 16 | ArtD 17 | ThDr 18 | Dr 19 | DrSc 20 | CSs 21 | prof 22 | obr 23 | Obr 24 | Č 25 | č 26 | absol 27 | adj 28 | admin 29 | adr 30 | Adr 31 | adv 32 | advok 33 | afr 34 | ak 35 | akad 36 | akc 37 | akuz 38 | et 39 | al 40 | alch 41 | amer 42 | anat 43 | angl 44 | Angl 45 | anglosas 46 | anorg 47 | ap 48 | apod 49 | arch 50 | archeol 51 | archit 52 | arg 53 | art 54 | astr 55 | astrol 56 | astron 57 | atp 58 | atď 59 | austr 60 | Austr 61 | aut 62 | belg 63 | Belg 64 | bibl 65 | Bibl 66 | biol 67 | bot 68 | bud 69 | bás 70 | býv 71 | cest 72 | chem 73 | cirk 74 | csl 75 | čs 76 | Čs 77 | dat 78 | dep 79 | det 80 | dial 81 | diaľ 82 | dipl 83 | distrib 84 | dokl 85 | dosl 86 | dopr 87 | dram 88 | duš 89 | dv 90 | dvojčl 91 | dór 92 | ekol 93 | ekon 94 | el 95 | elektr 96 | elektrotech 97 | energet 98 | epic 99 | est 100 | etc 101 | etonym 102 | eufem 103 | európ 104 | Európ 105 | ev 106 | evid 107 | expr 108 | fa 109 | fam 110 | farm 111 | fem 112 | feud 113 | fil 114 | filat 115 | filoz 116 | fi 117 | fon 118 | form 119 | fot 120 | fr 121 | Fr 122 | franc 123 | Franc 124 | fraz 125 | fut 126 | fyz 127 | fyziol 128 | garb 129 | gen 130 | genet 131 | genpor 132 | geod 133 | geogr 134 | geol 135 | geom 136 | germ 137 | gr 138 | Gr 139 | gréc 140 | Gréc 141 | gréckokat 142 | hebr 143 | herald 144 | hist 145 | hlav 146 | hosp 147 | hromad 148 | hud 149 | hypok 150 | ident 151 | i.e 152 | ident 153 | imp 154 | impf 155 | indoeur 156 | inf 157 | inform 158 | instr 159 | int 160 | interj 161 | inšt 162 | inštr 163 | iron 164 | jap 165 | Jap 166 | jaz 167 | jedn 168 | juhoamer 169 | juhových 170 | juhozáp 171 | juž 172 | kanad 173 | Kanad 174 | kanc 175 | kapit 176 | kpt 177 | kart 178 | katastr 179 | knih 180 | kniž 181 | komp 182 | konj 183 | konkr 184 | kozmet 185 | krajč 186 | kresť 187 | kt 188 | kuch 189 | lat 190 | latinskoamer 191 | lek 192 | lex 193 | lingv 194 | lit 195 | litur 196 | log 197 | lok 198 | max 199 | Max 200 | maď 201 | Maď 202 | medzinár 203 | mest 204 | metr 205 | mil 206 | Mil 207 | min 208 | Min 209 | miner 210 | ml 211 | mld 212 | mn 213 | mod 214 | mytol 215 | napr 216 | nar 217 | Nar 218 | nasl 219 | nedok 220 | neg 221 | negat 222 | neklas 223 | nem 224 | Nem 225 | neodb 226 | neos 227 | neskl 228 | nesklon 229 | nespis 230 | nespráv 231 | neved 232 | než 233 | niekt 234 | niž 235 | nom 236 | náb 237 | nákl 238 | námor 239 | nár 240 | obch 241 | obj 242 | obv 243 | obyč 244 | obč 245 | občian 246 | odb 247 | odd 248 | ods 249 | ojed 250 | okr 251 | Okr 252 | opt 253 | opyt 254 | org 255 | os 256 | osob 257 | ot 258 | ovoc 259 | par 260 | part 261 | pejor 262 | pers 263 | pf 264 | Pf 265 | P.f 266 | p.f 267 | pl 268 | Plk 269 | pod 270 | podst 271 | pokl 272 | polit 273 | politol 274 | polygr 275 | pomn 276 | popl 277 | por 278 | porad 279 | porov 280 | posch 281 | potrav 282 | použ 283 | poz 284 | pozit 285 | poľ 286 | poľno 287 | poľnohosp 288 | poľov 289 | pošt 290 | pož 291 | prac 292 | predl 293 | pren 294 | prep 295 | preuk 296 | priezv 297 | Priezv 298 | privl 299 | prof 300 | práv 301 | príd 302 | príj 303 | prík 304 | príp 305 | prír 306 | prísl 307 | príslov 308 | príč 309 | psych 310 | publ 311 | pís 312 | písm 313 | pôv 314 | refl 315 | reg 316 | rep 317 | resp 318 | rozk 319 | rozlič 320 | rozpráv 321 | roč 322 | Roč 323 | ryb 324 | rádiotech 325 | rím 326 | samohl 327 | semest 328 | sev 329 | severoamer 330 | severových 331 | severozáp 332 | sg 333 | skr 334 | skup 335 | sl 336 | Sloven 337 | soc 338 | soch 339 | sociol 340 | sp 341 | spol 342 | Spol 343 | spoloč 344 | spoluhl 345 | správ 346 | spôs 347 | st 348 | star 349 | starogréc 350 | starorím 351 | s.r.o 352 | stol 353 | stor 354 | str 355 | stredoamer 356 | stredoškol 357 | subj 358 | subst 359 | superl 360 | sv 361 | sz 362 | súkr 363 | súp 364 | súvzť 365 | tal 366 | Tal 367 | tech 368 | tel 369 | Tel 370 | telef 371 | teles 372 | telev 373 | teol 374 | trans 375 | turist 376 | tuzem 377 | typogr 378 | tzn 379 | tzv 380 | ukaz 381 | ul 382 | Ul 383 | umel 384 | univ 385 | ust 386 | ved 387 | vedľ 388 | verb 389 | veter 390 | vin 391 | viď 392 | vl 393 | vod 394 | vodohosp 395 | pnl 396 | vulg 397 | vyj 398 | vys 399 | vysokoškol 400 | vzťaž 401 | vôb 402 | vých 403 | výd 404 | výrob 405 | výsk 406 | výsl 407 | výtv 408 | výtvar 409 | význ 410 | včel 411 | vš 412 | všeob 413 | zahr 414 | zar 415 | zariad 416 | zast 417 | zastar 418 | zastaráv 419 | zb 420 | zdravot 421 | združ 422 | zjemn 423 | zlat 424 | zn 425 | Zn 426 | zool 427 | zr 428 | zried 429 | zv 430 | záhr 431 | zák 432 | zákl 433 | zám 434 | záp 435 | západoeur 436 | zázn 437 | územ 438 | účt 439 | čast 440 | čes 441 | Čes 442 | čl 443 | čísl 444 | živ 445 | pr 446 | fak 447 | Kr 448 | p.n.l 449 | A 450 | B 451 | C 452 | D 453 | E 454 | F 455 | G 456 | H 457 | I 458 | J 459 | K 460 | L 461 | M 462 | N 463 | O 464 | P 465 | Q 466 | R 467 | S 468 | T 469 | U 470 | V 471 | W 472 | X 473 | Y 474 | Z 475 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.sl: -------------------------------------------------------------------------------- 1 | dr 2 | Dr 3 | itd 4 | itn 5 | št #NUMERIC_ONLY# 6 | Št #NUMERIC_ONLY# 7 | d 8 | jan 9 | Jan 10 | feb 11 | Feb 12 | mar 13 | Mar 14 | apr 15 | Apr 16 | jun 17 | Jun 18 | jul 19 | Jul 20 | avg 21 | Avg 22 | sept 23 | Sept 24 | sep 25 | Sep 26 | okt 27 | Okt 28 | nov 29 | Nov 30 | dec 31 | Dec 32 | tj 33 | Tj 34 | npr 35 | Npr 36 | sl 37 | Sl 38 | op 39 | Op 40 | gl 41 | Gl 42 | oz 43 | Oz 44 | prev 45 | dipl 46 | ing 47 | prim 48 | Prim 49 | cf 50 | Cf 51 | gl 52 | Gl 53 | A 54 | B 55 | C 56 | D 57 | E 58 | F 59 | G 60 | H 61 | I 62 | J 63 | K 64 | L 65 | M 66 | N 67 | O 68 | P 69 | Q 70 | R 71 | S 72 | T 73 | U 74 | V 75 | W 76 | X 77 | Y 78 | Z 79 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.sv: -------------------------------------------------------------------------------- 1 | #single upper case letter are usually initials 2 | A 3 | B 4 | C 5 | D 6 | E 7 | F 8 | G 9 | H 10 | I 11 | J 12 | K 13 | L 14 | M 15 | N 16 | O 17 | P 18 | Q 19 | R 20 | S 21 | T 22 | U 23 | V 24 | W 25 | X 26 | Y 27 | Z 28 | #misc abbreviations 29 | AB 30 | G 31 | VG 32 | dvs 33 | etc 34 | from 35 | iaf 36 | jfr 37 | kl 38 | kr 39 | mao 40 | mfl 41 | mm 42 | osv 43 | pga 44 | tex 45 | tom 46 | vs 47 | -------------------------------------------------------------------------------- /preprocess/nonbreaking_prefixes/nonbreaking_prefix.ta: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | அ 7 | ஆ 8 | இ 9 | ஈ 10 | உ 11 | ஊ 12 | எ 13 | ஏ 14 | ஐ 15 | ஒ 16 | ஓ 17 | ஔ 18 | ஃ 19 | க 20 | கா 21 | கி 22 | கீ 23 | கு 24 | கூ 25 | கெ 26 | கே 27 | கை 28 | கொ 29 | கோ 30 | கௌ 31 | க் 32 | ச 33 | சா 34 | சி 35 | சீ 36 | சு 37 | சூ 38 | செ 39 | சே 40 | சை 41 | சொ 42 | சோ 43 | சௌ 44 | ச் 45 | ட 46 | டா 47 | டி 48 | டீ 49 | டு 50 | டூ 51 | டெ 52 | டே 53 | டை 54 | டொ 55 | டோ 56 | டௌ 57 | ட் 58 | த 59 | தா 60 | தி 61 | தீ 62 | து 63 | தூ 64 | தெ 65 | தே 66 | தை 67 | தொ 68 | தோ 69 | தௌ 70 | த் 71 | ப 72 | பா 73 | பி 74 | பீ 75 | பு 76 | பூ 77 | பெ 78 | பே 79 | பை 80 | பொ 81 | போ 82 | பௌ 83 | ப் 84 | ற 85 | றா 86 | றி 87 | றீ 88 | று 89 | றூ 90 | றெ 91 | றே 92 | றை 93 | றொ 94 | றோ 95 | றௌ 96 | ற் 97 | ய 98 | யா 99 | யி 100 | யீ 101 | யு 102 | யூ 103 | யெ 104 | யே 105 | யை 106 | யொ 107 | யோ 108 | யௌ 109 | ய் 110 | ர 111 | ரா 112 | ரி 113 | ரீ 114 | ரு 115 | ரூ 116 | ரெ 117 | ரே 118 | ரை 119 | ரொ 120 | ரோ 121 | ரௌ 122 | ர் 123 | ல 124 | லா 125 | லி 126 | லீ 127 | லு 128 | லூ 129 | லெ 130 | லே 131 | லை 132 | லொ 133 | லோ 134 | லௌ 135 | ல் 136 | வ 137 | வா 138 | வி 139 | வீ 140 | வு 141 | வூ 142 | வெ 143 | வே 144 | வை 145 | வொ 146 | வோ 147 | வௌ 148 | வ் 149 | ள 150 | ளா 151 | ளி 152 | ளீ 153 | ளு 154 | ளூ 155 | ளெ 156 | ளே 157 | ளை 158 | ளொ 159 | ளோ 160 | ளௌ 161 | ள் 162 | ழ 163 | ழா 164 | ழி 165 | ழீ 166 | ழு 167 | ழூ 168 | ழெ 169 | ழே 170 | ழை 171 | ழொ 172 | ழோ 173 | ழௌ 174 | ழ் 175 | ங 176 | ஙா 177 | ஙி 178 | ஙீ 179 | ஙு 180 | ஙூ 181 | ஙெ 182 | ஙே 183 | ஙை 184 | ஙொ 185 | ஙோ 186 | ஙௌ 187 | ங் 188 | ஞ 189 | ஞா 190 | ஞி 191 | ஞீ 192 | ஞு 193 | ஞூ 194 | ஞெ 195 | ஞே 196 | ஞை 197 | ஞொ 198 | ஞோ 199 | ஞௌ 200 | ஞ் 201 | ண 202 | ணா 203 | ணி 204 | ணீ 205 | ணு 206 | ணூ 207 | ணெ 208 | ணே 209 | ணை 210 | ணொ 211 | ணோ 212 | ணௌ 213 | ண் 214 | ந 215 | நா 216 | நி 217 | நீ 218 | நு 219 | நூ 220 | நெ 221 | நே 222 | நை 223 | நொ 224 | நோ 225 | நௌ 226 | ந் 227 | ம 228 | மா 229 | மி 230 | மீ 231 | மு 232 | மூ 233 | மெ 234 | மே 235 | மை 236 | மொ 237 | மோ 238 | மௌ 239 | ம் 240 | ன 241 | னா 242 | னி 243 | னீ 244 | னு 245 | னூ 246 | னெ 247 | னே 248 | னை 249 | னொ 250 | னோ 251 | னௌ 252 | ன் 253 | 254 | 255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 256 | திரு 257 | திருமதி 258 | வண 259 | கௌரவ 260 | 261 | 262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 263 | உ.ம் 264 | #கா.ம் 265 | #எ.ம் 266 | 267 | 268 | #Numbers only. These should only induce breaks when followed by a numeric sequence 269 | # add NUMERIC_ONLY after the word for this function 270 | #This case is mostly for the english "No." which can either be a sentence of its own, or 271 | #if followed by a number, a non-breaking prefix 272 | No #NUMERIC_ONLY# 273 | Nos 274 | Art #NUMERIC_ONLY# 275 | Nr 276 | pp #NUMERIC_ONLY# 277 | -------------------------------------------------------------------------------- /preprocess/normalize-punctuation.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl -w 2 | 3 | use strict; 4 | 5 | my ($language) = @ARGV; 6 | 7 | while() { 8 | s/\r//g; 9 | # remove extra spaces 10 | s/\(/ \(/g; 11 | s/\)/\) /g; s/ +/ /g; 12 | s/\) ([\.\!\:\?\;\,])/\)$1/g; 13 | s/\( /\(/g; 14 | s/ \)/\)/g; 15 | s/(\d) \%/$1\%/g; 16 | s/ :/:/g; 17 | s/ ;/;/g; 18 | # normalize unicode punctuation 19 | s/„/\"/g; 20 | s/“/\"/g; 21 | s/”/\"/g; 22 | s/–/-/g; 23 | s/—/ - /g; s/ +/ /g; 24 | s/´/\'/g; 25 | s/([a-z])‘([a-z])/$1\'$2/gi; 26 | s/([a-z])’([a-z])/$1\'$2/gi; 27 | s/‘/\"/g; 28 | s/‚/\"/g; 29 | s/’/\"/g; 30 | s/''/\"/g; 31 | s/´´/\"/g; 32 | s/…/.../g; 33 | # French quotes 34 | s/ « / \"/g; 35 | s/« /\"/g; 36 | s/«/\"/g; 37 | s/ » /\" /g; 38 | s/ »/\"/g; 39 | s/»/\"/g; 40 | # handle pseudo-spaces 41 | s/ \%/\%/g; 42 | s/nº /nº /g; 43 | s/ :/:/g; 44 | s/ ºC/ ºC/g; 45 | s/ cm/ cm/g; 46 | s/ \?/\?/g; 47 | s/ \!/\!/g; 48 | s/ ;/;/g; 49 | s/, /, /g; s/ +/ /g; 50 | 51 | # English "quotation," followed by comma, style 52 | if ($language eq "en") { 53 | s/\"([,\.]+)/$1\"/g; 54 | } 55 | # Czech is confused 56 | elsif ($language eq "cs" || $language eq "cz") { 57 | } 58 | # German/Spanish/French "quotation", followed by comma, style 59 | else { 60 | s/,\"/\",/g; 61 | s/(\.+)\"(\s*[^<])/\"$1$2/g; # don't fix period at end of sentence 62 | } 63 | 64 | print STDERR $_ if //; 65 | 66 | if ($language eq "de" || $language eq "es" || $language eq "cz" || $language eq "cs" || $language eq "fr") { 67 | s/(\d) (\d)/$1,$2/g; 68 | } 69 | else { 70 | s/(\d) (\d)/$1.$2/g; 71 | } 72 | print $_; 73 | } 74 | -------------------------------------------------------------------------------- /preprocess/preprocess.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # source language (example: fr) 4 | S=$1 5 | # target language (example: en) 6 | T=$2 7 | 8 | # path to dl4mt/data 9 | P1=$3 10 | 11 | # path to subword NMT scripts (can be downloaded from https://github.com/rsennrich/subword-nmt) 12 | P2=$4 13 | 14 | ## merge all parallel corpora 15 | #./merge.sh $1 $2 16 | 17 | perl $P1/normalize-punctuation.perl -l ${S} < all_${S}-${T}.${S} > all_${S}-${T}.${S}.norm # do this for validation and test 18 | perl $P1/normalize-punctuation.perl -l ${T} < all_${S}-${T}.${T} > all_${S}-${T}.${T}.norm # do this for validation and test 19 | 20 | # tokenize 21 | perl $P1/tokenizer_apos.perl -threads 5 -l $S < all_${S}-${T}.${S}.norm > all_${S}-${T}.${S}.tok # do this for validation and test 22 | perl $P1/tokenizer_apos.perl -threads 5 -l $T < all_${S}-${T}.${T}.norm > all_${S}-${T}.${T}.tok # do this for validation and test 23 | 24 | # BPE 25 | if [ ! -f "../${S}.bpe" ]; then 26 | python $P2/learn_bpe.py -s 20000 < all_${S}-${T}.${S}.tok > ../${S}.bpe 27 | fi 28 | if [ ! -f "../${T}.bpe" ]; then 29 | python $P2/learn_bpe.py -s 20000 < all_${S}-${T}.${T}.tok > ../${T}.bpe 30 | fi 31 | 32 | python $P2/apply_bpe.py -c ../${S}.bpe < all_${S}-${T}.${S}.tok > all_${S}-${T}.${S}.tok.bpe # do this for validation and test 33 | python $P2/apply_bpe.py -c ../${T}.bpe < all_${S}-${T}.${T}.tok > all_${S}-${T}.${T}.tok.bpe # do this for validation and test 34 | 35 | # shuffle 36 | python $P1/shuffle.py all_${S}-${T}.${S}.tok.bpe all_${S}-${T}.${T}.tok.bpe all_${S}-${T}.${S}.tok all_${S}-${T}.${T}.tok 37 | 38 | # build dictionary 39 | #python $P1/build_dictionary.py all_${S}-${T}.${S}.tok & 40 | #python $P1/build_dictionary.py all_${S}-${T}.${T}.tok & 41 | #python $P1/build_dictionary_word.py all_${S}-${T}.${S}.tok.bpe & 42 | #python $P1/build_dictionary_word.py all_${S}-${T}.${T}.tok.bpe & 43 | -------------------------------------------------------------------------------- /preprocess/shuffle.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import random 4 | 5 | from tempfile import mkstemp 6 | from subprocess import call 7 | 8 | 9 | 10 | def main(files): 11 | 12 | tf_os, tpath = mkstemp() 13 | tf = open(tpath, 'w') 14 | 15 | fds = [open(ff) for ff in files] 16 | 17 | for l in fds[0]: 18 | lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]] 19 | print >>tf, "|||".join(lines) 20 | 21 | [ff.close() for ff in fds] 22 | tf.close() 23 | 24 | tf = open(tpath, 'r') 25 | lines = tf.readlines() 26 | random.shuffle(lines) 27 | 28 | fds = [open(ff+'.shuf','w') for ff in files] 29 | 30 | for l in lines: 31 | s = l.strip().split('|||') 32 | for ii, fd in enumerate(fds): 33 | print >>fd, s[ii] 34 | 35 | [ff.close() for ff in fds] 36 | 37 | os.remove(tpath) 38 | 39 | if __name__ == '__main__': 40 | main(sys.argv[1:]) 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /presentation/appendix.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nyu-dl/dl4mt-cdec/e738dc7235cb2819ad2b4e8e5837e97b2fb41de2/presentation/appendix.pdf -------------------------------------------------------------------------------- /subword_base/train_wmt15_csen_bpe2bpe_both_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from subword_base_both import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both', 11 | 'two_layer_gru_decoder_both'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_csen_bpe2bpe_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /subword_base/train_wmt15_deen_bpe2bpe_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from subword_base import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder': ('param_init_two_layer_gru_decoder', 11 | 'two_layer_gru_decoder'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2bpe_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /subword_base/train_wmt15_deen_bpe2bpe_both_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from subword_base_both import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both', 11 | 'two_layer_gru_decoder_both'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_deen_bpe2bpe_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /subword_base/train_wmt15_fien_bpe2bpe_both_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from subword_base_both import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both', 11 | 'two_layer_gru_decoder_both'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_fien_bpe2bpe_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /subword_base/train_wmt15_ruen_bpe2bpe_both_adam.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from collections import OrderedDict 4 | from nmt import train 5 | from subword_base_both import * 6 | 7 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 8 | 'fff': ('param_init_ffflayer', 'ffflayer'), 9 | 'gru': ('param_init_gru', 'gru_layer'), 10 | 'two_layer_gru_decoder_both': ('param_init_two_layer_gru_decoder_both', 11 | 'two_layer_gru_decoder_both'), 12 | } 13 | 14 | 15 | def main(job_id, params): 16 | re_load = False 17 | save_file_name = 'bpe2bpe_two_layer_gru_decoder_both_adam' 18 | source_dataset = params['train_data_path'] + params['source_dataset'] 19 | target_dataset = params['train_data_path'] + params['target_dataset'] 20 | valid_source_dataset = params['dev_data_path'] + params['valid_source_dataset'] 21 | valid_target_dataset = params['dev_data_path'] + params['valid_target_dataset'] 22 | source_dictionary = params['train_data_path'] + params['source_dictionary'] 23 | target_dictionary = params['train_data_path'] + params['target_dictionary'] 24 | 25 | print params, params['save_path'], save_file_name 26 | validerr = train( 27 | max_epochs=int(params['max_epochs']), 28 | patience=int(params['patience']), 29 | dim_word=int(params['dim_word']), 30 | dim_word_src=int(params['dim_word_src']), 31 | save_path=params['save_path'], 32 | save_file_name=save_file_name, 33 | re_load=re_load, 34 | enc_dim=int(params['enc_dim']), 35 | dec_dim=int(params['dec_dim']), 36 | n_words=int(params['n_words']), 37 | n_words_src=int(params['n_words_src']), 38 | decay_c=float(params['decay_c']), 39 | lrate=float(params['learning_rate']), 40 | optimizer=params['optimizer'], 41 | maxlen=int(params['maxlen']), 42 | maxlen_trg=int(params['maxlen_trg']), 43 | maxlen_sample=int(params['maxlen_sample']), 44 | batch_size=int(params['batch_size']), 45 | valid_batch_size=int(params['valid_batch_size']), 46 | sort_size=int(params['sort_size']), 47 | validFreq=int(params['validFreq']), 48 | dispFreq=int(params['dispFreq']), 49 | saveFreq=int(params['saveFreq']), 50 | sampleFreq=int(params['sampleFreq']), 51 | clip_c=int(params['clip_c']), 52 | datasets=[source_dataset, target_dataset], 53 | valid_datasets=[valid_source_dataset, valid_target_dataset], 54 | dictionaries=[source_dictionary, target_dictionary], 55 | use_dropout=int(params['use_dropout']), 56 | source_word_level=int(params['source_word_level']), 57 | target_word_level=int(params['target_word_level']), 58 | layers=layers, 59 | save_every_saveFreq=1, 60 | use_bpe=1, 61 | init_params=init_params, 62 | build_model=build_model, 63 | build_sampler=build_sampler, 64 | gen_sample=gen_sample 65 | ) 66 | return validerr 67 | 68 | if __name__ == '__main__': 69 | 70 | import sys, time 71 | if len(sys.argv) > 1: 72 | config_file_name = sys.argv[-1] 73 | else: 74 | config_file_name = 'wmt15_ruen_bpe2bpe_adam.txt' 75 | 76 | f = open(config_file_name, 'r') 77 | lines = f.readlines() 78 | params = OrderedDict() 79 | 80 | for line in lines: 81 | line = line.split('\n')[0] 82 | param_list = line.split(' ') 83 | param_name = param_list[0] 84 | param_value = param_list[1] 85 | params[param_name] = param_value 86 | 87 | main(0, params) 88 | -------------------------------------------------------------------------------- /subword_base/translate.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from subword_base import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | queue.put((idx, x)) 126 | return idx+1 127 | 128 | def _finish_processes(): 129 | for midx in xrange(n_process): 130 | queue.put(None) 131 | 132 | def _retrieve_jobs(n_samples): 133 | trans = [None] * n_samples 134 | for idx in xrange(n_samples): 135 | resp = rqueue.get() 136 | trans[resp[0]] = resp[1] 137 | if numpy.mod(idx, 10) == 0: 138 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 139 | return trans 140 | 141 | print 'Translating ', source_file, '...' 142 | n_samples = _send_jobs(source_file) 143 | trans = _seqs2words(_retrieve_jobs(n_samples)) 144 | _finish_processes() 145 | with open(saveto, 'w') as f: 146 | if decoder_bpe_to_tok: 147 | print >>f, '\n'.join(trans).replace('@@ ', '') 148 | else: 149 | print >>f, '\n'.join(trans) 150 | print 'Done' 151 | 152 | 153 | if __name__ == "__main__": 154 | parser = argparse.ArgumentParser() 155 | parser.add_argument('-k', type=int, default=5) 156 | parser.add_argument('-p', type=int, default=5) 157 | parser.add_argument('-n', action="store_true", default=False) 158 | parser.add_argument('-bpe', action="store_true", default=False) 159 | parser.add_argument('-enc_c', action="store_true", default=False) 160 | parser.add_argument('-dec_c', action="store_true", default=False) 161 | parser.add_argument('-utf8', action="store_true", default=False) 162 | parser.add_argument('model', type=str) 163 | parser.add_argument('dictionary', type=str) 164 | parser.add_argument('dictionary_target', type=str) 165 | parser.add_argument('source', type=str) 166 | parser.add_argument('saveto', type=str) 167 | 168 | args = parser.parse_args() 169 | 170 | main(args.model, args.dictionary, args.dictionary_target, args.source, 171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 172 | encoder_chr_level=args.enc_c, 173 | decoder_chr_level=args.dec_c, 174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 175 | -------------------------------------------------------------------------------- /subword_base/translate_both.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | 9 | from subword_base_both import (build_sampler, gen_sample, init_params) 10 | from mixer import * 11 | 12 | from multiprocessing import Process, Queue 13 | 14 | 15 | def translate_model(queue, rqueue, pid, model, options, k, normalize): 16 | 17 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 18 | trng = RandomStreams(1234) 19 | 20 | # allocate model parameters 21 | params = init_params(options) 22 | 23 | # load model parameters and set theano shared variables 24 | params = load_params(model, params) 25 | tparams = init_tparams(params) 26 | 27 | # word index 28 | use_noise = theano.shared(numpy.float32(0.)) 29 | f_init, f_next = build_sampler(tparams, options, trng, use_noise) 30 | 31 | def _translate(seq): 32 | use_noise.set_value(0.) 33 | # sample given an input sequence and obtain scores 34 | sample, score = gen_sample(tparams, f_init, f_next, 35 | numpy.array(seq).reshape([len(seq), 1]), 36 | options, trng=trng, k=k, maxlen=500, 37 | stochastic=False, argmax=False) 38 | 39 | # normalize scores according to sequence lengths 40 | if normalize: 41 | lengths = numpy.array([len(s) for s in sample]) 42 | score = score / lengths 43 | sidx = numpy.argmin(score) 44 | return sample[sidx] 45 | 46 | while True: 47 | req = queue.get() 48 | if req is None: 49 | break 50 | 51 | idx, x = req[0], req[1] 52 | print pid, '-', idx 53 | seq = _translate(x) 54 | 55 | rqueue.put((idx, seq)) 56 | 57 | return 58 | 59 | 60 | def main(model, dictionary, dictionary_target, source_file, saveto, k=5, 61 | normalize=False, n_process=5, encoder_chr_level=False, 62 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 63 | 64 | # load model model_options 65 | pkl_file = model.split('.')[0] + '.pkl' 66 | with open(pkl_file, 'rb') as f: 67 | options = pkl.load(f) 68 | 69 | # load source dictionary and invert 70 | with open(dictionary, 'rb') as f: 71 | word_dict = pkl.load(f) 72 | word_idict = dict() 73 | for kk, vv in word_dict.iteritems(): 74 | word_idict[vv] = kk 75 | word_idict[0] = '' 76 | word_idict[1] = 'UNK' 77 | 78 | # load target dictionary and invert 79 | with open(dictionary_target, 'rb') as f: 80 | word_dict_trg = pkl.load(f) 81 | word_idict_trg = dict() 82 | for kk, vv in word_dict_trg.iteritems(): 83 | word_idict_trg[vv] = kk 84 | word_idict_trg[0] = '' 85 | word_idict_trg[1] = 'UNK' 86 | 87 | # create input and output queues for processes 88 | queue = Queue() 89 | rqueue = Queue() 90 | processes = [None] * n_process 91 | for midx in xrange(n_process): 92 | processes[midx] = Process( 93 | target=translate_model, 94 | args=(queue, rqueue, midx, model, options, k, normalize)) 95 | processes[midx].start() 96 | 97 | # utility function 98 | def _seqs2words(caps): 99 | capsw = [] 100 | for cc in caps: 101 | ww = [] 102 | for w in cc: 103 | if w == 0: 104 | break 105 | if utf8: 106 | ww.append(word_idict_trg[w].encode('utf-8')) 107 | else: 108 | ww.append(word_idict_trg[w]) 109 | if decoder_chr_level: 110 | capsw.append(''.join(ww)) 111 | else: 112 | capsw.append(' '.join(ww)) 113 | return capsw 114 | 115 | def _send_jobs(fname): 116 | with open(fname, 'r') as f: 117 | for idx, line in enumerate(f): 118 | if encoder_chr_level: 119 | words = list(line.decode('utf-8').strip()) 120 | else: 121 | words = line.strip().split() 122 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 123 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 124 | x += [0] 125 | queue.put((idx, x)) 126 | return idx+1 127 | 128 | def _finish_processes(): 129 | for midx in xrange(n_process): 130 | queue.put(None) 131 | 132 | def _retrieve_jobs(n_samples): 133 | trans = [None] * n_samples 134 | for idx in xrange(n_samples): 135 | resp = rqueue.get() 136 | trans[resp[0]] = resp[1] 137 | if numpy.mod(idx, 10) == 0: 138 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 139 | return trans 140 | 141 | print 'Translating ', source_file, '...' 142 | n_samples = _send_jobs(source_file) 143 | trans = _seqs2words(_retrieve_jobs(n_samples)) 144 | _finish_processes() 145 | with open(saveto, 'w') as f: 146 | if decoder_bpe_to_tok: 147 | print >>f, '\n'.join(trans).replace('@@ ', '') 148 | else: 149 | print >>f, '\n'.join(trans) 150 | print 'Done' 151 | 152 | 153 | if __name__ == "__main__": 154 | parser = argparse.ArgumentParser() 155 | parser.add_argument('-k', type=int, default=5) 156 | parser.add_argument('-p', type=int, default=5) 157 | parser.add_argument('-n', action="store_true", default=False) 158 | parser.add_argument('-bpe', action="store_true", default=False) 159 | parser.add_argument('-enc_c', action="store_true", default=False) 160 | parser.add_argument('-dec_c', action="store_true", default=False) 161 | parser.add_argument('-utf8', action="store_true", default=False) 162 | parser.add_argument('model', type=str) 163 | parser.add_argument('dictionary', type=str) 164 | parser.add_argument('dictionary_target', type=str) 165 | parser.add_argument('source', type=str) 166 | parser.add_argument('saveto', type=str) 167 | 168 | args = parser.parse_args() 169 | 170 | main(args.model, args.dictionary, args.dictionary_target, args.source, 171 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 172 | encoder_chr_level=args.enc_c, 173 | decoder_chr_level=args.dec_c, 174 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 175 | -------------------------------------------------------------------------------- /subword_base/translate_both_bpe2bpe_ensemble_deen.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Translates a source file using a translation model. 3 | ''' 4 | import argparse 5 | 6 | import numpy 7 | import cPickle as pkl 8 | import ipdb 9 | 10 | from nmt_both import (build_sampler, init_params) 11 | from mixer import * 12 | 13 | from multiprocessing import Process, Queue 14 | 15 | 16 | def gen_sample(tparams, f_inits, f_nexts, x, options, trng=None, 17 | k=1, maxlen=500, stochastic=True, argmax=False): 18 | 19 | # k is the beam size we have 20 | if k > 1: 21 | assert not stochastic, \ 22 | 'Beam search does not support stochastic sampling' 23 | 24 | sample = [] 25 | sample_score = [] 26 | if stochastic: 27 | sample_score = 0 28 | 29 | live_k = 1 30 | dead_k = 0 31 | 32 | hyp_samples = [[]] * live_k 33 | hyp_scores = numpy.zeros(live_k).astype('float32') 34 | hyp_states = [] 35 | 36 | # get initial state of decoder rnn and encoder context 37 | rets = [] 38 | next_state_chars = [] 39 | next_state_words = [] 40 | ctx0s = [] 41 | 42 | for i in xrange(len(f_inits)): 43 | ret = f_inits[i](x) 44 | next_state_chars.append(ret[0]) 45 | next_state_words.append(ret[1]) 46 | ctx0s.append(ret[2]) 47 | next_w = -1 * numpy.ones((1,)).astype('int64') # bos indicator 48 | 49 | num_models = len(f_inits) 50 | 51 | for ii in xrange(maxlen): 52 | 53 | temp_next_p = [] 54 | temp_next_state_char = [] 55 | temp_next_state_word = [] 56 | 57 | for i in xrange(num_models): 58 | 59 | ctx = numpy.tile(ctx0s[i], [live_k, 1]) 60 | inps = [next_w, ctx, next_state_chars[i], next_state_words[i]] 61 | ret = f_nexts[i](*inps) 62 | next_p, _, next_state_char, next_state_word = ret[0], ret[1], ret[2], ret[3] 63 | temp_next_p.append(next_p) 64 | temp_next_state_char.append(next_state_char) 65 | temp_next_state_word.append(next_state_word) 66 | #next_p = numpy.log(numpy.array(temp_next_p)).sum(axis=0) / num_models 67 | next_p = numpy.log(numpy.array(temp_next_p).mean(axis=0)) 68 | 69 | if stochastic: 70 | if argmax: 71 | nw = next_p[0].argmax() 72 | else: 73 | nw = next_w[0] 74 | sample.append(nw) 75 | sample_score += next_p[0, nw] 76 | if nw == 0: 77 | break 78 | else: 79 | cand_scores = hyp_scores[:, None] - next_p 80 | cand_flat = cand_scores.flatten() 81 | ranks_flat = cand_flat.argsort()[:(k - dead_k)] 82 | 83 | voc_size = next_p.shape[1] 84 | trans_indices = ranks_flat / voc_size 85 | word_indices = ranks_flat % voc_size 86 | costs = cand_flat[ranks_flat] 87 | 88 | new_hyp_samples = [] 89 | new_hyp_scores = numpy.zeros(k - dead_k).astype('float32') 90 | new_hyp_states_chars = [] 91 | new_hyp_states_words = [] 92 | 93 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)): 94 | new_hyp_samples.append(hyp_samples[ti] + [wi]) 95 | new_hyp_scores[idx] = copy.copy(costs[idx]) 96 | 97 | for i in xrange(num_models): 98 | new_hyp_states_char = [] 99 | new_hyp_states_word = [] 100 | 101 | for ti in trans_indices: 102 | new_hyp_states_char.append(copy.copy(temp_next_state_char[i][ti])) 103 | new_hyp_states_word.append(copy.copy(temp_next_state_word[i][ti])) 104 | 105 | new_hyp_states_chars.append(new_hyp_states_char) 106 | new_hyp_states_words.append(new_hyp_states_word) 107 | 108 | # check the finished samples 109 | new_live_k = 0 110 | hyp_samples = [] 111 | hyp_scores = [] 112 | 113 | for idx in xrange(len(new_hyp_samples)): 114 | if new_hyp_samples[idx][-1] == 0: 115 | sample.append(new_hyp_samples[idx]) 116 | sample_score.append(new_hyp_scores[idx]) 117 | dead_k += 1 118 | else: 119 | new_live_k += 1 120 | hyp_samples.append(new_hyp_samples[idx]) 121 | hyp_scores.append(new_hyp_scores[idx]) 122 | 123 | for i in xrange(num_models): 124 | hyp_states_char = [] 125 | hyp_states_word = [] 126 | 127 | for idx in xrange(len(new_hyp_samples)): 128 | if new_hyp_samples[idx][-1] != 0: 129 | hyp_states_char.append(new_hyp_states_chars[i][idx]) 130 | hyp_states_word.append(new_hyp_states_words[i][idx]) 131 | 132 | next_state_chars[i] = numpy.array(hyp_states_char) 133 | next_state_words[i] = numpy.array(hyp_states_word) 134 | 135 | hyp_scores = numpy.array(hyp_scores) 136 | live_k = new_live_k 137 | 138 | if new_live_k < 1: 139 | break 140 | if dead_k >= k: 141 | break 142 | 143 | next_w = numpy.array([w[-1] for w in hyp_samples]) 144 | 145 | if not stochastic: 146 | # dump every remaining one 147 | if live_k > 0: 148 | for idx in xrange(live_k): 149 | sample.append(hyp_samples[idx]) 150 | sample_score.append(hyp_scores[idx]) 151 | 152 | return sample, sample_score 153 | 154 | 155 | def translate_model(queue, rqueue, pid, models, options, k, normalize): 156 | 157 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 158 | trng = RandomStreams(1234) 159 | 160 | # allocate model parameters 161 | params = [] 162 | for i in xrange(len(models)): 163 | params.append(init_params(options)) 164 | 165 | # load model parameters and set theano shared variables 166 | tparams = [] 167 | for i in xrange(len(params)): 168 | params[i] = load_params(models[i], params[i]) 169 | tparams.append(init_tparams(params[i])) 170 | 171 | # word index 172 | use_noise = theano.shared(numpy.float32(0.)) 173 | f_inits = [] 174 | f_nexts = [] 175 | for i in xrange(len(tparams)): 176 | f_init, f_next = build_sampler(tparams[i], options, trng, use_noise) 177 | f_inits.append(f_init) 178 | f_nexts.append(f_next) 179 | 180 | def _translate(seq): 181 | use_noise.set_value(0.) 182 | # sample given an input sequence and obtain scores 183 | sample, score = gen_sample(tparams, f_inits, f_nexts, 184 | numpy.array(seq).reshape([len(seq), 1]), 185 | options, trng=trng, k=k, maxlen=500, 186 | stochastic=False, argmax=False) 187 | 188 | # normalize scores according to sequence lengths 189 | if normalize: 190 | lengths = numpy.array([len(s) for s in sample]) 191 | score = score / lengths 192 | sidx = numpy.argmin(score) 193 | return sample[sidx] 194 | 195 | while True: 196 | req = queue.get() 197 | if req is None: 198 | break 199 | 200 | idx, x = req[0], req[1] 201 | print pid, '-', idx 202 | seq = _translate(x) 203 | 204 | rqueue.put((idx, seq)) 205 | 206 | return 207 | 208 | 209 | def main(models, dictionary, dictionary_target, source_file, saveto, k=5, 210 | normalize=False, n_process=5, encoder_chr_level=False, 211 | decoder_chr_level=False, utf8=False, decoder_bpe_to_tok=False): 212 | 213 | # load model model_options 214 | pkl_file = models[0].split('.')[0] + '.pkl' 215 | with open(pkl_file, 'rb') as f: 216 | options = pkl.load(f) 217 | 218 | # load source dictionary and invert 219 | with open(dictionary, 'rb') as f: 220 | word_dict = pkl.load(f) 221 | word_idict = dict() 222 | for kk, vv in word_dict.iteritems(): 223 | word_idict[vv] = kk 224 | word_idict[0] = '' 225 | word_idict[1] = 'UNK' 226 | 227 | # load target dictionary and invert 228 | with open(dictionary_target, 'rb') as f: 229 | word_dict_trg = pkl.load(f) 230 | word_idict_trg = dict() 231 | for kk, vv in word_dict_trg.iteritems(): 232 | word_idict_trg[vv] = kk 233 | word_idict_trg[0] = '' 234 | word_idict_trg[1] = 'UNK' 235 | 236 | # create input and output queues for processes 237 | queue = Queue() 238 | rqueue = Queue() 239 | processes = [None] * n_process 240 | for midx in xrange(n_process): 241 | processes[midx] = Process( 242 | target=translate_model, 243 | args=(queue, rqueue, midx, models, options, k, normalize)) 244 | processes[midx].start() 245 | 246 | # utility function 247 | def _seqs2words(caps): 248 | capsw = [] 249 | for cc in caps: 250 | ww = [] 251 | for w in cc: 252 | if w == 0: 253 | break 254 | if utf8: 255 | ww.append(word_idict_trg[w].encode('utf-8')) 256 | else: 257 | ww.append(word_idict_trg[w]) 258 | if decoder_chr_level: 259 | capsw.append(''.join(ww)) 260 | else: 261 | capsw.append(' '.join(ww)) 262 | return capsw 263 | 264 | def _send_jobs(fname): 265 | with open(fname, 'r') as f: 266 | for idx, line in enumerate(f): 267 | if encoder_chr_level: 268 | words = list(line.decode('utf-8').strip()) 269 | else: 270 | words = line.strip().split() 271 | x = map(lambda w: word_dict[w] if w in word_dict else 1, words) 272 | x = map(lambda ii: ii if ii < options['n_words_src'] else 1, x) 273 | x += [0] 274 | queue.put((idx, x)) 275 | return idx+1 276 | 277 | def _finish_processes(): 278 | for midx in xrange(n_process): 279 | queue.put(None) 280 | 281 | def _retrieve_jobs(n_samples): 282 | trans = [None] * n_samples 283 | for idx in xrange(n_samples): 284 | resp = rqueue.get() 285 | trans[resp[0]] = resp[1] 286 | if numpy.mod(idx, 10) == 0: 287 | print 'Sample ', (idx+1), '/', n_samples, ' Done' 288 | return trans 289 | 290 | print 'Translating ', source_file, '...' 291 | n_samples = _send_jobs(source_file) 292 | trans = _seqs2words(_retrieve_jobs(n_samples)) 293 | _finish_processes() 294 | with open(saveto, 'w') as f: 295 | if decoder_bpe_to_tok: 296 | print >>f, '\n'.join(trans).replace('@@ ', '') 297 | else: 298 | print >>f, '\n'.join(trans) 299 | print 'Done' 300 | 301 | 302 | if __name__ == "__main__": 303 | parser = argparse.ArgumentParser() 304 | parser.add_argument('-k', type=int, default=5) 305 | parser.add_argument('-p', type=int, default=5) 306 | parser.add_argument('-n', action="store_true", default=False) 307 | parser.add_argument('-bpe', action="store_true", default=False) 308 | parser.add_argument('-enc_c', action="store_true", default=False) 309 | parser.add_argument('-dec_c', action="store_true", default=False) 310 | parser.add_argument('-utf8', action="store_true", default=False) 311 | parser.add_argument('saveto', type=str) 312 | 313 | model_path = '/misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2bpe_two_layer_gru_decoder/0209/' 314 | model1 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en1.290000.npz' 315 | model2 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en2.260000.npz' 316 | model3 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam_en3.290000.npz' 317 | model4 = model_path + 'bpe2bpe_two_layer_gru_decoder_both_adam.335000.npz' 318 | models = [model1, model2, model3, model4] 319 | dictionary = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.en.tok.bpe.word.pkl' 320 | dictionary_target = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/all_de-en.de.tok.bpe.word.pkl' 321 | source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/newstest2013.en.tok.bpe' 322 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2014-deen-src.en.tok.bpe' 323 | #source = '/misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/test/newstest2015-deen-src.en.tok.bpe' 324 | 325 | args = parser.parse_args() 326 | 327 | main(models, dictionary, dictionary_target, source, 328 | args.saveto, k=args.k, normalize=args.n, n_process=args.p, 329 | encoder_chr_level=args.enc_c, 330 | decoder_chr_level=args.dec_c, 331 | utf8=args.utf8, decoder_bpe_to_tok=args.bpe) 332 | -------------------------------------------------------------------------------- /subword_base/wmt15_csen_bpe2bpe_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/csen/bpe2bpe_two_layer_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/csen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 21816 14 | n_words_src 21907 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 100 26 | maxlen_sample 100 27 | source_word_level 1 28 | target_word_level 1 29 | source_dataset all_cs-en.en.tok.bpe 30 | target_dataset all_cs-en.cs.tok.bpe 31 | valid_source_dataset newstest2013-src.en.tok.bpe 32 | valid_target_dataset newstest2013-ref.cs.tok.bpe 33 | source_dictionary all_cs-en.en.tok.bpe.word.pkl 34 | target_dictionary all_cs-en.cs.tok.bpe.word.pkl 35 | -------------------------------------------------------------------------------- /subword_base/wmt15_deen_bpe2bpe_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/deen/bpe2bpe_two_layer_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/deen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 24254 14 | n_words_src 24440 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 100 26 | maxlen_sample 100 27 | source_word_level 1 28 | target_word_level 1 29 | source_dataset all_de-en.en.tok.bpe.shuf 30 | target_dataset all_de-en.de.tok.bpe.shuf 31 | valid_source_dataset newstest2013.en.tok.bpe 32 | valid_target_dataset newstest2013.de.tok.bpe 33 | source_dictionary all_de-en.en.tok.bpe.word.pkl 34 | target_dictionary all_de-en.de.tok.bpe.word.pkl 35 | -------------------------------------------------------------------------------- /subword_base/wmt15_fien_bpe2bpe_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/fien/bpe2bpe_two_layer_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/fien/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 20783 14 | n_words_src 20174 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 100 26 | maxlen_sample 100 27 | source_word_level 1 28 | target_word_level 1 29 | source_dataset all_fi-en.en.tok.bpe.shuf 30 | target_dataset all_fi-en.fi.tok.bpe.shuf 31 | valid_source_dataset newsdev2015-enfi-src.en.tok.bpe 32 | valid_target_dataset newsdev2015-enfi-ref.fi.tok.bpe 33 | source_dictionary all_fi-en.en.tok.bpe.word.pkl 34 | target_dictionary all_fi-en.fi.tok.bpe.word.pkl 35 | -------------------------------------------------------------------------------- /subword_base/wmt15_ruen_bpe2bpe_adam.txt: -------------------------------------------------------------------------------- 1 | save_path /misc/kcgscratch1/ChoGroup/junyoung_exp/acl2016/wmt15/ruen/bpe2bpe_two_layer_gru_decoder/0209/ 2 | train_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/train/ 3 | dev_data_path /misc/kcgscratch1/ChoGroup/junyoung_exp/wmt15/ruen/dev/ 4 | max_epochs 1000000000000 5 | patience -1 6 | learning_rate 0.0001 7 | batch_size 128 8 | valid_batch_size 128 9 | enc_dim 512 10 | dec_dim 1024 11 | dim_word 512 12 | dim_word_src 512 13 | n_words 22106 14 | n_words_src 22030 15 | optimizer adam 16 | decay_c 0 17 | use_dropout 0 18 | clip_c 1 19 | saveFreq 5000 20 | sampleFreq 5000 21 | dispFreq 1000 22 | validFreq 5000 23 | sort_size 20 24 | maxlen 50 25 | maxlen_trg 100 26 | maxlen_sample 100 27 | source_word_level 1 28 | target_word_level 1 29 | source_dataset all_ru-en.en.tok.bpe 30 | target_dataset all_ru-en.ru.tok.bpe 31 | valid_source_dataset newstest2013-src.en.tok.bpe 32 | valid_target_dataset newstest2013-ref.ru.tok.bpe 33 | source_dictionary all_ru-en.en.tok.bpe.word.pkl 34 | target_dictionary all_ru-en.ru.tok.bpe.word.pkl 35 | -------------------------------------------------------------------------------- /translate_readme.txt: -------------------------------------------------------------------------------- 1 | Command for using translate.py BPE-case: 2 | python translate.py -k {beam_width} -p {number_of_processors} -n -bpe {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name} 3 | 4 | Command for using translate.py Char-case: 5 | python translate.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name} 6 | 7 | Command for using translate_both.py BPE-case: 8 | python translate_both.py -k {beam_width} -p {number_of_processors} -n -bpe {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name} 9 | 10 | Command for using translate_both.py Char-case: 11 | python translate_both.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name} 12 | 13 | Command for using translate_attc.py Char-case: 14 | python translate_attc.py -k {beam_width} -p {number_of_processors} -n -dec_c -utf8 {path/model.npz} {path/source_dict} {path/target_dict} {path/valid.txt or test.txt} {save_path/save_file_name} 15 | 16 | Command for using `multi-bleu.perl': 17 | perl multi-bleu.perl {reference.txt} < {translated.txt} 18 | --------------------------------------------------------------------------------