├── .gitignore ├── README.md ├── data ├── __init__.py ├── build_dictionary.py ├── clean-corpus-n.perl ├── data_iterator.py ├── data_statistics.py ├── data_utils.py ├── merge.sh ├── multi-bleu.perl ├── nonbreaking_prefixes │ ├── README.txt │ ├── nonbreaking_prefix.ca │ ├── nonbreaking_prefix.cs │ ├── nonbreaking_prefix.de │ ├── nonbreaking_prefix.el │ ├── nonbreaking_prefix.en │ ├── nonbreaking_prefix.es │ ├── nonbreaking_prefix.fi │ ├── nonbreaking_prefix.fr │ ├── nonbreaking_prefix.hu │ ├── nonbreaking_prefix.is │ ├── nonbreaking_prefix.it │ ├── nonbreaking_prefix.lv │ ├── nonbreaking_prefix.nl │ ├── nonbreaking_prefix.pl │ ├── nonbreaking_prefix.pt │ ├── nonbreaking_prefix.ro │ ├── nonbreaking_prefix.ru │ ├── nonbreaking_prefix.sk │ ├── nonbreaking_prefix.sl │ ├── nonbreaking_prefix.sv │ └── nonbreaking_prefix.ta ├── normalize-punctuation.perl ├── postprocess.sh ├── preprocess.sh ├── sample.en ├── sample.fr ├── shuffle.py ├── strip_sgml.py ├── subword_nmt │ ├── README.md │ ├── apply_bpe.py │ ├── chrF.py │ ├── learn_bpe.py │ └── segment-char-ngrams.py ├── tokenizer.perl └── util.py ├── decode.ipynb ├── decode.py ├── seq2seq_model.py ├── train.ipynb └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | **/*.pyc 2 | **/*.swp 3 | model/ 4 | europarl-v7.* 5 | newstest* 6 | .ipynb* 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TF-seq2seq 2 | ## **Sequence to sequence (seq2seq) learning Using TensorFlow.** 3 | 4 | The core building blocks are RNN Encoder-Decoder architectures and Attention mechanism. 5 | 6 | The package was largely implemented using the latest (1.2) tf.contrib.seq2seq modules 7 | - AttentionWrapper 8 | - Decoder 9 | - BasicDecoder 10 | - BeamSearchDecoder 11 | 12 | **The package supports** 13 | - Multi-layer GRU/LSTM 14 | - Residual connection 15 | - Dropout 16 | - Attention and input_feeding 17 | - Beamsearch decoding 18 | - Write n-best list 19 | 20 | # Dependencies 21 | - NumPy >= 1.11.1 22 | - Tensorflow >= 1.2 23 | 24 | 25 | # History 26 | - June 5, 2017: Major update 27 | - June 6, 2017: Supports batch beamsearch decoding 28 | - June 11, 2017: Separted training / decoding 29 | - June 22, 2017: Supports tf.1.2 (contrib.rnn -> python.ops.rnn_cell) 30 | 31 | 32 | # Usage Instructions 33 | ## **Data Preparation** 34 | 35 | To preprocess raw parallel data of sample_data.src and sample_data.trg, simply run 36 | ```ruby 37 | cd data/ 38 | ./preprocess.sh src trg sample_data ${max_seq_len} 39 | ``` 40 | 41 | Running the above code performs widely used preprocessing steps for Machine Translation (MT). 42 | 43 | - Normalizing punctuation 44 | - Tokenizing 45 | - Bytepair encoding (# merge = 30000) (Sennrich et al., 2016) 46 | - Cleaning sequences of length over ${max_seq_len} 47 | - Shuffling 48 | - Building dictionaries 49 | 50 | ## **Training** 51 | To train a seq2seq model, 52 | ```ruby 53 | $ python train.py --cell_type 'lstm' \ 54 | --attention_type 'luong' \ 55 | --hidden_units 1024 \ 56 | --depth 2 \ 57 | --embedding_size 500 \ 58 | --num_encoder_symbols 30000 \ 59 | --num_decoder_symbols 30000 ... 60 | ``` 61 | 62 | ## **Decoding** 63 | To run the trained model for decoding, 64 | ```ruby 65 | $ python decode.py --beam_width 5 \ 66 | --decode_batch_size 30 \ 67 | --model_path $PATH_TO_A_MODEL_CHECKPOINT (e.g. model/translate.ckpt-100) \ 68 | --max_decode_step 300 \ 69 | --write_n_best False 70 | --decode_input $PATH_TO_DECODE_INPUT 71 | --decode_output $PATH_TO_DECODE_OUTPUT 72 | 73 | ``` 74 | If --beam_width=1, greedy decoding is performed at each time-step. 75 | 76 | ## **Arguments** 77 | 78 | **Data params** 79 | - --source_vocabulary : Path to source vocabulary 80 | - --target_vocabulary : Path to target vocabulary 81 | - --source_train_data : Path to source training data 82 | - --target_train_data : Path to target training data 83 | - --source_valid_data : Path to source validation data 84 | - --target_valid_data : Path to target validation data 85 | 86 | **Network params** 87 | - --cell_type : RNN cell to use for encoder and decoder (default: lstm) 88 | - --attention_type : Attention mechanism (bahdanau, luong), (default: bahdanau) 89 | - --depth : Number of hidden units for each layer in the model (default: 2) 90 | - --embedding_size : Embedding dimensions of encoder and decoder inputs (default: 500) 91 | - --num_encoder_symbols : Source vocabulary size to use (default: 30000) 92 | - --num_decoder_symbols : Target vocabulary size to use (default: 30000) 93 | - --use_residual : Use residual connection between layers (default: True) 94 | - --attn_input_feeding : Use input feeding method in attentional decoder (Luong et al., 2015) (default: True) 95 | - --use_dropout : Use dropout in rnn cell output (default: True) 96 | - --dropout_rate : Dropout probability for cell outputs (0.0: no dropout) (default: 0.3) 97 | 98 | **Training params** 99 | - --learning_rate : Number of hidden units for each layer in the model (default: 0.0002) 100 | - --max_gradient_norm : Clip gradients to this norm (default 1.0) 101 | - --batch_size : Batch size 102 | - --max_epochs : Maximum training epochs 103 | - --max_load_batches : Maximum number of batches to prefetch at one time. 104 | - --max_seq_length : Maximum sequence length 105 | - --display_freq : Display training status every this iteration 106 | - --save_freq : Save model checkpoint every this iteration 107 | - --valid_freq : Evaluate the model every this iteration: valid_data needed 108 | - --optimizer : Optimizer for training: (adadelta, adam, rmsprop) (default: adam) 109 | - --model_dir : Path to save model checkpoints 110 | - --model_name : File name used for model checkpoints 111 | - --shuffle_each_epoch : Shuffle training dataset for each epoch (default: True) 112 | - --sort_by_length : Sort pre-fetched minibatches by their target sequence lengths (default: True) 113 | 114 | **Decoding params** 115 | - --beam_width : Beam width used in beamsearch (default: 1) 116 | - --decode_batch_size : Batch size used in decoding 117 | - --max_decode_step : Maximum time step limit in decoding (default: 500) 118 | - --write_n_best : Write beamsearch n-best list (n=beam_width) (default: False) 119 | - --decode_input : Input file path to decode 120 | - --decode_output : Output file path of decoding output 121 | 122 | **Runtime params** 123 | - --allow_soft_placement : Allow device soft placement 124 | - --log_device_placement : Log placement of ops on devices 125 | 126 | 127 | ## Acknowledgements 128 | 129 | The implementation is based on following projects: 130 | - [nematus](https://github.com/rsennrich/nematus/): Theano implementation of Neural Machine Translation. Major reference of this project 131 | - [subword-nmt](https://github.com/rsennrich/subword-nmt/): Included subword-unit scripts to preprocess input data 132 | - [moses](https://github.com/moses-smt/mosesdecoder): Included preprocessing scripts to preprocess input data 133 | - [tf.seq2seq_legacy](https://github.com/tensorflow/models/tree/master/tutorials/rnn/translate) Legacy Tensorflow seq2seq tutorial 134 | - [tf_tutorial_plus](https://github.com/j-min/tf_tutorial_plus): Nice tutorials for tf.contrib.seq2seq API 135 | 136 | For any comments and feedbacks, please email me at pjh0308@gmail.com or open an issue here. 137 | -------------------------------------------------------------------------------- /data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jayparks/tf-seq2seq/e55d88ec21090c127d24da16b9e2b6b9aa894821/data/__init__.py -------------------------------------------------------------------------------- /data/build_dictionary.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import numpy 4 | import json 5 | 6 | import sys 7 | import fileinput 8 | 9 | from collections import OrderedDict 10 | from data_utils import extra_tokens 11 | 12 | def main(): 13 | for filename in sys.argv[1:]: 14 | print 'Processing', filename 15 | word_freqs = OrderedDict() 16 | with open(filename, 'r') as f: 17 | for line in f: 18 | words_in = line.strip().split(' ') 19 | for w in words_in: 20 | if w not in word_freqs: 21 | word_freqs[w] = 0 22 | word_freqs[w] += 1 23 | words = word_freqs.keys() 24 | freqs = word_freqs.values() 25 | 26 | sorted_idx = numpy.argsort(freqs) 27 | sorted_words = [words[ii] for ii in sorted_idx[::-1]] 28 | 29 | worddict = OrderedDict() 30 | for ii, ww in enumerate(extra_tokens): 31 | worddict[ww] = ii 32 | for ii, ww in enumerate(sorted_words): 33 | worddict[ww] = ii + len(extra_tokens) 34 | 35 | with open('%s.json'%filename, 'wb') as f: 36 | json.dump(worddict, f, indent=2, ensure_ascii=False) 37 | 38 | print 'Done' 39 | 40 | if __name__ == '__main__': 41 | main() 42 | -------------------------------------------------------------------------------- /data/clean-corpus-n.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | # $Id: clean-corpus-n.perl 3633 2010-10-21 09:49:27Z phkoehn $ 7 | use warnings; 8 | use strict; 9 | use Getopt::Long; 10 | my $help; 11 | my $lc = 0; # lowercase the corpus? 12 | my $ignore_ratio = 0; 13 | my $ignore_xml = 0; 14 | my $enc = "utf8"; # encoding of the input and output files 15 | # set to anything else you wish, but I have not tested it yet 16 | my $max_word_length = 1000; # any segment with a word (or factor) exceeding this length in chars 17 | # is discarded; motivated by symal.cpp, which has its own such parameter (hardcoded to 1000) 18 | # and crashes if it encounters a word that exceeds it 19 | my $ratio = 9; 20 | 21 | GetOptions( 22 | "help" => \$help, 23 | "lowercase|lc" => \$lc, 24 | "encoding=s" => \$enc, 25 | "ratio=f" => \$ratio, 26 | "ignore-ratio" => \$ignore_ratio, 27 | "ignore-xml" => \$ignore_xml, 28 | "max-word-length|mwl=s" => \$max_word_length 29 | ) or exit(1); 30 | 31 | if (scalar(@ARGV) < 6 || $help) { 32 | print "syntax: clean-corpus-n.perl [-ratio n] corpus l1 l2 clean-corpus min max [lines retained file]\n"; 33 | exit; 34 | } 35 | 36 | my $corpus = $ARGV[0]; 37 | my $l1 = $ARGV[1]; 38 | my $l2 = $ARGV[2]; 39 | my $out = $ARGV[3]; 40 | my $min = $ARGV[4]; 41 | my $max = $ARGV[5]; 42 | 43 | my $linesRetainedFile = ""; 44 | if (scalar(@ARGV) > 6) { 45 | $linesRetainedFile = $ARGV[6]; 46 | open(LINES_RETAINED,">$linesRetainedFile") or die "Can't write $linesRetainedFile"; 47 | } 48 | 49 | print STDERR "clean-corpus.perl: processing $corpus.$l1 & .$l2 to $out, cutoff $min-$max, ratio $ratio\n"; 50 | 51 | my $opn = undef; 52 | my $l1input = "$corpus.$l1"; 53 | if (-e $l1input) { 54 | $opn = $l1input; 55 | } elsif (-e $l1input.".gz") { 56 | $opn = "gunzip -c $l1input.gz |"; 57 | } else { 58 | die "Error: $l1input does not exist"; 59 | } 60 | open(F,$opn) or die "Can't open '$opn'"; 61 | $opn = undef; 62 | my $l2input = "$corpus.$l2"; 63 | if (-e $l2input) { 64 | $opn = $l2input; 65 | } elsif (-e $l2input.".gz") { 66 | $opn = "gunzip -c $l2input.gz |"; 67 | } else { 68 | die "Error: $l2input does not exist"; 69 | } 70 | 71 | open(E,$opn) or die "Can't open '$opn'"; 72 | 73 | open(FO,">$out.$l1") or die "Can't write $out.$l1"; 74 | open(EO,">$out.$l2") or die "Can't write $out.$l2"; 75 | 76 | # necessary for proper lowercasing 77 | my $binmode; 78 | if ($enc eq "utf8") { 79 | $binmode = ":utf8"; 80 | } else { 81 | $binmode = ":encoding($enc)"; 82 | } 83 | binmode(F, $binmode); 84 | binmode(E, $binmode); 85 | binmode(FO, $binmode); 86 | binmode(EO, $binmode); 87 | 88 | my $innr = 0; 89 | my $outnr = 0; 90 | my $factored_flag; 91 | while(my $f = ) { 92 | $innr++; 93 | print STDERR "." if $innr % 10000 == 0; 94 | print STDERR "($innr)" if $innr % 100000 == 0; 95 | my $e = ; 96 | die "$corpus.$l2 is too short!" if !defined $e; 97 | chomp($e); 98 | chomp($f); 99 | if ($innr == 1) { 100 | $factored_flag = ($e =~ /\|/ || $f =~ /\|/); 101 | } 102 | 103 | #if lowercasing, lowercase 104 | if ($lc) { 105 | $e = lc($e); 106 | $f = lc($f); 107 | } 108 | 109 | $e =~ s/\|//g unless $factored_flag; 110 | $e =~ s/\s+/ /g; 111 | $e =~ s/^ //; 112 | $e =~ s/ $//; 113 | $f =~ s/\|//g unless $factored_flag; 114 | $f =~ s/\s+/ /g; 115 | $f =~ s/^ //; 116 | $f =~ s/ $//; 117 | next if $f eq ''; 118 | next if $e eq ''; 119 | 120 | my $ec = &word_count($e); 121 | my $fc = &word_count($f); 122 | next if $ec > $max; 123 | next if $fc > $max; 124 | next if $ec < $min; 125 | next if $fc < $min; 126 | next if !$ignore_ratio && $ec/$fc > $ratio; 127 | next if !$ignore_ratio && $fc/$ec > $ratio; 128 | # Skip this segment if any factor is longer than $max_word_length 129 | my $max_word_length_plus_one = $max_word_length + 1; 130 | next if $e =~ /[^\s\|]{$max_word_length_plus_one}/; 131 | next if $f =~ /[^\s\|]{$max_word_length_plus_one}/; 132 | 133 | # An extra check: none of the factors can be blank! 134 | die "There is a blank factor in $corpus.$l1 on line $innr: $f" 135 | if $f =~ /[ \|]\|/; 136 | die "There is a blank factor in $corpus.$l2 on line $innr: $e" 137 | if $e =~ /[ \|]\|/; 138 | 139 | $outnr++; 140 | print FO $f."\n"; 141 | print EO $e."\n"; 142 | 143 | if ($linesRetainedFile ne "") { 144 | print LINES_RETAINED $innr."\n"; 145 | } 146 | } 147 | 148 | if ($linesRetainedFile ne "") { 149 | close LINES_RETAINED; 150 | } 151 | 152 | print STDERR "\n"; 153 | my $e = ; 154 | die "$corpus.$l2 is too long!" if defined $e; 155 | 156 | print STDERR "Input sentences: $innr Output sentences: $outnr\n"; 157 | 158 | sub word_count { 159 | my ($line) = @_; 160 | if ($ignore_xml) { 161 | $line =~ s/<\S[^>]*\S>/ /g; 162 | $line =~ s/\s+/ /g; 163 | $line =~ s/^ //g; 164 | $line =~ s/ $//g; 165 | } 166 | my @w = split(/ /,$line); 167 | return scalar @w; 168 | } 169 | -------------------------------------------------------------------------------- /data/data_iterator.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import shuffle 4 | from util import load_dict 5 | 6 | import data_utils 7 | 8 | ''' 9 | Much of this code is based on the data_iterator.py of 10 | nematus project (https://github.com/rsennrich/nematus) 11 | ''' 12 | 13 | class TextIterator: 14 | """Simple Text iterator.""" 15 | def __init__(self, source, source_dict, 16 | batch_size=128, maxlen=None, 17 | n_words_source=-1, 18 | skip_empty=False, 19 | shuffle_each_epoch=False, 20 | sort_by_length=False, 21 | maxibatch_size=20, 22 | ): 23 | 24 | if shuffle_each_epoch: 25 | self.source_orig = source 26 | self.source = shuffle.main([self.source_orig], temporary=True) 27 | else: 28 | self.source = data_utils.fopen(source, 'r') 29 | 30 | self.source_dict = load_dict(source_dict) 31 | self.batch_size = batch_size 32 | self.maxlen = maxlen 33 | self.skip_empty = skip_empty 34 | 35 | self.n_words_source = n_words_source 36 | 37 | if self.n_words_source > 0: 38 | for key, idx in self.source_dict.items(): 39 | if idx >= self.n_words_source: 40 | del self.source_dict[key] 41 | 42 | self.shuffle = shuffle_each_epoch 43 | self.sort_by_length = sort_by_length 44 | 45 | self.shuffle = shuffle_each_epoch 46 | self.sort_by_length = sort_by_length 47 | 48 | self.source_buffer = [] 49 | self.k = batch_size * maxibatch_size 50 | 51 | self.end_of_data = False 52 | 53 | def __iter__(self): 54 | return self 55 | 56 | def __len__(self): 57 | return sum([1 for _ in self]) 58 | 59 | def reset(self): 60 | if self.shuffle: 61 | self.source = shuffle.main([self.source_orig], temporary=True) 62 | else: 63 | self.source.seek(0) 64 | 65 | def next(self): 66 | if self.end_of_data: 67 | self.end_of_data = False 68 | self.reset() 69 | raise StopIteration 70 | 71 | source = [] 72 | 73 | # fill buffer, if it's empty 74 | if len(self.source_buffer) == 0: 75 | for k_ in xrange(self.k): 76 | ss = self.source.readline() 77 | if ss == "": 78 | break 79 | self.source_buffer.append(ss.strip().split()) 80 | 81 | # sort by buffer 82 | if self.sort_by_length: 83 | slen = np.array([len(s) for s in self.source_buffer]) 84 | sidx = slen.argsort() 85 | 86 | _sbuf = [self.source_buffer[i] for i in sidx] 87 | 88 | self.source_buffer = _sbuf 89 | else: 90 | self.source_buffer.reverse() 91 | 92 | if len(self.source_buffer) == 0: 93 | self.end_of_data = False 94 | self.reset() 95 | raise StopIteration 96 | 97 | try: 98 | # actual work here 99 | while True: 100 | # read from source file and map to word index 101 | try: 102 | ss = self.source_buffer.pop() 103 | except IndexError: 104 | break 105 | ss = [self.source_dict[w] if w in self.source_dict 106 | else data_utils.unk_token for w in ss] 107 | 108 | if self.maxlen and len(ss) > self.maxlen: 109 | continue 110 | if self.skip_empty and (not ss): 111 | continue 112 | source.append(ss) 113 | 114 | if len(source) >= self.batch_size: 115 | break 116 | except IOError: 117 | self.end_of_data = True 118 | 119 | # all sentence pairs in maxibatch filtered out because of length 120 | if len(source) == 0: 121 | source = self.next() 122 | 123 | return source 124 | 125 | class BiTextIterator: 126 | """Simple Bitext iterator.""" 127 | def __init__(self, source, target, 128 | source_dict, target_dict, 129 | batch_size=128, 130 | maxlen=100, 131 | n_words_source=-1, 132 | n_words_target=-1, 133 | skip_empty=False, 134 | shuffle_each_epoch=False, 135 | sort_by_length=True, 136 | maxibatch_size=20): 137 | if shuffle_each_epoch: 138 | self.source_orig = source 139 | self.target_orig = target 140 | self.source, self.target = shuffle.main([self.source_orig, self.target_orig], temporary=True) 141 | else: 142 | self.source = data_utils.fopen(source, 'r') 143 | self.target = data_utils.fopen(target, 'r') 144 | 145 | self.source_dict = load_dict(source_dict) 146 | self.target_dict = load_dict(target_dict) 147 | 148 | self.batch_size = batch_size 149 | self.maxlen = maxlen 150 | self.skip_empty = skip_empty 151 | 152 | self.n_words_source = n_words_source 153 | self.n_words_target = n_words_target 154 | 155 | if self.n_words_source > 0: 156 | for key, idx in self.source_dict.items(): 157 | if idx >= self.n_words_source: 158 | del self.source_dict[key] 159 | 160 | if self.n_words_target > 0: 161 | for key, idx in self.target_dict.items(): 162 | if idx >= self.n_words_target: 163 | del self.target_dict[key] 164 | 165 | self.shuffle = shuffle_each_epoch 166 | self.sort_by_length = sort_by_length 167 | 168 | self.source_buffer = [] 169 | self.target_buffer = [] 170 | self.k = batch_size * maxibatch_size 171 | 172 | self.end_of_data = False 173 | 174 | def __iter__(self): 175 | return self 176 | 177 | def __len__(self): 178 | return sum([1 for _ in self]) 179 | 180 | def reset(self): 181 | if self.shuffle: 182 | self.source, self.target = shuffle.main([self.source_orig, self.target_orig], temporary=True) 183 | else: 184 | self.source.seek(0) 185 | self.target.seek(0) 186 | 187 | def next(self): 188 | if self.end_of_data: 189 | self.end_of_data = False 190 | self.reset() 191 | raise StopIteration 192 | 193 | source = [] 194 | target = [] 195 | 196 | # fill buffer, if it's empty 197 | assert len(self.source_buffer) == len(self.target_buffer), 'Buffer size mismatch!' 198 | 199 | if len(self.source_buffer) == 0: 200 | for k_ in xrange(self.k): 201 | ss = self.source.readline() 202 | if ss == "": 203 | break 204 | tt = self.target.readline() 205 | if tt == "": 206 | break 207 | self.source_buffer.append(ss.strip().split()) 208 | self.target_buffer.append(tt.strip().split()) 209 | 210 | # sort by target buffer 211 | if self.sort_by_length: 212 | tlen = np.array([len(t) for t in self.target_buffer]) 213 | tidx = tlen.argsort() 214 | 215 | _sbuf = [self.source_buffer[i] for i in tidx] 216 | _tbuf = [self.target_buffer[i] for i in tidx] 217 | 218 | self.source_buffer = _sbuf 219 | self.target_buffer = _tbuf 220 | 221 | else: 222 | self.source_buffer.reverse() 223 | self.target_buffer.reverse() 224 | 225 | if len(self.source_buffer) == 0 or len(self.target_buffer) == 0: 226 | self.end_of_data = False 227 | self.reset() 228 | raise StopIteration 229 | 230 | try: 231 | 232 | # actual work here 233 | while True: 234 | 235 | # read from source file and map to word index 236 | try: 237 | ss = self.source_buffer.pop() 238 | except IndexError: 239 | break 240 | ss = [self.source_dict[w] if w in self.source_dict 241 | else data_utils.unk_token for w in ss] 242 | 243 | # read from source file and map to word index 244 | tt = self.target_buffer.pop() 245 | tt = [self.target_dict[w] if w in self.target_dict 246 | else data_utils.unk_token for w in tt] 247 | if self.n_words_target > 0: 248 | tt = [w if w < self.n_words_target 249 | else data_utils.unk_token for w in tt] 250 | 251 | if self.maxlen: 252 | if len(ss) > self.maxlen and len(tt) > self.maxlen: 253 | continue 254 | if self.skip_empty and (not ss or not tt): 255 | continue 256 | 257 | source.append(ss) 258 | target.append(tt) 259 | 260 | if len(source) >= self.batch_size or \ 261 | len(target) >= self.batch_size: 262 | break 263 | except IOError: 264 | self.end_of_data = True 265 | 266 | # all sentence pairs in maxibatch filtered out because of length 267 | if len(source) == 0 or len(target) == 0: 268 | source, target = self.next() 269 | 270 | return source, target 271 | -------------------------------------------------------------------------------- /data/data_statistics.py: -------------------------------------------------------------------------------- 1 | 2 | import sys 3 | import numpy as np 4 | 5 | def main(argv): 6 | for input_file in argv: 7 | lengths = [] 8 | with open(input_file, 'r') as corpus: 9 | for line in corpus: 10 | lengths.append(len(line.split())) 11 | print("%s: size=%d, avg_length=%.2f, std=%.2f, min=%d, max=%d" 12 | % (input_file, len(lengths), np.mean(lengths), np.std(lengths), np.min(lengths), np.max(lengths))) 13 | 14 | 15 | if __name__ == "__main__": 16 | main(sys.argv[1:]) 17 | -------------------------------------------------------------------------------- /data/data_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | import gzip 4 | from util import load_dict 5 | 6 | # Extra vocabulary symbols 7 | _GO = '_GO' 8 | EOS = '_EOS' # also function as PAD 9 | UNK = '_UNK' 10 | 11 | extra_tokens = [_GO, EOS, UNK] 12 | 13 | start_token = extra_tokens.index(_GO) # start_token = 0 14 | end_token = extra_tokens.index(EOS) # end_token = 1 15 | unk_token = extra_tokens.index(UNK) 16 | 17 | 18 | def fopen(filename, mode='r'): 19 | if filename.endswith('.gz'): 20 | return gzip.open(filename, mode) 21 | return open(filename, mode) 22 | 23 | 24 | def load_inverse_dict(dict_path): 25 | orig_dict = load_dict(dict_path) 26 | idict = {} 27 | for words, idx in orig_dict.iteritems(): 28 | idict[idx] = words 29 | return idict 30 | 31 | 32 | def seq2words(seq, inverse_target_dictionary): 33 | words = [] 34 | for w in seq: 35 | if w == end_token: 36 | break 37 | if w in inverse_target_dictionary: 38 | words.append(inverse_target_dictionary[w]) 39 | else: 40 | words.append(UNK) 41 | return ' '.join(words) 42 | 43 | 44 | # batch preparation of a given sequence 45 | def prepare_batch(seqs_x, maxlen=None): 46 | # seqs_x: a list of sentences 47 | lengths_x = [len(s) for s in seqs_x] 48 | 49 | if maxlen is not None: 50 | new_seqs_x = [] 51 | new_lengths_x = [] 52 | for l_x, s_x in zip(lengths_x, seqs_x): 53 | if l_x <= maxlen: 54 | new_seqs_x.append(s_x) 55 | new_lengths_x.append(l_x) 56 | lengths_x = new_lengths_x 57 | seqs_x = new_seqs_x 58 | 59 | if len(lengths_x) < 1: 60 | return None, None 61 | 62 | batch_size = len(seqs_x) 63 | 64 | x_lengths = np.array(lengths_x) 65 | maxlen_x = np.max(x_lengths) 66 | 67 | x = np.ones((batch_size, maxlen_x)).astype('int32') * end_token 68 | 69 | for idx, s_x in enumerate(seqs_x): 70 | x[idx, :lengths_x[idx]] = s_x 71 | return x, x_lengths 72 | 73 | 74 | # batch preparation of a given sequence pair for training 75 | def prepare_train_batch(seqs_x, seqs_y, maxlen=None): 76 | # seqs_x, seqs_y: a list of sentences 77 | lengths_x = [len(s) for s in seqs_x] 78 | lengths_y = [len(s) for s in seqs_y] 79 | 80 | if maxlen is not None: 81 | new_seqs_x = [] 82 | new_seqs_y = [] 83 | new_lengths_x = [] 84 | new_lengths_y = [] 85 | for l_x, s_x, l_y, s_y in zip(lengths_x, seqs_x, lengths_y, seqs_y): 86 | if l_x <= maxlen and l_y <= maxlen: 87 | new_seqs_x.append(s_x) 88 | new_lengths_x.append(l_x) 89 | new_seqs_y.append(s_y) 90 | new_lengths_y.append(l_y) 91 | lengths_x = new_lengths_x 92 | seqs_x = new_seqs_x 93 | lengths_y = new_lengths_y 94 | seqs_y = new_seqs_y 95 | 96 | if len(lengths_x) < 1 or len(lengths_y) < 1: 97 | return None, None, None, None 98 | 99 | batch_size = len(seqs_x) 100 | 101 | x_lengths = np.array(lengths_x) 102 | y_lengths = np.array(lengths_y) 103 | 104 | maxlen_x = np.max(x_lengths) 105 | maxlen_y = np.max(y_lengths) 106 | 107 | x = np.ones((batch_size, maxlen_x)).astype('int32') * end_token 108 | y = np.ones((batch_size, maxlen_y)).astype('int32') * end_token 109 | 110 | for idx, [s_x, s_y] in enumerate(zip(seqs_x, seqs_y)): 111 | x[idx, :lengths_x[idx]] = s_x 112 | y[idx, :lengths_y[idx]] = s_y 113 | return x, x_lengths, y, y_lengths 114 | -------------------------------------------------------------------------------- /data/merge.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | SRC=$1 5 | TRG=$2 6 | 7 | FSRC=all_${1}-${2}.${1} 8 | FTRG=all_${1}-${2}.${2} 9 | 10 | echo "" > $FSRC 11 | for F in *${1}-${2}.${1} 12 | do 13 | if [ "$F" = "$FSRC" ]; then 14 | echo "pass" 15 | else 16 | cat $F >> $FSRC 17 | fi 18 | done 19 | 20 | 21 | echo "" > $FTRG 22 | for F in *${1}-${2}.${2} 23 | do 24 | if [ "$F" = "$FTRG" ]; then 25 | echo "pass" 26 | else 27 | cat $F >> $FTRG 28 | fi 29 | done 30 | -------------------------------------------------------------------------------- /data/multi-bleu.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | # $Id$ 7 | use warnings; 8 | use strict; 9 | 10 | my $lowercase = 0; 11 | if ($ARGV[0] eq "-lc") { 12 | $lowercase = 1; 13 | shift; 14 | } 15 | 16 | my $stem = $ARGV[0]; 17 | if (!defined $stem) { 18 | print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n"; 19 | print STDERR "Reads the references from reference or reference0, reference1, ...\n"; 20 | exit(1); 21 | } 22 | 23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0"; 24 | 25 | my @REF; 26 | my $ref=0; 27 | while(-e "$stem$ref") { 28 | &add_to_ref("$stem$ref",\@REF); 29 | $ref++; 30 | } 31 | &add_to_ref($stem,\@REF) if -e $stem; 32 | die("ERROR: could not find reference file $stem") unless scalar @REF; 33 | 34 | sub add_to_ref { 35 | my ($file,$REF) = @_; 36 | my $s=0; 37 | open(REF,$file) or die "Can't read $file"; 38 | while() { 39 | chop; 40 | push @{$$REF[$s++]}, $_; 41 | } 42 | close(REF); 43 | } 44 | 45 | my(@CORRECT,@TOTAL,$length_translation,$length_reference); 46 | my $s=0; 47 | while() { 48 | chop; 49 | $_ = lc if $lowercase; 50 | my @WORD = split; 51 | my %REF_NGRAM = (); 52 | my $length_translation_this_sentence = scalar(@WORD); 53 | my ($closest_diff,$closest_length) = (9999,9999); 54 | foreach my $reference (@{$REF[$s]}) { 55 | # print "$s $_ <=> $reference\n"; 56 | $reference = lc($reference) if $lowercase; 57 | my @WORD = split(' ',$reference); 58 | my $length = scalar(@WORD); 59 | my $diff = abs($length_translation_this_sentence-$length); 60 | if ($diff < $closest_diff) { 61 | $closest_diff = $diff; 62 | $closest_length = $length; 63 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n"; 64 | } elsif ($diff == $closest_diff) { 65 | $closest_length = $length if $length < $closest_length; 66 | # from two references with the same closeness to me 67 | # take the *shorter* into account, not the "first" one. 68 | } 69 | for(my $n=1;$n<=4;$n++) { 70 | my %REF_NGRAM_N = (); 71 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 72 | my $ngram = "$n"; 73 | for(my $w=0;$w<$n;$w++) { 74 | $ngram .= " ".$WORD[$start+$w]; 75 | } 76 | $REF_NGRAM_N{$ngram}++; 77 | } 78 | foreach my $ngram (keys %REF_NGRAM_N) { 79 | if (!defined($REF_NGRAM{$ngram}) || 80 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) { 81 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram}; 82 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}
\n"; 83 | } 84 | } 85 | } 86 | } 87 | $length_translation += $length_translation_this_sentence; 88 | $length_reference += $closest_length; 89 | for(my $n=1;$n<=4;$n++) { 90 | my %T_NGRAM = (); 91 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 92 | my $ngram = "$n"; 93 | for(my $w=0;$w<$n;$w++) { 94 | $ngram .= " ".$WORD[$start+$w]; 95 | } 96 | $T_NGRAM{$ngram}++; 97 | } 98 | foreach my $ngram (keys %T_NGRAM) { 99 | $ngram =~ /^(\d+) /; 100 | my $n = $1; 101 | # my $corr = 0; 102 | # print "$i e $ngram $T_NGRAM{$ngram}
\n"; 103 | $TOTAL[$n] += $T_NGRAM{$ngram}; 104 | if (defined($REF_NGRAM{$ngram})) { 105 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) { 106 | $CORRECT[$n] += $T_NGRAM{$ngram}; 107 | # $corr = $T_NGRAM{$ngram}; 108 | # print "$i e correct1 $T_NGRAM{$ngram}
\n"; 109 | } 110 | else { 111 | $CORRECT[$n] += $REF_NGRAM{$ngram}; 112 | # $corr = $REF_NGRAM{$ngram}; 113 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n"; 114 | } 115 | } 116 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram}; 117 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n" 118 | } 119 | } 120 | $s++; 121 | } 122 | my $brevity_penalty = 1; 123 | my $bleu = 0; 124 | 125 | my @bleu=(); 126 | 127 | for(my $n=1;$n<=4;$n++) { 128 | if (defined ($TOTAL[$n])){ 129 | $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0; 130 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n"; 131 | }else{ 132 | $bleu[$n]=0; 133 | } 134 | } 135 | 136 | if ($length_reference==0){ 137 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n"; 138 | exit(1); 139 | } 140 | 141 | if ($length_translation<$length_reference) { 142 | $brevity_penalty = exp(1-$length_reference/$length_translation); 143 | } 144 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) + 145 | my_log( $bleu[2] ) + 146 | my_log( $bleu[3] ) + 147 | my_log( $bleu[4] ) ) / 4) ; 148 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n", 149 | 100*$bleu, 150 | 100*$bleu[1], 151 | 100*$bleu[2], 152 | 100*$bleu[3], 153 | 100*$bleu[4], 154 | $brevity_penalty, 155 | $length_translation / $length_reference, 156 | $length_translation, 157 | $length_reference; 158 | 159 | sub my_log { 160 | return -9999999999 unless $_[0]; 161 | return log($_[0]); 162 | } 163 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/README.txt: -------------------------------------------------------------------------------- 1 | The language suffix can be found here: 2 | 3 | http://www.loc.gov/standards/iso639-2/php/code_list.php 4 | 5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations). 6 | This code includes data from czech wiktionary (also czech abbreviations). 7 | 8 | 9 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.ca: -------------------------------------------------------------------------------- 1 | Dr 2 | Dra 3 | pàg 4 | p 5 | c 6 | av 7 | Sr 8 | Sra 9 | adm 10 | esq 11 | Prof 12 | S.A 13 | S.L 14 | p.e 15 | ptes 16 | Sta 17 | St 18 | pl 19 | màx 20 | cast 21 | dir 22 | nre 23 | fra 24 | admdora 25 | Emm 26 | Excma 27 | espf 28 | dc 29 | admdor 30 | tel 31 | angl 32 | aprox 33 | ca 34 | dept 35 | dj 36 | dl 37 | dt 38 | ds 39 | dg 40 | dv 41 | ed 42 | entl 43 | al 44 | i.e 45 | maj 46 | smin 47 | n 48 | núm 49 | pta 50 | A 51 | B 52 | C 53 | D 54 | E 55 | F 56 | G 57 | H 58 | I 59 | J 60 | K 61 | L 62 | M 63 | N 64 | O 65 | P 66 | Q 67 | R 68 | S 69 | T 70 | U 71 | V 72 | W 73 | X 74 | Y 75 | Z 76 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.cs: -------------------------------------------------------------------------------- 1 | Bc 2 | BcA 3 | Ing 4 | Ing.arch 5 | MUDr 6 | MVDr 7 | MgA 8 | Mgr 9 | JUDr 10 | PhDr 11 | RNDr 12 | PharmDr 13 | ThLic 14 | ThDr 15 | Ph.D 16 | Th.D 17 | prof 18 | doc 19 | CSc 20 | DrSc 21 | dr. h. c 22 | PaedDr 23 | Dr 24 | PhMr 25 | DiS 26 | abt 27 | ad 28 | a.i 29 | aj 30 | angl 31 | anon 32 | apod 33 | atd 34 | atp 35 | aut 36 | bd 37 | biogr 38 | b.m 39 | b.p 40 | b.r 41 | cca 42 | cit 43 | cizojaz 44 | c.k 45 | col 46 | čes 47 | čín 48 | čj 49 | ed 50 | facs 51 | fasc 52 | fol 53 | fot 54 | franc 55 | h.c 56 | hist 57 | hl 58 | hrsg 59 | ibid 60 | il 61 | ind 62 | inv.č 63 | jap 64 | jhdt 65 | jv 66 | koed 67 | kol 68 | korej 69 | kl 70 | krit 71 | lat 72 | lit 73 | m.a 74 | maď 75 | mj 76 | mp 77 | násl 78 | např 79 | nepubl 80 | něm 81 | no 82 | nr 83 | n.s 84 | okr 85 | odd 86 | odp 87 | obr 88 | opr 89 | orig 90 | phil 91 | pl 92 | pokrač 93 | pol 94 | port 95 | pozn 96 | př.kr 97 | př.n.l 98 | přel 99 | přeprac 100 | příl 101 | pseud 102 | pt 103 | red 104 | repr 105 | resp 106 | revid 107 | rkp 108 | roč 109 | roz 110 | rozš 111 | samost 112 | sect 113 | sest 114 | seš 115 | sign 116 | sl 117 | srv 118 | stol 119 | sv 120 | šk 121 | šk.ro 122 | špan 123 | tab 124 | t.č 125 | tis 126 | tj 127 | tř 128 | tzv 129 | univ 130 | uspoř 131 | vol 132 | vl.jm 133 | vs 134 | vyd 135 | vyobr 136 | zal 137 | zejm 138 | zkr 139 | zprac 140 | zvl 141 | n.p 142 | např 143 | než 144 | MUDr 145 | abl 146 | absol 147 | adj 148 | adv 149 | ak 150 | ak. sl 151 | akt 152 | alch 153 | amer 154 | anat 155 | angl 156 | anglosas 157 | arab 158 | arch 159 | archit 160 | arg 161 | astr 162 | astrol 163 | att 164 | bás 165 | belg 166 | bibl 167 | biol 168 | boh 169 | bot 170 | bulh 171 | círk 172 | csl 173 | č 174 | čas 175 | čes 176 | dat 177 | děj 178 | dep 179 | dět 180 | dial 181 | dór 182 | dopr 183 | dosl 184 | ekon 185 | epic 186 | etnonym 187 | eufem 188 | f 189 | fam 190 | fem 191 | fil 192 | film 193 | form 194 | fot 195 | fr 196 | fut 197 | fyz 198 | gen 199 | geogr 200 | geol 201 | geom 202 | germ 203 | gram 204 | hebr 205 | herald 206 | hist 207 | hl 208 | hovor 209 | hud 210 | hut 211 | chcsl 212 | chem 213 | ie 214 | imp 215 | impf 216 | ind 217 | indoevr 218 | inf 219 | instr 220 | interj 221 | ión 222 | iron 223 | it 224 | kanad 225 | katalán 226 | klas 227 | kniž 228 | komp 229 | konj 230 | 231 | konkr 232 | kř 233 | kuch 234 | lat 235 | lék 236 | les 237 | lid 238 | lit 239 | liturg 240 | lok 241 | log 242 | m 243 | mat 244 | meteor 245 | metr 246 | mod 247 | ms 248 | mysl 249 | n 250 | náb 251 | námoř 252 | neklas 253 | něm 254 | nesklon 255 | nom 256 | ob 257 | obch 258 | obyč 259 | ojed 260 | opt 261 | part 262 | pas 263 | pejor 264 | pers 265 | pf 266 | pl 267 | plpf 268 | 269 | práv 270 | prep 271 | předl 272 | přivl 273 | r 274 | rcsl 275 | refl 276 | reg 277 | rkp 278 | ř 279 | řec 280 | s 281 | samohl 282 | sg 283 | sl 284 | souhl 285 | spec 286 | srov 287 | stfr 288 | střv 289 | stsl 290 | subj 291 | subst 292 | superl 293 | sv 294 | sz 295 | táz 296 | tech 297 | telev 298 | teol 299 | trans 300 | typogr 301 | var 302 | vedl 303 | verb 304 | vl. jm 305 | voj 306 | vok 307 | vůb 308 | vulg 309 | výtv 310 | vztaž 311 | zahr 312 | zájm 313 | zast 314 | zejm 315 | 316 | zeměd 317 | zkr 318 | zř 319 | mj 320 | dl 321 | atp 322 | sport 323 | Mgr 324 | horn 325 | MVDr 326 | JUDr 327 | RSDr 328 | Bc 329 | PhDr 330 | ThDr 331 | Ing 332 | aj 333 | apod 334 | PharmDr 335 | pomn 336 | ev 337 | slang 338 | nprap 339 | odp 340 | dop 341 | pol 342 | st 343 | stol 344 | p. n. l 345 | před n. l 346 | n. l 347 | př. Kr 348 | po Kr 349 | př. n. l 350 | odd 351 | RNDr 352 | tzv 353 | atd 354 | tzn 355 | resp 356 | tj 357 | p 358 | br 359 | č. j 360 | čj 361 | č. p 362 | čp 363 | a. s 364 | s. r. o 365 | spol. s r. o 366 | p. o 367 | s. p 368 | v. o. s 369 | k. s 370 | o. p. s 371 | o. s 372 | v. r 373 | v z 374 | ml 375 | vč 376 | kr 377 | mld 378 | hod 379 | popř 380 | ap 381 | event 382 | rus 383 | slov 384 | rum 385 | švýc 386 | P. T 387 | zvl 388 | hor 389 | dol 390 | S.O.S -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.de: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | #no german words end in single lower-case letters, so we throw those in too. 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in German. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #Titles and Honorifics 104 | Adj 105 | Adm 106 | Adv 107 | Asst 108 | Bart 109 | Bldg 110 | Brig 111 | Bros 112 | Capt 113 | Cmdr 114 | Col 115 | Comdr 116 | Con 117 | Corp 118 | Cpl 119 | DR 120 | Dr 121 | Ens 122 | Gen 123 | Gov 124 | Hon 125 | Hosp 126 | Insp 127 | Lt 128 | MM 129 | MR 130 | MRS 131 | MS 132 | Maj 133 | Messrs 134 | Mlle 135 | Mme 136 | Mr 137 | Mrs 138 | Ms 139 | Msgr 140 | Op 141 | Ord 142 | Pfc 143 | Ph 144 | Prof 145 | Pvt 146 | Rep 147 | Reps 148 | Res 149 | Rev 150 | Rt 151 | Sen 152 | Sens 153 | Sfc 154 | Sgt 155 | Sr 156 | St 157 | Supt 158 | Surg 159 | 160 | #Misc symbols 161 | Mio 162 | Mrd 163 | bzw 164 | v 165 | vs 166 | usw 167 | d.h 168 | z.B 169 | u.a 170 | etc 171 | Mrd 172 | MwSt 173 | ggf 174 | d.J 175 | D.h 176 | m.E 177 | vgl 178 | I.F 179 | z.T 180 | sogen 181 | ff 182 | u.E 183 | g.U 184 | g.g.A 185 | c.-à-d 186 | Buchst 187 | u.s.w 188 | sog 189 | u.ä 190 | Std 191 | evtl 192 | Zt 193 | Chr 194 | u.U 195 | o.ä 196 | Ltd 197 | b.A 198 | z.Zt 199 | spp 200 | sen 201 | SA 202 | k.o 203 | jun 204 | i.H.v 205 | dgl 206 | dergl 207 | Co 208 | zzt 209 | usf 210 | s.p.a 211 | Dkr 212 | Corp 213 | bzgl 214 | BSE 215 | 216 | #Number indicators 217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it 218 | No 219 | Nos 220 | Art 221 | Nr 222 | pp 223 | ca 224 | Ca 225 | 226 | #Ordinals are done with . in German - "1." = "1st" in English 227 | 1 228 | 2 229 | 3 230 | 4 231 | 5 232 | 6 233 | 7 234 | 8 235 | 9 236 | 10 237 | 11 238 | 12 239 | 13 240 | 14 241 | 15 242 | 16 243 | 17 244 | 18 245 | 19 246 | 20 247 | 21 248 | 22 249 | 23 250 | 24 251 | 25 252 | 26 253 | 27 254 | 28 255 | 29 256 | 30 257 | 31 258 | 32 259 | 33 260 | 34 261 | 35 262 | 36 263 | 37 264 | 38 265 | 39 266 | 40 267 | 41 268 | 42 269 | 43 270 | 44 271 | 45 272 | 46 273 | 47 274 | 48 275 | 49 276 | 50 277 | 51 278 | 52 279 | 53 280 | 54 281 | 55 282 | 56 283 | 57 284 | 58 285 | 59 286 | 60 287 | 61 288 | 62 289 | 63 290 | 64 291 | 65 292 | 66 293 | 67 294 | 68 295 | 69 296 | 70 297 | 71 298 | 72 299 | 73 300 | 74 301 | 75 302 | 76 303 | 77 304 | 78 305 | 79 306 | 80 307 | 81 308 | 82 309 | 83 310 | 84 311 | 85 312 | 86 313 | 87 314 | 88 315 | 89 316 | 90 317 | 91 318 | 92 319 | 93 320 | 94 321 | 95 322 | 96 323 | 97 324 | 98 325 | 99 326 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.el: -------------------------------------------------------------------------------- 1 | # Sigle letters in upper-case are usually abbreviations of names 2 | Α 3 | Β 4 | Γ 5 | Δ 6 | Ε 7 | Ζ 8 | Η 9 | Θ 10 | Ι 11 | Κ 12 | Λ 13 | Μ 14 | Ν 15 | Ξ 16 | Ο 17 | Π 18 | Ρ 19 | Σ 20 | Τ 21 | Υ 22 | Φ 23 | Χ 24 | Ψ 25 | Ω 26 | 27 | # Includes abbreviations for the Greek language compiled from various sources (Greek grammar books, Greek language related web content). 28 | Άθαν 29 | Έγχρ 30 | Έκθ 31 | Έσδ 32 | Έφ 33 | Όμ 34 | Α΄Έσδρ 35 | Α΄Έσδ 36 | Α΄Βασ 37 | Α΄Θεσ 38 | Α΄Ιω 39 | Α΄Κορινθ 40 | Α΄Κορ 41 | Α΄Μακκ 42 | Α΄Μακ 43 | Α΄Πέτρ 44 | Α΄Πέτ 45 | Α΄Παραλ 46 | Α΄Πε 47 | Α΄Σαμ 48 | Α΄Τιμ 49 | Α΄Χρον 50 | Α΄Χρ 51 | Α.Β.Α 52 | Α.Β 53 | Α.Ε 54 | Α.Κ.Τ.Ο 55 | Αέθλ 56 | Αέτ 57 | Αίλ.Δ 58 | Αίλ.Τακτ 59 | Αίσ 60 | Αββακ 61 | Αβυδ 62 | Αβ 63 | Αγάκλ 64 | Αγάπ 65 | Αγάπ.Αμαρτ.Σ 66 | Αγάπ.Γεωπ 67 | Αγαθάγγ 68 | Αγαθήμ 69 | Αγαθιν 70 | Αγαθοκλ 71 | Αγαθρχ 72 | Αγαθ 73 | Αγαθ.Ιστ 74 | Αγαλλ 75 | Αγαπητ 76 | Αγγ 77 | Αγησ 78 | Αγλ 79 | Αγορ.Κ 80 | Αγρο.Κωδ 81 | Αγρ.Εξ 82 | Αγρ.Κ 83 | Αγ.Γρ 84 | Αδριαν 85 | Αδρ 86 | Αετ 87 | Αθάν 88 | Αθήν 89 | Αθήν.Επιγρ 90 | Αθήν.Επιτ 91 | Αθήν.Ιατρ 92 | Αθήν.Μηχ 93 | Αθανάσ 94 | Αθαν 95 | Αθηνί 96 | Αθηναγ 97 | Αθηνόδ 98 | Αθ 99 | Αθ.Αρχ 100 | Αιλ 101 | Αιλ.Επιστ 102 | Αιλ.ΖΙ 103 | Αιλ.ΠΙ 104 | Αιλ.απ 105 | Αιμιλ 106 | Αιν.Γαζ 107 | Αιν.Τακτ 108 | Αισχίν 109 | Αισχίν.Επιστ 110 | Αισχ 111 | Αισχ.Αγαμ 112 | Αισχ.Αγ 113 | Αισχ.Αλ 114 | Αισχ.Ελεγ 115 | Αισχ.Επτ.Θ 116 | Αισχ.Ευμ 117 | Αισχ.Ικέτ 118 | Αισχ.Ικ 119 | Αισχ.Περσ 120 | Αισχ.Προμ.Δεσμ 121 | Αισχ.Πρ 122 | Αισχ.Χοηφ 123 | Αισχ.Χο 124 | Αισχ.απ 125 | ΑιτΕ 126 | Αιτ 127 | Αλκ 128 | Αλχιας 129 | Αμ.Π.Ο 130 | Αμβ 131 | Αμμών 132 | Αμ. 133 | Αν.Πειθ.Συμβ.Δικ 134 | Ανακρ 135 | Ανακ 136 | Αναμν.Τόμ 137 | Αναπλ 138 | Ανδ 139 | Ανθλγος 140 | Ανθστης 141 | Αντισθ 142 | Ανχης 143 | Αν 144 | Αποκ 145 | Απρ 146 | Απόδ 147 | Απόφ 148 | Απόφ.Νομ 149 | Απ 150 | Απ.Δαπ 151 | Απ.Διατ 152 | Απ.Επιστ 153 | Αριθ 154 | Αριστοτ 155 | Αριστοφ 156 | Αριστοφ.Όρν 157 | Αριστοφ.Αχ 158 | Αριστοφ.Βάτρ 159 | Αριστοφ.Ειρ 160 | Αριστοφ.Εκκλ 161 | Αριστοφ.Θεσμ 162 | Αριστοφ.Ιππ 163 | Αριστοφ.Λυσ 164 | Αριστοφ.Νεφ 165 | Αριστοφ.Πλ 166 | Αριστοφ.Σφ 167 | Αριστ 168 | Αριστ.Αθ.Πολ 169 | Αριστ.Αισθ 170 | Αριστ.Αν.Πρ 171 | Αριστ.Ζ.Ι 172 | Αριστ.Ηθ.Ευδ 173 | Αριστ.Ηθ.Νικ 174 | Αριστ.Κατ 175 | Αριστ.Μετ 176 | Αριστ.Πολ 177 | Αριστ.Φυσιογν 178 | Αριστ.Φυσ 179 | Αριστ.Ψυχ 180 | Αριστ.Ρητ 181 | Αρμεν 182 | Αρμ 183 | Αρχ.Εκ.Καν.Δ 184 | Αρχ.Ευβ.Μελ 185 | Αρχ.Ιδ.Δ 186 | Αρχ.Νομ 187 | Αρχ.Ν 188 | Αρχ.Π.Ε 189 | Αρ 190 | Αρ.Φορ.Μητρ 191 | Ασμ 192 | Ασμ.ασμ 193 | Αστ.Δ 194 | Αστ.Χρον 195 | Ασ 196 | Ατομ.Γνωμ 197 | Αυγ 198 | Αφρ 199 | Αχ.Νομ 200 | Α 201 | Α.Εγχ.Π 202 | Α.Κ.΄Υδρας 203 | Β΄Έσδρ 204 | Β΄Έσδ 205 | Β΄Βασ 206 | Β΄Θεσ 207 | Β΄Ιω 208 | Β΄Κορινθ 209 | Β΄Κορ 210 | Β΄Μακκ 211 | Β΄Μακ 212 | Β΄Πέτρ 213 | Β΄Πέτ 214 | Β΄Πέ 215 | Β΄Παραλ 216 | Β΄Σαμ 217 | Β΄Τιμ 218 | Β΄Χρον 219 | Β΄Χρ 220 | Β.Ι.Π.Ε 221 | Β.Κ.Τ 222 | Β.Κ.Ψ.Β 223 | Β.Μ 224 | Β.Ο.Α.Κ 225 | Β.Ο.Α 226 | Β.Ο.Δ 227 | Βίβλ 228 | Βαρ 229 | ΒεΘ 230 | Βι.Περ 231 | Βιπερ 232 | Βιργ 233 | Βλγ 234 | Βούλ 235 | Βρ 236 | Γ΄Βασ 237 | Γ΄Μακκ 238 | ΓΕΝμλ 239 | Γέν 240 | Γαλ 241 | Γεν 242 | Γλ 243 | Γν.Ν.Σ.Κρ 244 | Γνωμ 245 | Γν 246 | Γράμμ 247 | Γρηγ.Ναζ 248 | Γρηγ.Νύσ 249 | Γ Νοσ 250 | Γ' Ογκολ 251 | Γ.Ν 252 | Δ΄Βασ 253 | Δ.Β 254 | Δ.Δίκη 255 | Δ.Δίκ 256 | Δ.Ε.Σ 257 | Δ.Ε.Φ.Α 258 | Δ.Ε.Φ 259 | Δ.Εργ.Ν 260 | Δαμ 261 | Δαμ.μνημ.έργ 262 | Δαν 263 | Δασ.Κ 264 | Δεκ 265 | Δελτ.Δικ.Ε.Τ.Ε 266 | Δελτ.Νομ 267 | Δελτ.Συνδ.Α.Ε 268 | Δερμ 269 | Δευτ 270 | Δεύτ 271 | Δημοσθ 272 | Δημόκρ 273 | Δι.Δικ 274 | Διάτ 275 | Διαιτ.Απ 276 | Διαιτ 277 | Διαρκ.Στρατ 278 | Δικ 279 | Διοίκ.Πρωτ 280 | ΔιοικΔνη 281 | Διοικ.Εφ 282 | Διον.Αρ 283 | Διόρθ.Λαθ 284 | Δ.κ.Π 285 | Δνη 286 | Δν 287 | Δογμ.Όρος 288 | Δρ 289 | Δ.τ.Α 290 | Δτ 291 | ΔωδΝομ 292 | Δ.Περ 293 | Δ.Στρ 294 | ΕΔΠολ 295 | ΕΕυρΚ 296 | ΕΙΣ 297 | ΕΝαυτΔ 298 | ΕΣΑμΕΑ 299 | ΕΣΘ 300 | ΕΣυγκΔ 301 | ΕΤρΑξΧρΔ 302 | Ε.Φ.Ε.Τ 303 | Ε.Φ.Ι 304 | Ε.Φ.Ο.Επ.Α 305 | Εβδ 306 | Εβρ 307 | Εγκύκλ.Επιστ 308 | Εγκ 309 | Εε.Αιγ 310 | Εθν.Κ.Τ 311 | Εθν 312 | Ειδ.Δικ.Αγ.Κακ 313 | Εικ 314 | Ειρ.Αθ 315 | Ειρην.Αθ 316 | Ειρην 317 | Έλεγχ 318 | Ειρ 319 | Εισ.Α.Π 320 | Εισ.Ε 321 | Εισ.Ν.Α.Κ 322 | Εισ.Ν.Κ.Πολ.Δ 323 | Εισ.Πρωτ 324 | Εισηγ.Έκθ 325 | Εισ 326 | Εκκλ 327 | Εκκ 328 | Εκ 329 | Ελλ.Δνη 330 | Εν.Ε 331 | Εξ 332 | Επ.Αν 333 | Επ.Εργ.Δ 334 | Επ.Εφ 335 | Επ.Κυπ.Δ 336 | Επ.Μεσ.Αρχ 337 | Επ.Νομ 338 | Επίκτ 339 | Επίκ 340 | Επι.Δ.Ε 341 | Επιθ.Ναυτ.Δικ 342 | Επικ 343 | Επισκ.Ε.Δ 344 | Επισκ.Εμπ.Δικ 345 | Επιστ.Επετ.Αρμ 346 | Επιστ.Επετ 347 | Επιστ.Ιερ 348 | Επιτρ.Προστ.Συνδ.Στελ 349 | Επιφάν 350 | Επτ.Εφ 351 | Επ.Ιρ 352 | Επ.Ι 353 | Εργ.Ασφ.Νομ 354 | Ερμ.Α.Κ 355 | Ερμη.Σ 356 | Εσθ 357 | Εσπερ 358 | Ετρ.Δ 359 | Ευκλ 360 | Ευρ.Δ.Δ.Α 361 | Ευρ.Σ.Δ.Α 362 | Ευρ.ΣτΕ 363 | Ευρατόμ 364 | Ευρ.Άλκ 365 | Ευρ.Ανδρομ 366 | Ευρ.Βάκχ 367 | Ευρ.Εκ 368 | Ευρ.Ελ 369 | Ευρ.Ηλ 370 | Ευρ.Ηρακ 371 | Ευρ.Ηρ 372 | Ευρ.Ηρ.Μαιν 373 | Ευρ.Ικέτ 374 | Ευρ.Ιππόλ 375 | Ευρ.Ιφ.Α 376 | Ευρ.Ιφ.Τ 377 | Ευρ.Ι.Τ 378 | Ευρ.Κύκλ 379 | Ευρ.Μήδ 380 | Ευρ.Ορ 381 | Ευρ.Ρήσ 382 | Ευρ.Τρωάδ 383 | Ευρ.Φοίν 384 | Εφ.Αθ 385 | Εφ.Εν 386 | Εφ.Επ 387 | Εφ.Θρ 388 | Εφ.Θ 389 | Εφ.Ι 390 | Εφ.Κερ 391 | Εφ.Κρ 392 | Εφ.Λ 393 | Εφ.Ν 394 | Εφ.Πατ 395 | Εφ.Πειρ 396 | Εφαρμ.Δ.Δ 397 | Εφαρμ 398 | Εφεσ 399 | Εφημ 400 | Εφ 401 | Ζαχ 402 | Ζιγ 403 | Ζυ 404 | Ζχ 405 | ΗΕ.Δ 406 | Ημερ 407 | Ηράκλ 408 | Ηροδ 409 | Ησίοδ 410 | Ησ 411 | Η.Ε.Γ 412 | ΘΗΣ 413 | ΘΡ 414 | Θαλ 415 | Θεοδ 416 | Θεοφ 417 | Θεσ 418 | Θεόδ.Μοψ 419 | Θεόκρ 420 | Θεόφιλ 421 | Θουκ 422 | Θρ 423 | Θρ.Ε 424 | Θρ.Ιερ 425 | Θρ.Ιρ 426 | Ιακ 427 | Ιαν 428 | Ιβ 429 | Ιδθ 430 | Ιδ 431 | Ιεζ 432 | Ιερ 433 | Ιζ 434 | Ιησ 435 | Ιησ.Ν 436 | Ικ 437 | Ιλ 438 | Ιν 439 | Ιουδ 440 | Ιουστ 441 | Ιούδα 442 | Ιούλ 443 | Ιούν 444 | Ιπποκρ 445 | Ιππόλ 446 | Ιρ 447 | Ισίδ.Πηλ 448 | Ισοκρ 449 | Ισ.Ν 450 | Ιωβ 451 | Ιωλ 452 | Ιων 453 | Ιω 454 | ΚΟΣ 455 | ΚΟ.ΜΕ.ΚΟΝ 456 | ΚΠοινΔ 457 | ΚΠολΔ 458 | ΚαΒ 459 | Καλ 460 | Καλ.Τέχν 461 | ΚανΒ 462 | Καν.Διαδ 463 | Κατάργ 464 | Κλ 465 | ΚοινΔ 466 | Κολσ 467 | Κολ 468 | Κον 469 | Κορ 470 | Κος 471 | ΚριτΕπιθ 472 | ΚριτΕ 473 | Κριτ 474 | Κρ 475 | ΚτΒ 476 | ΚτΕ 477 | ΚτΠ 478 | Κυβ 479 | Κυπρ 480 | Κύριλ.Αλεξ 481 | Κύριλ.Ιερ 482 | Λεβ 483 | Λεξ.Σουίδα 484 | Λευϊτ 485 | Λευ 486 | Λκ 487 | Λογ 488 | ΛουκΑμ 489 | Λουκιαν 490 | Λουκ.Έρωτ 491 | Λουκ.Ενάλ.Διάλ 492 | Λουκ.Ερμ 493 | Λουκ.Εταιρ.Διάλ 494 | Λουκ.Ε.Δ 495 | Λουκ.Θε.Δ 496 | Λουκ.Ικ. 497 | Λουκ.Ιππ 498 | Λουκ.Λεξιφ 499 | Λουκ.Μεν 500 | Λουκ.Μισθ.Συν 501 | Λουκ.Ορχ 502 | Λουκ.Περ 503 | Λουκ.Συρ 504 | Λουκ.Τοξ 505 | Λουκ.Τυρ 506 | Λουκ.Φιλοψ 507 | Λουκ.Φιλ 508 | Λουκ.Χάρ 509 | Λουκ. 510 | Λουκ.Αλ 511 | Λοχ 512 | Λυδ 513 | Λυκ 514 | Λυσ 515 | Λωζ 516 | Λ1 517 | Λ2 518 | ΜΟΕφ 519 | Μάρκ 520 | Μέν 521 | Μαλ 522 | Ματθ 523 | Μα 524 | Μιχ 525 | Μκ 526 | Μλ 527 | Μμ 528 | Μον.Δ.Π 529 | Μον.Πρωτ 530 | Μον 531 | Μρ 532 | Μτ 533 | Μχ 534 | Μ.Βασ 535 | Μ.Πλ 536 | ΝΑ 537 | Ναυτ.Χρον 538 | Να 539 | Νδικ 540 | Νεεμ 541 | Νε 542 | Νικ 543 | ΝκΦ 544 | Νμ 545 | ΝοΒ 546 | Νομ.Δελτ.Τρ.Ελ 547 | Νομ.Δελτ 548 | Νομ.Σ.Κ 549 | Νομ.Χρ 550 | Νομ 551 | Νομ.Διεύθ 552 | Νοσ 553 | Ντ 554 | Νόσων 555 | Ν1 556 | Ν2 557 | Ν3 558 | Ν4 559 | Νtot 560 | Ξενοφ 561 | Ξεν 562 | Ξεν.Ανάβ 563 | Ξεν.Απολ 564 | Ξεν.Απομν 565 | Ξεν.Απομ 566 | Ξεν.Ελλ 567 | Ξεν.Ιέρ 568 | Ξεν.Ιππαρχ 569 | Ξεν.Ιππ 570 | Ξεν.Κυρ.Αν 571 | Ξεν.Κύρ.Παιδ 572 | Ξεν.Κ.Π 573 | Ξεν.Λακ.Πολ 574 | Ξεν.Οικ 575 | Ξεν.Προσ 576 | Ξεν.Συμπόσ 577 | Ξεν.Συμπ 578 | Ο΄ 579 | Οβδ 580 | Οβ 581 | ΟικΕ 582 | Οικ 583 | Οικ.Πατρ 584 | Οικ.Σύν.Βατ 585 | Ολομ 586 | Ολ 587 | Ολ.Α.Π 588 | Ομ.Ιλ 589 | Ομ.Οδ 590 | ΟπΤοιχ 591 | Οράτ 592 | Ορθ 593 | ΠΡΟ.ΠΟ 594 | Πίνδ 595 | Πίνδ.Ι 596 | Πίνδ.Νεμ 597 | Πίνδ.Ν 598 | Πίνδ.Ολ 599 | Πίνδ.Παθ 600 | Πίνδ.Πυθ 601 | Πίνδ.Π 602 | ΠαγΝμλγ 603 | Παν 604 | Παρμ 605 | Παροιμ 606 | Παρ 607 | Παυσ 608 | Πειθ.Συμβ 609 | ΠειρΝ 610 | Πελ 611 | ΠεντΣτρ 612 | Πεντ 613 | Πεντ.Εφ 614 | ΠερΔικ 615 | Περ.Γεν.Νοσ 616 | Πετ 617 | Πλάτ 618 | Πλάτ.Αλκ 619 | Πλάτ.Αντ 620 | Πλάτ.Αξίοχ 621 | Πλάτ.Απόλ 622 | Πλάτ.Γοργ 623 | Πλάτ.Ευθ 624 | Πλάτ.Θεαίτ 625 | Πλάτ.Κρατ 626 | Πλάτ.Κριτ 627 | Πλάτ.Λύσ 628 | Πλάτ.Μεν 629 | Πλάτ.Νόμ 630 | Πλάτ.Πολιτ 631 | Πλάτ.Πολ 632 | Πλάτ.Πρωτ 633 | Πλάτ.Σοφ. 634 | Πλάτ.Συμπ 635 | Πλάτ.Τίμ 636 | Πλάτ.Φαίδρ 637 | Πλάτ.Φιλ 638 | Πλημ 639 | Πλούτ 640 | Πλούτ.Άρατ 641 | Πλούτ.Αιμ 642 | Πλούτ.Αλέξ 643 | Πλούτ.Αλκ 644 | Πλούτ.Αντ 645 | Πλούτ.Αρτ 646 | Πλούτ.Ηθ 647 | Πλούτ.Θεμ 648 | Πλούτ.Κάμ 649 | Πλούτ.Καίσ 650 | Πλούτ.Κικ 651 | Πλούτ.Κράσ 652 | Πλούτ.Κ 653 | Πλούτ.Λυκ 654 | Πλούτ.Μάρκ 655 | Πλούτ.Μάρ 656 | Πλούτ.Περ 657 | Πλούτ.Ρωμ 658 | Πλούτ.Σύλλ 659 | Πλούτ.Φλαμ 660 | Πλ 661 | Ποιν.Δικ 662 | Ποιν.Δ 663 | Ποιν.Ν 664 | Ποιν.Χρον 665 | Ποιν.Χρ 666 | Πολ.Δ 667 | Πολ.Πρωτ 668 | Πολ 669 | Πολ.Μηχ 670 | Πολ.Μ 671 | Πρακτ.Αναθ 672 | Πρακτ.Ολ 673 | Πραξ 674 | Πρμ 675 | Πρξ 676 | Πρωτ 677 | Πρ 678 | Πρ.Αν 679 | Πρ.Λογ 680 | Πταισμ 681 | Πυρ.Καλ 682 | Πόλη 683 | Π.Δ 684 | Π.Δ.Άσμ 685 | ΡΜ.Ε 686 | Ρθ 687 | Ρμ 688 | Ρωμ 689 | ΣΠλημ 690 | Σαπφ 691 | Σειρ 692 | Σολ 693 | Σοφ 694 | Σοφ.Αντιγ 695 | Σοφ.Αντ 696 | Σοφ.Αποσ 697 | Σοφ.Απ 698 | Σοφ.Ηλέκ 699 | Σοφ.Ηλ 700 | Σοφ.Οιδ.Κολ 701 | Σοφ.Οιδ.Τύρ 702 | Σοφ.Ο.Τ 703 | Σοφ.Σειρ 704 | Σοφ.Σολ 705 | Σοφ.Τραχ 706 | Σοφ.Φιλοκτ 707 | Σρ 708 | Σ.τ.Ε 709 | Σ.τ.Π 710 | Στρ.Π.Κ 711 | Στ.Ευρ 712 | Συζήτ 713 | Συλλ.Νομολ 714 | Συλ.Νομ 715 | ΣυμβΕπιθ 716 | Συμπ.Ν 717 | Συνθ.Αμ 718 | Συνθ.Ε.Ε 719 | Συνθ.Ε.Κ 720 | Συνθ.Ν 721 | Σφν 722 | Σφ 723 | Σφ.Σλ 724 | Σχ.Πολ.Δ 725 | Σχ.Συντ.Ε 726 | Σωσ 727 | Σύντ 728 | Σ.Πληρ 729 | ΤΘ 730 | ΤΣ.Δ 731 | Τίτ 732 | Τβ 733 | Τελ.Ενημ 734 | Τελ.Κ 735 | Τερτυλ 736 | Τιμ 737 | Τοπ.Α 738 | Τρ.Ο 739 | Τριμ 740 | Τριμ.Πλ 741 | Τρ.Πλημ 742 | Τρ.Π.Δ 743 | Τ.τ.Ε 744 | Ττ 745 | Τωβ 746 | Υγ 747 | Υπερ 748 | Υπ 749 | Υ.Γ 750 | Φιλήμ 751 | Φιλιπ 752 | Φιλ 753 | Φλμ 754 | Φλ 755 | Φορ.Β 756 | Φορ.Δ.Ε 757 | Φορ.Δνη 758 | Φορ.Δ 759 | Φορ.Επ 760 | Φώτ 761 | Χρ.Ι.Δ 762 | Χρ.Ιδ.Δ 763 | Χρ.Ο 764 | Χρυσ 765 | Ψήφ 766 | Ψαλμ 767 | Ψαλ 768 | Ψλ 769 | Ωριγ 770 | Ωσ 771 | Ω.Ρ.Λ 772 | άγν 773 | άγν.ετυμολ 774 | άγ 775 | άκλ 776 | άνθρ 777 | άπ 778 | άρθρ 779 | άρν 780 | άρ 781 | άτ 782 | άψ 783 | ά 784 | έκδ 785 | έκφρ 786 | έμψ 787 | ένθ.αν 788 | έτ 789 | έ.α 790 | ίδ 791 | αβεστ 792 | αβησσ 793 | αγγλ 794 | αγγ 795 | αδημ 796 | αεροναυτ 797 | αερον 798 | αεροπ 799 | αθλητ 800 | αθλ 801 | αθροιστ 802 | αιγυπτ 803 | αιγ 804 | αιτιολ 805 | αιτ 806 | αι 807 | ακαδ 808 | ακκαδ 809 | αλβ 810 | αλλ 811 | αλφαβητ 812 | αμα 813 | αμερικ 814 | αμερ 815 | αμετάβ 816 | αμτβ 817 | αμφιβ 818 | αμφισβ 819 | αμφ 820 | αμ 821 | ανάλ 822 | ανάπτ 823 | ανάτ 824 | αναβ 825 | αναδαν 826 | αναδιπλασ 827 | αναδιπλ 828 | αναδρ 829 | αναλ 830 | αναν 831 | ανασυλλ 832 | ανατολ 833 | ανατομ 834 | ανατυπ 835 | ανατ 836 | αναφορ 837 | αναφ 838 | ανα.ε 839 | ανδρων 840 | ανθρωπολ 841 | ανθρωπ 842 | ανθ 843 | ανομ 844 | αντίτ 845 | αντδ 846 | αντιγρ 847 | αντιθ 848 | αντικ 849 | αντιμετάθ 850 | αντων 851 | αντ 852 | ανωτ 853 | ανόργ 854 | ανών 855 | αορ 856 | απαρέμφ 857 | απαρφ 858 | απαρχ 859 | απαρ 860 | απλολ 861 | απλοπ 862 | αποβ 863 | αποηχηροπ 864 | αποθ 865 | αποκρυφ 866 | αποφ 867 | απρμφ 868 | απρφ 869 | απρόσ 870 | απόδ 871 | απόλ 872 | απόσπ 873 | απόφ 874 | αραβοτουρκ 875 | αραβ 876 | αραμ 877 | αρβαν 878 | αργκ 879 | αριθμτ 880 | αριθμ 881 | αριθ 882 | αρκτικόλ 883 | αρκ 884 | αρμεν 885 | αρμ 886 | αρνητ 887 | αρσ 888 | αρχαιολ 889 | αρχιτεκτ 890 | αρχιτ 891 | αρχκ 892 | αρχ 893 | αρωμουν 894 | αρωμ 895 | αρ 896 | αρ.μετρ 897 | αρ.φ 898 | ασσυρ 899 | αστρολ 900 | αστροναυτ 901 | αστρον 902 | αττ 903 | αυστραλ 904 | αυτοπ 905 | αυτ 906 | αφγαν 907 | αφηρ 908 | αφομ 909 | αφρικ 910 | αχώρ 911 | αόρ 912 | α.α 913 | α/α 914 | α0 915 | βαθμ 916 | βαθ 917 | βαπτ 918 | βασκ 919 | βεβαιωτ 920 | βεβ 921 | βεδ 922 | βενετ 923 | βεν 924 | βερβερ 925 | βιβλγρ 926 | βιολ 927 | βιομ 928 | βιοχημ 929 | βιοχ 930 | βλάχ 931 | βλ 932 | βλ.λ 933 | βοταν 934 | βοτ 935 | βουλγαρ 936 | βουλγ 937 | βούλ 938 | βραζιλ 939 | βρετον 940 | βόρ 941 | γαλλ 942 | γενικότ 943 | γενοβ 944 | γεν 945 | γερμαν 946 | γερμ 947 | γεωγρ 948 | γεωλ 949 | γεωμετρ 950 | γεωμ 951 | γεωπ 952 | γεωργ 953 | γλυπτ 954 | γλωσσολ 955 | γλωσσ 956 | γλ 957 | γνμδ 958 | γνμ 959 | γνωμ 960 | γοτθ 961 | γραμμ 962 | γραμ 963 | γρμ 964 | γρ 965 | γυμν 966 | δίδες 967 | δίκ 968 | δίφθ 969 | δαν 970 | δεικτ 971 | δεκατ 972 | δηλ 973 | δημογρ 974 | δημοτ 975 | δημώδ 976 | δημ 977 | διάγρ 978 | διάκρ 979 | διάλεξ 980 | διάλ 981 | διάσπ 982 | διαλεκτ 983 | διατρ 984 | διαφ 985 | διαχ 986 | διδα 987 | διεθν 988 | διεθ 989 | δικον 990 | διστ 991 | δισύλλ 992 | δισ 993 | διφθογγοπ 994 | δογμ 995 | δολ 996 | δοτ 997 | δρμ 998 | δρχ 999 | δρ(α) 1000 | δωρ 1001 | δ 1002 | εβρ 1003 | εγκλπ 1004 | εδ 1005 | εθνολ 1006 | εθν 1007 | ειδικότ 1008 | ειδ 1009 | ειδ.β 1010 | εικ 1011 | ειρ 1012 | εισ 1013 | εκατοστμ 1014 | εκατοστ 1015 | εκατστ.2 1016 | εκατστ.3 1017 | εκατ 1018 | εκδ 1019 | εκκλησ 1020 | εκκλ 1021 | εκ 1022 | ελλην 1023 | ελλ 1024 | ελνστ 1025 | ελπ 1026 | εμβ 1027 | εμφ 1028 | εναλλ 1029 | ενδ 1030 | ενεργ 1031 | ενεστ 1032 | ενικ 1033 | ενν 1034 | εν 1035 | εξέλ 1036 | εξακολ 1037 | εξομάλ 1038 | εξ 1039 | εο 1040 | επέκτ 1041 | επίδρ 1042 | επίθ 1043 | επίρρ 1044 | επίσ 1045 | επαγγελμ 1046 | επανάλ 1047 | επανέκδ 1048 | επιθ 1049 | επικ 1050 | επιμ 1051 | επιρρ 1052 | επιστ 1053 | επιτατ 1054 | επιφ 1055 | επών 1056 | επ 1057 | εργ 1058 | ερμ 1059 | ερρινοπ 1060 | ερωτ 1061 | ετρουσκ 1062 | ετυμ 1063 | ετ 1064 | ευφ 1065 | ευχετ 1066 | εφ 1067 | εύχρ 1068 | ε.α 1069 | ε/υ 1070 | ε0 1071 | ζωγρ 1072 | ζωολ 1073 | ηθικ 1074 | ηθ 1075 | ηλεκτρολ 1076 | ηλεκτρον 1077 | ηλεκτρ 1078 | ημίτ 1079 | ημίφ 1080 | ημιφ 1081 | ηχηροπ 1082 | ηχηρ 1083 | ηχομιμ 1084 | ηχ 1085 | η 1086 | θέατρ 1087 | θεολ 1088 | θετ 1089 | θηλ 1090 | θρακ 1091 | θρησκειολ 1092 | θρησκ 1093 | θ 1094 | ιαπων 1095 | ιατρ 1096 | ιδιωμ 1097 | ιδ 1098 | ινδ 1099 | ιραν 1100 | ισπαν 1101 | ιστορ 1102 | ιστ 1103 | ισχυροπ 1104 | ιταλ 1105 | ιχθυολ 1106 | ιων 1107 | κάτ 1108 | καθ 1109 | κακοσ 1110 | καν 1111 | καρ 1112 | κατάλ 1113 | κατατ 1114 | κατωτ 1115 | κατ 1116 | κα 1117 | κελτ 1118 | κεφ 1119 | κινεζ 1120 | κινημ 1121 | κλητ 1122 | κλιτ 1123 | κλπ 1124 | κλ 1125 | κν 1126 | κοινωνιολ 1127 | κοινων 1128 | κοπτ 1129 | κουτσοβλαχ 1130 | κουτσοβλ 1131 | κπ 1132 | κρ.γν 1133 | κτγ 1134 | κτην 1135 | κτητ 1136 | κτλ 1137 | κτ 1138 | κυριολ 1139 | κυρ 1140 | κύρ 1141 | κ 1142 | κ.ά 1143 | κ.ά.π 1144 | κ.α 1145 | κ.εξ 1146 | κ.επ 1147 | κ.ε 1148 | κ.λπ 1149 | κ.λ.π 1150 | κ.ού.κ 1151 | κ.ο.κ 1152 | κ.τ.λ 1153 | κ.τ.τ 1154 | κ.τ.ό 1155 | λέξ 1156 | λαογρ 1157 | λαπ 1158 | λατιν 1159 | λατ 1160 | λαϊκότρ 1161 | λαϊκ 1162 | λετ 1163 | λιθ 1164 | λογιστ 1165 | λογοτ 1166 | λογ 1167 | λουβ 1168 | λυδ 1169 | λόγ 1170 | λ 1171 | λ.χ 1172 | μέλλ 1173 | μέσ 1174 | μαθημ 1175 | μαθ 1176 | μαιευτ 1177 | μαλαισ 1178 | μαλτ 1179 | μαμμων 1180 | μεγεθ 1181 | μεε 1182 | μειωτ 1183 | μελ 1184 | μεξ 1185 | μεσν 1186 | μεσογ 1187 | μεσοπαθ 1188 | μεσοφ 1189 | μετάθ 1190 | μεταβτ 1191 | μεταβ 1192 | μετακ 1193 | μεταπλ 1194 | μεταπτωτ 1195 | μεταρ 1196 | μεταφορ 1197 | μετβ 1198 | μετεπιθ 1199 | μετεπιρρ 1200 | μετεωρολ 1201 | μετεωρ 1202 | μετον 1203 | μετουσ 1204 | μετοχ 1205 | μετρ 1206 | μετ 1207 | μητρων 1208 | μηχανολ 1209 | μηχ 1210 | μικροβιολ 1211 | μογγολ 1212 | μορφολ 1213 | μουσ 1214 | μπενελούξ 1215 | μσνλατ 1216 | μσν 1217 | μτβ 1218 | μτγν 1219 | μτγ 1220 | μτφρδ 1221 | μτφρ 1222 | μτφ 1223 | μτχ 1224 | μυθ 1225 | μυκην 1226 | μυκ 1227 | μφ 1228 | μ 1229 | μ.ε 1230 | μ.μ 1231 | μ.π.ε 1232 | μ.π.π 1233 | μ0 1234 | ναυτ 1235 | νεοελλ 1236 | νεολατιν 1237 | νεολατ 1238 | νεολ 1239 | νεότ 1240 | νλατ 1241 | νομ 1242 | νορβ 1243 | νοσ 1244 | νότ 1245 | ν 1246 | ξ.λ 1247 | οικοδ 1248 | οικολ 1249 | οικον 1250 | οικ 1251 | ολλανδ 1252 | ολλ 1253 | ομηρ 1254 | ομόρρ 1255 | ονομ 1256 | ον 1257 | οπτ 1258 | ορθογρ 1259 | ορθ 1260 | οριστ 1261 | ορυκτολ 1262 | ορυκτ 1263 | ορ 1264 | οσετ 1265 | οσκ 1266 | ουαλ 1267 | ουγγρ 1268 | ουδ 1269 | ουσιαστικοπ 1270 | ουσιαστ 1271 | ουσ 1272 | πίν 1273 | παθητ 1274 | παθολ 1275 | παθ 1276 | παιδ 1277 | παλαιοντ 1278 | παλαιότ 1279 | παλ 1280 | παππων 1281 | παράγρ 1282 | παράγ 1283 | παράλλ 1284 | παράλ 1285 | παραγ 1286 | παρακ 1287 | παραλ 1288 | παραπ 1289 | παρατ 1290 | παρβ 1291 | παρετυμ 1292 | παροξ 1293 | παρων 1294 | παρωχ 1295 | παρ 1296 | παρ.φρ 1297 | πατριδων 1298 | πατρων 1299 | πβ 1300 | περιθ 1301 | περιλ 1302 | περιφρ 1303 | περσ 1304 | περ 1305 | πιθ 1306 | πληθ 1307 | πληροφ 1308 | ποδ 1309 | ποιητ 1310 | πολιτ 1311 | πολλαπλ 1312 | πολ 1313 | πορτογαλ 1314 | πορτ 1315 | ποσ 1316 | πρακριτ 1317 | πρβλ 1318 | πρβ 1319 | πργ 1320 | πρκμ 1321 | πρκ 1322 | πρλ 1323 | προέλ 1324 | προβηγκ 1325 | προελλ 1326 | προηγ 1327 | προθεμ 1328 | προπαραλ 1329 | προπαροξ 1330 | προπερισπ 1331 | προσαρμ 1332 | προσηγορ 1333 | προσταχτ 1334 | προστ 1335 | προσφών 1336 | προσ 1337 | προτακτ 1338 | προτ.Εισ 1339 | προφ 1340 | προχωρ 1341 | πρτ 1342 | πρόθ 1343 | πρόσθ 1344 | πρόσ 1345 | πρότ 1346 | πρ 1347 | πρ.Εφ 1348 | πτ 1349 | πυ 1350 | π 1351 | π.Χ 1352 | π.μ 1353 | π.χ 1354 | ρήμ 1355 | ρίζ 1356 | ρηματ 1357 | ρητορ 1358 | ριν 1359 | ρουμ 1360 | ρωμ 1361 | ρωσ 1362 | ρ 1363 | σανσκρ 1364 | σαξ 1365 | σελ 1366 | σερβοκρ 1367 | σερβ 1368 | σημασιολ 1369 | σημδ 1370 | σημειολ 1371 | σημερ 1372 | σημιτ 1373 | σημ 1374 | σκανδ 1375 | σκυθ 1376 | σκωπτ 1377 | σλαβ 1378 | σλοβ 1379 | σουηδ 1380 | σουμερ 1381 | σουπ 1382 | σπάν 1383 | σπανιότ 1384 | σπ 1385 | σσ 1386 | στατ 1387 | στερ 1388 | στιγμ 1389 | στιχ 1390 | στρέμ 1391 | στρατιωτ 1392 | στρατ 1393 | στ 1394 | συγγ 1395 | συγκρ 1396 | συγκ 1397 | συμπερ 1398 | συμπλεκτ 1399 | συμπλ 1400 | συμπροφ 1401 | συμφυρ 1402 | συμφ 1403 | συνήθ 1404 | συνίζ 1405 | συναίρ 1406 | συναισθ 1407 | συνδετ 1408 | συνδ 1409 | συνεκδ 1410 | συνηρ 1411 | συνθετ 1412 | συνθ 1413 | συνοπτ 1414 | συντελ 1415 | συντομογρ 1416 | συντ 1417 | συν 1418 | συρ 1419 | σχημ 1420 | σχ 1421 | σύγκρ 1422 | σύμπλ 1423 | σύμφ 1424 | σύνδ 1425 | σύνθ 1426 | σύντμ 1427 | σύντ 1428 | σ 1429 | σ.π 1430 | σ/β 1431 | τακτ 1432 | τελ 1433 | τετρ 1434 | τετρ.μ 1435 | τεχνλ 1436 | τεχνολ 1437 | τεχν 1438 | τεύχ 1439 | τηλεπικ 1440 | τηλεόρ 1441 | τιμ 1442 | τιμ.τομ 1443 | τοΣ 1444 | τον 1445 | τοπογρ 1446 | τοπων 1447 | τοπ 1448 | τοσκ 1449 | τουρκ 1450 | τοχ 1451 | τριτοπρόσ 1452 | τροποπ 1453 | τροπ 1454 | τσεχ 1455 | τσιγγ 1456 | ττ 1457 | τυπ 1458 | τόμ 1459 | τόνν 1460 | τ 1461 | τ.μ 1462 | τ.χλμ 1463 | υβρ 1464 | υπερθ 1465 | υπερσ 1466 | υπερ 1467 | υπεύθ 1468 | υποθ 1469 | υποκορ 1470 | υποκ 1471 | υποσημ 1472 | υποτ 1473 | υποφ 1474 | υποχωρ 1475 | υπόλ 1476 | υπόχρ 1477 | υπ 1478 | υστλατ 1479 | υψόμ 1480 | υψ 1481 | φάκ 1482 | φαρμακολ 1483 | φαρμ 1484 | φιλολ 1485 | φιλοσ 1486 | φιλοτ 1487 | φινλ 1488 | φοινικ 1489 | φράγκ 1490 | φρανκον 1491 | φριζ 1492 | φρ 1493 | φυλλ 1494 | φυσιολ 1495 | φυσ 1496 | φωνηεντ 1497 | φωνητ 1498 | φωνολ 1499 | φων 1500 | φωτογρ 1501 | φ 1502 | φ.τ.μ 1503 | χαμιτ 1504 | χαρτόσ 1505 | χαρτ 1506 | χασμ 1507 | χαϊδ 1508 | χγφ 1509 | χειλ 1510 | χεττ 1511 | χημ 1512 | χιλ 1513 | χλγρ 1514 | χλγ 1515 | χλμ 1516 | χλμ.2 1517 | χλμ.3 1518 | χλσγρ 1519 | χλστγρ 1520 | χλστμ 1521 | χλστμ.2 1522 | χλστμ.3 1523 | χλ 1524 | χργρ 1525 | χρημ 1526 | χρον 1527 | χρ 1528 | χφ 1529 | χ.ε 1530 | χ.κ 1531 | χ.ο 1532 | χ.σ 1533 | χ.τ 1534 | χ.χ 1535 | ψευδ 1536 | ψυχαν 1537 | ψυχιατρ 1538 | ψυχολ 1539 | ψυχ 1540 | ωκεαν 1541 | όμ 1542 | όν 1543 | όπ.παρ 1544 | όπ.π 1545 | ό.π 1546 | ύψ 1547 | 1Βσ 1548 | 1Εσ 1549 | 1Θσ 1550 | 1Ιν 1551 | 1Κρ 1552 | 1Μκ 1553 | 1Πρ 1554 | 1Πτ 1555 | 1Τμ 1556 | 2Βσ 1557 | 2Εσ 1558 | 2Θσ 1559 | 2Ιν 1560 | 2Κρ 1561 | 2Μκ 1562 | 2Πρ 1563 | 2Πτ 1564 | 2Τμ 1565 | 3Βσ 1566 | 3Ιν 1567 | 3Μκ 1568 | 4Βσ 1569 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.en: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Asst 38 | Bart 39 | Bldg 40 | Brig 41 | Bros 42 | Capt 43 | Cmdr 44 | Col 45 | Comdr 46 | Con 47 | Corp 48 | Cpl 49 | DR 50 | Dr 51 | Drs 52 | Ens 53 | Gen 54 | Gov 55 | Hon 56 | Hr 57 | Hosp 58 | Insp 59 | Lt 60 | MM 61 | MR 62 | MRS 63 | MS 64 | Maj 65 | Messrs 66 | Mlle 67 | Mme 68 | Mr 69 | Mrs 70 | Ms 71 | Msgr 72 | Op 73 | Ord 74 | Pfc 75 | Ph 76 | Prof 77 | Pvt 78 | Rep 79 | Reps 80 | Res 81 | Rev 82 | Rt 83 | Sen 84 | Sens 85 | Sfc 86 | Sgt 87 | Sr 88 | St 89 | Supt 90 | Surg 91 | 92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 93 | v 94 | vs 95 | i.e 96 | rev 97 | e.g 98 | 99 | #Numbers only. These should only induce breaks when followed by a numeric sequence 100 | # add NUMERIC_ONLY after the word for this function 101 | #This case is mostly for the english "No." which can either be a sentence of its own, or 102 | #if followed by a number, a non-breaking prefix 103 | No #NUMERIC_ONLY# 104 | Nos 105 | Art #NUMERIC_ONLY# 106 | Nr 107 | pp #NUMERIC_ONLY# 108 | 109 | #month abbreviations 110 | Jan 111 | Feb 112 | Mar 113 | Apr 114 | #May is a full word 115 | Jun 116 | Jul 117 | Aug 118 | Sep 119 | Oct 120 | Nov 121 | Dec 122 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.es: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm 34 | 35 | A.C 36 | Apdo 37 | Av 38 | Bco 39 | CC.AA 40 | Da 41 | Dep 42 | Dn 43 | Dr 44 | Dra 45 | EE.UU 46 | Excmo 47 | FF.CC 48 | Fil 49 | Gral 50 | J.C 51 | Let 52 | Lic 53 | N.B 54 | P.D 55 | P.V.P 56 | Prof 57 | Pts 58 | Rte 59 | S.A 60 | S.A.R 61 | S.E 62 | S.L 63 | S.R.C 64 | Sr 65 | Sra 66 | Srta 67 | Sta 68 | Sto 69 | T.V.E 70 | Tel 71 | Ud 72 | Uds 73 | V.B 74 | V.E 75 | Vd 76 | Vds 77 | a/c 78 | adj 79 | admón 80 | afmo 81 | apdo 82 | av 83 | c 84 | c.f 85 | c.g 86 | cap 87 | cm 88 | cta 89 | dcha 90 | doc 91 | ej 92 | entlo 93 | esq 94 | etc 95 | f.c 96 | gr 97 | grs 98 | izq 99 | kg 100 | km 101 | mg 102 | mm 103 | núm 104 | núm 105 | p 106 | p.a 107 | p.ej 108 | ptas 109 | pág 110 | págs 111 | pág 112 | págs 113 | q.e.g.e 114 | q.e.s.m 115 | s 116 | s.s.s 117 | vid 118 | vol 119 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.fi: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT 2 | #indicate an end-of-sentence marker. Special cases are included for prefixes 3 | #that ONLY appear before 0-9 numbers. 4 | 5 | #This list is compiled from omorfi database 6 | #by Tommi A Pirinen. 7 | 8 | 9 | #any single upper case letter followed by a period is not a sentence ender 10 | A 11 | B 12 | C 13 | D 14 | E 15 | F 16 | G 17 | H 18 | I 19 | J 20 | K 21 | L 22 | M 23 | N 24 | O 25 | P 26 | Q 27 | R 28 | S 29 | T 30 | U 31 | V 32 | W 33 | X 34 | Y 35 | Z 36 | Å 37 | Ä 38 | Ö 39 | 40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 41 | alik 42 | alil 43 | amir 44 | apul 45 | apul.prof 46 | arkkit 47 | ass 48 | assist 49 | dipl 50 | dipl.arkkit 51 | dipl.ekon 52 | dipl.ins 53 | dipl.kielenk 54 | dipl.kirjeenv 55 | dipl.kosm 56 | dipl.urk 57 | dos 58 | erikoiseläinl 59 | erikoishammasl 60 | erikoisl 61 | erikoist 62 | ev.luutn 63 | evp 64 | fil 65 | ft 66 | hallinton 67 | hallintot 68 | hammaslääket 69 | jatk 70 | jääk 71 | kansaned 72 | kapt 73 | kapt.luutn 74 | kenr 75 | kenr.luutn 76 | kenr.maj 77 | kers 78 | kirjeenv 79 | kom 80 | kom.kapt 81 | komm 82 | konst 83 | korpr 84 | luutn 85 | maist 86 | maj 87 | Mr 88 | Mrs 89 | Ms 90 | M.Sc 91 | neuv 92 | nimim 93 | Ph.D 94 | prof 95 | puh.joht 96 | pääll 97 | res 98 | san 99 | siht 100 | suom 101 | sähköp 102 | säv 103 | toht 104 | toim 105 | toim.apul 106 | toim.joht 107 | toim.siht 108 | tuom 109 | ups 110 | vänr 111 | vääp 112 | ye.ups 113 | ylik 114 | ylil 115 | ylim 116 | ylimatr 117 | yliop 118 | yliopp 119 | ylip 120 | yliv 121 | 122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall 123 | #into this category - it sometimes ends a sentence) 124 | e.g 125 | ent 126 | esim 127 | huom 128 | i.e 129 | ilm 130 | l 131 | mm 132 | myöh 133 | nk 134 | nyk 135 | par 136 | po 137 | t 138 | v 139 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.fr: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | # 4 | #any single upper case letter followed by a period is not a sentence ender 5 | #usually upper case letters are initials in a name 6 | #no French words end in single lower-case letters, so we throw those in too? 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | # Period-final abbreviation list for French 61 | A.C.N 62 | A.M 63 | art 64 | ann 65 | apr 66 | av 67 | auj 68 | lib 69 | B.P 70 | boul 71 | ca 72 | c.-à-d 73 | cf 74 | ch.-l 75 | chap 76 | contr 77 | C.P.I 78 | C.Q.F.D 79 | C.N 80 | C.N.S 81 | C.S 82 | dir 83 | éd 84 | e.g 85 | env 86 | al 87 | etc 88 | E.V 89 | ex 90 | fasc 91 | fém 92 | fig 93 | fr 94 | hab 95 | ibid 96 | id 97 | i.e 98 | inf 99 | LL.AA 100 | LL.AA.II 101 | LL.AA.RR 102 | LL.AA.SS 103 | L.D 104 | LL.EE 105 | LL.MM 106 | LL.MM.II.RR 107 | loc.cit 108 | masc 109 | MM 110 | ms 111 | N.B 112 | N.D.A 113 | N.D.L.R 114 | N.D.T 115 | n/réf 116 | NN.SS 117 | N.S 118 | N.D 119 | N.P.A.I 120 | p.c.c 121 | pl 122 | pp 123 | p.ex 124 | p.j 125 | P.S 126 | R.A.S 127 | R.-V 128 | R.P 129 | R.I.P 130 | SS 131 | S.S 132 | S.A 133 | S.A.I 134 | S.A.R 135 | S.A.S 136 | S.E 137 | sec 138 | sect 139 | sing 140 | S.M 141 | S.M.I.R 142 | sq 143 | sqq 144 | suiv 145 | sup 146 | suppl 147 | tél 148 | T.S.V.P 149 | vb 150 | vol 151 | vs 152 | X.O 153 | Z.I 154 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.hu: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | Á 33 | É 34 | Í 35 | Ó 36 | Ö 37 | Ő 38 | Ú 39 | Ü 40 | Ű 41 | 42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 43 | Dr 44 | dr 45 | kb 46 | Kb 47 | vö 48 | Vö 49 | pl 50 | Pl 51 | ca 52 | Ca 53 | min 54 | Min 55 | max 56 | Max 57 | ún 58 | Ún 59 | prof 60 | Prof 61 | de 62 | De 63 | du 64 | Du 65 | Szt 66 | St 67 | 68 | #Numbers only. These should only induce breaks when followed by a numeric sequence 69 | # add NUMERIC_ONLY after the word for this function 70 | #This case is mostly for the english "No." which can either be a sentence of its own, or 71 | #if followed by a number, a non-breaking prefix 72 | 73 | # Month name abbreviations 74 | jan #NUMERIC_ONLY# 75 | Jan #NUMERIC_ONLY# 76 | Feb #NUMERIC_ONLY# 77 | feb #NUMERIC_ONLY# 78 | márc #NUMERIC_ONLY# 79 | Márc #NUMERIC_ONLY# 80 | ápr #NUMERIC_ONLY# 81 | Ápr #NUMERIC_ONLY# 82 | máj #NUMERIC_ONLY# 83 | Máj #NUMERIC_ONLY# 84 | jún #NUMERIC_ONLY# 85 | Jún #NUMERIC_ONLY# 86 | Júl #NUMERIC_ONLY# 87 | júl #NUMERIC_ONLY# 88 | aug #NUMERIC_ONLY# 89 | Aug #NUMERIC_ONLY# 90 | Szept #NUMERIC_ONLY# 91 | szept #NUMERIC_ONLY# 92 | okt #NUMERIC_ONLY# 93 | Okt #NUMERIC_ONLY# 94 | nov #NUMERIC_ONLY# 95 | Nov #NUMERIC_ONLY# 96 | dec #NUMERIC_ONLY# 97 | Dec #NUMERIC_ONLY# 98 | 99 | # Other abbreviations 100 | tel #NUMERIC_ONLY# 101 | Tel #NUMERIC_ONLY# 102 | Fax #NUMERIC_ONLY# 103 | fax #NUMERIC_ONLY# 104 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.is: -------------------------------------------------------------------------------- 1 | no #NUMERIC_ONLY# 2 | No #NUMERIC_ONLY# 3 | nr #NUMERIC_ONLY# 4 | Nr #NUMERIC_ONLY# 5 | nR #NUMERIC_ONLY# 6 | NR #NUMERIC_ONLY# 7 | a 8 | b 9 | c 10 | d 11 | e 12 | f 13 | g 14 | h 15 | i 16 | j 17 | k 18 | l 19 | m 20 | n 21 | o 22 | p 23 | q 24 | r 25 | s 26 | t 27 | u 28 | v 29 | w 30 | x 31 | y 32 | z 33 | ^ 34 | í 35 | á 36 | ó 37 | æ 38 | A 39 | B 40 | C 41 | D 42 | E 43 | F 44 | G 45 | H 46 | I 47 | J 48 | K 49 | L 50 | M 51 | N 52 | O 53 | P 54 | Q 55 | R 56 | S 57 | T 58 | U 59 | V 60 | W 61 | X 62 | Y 63 | Z 64 | ab.fn 65 | a.fn 66 | afs 67 | al 68 | alm 69 | alg 70 | andh 71 | ath 72 | aths 73 | atr 74 | ao 75 | au 76 | aukaf 77 | áfn 78 | áhrl.s 79 | áhrs 80 | ákv.gr 81 | ákv 82 | bh 83 | bls 84 | dr 85 | e.Kr 86 | et 87 | ef 88 | efn 89 | ennfr 90 | eink 91 | end 92 | e.st 93 | erl 94 | fél 95 | fskj 96 | fh 97 | f.hl 98 | físl 99 | fl 100 | fn 101 | fo 102 | forl 103 | frb 104 | frl 105 | frh 106 | frt 107 | fsl 108 | fsh 109 | fs 110 | fsk 111 | fst 112 | f.Kr 113 | ft 114 | fv 115 | fyrrn 116 | fyrrv 117 | germ 118 | gm 119 | gr 120 | hdl 121 | hdr 122 | hf 123 | hl 124 | hlsk 125 | hljsk 126 | hljv 127 | hljóðv 128 | hr 129 | hv 130 | hvk 131 | holl 132 | Hos 133 | höf 134 | hk 135 | hrl 136 | ísl 137 | kaf 138 | kap 139 | Khöfn 140 | kk 141 | kg 142 | kk 143 | km 144 | kl 145 | klst 146 | kr 147 | kt 148 | kgúrsk 149 | kvk 150 | leturbr 151 | lh 152 | lh.nt 153 | lh.þt 154 | lo 155 | ltr 156 | mlja 157 | mljó 158 | millj 159 | mm 160 | mms 161 | m.fl 162 | miðm 163 | mgr 164 | mst 165 | mín 166 | nf 167 | nh 168 | nhm 169 | nl 170 | nk 171 | nmgr 172 | no 173 | núv 174 | nt 175 | o.áfr 176 | o.m.fl 177 | ohf 178 | o.fl 179 | o.s.frv 180 | ófn 181 | ób 182 | óákv.gr 183 | óákv 184 | pfn 185 | PR 186 | pr 187 | Ritstj 188 | Rvík 189 | Rvk 190 | samb 191 | samhlj 192 | samn 193 | samn 194 | sbr 195 | sek 196 | sérn 197 | sf 198 | sfn 199 | sh 200 | sfn 201 | sh 202 | s.hl 203 | sk 204 | skv 205 | sl 206 | sn 207 | so 208 | ss.us 209 | s.st 210 | samþ 211 | sbr 212 | shlj 213 | sign 214 | skál 215 | st 216 | st.s 217 | stk 218 | sþ 219 | teg 220 | tbl 221 | tfn 222 | tl 223 | tvíhlj 224 | tvt 225 | till 226 | to 227 | umr 228 | uh 229 | us 230 | uppl 231 | útg 232 | vb 233 | Vf 234 | vh 235 | vkf 236 | Vl 237 | vl 238 | vlf 239 | vmf 240 | 8vo 241 | vsk 242 | vth 243 | þt 244 | þf 245 | þjs 246 | þgf 247 | þlt 248 | þolm 249 | þm 250 | þml 251 | þýð 252 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.it: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Amn 38 | Arch 39 | Asst 40 | Avv 41 | Bart 42 | Bcc 43 | Bldg 44 | Brig 45 | Bros 46 | C.A.P 47 | C.P 48 | Capt 49 | Cc 50 | Cmdr 51 | Co 52 | Col 53 | Comdr 54 | Con 55 | Corp 56 | Cpl 57 | DR 58 | Dott 59 | Dr 60 | Drs 61 | Egr 62 | Ens 63 | Gen 64 | Geom 65 | Gov 66 | Hon 67 | Hosp 68 | Hr 69 | Id 70 | Ing 71 | Insp 72 | Lt 73 | MM 74 | MR 75 | MRS 76 | MS 77 | Maj 78 | Messrs 79 | Mlle 80 | Mme 81 | Mo 82 | Mons 83 | Mr 84 | Mrs 85 | Ms 86 | Msgr 87 | N.B 88 | Op 89 | Ord 90 | P.S 91 | P.T 92 | Pfc 93 | Ph 94 | Prof 95 | Pvt 96 | RP 97 | RSVP 98 | Rag 99 | Rep 100 | Reps 101 | Res 102 | Rev 103 | Rif 104 | Rt 105 | S.A 106 | S.B.F 107 | S.P.M 108 | S.p.A 109 | S.r.l 110 | Sen 111 | Sens 112 | Sfc 113 | Sgt 114 | Sig 115 | Sigg 116 | Soc 117 | Spett 118 | Sr 119 | St 120 | Supt 121 | Surg 122 | V.P 123 | 124 | # other 125 | a.c 126 | acc 127 | all 128 | banc 129 | c.a 130 | c.c.p 131 | c.m 132 | c.p 133 | c.s 134 | c.v 135 | corr 136 | dott 137 | e.p.c 138 | ecc 139 | es 140 | fatt 141 | gg 142 | int 143 | lett 144 | ogg 145 | on 146 | p.c 147 | p.c.c 148 | p.es 149 | p.f 150 | p.r 151 | p.v 152 | post 153 | pp 154 | racc 155 | ric 156 | s.n.c 157 | seg 158 | sgg 159 | ss 160 | tel 161 | u.s 162 | v.r 163 | v.s 164 | 165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 166 | v 167 | vs 168 | i.e 169 | rev 170 | e.g 171 | 172 | #Numbers only. These should only induce breaks when followed by a numeric sequence 173 | # add NUMERIC_ONLY after the word for this function 174 | #This case is mostly for the english "No." which can either be a sentence of its own, or 175 | #if followed by a number, a non-breaking prefix 176 | No #NUMERIC_ONLY# 177 | Nos 178 | Art #NUMERIC_ONLY# 179 | Nr 180 | pp #NUMERIC_ONLY# 181 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.lv: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | Ā 8 | B 9 | C 10 | Č 11 | D 12 | E 13 | Ē 14 | F 15 | G 16 | Ģ 17 | H 18 | I 19 | Ī 20 | J 21 | K 22 | Ķ 23 | L 24 | Ļ 25 | M 26 | N 27 | Ņ 28 | O 29 | P 30 | Q 31 | R 32 | S 33 | Š 34 | T 35 | U 36 | Ū 37 | V 38 | W 39 | X 40 | Y 41 | Z 42 | Ž 43 | 44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 45 | dr 46 | Dr 47 | med 48 | prof 49 | Prof 50 | inž 51 | Inž 52 | ist.loc 53 | Ist.loc 54 | kor.loc 55 | Kor.loc 56 | v.i 57 | vietn 58 | Vietn 59 | 60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 61 | a.l 62 | t.p 63 | pārb 64 | Pārb 65 | vec 66 | Vec 67 | inv 68 | Inv 69 | sk 70 | Sk 71 | spec 72 | Spec 73 | vienk 74 | Vienk 75 | virz 76 | Virz 77 | māksl 78 | Māksl 79 | mūz 80 | Mūz 81 | akad 82 | Akad 83 | soc 84 | Soc 85 | galv 86 | Galv 87 | vad 88 | Vad 89 | sertif 90 | Sertif 91 | folkl 92 | Folkl 93 | hum 94 | Hum 95 | 96 | #Numbers only. These should only induce breaks when followed by a numeric sequence 97 | # add NUMERIC_ONLY after the word for this function 98 | #This case is mostly for the english "No." which can either be a sentence of its own, or 99 | #if followed by a number, a non-breaking prefix 100 | Nr #NUMERIC_ONLY# 101 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.nl: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen 4 | # http://nl.wikipedia.org/wiki/Aanspreekvorm 5 | # http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs 6 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 7 | #usually upper case letters are initials in a name 8 | A 9 | B 10 | C 11 | D 12 | E 13 | F 14 | G 15 | H 16 | I 17 | J 18 | K 19 | L 20 | M 21 | N 22 | O 23 | P 24 | Q 25 | R 26 | S 27 | T 28 | U 29 | V 30 | W 31 | X 32 | Y 33 | Z 34 | 35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 36 | bacc 37 | bc 38 | bgen 39 | c.i 40 | dhr 41 | dr 42 | dr.h.c 43 | drs 44 | drs 45 | ds 46 | eint 47 | fa 48 | Fa 49 | fam 50 | gen 51 | genm 52 | ing 53 | ir 54 | jhr 55 | jkvr 56 | jr 57 | kand 58 | kol 59 | lgen 60 | lkol 61 | Lt 62 | maj 63 | Mej 64 | mevr 65 | Mme 66 | mr 67 | mr 68 | Mw 69 | o.b.s 70 | plv 71 | prof 72 | ritm 73 | tint 74 | Vz 75 | Z.D 76 | Z.D.H 77 | Z.E 78 | Z.Em 79 | Z.H 80 | Z.K.H 81 | Z.K.M 82 | Z.M 83 | z.v 84 | 85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence 87 | a.g.v 88 | bijv 89 | bijz 90 | bv 91 | d.w.z 92 | e.c 93 | e.g 94 | e.k 95 | ev 96 | i.p.v 97 | i.s.m 98 | i.t.t 99 | i.v.m 100 | m.a.w 101 | m.b.t 102 | m.b.v 103 | m.h.o 104 | m.i 105 | m.i.v 106 | v.w.t 107 | 108 | #Numbers only. These should only induce breaks when followed by a numeric sequence 109 | # add NUMERIC_ONLY after the word for this function 110 | #This case is mostly for the english "No." which can either be a sentence of its own, or 111 | #if followed by a number, a non-breaking prefix 112 | Nr #NUMERIC_ONLY# 113 | Nrs 114 | nrs 115 | nr #NUMERIC_ONLY# 116 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.pl: -------------------------------------------------------------------------------- 1 | adw 2 | afr 3 | akad 4 | al 5 | Al 6 | am 7 | amer 8 | arch 9 | art 10 | Art 11 | artyst 12 | astr 13 | austr 14 | bałt 15 | bdb 16 | bł 17 | bm 18 | br 19 | bryg 20 | bryt 21 | centr 22 | ces 23 | chem 24 | chiń 25 | chir 26 | c.k 27 | c.o 28 | cyg 29 | cyw 30 | cyt 31 | czes 32 | czw 33 | cd 34 | Cd 35 | czyt 36 | ćw 37 | ćwicz 38 | daw 39 | dcn 40 | dekl 41 | demokr 42 | det 43 | diec 44 | dł 45 | dn 46 | dot 47 | dol 48 | dop 49 | dost 50 | dosł 51 | h.c 52 | ds 53 | dst 54 | duszp 55 | dypl 56 | egz 57 | ekol 58 | ekon 59 | elektr 60 | em 61 | ew 62 | fab 63 | farm 64 | fot 65 | fr 66 | gat 67 | gastr 68 | geogr 69 | geol 70 | gimn 71 | głęb 72 | gm 73 | godz 74 | górn 75 | gosp 76 | gr 77 | gram 78 | hist 79 | hiszp 80 | hr 81 | Hr 82 | hot 83 | id 84 | in 85 | im 86 | iron 87 | jn 88 | kard 89 | kat 90 | katol 91 | k.k 92 | kk 93 | kol 94 | kl 95 | k.p.a 96 | kpc 97 | k.p.c 98 | kpt 99 | kr 100 | k.r 101 | krak 102 | k.r.o 103 | kryt 104 | kult 105 | laic 106 | łac 107 | niem 108 | woj 109 | nb 110 | np 111 | Nb 112 | Np 113 | pol 114 | pow 115 | m.in 116 | pt 117 | ps 118 | Pt 119 | Ps 120 | cdn 121 | jw 122 | ryc 123 | rys 124 | Ryc 125 | Rys 126 | tj 127 | tzw 128 | Tzw 129 | tzn 130 | zob 131 | ang 132 | ub 133 | ul 134 | pw 135 | pn 136 | pl 137 | al 138 | k 139 | n 140 | nr #NUMERIC_ONLY# 141 | Nr #NUMERIC_ONLY# 142 | ww 143 | wł 144 | ur 145 | zm 146 | żyd 147 | żarg 148 | żyw 149 | wył 150 | bp 151 | bp 152 | wyst 153 | tow 154 | Tow 155 | o 156 | sp 157 | Sp 158 | st 159 | spółdz 160 | Spółdz 161 | społ 162 | spółgł 163 | stoł 164 | stow 165 | Stoł 166 | Stow 167 | zn 168 | zew 169 | zewn 170 | zdr 171 | zazw 172 | zast 173 | zaw 174 | zał 175 | zal 176 | zam 177 | zak 178 | zakł 179 | zagr 180 | zach 181 | adw 182 | Adw 183 | lek 184 | Lek 185 | med 186 | mec 187 | Mec 188 | doc 189 | Doc 190 | dyw 191 | dyr 192 | Dyw 193 | Dyr 194 | inż 195 | Inż 196 | mgr 197 | Mgr 198 | dh 199 | dr 200 | Dh 201 | Dr 202 | p 203 | P 204 | red 205 | Red 206 | prof 207 | prok 208 | Prof 209 | Prok 210 | hab 211 | płk 212 | Płk 213 | nadkom 214 | Nadkom 215 | podkom 216 | Podkom 217 | ks 218 | Ks 219 | gen 220 | Gen 221 | por 222 | Por 223 | reż 224 | Reż 225 | przyp 226 | Przyp 227 | śp 228 | św 229 | śW 230 | Śp 231 | Św 232 | ŚW 233 | szer 234 | Szer 235 | pkt #NUMERIC_ONLY# 236 | str #NUMERIC_ONLY# 237 | tab #NUMERIC_ONLY# 238 | Tab #NUMERIC_ONLY# 239 | tel 240 | ust #NUMERIC_ONLY# 241 | par #NUMERIC_ONLY# 242 | poz 243 | pok 244 | oo 245 | oO 246 | Oo 247 | OO 248 | r #NUMERIC_ONLY# 249 | l #NUMERIC_ONLY# 250 | s #NUMERIC_ONLY# 251 | najśw 252 | Najśw 253 | A 254 | B 255 | C 256 | D 257 | E 258 | F 259 | G 260 | H 261 | I 262 | J 263 | K 264 | L 265 | M 266 | N 267 | O 268 | P 269 | Q 270 | R 271 | S 272 | T 273 | U 274 | V 275 | W 276 | X 277 | Y 278 | Z 279 | Ś 280 | Ć 281 | Ż 282 | Ź 283 | Dz 284 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.pt: -------------------------------------------------------------------------------- 1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009. 2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 4 | 5 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 6 | #usually upper case letters are initials in a name 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 104 | Adj 105 | Adm 106 | Adv 107 | Art 108 | Ca 109 | Capt 110 | Cmdr 111 | Col 112 | Comdr 113 | Con 114 | Corp 115 | Cpl 116 | DR 117 | DRA 118 | Dr 119 | Dra 120 | Dras 121 | Drs 122 | Eng 123 | Enga 124 | Engas 125 | Engos 126 | Ex 127 | Exo 128 | Exmo 129 | Fig 130 | Gen 131 | Hosp 132 | Insp 133 | Lda 134 | MM 135 | MR 136 | MRS 137 | MS 138 | Maj 139 | Mrs 140 | Ms 141 | Msgr 142 | Op 143 | Ord 144 | Pfc 145 | Ph 146 | Prof 147 | Pvt 148 | Rep 149 | Reps 150 | Res 151 | Rev 152 | Rt 153 | Sen 154 | Sens 155 | Sfc 156 | Sgt 157 | Sr 158 | Sra 159 | Sras 160 | Srs 161 | Sto 162 | Supt 163 | Surg 164 | adj 165 | adm 166 | adv 167 | art 168 | cit 169 | col 170 | con 171 | corp 172 | cpl 173 | dr 174 | dra 175 | dras 176 | drs 177 | eng 178 | enga 179 | engas 180 | engos 181 | ex 182 | exo 183 | exmo 184 | fig 185 | op 186 | prof 187 | sr 188 | sra 189 | sras 190 | srs 191 | sto 192 | 193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 194 | v 195 | vs 196 | i.e 197 | rev 198 | e.g 199 | 200 | #Numbers only. These should only induce breaks when followed by a numeric sequence 201 | # add NUMERIC_ONLY after the word for this function 202 | #This case is mostly for the english "No." which can either be a sentence of its own, or 203 | #if followed by a number, a non-breaking prefix 204 | No #NUMERIC_ONLY# 205 | Nos 206 | Art #NUMERIC_ONLY# 207 | Nr 208 | p #NUMERIC_ONLY# 209 | pp #NUMERIC_ONLY# 210 | 211 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.ro: -------------------------------------------------------------------------------- 1 | A 2 | B 3 | C 4 | D 5 | E 6 | F 7 | G 8 | H 9 | I 10 | J 11 | K 12 | L 13 | M 14 | N 15 | O 16 | P 17 | Q 18 | R 19 | S 20 | T 21 | U 22 | V 23 | W 24 | X 25 | Y 26 | Z 27 | dpdv 28 | etc 29 | șamd 30 | M.Ap.N 31 | dl 32 | Dl 33 | d-na 34 | D-na 35 | dvs 36 | Dvs 37 | pt 38 | Pt 39 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.ru: -------------------------------------------------------------------------------- 1 | # added Cyrillic uppercase letters [А-Я] 2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes) 3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013 4 | А 5 | Б 6 | В 7 | Г 8 | Д 9 | Е 10 | Ж 11 | З 12 | И 13 | Й 14 | К 15 | Л 16 | М 17 | Н 18 | О 19 | П 20 | Р 21 | С 22 | Т 23 | У 24 | Ф 25 | Х 26 | Ц 27 | Ч 28 | Ш 29 | Щ 30 | Ъ 31 | Ы 32 | Ь 33 | Э 34 | Ю 35 | Я 36 | A 37 | B 38 | C 39 | D 40 | E 41 | F 42 | G 43 | H 44 | I 45 | J 46 | K 47 | L 48 | M 49 | N 50 | O 51 | P 52 | Q 53 | R 54 | S 55 | T 56 | U 57 | V 58 | W 59 | X 60 | Y 61 | Z 62 | 0гг 63 | 1гг 64 | 2гг 65 | 3гг 66 | 4гг 67 | 5гг 68 | 6гг 69 | 7гг 70 | 8гг 71 | 9гг 72 | 0г 73 | 1г 74 | 2г 75 | 3г 76 | 4г 77 | 5г 78 | 6г 79 | 7г 80 | 8г 81 | 9г 82 | Xвв 83 | Vвв 84 | Iвв 85 | Lвв 86 | Mвв 87 | Cвв 88 | Xв 89 | Vв 90 | Iв 91 | Lв 92 | Mв 93 | Cв 94 | 0м 95 | 1м 96 | 2м 97 | 3м 98 | 4м 99 | 5м 100 | 6м 101 | 7м 102 | 8м 103 | 9м 104 | 0мм 105 | 1мм 106 | 2мм 107 | 3мм 108 | 4мм 109 | 5мм 110 | 6мм 111 | 7мм 112 | 8мм 113 | 9мм 114 | 0см 115 | 1см 116 | 2см 117 | 3см 118 | 4см 119 | 5см 120 | 6см 121 | 7см 122 | 8см 123 | 9см 124 | 0дм 125 | 1дм 126 | 2дм 127 | 3дм 128 | 4дм 129 | 5дм 130 | 6дм 131 | 7дм 132 | 8дм 133 | 9дм 134 | 0л 135 | 1л 136 | 2л 137 | 3л 138 | 4л 139 | 5л 140 | 6л 141 | 7л 142 | 8л 143 | 9л 144 | 0км 145 | 1км 146 | 2км 147 | 3км 148 | 4км 149 | 5км 150 | 6км 151 | 7км 152 | 8км 153 | 9км 154 | 0га 155 | 1га 156 | 2га 157 | 3га 158 | 4га 159 | 5га 160 | 6га 161 | 7га 162 | 8га 163 | 9га 164 | 0кг 165 | 1кг 166 | 2кг 167 | 3кг 168 | 4кг 169 | 5кг 170 | 6кг 171 | 7кг 172 | 8кг 173 | 9кг 174 | 0т 175 | 1т 176 | 2т 177 | 3т 178 | 4т 179 | 5т 180 | 6т 181 | 7т 182 | 8т 183 | 9т 184 | 0г 185 | 1г 186 | 2г 187 | 3г 188 | 4г 189 | 5г 190 | 6г 191 | 7г 192 | 8г 193 | 9г 194 | 0мг 195 | 1мг 196 | 2мг 197 | 3мг 198 | 4мг 199 | 5мг 200 | 6мг 201 | 7мг 202 | 8мг 203 | 9мг 204 | бульв 205 | в 206 | вв 207 | г 208 | га 209 | гг 210 | гл 211 | гос 212 | д 213 | дм 214 | доп 215 | др 216 | е 217 | ед 218 | ед 219 | зам 220 | и 221 | инд 222 | исп 223 | Исп 224 | к 225 | кап 226 | кг 227 | кв 228 | кл 229 | км 230 | кол 231 | комн 232 | коп 233 | куб 234 | л 235 | лиц 236 | лл 237 | м 238 | макс 239 | мг 240 | мин 241 | мл 242 | млн 243 | млрд 244 | мм 245 | н 246 | наб 247 | нач 248 | неуд 249 | ном 250 | о 251 | обл 252 | обр 253 | общ 254 | ок 255 | ост 256 | отл 257 | п 258 | пер 259 | перераб 260 | пл 261 | пос 262 | пр 263 | просп 264 | проф 265 | р 266 | ред 267 | руб 268 | с 269 | сб 270 | св 271 | см 272 | соч 273 | ср 274 | ст 275 | стр 276 | т 277 | тел 278 | Тел 279 | тех 280 | тт 281 | туп 282 | тыс 283 | уд 284 | ул 285 | уч 286 | физ 287 | х 288 | хор 289 | ч 290 | чел 291 | шт 292 | экз 293 | э 294 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.sk: -------------------------------------------------------------------------------- 1 | Bc 2 | Mgr 3 | RNDr 4 | PharmDr 5 | PhDr 6 | JUDr 7 | PaedDr 8 | ThDr 9 | Ing 10 | MUDr 11 | MDDr 12 | MVDr 13 | Dr 14 | ThLic 15 | PhD 16 | ArtD 17 | ThDr 18 | Dr 19 | DrSc 20 | CSs 21 | prof 22 | obr 23 | Obr 24 | Č 25 | č 26 | absol 27 | adj 28 | admin 29 | adr 30 | Adr 31 | adv 32 | advok 33 | afr 34 | ak 35 | akad 36 | akc 37 | akuz 38 | et 39 | al 40 | alch 41 | amer 42 | anat 43 | angl 44 | Angl 45 | anglosas 46 | anorg 47 | ap 48 | apod 49 | arch 50 | archeol 51 | archit 52 | arg 53 | art 54 | astr 55 | astrol 56 | astron 57 | atp 58 | atď 59 | austr 60 | Austr 61 | aut 62 | belg 63 | Belg 64 | bibl 65 | Bibl 66 | biol 67 | bot 68 | bud 69 | bás 70 | býv 71 | cest 72 | chem 73 | cirk 74 | csl 75 | čs 76 | Čs 77 | dat 78 | dep 79 | det 80 | dial 81 | diaľ 82 | dipl 83 | distrib 84 | dokl 85 | dosl 86 | dopr 87 | dram 88 | duš 89 | dv 90 | dvojčl 91 | dór 92 | ekol 93 | ekon 94 | el 95 | elektr 96 | elektrotech 97 | energet 98 | epic 99 | est 100 | etc 101 | etonym 102 | eufem 103 | európ 104 | Európ 105 | ev 106 | evid 107 | expr 108 | fa 109 | fam 110 | farm 111 | fem 112 | feud 113 | fil 114 | filat 115 | filoz 116 | fi 117 | fon 118 | form 119 | fot 120 | fr 121 | Fr 122 | franc 123 | Franc 124 | fraz 125 | fut 126 | fyz 127 | fyziol 128 | garb 129 | gen 130 | genet 131 | genpor 132 | geod 133 | geogr 134 | geol 135 | geom 136 | germ 137 | gr 138 | Gr 139 | gréc 140 | Gréc 141 | gréckokat 142 | hebr 143 | herald 144 | hist 145 | hlav 146 | hosp 147 | hromad 148 | hud 149 | hypok 150 | ident 151 | i.e 152 | ident 153 | imp 154 | impf 155 | indoeur 156 | inf 157 | inform 158 | instr 159 | int 160 | interj 161 | inšt 162 | inštr 163 | iron 164 | jap 165 | Jap 166 | jaz 167 | jedn 168 | juhoamer 169 | juhových 170 | juhozáp 171 | juž 172 | kanad 173 | Kanad 174 | kanc 175 | kapit 176 | kpt 177 | kart 178 | katastr 179 | knih 180 | kniž 181 | komp 182 | konj 183 | konkr 184 | kozmet 185 | krajč 186 | kresť 187 | kt 188 | kuch 189 | lat 190 | latinskoamer 191 | lek 192 | lex 193 | lingv 194 | lit 195 | litur 196 | log 197 | lok 198 | max 199 | Max 200 | maď 201 | Maď 202 | medzinár 203 | mest 204 | metr 205 | mil 206 | Mil 207 | min 208 | Min 209 | miner 210 | ml 211 | mld 212 | mn 213 | mod 214 | mytol 215 | napr 216 | nar 217 | Nar 218 | nasl 219 | nedok 220 | neg 221 | negat 222 | neklas 223 | nem 224 | Nem 225 | neodb 226 | neos 227 | neskl 228 | nesklon 229 | nespis 230 | nespráv 231 | neved 232 | než 233 | niekt 234 | niž 235 | nom 236 | náb 237 | nákl 238 | námor 239 | nár 240 | obch 241 | obj 242 | obv 243 | obyč 244 | obč 245 | občian 246 | odb 247 | odd 248 | ods 249 | ojed 250 | okr 251 | Okr 252 | opt 253 | opyt 254 | org 255 | os 256 | osob 257 | ot 258 | ovoc 259 | par 260 | part 261 | pejor 262 | pers 263 | pf 264 | Pf 265 | P.f 266 | p.f 267 | pl 268 | Plk 269 | pod 270 | podst 271 | pokl 272 | polit 273 | politol 274 | polygr 275 | pomn 276 | popl 277 | por 278 | porad 279 | porov 280 | posch 281 | potrav 282 | použ 283 | poz 284 | pozit 285 | poľ 286 | poľno 287 | poľnohosp 288 | poľov 289 | pošt 290 | pož 291 | prac 292 | predl 293 | pren 294 | prep 295 | preuk 296 | priezv 297 | Priezv 298 | privl 299 | prof 300 | práv 301 | príd 302 | príj 303 | prík 304 | príp 305 | prír 306 | prísl 307 | príslov 308 | príč 309 | psych 310 | publ 311 | pís 312 | písm 313 | pôv 314 | refl 315 | reg 316 | rep 317 | resp 318 | rozk 319 | rozlič 320 | rozpráv 321 | roč 322 | Roč 323 | ryb 324 | rádiotech 325 | rím 326 | samohl 327 | semest 328 | sev 329 | severoamer 330 | severových 331 | severozáp 332 | sg 333 | skr 334 | skup 335 | sl 336 | Sloven 337 | soc 338 | soch 339 | sociol 340 | sp 341 | spol 342 | Spol 343 | spoloč 344 | spoluhl 345 | správ 346 | spôs 347 | st 348 | star 349 | starogréc 350 | starorím 351 | s.r.o 352 | stol 353 | stor 354 | str 355 | stredoamer 356 | stredoškol 357 | subj 358 | subst 359 | superl 360 | sv 361 | sz 362 | súkr 363 | súp 364 | súvzť 365 | tal 366 | Tal 367 | tech 368 | tel 369 | Tel 370 | telef 371 | teles 372 | telev 373 | teol 374 | trans 375 | turist 376 | tuzem 377 | typogr 378 | tzn 379 | tzv 380 | ukaz 381 | ul 382 | Ul 383 | umel 384 | univ 385 | ust 386 | ved 387 | vedľ 388 | verb 389 | veter 390 | vin 391 | viď 392 | vl 393 | vod 394 | vodohosp 395 | pnl 396 | vulg 397 | vyj 398 | vys 399 | vysokoškol 400 | vzťaž 401 | vôb 402 | vých 403 | výd 404 | výrob 405 | výsk 406 | výsl 407 | výtv 408 | výtvar 409 | význ 410 | včel 411 | vš 412 | všeob 413 | zahr 414 | zar 415 | zariad 416 | zast 417 | zastar 418 | zastaráv 419 | zb 420 | zdravot 421 | združ 422 | zjemn 423 | zlat 424 | zn 425 | Zn 426 | zool 427 | zr 428 | zried 429 | zv 430 | záhr 431 | zák 432 | zákl 433 | zám 434 | záp 435 | západoeur 436 | zázn 437 | územ 438 | účt 439 | čast 440 | čes 441 | Čes 442 | čl 443 | čísl 444 | živ 445 | pr 446 | fak 447 | Kr 448 | p.n.l 449 | A 450 | B 451 | C 452 | D 453 | E 454 | F 455 | G 456 | H 457 | I 458 | J 459 | K 460 | L 461 | M 462 | N 463 | O 464 | P 465 | Q 466 | R 467 | S 468 | T 469 | U 470 | V 471 | W 472 | X 473 | Y 474 | Z 475 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.sl: -------------------------------------------------------------------------------- 1 | dr 2 | Dr 3 | itd 4 | itn 5 | št #NUMERIC_ONLY# 6 | Št #NUMERIC_ONLY# 7 | d 8 | jan 9 | Jan 10 | feb 11 | Feb 12 | mar 13 | Mar 14 | apr 15 | Apr 16 | jun 17 | Jun 18 | jul 19 | Jul 20 | avg 21 | Avg 22 | sept 23 | Sept 24 | sep 25 | Sep 26 | okt 27 | Okt 28 | nov 29 | Nov 30 | dec 31 | Dec 32 | tj 33 | Tj 34 | npr 35 | Npr 36 | sl 37 | Sl 38 | op 39 | Op 40 | gl 41 | Gl 42 | oz 43 | Oz 44 | prev 45 | dipl 46 | ing 47 | prim 48 | Prim 49 | cf 50 | Cf 51 | gl 52 | Gl 53 | A 54 | B 55 | C 56 | D 57 | E 58 | F 59 | G 60 | H 61 | I 62 | J 63 | K 64 | L 65 | M 66 | N 67 | O 68 | P 69 | Q 70 | R 71 | S 72 | T 73 | U 74 | V 75 | W 76 | X 77 | Y 78 | Z 79 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.sv: -------------------------------------------------------------------------------- 1 | #single upper case letter are usually initials 2 | A 3 | B 4 | C 5 | D 6 | E 7 | F 8 | G 9 | H 10 | I 11 | J 12 | K 13 | L 14 | M 15 | N 16 | O 17 | P 18 | Q 19 | R 20 | S 21 | T 22 | U 23 | V 24 | W 25 | X 26 | Y 27 | Z 28 | #misc abbreviations 29 | AB 30 | G 31 | VG 32 | dvs 33 | etc 34 | from 35 | iaf 36 | jfr 37 | kl 38 | kr 39 | mao 40 | mfl 41 | mm 42 | osv 43 | pga 44 | tex 45 | tom 46 | vs 47 | -------------------------------------------------------------------------------- /data/nonbreaking_prefixes/nonbreaking_prefix.ta: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | அ 7 | ஆ 8 | இ 9 | ஈ 10 | உ 11 | ஊ 12 | எ 13 | ஏ 14 | ஐ 15 | ஒ 16 | ஓ 17 | ஔ 18 | ஃ 19 | க 20 | கா 21 | கி 22 | கீ 23 | கு 24 | கூ 25 | கெ 26 | கே 27 | கை 28 | கொ 29 | கோ 30 | கௌ 31 | க் 32 | ச 33 | சா 34 | சி 35 | சீ 36 | சு 37 | சூ 38 | செ 39 | சே 40 | சை 41 | சொ 42 | சோ 43 | சௌ 44 | ச் 45 | ட 46 | டா 47 | டி 48 | டீ 49 | டு 50 | டூ 51 | டெ 52 | டே 53 | டை 54 | டொ 55 | டோ 56 | டௌ 57 | ட் 58 | த 59 | தா 60 | தி 61 | தீ 62 | து 63 | தூ 64 | தெ 65 | தே 66 | தை 67 | தொ 68 | தோ 69 | தௌ 70 | த் 71 | ப 72 | பா 73 | பி 74 | பீ 75 | பு 76 | பூ 77 | பெ 78 | பே 79 | பை 80 | பொ 81 | போ 82 | பௌ 83 | ப் 84 | ற 85 | றா 86 | றி 87 | றீ 88 | று 89 | றூ 90 | றெ 91 | றே 92 | றை 93 | றொ 94 | றோ 95 | றௌ 96 | ற் 97 | ய 98 | யா 99 | யி 100 | யீ 101 | யு 102 | யூ 103 | யெ 104 | யே 105 | யை 106 | யொ 107 | யோ 108 | யௌ 109 | ய் 110 | ர 111 | ரா 112 | ரி 113 | ரீ 114 | ரு 115 | ரூ 116 | ரெ 117 | ரே 118 | ரை 119 | ரொ 120 | ரோ 121 | ரௌ 122 | ர் 123 | ல 124 | லா 125 | லி 126 | லீ 127 | லு 128 | லூ 129 | லெ 130 | லே 131 | லை 132 | லொ 133 | லோ 134 | லௌ 135 | ல் 136 | வ 137 | வா 138 | வி 139 | வீ 140 | வு 141 | வூ 142 | வெ 143 | வே 144 | வை 145 | வொ 146 | வோ 147 | வௌ 148 | வ் 149 | ள 150 | ளா 151 | ளி 152 | ளீ 153 | ளு 154 | ளூ 155 | ளெ 156 | ளே 157 | ளை 158 | ளொ 159 | ளோ 160 | ளௌ 161 | ள் 162 | ழ 163 | ழா 164 | ழி 165 | ழீ 166 | ழு 167 | ழூ 168 | ழெ 169 | ழே 170 | ழை 171 | ழொ 172 | ழோ 173 | ழௌ 174 | ழ் 175 | ங 176 | ஙா 177 | ஙி 178 | ஙீ 179 | ஙு 180 | ஙூ 181 | ஙெ 182 | ஙே 183 | ஙை 184 | ஙொ 185 | ஙோ 186 | ஙௌ 187 | ங் 188 | ஞ 189 | ஞா 190 | ஞி 191 | ஞீ 192 | ஞு 193 | ஞூ 194 | ஞெ 195 | ஞே 196 | ஞை 197 | ஞொ 198 | ஞோ 199 | ஞௌ 200 | ஞ் 201 | ண 202 | ணா 203 | ணி 204 | ணீ 205 | ணு 206 | ணூ 207 | ணெ 208 | ணே 209 | ணை 210 | ணொ 211 | ணோ 212 | ணௌ 213 | ண் 214 | ந 215 | நா 216 | நி 217 | நீ 218 | நு 219 | நூ 220 | நெ 221 | நே 222 | நை 223 | நொ 224 | நோ 225 | நௌ 226 | ந் 227 | ம 228 | மா 229 | மி 230 | மீ 231 | மு 232 | மூ 233 | மெ 234 | மே 235 | மை 236 | மொ 237 | மோ 238 | மௌ 239 | ம் 240 | ன 241 | னா 242 | னி 243 | னீ 244 | னு 245 | னூ 246 | னெ 247 | னே 248 | னை 249 | னொ 250 | னோ 251 | னௌ 252 | ன் 253 | 254 | 255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 256 | திரு 257 | திருமதி 258 | வண 259 | கௌரவ 260 | 261 | 262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 263 | உ.ம் 264 | #கா.ம் 265 | #எ.ம் 266 | 267 | 268 | #Numbers only. These should only induce breaks when followed by a numeric sequence 269 | # add NUMERIC_ONLY after the word for this function 270 | #This case is mostly for the english "No." which can either be a sentence of its own, or 271 | #if followed by a number, a non-breaking prefix 272 | No #NUMERIC_ONLY# 273 | Nos 274 | Art #NUMERIC_ONLY# 275 | Nr 276 | pp #NUMERIC_ONLY# 277 | -------------------------------------------------------------------------------- /data/normalize-punctuation.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | use warnings; 7 | use strict; 8 | 9 | my $language = "en"; 10 | my $PENN = 0; 11 | 12 | while (@ARGV) { 13 | $_ = shift; 14 | /^-b$/ && ($| = 1, next); # not buffered (flush each line) 15 | /^-l$/ && ($language = shift, next); 16 | /^[^\-]/ && ($language = $_, next); 17 | /^-penn$/ && ($PENN = 1, next); 18 | } 19 | 20 | while() { 21 | s/\r//g; 22 | # remove extra spaces 23 | s/\(/ \(/g; 24 | s/\)/\) /g; s/ +/ /g; 25 | s/\) ([\.\!\:\?\;\,])/\)$1/g; 26 | s/\( /\(/g; 27 | s/ \)/\)/g; 28 | s/(\d) \%/$1\%/g; 29 | s/ :/:/g; 30 | s/ ;/;/g; 31 | # normalize unicode punctuation 32 | if ($PENN == 0) { 33 | s/\`/\'/g; 34 | s/\'\'/ \" /g; 35 | } 36 | 37 | s/„/\"/g; 38 | s/“/\"/g; 39 | s/”/\"/g; 40 | s/–/-/g; 41 | s/—/ - /g; s/ +/ /g; 42 | s/´/\'/g; 43 | s/([a-z])‘([a-z])/$1\'$2/gi; 44 | s/([a-z])’([a-z])/$1\'$2/gi; 45 | s/‘/\"/g; 46 | s/‚/\"/g; 47 | s/’/\"/g; 48 | s/''/\"/g; 49 | s/´´/\"/g; 50 | s/…/.../g; 51 | # French quotes 52 | s/ « / \"/g; 53 | s/« /\"/g; 54 | s/«/\"/g; 55 | s/ » /\" /g; 56 | s/ »/\"/g; 57 | s/»/\"/g; 58 | # handle pseudo-spaces 59 | s/ \%/\%/g; 60 | s/nº /nº /g; 61 | s/ :/:/g; 62 | s/ ºC/ ºC/g; 63 | s/ cm/ cm/g; 64 | s/ \?/\?/g; 65 | s/ \!/\!/g; 66 | s/ ;/;/g; 67 | s/, /, /g; s/ +/ /g; 68 | 69 | # English "quotation," followed by comma, style 70 | if ($language eq "en") { 71 | s/\"([,\.]+)/$1\"/g; 72 | } 73 | # Czech is confused 74 | elsif ($language eq "cs" || $language eq "cz") { 75 | } 76 | # German/Spanish/French "quotation", followed by comma, style 77 | else { 78 | s/,\"/\",/g; 79 | s/(\.+)\"(\s*[^<])/\"$1$2/g; # don't fix period at end of sentence 80 | } 81 | 82 | 83 | if ($language eq "de" || $language eq "es" || $language eq "cz" || $language eq "cs" || $language eq "fr") { 84 | s/(\d) (\d)/$1,$2/g; 85 | } 86 | else { 87 | s/(\d) (\d)/$1.$2/g; 88 | } 89 | print $_; 90 | } 91 | -------------------------------------------------------------------------------- /data/postprocess.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # input path 4 | INPUT=$1 5 | 6 | # output path 7 | OUTPUT=$2 8 | 9 | # restore subword units to original segmentation 10 | sed -r 's/(@@ )|(@@ ?$)//g' ${INPUT} > ${OUTPUT} 11 | -------------------------------------------------------------------------------- /data/preprocess.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # source language suffix (example: en, cs, de, fr) 4 | S=$1 5 | 6 | # target language suffix (example: en, cs, de, fr) 7 | T=$2 8 | 9 | # path to corpus 10 | CORPUS=$3 11 | 12 | # maximum sequence length 13 | MAXLEN=$4 14 | 15 | echo "normalizing punctuation.." 16 | perl normalize-punctuation.perl -l ${S} < ${CORPUS}.${S} > ${CORPUS}.norm.${S} 17 | perl normalize-punctuation.perl -l ${T} < ${CORPUS}.${T} > ${CORPUS}.norm.${T} 18 | 19 | echo "tokenizing.." 20 | perl tokenizer.perl -l ${S} -threads 10 < ${CORPUS}.norm.${S} > ${CORPUS}.tok.${S} 21 | perl tokenizer.perl -l ${T} -threads 10 < ${CORPUS}.norm.${T} > ${CORPUS}.tok.${T} 22 | 23 | echo "learning bpe.." 24 | # learn BPE on joint vocabulary 25 | cat ${CORPUS}.tok.${S} ${CORPUS}.tok.${T} | python subword_nmt/learn_bpe.py -s 30000 > ${S}${T}.bpe 26 | 27 | echo "applying bpe.." 28 | python subword_nmt/apply_bpe.py -c ${S}${T}.bpe < ${CORPUS}.tok.${S} > ${CORPUS}.bpe.${S} 29 | python subword_nmt/apply_bpe.py -c ${S}${T}.bpe < ${CORPUS}.tok.${T} > ${CORPUS}.bpe.${T} 30 | 31 | echo "cleaning: filtering sequences of length over ${MAXLEN}" 32 | perl clean-corpus-n.perl ${CORPUS}.bpe ${S} ${T} ${CORPUS}.clean 1 ${MAXLEN} 33 | 34 | echo "shuffling.." 35 | python shuffle.py ${CORPUS}.clean.${S} ${CORPUS}.clean.${T} 36 | 37 | mv ${CORPUS}.clean.${S}.shuf ${CORPUS}.shuf.${S} 38 | mv ${CORPUS}.clean.${T}.shuf ${CORPUS}.shuf.${T} 39 | 40 | echo "building dictionaries.." 41 | python build_dictionary.py ${CORPUS}.shuf.${S} ${CORPUS}.shuf.${T} 42 | 43 | echo "preprocessing complete.." 44 | python data_statistics.py ${CORPUS}.shuf.${S} ${CORPUS}.shuf.${T} 45 | -------------------------------------------------------------------------------- /data/sample.en: -------------------------------------------------------------------------------- 1 | Parliament Does Not Support Amendment Freeing Tymoshenko 2 | Today, the Ukraine parliament dismissed, within the Code of Criminal Procedure amendment, the motion to revoke an article based on which the opposition leader, Yulia Tymoshenko, was sentenced. 3 | The amendment that would lead to freeing the imprisoned former Prime Minister was revoked during second reading of the proposal for mitigation of sentences for economic offences. 4 | In October, Tymoshenko was sentenced to seven years in prison for entering into what was reported to be a disadvantageous gas deal with Russia. 5 | The verdict is not yet final; the court will hear Tymoshenko's appeal in December. 6 | Tymoshenko claims the verdict is a political revenge of the regime; in the West, the trial has also evoked suspicion of being biased. 7 | The proposal to remove Article 365 from the Code of Criminal Procedure, upon which the former Prime Minister was sentenced, was supported by 147 members of parliament. 8 | Its ratification would require 226 votes. 9 | Libya's Victory 10 | The story of Libya's liberation, or rebellion, already has its defeated. 11 | Muammar Kaddafi is buried at an unknown place in the desert. Without him, the war is over. 12 | It is time to define the winners. 13 | As a rule, Islamists win in the country; the question is whether they are the moderate or the radical ones. 14 | The transitional cabinet declared itself a follower of the customary Sharia law, of which we have already heard. 15 | Libya will become a crime free country, as the punishment for stealing is having one's hand amputated. 16 | Women can forget about emancipation; potential religious renegades will be executed; etc. 17 | Instead of a dictator, a society consisting of competing tribes will be united by Koran. 18 | Libya will be in an order we cannot imagine and surely would not ask for. 19 | However, our lifestyle is neither unique nor the best one and would not, most probably, be suitable for, for example, the people of Libya. 20 | In fact, it is a wonder that the Islamic fighters accepted help from the nonbelievers. -------------------------------------------------------------------------------- /data/sample.fr: -------------------------------------------------------------------------------- 1 | Le Parlement n'a pas ratifié l'amendement pour la libération de Tymosenko 2 | Le parlement ukrainien a refusé, dans la cadre d'un amendement au droit pénal, le projet d'annulation du paragraphe relatif à l'inculpation de Julia Tymosenko, la chef de l'opposition. 3 | Les députés ont refusé en deuxième lecture le projet de modification visant la réduction des peines pour délits économiques, qui aurait pu ouvrir les portes de la liberté pour l'ex-Première Ministre actuellement emprisonnée. 4 | Tymosenko a été condamnée en octobre à 7 ans de prison pour avoir conclu un accord à priori désavantageux avec la Russie pour l'achat de gaz naturel. 5 | Le jugement n'est pas définitif et le tribunal doit statuer sur l'appel de la condamnée en décembre. 6 | Tymosenko qualifie le jugement de vengeance politique du régime et a provoqué un processus de soupçons de partialité du tribunal également à l'ouest. 7 | Le projet d'annuler le paragraphe 365 du droit pénal, sur la base duquel l'ex-Premère Ministre a été condamnée, a été signé par 147 députés. 8 | Il aurait fallu 226 voix pour l'approuver. 9 | Victoire libyenne 10 | La libération ou rebellion libyenne a déjà ses vaincus. 11 | Muammar Kaddafi est enterré dans un endroit inconnu dans le désert et sans lui la guerre est terminée. 12 | Il reste à déterminer les vainqueurs. 13 | Comme il est de coutume dans la région, les vainqueurs aux élections sont les islamistes, mais s'agit-il des modérés ou des radicaux. 14 | Le Conseil National provisoire a décrété le droit habituel de la Charia et nous savons déjà de quoi il s'agit. 15 | La Libye devient un pays sans criminalité car pour un vol on coupe la main. 16 | Les femmes peuvent oublier l'émancipation, les non respectueuses éventuelles dela foi sont passibles de la peine capitale... 17 | C'est le Coran qui va unir à la place de la personnalité du dictateur la société composée de tribus opposées. 18 | En Libye règnera un tel ordre comme nous ne pouvons pas nous le représenter et comme nous n'aimerions sûrement pas. 19 | Mais notre façon de vivre n'est pas unique, n'est pas objectivement la meilleure et ne serait probablement pas avantageuse pour les libyens. 20 | Il est particulièrement étonnant que les guerriers islamistes ont accepté l'aide avec reconnaissance des chiens infidèles. -------------------------------------------------------------------------------- /data/shuffle.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import random 4 | 5 | import tempfile 6 | from subprocess import call 7 | 8 | ''' 9 | This code comes from shuffle.py of 10 | nematus proejct (https://github.com/rsennrich/nematus) 11 | ''' 12 | 13 | def main(files, temporary=False): 14 | 15 | tf_os, tpath = tempfile.mkstemp() 16 | tf = open(tpath, 'w') 17 | 18 | fds = [open(ff) for ff in files] 19 | 20 | for l in fds[0]: 21 | lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]] 22 | print >>tf, "|||".join(lines) 23 | 24 | [ff.close() for ff in fds] 25 | tf.close() 26 | 27 | lines = open(tpath, 'r').readlines() 28 | random.shuffle(lines) 29 | 30 | if temporary: 31 | fds = [] 32 | for ff in files: 33 | path, filename = os.path.split(os.path.realpath(ff)) 34 | fds.append(tempfile.TemporaryFile(prefix=filename+'.shuf', dir=path)) 35 | else: 36 | fds = [open(ff+'.shuf','w') for ff in files] 37 | 38 | for l in lines: 39 | s = l.strip().split('|||') 40 | for ii, fd in enumerate(fds): 41 | print >>fd, s[ii] 42 | 43 | if temporary: 44 | [ff.seek(0) for ff in fds] 45 | else: 46 | [ff.close() for ff in fds] 47 | 48 | os.close(tf_os) 49 | os.remove(tpath) 50 | 51 | return fds 52 | 53 | if __name__ == '__main__': 54 | main(sys.argv[1:]) 55 | 56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /data/strip_sgml.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import re 3 | 4 | ''' 5 | This code comes from strip_sgml.py of 6 | nematus proejct (https://github.com/rsennrich/nematus) 7 | ''' 8 | 9 | def main(): 10 | fin = sys.stdin 11 | fout = sys.stdout 12 | for l in fin: 13 | line = l.strip() 14 | text = re.sub('<[^<]+>', "", line).strip() 15 | if len(text) == 0: 16 | continue 17 | print >>fout, text 18 | 19 | 20 | if __name__ == "__main__": 21 | main() 22 | 23 | -------------------------------------------------------------------------------- /data/subword_nmt/README.md: -------------------------------------------------------------------------------- 1 | Preprocessing scripts to learn and apply subword units 2 | (https://github.com/rsennrich/subword-nmt) 3 | -------------------------------------------------------------------------------- /data/subword_nmt/apply_bpe.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | # Author: Rico Sennrich 4 | 5 | """Use operations learned with learn_bpe.py to encode a new text. 6 | The text will not be smaller, but use only a fixed vocabulary, with rare words 7 | encoded as variable-length sequences of subword units. 8 | 9 | Reference: 10 | Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. 11 | Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany. 12 | """ 13 | 14 | from __future__ import unicode_literals, division 15 | 16 | import sys 17 | import codecs 18 | import argparse 19 | from collections import defaultdict 20 | 21 | # hack for python2/3 compatibility 22 | from io import open 23 | argparse.open = open 24 | 25 | # python 2/3 compatibility 26 | if sys.version_info < (3, 0): 27 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr) 28 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout) 29 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin) 30 | 31 | class BPE(object): 32 | 33 | def __init__(self, codes, separator='@@'): 34 | self.bpe_codes = [tuple(item.split()) for item in codes] 35 | # some hacking to deal with duplicates (only consider first instance) 36 | self.bpe_codes = dict([(code,i) for (i,code) in reversed(list(enumerate(self.bpe_codes)))]) 37 | 38 | self.separator = separator 39 | 40 | def segment(self, sentence): 41 | """segment single sentence (whitespace-tokenized string) with BPE encoding""" 42 | 43 | output = [] 44 | for word in sentence.split(): 45 | new_word = encode(word, self.bpe_codes) 46 | 47 | for item in new_word[:-1]: 48 | output.append(item + self.separator) 49 | output.append(new_word[-1]) 50 | 51 | return ' '.join(output) 52 | 53 | def create_parser(): 54 | parser = argparse.ArgumentParser( 55 | formatter_class=argparse.RawDescriptionHelpFormatter, 56 | description="learn BPE-based word segmentation") 57 | 58 | parser.add_argument( 59 | '--input', '-i', type=argparse.FileType('r'), default=sys.stdin, 60 | metavar='PATH', 61 | help="Input file (default: standard input).") 62 | parser.add_argument( 63 | '--codes', '-c', type=argparse.FileType('r'), metavar='PATH', 64 | required=True, 65 | help="File with BPE codes (created by learn_bpe.py).") 66 | parser.add_argument( 67 | '--output', '-o', type=argparse.FileType('w'), default=sys.stdout, 68 | metavar='PATH', 69 | help="Output file (default: standard output)") 70 | parser.add_argument( 71 | '--separator', '-s', type=str, default='@@', metavar='STR', 72 | help="Separator between non-final subword units (default: '%(default)s'))") 73 | 74 | return parser 75 | 76 | def get_pairs(word): 77 | """Return set of symbol pairs in a word. 78 | 79 | word is represented as tuple of symbols (symbols being variable-length strings) 80 | """ 81 | pairs = set() 82 | prev_char = word[0] 83 | for char in word[1:]: 84 | pairs.add((prev_char, char)) 85 | prev_char = char 86 | return pairs 87 | 88 | def encode(orig, bpe_codes, cache={}): 89 | """Encode word based on list of BPE merge operations, which are applied consecutively 90 | """ 91 | 92 | if orig in cache: 93 | return cache[orig] 94 | 95 | word = tuple(orig) + ('',) 96 | pairs = get_pairs(word) 97 | 98 | while True: 99 | bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf'))) 100 | if bigram not in bpe_codes: 101 | break 102 | first, second = bigram 103 | new_word = [] 104 | i = 0 105 | while i < len(word): 106 | try: 107 | j = word.index(first, i) 108 | new_word.extend(word[i:j]) 109 | i = j 110 | except: 111 | new_word.extend(word[i:]) 112 | break 113 | 114 | if word[i] == first and i < len(word)-1 and word[i+1] == second: 115 | new_word.append(first+second) 116 | i += 2 117 | else: 118 | new_word.append(word[i]) 119 | i += 1 120 | new_word = tuple(new_word) 121 | word = new_word 122 | if len(word) == 1: 123 | break 124 | else: 125 | pairs = get_pairs(word) 126 | 127 | # don't print end-of-word symbols 128 | if word[-1] == '': 129 | word = word[:-1] 130 | elif word[-1].endswith(''): 131 | word = word[:-1] + (word[-1].replace('',''),) 132 | 133 | cache[orig] = word 134 | return word 135 | 136 | 137 | if __name__ == '__main__': 138 | parser = create_parser() 139 | args = parser.parse_args() 140 | 141 | bpe = BPE(args.codes, args.separator) 142 | 143 | for line in args.input: 144 | args.output.write(bpe.segment(line).strip()) 145 | args.output.write('\n') 146 | -------------------------------------------------------------------------------- /data/subword_nmt/chrF.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | # Author: Rico Sennrich 4 | 5 | """Compute chrF3 for machine translation evaluation 6 | 7 | Reference: 8 | Maja Popović (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translationn, pages 392–395, Lisbon, Portugal. 9 | """ 10 | 11 | from __future__ import print_function, unicode_literals, division 12 | import sys 13 | import codecs 14 | import io 15 | import argparse 16 | from collections import defaultdict 17 | from math import log, exp 18 | 19 | # hack for python2/3 compatibility 20 | from io import open 21 | argparse.open = open 22 | 23 | # python 2/3 compatibility 24 | if sys.version_info < (3, 0): 25 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr) 26 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout) 27 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin) 28 | 29 | 30 | def create_parser(): 31 | parser = argparse.ArgumentParser( 32 | formatter_class=argparse.RawDescriptionHelpFormatter, 33 | description="learn BPE-based word segmentation") 34 | 35 | parser.add_argument( 36 | '--ref', '-r', type=argparse.FileType('r'), required=True, 37 | metavar='PATH', 38 | help="Reference file") 39 | parser.add_argument( 40 | '--hyp', type=argparse.FileType('r'), metavar='PATH', 41 | default=sys.stdin, 42 | help="Hypothesis file (default: stdin).") 43 | parser.add_argument( 44 | '--beta', '-b', type=float, default=3, 45 | metavar='FLOAT', 46 | help="beta parameter (default: '%(default)s')") 47 | parser.add_argument( 48 | '--ngram', '-n', type=int, default=6, 49 | metavar='INT', 50 | help="ngram order (default: '%(default)s')") 51 | parser.add_argument( 52 | '--space', '-s', action='store_true', 53 | help="take spaces into account (default: '%(default)s')") 54 | parser.add_argument( 55 | '--precision', action='store_true', 56 | help="report precision (default: '%(default)s')") 57 | parser.add_argument( 58 | '--recall', action='store_true', 59 | help="report recall (default: '%(default)s')") 60 | 61 | return parser 62 | 63 | def extract_ngrams(words, max_length=4, spaces=False): 64 | 65 | if not spaces: 66 | words = ''.join(words.split()) 67 | else: 68 | words = words.strip() 69 | 70 | results = defaultdict(lambda: defaultdict(int)) 71 | for length in range(max_length): 72 | for start_pos in range(len(words)): 73 | end_pos = start_pos + length + 1 74 | if end_pos <= len(words): 75 | results[length][tuple(words[start_pos: end_pos])] += 1 76 | return results 77 | 78 | 79 | def get_correct(ngrams_ref, ngrams_test, correct, total): 80 | 81 | for rank in ngrams_test: 82 | for chain in ngrams_test[rank]: 83 | total[rank] += ngrams_test[rank][chain] 84 | if chain in ngrams_ref[rank]: 85 | correct[rank] += min(ngrams_test[rank][chain], ngrams_ref[rank][chain]) 86 | 87 | return correct, total 88 | 89 | 90 | def f1(correct, total_hyp, total_ref, max_length, beta=3, smooth=0): 91 | 92 | precision = 0 93 | recall = 0 94 | 95 | for i in range(max_length): 96 | if total_hyp[i] + smooth and total_ref[i] + smooth: 97 | precision += (correct[i] + smooth) / (total_hyp[i] + smooth) 98 | recall += (correct[i] + smooth) / (total_ref[i] + smooth) 99 | 100 | precision /= max_length 101 | recall /= max_length 102 | 103 | return (1 + beta**2) * (precision*recall) / ((beta**2 * precision) + recall), precision, recall 104 | 105 | def main(args): 106 | 107 | correct = [0]*args.ngram 108 | total = [0]*args.ngram 109 | total_ref = [0]*args.ngram 110 | for line in args.ref: 111 | line2 = args.hyp.readline() 112 | 113 | ngrams_ref = extract_ngrams(line, max_length=args.ngram, spaces=args.space) 114 | ngrams_test = extract_ngrams(line2, max_length=args.ngram, spaces=args.space) 115 | 116 | get_correct(ngrams_ref, ngrams_test, correct, total) 117 | 118 | for rank in ngrams_ref: 119 | for chain in ngrams_ref[rank]: 120 | total_ref[rank] += ngrams_ref[rank][chain] 121 | 122 | chrf, precision, recall = f1(correct, total, total_ref, args.ngram, args.beta) 123 | 124 | print('chrF3: {0:.4f}'.format(chrf)) 125 | if args.precision: 126 | print('chrPrec: {0:.4f}'.format(precision)) 127 | if args.recall: 128 | print('chrRec: {0:.4f}'.format(recall)) 129 | 130 | if __name__ == '__main__': 131 | 132 | parser = create_parser() 133 | args = parser.parse_args() 134 | 135 | main(args) 136 | -------------------------------------------------------------------------------- /data/subword_nmt/learn_bpe.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | # Author: Rico Sennrich 4 | 5 | """Use byte pair encoding (BPE) to learn a variable-length encoding of the vocabulary in a text. 6 | Unlike the original BPE, it does not compress the plain text, but can be used to reduce the vocabulary 7 | of a text to a configurable number of symbols, with only a small increase in the number of tokens. 8 | 9 | Reference: 10 | Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. 11 | Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany. 12 | """ 13 | 14 | from __future__ import unicode_literals 15 | 16 | import sys 17 | import codecs 18 | import re 19 | import copy 20 | import argparse 21 | from collections import defaultdict, Counter 22 | 23 | # hack for python2/3 compatibility 24 | from io import open 25 | argparse.open = open 26 | 27 | # python 2/3 compatibility 28 | if sys.version_info < (3, 0): 29 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr) 30 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout) 31 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin) 32 | 33 | def create_parser(): 34 | parser = argparse.ArgumentParser( 35 | formatter_class=argparse.RawDescriptionHelpFormatter, 36 | description="learn BPE-based word segmentation") 37 | 38 | parser.add_argument( 39 | '--input', '-i', type=argparse.FileType('r'), default=sys.stdin, 40 | metavar='PATH', 41 | help="Input text (default: standard input).") 42 | parser.add_argument( 43 | '--output', '-o', type=argparse.FileType('w'), default=sys.stdout, 44 | metavar='PATH', 45 | help="Output file for BPE codes (default: standard output)") 46 | parser.add_argument( 47 | '--symbols', '-s', type=int, default=10000, 48 | help="Create this many new symbols (each representing a character n-gram) (default: %(default)s))") 49 | parser.add_argument( 50 | '--verbose', '-v', action="store_true", 51 | help="verbose mode.") 52 | 53 | return parser 54 | 55 | def get_vocabulary(fobj): 56 | """Read text and return dictionary that encodes vocabulary 57 | """ 58 | vocab = Counter() 59 | for line in fobj: 60 | for word in line.split(): 61 | vocab[word] += 1 62 | return vocab 63 | 64 | def update_pair_statistics(pair, changed, stats, indices): 65 | """Minimally update the indices and frequency of symbol pairs 66 | 67 | if we merge a pair of symbols, only pairs that overlap with occurrences 68 | of this pair are affected, and need to be updated. 69 | """ 70 | stats[pair] = 0 71 | indices[pair] = defaultdict(int) 72 | first, second = pair 73 | new_pair = first+second 74 | for j, word, old_word, freq in changed: 75 | 76 | # find all instances of pair, and update frequency/indices around it 77 | i = 0 78 | while True: 79 | try: 80 | i = old_word.index(first, i) 81 | except ValueError: 82 | break 83 | if i < len(old_word)-1 and old_word[i+1] == second: 84 | if i: 85 | prev = old_word[i-1:i+1] 86 | stats[prev] -= freq 87 | indices[prev][j] -= 1 88 | if i < len(old_word)-2: 89 | # don't double-count consecutive pairs 90 | if old_word[i+2] != first or i >= len(old_word)-3 or old_word[i+3] != second: 91 | nex = old_word[i+1:i+3] 92 | stats[nex] -= freq 93 | indices[nex][j] -= 1 94 | i += 2 95 | else: 96 | i += 1 97 | 98 | i = 0 99 | while True: 100 | try: 101 | i = word.index(new_pair, i) 102 | except ValueError: 103 | break 104 | if i: 105 | prev = word[i-1:i+1] 106 | stats[prev] += freq 107 | indices[prev][j] += 1 108 | # don't double-count consecutive pairs 109 | if i < len(word)-1 and word[i+1] != new_pair: 110 | nex = word[i:i+2] 111 | stats[nex] += freq 112 | indices[nex][j] += 1 113 | i += 1 114 | 115 | 116 | def get_pair_statistics(vocab): 117 | """Count frequency of all symbol pairs, and create index""" 118 | 119 | # data structure of pair frequencies 120 | stats = defaultdict(int) 121 | 122 | #index from pairs to words 123 | indices = defaultdict(lambda: defaultdict(int)) 124 | 125 | for i, (word, freq) in enumerate(vocab): 126 | prev_char = word[0] 127 | for char in word[1:]: 128 | stats[prev_char, char] += freq 129 | indices[prev_char, char][i] += 1 130 | prev_char = char 131 | 132 | return stats, indices 133 | 134 | 135 | def replace_pair(pair, vocab, indices): 136 | """Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'""" 137 | first, second = pair 138 | pair_str = ''.join(pair) 139 | pair_str = pair_str.replace('\\','\\\\') 140 | changes = [] 141 | pattern = re.compile(r'(?',) ,y) for (x,y) in vocab.items()]) 181 | sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True) 182 | 183 | stats, indices = get_pair_statistics(sorted_vocab) 184 | big_stats = copy.deepcopy(stats) 185 | # threshold is inspired by Zipfian assumption, but should only affect speed 186 | threshold = max(stats.values()) / 10 187 | for i in range(args.symbols): 188 | if stats: 189 | most_frequent = max(stats, key=stats.get) 190 | 191 | # we probably missed the best pair because of pruning; go back to full statistics 192 | if not stats or (i and stats[most_frequent] < threshold): 193 | prune_stats(stats, big_stats, threshold) 194 | stats = copy.deepcopy(big_stats) 195 | most_frequent = max(stats, key=stats.get) 196 | # threshold is inspired by Zipfian assumption, but should only affect speed 197 | threshold = stats[most_frequent] * i/(i+10000.0) 198 | prune_stats(stats, big_stats, threshold) 199 | 200 | if stats[most_frequent] < 2: 201 | sys.stderr.write('no pair has frequency > 1. Stopping\n') 202 | break 203 | 204 | if args.verbose: 205 | sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent])) 206 | args.output.write('{0} {1}\n'.format(*most_frequent)) 207 | changes = replace_pair(most_frequent, sorted_vocab, indices) 208 | update_pair_statistics(most_frequent, changes, stats, indices) 209 | stats[most_frequent] = 0 210 | if not i % 100: 211 | prune_stats(stats, big_stats, threshold) 212 | -------------------------------------------------------------------------------- /data/subword_nmt/segment-char-ngrams.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | # Author: Rico Sennrich 4 | 5 | from __future__ import unicode_literals, division 6 | 7 | import sys 8 | import codecs 9 | import argparse 10 | 11 | # hack for python2/3 compatibility 12 | from io import open 13 | argparse.open = open 14 | 15 | # python 2/3 compatibility 16 | if sys.version_info < (3, 0): 17 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr) 18 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout) 19 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin) 20 | 21 | def create_parser(): 22 | parser = argparse.ArgumentParser( 23 | formatter_class=argparse.RawDescriptionHelpFormatter, 24 | description="segment rare words into character n-grams") 25 | 26 | parser.add_argument( 27 | '--input', '-i', type=argparse.FileType('r'), default=sys.stdin, 28 | metavar='PATH', 29 | help="Input file (default: standard input).") 30 | parser.add_argument( 31 | '--vocab', type=argparse.FileType('r'), metavar='PATH', 32 | required=True, 33 | help="Vocabulary file.") 34 | parser.add_argument( 35 | '--shortlist', type=int, metavar='INT', default=0, 36 | help="do not segment INT most frequent words in vocabulary (default: '%(default)s')).") 37 | parser.add_argument( 38 | '-n', type=int, metavar='INT', default=2, 39 | help="segment rare words into character n-grams of size INT (default: '%(default)s')).") 40 | parser.add_argument( 41 | '--output', '-o', type=argparse.FileType('w'), default=sys.stdout, 42 | metavar='PATH', 43 | help="Output file (default: standard output)") 44 | parser.add_argument( 45 | '--separator', '-s', type=str, default='@@', metavar='STR', 46 | help="Separator between non-final subword units (default: '%(default)s'))") 47 | 48 | return parser 49 | 50 | 51 | if __name__ == '__main__': 52 | 53 | parser = create_parser() 54 | args = parser.parse_args() 55 | 56 | vocab = [line.split()[0] for line in args.vocab if len(line.split()) == 2] 57 | vocab = dict((y,x) for (x,y) in enumerate(vocab)) 58 | 59 | for line in args.input: 60 | for word in line.split(): 61 | if word not in vocab or vocab[word] > args.shortlist: 62 | i = 0 63 | while i*args.n < len(word): 64 | args.output.write(word[i*args.n:i*args.n+args.n]) 65 | i += 1 66 | if i*args.n < len(word): 67 | args.output.write(args.separator) 68 | args.output.write(' ') 69 | else: 70 | args.output.write(word + ' ') 71 | args.output.write('\n') 72 | -------------------------------------------------------------------------------- /data/tokenizer.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | use warnings; 7 | 8 | # Sample Tokenizer 9 | ### Version 1.1 10 | # written by Pidong Wang, based on the code written by Josh Schroeder and Philipp Koehn 11 | # Version 1.1 updates: 12 | # (1) add multithreading option "-threads NUM_THREADS" (default is 1); 13 | # (2) add a timing option "-time" to calculate the average speed of this tokenizer; 14 | # (3) add an option "-lines NUM_SENTENCES_PER_THREAD" to set the number of lines for each thread (default is 2000), and this option controls the memory amount needed: the larger this number is, the larger memory is required (the higher tokenization speed); 15 | ### Version 1.0 16 | # $Id: tokenizer.perl 915 2009-08-10 08:15:49Z philipp $ 17 | # written by Josh Schroeder, based on code by Philipp Koehn 18 | 19 | binmode(STDIN, ":utf8"); 20 | binmode(STDOUT, ":utf8"); 21 | 22 | use warnings; 23 | use FindBin qw($RealBin); 24 | use strict; 25 | use Time::HiRes; 26 | 27 | if (eval {require Thread;1;}) { 28 | #module loaded 29 | Thread->import(); 30 | } 31 | 32 | my $mydir = "$RealBin/nonbreaking_prefixes"; 33 | 34 | my %NONBREAKING_PREFIX = (); 35 | my @protected_patterns = (); 36 | my $protected_patterns_file = ""; 37 | my $language = "en"; 38 | my $QUIET = 0; 39 | my $HELP = 0; 40 | my $AGGRESSIVE = 0; 41 | my $SKIP_XML = 0; 42 | my $TIMING = 0; 43 | my $NUM_THREADS = 1; 44 | my $NUM_SENTENCES_PER_THREAD = 2000; 45 | my $PENN = 0; 46 | my $NO_ESCAPING = 0; 47 | while (@ARGV) 48 | { 49 | $_ = shift; 50 | /^-b$/ && ($| = 1, next); 51 | /^-l$/ && ($language = shift, next); 52 | /^-q$/ && ($QUIET = 1, next); 53 | /^-h$/ && ($HELP = 1, next); 54 | /^-x$/ && ($SKIP_XML = 1, next); 55 | /^-a$/ && ($AGGRESSIVE = 1, next); 56 | /^-time$/ && ($TIMING = 1, next); 57 | # Option to add list of regexps to be protected 58 | /^-protected/ && ($protected_patterns_file = shift, next); 59 | /^-threads$/ && ($NUM_THREADS = int(shift), next); 60 | /^-lines$/ && ($NUM_SENTENCES_PER_THREAD = int(shift), next); 61 | /^-penn$/ && ($PENN = 1, next); 62 | /^-no-escape/ && ($NO_ESCAPING = 1, next); 63 | } 64 | 65 | # for time calculation 66 | my $start_time; 67 | if ($TIMING) 68 | { 69 | $start_time = [ Time::HiRes::gettimeofday( ) ]; 70 | } 71 | 72 | # print help message 73 | if ($HELP) 74 | { 75 | print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n"; 76 | print "Options:\n"; 77 | print " -q ... quiet.\n"; 78 | print " -a ... aggressive hyphen splitting.\n"; 79 | print " -b ... disable Perl buffering.\n"; 80 | print " -time ... enable processing time calculation.\n"; 81 | print " -penn ... use Penn treebank-like tokenization.\n"; 82 | print " -protected FILE ... specify file with patters to be protected in tokenisation.\n"; 83 | print " -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n"; 84 | exit; 85 | } 86 | 87 | if (!$QUIET) 88 | { 89 | print STDERR "Tokenizer Version 1.1\n"; 90 | print STDERR "Language: $language\n"; 91 | print STDERR "Number of threads: $NUM_THREADS\n"; 92 | } 93 | 94 | # load the language-specific non-breaking prefix info from files in the directory nonbreaking_prefixes 95 | load_prefixes($language,\%NONBREAKING_PREFIX); 96 | 97 | if (scalar(%NONBREAKING_PREFIX) eq 0) 98 | { 99 | print STDERR "Warning: No known abbreviations for language '$language'\n"; 100 | } 101 | 102 | # Load protected patterns 103 | if ($protected_patterns_file) 104 | { 105 | open(PP,$protected_patterns_file) || die "Unable to open $protected_patterns_file"; 106 | while() { 107 | chomp; 108 | push @protected_patterns, $_; 109 | } 110 | } 111 | 112 | my @batch_sentences = (); 113 | my @thread_list = (); 114 | my $count_sentences = 0; 115 | 116 | if ($NUM_THREADS > 1) 117 | {# multi-threading tokenization 118 | while() 119 | { 120 | $count_sentences = $count_sentences + 1; 121 | push(@batch_sentences, $_); 122 | if (scalar(@batch_sentences)>=($NUM_SENTENCES_PER_THREAD*$NUM_THREADS)) 123 | { 124 | # assign each thread work 125 | for (my $i=0; $i<$NUM_THREADS; $i++) 126 | { 127 | my $start_index = $i*$NUM_SENTENCES_PER_THREAD; 128 | my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1; 129 | my @subbatch_sentences = @batch_sentences[$start_index..$end_index]; 130 | my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences; 131 | push(@thread_list, $new_thread); 132 | } 133 | foreach (@thread_list) 134 | { 135 | my $tokenized_list = $_->join; 136 | foreach (@$tokenized_list) 137 | { 138 | print $_; 139 | } 140 | } 141 | # reset for the new run 142 | @thread_list = (); 143 | @batch_sentences = (); 144 | } 145 | } 146 | # the last batch 147 | if (scalar(@batch_sentences)>0) 148 | { 149 | # assign each thread work 150 | for (my $i=0; $i<$NUM_THREADS; $i++) 151 | { 152 | my $start_index = $i*$NUM_SENTENCES_PER_THREAD; 153 | if ($start_index >= scalar(@batch_sentences)) 154 | { 155 | last; 156 | } 157 | my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1; 158 | if ($end_index >= scalar(@batch_sentences)) 159 | { 160 | $end_index = scalar(@batch_sentences)-1; 161 | } 162 | my @subbatch_sentences = @batch_sentences[$start_index..$end_index]; 163 | my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences; 164 | push(@thread_list, $new_thread); 165 | } 166 | foreach (@thread_list) 167 | { 168 | my $tokenized_list = $_->join; 169 | foreach (@$tokenized_list) 170 | { 171 | print $_; 172 | } 173 | } 174 | } 175 | } 176 | else 177 | {# single thread only 178 | while() 179 | { 180 | if (($SKIP_XML && /^<.+>$/) || /^\s*$/) 181 | { 182 | #don't try to tokenize XML/HTML tag lines 183 | print $_; 184 | } 185 | else 186 | { 187 | print &tokenize($_); 188 | } 189 | } 190 | } 191 | 192 | if ($TIMING) 193 | { 194 | my $duration = Time::HiRes::tv_interval( $start_time ); 195 | print STDERR ("TOTAL EXECUTION TIME: ".$duration."\n"); 196 | print STDERR ("TOKENIZATION SPEED: ".($duration/$count_sentences*1000)." milliseconds/line\n"); 197 | } 198 | 199 | ##################################################################################### 200 | # subroutines afterward 201 | 202 | # tokenize a batch of texts saved in an array 203 | # input: an array containing a batch of texts 204 | # return: another array containing a batch of tokenized texts for the input array 205 | sub tokenize_batch 206 | { 207 | my(@text_list) = @_; 208 | my(@tokenized_list) = (); 209 | foreach (@text_list) 210 | { 211 | if (($SKIP_XML && /^<.+>$/) || /^\s*$/) 212 | { 213 | #don't try to tokenize XML/HTML tag lines 214 | push(@tokenized_list, $_); 215 | } 216 | else 217 | { 218 | push(@tokenized_list, &tokenize($_)); 219 | } 220 | } 221 | return \@tokenized_list; 222 | } 223 | 224 | # the actual tokenize function which tokenizes one input string 225 | # input: one string 226 | # return: the tokenized string for the input string 227 | sub tokenize 228 | { 229 | my($text) = @_; 230 | 231 | if ($PENN) { 232 | return tokenize_penn($text); 233 | } 234 | 235 | chomp($text); 236 | $text = " $text "; 237 | 238 | # remove ASCII junk 239 | $text =~ s/\s+/ /g; 240 | $text =~ s/[\000-\037]//g; 241 | 242 | # Find protected patterns 243 | my @protected = (); 244 | foreach my $protected_pattern (@protected_patterns) { 245 | my $t = $text; 246 | while ($t =~ /($protected_pattern)(.*)$/) { 247 | push @protected, $1; 248 | $t = $2; 249 | } 250 | } 251 | 252 | for (my $i = 0; $i < scalar(@protected); ++$i) { 253 | my $subst = sprintf("THISISPROTECTED%.3d", $i); 254 | $text =~ s,\Q$protected[$i], $subst ,g; 255 | } 256 | $text =~ s/ +/ /g; 257 | $text =~ s/^ //g; 258 | $text =~ s/ $//g; 259 | 260 | # seperate out all "other" special characters 261 | $text =~ s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g; 262 | 263 | # aggressive hyphen splitting 264 | if ($AGGRESSIVE) 265 | { 266 | $text =~ s/([\p{IsAlnum}])\-(?=[\p{IsAlnum}])/$1 \@-\@ /g; 267 | } 268 | 269 | #multi-dots stay together 270 | $text =~ s/\.([\.]+)/ DOTMULTI$1/g; 271 | while($text =~ /DOTMULTI\./) 272 | { 273 | $text =~ s/DOTMULTI\.([^\.])/DOTDOTMULTI $1/g; 274 | $text =~ s/DOTMULTI\./DOTDOTMULTI/g; 275 | } 276 | 277 | # seperate out "," except if within numbers (5,300) 278 | #$text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 279 | 280 | # separate out "," except if within numbers (5,300) 281 | # previous "global" application skips some: A,B,C,D,E > A , B,C , D,E 282 | # first application uses up B so rule can't see B,C 283 | # two-step version here may create extra spaces but these are removed later 284 | # will also space digit,letter or letter,digit forms (redundant with next section) 285 | $text =~ s/([^\p{IsN}])[,]/$1 , /g; 286 | $text =~ s/[,]([^\p{IsN}])/ , $1/g; 287 | 288 | # separate , pre and post number 289 | #$text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 290 | #$text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g; 291 | 292 | # turn `into ' 293 | #$text =~ s/\`/\'/g; 294 | 295 | #turn '' into " 296 | #$text =~ s/\'\'/ \" /g; 297 | 298 | if ($language eq "en") 299 | { 300 | #split contractions right 301 | $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 302 | $text =~ s/([^\p{IsAlpha}\p{IsN}])[']([\p{IsAlpha}])/$1 ' $2/g; 303 | $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 304 | $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '$2/g; 305 | #special case for "1990's" 306 | $text =~ s/([\p{IsN}])[']([s])/$1 '$2/g; 307 | } 308 | elsif (($language eq "fr") or ($language eq "it")) 309 | { 310 | #split contractions left 311 | $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 312 | $text =~ s/([^\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g; 313 | $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 314 | $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1' $2/g; 315 | } 316 | else 317 | { 318 | $text =~ s/\'/ \' /g; 319 | } 320 | 321 | #word token method 322 | my @words = split(/\s/,$text); 323 | $text = ""; 324 | for (my $i=0;$i<(scalar(@words));$i++) 325 | { 326 | my $word = $words[$i]; 327 | if ( $word =~ /^(\S+)\.$/) 328 | { 329 | my $pre = $1; 330 | if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml 371 | $text =~ s/\'/\'/g; # xml 372 | $text =~ s/\"/\"/g; # xml 373 | $text =~ s/\[/\[/g; # syntax non-terminal 374 | $text =~ s/\]/\]/g; # syntax non-terminal 375 | } 376 | 377 | #ensure final line break 378 | $text .= "\n" unless $text =~ /\n$/; 379 | 380 | return $text; 381 | } 382 | 383 | sub tokenize_penn 384 | { 385 | # Improved compatibility with Penn Treebank tokenization. Useful if 386 | # the text is to later be parsed with a PTB-trained parser. 387 | # 388 | # Adapted from Robert MacIntyre's sed script: 389 | # http://www.cis.upenn.edu/~treebank/tokenizer.sed 390 | 391 | my($text) = @_; 392 | chomp($text); 393 | 394 | # remove ASCII junk 395 | $text =~ s/\s+/ /g; 396 | $text =~ s/[\000-\037]//g; 397 | 398 | # attempt to get correct directional quotes 399 | $text =~ s/^``/`` /g; 400 | $text =~ s/^"/`` /g; 401 | $text =~ s/^`([^`])/` $1/g; 402 | $text =~ s/^'/` /g; 403 | $text =~ s/([ ([{<])"/$1 `` /g; 404 | $text =~ s/([ ([{<])``/$1 `` /g; 405 | $text =~ s/([ ([{<])`([^`])/$1 ` $2/g; 406 | $text =~ s/([ ([{<])'/$1 ` /g; 407 | # close quotes handled at end 408 | 409 | $text =~ s=\.\.\.= _ELLIPSIS_ =g; 410 | 411 | # separate out "," except if within numbers (5,300) 412 | $text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 413 | # separate , pre and post number 414 | $text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 415 | $text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g; 416 | 417 | #$text =~ s=([;:@#\$%&\p{IsSc}])= $1 =g; 418 | $text =~ s=([;:@#\$%&\p{IsSc}\p{IsSo}])= $1 =g; 419 | 420 | # Separate out intra-token slashes. PTB tokenization doesn't do this, so 421 | # the tokens should be merged prior to parsing with a PTB-trained parser 422 | # (see syntax-hyphen-splitting.perl). 423 | $text =~ s/([\p{IsAlnum}])\/([\p{IsAlnum}])/$1 \@\/\@ $2/g; 424 | 425 | # Assume sentence tokenization has been done first, so split FINAL periods 426 | # only. 427 | $text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g; 428 | # however, we may as well split ALL question marks and exclamation points, 429 | # since they shouldn't have the abbrev.-marker ambiguity problem 430 | $text =~ s=([?!])= $1 =g; 431 | 432 | # parentheses, brackets, etc. 433 | $text =~ s=([\]\[\(\){}<>])= $1 =g; 434 | $text =~ s/\(/-LRB-/g; 435 | $text =~ s/\)/-RRB-/g; 436 | $text =~ s/\[/-LSB-/g; 437 | $text =~ s/\]/-RSB-/g; 438 | $text =~ s/{/-LCB-/g; 439 | $text =~ s/}/-RCB-/g; 440 | 441 | $text =~ s=--= -- =g; 442 | 443 | # First off, add a space to the beginning and end of each line, to reduce 444 | # necessary number of regexps. 445 | $text =~ s=$= =; 446 | $text =~ s=^= =; 447 | 448 | $text =~ s="= '' =g; 449 | # possessive or close-single-quote 450 | $text =~ s=([^'])' =$1 ' =g; 451 | # as in it's, I'm, we'd 452 | $text =~ s='([sSmMdD]) = '$1 =g; 453 | $text =~ s='ll = 'll =g; 454 | $text =~ s='re = 're =g; 455 | $text =~ s='ve = 've =g; 456 | $text =~ s=n't = n't =g; 457 | $text =~ s='LL = 'LL =g; 458 | $text =~ s='RE = 'RE =g; 459 | $text =~ s='VE = 'VE =g; 460 | $text =~ s=N'T = N'T =g; 461 | 462 | $text =~ s= ([Cc])annot = $1an not =g; 463 | $text =~ s= ([Dd])'ye = $1' ye =g; 464 | $text =~ s= ([Gg])imme = $1im me =g; 465 | $text =~ s= ([Gg])onna = $1on na =g; 466 | $text =~ s= ([Gg])otta = $1ot ta =g; 467 | $text =~ s= ([Ll])emme = $1em me =g; 468 | $text =~ s= ([Mm])ore'n = $1ore 'n =g; 469 | $text =~ s= '([Tt])is = '$1 is =g; 470 | $text =~ s= '([Tt])was = '$1 was =g; 471 | $text =~ s= ([Ww])anna = $1an na =g; 472 | 473 | #word token method 474 | my @words = split(/\s/,$text); 475 | $text = ""; 476 | for (my $i=0;$i<(scalar(@words));$i++) 477 | { 478 | my $word = $words[$i]; 479 | if ( $word =~ /^(\S+)\.$/) 480 | { 481 | my $pre = $1; 482 | if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml 511 | $text =~ s/\'/\'/g; # xml 512 | $text =~ s/\"/\"/g; # xml 513 | $text =~ s/\[/\[/g; # syntax non-terminal 514 | $text =~ s/\]/\]/g; # syntax non-terminal 515 | 516 | #ensure final line break 517 | $text .= "\n" unless $text =~ /\n$/; 518 | 519 | return $text; 520 | } 521 | 522 | sub load_prefixes 523 | { 524 | my ($language, $PREFIX_REF) = @_; 525 | 526 | my $prefixfile = "$mydir/nonbreaking_prefix.$language"; 527 | 528 | #default back to English if we don't have a language-specific prefix file 529 | if (!(-e $prefixfile)) 530 | { 531 | $prefixfile = "$mydir/nonbreaking_prefix.en"; 532 | print STDERR "WARNING: No known abbreviations for language '$language', attempting fall-back to English version...\n"; 533 | die ("ERROR: No abbreviations files found in $mydir\n") unless (-e $prefixfile); 534 | } 535 | 536 | if (-e "$prefixfile") 537 | { 538 | open(PREFIX, "<:utf8", "$prefixfile"); 539 | while () 540 | { 541 | my $item = $_; 542 | chomp($item); 543 | if (($item) && (substr($item,0,1) ne "#")) 544 | { 545 | if ($item =~ /(.*)[\s]+(\#NUMERIC_ONLY\#)/) 546 | { 547 | $PREFIX_REF->{$1} = 2; 548 | } 549 | else 550 | { 551 | $PREFIX_REF->{$item} = 1; 552 | } 553 | } 554 | } 555 | close(PREFIX); 556 | } 557 | } 558 | -------------------------------------------------------------------------------- /data/util.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Utility functions 3 | ''' 4 | 5 | ''' 6 | This code is based on the util.py of 7 | nematus project (https://github.com/rsennrich/nematus) 8 | ''' 9 | 10 | import sys 11 | import json 12 | import cPickle as pkl 13 | 14 | #json loads strings as unicode; we currently still work with Python 2 strings, and need conversion 15 | def unicode_to_utf8(d): 16 | return dict((key.encode("UTF-8"), value) for (key,value) in d.items()) 17 | 18 | def load_dict(filename): 19 | try: 20 | with open(filename, 'rb') as f: 21 | return unicode_to_utf8(json.load(f)) 22 | except: 23 | with open(filename, 'rb') as f: 24 | return pkl.load(f) 25 | 26 | 27 | def load_config(basename): 28 | try: 29 | with open('%s.json' % basename, 'rb') as f: 30 | return json.load(f) 31 | except: 32 | try: 33 | with open('%s.pkl' % basename, 'rb') as f: 34 | return pkl.load(f) 35 | except: 36 | sys.stderr.write('Error: config file {0}.json is missing\n'.format(basename)) 37 | sys.exit(1) 38 | 39 | -------------------------------------------------------------------------------- /decode.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "\n", 12 | "#!/usr/bin/env python\n", 13 | "# coding: utf-8\n", 14 | "\n", 15 | "import os\n", 16 | "import math\n", 17 | "import time\n", 18 | "import json\n", 19 | "import random\n", 20 | "\n", 21 | "from collections import OrderedDict\n", 22 | "\n", 23 | "import numpy as np\n", 24 | "import tensorflow as tf\n", 25 | "\n", 26 | "from data.data_iterator import TextIterator\n", 27 | "\n", 28 | "import data.util as util\n", 29 | "import data.data_utils as data_utils\n", 30 | "from data.data_utils import prepare_batch\n", 31 | "from data.data_utils import prepare_train_batch\n", 32 | "\n", 33 | "from seq2seq_model import Seq2SeqModel" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# Decoding parameters\n", 45 | "tf.app.flags.DEFINE_integer('beam_width', 12, 'Beam width used in beamsearch')\n", 46 | "tf.app.flags.DEFINE_integer('decode_batch_size', 80, 'Batch size used for decoding')\n", 47 | "tf.app.flags.DEFINE_integer('max_decode_step', 500, 'Maximum time step limit to decode')\n", 48 | "tf.app.flags.DEFINE_boolean('write_n_best', False, 'Write n-best list (n=beam_width)')\n", 49 | "tf.app.flags.DEFINE_string('model_path', None, 'Path to a specific model checkpoint.')\n", 50 | "tf.app.flags.DEFINE_string('decode_input', 'data/newstest2012.bpe.de', 'Decoding input path')\n", 51 | "tf.app.flags.DEFINE_string('decode_output', 'data/newstest2012.bpe.de.trans', 'Decoding output path')\n", 52 | "\n", 53 | "# Runtime parameters\n", 54 | "tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement')\n", 55 | "tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices')\n", 56 | "\n", 57 | "FLAGS = tf.app.flags.FLAGS" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "def load_config(FLAGS):\n", 69 | " \n", 70 | " config = util.unicode_to_utf8(\n", 71 | " json.load(open('%s.json' % FLAGS.model_path, 'rb')))\n", 72 | " for key, value in FLAGS.__flags.items():\n", 73 | " config[key] = value\n", 74 | "\n", 75 | " return config\n", 76 | "\n", 77 | "\n", 78 | "def load_model(session, config):\n", 79 | " \n", 80 | " model = Seq2SeqModel(config, 'decode')\n", 81 | " if tf.train.checkpoint_exists(FLAGS.model_path):\n", 82 | " print 'Reloading model parameters..'\n", 83 | " model.restore(session, FLAGS.model_path)\n", 84 | " else:\n", 85 | " raise ValueError(\n", 86 | " 'No such file:[{}]'.format(FLAGS.model_path))\n", 87 | " return model" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "def decode():\n", 99 | " # Load model config\n", 100 | " config = load_config(FLAGS)\n", 101 | "\n", 102 | " # Load source data to decode\n", 103 | " test_set = TextIterator(source=config['decode_input'],\n", 104 | " batch_size=config['decode_batch_size'],\n", 105 | " source_dict=config['source_vocabulary'],\n", 106 | " maxlen=None,\n", 107 | " n_words_source=config['num_encoder_symbols'])\n", 108 | "\n", 109 | " # Load inverse dictionary used in decoding\n", 110 | " target_inverse_dict = data_utils.load_inverse_dict(config['target_vocabulary'])\n", 111 | " \n", 112 | " # Initiate TF session\n", 113 | " with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement, \n", 114 | " log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess:\n", 115 | "\n", 116 | " # Reload existing checkpoint\n", 117 | " model = load_model(sess, config)\n", 118 | " try:\n", 119 | " print 'Decoding {}..'.format(FLAGS.decode_input)\n", 120 | " if FLAGS.write_n_best:\n", 121 | " fout = [data_utils.fopen((\"%s_%d\" % (FLAGS.decode_output, k)), 'w') \\\n", 122 | " for k in range(FLAGS.beam_width)]\n", 123 | " else:\n", 124 | " fout = [data_utils.fopen(FLAGS.decode_output, 'w')]\n", 125 | " \n", 126 | " for idx, source_seq in enumerate(test_set):\n", 127 | " source, source_len = prepare_batch(source_seq)\n", 128 | " # predicted_ids: GreedyDecoder; [batch_size, max_time_step, 1]\n", 129 | " # BeamSearchDecoder; [batch_size, max_time_step, beam_width]\n", 130 | " predicted_ids = model.predict(sess, encoder_inputs=source, \n", 131 | " encoder_inputs_length=source_len)\n", 132 | " \n", 133 | " # Write decoding results\n", 134 | " for k, f in reversed(list(enumerate(fout))):\n", 135 | " for seq in predicted_ids:\n", 136 | " f.write(str(data_utils.seq2words(seq[:,k], target_inverse_dict)) + '\\n')\n", 137 | " if not FLAGS.write_n_best:\n", 138 | " break\n", 139 | " print ' {}th line decoded'.format(idx * FLAGS.decode_batch_size)\n", 140 | " \n", 141 | " print 'Decoding terminated'\n", 142 | " except IOError:\n", 143 | " pass\n", 144 | " finally:\n", 145 | " [f.close() for f in fout]" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "collapsed": true 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "def main(_):\n", 157 | " decode()\n", 158 | "\n", 159 | "\n", 160 | "if __name__ == '__main__':\n", 161 | " tf.app.run()" 162 | ] 163 | } 164 | ], 165 | "metadata": { 166 | "kernelspec": { 167 | "display_name": "Python 2", 168 | "language": "python", 169 | "name": "python2" 170 | }, 171 | "language_info": { 172 | "codemirror_mode": { 173 | "name": "ipython", 174 | "version": 2 175 | }, 176 | "file_extension": ".py", 177 | "mimetype": "text/x-python", 178 | "name": "python", 179 | "nbconvert_exporter": "python", 180 | "pygments_lexer": "ipython2", 181 | "version": "2.7.10" 182 | } 183 | }, 184 | "nbformat": 4, 185 | "nbformat_minor": 0 186 | } 187 | -------------------------------------------------------------------------------- /decode.py: -------------------------------------------------------------------------------- 1 | 2 | #!/usr/bin/env python 3 | # coding: utf-8 4 | 5 | import os 6 | import math 7 | import time 8 | import json 9 | import random 10 | 11 | from collections import OrderedDict 12 | 13 | import numpy as np 14 | import tensorflow as tf 15 | 16 | from data.data_iterator import TextIterator 17 | 18 | import data.util as util 19 | import data.data_utils as data_utils 20 | from data.data_utils import prepare_batch 21 | from data.data_utils import prepare_train_batch 22 | 23 | from seq2seq_model import Seq2SeqModel 24 | 25 | # Decoding parameters 26 | tf.app.flags.DEFINE_integer('beam_width', 12, 'Beam width used in beamsearch') 27 | tf.app.flags.DEFINE_integer('decode_batch_size', 80, 'Batch size used for decoding') 28 | tf.app.flags.DEFINE_integer('max_decode_step', 500, 'Maximum time step limit to decode') 29 | tf.app.flags.DEFINE_boolean('write_n_best', False, 'Write n-best list (n=beam_width)') 30 | tf.app.flags.DEFINE_string('model_path', None, 'Path to a specific model checkpoint.') 31 | tf.app.flags.DEFINE_string('decode_input', 'data/newstest2012.bpe.de', 'Decoding input path') 32 | tf.app.flags.DEFINE_string('decode_output', 'data/newstest2012.bpe.de.trans', 'Decoding output path') 33 | 34 | # Runtime parameters 35 | tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement') 36 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices') 37 | 38 | FLAGS = tf.app.flags.FLAGS 39 | 40 | def load_config(FLAGS): 41 | 42 | config = util.unicode_to_utf8( 43 | json.load(open('%s.json' % FLAGS.model_path, 'rb'))) 44 | for key, value in FLAGS.__flags.items(): 45 | config[key] = value 46 | 47 | return config 48 | 49 | 50 | def load_model(session, config): 51 | 52 | model = Seq2SeqModel(config, 'decode') 53 | if tf.train.checkpoint_exists(FLAGS.model_path): 54 | print 'Reloading model parameters..' 55 | model.restore(session, FLAGS.model_path) 56 | else: 57 | raise ValueError( 58 | 'No such file:[{}]'.format(FLAGS.model_path)) 59 | return model 60 | 61 | 62 | def decode(): 63 | # Load model config 64 | config = load_config(FLAGS) 65 | 66 | # Load source data to decode 67 | test_set = TextIterator(source=config['decode_input'], 68 | batch_size=config['decode_batch_size'], 69 | source_dict=config['source_vocabulary'], 70 | maxlen=None, 71 | n_words_source=config['num_encoder_symbols']) 72 | 73 | # Load inverse dictionary used in decoding 74 | target_inverse_dict = data_utils.load_inverse_dict(config['target_vocabulary']) 75 | 76 | # Initiate TF session 77 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement, 78 | log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess: 79 | 80 | # Reload existing checkpoint 81 | model = load_model(sess, config) 82 | try: 83 | print 'Decoding {}..'.format(FLAGS.decode_input) 84 | if FLAGS.write_n_best: 85 | fout = [data_utils.fopen(("%s_%d" % (FLAGS.decode_output, k)), 'w') \ 86 | for k in range(FLAGS.beam_width)] 87 | else: 88 | fout = [data_utils.fopen(FLAGS.decode_output, 'w')] 89 | 90 | for idx, source_seq in enumerate(test_set): 91 | source, source_len = prepare_batch(source_seq) 92 | # predicted_ids: GreedyDecoder; [batch_size, max_time_step, 1] 93 | # BeamSearchDecoder; [batch_size, max_time_step, beam_width] 94 | predicted_ids = model.predict(sess, encoder_inputs=source, 95 | encoder_inputs_length=source_len) 96 | 97 | # Write decoding results 98 | for k, f in reversed(list(enumerate(fout))): 99 | for seq in predicted_ids: 100 | f.write(str(data_utils.seq2words(seq[:,k], target_inverse_dict)) + '\n') 101 | if not FLAGS.write_n_best: 102 | break 103 | print ' {}th line decoded'.format(idx * FLAGS.decode_batch_size) 104 | 105 | print 'Decoding terminated' 106 | except IOError: 107 | pass 108 | finally: 109 | [f.close() for f in fout] 110 | 111 | 112 | def main(_): 113 | decode() 114 | 115 | 116 | if __name__ == '__main__': 117 | tf.app.run() 118 | 119 | -------------------------------------------------------------------------------- /seq2seq_model.py: -------------------------------------------------------------------------------- 1 | 2 | #!/usr/bin/env python 3 | # -*- coding: utf-8 -*- 4 | import math 5 | 6 | import numpy as np 7 | import tensorflow as tf 8 | import tensorflow.contrib.seq2seq as seq2seq 9 | 10 | from tensorflow.python.ops.rnn_cell import GRUCell 11 | from tensorflow.python.ops.rnn_cell import LSTMCell 12 | from tensorflow.python.ops.rnn_cell import MultiRNNCell 13 | from tensorflow.python.ops.rnn_cell import DropoutWrapper, ResidualWrapper 14 | 15 | from tensorflow.python.ops import array_ops 16 | from tensorflow.python.ops import control_flow_ops 17 | from tensorflow.python.framework import constant_op 18 | from tensorflow.python.framework import dtypes 19 | from tensorflow.python.layers.core import Dense 20 | from tensorflow.python.util import nest 21 | 22 | from tensorflow.contrib.seq2seq.python.ops import attention_wrapper 23 | from tensorflow.contrib.seq2seq.python.ops import beam_search_decoder 24 | 25 | import data.data_utils as data_utils 26 | 27 | class Seq2SeqModel(object): 28 | 29 | def __init__(self, config, mode): 30 | 31 | assert mode.lower() in ['train', 'decode'] 32 | 33 | self.config = config 34 | self.mode = mode.lower() 35 | 36 | self.cell_type = config['cell_type'] 37 | self.hidden_units = config['hidden_units'] 38 | self.depth = config['depth'] 39 | self.attention_type = config['attention_type'] 40 | self.embedding_size = config['embedding_size'] 41 | #self.bidirectional = config.bidirectional 42 | 43 | self.num_encoder_symbols = config['num_encoder_symbols'] 44 | self.num_decoder_symbols = config['num_decoder_symbols'] 45 | 46 | self.use_residual = config['use_residual'] 47 | self.attn_input_feeding = config['attn_input_feeding'] 48 | self.use_dropout = config['use_dropout'] 49 | self.keep_prob = 1.0 - config['dropout_rate'] 50 | 51 | self.optimizer = config['optimizer'] 52 | self.learning_rate = config['learning_rate'] 53 | self.max_gradient_norm = config['max_gradient_norm'] 54 | self.global_step = tf.Variable(0, trainable=False, name='global_step') 55 | self.global_epoch_step = tf.Variable(0, trainable=False, name='global_epoch_step') 56 | self.global_epoch_step_op = \ 57 | tf.assign(self.global_epoch_step, self.global_epoch_step+1) 58 | 59 | self.dtype = tf.float16 if config['use_fp16'] else tf.float32 60 | self.keep_prob_placeholder = tf.placeholder(self.dtype, shape=[], name='keep_prob') 61 | 62 | self.use_beamsearch_decode=False 63 | if self.mode == 'decode': 64 | self.beam_width = config['beam_width'] 65 | self.use_beamsearch_decode = True if self.beam_width > 1 else False 66 | self.max_decode_step = config['max_decode_step'] 67 | 68 | self.build_model() 69 | 70 | 71 | def build_model(self): 72 | print("building model..") 73 | 74 | # Building encoder and decoder networks 75 | self.init_placeholders() 76 | self.build_encoder() 77 | self.build_decoder() 78 | 79 | # Merge all the training summaries 80 | self.summary_op = tf.summary.merge_all() 81 | 82 | 83 | def init_placeholders(self): 84 | # encoder_inputs: [batch_size, max_time_steps] 85 | self.encoder_inputs = tf.placeholder(dtype=tf.int32, 86 | shape=(None, None), name='encoder_inputs') 87 | 88 | # encoder_inputs_length: [batch_size] 89 | self.encoder_inputs_length = tf.placeholder( 90 | dtype=tf.int32, shape=(None,), name='encoder_inputs_length') 91 | 92 | # get dynamic batch_size 93 | self.batch_size = tf.shape(self.encoder_inputs)[0] 94 | if self.mode == 'train': 95 | 96 | # decoder_inputs: [batch_size, max_time_steps] 97 | self.decoder_inputs = tf.placeholder( 98 | dtype=tf.int32, shape=(None, None), name='decoder_inputs') 99 | # decoder_inputs_length: [batch_size] 100 | self.decoder_inputs_length = tf.placeholder( 101 | dtype=tf.int32, shape=(None,), name='decoder_inputs_length') 102 | 103 | decoder_start_token = tf.ones( 104 | shape=[self.batch_size, 1], dtype=tf.int32) * data_utils.start_token 105 | decoder_end_token = tf.ones( 106 | shape=[self.batch_size, 1], dtype=tf.int32) * data_utils.end_token 107 | 108 | # decoder_inputs_train: [batch_size , max_time_steps + 1] 109 | # insert _GO symbol in front of each decoder input 110 | self.decoder_inputs_train = tf.concat([decoder_start_token, 111 | self.decoder_inputs], axis=1) 112 | 113 | # decoder_inputs_length_train: [batch_size] 114 | self.decoder_inputs_length_train = self.decoder_inputs_length + 1 115 | 116 | # decoder_targets_train: [batch_size, max_time_steps + 1] 117 | # insert EOS symbol at the end of each decoder input 118 | self.decoder_targets_train = tf.concat([self.decoder_inputs, 119 | decoder_end_token], axis=1) 120 | 121 | 122 | def build_encoder(self): 123 | print("building encoder..") 124 | with tf.variable_scope('encoder'): 125 | # Building encoder_cell 126 | self.encoder_cell = self.build_encoder_cell() 127 | 128 | # Initialize encoder_embeddings to have variance=1. 129 | sqrt3 = math.sqrt(3) # Uniform(-sqrt(3), sqrt(3)) has variance=1. 130 | initializer = tf.random_uniform_initializer(-sqrt3, sqrt3, dtype=self.dtype) 131 | 132 | self.encoder_embeddings = tf.get_variable(name='embedding', 133 | shape=[self.num_encoder_symbols, self.embedding_size], 134 | initializer=initializer, dtype=self.dtype) 135 | 136 | # Embedded_inputs: [batch_size, time_step, embedding_size] 137 | self.encoder_inputs_embedded = tf.nn.embedding_lookup( 138 | params=self.encoder_embeddings, ids=self.encoder_inputs) 139 | 140 | # Input projection layer to feed embedded inputs to the cell 141 | # ** Essential when use_residual=True to match input/output dims 142 | input_layer = Dense(self.hidden_units, dtype=self.dtype, name='input_projection') 143 | 144 | # Embedded inputs having gone through input projection layer 145 | self.encoder_inputs_embedded = input_layer(self.encoder_inputs_embedded) 146 | 147 | # Encode input sequences into context vectors: 148 | # encoder_outputs: [batch_size, max_time_step, cell_output_size] 149 | # encoder_state: [batch_size, cell_output_size] 150 | self.encoder_outputs, self.encoder_last_state = tf.nn.dynamic_rnn( 151 | cell=self.encoder_cell, inputs=self.encoder_inputs_embedded, 152 | sequence_length=self.encoder_inputs_length, dtype=self.dtype, 153 | time_major=False) 154 | 155 | 156 | def build_decoder(self): 157 | print("building decoder and attention..") 158 | with tf.variable_scope('decoder'): 159 | # Building decoder_cell and decoder_initial_state 160 | self.decoder_cell, self.decoder_initial_state = self.build_decoder_cell() 161 | 162 | # Initialize decoder embeddings to have variance=1. 163 | sqrt3 = math.sqrt(3) # Uniform(-sqrt(3), sqrt(3)) has variance=1. 164 | initializer = tf.random_uniform_initializer(-sqrt3, sqrt3, dtype=self.dtype) 165 | 166 | self.decoder_embeddings = tf.get_variable(name='embedding', 167 | shape=[self.num_decoder_symbols, self.embedding_size], 168 | initializer=initializer, dtype=self.dtype) 169 | 170 | # Input projection layer to feed embedded inputs to the cell 171 | # ** Essential when use_residual=True to match input/output dims 172 | input_layer = Dense(self.hidden_units, dtype=self.dtype, name='input_projection') 173 | 174 | # Output projection layer to convert cell_outputs to logits 175 | output_layer = Dense(self.num_decoder_symbols, name='output_projection') 176 | 177 | if self.mode == 'train': 178 | # decoder_inputs_embedded: [batch_size, max_time_step + 1, embedding_size] 179 | self.decoder_inputs_embedded = tf.nn.embedding_lookup( 180 | params=self.decoder_embeddings, ids=self.decoder_inputs_train) 181 | 182 | # Embedded inputs having gone through input projection layer 183 | self.decoder_inputs_embedded = input_layer(self.decoder_inputs_embedded) 184 | 185 | # Helper to feed inputs for training: read inputs from dense ground truth vectors 186 | training_helper = seq2seq.TrainingHelper(inputs=self.decoder_inputs_embedded, 187 | sequence_length=self.decoder_inputs_length_train, 188 | time_major=False, 189 | name='training_helper') 190 | 191 | training_decoder = seq2seq.BasicDecoder(cell=self.decoder_cell, 192 | helper=training_helper, 193 | initial_state=self.decoder_initial_state, 194 | output_layer=output_layer) 195 | #output_layer=None) 196 | 197 | # Maximum decoder time_steps in current batch 198 | max_decoder_length = tf.reduce_max(self.decoder_inputs_length_train) 199 | 200 | # decoder_outputs_train: BasicDecoderOutput 201 | # namedtuple(rnn_outputs, sample_id) 202 | # decoder_outputs_train.rnn_output: [batch_size, max_time_step + 1, num_decoder_symbols] if output_time_major=False 203 | # [max_time_step + 1, batch_size, num_decoder_symbols] if output_time_major=True 204 | # decoder_outputs_train.sample_id: [batch_size], tf.int32 205 | (self.decoder_outputs_train, self.decoder_last_state_train, 206 | self.decoder_outputs_length_train) = (seq2seq.dynamic_decode( 207 | decoder=training_decoder, 208 | output_time_major=False, 209 | impute_finished=True, 210 | maximum_iterations=max_decoder_length)) 211 | 212 | # More efficient to do the projection on the batch-time-concatenated tensor 213 | # logits_train: [batch_size, max_time_step + 1, num_decoder_symbols] 214 | # self.decoder_logits_train = output_layer(self.decoder_outputs_train.rnn_output) 215 | self.decoder_logits_train = tf.identity(self.decoder_outputs_train.rnn_output) 216 | # Use argmax to extract decoder symbols to emit 217 | self.decoder_pred_train = tf.argmax(self.decoder_logits_train, axis=-1, 218 | name='decoder_pred_train') 219 | 220 | # masks: masking for valid and padded time steps, [batch_size, max_time_step + 1] 221 | masks = tf.sequence_mask(lengths=self.decoder_inputs_length_train, 222 | maxlen=max_decoder_length, dtype=self.dtype, name='masks') 223 | 224 | # Computes per word average cross-entropy over a batch 225 | # Internally calls 'nn_ops.sparse_softmax_cross_entropy_with_logits' by default 226 | self.loss = seq2seq.sequence_loss(logits=self.decoder_logits_train, 227 | targets=self.decoder_targets_train, 228 | weights=masks, 229 | average_across_timesteps=True, 230 | average_across_batch=True,) 231 | # Training summary for the current batch_loss 232 | tf.summary.scalar('loss', self.loss) 233 | 234 | # Contruct graphs for minimizing loss 235 | self.init_optimizer() 236 | 237 | elif self.mode == 'decode': 238 | 239 | # Start_tokens: [batch_size,] `int32` vector 240 | start_tokens = tf.ones([self.batch_size,], tf.int32) * data_utils.start_token 241 | end_token = data_utils.end_token 242 | 243 | def embed_and_input_proj(inputs): 244 | return input_layer(tf.nn.embedding_lookup(self.decoder_embeddings, inputs)) 245 | 246 | if not self.use_beamsearch_decode: 247 | # Helper to feed inputs for greedy decoding: uses the argmax of the output 248 | decoding_helper = seq2seq.GreedyEmbeddingHelper(start_tokens=start_tokens, 249 | end_token=end_token, 250 | embedding=embed_and_input_proj) 251 | # Basic decoder performs greedy decoding at each time step 252 | print("building greedy decoder..") 253 | inference_decoder = seq2seq.BasicDecoder(cell=self.decoder_cell, 254 | helper=decoding_helper, 255 | initial_state=self.decoder_initial_state, 256 | output_layer=output_layer) 257 | else: 258 | # Beamsearch is used to approximately find the most likely translation 259 | print("building beamsearch decoder..") 260 | inference_decoder = beam_search_decoder.BeamSearchDecoder(cell=self.decoder_cell, 261 | embedding=embed_and_input_proj, 262 | start_tokens=start_tokens, 263 | end_token=end_token, 264 | initial_state=self.decoder_initial_state, 265 | beam_width=self.beam_width, 266 | output_layer=output_layer,) 267 | # For GreedyDecoder, return 268 | # decoder_outputs_decode: BasicDecoderOutput instance 269 | # namedtuple(rnn_outputs, sample_id) 270 | # decoder_outputs_decode.rnn_output: [batch_size, max_time_step, num_decoder_symbols] if output_time_major=False 271 | # [max_time_step, batch_size, num_decoder_symbols] if output_time_major=True 272 | # decoder_outputs_decode.sample_id: [batch_size, max_time_step], tf.int32 if output_time_major=False 273 | # [max_time_step, batch_size], tf.int32 if output_time_major=True 274 | 275 | # For BeamSearchDecoder, return 276 | # decoder_outputs_decode: FinalBeamSearchDecoderOutput instance 277 | # namedtuple(predicted_ids, beam_search_decoder_output) 278 | # decoder_outputs_decode.predicted_ids: [batch_size, max_time_step, beam_width] if output_time_major=False 279 | # [max_time_step, batch_size, beam_width] if output_time_major=True 280 | # decoder_outputs_decode.beam_search_decoder_output: BeamSearchDecoderOutput instance 281 | # namedtuple(scores, predicted_ids, parent_ids) 282 | 283 | (self.decoder_outputs_decode, self.decoder_last_state_decode, 284 | self.decoder_outputs_length_decode) = (seq2seq.dynamic_decode( 285 | decoder=inference_decoder, 286 | output_time_major=False, 287 | #impute_finished=True, # error occurs 288 | maximum_iterations=self.max_decode_step)) 289 | 290 | if not self.use_beamsearch_decode: 291 | # decoder_outputs_decode.sample_id: [batch_size, max_time_step] 292 | # Or use argmax to find decoder symbols to emit: 293 | # self.decoder_pred_decode = tf.argmax(self.decoder_outputs_decode.rnn_output, 294 | # axis=-1, name='decoder_pred_decode') 295 | 296 | # Here, we use expand_dims to be compatible with the result of the beamsearch decoder 297 | # decoder_pred_decode: [batch_size, max_time_step, 1] (output_major=False) 298 | self.decoder_pred_decode = tf.expand_dims(self.decoder_outputs_decode.sample_id, -1) 299 | 300 | else: 301 | # Use beam search to approximately find the most likely translation 302 | # decoder_pred_decode: [batch_size, max_time_step, beam_width] (output_major=False) 303 | self.decoder_pred_decode = self.decoder_outputs_decode.predicted_ids 304 | 305 | 306 | def build_single_cell(self): 307 | cell_type = LSTMCell 308 | if (self.cell_type.lower() == 'gru'): 309 | cell_type = GRUCell 310 | cell = cell_type(self.hidden_units) 311 | 312 | if self.use_dropout: 313 | cell = DropoutWrapper(cell, dtype=self.dtype, 314 | output_keep_prob=self.keep_prob_placeholder,) 315 | if self.use_residual: 316 | cell = ResidualWrapper(cell) 317 | 318 | return cell 319 | 320 | 321 | # Building encoder cell 322 | def build_encoder_cell (self): 323 | 324 | return MultiRNNCell([self.build_single_cell() for i in range(self.depth)]) 325 | 326 | 327 | # Building decoder cell and attention. Also returns decoder_initial_state 328 | def build_decoder_cell(self): 329 | 330 | encoder_outputs = self.encoder_outputs 331 | encoder_last_state = self.encoder_last_state 332 | encoder_inputs_length = self.encoder_inputs_length 333 | 334 | # To use BeamSearchDecoder, encoder_outputs, encoder_last_state, encoder_inputs_length 335 | # needs to be tiled so that: [batch_size, .., ..] -> [batch_size x beam_width, .., ..] 336 | if self.use_beamsearch_decode: 337 | print ("use beamsearch decoding..") 338 | encoder_outputs = seq2seq.tile_batch( 339 | self.encoder_outputs, multiplier=self.beam_width) 340 | encoder_last_state = nest.map_structure( 341 | lambda s: seq2seq.tile_batch(s, self.beam_width), self.encoder_last_state) 342 | encoder_inputs_length = seq2seq.tile_batch( 343 | self.encoder_inputs_length, multiplier=self.beam_width) 344 | 345 | # Building attention mechanism: Default Bahdanau 346 | # 'Bahdanau' style attention: https://arxiv.org/abs/1409.0473 347 | self.attention_mechanism = attention_wrapper.BahdanauAttention( 348 | num_units=self.hidden_units, memory=encoder_outputs, 349 | memory_sequence_length=encoder_inputs_length,) 350 | # 'Luong' style attention: https://arxiv.org/abs/1508.04025 351 | if self.attention_type.lower() == 'luong': 352 | self.attention_mechanism = attention_wrapper.LuongAttention( 353 | num_units=self.hidden_units, memory=encoder_outputs, 354 | memory_sequence_length=encoder_inputs_length,) 355 | 356 | # Building decoder_cell 357 | self.decoder_cell_list = [ 358 | self.build_single_cell() for i in range(self.depth)] 359 | decoder_initial_state = encoder_last_state 360 | 361 | def attn_decoder_input_fn(inputs, attention): 362 | if not self.attn_input_feeding: 363 | return inputs 364 | 365 | # Essential when use_residual=True 366 | _input_layer = Dense(self.hidden_units, dtype=self.dtype, 367 | name='attn_input_feeding') 368 | return _input_layer(array_ops.concat([inputs, attention], -1)) 369 | 370 | # AttentionWrapper wraps RNNCell with the attention_mechanism 371 | # Note: We implement Attention mechanism only on the top decoder layer 372 | self.decoder_cell_list[-1] = attention_wrapper.AttentionWrapper( 373 | cell=self.decoder_cell_list[-1], 374 | attention_mechanism=self.attention_mechanism, 375 | attention_layer_size=self.hidden_units, 376 | cell_input_fn=attn_decoder_input_fn, 377 | initial_cell_state=encoder_last_state[-1], 378 | alignment_history=False, 379 | name='Attention_Wrapper') 380 | 381 | # To be compatible with AttentionWrapper, the encoder last state 382 | # of the top layer should be converted into the AttentionWrapperState form 383 | # We can easily do this by calling AttentionWrapper.zero_state 384 | 385 | # Also if beamsearch decoding is used, the batch_size argument in .zero_state 386 | # should be ${decoder_beam_width} times to the origianl batch_size 387 | batch_size = self.batch_size if not self.use_beamsearch_decode \ 388 | else self.batch_size * self.beam_width 389 | initial_state = [state for state in encoder_last_state] 390 | 391 | initial_state[-1] = self.decoder_cell_list[-1].zero_state( 392 | batch_size=batch_size, dtype=self.dtype) 393 | decoder_initial_state = tuple(initial_state) 394 | 395 | return MultiRNNCell(self.decoder_cell_list), decoder_initial_state 396 | 397 | 398 | def init_optimizer(self): 399 | print("setting optimizer..") 400 | # Gradients and SGD update operation for training the model 401 | trainable_params = tf.trainable_variables() 402 | if self.optimizer.lower() == 'adadelta': 403 | self.opt = tf.train.AdadeltaOptimizer(learning_rate=self.learning_rate) 404 | elif self.optimizer.lower() == 'adam': 405 | self.opt = tf.train.AdamOptimizer(learning_rate=self.learning_rate) 406 | elif self.optimizer.lower() == 'rmsprop': 407 | self.opt = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate) 408 | else: 409 | self.opt = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate) 410 | 411 | # Compute gradients of loss w.r.t. all trainable variables 412 | gradients = tf.gradients(self.loss, trainable_params) 413 | 414 | # Clip gradients by a given maximum_gradient_norm 415 | clip_gradients, _ = tf.clip_by_global_norm(gradients, self.max_gradient_norm) 416 | 417 | # Update the model 418 | self.updates = self.opt.apply_gradients( 419 | zip(clip_gradients, trainable_params), global_step=self.global_step) 420 | 421 | def save(self, sess, path, var_list=None, global_step=None): 422 | # var_list = None returns the list of all saveable variables 423 | saver = tf.train.Saver(var_list) 424 | 425 | # temporary code 426 | #del tf.get_collection_ref('LAYER_NAME_UIDS')[0] 427 | save_path = saver.save(sess, save_path=path, global_step=global_step) 428 | print('model saved at %s' % save_path) 429 | 430 | 431 | def restore(self, sess, path, var_list=None): 432 | # var_list = None returns the list of all saveable variables 433 | saver = tf.train.Saver(var_list) 434 | saver.restore(sess, save_path=path) 435 | print('model restored from %s' % path) 436 | 437 | 438 | def train(self, sess, encoder_inputs, encoder_inputs_length, 439 | decoder_inputs, decoder_inputs_length): 440 | """Run a train step of the model feeding the given inputs. 441 | 442 | Args: 443 | session: tensorflow session to use. 444 | encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps] 445 | to feed as encoder inputs 446 | encoder_inputs_length: a numpy int vector of [batch_size] 447 | to feed as sequence lengths for each element in the given batch 448 | decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps] 449 | to feed as decoder inputs 450 | decoder_inputs_length: a numpy int vector of [batch_size] 451 | to feed as sequence lengths for each element in the given batch 452 | 453 | Returns: 454 | A triple consisting of gradient norm (or None if we did not do backward), 455 | average perplexity, and the outputs. 456 | """ 457 | # Check if the model is 'training' mode 458 | if self.mode.lower() != 'train': 459 | raise ValueError("train step can only be operated in train mode") 460 | 461 | input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length, 462 | decoder_inputs, decoder_inputs_length, False) 463 | # Input feeds for dropout 464 | input_feed[self.keep_prob_placeholder.name] = self.keep_prob 465 | 466 | output_feed = [self.updates, # Update Op that does optimization 467 | self.loss, # Loss for current batch 468 | self.summary_op] # Training summary 469 | 470 | outputs = sess.run(output_feed, input_feed) 471 | return outputs[1], outputs[2] # loss, summary 472 | 473 | 474 | def eval(self, sess, encoder_inputs, encoder_inputs_length, 475 | decoder_inputs, decoder_inputs_length): 476 | """Run a evaluation step of the model feeding the given inputs. 477 | 478 | Args: 479 | session: tensorflow session to use. 480 | encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps] 481 | to feed as encoder inputs 482 | encoder_inputs_length: a numpy int vector of [batch_size] 483 | to feed as sequence lengths for each element in the given batch 484 | decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps] 485 | to feed as decoder inputs 486 | decoder_inputs_length: a numpy int vector of [batch_size] 487 | to feed as sequence lengths for each element in the given batch 488 | 489 | Returns: 490 | A triple consisting of gradient norm (or None if we did not do backward), 491 | average perplexity, and the outputs. 492 | """ 493 | 494 | input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length, 495 | decoder_inputs, decoder_inputs_length, False) 496 | # Input feeds for dropout 497 | input_feed[self.keep_prob_placeholder.name] = 1.0 498 | 499 | output_feed = [self.loss, # Loss for current batch 500 | self.summary_op] # Training summary 501 | outputs = sess.run(output_feed, input_feed) 502 | return outputs[0], outputs[1] # loss 503 | 504 | 505 | def predict(self, sess, encoder_inputs, encoder_inputs_length): 506 | 507 | input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length, 508 | decoder_inputs=None, decoder_inputs_length=None, 509 | decode=True) 510 | 511 | # Input feeds for dropout 512 | input_feed[self.keep_prob_placeholder.name] = 1.0 513 | 514 | output_feed = [self.decoder_pred_decode] 515 | outputs = sess.run(output_feed, input_feed) 516 | 517 | # GreedyDecoder: [batch_size, max_time_step] 518 | return outputs[0] # BeamSearchDecoder: [batch_size, max_time_step, beam_width] 519 | 520 | 521 | def check_feeds(self, encoder_inputs, encoder_inputs_length, 522 | decoder_inputs, decoder_inputs_length, decode): 523 | """ 524 | Args: 525 | encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps] 526 | to feed as encoder inputs 527 | encoder_inputs_length: a numpy int vector of [batch_size] 528 | to feed as sequence lengths for each element in the given batch 529 | decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps] 530 | to feed as decoder inputs 531 | decoder_inputs_length: a numpy int vector of [batch_size] 532 | to feed as sequence lengths for each element in the given batch 533 | decode: a scalar boolean that indicates decode mode 534 | Returns: 535 | A feed for the model that consists of encoder_inputs, encoder_inputs_length, 536 | decoder_inputs, decoder_inputs_length 537 | """ 538 | 539 | input_batch_size = encoder_inputs.shape[0] 540 | if input_batch_size != encoder_inputs_length.shape[0]: 541 | raise ValueError("Encoder inputs and their lengths must be equal in their " 542 | "batch_size, %d != %d" % (input_batch_size, encoder_inputs_length.shape[0])) 543 | 544 | if not decode: 545 | target_batch_size = decoder_inputs.shape[0] 546 | if target_batch_size != input_batch_size: 547 | raise ValueError("Encoder inputs and Decoder inputs must be equal in their " 548 | "batch_size, %d != %d" % (input_batch_size, target_batch_size)) 549 | if target_batch_size != decoder_inputs_length.shape[0]: 550 | raise ValueError("Decoder targets and their lengths must be equal in their " 551 | "batch_size, %d != %d" % (target_batch_size, decoder_inputs_length.shape[0])) 552 | 553 | input_feed = {} 554 | 555 | input_feed[self.encoder_inputs.name] = encoder_inputs 556 | input_feed[self.encoder_inputs_length.name] = encoder_inputs_length 557 | 558 | if not decode: 559 | input_feed[self.decoder_inputs.name] = decoder_inputs 560 | input_feed[self.decoder_inputs_length.name] = decoder_inputs_length 561 | 562 | return input_feed 563 | 564 | -------------------------------------------------------------------------------- /train.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "#!/usr/bin/env python\n", 12 | "# coding: utf-8\n", 13 | "\n", 14 | "import os\n", 15 | "import math\n", 16 | "import time\n", 17 | "import json\n", 18 | "import random\n", 19 | "\n", 20 | "from collections import OrderedDict\n", 21 | "\n", 22 | "import numpy as np\n", 23 | "import tensorflow as tf\n", 24 | "\n", 25 | "from data.data_iterator import TextIterator\n", 26 | "from data.data_iterator import BiTextIterator\n", 27 | "\n", 28 | "import data.data_utils as data_utils\n", 29 | "from data.data_utils import prepare_batch\n", 30 | "from data.data_utils import prepare_train_batch\n", 31 | "\n", 32 | "from seq2seq_model import Seq2SeqModel" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "# Data loading parameters\n", 44 | "tf.app.flags.DEFINE_string('source_vocabulary', 'data/europarl-v7.1.4M.de.json', 'Path to source vocabulary')\n", 45 | "tf.app.flags.DEFINE_string('target_vocabulary', 'data/europarl-v7.1.4M.fr.json', 'Path to target vocabulary')\n", 46 | "tf.app.flags.DEFINE_string('source_train_data', 'data/europarl-v7.1.4M.de', 'Path to source training data')\n", 47 | "tf.app.flags.DEFINE_string('target_train_data', 'data/europarl-v7.1.4M.fr', 'Path to target training data')\n", 48 | "tf.app.flags.DEFINE_string('source_valid_data', 'data/newstest2012.bpe.de', 'Path to source validation data')\n", 49 | "tf.app.flags.DEFINE_string('target_valid_data', 'data/newstest2012.bpe.fr', 'Path to target validation data')\n", 50 | "\n", 51 | "# Network parameters\n", 52 | "tf.app.flags.DEFINE_string('cell_type', 'lstm', 'RNN cell for encoder and decoder, default: lstm')\n", 53 | "tf.app.flags.DEFINE_string('attention_type', 'bahdanau', 'Attention mechanism: (bahdanau, luong), default: bahdanau')\n", 54 | "tf.app.flags.DEFINE_integer('hidden_units', 1024, 'Number of hidden units in each layer')\n", 55 | "tf.app.flags.DEFINE_integer('depth', 2, 'Number of layers in each encoder and decoder')\n", 56 | "tf.app.flags.DEFINE_integer('embedding_size', 500, 'Embedding dimensions of encoder and decoder inputs')\n", 57 | "tf.app.flags.DEFINE_integer('num_encoder_symbols', 30000, 'Source vocabulary size')\n", 58 | "tf.app.flags.DEFINE_integer('num_decoder_symbols', 30000, 'Target vocabulary size')\n", 59 | "\n", 60 | "tf.app.flags.DEFINE_boolean('use_residual', True, 'Use residual connection between layers')\n", 61 | "tf.app.flags.DEFINE_boolean('attn_input_feeding', False, 'Use input feeding method in attentional decoder')\n", 62 | "tf.app.flags.DEFINE_boolean('use_dropout', True, 'Use dropout in each rnn cell')\n", 63 | "tf.app.flags.DEFINE_float('dropout_rate', 0.3, 'Dropout probability for input/output/state units (0.0: no dropout)')\n", 64 | "\n", 65 | "# Training parameters\n", 66 | "tf.app.flags.DEFINE_float('learning_rate', 0.0002, 'Learning rate')\n", 67 | "tf.app.flags.DEFINE_float('max_gradient_norm', 1.0, 'Clip gradients to this norm')\n", 68 | "tf.app.flags.DEFINE_integer('batch_size', 128, 'Batch size')\n", 69 | "tf.app.flags.DEFINE_integer('max_epochs', 10, 'Maximum # of training epochs')\n", 70 | "tf.app.flags.DEFINE_integer('max_load_batches', 20, 'Maximum # of batches to load at one time')\n", 71 | "tf.app.flags.DEFINE_integer('max_seq_length', 50, 'Maximum sequence length')\n", 72 | "tf.app.flags.DEFINE_integer('display_freq', 100, 'Display training status every this iteration')\n", 73 | "tf.app.flags.DEFINE_integer('save_freq', 11500, 'Save model checkpoint every this iteration')\n", 74 | "tf.app.flags.DEFINE_integer('valid_freq', 1150000, 'Evaluate model every this iteration: valid_data needed')\n", 75 | "tf.app.flags.DEFINE_string('optimizer', 'adam', 'Optimizer for training: (adadelta, adam, rmsprop)')\n", 76 | "tf.app.flags.DEFINE_string('model_dir', 'model/', 'Path to save model checkpoints')\n", 77 | "tf.app.flags.DEFINE_string('summary_dir', 'model/summary', 'Path to save model summary')\n", 78 | "tf.app.flags.DEFINE_string('model_name', 'translate.ckpt', 'File name used for model checkpoints')\n", 79 | "tf.app.flags.DEFINE_boolean('shuffle_each_epoch', True, 'Shuffle training dataset for each epoch')\n", 80 | "tf.app.flags.DEFINE_boolean('sort_by_length', True, 'Sort pre-fetched minibatches by their target sequence lengths')\n", 81 | "tf.app.flags.DEFINE_boolean('use_fp16', False, 'Use half precision float16 instead of float32 as dtype')\n", 82 | "\n", 83 | "# Runtime parameters\n", 84 | "tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement')\n", 85 | "tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices')\n", 86 | "\n", 87 | "FLAGS = tf.app.flags.FLAGS" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "def create_model(session, FLAGS):\n", 99 | "\n", 100 | " config = OrderedDict(sorted(FLAGS.__flags.items()))\n", 101 | " model = Seq2SeqModel(config, 'train')\n", 102 | "\n", 103 | " ckpt = tf.train.get_checkpoint_state(FLAGS.model_dir)\n", 104 | " if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):\n", 105 | " print 'Reloading model parameters..'\n", 106 | " model.restore(session, ckpt.model_checkpoint_path)\n", 107 | " \n", 108 | " else:\n", 109 | " if not os.path.exists(FLAGS.model_dir):\n", 110 | " os.makedirs(FLAGS.model_dir)\n", 111 | " print 'Created new model parameters..'\n", 112 | " session.run(tf.global_variables_initializer())\n", 113 | " \n", 114 | " return model" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": { 121 | "collapsed": true 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "def train():\n", 126 | " # Load parallel data to train\n", 127 | " print 'Loading training data..'\n", 128 | " train_set = BiTextIterator(source=FLAGS.source_train_data,\n", 129 | " target=FLAGS.target_train_data,\n", 130 | " source_dict=FLAGS.source_vocabulary,\n", 131 | " target_dict=FLAGS.target_vocabulary,\n", 132 | " batch_size=FLAGS.batch_size,\n", 133 | " maxlen=FLAGS.max_seq_length,\n", 134 | " n_words_source=FLAGS.num_encoder_symbols,\n", 135 | " n_words_target=FLAGS.num_decoder_symbols,\n", 136 | " shuffle_each_epoch=FLAGS.shuffle_each_epoch,\n", 137 | " sort_by_length=FLAGS.sort_by_length,\n", 138 | " maxibatch_size=FLAGS.max_load_batches)\n", 139 | "\n", 140 | " if FLAGS.source_valid_data and FLAGS.target_valid_data:\n", 141 | " print 'Loading validation data..'\n", 142 | " valid_set = BiTextIterator(source=FLAGS.source_valid_data,\n", 143 | " target=FLAGS.target_valid_data,\n", 144 | " source_dict=FLAGS.source_vocabulary,\n", 145 | " target_dict=FLAGS.target_vocabulary,\n", 146 | " batch_size=FLAGS.batch_size,\n", 147 | " maxlen=None,\n", 148 | " n_words_source=FLAGS.num_encoder_symbols,\n", 149 | " n_words_target=FLAGS.num_decoder_symbols)\n", 150 | " else:\n", 151 | " valid_set = None\n", 152 | "\n", 153 | " # Initiate TF session\n", 154 | " with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement, \n", 155 | " log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess:\n", 156 | "\n", 157 | " # Create a log writer object\n", 158 | " log_writer = tf.summary.FileWriter(FLAGS.model_dir, graph=sess.graph)\n", 159 | " \n", 160 | " # Create a new model or reload existing checkpoint\n", 161 | " model = create_model(sess, FLAGS)\n", 162 | "\n", 163 | " step_time, loss = 0.0, 0.0\n", 164 | " words_seen, sents_seen = 0, 0\n", 165 | " start_time = time.time()\n", 166 | "\n", 167 | " # Training loop\n", 168 | " print 'Training..'\n", 169 | " for epoch_idx in xrange(FLAGS.max_epochs):\n", 170 | " if model.global_epoch_step.eval() >= FLAGS.max_epochs:\n", 171 | " print 'Training is already complete.', \\\n", 172 | " 'current epoch:{}, max epoch:{}'.format(model.global_epoch_step.eval(), FLAGS.max_epochs)\n", 173 | " break\n", 174 | "\n", 175 | " for source_seq, target_seq in train_set: \n", 176 | " # Get a batch from training parallel data\n", 177 | " source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq,\n", 178 | " FLAGS.max_seq_length)\n", 179 | " if source is None or target is None:\n", 180 | " print 'No samples under max_seq_length ', FLAGS.max_seq_length\n", 181 | " continue\n", 182 | "\n", 183 | " # Execute a single training step\n", 184 | " step_loss, summary = model.train(sess, encoder_inputs=source, encoder_inputs_length=source_len, \n", 185 | " decoder_inputs=target, decoder_inputs_length=target_len)\n", 186 | "\n", 187 | " loss += float(step_loss) / FLAGS.display_freq\n", 188 | " words_seen += float(np.sum(source_len+target_len))\n", 189 | " sents_seen += float(source.shape[0]) # batch_size\n", 190 | "\n", 191 | " if model.global_step.eval() % FLAGS.display_freq == 0:\n", 192 | "\n", 193 | " avg_perplexity = math.exp(float(loss)) if loss < 300 else float(\"inf\")\n", 194 | "\n", 195 | " time_elapsed = time.time() - start_time\n", 196 | " step_time = time_elapsed / FLAGS.display_freq\n", 197 | "\n", 198 | " words_per_sec = words_seen / time_elapsed\n", 199 | " sents_per_sec = sents_seen / time_elapsed\n", 200 | "\n", 201 | " print 'Epoch ', model.global_epoch_step.eval(), 'Step ', model.global_step.eval(), \\\n", 202 | " 'Perplexity {0:.2f}'.format(avg_perplexity), 'Step-time ', step_time, \\\n", 203 | " '{0:.2f} sents/s'.format(sents_per_sec), '{0:.2f} words/s'.format(words_per_sec)\n", 204 | "\n", 205 | " loss = 0\n", 206 | " words_seen = 0\n", 207 | " sents_seen = 0\n", 208 | " start_time = time.time()\n", 209 | " \n", 210 | " # Record training summary for the current batch\n", 211 | " log_writer.add_summary(summary, model.global_step.eval())\n", 212 | "\n", 213 | " # Execute a validation step\n", 214 | " if valid_set and model.global_step.eval() % FLAGS.valid_freq == 0:\n", 215 | " print 'Validation step'\n", 216 | " valid_loss = 0.0\n", 217 | " valid_sents_seen = 0\n", 218 | " for source_seq, target_seq in valid_set:\n", 219 | " # Get a batch from validation parallel data\n", 220 | " source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq)\n", 221 | "\n", 222 | " # Compute validation loss: average per word cross entropy loss\n", 223 | " step_loss = model.eval(sess, encoder_inputs=source, encoder_inputs_length=source_len,\n", 224 | " decoder_inputs=target, decoder_inputs_length=target_len)\n", 225 | " batch_size = source.shape[0]\n", 226 | "\n", 227 | " valid_loss += step_loss * batch_size\n", 228 | " valid_sents_seen += batch_size\n", 229 | " print ' {} samples seen'.format(valid_sents_seen)\n", 230 | "\n", 231 | " valid_loss = valid_loss / valid_sents_seen\n", 232 | " print 'Valid perplexity: {0:.2f}'.format(math.exp(valid_loss))\n", 233 | "\n", 234 | " # Save the model checkpoint\n", 235 | " if model.global_step.eval() % FLAGS.save_freq == 0:\n", 236 | " print 'Saving the model..'\n", 237 | " checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name)\n", 238 | " model.save(sess, checkpoint_path, global_step=model.global_step)\n", 239 | " json.dump(model.config,\n", 240 | " open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'),\n", 241 | " indent=2)\n", 242 | "\n", 243 | " # Increase the epoch index of the model\n", 244 | " model.global_epoch_step_op.eval()\n", 245 | " print 'Epoch {0:} DONE'.format(model.global_epoch_step.eval())\n", 246 | " \n", 247 | " print 'Saving the last model..'\n", 248 | " checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name)\n", 249 | " model.save(sess, checkpoint_path, global_step=model.global_step)\n", 250 | " json.dump(model.config,\n", 251 | " open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'),\n", 252 | " indent=2)\n", 253 | " \n", 254 | " print 'Training Terminated'" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "collapsed": true 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "def main(_):\n", 266 | " train()\n", 267 | "\n", 268 | "\n", 269 | "if __name__ == '__main__':\n", 270 | " tf.app.run()" 271 | ] 272 | } 273 | ], 274 | "metadata": { 275 | "kernelspec": { 276 | "display_name": "Python 2", 277 | "language": "python", 278 | "name": "python2" 279 | }, 280 | "language_info": { 281 | "codemirror_mode": { 282 | "name": "ipython", 283 | "version": 2 284 | }, 285 | "file_extension": ".py", 286 | "mimetype": "text/x-python", 287 | "name": "python", 288 | "nbconvert_exporter": "python", 289 | "pygments_lexer": "ipython2", 290 | "version": "2.7.10" 291 | } 292 | }, 293 | "nbformat": 4, 294 | "nbformat_minor": 0 295 | } 296 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | 2 | #!/usr/bin/env python 3 | # coding: utf-8 4 | 5 | import os 6 | import math 7 | import time 8 | import json 9 | import random 10 | 11 | from collections import OrderedDict 12 | 13 | import numpy as np 14 | import tensorflow as tf 15 | 16 | from data.data_iterator import TextIterator 17 | from data.data_iterator import BiTextIterator 18 | 19 | import data.data_utils as data_utils 20 | from data.data_utils import prepare_batch 21 | from data.data_utils import prepare_train_batch 22 | 23 | from seq2seq_model import Seq2SeqModel 24 | 25 | 26 | # Data loading parameters 27 | tf.app.flags.DEFINE_string('source_vocabulary', 'data/europarl-v7.1.4M.de.json', 'Path to source vocabulary') 28 | tf.app.flags.DEFINE_string('target_vocabulary', 'data/europarl-v7.1.4M.fr.json', 'Path to target vocabulary') 29 | tf.app.flags.DEFINE_string('source_train_data', 'data/europarl-v7.1.4M.de', 'Path to source training data') 30 | tf.app.flags.DEFINE_string('target_train_data', 'data/europarl-v7.1.4M.fr', 'Path to target training data') 31 | tf.app.flags.DEFINE_string('source_valid_data', 'data/newstest2012.bpe.de', 'Path to source validation data') 32 | tf.app.flags.DEFINE_string('target_valid_data', 'data/newstest2012.bpe.fr', 'Path to target validation data') 33 | 34 | # Network parameters 35 | tf.app.flags.DEFINE_string('cell_type', 'lstm', 'RNN cell for encoder and decoder, default: lstm') 36 | tf.app.flags.DEFINE_string('attention_type', 'bahdanau', 'Attention mechanism: (bahdanau, luong), default: bahdanau') 37 | tf.app.flags.DEFINE_integer('hidden_units', 1024, 'Number of hidden units in each layer') 38 | tf.app.flags.DEFINE_integer('depth', 2, 'Number of layers in each encoder and decoder') 39 | tf.app.flags.DEFINE_integer('embedding_size', 500, 'Embedding dimensions of encoder and decoder inputs') 40 | tf.app.flags.DEFINE_integer('num_encoder_symbols', 30000, 'Source vocabulary size') 41 | tf.app.flags.DEFINE_integer('num_decoder_symbols', 30000, 'Target vocabulary size') 42 | 43 | tf.app.flags.DEFINE_boolean('use_residual', True, 'Use residual connection between layers') 44 | tf.app.flags.DEFINE_boolean('attn_input_feeding', False, 'Use input feeding method in attentional decoder') 45 | tf.app.flags.DEFINE_boolean('use_dropout', True, 'Use dropout in each rnn cell') 46 | tf.app.flags.DEFINE_float('dropout_rate', 0.3, 'Dropout probability for input/output/state units (0.0: no dropout)') 47 | 48 | # Training parameters 49 | tf.app.flags.DEFINE_float('learning_rate', 0.0002, 'Learning rate') 50 | tf.app.flags.DEFINE_float('max_gradient_norm', 1.0, 'Clip gradients to this norm') 51 | tf.app.flags.DEFINE_integer('batch_size', 128, 'Batch size') 52 | tf.app.flags.DEFINE_integer('max_epochs', 10, 'Maximum # of training epochs') 53 | tf.app.flags.DEFINE_integer('max_load_batches', 20, 'Maximum # of batches to load at one time') 54 | tf.app.flags.DEFINE_integer('max_seq_length', 50, 'Maximum sequence length') 55 | tf.app.flags.DEFINE_integer('display_freq', 100, 'Display training status every this iteration') 56 | tf.app.flags.DEFINE_integer('save_freq', 11500, 'Save model checkpoint every this iteration') 57 | tf.app.flags.DEFINE_integer('valid_freq', 1150000, 'Evaluate model every this iteration: valid_data needed') 58 | tf.app.flags.DEFINE_string('optimizer', 'adam', 'Optimizer for training: (adadelta, adam, rmsprop)') 59 | tf.app.flags.DEFINE_string('model_dir', 'model/', 'Path to save model checkpoints') 60 | tf.app.flags.DEFINE_string('model_name', 'translate.ckpt', 'File name used for model checkpoints') 61 | tf.app.flags.DEFINE_boolean('shuffle_each_epoch', True, 'Shuffle training dataset for each epoch') 62 | tf.app.flags.DEFINE_boolean('sort_by_length', True, 'Sort pre-fetched minibatches by their target sequence lengths') 63 | tf.app.flags.DEFINE_boolean('use_fp16', False, 'Use half precision float16 instead of float32 as dtype') 64 | 65 | # Runtime parameters 66 | tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement') 67 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices') 68 | 69 | FLAGS = tf.app.flags.FLAGS 70 | 71 | def create_model(session, FLAGS): 72 | 73 | config = OrderedDict(sorted(FLAGS.__flags.items())) 74 | model = Seq2SeqModel(config, 'train') 75 | 76 | ckpt = tf.train.get_checkpoint_state(FLAGS.model_dir) 77 | if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path): 78 | print 'Reloading model parameters..' 79 | model.restore(session, ckpt.model_checkpoint_path) 80 | 81 | else: 82 | if not os.path.exists(FLAGS.model_dir): 83 | os.makedirs(FLAGS.model_dir) 84 | print 'Created new model parameters..' 85 | session.run(tf.global_variables_initializer()) 86 | 87 | return model 88 | 89 | def train(): 90 | # Load parallel data to train 91 | print 'Loading training data..' 92 | train_set = BiTextIterator(source=FLAGS.source_train_data, 93 | target=FLAGS.target_train_data, 94 | source_dict=FLAGS.source_vocabulary, 95 | target_dict=FLAGS.target_vocabulary, 96 | batch_size=FLAGS.batch_size, 97 | maxlen=FLAGS.max_seq_length, 98 | n_words_source=FLAGS.num_encoder_symbols, 99 | n_words_target=FLAGS.num_decoder_symbols, 100 | shuffle_each_epoch=FLAGS.shuffle_each_epoch, 101 | sort_by_length=FLAGS.sort_by_length, 102 | maxibatch_size=FLAGS.max_load_batches) 103 | 104 | if FLAGS.source_valid_data and FLAGS.target_valid_data: 105 | print 'Loading validation data..' 106 | valid_set = BiTextIterator(source=FLAGS.source_valid_data, 107 | target=FLAGS.target_valid_data, 108 | source_dict=FLAGS.source_vocabulary, 109 | target_dict=FLAGS.target_vocabulary, 110 | batch_size=FLAGS.batch_size, 111 | maxlen=None, 112 | n_words_source=FLAGS.num_encoder_symbols, 113 | n_words_target=FLAGS.num_decoder_symbols) 114 | else: 115 | valid_set = None 116 | 117 | # Initiate TF session 118 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement, 119 | log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess: 120 | 121 | # Create a new model or reload existing checkpoint 122 | model = create_model(sess, FLAGS) 123 | 124 | # Create a log writer object 125 | log_writer = tf.summary.FileWriter(FLAGS.model_dir, graph=sess.graph) 126 | 127 | 128 | 129 | step_time, loss = 0.0, 0.0 130 | words_seen, sents_seen = 0, 0 131 | start_time = time.time() 132 | 133 | # Training loop 134 | print 'Training..' 135 | for epoch_idx in xrange(FLAGS.max_epochs): 136 | if model.global_epoch_step.eval() >= FLAGS.max_epochs: 137 | print 'Training is already complete.', \ 138 | 'current epoch:{}, max epoch:{}'.format(model.global_epoch_step.eval(), FLAGS.max_epochs) 139 | break 140 | 141 | for source_seq, target_seq in train_set: 142 | # Get a batch from training parallel data 143 | source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq, 144 | FLAGS.max_seq_length) 145 | if source is None or target is None: 146 | print 'No samples under max_seq_length ', FLAGS.max_seq_length 147 | continue 148 | 149 | # Execute a single training step 150 | step_loss, summary = model.train(sess, encoder_inputs=source, encoder_inputs_length=source_len, 151 | decoder_inputs=target, decoder_inputs_length=target_len) 152 | 153 | loss += float(step_loss) / FLAGS.display_freq 154 | words_seen += float(np.sum(source_len+target_len)) 155 | sents_seen += float(source.shape[0]) # batch_size 156 | 157 | if model.global_step.eval() % FLAGS.display_freq == 0: 158 | 159 | avg_perplexity = math.exp(float(loss)) if loss < 300 else float("inf") 160 | 161 | time_elapsed = time.time() - start_time 162 | step_time = time_elapsed / FLAGS.display_freq 163 | 164 | words_per_sec = words_seen / time_elapsed 165 | sents_per_sec = sents_seen / time_elapsed 166 | 167 | print 'Epoch ', model.global_epoch_step.eval(), 'Step ', model.global_step.eval(), \ 168 | 'Perplexity {0:.2f}'.format(avg_perplexity), 'Step-time ', step_time, \ 169 | '{0:.2f} sents/s'.format(sents_per_sec), '{0:.2f} words/s'.format(words_per_sec) 170 | 171 | loss = 0 172 | words_seen = 0 173 | sents_seen = 0 174 | start_time = time.time() 175 | 176 | # Record training summary for the current batch 177 | log_writer.add_summary(summary, model.global_step.eval()) 178 | 179 | # Execute a validation step 180 | if valid_set and model.global_step.eval() % FLAGS.valid_freq == 0: 181 | print 'Validation step' 182 | valid_loss = 0.0 183 | valid_sents_seen = 0 184 | for source_seq, target_seq in valid_set: 185 | # Get a batch from validation parallel data 186 | source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq) 187 | 188 | # Compute validation loss: average per word cross entropy loss 189 | step_loss, summary = model.eval(sess, encoder_inputs=source, encoder_inputs_length=source_len, 190 | decoder_inputs=target, decoder_inputs_length=target_len) 191 | batch_size = source.shape[0] 192 | 193 | valid_loss += step_loss * batch_size 194 | valid_sents_seen += batch_size 195 | print ' {} samples seen'.format(valid_sents_seen) 196 | 197 | valid_loss = valid_loss / valid_sents_seen 198 | print 'Valid perplexity: {0:.2f}'.format(math.exp(valid_loss)) 199 | 200 | # Save the model checkpoint 201 | if model.global_step.eval() % FLAGS.save_freq == 0: 202 | print 'Saving the model..' 203 | checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name) 204 | model.save(sess, checkpoint_path, global_step=model.global_step) 205 | json.dump(model.config, 206 | open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'), 207 | indent=2) 208 | 209 | # Increase the epoch index of the model 210 | model.global_epoch_step_op.eval() 211 | print 'Epoch {0:} DONE'.format(model.global_epoch_step.eval()) 212 | 213 | print 'Saving the last model..' 214 | checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name) 215 | model.save(sess, checkpoint_path, global_step=model.global_step) 216 | json.dump(model.config, 217 | open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'), 218 | indent=2) 219 | 220 | print 'Training Terminated' 221 | 222 | 223 | 224 | def main(_): 225 | train() 226 | 227 | 228 | if __name__ == '__main__': 229 | tf.app.run() 230 | --------------------------------------------------------------------------------