├── .gitignore
├── README.md
├── data
├── __init__.py
├── build_dictionary.py
├── clean-corpus-n.perl
├── data_iterator.py
├── data_statistics.py
├── data_utils.py
├── merge.sh
├── multi-bleu.perl
├── nonbreaking_prefixes
│ ├── README.txt
│ ├── nonbreaking_prefix.ca
│ ├── nonbreaking_prefix.cs
│ ├── nonbreaking_prefix.de
│ ├── nonbreaking_prefix.el
│ ├── nonbreaking_prefix.en
│ ├── nonbreaking_prefix.es
│ ├── nonbreaking_prefix.fi
│ ├── nonbreaking_prefix.fr
│ ├── nonbreaking_prefix.hu
│ ├── nonbreaking_prefix.is
│ ├── nonbreaking_prefix.it
│ ├── nonbreaking_prefix.lv
│ ├── nonbreaking_prefix.nl
│ ├── nonbreaking_prefix.pl
│ ├── nonbreaking_prefix.pt
│ ├── nonbreaking_prefix.ro
│ ├── nonbreaking_prefix.ru
│ ├── nonbreaking_prefix.sk
│ ├── nonbreaking_prefix.sl
│ ├── nonbreaking_prefix.sv
│ └── nonbreaking_prefix.ta
├── normalize-punctuation.perl
├── postprocess.sh
├── preprocess.sh
├── sample.en
├── sample.fr
├── shuffle.py
├── strip_sgml.py
├── subword_nmt
│ ├── README.md
│ ├── apply_bpe.py
│ ├── chrF.py
│ ├── learn_bpe.py
│ └── segment-char-ngrams.py
├── tokenizer.perl
└── util.py
├── decode.ipynb
├── decode.py
├── seq2seq_model.py
├── train.ipynb
└── train.py
/.gitignore:
--------------------------------------------------------------------------------
1 | **/*.pyc
2 | **/*.swp
3 | model/
4 | europarl-v7.*
5 | newstest*
6 | .ipynb*
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # TF-seq2seq
2 | ## **Sequence to sequence (seq2seq) learning Using TensorFlow.**
3 |
4 | The core building blocks are RNN Encoder-Decoder architectures and Attention mechanism.
5 |
6 | The package was largely implemented using the latest (1.2) tf.contrib.seq2seq modules
7 | - AttentionWrapper
8 | - Decoder
9 | - BasicDecoder
10 | - BeamSearchDecoder
11 |
12 | **The package supports**
13 | - Multi-layer GRU/LSTM
14 | - Residual connection
15 | - Dropout
16 | - Attention and input_feeding
17 | - Beamsearch decoding
18 | - Write n-best list
19 |
20 | # Dependencies
21 | - NumPy >= 1.11.1
22 | - Tensorflow >= 1.2
23 |
24 |
25 | # History
26 | - June 5, 2017: Major update
27 | - June 6, 2017: Supports batch beamsearch decoding
28 | - June 11, 2017: Separted training / decoding
29 | - June 22, 2017: Supports tf.1.2 (contrib.rnn -> python.ops.rnn_cell)
30 |
31 |
32 | # Usage Instructions
33 | ## **Data Preparation**
34 |
35 | To preprocess raw parallel data of sample_data.src
and sample_data.trg
, simply run
36 | ```ruby
37 | cd data/
38 | ./preprocess.sh src trg sample_data ${max_seq_len}
39 | ```
40 |
41 | Running the above code performs widely used preprocessing steps for Machine Translation (MT).
42 |
43 | - Normalizing punctuation
44 | - Tokenizing
45 | - Bytepair encoding (# merge = 30000) (Sennrich et al., 2016)
46 | - Cleaning sequences of length over ${max_seq_len}
47 | - Shuffling
48 | - Building dictionaries
49 |
50 | ## **Training**
51 | To train a seq2seq model,
52 | ```ruby
53 | $ python train.py --cell_type 'lstm' \
54 | --attention_type 'luong' \
55 | --hidden_units 1024 \
56 | --depth 2 \
57 | --embedding_size 500 \
58 | --num_encoder_symbols 30000 \
59 | --num_decoder_symbols 30000 ...
60 | ```
61 |
62 | ## **Decoding**
63 | To run the trained model for decoding,
64 | ```ruby
65 | $ python decode.py --beam_width 5 \
66 | --decode_batch_size 30 \
67 | --model_path $PATH_TO_A_MODEL_CHECKPOINT (e.g. model/translate.ckpt-100) \
68 | --max_decode_step 300 \
69 | --write_n_best False
70 | --decode_input $PATH_TO_DECODE_INPUT
71 | --decode_output $PATH_TO_DECODE_OUTPUT
72 |
73 | ```
74 | If --beam_width=1
, greedy decoding is performed at each time-step.
75 |
76 | ## **Arguments**
77 |
78 | **Data params**
79 | - --source_vocabulary
: Path to source vocabulary
80 | - --target_vocabulary
: Path to target vocabulary
81 | - --source_train_data
: Path to source training data
82 | - --target_train_data
: Path to target training data
83 | - --source_valid_data
: Path to source validation data
84 | - --target_valid_data
: Path to target validation data
85 |
86 | **Network params**
87 | - --cell_type
: RNN cell to use for encoder and decoder (default: lstm)
88 | - --attention_type
: Attention mechanism (bahdanau, luong), (default: bahdanau)
89 | - --depth
: Number of hidden units for each layer in the model (default: 2)
90 | - --embedding_size
: Embedding dimensions of encoder and decoder inputs (default: 500)
91 | - --num_encoder_symbols
: Source vocabulary size to use (default: 30000)
92 | - --num_decoder_symbols
: Target vocabulary size to use (default: 30000)
93 | - --use_residual
: Use residual connection between layers (default: True)
94 | - --attn_input_feeding
: Use input feeding method in attentional decoder (Luong et al., 2015) (default: True)
95 | - --use_dropout
: Use dropout in rnn cell output (default: True)
96 | - --dropout_rate
: Dropout probability for cell outputs (0.0: no dropout) (default: 0.3)
97 |
98 | **Training params**
99 | - --learning_rate
: Number of hidden units for each layer in the model (default: 0.0002)
100 | - --max_gradient_norm
: Clip gradients to this norm (default 1.0)
101 | - --batch_size
: Batch size
102 | - --max_epochs
: Maximum training epochs
103 | - --max_load_batches
: Maximum number of batches to prefetch at one time.
104 | - --max_seq_length
: Maximum sequence length
105 | - --display_freq
: Display training status every this iteration
106 | - --save_freq
: Save model checkpoint every this iteration
107 | - --valid_freq
: Evaluate the model every this iteration: valid_data needed
108 | - --optimizer
: Optimizer for training: (adadelta, adam, rmsprop) (default: adam)
109 | - --model_dir
: Path to save model checkpoints
110 | - --model_name
: File name used for model checkpoints
111 | - --shuffle_each_epoch
: Shuffle training dataset for each epoch (default: True)
112 | - --sort_by_length
: Sort pre-fetched minibatches by their target sequence lengths (default: True)
113 |
114 | **Decoding params**
115 | - --beam_width
: Beam width used in beamsearch (default: 1)
116 | - --decode_batch_size
: Batch size used in decoding
117 | - --max_decode_step
: Maximum time step limit in decoding (default: 500)
118 | - --write_n_best
: Write beamsearch n-best list (n=beam_width) (default: False)
119 | - --decode_input
: Input file path to decode
120 | - --decode_output
: Output file path of decoding output
121 |
122 | **Runtime params**
123 | - --allow_soft_placement
: Allow device soft placement
124 | - --log_device_placement
: Log placement of ops on devices
125 |
126 |
127 | ## Acknowledgements
128 |
129 | The implementation is based on following projects:
130 | - [nematus](https://github.com/rsennrich/nematus/): Theano implementation of Neural Machine Translation. Major reference of this project
131 | - [subword-nmt](https://github.com/rsennrich/subword-nmt/): Included subword-unit scripts to preprocess input data
132 | - [moses](https://github.com/moses-smt/mosesdecoder): Included preprocessing scripts to preprocess input data
133 | - [tf.seq2seq_legacy](https://github.com/tensorflow/models/tree/master/tutorials/rnn/translate) Legacy Tensorflow seq2seq tutorial
134 | - [tf_tutorial_plus](https://github.com/j-min/tf_tutorial_plus): Nice tutorials for tf.contrib.seq2seq API
135 |
136 | For any comments and feedbacks, please email me at pjh0308@gmail.com or open an issue here.
137 |
--------------------------------------------------------------------------------
/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jayparks/tf-seq2seq/e55d88ec21090c127d24da16b9e2b6b9aa894821/data/__init__.py
--------------------------------------------------------------------------------
/data/build_dictionary.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import numpy
4 | import json
5 |
6 | import sys
7 | import fileinput
8 |
9 | from collections import OrderedDict
10 | from data_utils import extra_tokens
11 |
12 | def main():
13 | for filename in sys.argv[1:]:
14 | print 'Processing', filename
15 | word_freqs = OrderedDict()
16 | with open(filename, 'r') as f:
17 | for line in f:
18 | words_in = line.strip().split(' ')
19 | for w in words_in:
20 | if w not in word_freqs:
21 | word_freqs[w] = 0
22 | word_freqs[w] += 1
23 | words = word_freqs.keys()
24 | freqs = word_freqs.values()
25 |
26 | sorted_idx = numpy.argsort(freqs)
27 | sorted_words = [words[ii] for ii in sorted_idx[::-1]]
28 |
29 | worddict = OrderedDict()
30 | for ii, ww in enumerate(extra_tokens):
31 | worddict[ww] = ii
32 | for ii, ww in enumerate(sorted_words):
33 | worddict[ww] = ii + len(extra_tokens)
34 |
35 | with open('%s.json'%filename, 'wb') as f:
36 | json.dump(worddict, f, indent=2, ensure_ascii=False)
37 |
38 | print 'Done'
39 |
40 | if __name__ == '__main__':
41 | main()
42 |
--------------------------------------------------------------------------------
/data/clean-corpus-n.perl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 | #
3 | # This file is part of moses. Its use is licensed under the GNU Lesser General
4 | # Public License version 2.1 or, at your option, any later version.
5 |
6 | # $Id: clean-corpus-n.perl 3633 2010-10-21 09:49:27Z phkoehn $
7 | use warnings;
8 | use strict;
9 | use Getopt::Long;
10 | my $help;
11 | my $lc = 0; # lowercase the corpus?
12 | my $ignore_ratio = 0;
13 | my $ignore_xml = 0;
14 | my $enc = "utf8"; # encoding of the input and output files
15 | # set to anything else you wish, but I have not tested it yet
16 | my $max_word_length = 1000; # any segment with a word (or factor) exceeding this length in chars
17 | # is discarded; motivated by symal.cpp, which has its own such parameter (hardcoded to 1000)
18 | # and crashes if it encounters a word that exceeds it
19 | my $ratio = 9;
20 |
21 | GetOptions(
22 | "help" => \$help,
23 | "lowercase|lc" => \$lc,
24 | "encoding=s" => \$enc,
25 | "ratio=f" => \$ratio,
26 | "ignore-ratio" => \$ignore_ratio,
27 | "ignore-xml" => \$ignore_xml,
28 | "max-word-length|mwl=s" => \$max_word_length
29 | ) or exit(1);
30 |
31 | if (scalar(@ARGV) < 6 || $help) {
32 | print "syntax: clean-corpus-n.perl [-ratio n] corpus l1 l2 clean-corpus min max [lines retained file]\n";
33 | exit;
34 | }
35 |
36 | my $corpus = $ARGV[0];
37 | my $l1 = $ARGV[1];
38 | my $l2 = $ARGV[2];
39 | my $out = $ARGV[3];
40 | my $min = $ARGV[4];
41 | my $max = $ARGV[5];
42 |
43 | my $linesRetainedFile = "";
44 | if (scalar(@ARGV) > 6) {
45 | $linesRetainedFile = $ARGV[6];
46 | open(LINES_RETAINED,">$linesRetainedFile") or die "Can't write $linesRetainedFile";
47 | }
48 |
49 | print STDERR "clean-corpus.perl: processing $corpus.$l1 & .$l2 to $out, cutoff $min-$max, ratio $ratio\n";
50 |
51 | my $opn = undef;
52 | my $l1input = "$corpus.$l1";
53 | if (-e $l1input) {
54 | $opn = $l1input;
55 | } elsif (-e $l1input.".gz") {
56 | $opn = "gunzip -c $l1input.gz |";
57 | } else {
58 | die "Error: $l1input does not exist";
59 | }
60 | open(F,$opn) or die "Can't open '$opn'";
61 | $opn = undef;
62 | my $l2input = "$corpus.$l2";
63 | if (-e $l2input) {
64 | $opn = $l2input;
65 | } elsif (-e $l2input.".gz") {
66 | $opn = "gunzip -c $l2input.gz |";
67 | } else {
68 | die "Error: $l2input does not exist";
69 | }
70 |
71 | open(E,$opn) or die "Can't open '$opn'";
72 |
73 | open(FO,">$out.$l1") or die "Can't write $out.$l1";
74 | open(EO,">$out.$l2") or die "Can't write $out.$l2";
75 |
76 | # necessary for proper lowercasing
77 | my $binmode;
78 | if ($enc eq "utf8") {
79 | $binmode = ":utf8";
80 | } else {
81 | $binmode = ":encoding($enc)";
82 | }
83 | binmode(F, $binmode);
84 | binmode(E, $binmode);
85 | binmode(FO, $binmode);
86 | binmode(EO, $binmode);
87 |
88 | my $innr = 0;
89 | my $outnr = 0;
90 | my $factored_flag;
91 | while(my $f = ) {
92 | $innr++;
93 | print STDERR "." if $innr % 10000 == 0;
94 | print STDERR "($innr)" if $innr % 100000 == 0;
95 | my $e = ;
96 | die "$corpus.$l2 is too short!" if !defined $e;
97 | chomp($e);
98 | chomp($f);
99 | if ($innr == 1) {
100 | $factored_flag = ($e =~ /\|/ || $f =~ /\|/);
101 | }
102 |
103 | #if lowercasing, lowercase
104 | if ($lc) {
105 | $e = lc($e);
106 | $f = lc($f);
107 | }
108 |
109 | $e =~ s/\|//g unless $factored_flag;
110 | $e =~ s/\s+/ /g;
111 | $e =~ s/^ //;
112 | $e =~ s/ $//;
113 | $f =~ s/\|//g unless $factored_flag;
114 | $f =~ s/\s+/ /g;
115 | $f =~ s/^ //;
116 | $f =~ s/ $//;
117 | next if $f eq '';
118 | next if $e eq '';
119 |
120 | my $ec = &word_count($e);
121 | my $fc = &word_count($f);
122 | next if $ec > $max;
123 | next if $fc > $max;
124 | next if $ec < $min;
125 | next if $fc < $min;
126 | next if !$ignore_ratio && $ec/$fc > $ratio;
127 | next if !$ignore_ratio && $fc/$ec > $ratio;
128 | # Skip this segment if any factor is longer than $max_word_length
129 | my $max_word_length_plus_one = $max_word_length + 1;
130 | next if $e =~ /[^\s\|]{$max_word_length_plus_one}/;
131 | next if $f =~ /[^\s\|]{$max_word_length_plus_one}/;
132 |
133 | # An extra check: none of the factors can be blank!
134 | die "There is a blank factor in $corpus.$l1 on line $innr: $f"
135 | if $f =~ /[ \|]\|/;
136 | die "There is a blank factor in $corpus.$l2 on line $innr: $e"
137 | if $e =~ /[ \|]\|/;
138 |
139 | $outnr++;
140 | print FO $f."\n";
141 | print EO $e."\n";
142 |
143 | if ($linesRetainedFile ne "") {
144 | print LINES_RETAINED $innr."\n";
145 | }
146 | }
147 |
148 | if ($linesRetainedFile ne "") {
149 | close LINES_RETAINED;
150 | }
151 |
152 | print STDERR "\n";
153 | my $e = ;
154 | die "$corpus.$l2 is too long!" if defined $e;
155 |
156 | print STDERR "Input sentences: $innr Output sentences: $outnr\n";
157 |
158 | sub word_count {
159 | my ($line) = @_;
160 | if ($ignore_xml) {
161 | $line =~ s/<\S[^>]*\S>/ /g;
162 | $line =~ s/\s+/ /g;
163 | $line =~ s/^ //g;
164 | $line =~ s/ $//g;
165 | }
166 | my @w = split(/ /,$line);
167 | return scalar @w;
168 | }
169 |
--------------------------------------------------------------------------------
/data/data_iterator.py:
--------------------------------------------------------------------------------
1 |
2 | import numpy as np
3 | import shuffle
4 | from util import load_dict
5 |
6 | import data_utils
7 |
8 | '''
9 | Much of this code is based on the data_iterator.py of
10 | nematus project (https://github.com/rsennrich/nematus)
11 | '''
12 |
13 | class TextIterator:
14 | """Simple Text iterator."""
15 | def __init__(self, source, source_dict,
16 | batch_size=128, maxlen=None,
17 | n_words_source=-1,
18 | skip_empty=False,
19 | shuffle_each_epoch=False,
20 | sort_by_length=False,
21 | maxibatch_size=20,
22 | ):
23 |
24 | if shuffle_each_epoch:
25 | self.source_orig = source
26 | self.source = shuffle.main([self.source_orig], temporary=True)
27 | else:
28 | self.source = data_utils.fopen(source, 'r')
29 |
30 | self.source_dict = load_dict(source_dict)
31 | self.batch_size = batch_size
32 | self.maxlen = maxlen
33 | self.skip_empty = skip_empty
34 |
35 | self.n_words_source = n_words_source
36 |
37 | if self.n_words_source > 0:
38 | for key, idx in self.source_dict.items():
39 | if idx >= self.n_words_source:
40 | del self.source_dict[key]
41 |
42 | self.shuffle = shuffle_each_epoch
43 | self.sort_by_length = sort_by_length
44 |
45 | self.shuffle = shuffle_each_epoch
46 | self.sort_by_length = sort_by_length
47 |
48 | self.source_buffer = []
49 | self.k = batch_size * maxibatch_size
50 |
51 | self.end_of_data = False
52 |
53 | def __iter__(self):
54 | return self
55 |
56 | def __len__(self):
57 | return sum([1 for _ in self])
58 |
59 | def reset(self):
60 | if self.shuffle:
61 | self.source = shuffle.main([self.source_orig], temporary=True)
62 | else:
63 | self.source.seek(0)
64 |
65 | def next(self):
66 | if self.end_of_data:
67 | self.end_of_data = False
68 | self.reset()
69 | raise StopIteration
70 |
71 | source = []
72 |
73 | # fill buffer, if it's empty
74 | if len(self.source_buffer) == 0:
75 | for k_ in xrange(self.k):
76 | ss = self.source.readline()
77 | if ss == "":
78 | break
79 | self.source_buffer.append(ss.strip().split())
80 |
81 | # sort by buffer
82 | if self.sort_by_length:
83 | slen = np.array([len(s) for s in self.source_buffer])
84 | sidx = slen.argsort()
85 |
86 | _sbuf = [self.source_buffer[i] for i in sidx]
87 |
88 | self.source_buffer = _sbuf
89 | else:
90 | self.source_buffer.reverse()
91 |
92 | if len(self.source_buffer) == 0:
93 | self.end_of_data = False
94 | self.reset()
95 | raise StopIteration
96 |
97 | try:
98 | # actual work here
99 | while True:
100 | # read from source file and map to word index
101 | try:
102 | ss = self.source_buffer.pop()
103 | except IndexError:
104 | break
105 | ss = [self.source_dict[w] if w in self.source_dict
106 | else data_utils.unk_token for w in ss]
107 |
108 | if self.maxlen and len(ss) > self.maxlen:
109 | continue
110 | if self.skip_empty and (not ss):
111 | continue
112 | source.append(ss)
113 |
114 | if len(source) >= self.batch_size:
115 | break
116 | except IOError:
117 | self.end_of_data = True
118 |
119 | # all sentence pairs in maxibatch filtered out because of length
120 | if len(source) == 0:
121 | source = self.next()
122 |
123 | return source
124 |
125 | class BiTextIterator:
126 | """Simple Bitext iterator."""
127 | def __init__(self, source, target,
128 | source_dict, target_dict,
129 | batch_size=128,
130 | maxlen=100,
131 | n_words_source=-1,
132 | n_words_target=-1,
133 | skip_empty=False,
134 | shuffle_each_epoch=False,
135 | sort_by_length=True,
136 | maxibatch_size=20):
137 | if shuffle_each_epoch:
138 | self.source_orig = source
139 | self.target_orig = target
140 | self.source, self.target = shuffle.main([self.source_orig, self.target_orig], temporary=True)
141 | else:
142 | self.source = data_utils.fopen(source, 'r')
143 | self.target = data_utils.fopen(target, 'r')
144 |
145 | self.source_dict = load_dict(source_dict)
146 | self.target_dict = load_dict(target_dict)
147 |
148 | self.batch_size = batch_size
149 | self.maxlen = maxlen
150 | self.skip_empty = skip_empty
151 |
152 | self.n_words_source = n_words_source
153 | self.n_words_target = n_words_target
154 |
155 | if self.n_words_source > 0:
156 | for key, idx in self.source_dict.items():
157 | if idx >= self.n_words_source:
158 | del self.source_dict[key]
159 |
160 | if self.n_words_target > 0:
161 | for key, idx in self.target_dict.items():
162 | if idx >= self.n_words_target:
163 | del self.target_dict[key]
164 |
165 | self.shuffle = shuffle_each_epoch
166 | self.sort_by_length = sort_by_length
167 |
168 | self.source_buffer = []
169 | self.target_buffer = []
170 | self.k = batch_size * maxibatch_size
171 |
172 | self.end_of_data = False
173 |
174 | def __iter__(self):
175 | return self
176 |
177 | def __len__(self):
178 | return sum([1 for _ in self])
179 |
180 | def reset(self):
181 | if self.shuffle:
182 | self.source, self.target = shuffle.main([self.source_orig, self.target_orig], temporary=True)
183 | else:
184 | self.source.seek(0)
185 | self.target.seek(0)
186 |
187 | def next(self):
188 | if self.end_of_data:
189 | self.end_of_data = False
190 | self.reset()
191 | raise StopIteration
192 |
193 | source = []
194 | target = []
195 |
196 | # fill buffer, if it's empty
197 | assert len(self.source_buffer) == len(self.target_buffer), 'Buffer size mismatch!'
198 |
199 | if len(self.source_buffer) == 0:
200 | for k_ in xrange(self.k):
201 | ss = self.source.readline()
202 | if ss == "":
203 | break
204 | tt = self.target.readline()
205 | if tt == "":
206 | break
207 | self.source_buffer.append(ss.strip().split())
208 | self.target_buffer.append(tt.strip().split())
209 |
210 | # sort by target buffer
211 | if self.sort_by_length:
212 | tlen = np.array([len(t) for t in self.target_buffer])
213 | tidx = tlen.argsort()
214 |
215 | _sbuf = [self.source_buffer[i] for i in tidx]
216 | _tbuf = [self.target_buffer[i] for i in tidx]
217 |
218 | self.source_buffer = _sbuf
219 | self.target_buffer = _tbuf
220 |
221 | else:
222 | self.source_buffer.reverse()
223 | self.target_buffer.reverse()
224 |
225 | if len(self.source_buffer) == 0 or len(self.target_buffer) == 0:
226 | self.end_of_data = False
227 | self.reset()
228 | raise StopIteration
229 |
230 | try:
231 |
232 | # actual work here
233 | while True:
234 |
235 | # read from source file and map to word index
236 | try:
237 | ss = self.source_buffer.pop()
238 | except IndexError:
239 | break
240 | ss = [self.source_dict[w] if w in self.source_dict
241 | else data_utils.unk_token for w in ss]
242 |
243 | # read from source file and map to word index
244 | tt = self.target_buffer.pop()
245 | tt = [self.target_dict[w] if w in self.target_dict
246 | else data_utils.unk_token for w in tt]
247 | if self.n_words_target > 0:
248 | tt = [w if w < self.n_words_target
249 | else data_utils.unk_token for w in tt]
250 |
251 | if self.maxlen:
252 | if len(ss) > self.maxlen and len(tt) > self.maxlen:
253 | continue
254 | if self.skip_empty and (not ss or not tt):
255 | continue
256 |
257 | source.append(ss)
258 | target.append(tt)
259 |
260 | if len(source) >= self.batch_size or \
261 | len(target) >= self.batch_size:
262 | break
263 | except IOError:
264 | self.end_of_data = True
265 |
266 | # all sentence pairs in maxibatch filtered out because of length
267 | if len(source) == 0 or len(target) == 0:
268 | source, target = self.next()
269 |
270 | return source, target
271 |
--------------------------------------------------------------------------------
/data/data_statistics.py:
--------------------------------------------------------------------------------
1 |
2 | import sys
3 | import numpy as np
4 |
5 | def main(argv):
6 | for input_file in argv:
7 | lengths = []
8 | with open(input_file, 'r') as corpus:
9 | for line in corpus:
10 | lengths.append(len(line.split()))
11 | print("%s: size=%d, avg_length=%.2f, std=%.2f, min=%d, max=%d"
12 | % (input_file, len(lengths), np.mean(lengths), np.std(lengths), np.min(lengths), np.max(lengths)))
13 |
14 |
15 | if __name__ == "__main__":
16 | main(sys.argv[1:])
17 |
--------------------------------------------------------------------------------
/data/data_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | import gzip
4 | from util import load_dict
5 |
6 | # Extra vocabulary symbols
7 | _GO = '_GO'
8 | EOS = '_EOS' # also function as PAD
9 | UNK = '_UNK'
10 |
11 | extra_tokens = [_GO, EOS, UNK]
12 |
13 | start_token = extra_tokens.index(_GO) # start_token = 0
14 | end_token = extra_tokens.index(EOS) # end_token = 1
15 | unk_token = extra_tokens.index(UNK)
16 |
17 |
18 | def fopen(filename, mode='r'):
19 | if filename.endswith('.gz'):
20 | return gzip.open(filename, mode)
21 | return open(filename, mode)
22 |
23 |
24 | def load_inverse_dict(dict_path):
25 | orig_dict = load_dict(dict_path)
26 | idict = {}
27 | for words, idx in orig_dict.iteritems():
28 | idict[idx] = words
29 | return idict
30 |
31 |
32 | def seq2words(seq, inverse_target_dictionary):
33 | words = []
34 | for w in seq:
35 | if w == end_token:
36 | break
37 | if w in inverse_target_dictionary:
38 | words.append(inverse_target_dictionary[w])
39 | else:
40 | words.append(UNK)
41 | return ' '.join(words)
42 |
43 |
44 | # batch preparation of a given sequence
45 | def prepare_batch(seqs_x, maxlen=None):
46 | # seqs_x: a list of sentences
47 | lengths_x = [len(s) for s in seqs_x]
48 |
49 | if maxlen is not None:
50 | new_seqs_x = []
51 | new_lengths_x = []
52 | for l_x, s_x in zip(lengths_x, seqs_x):
53 | if l_x <= maxlen:
54 | new_seqs_x.append(s_x)
55 | new_lengths_x.append(l_x)
56 | lengths_x = new_lengths_x
57 | seqs_x = new_seqs_x
58 |
59 | if len(lengths_x) < 1:
60 | return None, None
61 |
62 | batch_size = len(seqs_x)
63 |
64 | x_lengths = np.array(lengths_x)
65 | maxlen_x = np.max(x_lengths)
66 |
67 | x = np.ones((batch_size, maxlen_x)).astype('int32') * end_token
68 |
69 | for idx, s_x in enumerate(seqs_x):
70 | x[idx, :lengths_x[idx]] = s_x
71 | return x, x_lengths
72 |
73 |
74 | # batch preparation of a given sequence pair for training
75 | def prepare_train_batch(seqs_x, seqs_y, maxlen=None):
76 | # seqs_x, seqs_y: a list of sentences
77 | lengths_x = [len(s) for s in seqs_x]
78 | lengths_y = [len(s) for s in seqs_y]
79 |
80 | if maxlen is not None:
81 | new_seqs_x = []
82 | new_seqs_y = []
83 | new_lengths_x = []
84 | new_lengths_y = []
85 | for l_x, s_x, l_y, s_y in zip(lengths_x, seqs_x, lengths_y, seqs_y):
86 | if l_x <= maxlen and l_y <= maxlen:
87 | new_seqs_x.append(s_x)
88 | new_lengths_x.append(l_x)
89 | new_seqs_y.append(s_y)
90 | new_lengths_y.append(l_y)
91 | lengths_x = new_lengths_x
92 | seqs_x = new_seqs_x
93 | lengths_y = new_lengths_y
94 | seqs_y = new_seqs_y
95 |
96 | if len(lengths_x) < 1 or len(lengths_y) < 1:
97 | return None, None, None, None
98 |
99 | batch_size = len(seqs_x)
100 |
101 | x_lengths = np.array(lengths_x)
102 | y_lengths = np.array(lengths_y)
103 |
104 | maxlen_x = np.max(x_lengths)
105 | maxlen_y = np.max(y_lengths)
106 |
107 | x = np.ones((batch_size, maxlen_x)).astype('int32') * end_token
108 | y = np.ones((batch_size, maxlen_y)).astype('int32') * end_token
109 |
110 | for idx, [s_x, s_y] in enumerate(zip(seqs_x, seqs_y)):
111 | x[idx, :lengths_x[idx]] = s_x
112 | y[idx, :lengths_y[idx]] = s_y
113 | return x, x_lengths, y, y_lengths
114 |
--------------------------------------------------------------------------------
/data/merge.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | SRC=$1
5 | TRG=$2
6 |
7 | FSRC=all_${1}-${2}.${1}
8 | FTRG=all_${1}-${2}.${2}
9 |
10 | echo "" > $FSRC
11 | for F in *${1}-${2}.${1}
12 | do
13 | if [ "$F" = "$FSRC" ]; then
14 | echo "pass"
15 | else
16 | cat $F >> $FSRC
17 | fi
18 | done
19 |
20 |
21 | echo "" > $FTRG
22 | for F in *${1}-${2}.${2}
23 | do
24 | if [ "$F" = "$FTRG" ]; then
25 | echo "pass"
26 | else
27 | cat $F >> $FTRG
28 | fi
29 | done
30 |
--------------------------------------------------------------------------------
/data/multi-bleu.perl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 | #
3 | # This file is part of moses. Its use is licensed under the GNU Lesser General
4 | # Public License version 2.1 or, at your option, any later version.
5 |
6 | # $Id$
7 | use warnings;
8 | use strict;
9 |
10 | my $lowercase = 0;
11 | if ($ARGV[0] eq "-lc") {
12 | $lowercase = 1;
13 | shift;
14 | }
15 |
16 | my $stem = $ARGV[0];
17 | if (!defined $stem) {
18 | print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
19 | print STDERR "Reads the references from reference or reference0, reference1, ...\n";
20 | exit(1);
21 | }
22 |
23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
24 |
25 | my @REF;
26 | my $ref=0;
27 | while(-e "$stem$ref") {
28 | &add_to_ref("$stem$ref",\@REF);
29 | $ref++;
30 | }
31 | &add_to_ref($stem,\@REF) if -e $stem;
32 | die("ERROR: could not find reference file $stem") unless scalar @REF;
33 |
34 | sub add_to_ref {
35 | my ($file,$REF) = @_;
36 | my $s=0;
37 | open(REF,$file) or die "Can't read $file";
38 | while([) {
39 | chop;
40 | push @{$$REF[$s++]}, $_;
41 | }
42 | close(REF);
43 | }
44 |
45 | my(@CORRECT,@TOTAL,$length_translation,$length_reference);
46 | my $s=0;
47 | while() {
48 | chop;
49 | $_ = lc if $lowercase;
50 | my @WORD = split;
51 | my %REF_NGRAM = ();
52 | my $length_translation_this_sentence = scalar(@WORD);
53 | my ($closest_diff,$closest_length) = (9999,9999);
54 | foreach my $reference (@{$REF[$s]}) {
55 | # print "$s $_ <=> $reference\n";
56 | $reference = lc($reference) if $lowercase;
57 | my @WORD = split(' ',$reference);
58 | my $length = scalar(@WORD);
59 | my $diff = abs($length_translation_this_sentence-$length);
60 | if ($diff < $closest_diff) {
61 | $closest_diff = $diff;
62 | $closest_length = $length;
63 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
64 | } elsif ($diff == $closest_diff) {
65 | $closest_length = $length if $length < $closest_length;
66 | # from two references with the same closeness to me
67 | # take the *shorter* into account, not the "first" one.
68 | }
69 | for(my $n=1;$n<=4;$n++) {
70 | my %REF_NGRAM_N = ();
71 | for(my $start=0;$start<=$#WORD-($n-1);$start++) {
72 | my $ngram = "$n";
73 | for(my $w=0;$w<$n;$w++) {
74 | $ngram .= " ".$WORD[$start+$w];
75 | }
76 | $REF_NGRAM_N{$ngram}++;
77 | }
78 | foreach my $ngram (keys %REF_NGRAM_N) {
79 | if (!defined($REF_NGRAM{$ngram}) ||
80 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
81 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
82 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}]
\n";
83 | }
84 | }
85 | }
86 | }
87 | $length_translation += $length_translation_this_sentence;
88 | $length_reference += $closest_length;
89 | for(my $n=1;$n<=4;$n++) {
90 | my %T_NGRAM = ();
91 | for(my $start=0;$start<=$#WORD-($n-1);$start++) {
92 | my $ngram = "$n";
93 | for(my $w=0;$w<$n;$w++) {
94 | $ngram .= " ".$WORD[$start+$w];
95 | }
96 | $T_NGRAM{$ngram}++;
97 | }
98 | foreach my $ngram (keys %T_NGRAM) {
99 | $ngram =~ /^(\d+) /;
100 | my $n = $1;
101 | # my $corr = 0;
102 | # print "$i e $ngram $T_NGRAM{$ngram}
\n";
103 | $TOTAL[$n] += $T_NGRAM{$ngram};
104 | if (defined($REF_NGRAM{$ngram})) {
105 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
106 | $CORRECT[$n] += $T_NGRAM{$ngram};
107 | # $corr = $T_NGRAM{$ngram};
108 | # print "$i e correct1 $T_NGRAM{$ngram}
\n";
109 | }
110 | else {
111 | $CORRECT[$n] += $REF_NGRAM{$ngram};
112 | # $corr = $REF_NGRAM{$ngram};
113 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n";
114 | }
115 | }
116 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
117 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
118 | }
119 | }
120 | $s++;
121 | }
122 | my $brevity_penalty = 1;
123 | my $bleu = 0;
124 |
125 | my @bleu=();
126 |
127 | for(my $n=1;$n<=4;$n++) {
128 | if (defined ($TOTAL[$n])){
129 | $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
130 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
131 | }else{
132 | $bleu[$n]=0;
133 | }
134 | }
135 |
136 | if ($length_reference==0){
137 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
138 | exit(1);
139 | }
140 |
141 | if ($length_translation<$length_reference) {
142 | $brevity_penalty = exp(1-$length_reference/$length_translation);
143 | }
144 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
145 | my_log( $bleu[2] ) +
146 | my_log( $bleu[3] ) +
147 | my_log( $bleu[4] ) ) / 4) ;
148 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
149 | 100*$bleu,
150 | 100*$bleu[1],
151 | 100*$bleu[2],
152 | 100*$bleu[3],
153 | 100*$bleu[4],
154 | $brevity_penalty,
155 | $length_translation / $length_reference,
156 | $length_translation,
157 | $length_reference;
158 |
159 | sub my_log {
160 | return -9999999999 unless $_[0];
161 | return log($_[0]);
162 | }
163 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/README.txt:
--------------------------------------------------------------------------------
1 | The language suffix can be found here:
2 |
3 | http://www.loc.gov/standards/iso639-2/php/code_list.php
4 |
5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations).
6 | This code includes data from czech wiktionary (also czech abbreviations).
7 |
8 |
9 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.ca:
--------------------------------------------------------------------------------
1 | Dr
2 | Dra
3 | pàg
4 | p
5 | c
6 | av
7 | Sr
8 | Sra
9 | adm
10 | esq
11 | Prof
12 | S.A
13 | S.L
14 | p.e
15 | ptes
16 | Sta
17 | St
18 | pl
19 | màx
20 | cast
21 | dir
22 | nre
23 | fra
24 | admdora
25 | Emm
26 | Excma
27 | espf
28 | dc
29 | admdor
30 | tel
31 | angl
32 | aprox
33 | ca
34 | dept
35 | dj
36 | dl
37 | dt
38 | ds
39 | dg
40 | dv
41 | ed
42 | entl
43 | al
44 | i.e
45 | maj
46 | smin
47 | n
48 | núm
49 | pta
50 | A
51 | B
52 | C
53 | D
54 | E
55 | F
56 | G
57 | H
58 | I
59 | J
60 | K
61 | L
62 | M
63 | N
64 | O
65 | P
66 | Q
67 | R
68 | S
69 | T
70 | U
71 | V
72 | W
73 | X
74 | Y
75 | Z
76 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.cs:
--------------------------------------------------------------------------------
1 | Bc
2 | BcA
3 | Ing
4 | Ing.arch
5 | MUDr
6 | MVDr
7 | MgA
8 | Mgr
9 | JUDr
10 | PhDr
11 | RNDr
12 | PharmDr
13 | ThLic
14 | ThDr
15 | Ph.D
16 | Th.D
17 | prof
18 | doc
19 | CSc
20 | DrSc
21 | dr. h. c
22 | PaedDr
23 | Dr
24 | PhMr
25 | DiS
26 | abt
27 | ad
28 | a.i
29 | aj
30 | angl
31 | anon
32 | apod
33 | atd
34 | atp
35 | aut
36 | bd
37 | biogr
38 | b.m
39 | b.p
40 | b.r
41 | cca
42 | cit
43 | cizojaz
44 | c.k
45 | col
46 | čes
47 | čín
48 | čj
49 | ed
50 | facs
51 | fasc
52 | fol
53 | fot
54 | franc
55 | h.c
56 | hist
57 | hl
58 | hrsg
59 | ibid
60 | il
61 | ind
62 | inv.č
63 | jap
64 | jhdt
65 | jv
66 | koed
67 | kol
68 | korej
69 | kl
70 | krit
71 | lat
72 | lit
73 | m.a
74 | maď
75 | mj
76 | mp
77 | násl
78 | např
79 | nepubl
80 | něm
81 | no
82 | nr
83 | n.s
84 | okr
85 | odd
86 | odp
87 | obr
88 | opr
89 | orig
90 | phil
91 | pl
92 | pokrač
93 | pol
94 | port
95 | pozn
96 | př.kr
97 | př.n.l
98 | přel
99 | přeprac
100 | příl
101 | pseud
102 | pt
103 | red
104 | repr
105 | resp
106 | revid
107 | rkp
108 | roč
109 | roz
110 | rozš
111 | samost
112 | sect
113 | sest
114 | seš
115 | sign
116 | sl
117 | srv
118 | stol
119 | sv
120 | šk
121 | šk.ro
122 | špan
123 | tab
124 | t.č
125 | tis
126 | tj
127 | tř
128 | tzv
129 | univ
130 | uspoř
131 | vol
132 | vl.jm
133 | vs
134 | vyd
135 | vyobr
136 | zal
137 | zejm
138 | zkr
139 | zprac
140 | zvl
141 | n.p
142 | např
143 | než
144 | MUDr
145 | abl
146 | absol
147 | adj
148 | adv
149 | ak
150 | ak. sl
151 | akt
152 | alch
153 | amer
154 | anat
155 | angl
156 | anglosas
157 | arab
158 | arch
159 | archit
160 | arg
161 | astr
162 | astrol
163 | att
164 | bás
165 | belg
166 | bibl
167 | biol
168 | boh
169 | bot
170 | bulh
171 | círk
172 | csl
173 | č
174 | čas
175 | čes
176 | dat
177 | děj
178 | dep
179 | dět
180 | dial
181 | dór
182 | dopr
183 | dosl
184 | ekon
185 | epic
186 | etnonym
187 | eufem
188 | f
189 | fam
190 | fem
191 | fil
192 | film
193 | form
194 | fot
195 | fr
196 | fut
197 | fyz
198 | gen
199 | geogr
200 | geol
201 | geom
202 | germ
203 | gram
204 | hebr
205 | herald
206 | hist
207 | hl
208 | hovor
209 | hud
210 | hut
211 | chcsl
212 | chem
213 | ie
214 | imp
215 | impf
216 | ind
217 | indoevr
218 | inf
219 | instr
220 | interj
221 | ión
222 | iron
223 | it
224 | kanad
225 | katalán
226 | klas
227 | kniž
228 | komp
229 | konj
230 |
231 | konkr
232 | kř
233 | kuch
234 | lat
235 | lék
236 | les
237 | lid
238 | lit
239 | liturg
240 | lok
241 | log
242 | m
243 | mat
244 | meteor
245 | metr
246 | mod
247 | ms
248 | mysl
249 | n
250 | náb
251 | námoř
252 | neklas
253 | něm
254 | nesklon
255 | nom
256 | ob
257 | obch
258 | obyč
259 | ojed
260 | opt
261 | part
262 | pas
263 | pejor
264 | pers
265 | pf
266 | pl
267 | plpf
268 |
269 | práv
270 | prep
271 | předl
272 | přivl
273 | r
274 | rcsl
275 | refl
276 | reg
277 | rkp
278 | ř
279 | řec
280 | s
281 | samohl
282 | sg
283 | sl
284 | souhl
285 | spec
286 | srov
287 | stfr
288 | střv
289 | stsl
290 | subj
291 | subst
292 | superl
293 | sv
294 | sz
295 | táz
296 | tech
297 | telev
298 | teol
299 | trans
300 | typogr
301 | var
302 | vedl
303 | verb
304 | vl. jm
305 | voj
306 | vok
307 | vůb
308 | vulg
309 | výtv
310 | vztaž
311 | zahr
312 | zájm
313 | zast
314 | zejm
315 |
316 | zeměd
317 | zkr
318 | zř
319 | mj
320 | dl
321 | atp
322 | sport
323 | Mgr
324 | horn
325 | MVDr
326 | JUDr
327 | RSDr
328 | Bc
329 | PhDr
330 | ThDr
331 | Ing
332 | aj
333 | apod
334 | PharmDr
335 | pomn
336 | ev
337 | slang
338 | nprap
339 | odp
340 | dop
341 | pol
342 | st
343 | stol
344 | p. n. l
345 | před n. l
346 | n. l
347 | př. Kr
348 | po Kr
349 | př. n. l
350 | odd
351 | RNDr
352 | tzv
353 | atd
354 | tzn
355 | resp
356 | tj
357 | p
358 | br
359 | č. j
360 | čj
361 | č. p
362 | čp
363 | a. s
364 | s. r. o
365 | spol. s r. o
366 | p. o
367 | s. p
368 | v. o. s
369 | k. s
370 | o. p. s
371 | o. s
372 | v. r
373 | v z
374 | ml
375 | vč
376 | kr
377 | mld
378 | hod
379 | popř
380 | ap
381 | event
382 | rus
383 | slov
384 | rum
385 | švýc
386 | P. T
387 | zvl
388 | hor
389 | dol
390 | S.O.S
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.de:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | #no german words end in single lower-case letters, so we throw those in too.
7 | A
8 | B
9 | C
10 | D
11 | E
12 | F
13 | G
14 | H
15 | I
16 | J
17 | K
18 | L
19 | M
20 | N
21 | O
22 | P
23 | Q
24 | R
25 | S
26 | T
27 | U
28 | V
29 | W
30 | X
31 | Y
32 | Z
33 | a
34 | b
35 | c
36 | d
37 | e
38 | f
39 | g
40 | h
41 | i
42 | j
43 | k
44 | l
45 | m
46 | n
47 | o
48 | p
49 | q
50 | r
51 | s
52 | t
53 | u
54 | v
55 | w
56 | x
57 | y
58 | z
59 |
60 |
61 | #Roman Numerals. A dot after one of these is not a sentence break in German.
62 | I
63 | II
64 | III
65 | IV
66 | V
67 | VI
68 | VII
69 | VIII
70 | IX
71 | X
72 | XI
73 | XII
74 | XIII
75 | XIV
76 | XV
77 | XVI
78 | XVII
79 | XVIII
80 | XIX
81 | XX
82 | i
83 | ii
84 | iii
85 | iv
86 | v
87 | vi
88 | vii
89 | viii
90 | ix
91 | x
92 | xi
93 | xii
94 | xiii
95 | xiv
96 | xv
97 | xvi
98 | xvii
99 | xviii
100 | xix
101 | xx
102 |
103 | #Titles and Honorifics
104 | Adj
105 | Adm
106 | Adv
107 | Asst
108 | Bart
109 | Bldg
110 | Brig
111 | Bros
112 | Capt
113 | Cmdr
114 | Col
115 | Comdr
116 | Con
117 | Corp
118 | Cpl
119 | DR
120 | Dr
121 | Ens
122 | Gen
123 | Gov
124 | Hon
125 | Hosp
126 | Insp
127 | Lt
128 | MM
129 | MR
130 | MRS
131 | MS
132 | Maj
133 | Messrs
134 | Mlle
135 | Mme
136 | Mr
137 | Mrs
138 | Ms
139 | Msgr
140 | Op
141 | Ord
142 | Pfc
143 | Ph
144 | Prof
145 | Pvt
146 | Rep
147 | Reps
148 | Res
149 | Rev
150 | Rt
151 | Sen
152 | Sens
153 | Sfc
154 | Sgt
155 | Sr
156 | St
157 | Supt
158 | Surg
159 |
160 | #Misc symbols
161 | Mio
162 | Mrd
163 | bzw
164 | v
165 | vs
166 | usw
167 | d.h
168 | z.B
169 | u.a
170 | etc
171 | Mrd
172 | MwSt
173 | ggf
174 | d.J
175 | D.h
176 | m.E
177 | vgl
178 | I.F
179 | z.T
180 | sogen
181 | ff
182 | u.E
183 | g.U
184 | g.g.A
185 | c.-à-d
186 | Buchst
187 | u.s.w
188 | sog
189 | u.ä
190 | Std
191 | evtl
192 | Zt
193 | Chr
194 | u.U
195 | o.ä
196 | Ltd
197 | b.A
198 | z.Zt
199 | spp
200 | sen
201 | SA
202 | k.o
203 | jun
204 | i.H.v
205 | dgl
206 | dergl
207 | Co
208 | zzt
209 | usf
210 | s.p.a
211 | Dkr
212 | Corp
213 | bzgl
214 | BSE
215 |
216 | #Number indicators
217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it
218 | No
219 | Nos
220 | Art
221 | Nr
222 | pp
223 | ca
224 | Ca
225 |
226 | #Ordinals are done with . in German - "1." = "1st" in English
227 | 1
228 | 2
229 | 3
230 | 4
231 | 5
232 | 6
233 | 7
234 | 8
235 | 9
236 | 10
237 | 11
238 | 12
239 | 13
240 | 14
241 | 15
242 | 16
243 | 17
244 | 18
245 | 19
246 | 20
247 | 21
248 | 22
249 | 23
250 | 24
251 | 25
252 | 26
253 | 27
254 | 28
255 | 29
256 | 30
257 | 31
258 | 32
259 | 33
260 | 34
261 | 35
262 | 36
263 | 37
264 | 38
265 | 39
266 | 40
267 | 41
268 | 42
269 | 43
270 | 44
271 | 45
272 | 46
273 | 47
274 | 48
275 | 49
276 | 50
277 | 51
278 | 52
279 | 53
280 | 54
281 | 55
282 | 56
283 | 57
284 | 58
285 | 59
286 | 60
287 | 61
288 | 62
289 | 63
290 | 64
291 | 65
292 | 66
293 | 67
294 | 68
295 | 69
296 | 70
297 | 71
298 | 72
299 | 73
300 | 74
301 | 75
302 | 76
303 | 77
304 | 78
305 | 79
306 | 80
307 | 81
308 | 82
309 | 83
310 | 84
311 | 85
312 | 86
313 | 87
314 | 88
315 | 89
316 | 90
317 | 91
318 | 92
319 | 93
320 | 94
321 | 95
322 | 96
323 | 97
324 | 98
325 | 99
326 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.el:
--------------------------------------------------------------------------------
1 | # Sigle letters in upper-case are usually abbreviations of names
2 | Α
3 | Β
4 | Γ
5 | Δ
6 | Ε
7 | Ζ
8 | Η
9 | Θ
10 | Ι
11 | Κ
12 | Λ
13 | Μ
14 | Ν
15 | Ξ
16 | Ο
17 | Π
18 | Ρ
19 | Σ
20 | Τ
21 | Υ
22 | Φ
23 | Χ
24 | Ψ
25 | Ω
26 |
27 | # Includes abbreviations for the Greek language compiled from various sources (Greek grammar books, Greek language related web content).
28 | Άθαν
29 | Έγχρ
30 | Έκθ
31 | Έσδ
32 | Έφ
33 | Όμ
34 | Α΄Έσδρ
35 | Α΄Έσδ
36 | Α΄Βασ
37 | Α΄Θεσ
38 | Α΄Ιω
39 | Α΄Κορινθ
40 | Α΄Κορ
41 | Α΄Μακκ
42 | Α΄Μακ
43 | Α΄Πέτρ
44 | Α΄Πέτ
45 | Α΄Παραλ
46 | Α΄Πε
47 | Α΄Σαμ
48 | Α΄Τιμ
49 | Α΄Χρον
50 | Α΄Χρ
51 | Α.Β.Α
52 | Α.Β
53 | Α.Ε
54 | Α.Κ.Τ.Ο
55 | Αέθλ
56 | Αέτ
57 | Αίλ.Δ
58 | Αίλ.Τακτ
59 | Αίσ
60 | Αββακ
61 | Αβυδ
62 | Αβ
63 | Αγάκλ
64 | Αγάπ
65 | Αγάπ.Αμαρτ.Σ
66 | Αγάπ.Γεωπ
67 | Αγαθάγγ
68 | Αγαθήμ
69 | Αγαθιν
70 | Αγαθοκλ
71 | Αγαθρχ
72 | Αγαθ
73 | Αγαθ.Ιστ
74 | Αγαλλ
75 | Αγαπητ
76 | Αγγ
77 | Αγησ
78 | Αγλ
79 | Αγορ.Κ
80 | Αγρο.Κωδ
81 | Αγρ.Εξ
82 | Αγρ.Κ
83 | Αγ.Γρ
84 | Αδριαν
85 | Αδρ
86 | Αετ
87 | Αθάν
88 | Αθήν
89 | Αθήν.Επιγρ
90 | Αθήν.Επιτ
91 | Αθήν.Ιατρ
92 | Αθήν.Μηχ
93 | Αθανάσ
94 | Αθαν
95 | Αθηνί
96 | Αθηναγ
97 | Αθηνόδ
98 | Αθ
99 | Αθ.Αρχ
100 | Αιλ
101 | Αιλ.Επιστ
102 | Αιλ.ΖΙ
103 | Αιλ.ΠΙ
104 | Αιλ.απ
105 | Αιμιλ
106 | Αιν.Γαζ
107 | Αιν.Τακτ
108 | Αισχίν
109 | Αισχίν.Επιστ
110 | Αισχ
111 | Αισχ.Αγαμ
112 | Αισχ.Αγ
113 | Αισχ.Αλ
114 | Αισχ.Ελεγ
115 | Αισχ.Επτ.Θ
116 | Αισχ.Ευμ
117 | Αισχ.Ικέτ
118 | Αισχ.Ικ
119 | Αισχ.Περσ
120 | Αισχ.Προμ.Δεσμ
121 | Αισχ.Πρ
122 | Αισχ.Χοηφ
123 | Αισχ.Χο
124 | Αισχ.απ
125 | ΑιτΕ
126 | Αιτ
127 | Αλκ
128 | Αλχιας
129 | Αμ.Π.Ο
130 | Αμβ
131 | Αμμών
132 | Αμ.
133 | Αν.Πειθ.Συμβ.Δικ
134 | Ανακρ
135 | Ανακ
136 | Αναμν.Τόμ
137 | Αναπλ
138 | Ανδ
139 | Ανθλγος
140 | Ανθστης
141 | Αντισθ
142 | Ανχης
143 | Αν
144 | Αποκ
145 | Απρ
146 | Απόδ
147 | Απόφ
148 | Απόφ.Νομ
149 | Απ
150 | Απ.Δαπ
151 | Απ.Διατ
152 | Απ.Επιστ
153 | Αριθ
154 | Αριστοτ
155 | Αριστοφ
156 | Αριστοφ.Όρν
157 | Αριστοφ.Αχ
158 | Αριστοφ.Βάτρ
159 | Αριστοφ.Ειρ
160 | Αριστοφ.Εκκλ
161 | Αριστοφ.Θεσμ
162 | Αριστοφ.Ιππ
163 | Αριστοφ.Λυσ
164 | Αριστοφ.Νεφ
165 | Αριστοφ.Πλ
166 | Αριστοφ.Σφ
167 | Αριστ
168 | Αριστ.Αθ.Πολ
169 | Αριστ.Αισθ
170 | Αριστ.Αν.Πρ
171 | Αριστ.Ζ.Ι
172 | Αριστ.Ηθ.Ευδ
173 | Αριστ.Ηθ.Νικ
174 | Αριστ.Κατ
175 | Αριστ.Μετ
176 | Αριστ.Πολ
177 | Αριστ.Φυσιογν
178 | Αριστ.Φυσ
179 | Αριστ.Ψυχ
180 | Αριστ.Ρητ
181 | Αρμεν
182 | Αρμ
183 | Αρχ.Εκ.Καν.Δ
184 | Αρχ.Ευβ.Μελ
185 | Αρχ.Ιδ.Δ
186 | Αρχ.Νομ
187 | Αρχ.Ν
188 | Αρχ.Π.Ε
189 | Αρ
190 | Αρ.Φορ.Μητρ
191 | Ασμ
192 | Ασμ.ασμ
193 | Αστ.Δ
194 | Αστ.Χρον
195 | Ασ
196 | Ατομ.Γνωμ
197 | Αυγ
198 | Αφρ
199 | Αχ.Νομ
200 | Α
201 | Α.Εγχ.Π
202 | Α.Κ.΄Υδρας
203 | Β΄Έσδρ
204 | Β΄Έσδ
205 | Β΄Βασ
206 | Β΄Θεσ
207 | Β΄Ιω
208 | Β΄Κορινθ
209 | Β΄Κορ
210 | Β΄Μακκ
211 | Β΄Μακ
212 | Β΄Πέτρ
213 | Β΄Πέτ
214 | Β΄Πέ
215 | Β΄Παραλ
216 | Β΄Σαμ
217 | Β΄Τιμ
218 | Β΄Χρον
219 | Β΄Χρ
220 | Β.Ι.Π.Ε
221 | Β.Κ.Τ
222 | Β.Κ.Ψ.Β
223 | Β.Μ
224 | Β.Ο.Α.Κ
225 | Β.Ο.Α
226 | Β.Ο.Δ
227 | Βίβλ
228 | Βαρ
229 | ΒεΘ
230 | Βι.Περ
231 | Βιπερ
232 | Βιργ
233 | Βλγ
234 | Βούλ
235 | Βρ
236 | Γ΄Βασ
237 | Γ΄Μακκ
238 | ΓΕΝμλ
239 | Γέν
240 | Γαλ
241 | Γεν
242 | Γλ
243 | Γν.Ν.Σ.Κρ
244 | Γνωμ
245 | Γν
246 | Γράμμ
247 | Γρηγ.Ναζ
248 | Γρηγ.Νύσ
249 | Γ Νοσ
250 | Γ' Ογκολ
251 | Γ.Ν
252 | Δ΄Βασ
253 | Δ.Β
254 | Δ.Δίκη
255 | Δ.Δίκ
256 | Δ.Ε.Σ
257 | Δ.Ε.Φ.Α
258 | Δ.Ε.Φ
259 | Δ.Εργ.Ν
260 | Δαμ
261 | Δαμ.μνημ.έργ
262 | Δαν
263 | Δασ.Κ
264 | Δεκ
265 | Δελτ.Δικ.Ε.Τ.Ε
266 | Δελτ.Νομ
267 | Δελτ.Συνδ.Α.Ε
268 | Δερμ
269 | Δευτ
270 | Δεύτ
271 | Δημοσθ
272 | Δημόκρ
273 | Δι.Δικ
274 | Διάτ
275 | Διαιτ.Απ
276 | Διαιτ
277 | Διαρκ.Στρατ
278 | Δικ
279 | Διοίκ.Πρωτ
280 | ΔιοικΔνη
281 | Διοικ.Εφ
282 | Διον.Αρ
283 | Διόρθ.Λαθ
284 | Δ.κ.Π
285 | Δνη
286 | Δν
287 | Δογμ.Όρος
288 | Δρ
289 | Δ.τ.Α
290 | Δτ
291 | ΔωδΝομ
292 | Δ.Περ
293 | Δ.Στρ
294 | ΕΔΠολ
295 | ΕΕυρΚ
296 | ΕΙΣ
297 | ΕΝαυτΔ
298 | ΕΣΑμΕΑ
299 | ΕΣΘ
300 | ΕΣυγκΔ
301 | ΕΤρΑξΧρΔ
302 | Ε.Φ.Ε.Τ
303 | Ε.Φ.Ι
304 | Ε.Φ.Ο.Επ.Α
305 | Εβδ
306 | Εβρ
307 | Εγκύκλ.Επιστ
308 | Εγκ
309 | Εε.Αιγ
310 | Εθν.Κ.Τ
311 | Εθν
312 | Ειδ.Δικ.Αγ.Κακ
313 | Εικ
314 | Ειρ.Αθ
315 | Ειρην.Αθ
316 | Ειρην
317 | Έλεγχ
318 | Ειρ
319 | Εισ.Α.Π
320 | Εισ.Ε
321 | Εισ.Ν.Α.Κ
322 | Εισ.Ν.Κ.Πολ.Δ
323 | Εισ.Πρωτ
324 | Εισηγ.Έκθ
325 | Εισ
326 | Εκκλ
327 | Εκκ
328 | Εκ
329 | Ελλ.Δνη
330 | Εν.Ε
331 | Εξ
332 | Επ.Αν
333 | Επ.Εργ.Δ
334 | Επ.Εφ
335 | Επ.Κυπ.Δ
336 | Επ.Μεσ.Αρχ
337 | Επ.Νομ
338 | Επίκτ
339 | Επίκ
340 | Επι.Δ.Ε
341 | Επιθ.Ναυτ.Δικ
342 | Επικ
343 | Επισκ.Ε.Δ
344 | Επισκ.Εμπ.Δικ
345 | Επιστ.Επετ.Αρμ
346 | Επιστ.Επετ
347 | Επιστ.Ιερ
348 | Επιτρ.Προστ.Συνδ.Στελ
349 | Επιφάν
350 | Επτ.Εφ
351 | Επ.Ιρ
352 | Επ.Ι
353 | Εργ.Ασφ.Νομ
354 | Ερμ.Α.Κ
355 | Ερμη.Σ
356 | Εσθ
357 | Εσπερ
358 | Ετρ.Δ
359 | Ευκλ
360 | Ευρ.Δ.Δ.Α
361 | Ευρ.Σ.Δ.Α
362 | Ευρ.ΣτΕ
363 | Ευρατόμ
364 | Ευρ.Άλκ
365 | Ευρ.Ανδρομ
366 | Ευρ.Βάκχ
367 | Ευρ.Εκ
368 | Ευρ.Ελ
369 | Ευρ.Ηλ
370 | Ευρ.Ηρακ
371 | Ευρ.Ηρ
372 | Ευρ.Ηρ.Μαιν
373 | Ευρ.Ικέτ
374 | Ευρ.Ιππόλ
375 | Ευρ.Ιφ.Α
376 | Ευρ.Ιφ.Τ
377 | Ευρ.Ι.Τ
378 | Ευρ.Κύκλ
379 | Ευρ.Μήδ
380 | Ευρ.Ορ
381 | Ευρ.Ρήσ
382 | Ευρ.Τρωάδ
383 | Ευρ.Φοίν
384 | Εφ.Αθ
385 | Εφ.Εν
386 | Εφ.Επ
387 | Εφ.Θρ
388 | Εφ.Θ
389 | Εφ.Ι
390 | Εφ.Κερ
391 | Εφ.Κρ
392 | Εφ.Λ
393 | Εφ.Ν
394 | Εφ.Πατ
395 | Εφ.Πειρ
396 | Εφαρμ.Δ.Δ
397 | Εφαρμ
398 | Εφεσ
399 | Εφημ
400 | Εφ
401 | Ζαχ
402 | Ζιγ
403 | Ζυ
404 | Ζχ
405 | ΗΕ.Δ
406 | Ημερ
407 | Ηράκλ
408 | Ηροδ
409 | Ησίοδ
410 | Ησ
411 | Η.Ε.Γ
412 | ΘΗΣ
413 | ΘΡ
414 | Θαλ
415 | Θεοδ
416 | Θεοφ
417 | Θεσ
418 | Θεόδ.Μοψ
419 | Θεόκρ
420 | Θεόφιλ
421 | Θουκ
422 | Θρ
423 | Θρ.Ε
424 | Θρ.Ιερ
425 | Θρ.Ιρ
426 | Ιακ
427 | Ιαν
428 | Ιβ
429 | Ιδθ
430 | Ιδ
431 | Ιεζ
432 | Ιερ
433 | Ιζ
434 | Ιησ
435 | Ιησ.Ν
436 | Ικ
437 | Ιλ
438 | Ιν
439 | Ιουδ
440 | Ιουστ
441 | Ιούδα
442 | Ιούλ
443 | Ιούν
444 | Ιπποκρ
445 | Ιππόλ
446 | Ιρ
447 | Ισίδ.Πηλ
448 | Ισοκρ
449 | Ισ.Ν
450 | Ιωβ
451 | Ιωλ
452 | Ιων
453 | Ιω
454 | ΚΟΣ
455 | ΚΟ.ΜΕ.ΚΟΝ
456 | ΚΠοινΔ
457 | ΚΠολΔ
458 | ΚαΒ
459 | Καλ
460 | Καλ.Τέχν
461 | ΚανΒ
462 | Καν.Διαδ
463 | Κατάργ
464 | Κλ
465 | ΚοινΔ
466 | Κολσ
467 | Κολ
468 | Κον
469 | Κορ
470 | Κος
471 | ΚριτΕπιθ
472 | ΚριτΕ
473 | Κριτ
474 | Κρ
475 | ΚτΒ
476 | ΚτΕ
477 | ΚτΠ
478 | Κυβ
479 | Κυπρ
480 | Κύριλ.Αλεξ
481 | Κύριλ.Ιερ
482 | Λεβ
483 | Λεξ.Σουίδα
484 | Λευϊτ
485 | Λευ
486 | Λκ
487 | Λογ
488 | ΛουκΑμ
489 | Λουκιαν
490 | Λουκ.Έρωτ
491 | Λουκ.Ενάλ.Διάλ
492 | Λουκ.Ερμ
493 | Λουκ.Εταιρ.Διάλ
494 | Λουκ.Ε.Δ
495 | Λουκ.Θε.Δ
496 | Λουκ.Ικ.
497 | Λουκ.Ιππ
498 | Λουκ.Λεξιφ
499 | Λουκ.Μεν
500 | Λουκ.Μισθ.Συν
501 | Λουκ.Ορχ
502 | Λουκ.Περ
503 | Λουκ.Συρ
504 | Λουκ.Τοξ
505 | Λουκ.Τυρ
506 | Λουκ.Φιλοψ
507 | Λουκ.Φιλ
508 | Λουκ.Χάρ
509 | Λουκ.
510 | Λουκ.Αλ
511 | Λοχ
512 | Λυδ
513 | Λυκ
514 | Λυσ
515 | Λωζ
516 | Λ1
517 | Λ2
518 | ΜΟΕφ
519 | Μάρκ
520 | Μέν
521 | Μαλ
522 | Ματθ
523 | Μα
524 | Μιχ
525 | Μκ
526 | Μλ
527 | Μμ
528 | Μον.Δ.Π
529 | Μον.Πρωτ
530 | Μον
531 | Μρ
532 | Μτ
533 | Μχ
534 | Μ.Βασ
535 | Μ.Πλ
536 | ΝΑ
537 | Ναυτ.Χρον
538 | Να
539 | Νδικ
540 | Νεεμ
541 | Νε
542 | Νικ
543 | ΝκΦ
544 | Νμ
545 | ΝοΒ
546 | Νομ.Δελτ.Τρ.Ελ
547 | Νομ.Δελτ
548 | Νομ.Σ.Κ
549 | Νομ.Χρ
550 | Νομ
551 | Νομ.Διεύθ
552 | Νοσ
553 | Ντ
554 | Νόσων
555 | Ν1
556 | Ν2
557 | Ν3
558 | Ν4
559 | Νtot
560 | Ξενοφ
561 | Ξεν
562 | Ξεν.Ανάβ
563 | Ξεν.Απολ
564 | Ξεν.Απομν
565 | Ξεν.Απομ
566 | Ξεν.Ελλ
567 | Ξεν.Ιέρ
568 | Ξεν.Ιππαρχ
569 | Ξεν.Ιππ
570 | Ξεν.Κυρ.Αν
571 | Ξεν.Κύρ.Παιδ
572 | Ξεν.Κ.Π
573 | Ξεν.Λακ.Πολ
574 | Ξεν.Οικ
575 | Ξεν.Προσ
576 | Ξεν.Συμπόσ
577 | Ξεν.Συμπ
578 | Ο΄
579 | Οβδ
580 | Οβ
581 | ΟικΕ
582 | Οικ
583 | Οικ.Πατρ
584 | Οικ.Σύν.Βατ
585 | Ολομ
586 | Ολ
587 | Ολ.Α.Π
588 | Ομ.Ιλ
589 | Ομ.Οδ
590 | ΟπΤοιχ
591 | Οράτ
592 | Ορθ
593 | ΠΡΟ.ΠΟ
594 | Πίνδ
595 | Πίνδ.Ι
596 | Πίνδ.Νεμ
597 | Πίνδ.Ν
598 | Πίνδ.Ολ
599 | Πίνδ.Παθ
600 | Πίνδ.Πυθ
601 | Πίνδ.Π
602 | ΠαγΝμλγ
603 | Παν
604 | Παρμ
605 | Παροιμ
606 | Παρ
607 | Παυσ
608 | Πειθ.Συμβ
609 | ΠειρΝ
610 | Πελ
611 | ΠεντΣτρ
612 | Πεντ
613 | Πεντ.Εφ
614 | ΠερΔικ
615 | Περ.Γεν.Νοσ
616 | Πετ
617 | Πλάτ
618 | Πλάτ.Αλκ
619 | Πλάτ.Αντ
620 | Πλάτ.Αξίοχ
621 | Πλάτ.Απόλ
622 | Πλάτ.Γοργ
623 | Πλάτ.Ευθ
624 | Πλάτ.Θεαίτ
625 | Πλάτ.Κρατ
626 | Πλάτ.Κριτ
627 | Πλάτ.Λύσ
628 | Πλάτ.Μεν
629 | Πλάτ.Νόμ
630 | Πλάτ.Πολιτ
631 | Πλάτ.Πολ
632 | Πλάτ.Πρωτ
633 | Πλάτ.Σοφ.
634 | Πλάτ.Συμπ
635 | Πλάτ.Τίμ
636 | Πλάτ.Φαίδρ
637 | Πλάτ.Φιλ
638 | Πλημ
639 | Πλούτ
640 | Πλούτ.Άρατ
641 | Πλούτ.Αιμ
642 | Πλούτ.Αλέξ
643 | Πλούτ.Αλκ
644 | Πλούτ.Αντ
645 | Πλούτ.Αρτ
646 | Πλούτ.Ηθ
647 | Πλούτ.Θεμ
648 | Πλούτ.Κάμ
649 | Πλούτ.Καίσ
650 | Πλούτ.Κικ
651 | Πλούτ.Κράσ
652 | Πλούτ.Κ
653 | Πλούτ.Λυκ
654 | Πλούτ.Μάρκ
655 | Πλούτ.Μάρ
656 | Πλούτ.Περ
657 | Πλούτ.Ρωμ
658 | Πλούτ.Σύλλ
659 | Πλούτ.Φλαμ
660 | Πλ
661 | Ποιν.Δικ
662 | Ποιν.Δ
663 | Ποιν.Ν
664 | Ποιν.Χρον
665 | Ποιν.Χρ
666 | Πολ.Δ
667 | Πολ.Πρωτ
668 | Πολ
669 | Πολ.Μηχ
670 | Πολ.Μ
671 | Πρακτ.Αναθ
672 | Πρακτ.Ολ
673 | Πραξ
674 | Πρμ
675 | Πρξ
676 | Πρωτ
677 | Πρ
678 | Πρ.Αν
679 | Πρ.Λογ
680 | Πταισμ
681 | Πυρ.Καλ
682 | Πόλη
683 | Π.Δ
684 | Π.Δ.Άσμ
685 | ΡΜ.Ε
686 | Ρθ
687 | Ρμ
688 | Ρωμ
689 | ΣΠλημ
690 | Σαπφ
691 | Σειρ
692 | Σολ
693 | Σοφ
694 | Σοφ.Αντιγ
695 | Σοφ.Αντ
696 | Σοφ.Αποσ
697 | Σοφ.Απ
698 | Σοφ.Ηλέκ
699 | Σοφ.Ηλ
700 | Σοφ.Οιδ.Κολ
701 | Σοφ.Οιδ.Τύρ
702 | Σοφ.Ο.Τ
703 | Σοφ.Σειρ
704 | Σοφ.Σολ
705 | Σοφ.Τραχ
706 | Σοφ.Φιλοκτ
707 | Σρ
708 | Σ.τ.Ε
709 | Σ.τ.Π
710 | Στρ.Π.Κ
711 | Στ.Ευρ
712 | Συζήτ
713 | Συλλ.Νομολ
714 | Συλ.Νομ
715 | ΣυμβΕπιθ
716 | Συμπ.Ν
717 | Συνθ.Αμ
718 | Συνθ.Ε.Ε
719 | Συνθ.Ε.Κ
720 | Συνθ.Ν
721 | Σφν
722 | Σφ
723 | Σφ.Σλ
724 | Σχ.Πολ.Δ
725 | Σχ.Συντ.Ε
726 | Σωσ
727 | Σύντ
728 | Σ.Πληρ
729 | ΤΘ
730 | ΤΣ.Δ
731 | Τίτ
732 | Τβ
733 | Τελ.Ενημ
734 | Τελ.Κ
735 | Τερτυλ
736 | Τιμ
737 | Τοπ.Α
738 | Τρ.Ο
739 | Τριμ
740 | Τριμ.Πλ
741 | Τρ.Πλημ
742 | Τρ.Π.Δ
743 | Τ.τ.Ε
744 | Ττ
745 | Τωβ
746 | Υγ
747 | Υπερ
748 | Υπ
749 | Υ.Γ
750 | Φιλήμ
751 | Φιλιπ
752 | Φιλ
753 | Φλμ
754 | Φλ
755 | Φορ.Β
756 | Φορ.Δ.Ε
757 | Φορ.Δνη
758 | Φορ.Δ
759 | Φορ.Επ
760 | Φώτ
761 | Χρ.Ι.Δ
762 | Χρ.Ιδ.Δ
763 | Χρ.Ο
764 | Χρυσ
765 | Ψήφ
766 | Ψαλμ
767 | Ψαλ
768 | Ψλ
769 | Ωριγ
770 | Ωσ
771 | Ω.Ρ.Λ
772 | άγν
773 | άγν.ετυμολ
774 | άγ
775 | άκλ
776 | άνθρ
777 | άπ
778 | άρθρ
779 | άρν
780 | άρ
781 | άτ
782 | άψ
783 | ά
784 | έκδ
785 | έκφρ
786 | έμψ
787 | ένθ.αν
788 | έτ
789 | έ.α
790 | ίδ
791 | αβεστ
792 | αβησσ
793 | αγγλ
794 | αγγ
795 | αδημ
796 | αεροναυτ
797 | αερον
798 | αεροπ
799 | αθλητ
800 | αθλ
801 | αθροιστ
802 | αιγυπτ
803 | αιγ
804 | αιτιολ
805 | αιτ
806 | αι
807 | ακαδ
808 | ακκαδ
809 | αλβ
810 | αλλ
811 | αλφαβητ
812 | αμα
813 | αμερικ
814 | αμερ
815 | αμετάβ
816 | αμτβ
817 | αμφιβ
818 | αμφισβ
819 | αμφ
820 | αμ
821 | ανάλ
822 | ανάπτ
823 | ανάτ
824 | αναβ
825 | αναδαν
826 | αναδιπλασ
827 | αναδιπλ
828 | αναδρ
829 | αναλ
830 | αναν
831 | ανασυλλ
832 | ανατολ
833 | ανατομ
834 | ανατυπ
835 | ανατ
836 | αναφορ
837 | αναφ
838 | ανα.ε
839 | ανδρων
840 | ανθρωπολ
841 | ανθρωπ
842 | ανθ
843 | ανομ
844 | αντίτ
845 | αντδ
846 | αντιγρ
847 | αντιθ
848 | αντικ
849 | αντιμετάθ
850 | αντων
851 | αντ
852 | ανωτ
853 | ανόργ
854 | ανών
855 | αορ
856 | απαρέμφ
857 | απαρφ
858 | απαρχ
859 | απαρ
860 | απλολ
861 | απλοπ
862 | αποβ
863 | αποηχηροπ
864 | αποθ
865 | αποκρυφ
866 | αποφ
867 | απρμφ
868 | απρφ
869 | απρόσ
870 | απόδ
871 | απόλ
872 | απόσπ
873 | απόφ
874 | αραβοτουρκ
875 | αραβ
876 | αραμ
877 | αρβαν
878 | αργκ
879 | αριθμτ
880 | αριθμ
881 | αριθ
882 | αρκτικόλ
883 | αρκ
884 | αρμεν
885 | αρμ
886 | αρνητ
887 | αρσ
888 | αρχαιολ
889 | αρχιτεκτ
890 | αρχιτ
891 | αρχκ
892 | αρχ
893 | αρωμουν
894 | αρωμ
895 | αρ
896 | αρ.μετρ
897 | αρ.φ
898 | ασσυρ
899 | αστρολ
900 | αστροναυτ
901 | αστρον
902 | αττ
903 | αυστραλ
904 | αυτοπ
905 | αυτ
906 | αφγαν
907 | αφηρ
908 | αφομ
909 | αφρικ
910 | αχώρ
911 | αόρ
912 | α.α
913 | α/α
914 | α0
915 | βαθμ
916 | βαθ
917 | βαπτ
918 | βασκ
919 | βεβαιωτ
920 | βεβ
921 | βεδ
922 | βενετ
923 | βεν
924 | βερβερ
925 | βιβλγρ
926 | βιολ
927 | βιομ
928 | βιοχημ
929 | βιοχ
930 | βλάχ
931 | βλ
932 | βλ.λ
933 | βοταν
934 | βοτ
935 | βουλγαρ
936 | βουλγ
937 | βούλ
938 | βραζιλ
939 | βρετον
940 | βόρ
941 | γαλλ
942 | γενικότ
943 | γενοβ
944 | γεν
945 | γερμαν
946 | γερμ
947 | γεωγρ
948 | γεωλ
949 | γεωμετρ
950 | γεωμ
951 | γεωπ
952 | γεωργ
953 | γλυπτ
954 | γλωσσολ
955 | γλωσσ
956 | γλ
957 | γνμδ
958 | γνμ
959 | γνωμ
960 | γοτθ
961 | γραμμ
962 | γραμ
963 | γρμ
964 | γρ
965 | γυμν
966 | δίδες
967 | δίκ
968 | δίφθ
969 | δαν
970 | δεικτ
971 | δεκατ
972 | δηλ
973 | δημογρ
974 | δημοτ
975 | δημώδ
976 | δημ
977 | διάγρ
978 | διάκρ
979 | διάλεξ
980 | διάλ
981 | διάσπ
982 | διαλεκτ
983 | διατρ
984 | διαφ
985 | διαχ
986 | διδα
987 | διεθν
988 | διεθ
989 | δικον
990 | διστ
991 | δισύλλ
992 | δισ
993 | διφθογγοπ
994 | δογμ
995 | δολ
996 | δοτ
997 | δρμ
998 | δρχ
999 | δρ(α)
1000 | δωρ
1001 | δ
1002 | εβρ
1003 | εγκλπ
1004 | εδ
1005 | εθνολ
1006 | εθν
1007 | ειδικότ
1008 | ειδ
1009 | ειδ.β
1010 | εικ
1011 | ειρ
1012 | εισ
1013 | εκατοστμ
1014 | εκατοστ
1015 | εκατστ.2
1016 | εκατστ.3
1017 | εκατ
1018 | εκδ
1019 | εκκλησ
1020 | εκκλ
1021 | εκ
1022 | ελλην
1023 | ελλ
1024 | ελνστ
1025 | ελπ
1026 | εμβ
1027 | εμφ
1028 | εναλλ
1029 | ενδ
1030 | ενεργ
1031 | ενεστ
1032 | ενικ
1033 | ενν
1034 | εν
1035 | εξέλ
1036 | εξακολ
1037 | εξομάλ
1038 | εξ
1039 | εο
1040 | επέκτ
1041 | επίδρ
1042 | επίθ
1043 | επίρρ
1044 | επίσ
1045 | επαγγελμ
1046 | επανάλ
1047 | επανέκδ
1048 | επιθ
1049 | επικ
1050 | επιμ
1051 | επιρρ
1052 | επιστ
1053 | επιτατ
1054 | επιφ
1055 | επών
1056 | επ
1057 | εργ
1058 | ερμ
1059 | ερρινοπ
1060 | ερωτ
1061 | ετρουσκ
1062 | ετυμ
1063 | ετ
1064 | ευφ
1065 | ευχετ
1066 | εφ
1067 | εύχρ
1068 | ε.α
1069 | ε/υ
1070 | ε0
1071 | ζωγρ
1072 | ζωολ
1073 | ηθικ
1074 | ηθ
1075 | ηλεκτρολ
1076 | ηλεκτρον
1077 | ηλεκτρ
1078 | ημίτ
1079 | ημίφ
1080 | ημιφ
1081 | ηχηροπ
1082 | ηχηρ
1083 | ηχομιμ
1084 | ηχ
1085 | η
1086 | θέατρ
1087 | θεολ
1088 | θετ
1089 | θηλ
1090 | θρακ
1091 | θρησκειολ
1092 | θρησκ
1093 | θ
1094 | ιαπων
1095 | ιατρ
1096 | ιδιωμ
1097 | ιδ
1098 | ινδ
1099 | ιραν
1100 | ισπαν
1101 | ιστορ
1102 | ιστ
1103 | ισχυροπ
1104 | ιταλ
1105 | ιχθυολ
1106 | ιων
1107 | κάτ
1108 | καθ
1109 | κακοσ
1110 | καν
1111 | καρ
1112 | κατάλ
1113 | κατατ
1114 | κατωτ
1115 | κατ
1116 | κα
1117 | κελτ
1118 | κεφ
1119 | κινεζ
1120 | κινημ
1121 | κλητ
1122 | κλιτ
1123 | κλπ
1124 | κλ
1125 | κν
1126 | κοινωνιολ
1127 | κοινων
1128 | κοπτ
1129 | κουτσοβλαχ
1130 | κουτσοβλ
1131 | κπ
1132 | κρ.γν
1133 | κτγ
1134 | κτην
1135 | κτητ
1136 | κτλ
1137 | κτ
1138 | κυριολ
1139 | κυρ
1140 | κύρ
1141 | κ
1142 | κ.ά
1143 | κ.ά.π
1144 | κ.α
1145 | κ.εξ
1146 | κ.επ
1147 | κ.ε
1148 | κ.λπ
1149 | κ.λ.π
1150 | κ.ού.κ
1151 | κ.ο.κ
1152 | κ.τ.λ
1153 | κ.τ.τ
1154 | κ.τ.ό
1155 | λέξ
1156 | λαογρ
1157 | λαπ
1158 | λατιν
1159 | λατ
1160 | λαϊκότρ
1161 | λαϊκ
1162 | λετ
1163 | λιθ
1164 | λογιστ
1165 | λογοτ
1166 | λογ
1167 | λουβ
1168 | λυδ
1169 | λόγ
1170 | λ
1171 | λ.χ
1172 | μέλλ
1173 | μέσ
1174 | μαθημ
1175 | μαθ
1176 | μαιευτ
1177 | μαλαισ
1178 | μαλτ
1179 | μαμμων
1180 | μεγεθ
1181 | μεε
1182 | μειωτ
1183 | μελ
1184 | μεξ
1185 | μεσν
1186 | μεσογ
1187 | μεσοπαθ
1188 | μεσοφ
1189 | μετάθ
1190 | μεταβτ
1191 | μεταβ
1192 | μετακ
1193 | μεταπλ
1194 | μεταπτωτ
1195 | μεταρ
1196 | μεταφορ
1197 | μετβ
1198 | μετεπιθ
1199 | μετεπιρρ
1200 | μετεωρολ
1201 | μετεωρ
1202 | μετον
1203 | μετουσ
1204 | μετοχ
1205 | μετρ
1206 | μετ
1207 | μητρων
1208 | μηχανολ
1209 | μηχ
1210 | μικροβιολ
1211 | μογγολ
1212 | μορφολ
1213 | μουσ
1214 | μπενελούξ
1215 | μσνλατ
1216 | μσν
1217 | μτβ
1218 | μτγν
1219 | μτγ
1220 | μτφρδ
1221 | μτφρ
1222 | μτφ
1223 | μτχ
1224 | μυθ
1225 | μυκην
1226 | μυκ
1227 | μφ
1228 | μ
1229 | μ.ε
1230 | μ.μ
1231 | μ.π.ε
1232 | μ.π.π
1233 | μ0
1234 | ναυτ
1235 | νεοελλ
1236 | νεολατιν
1237 | νεολατ
1238 | νεολ
1239 | νεότ
1240 | νλατ
1241 | νομ
1242 | νορβ
1243 | νοσ
1244 | νότ
1245 | ν
1246 | ξ.λ
1247 | οικοδ
1248 | οικολ
1249 | οικον
1250 | οικ
1251 | ολλανδ
1252 | ολλ
1253 | ομηρ
1254 | ομόρρ
1255 | ονομ
1256 | ον
1257 | οπτ
1258 | ορθογρ
1259 | ορθ
1260 | οριστ
1261 | ορυκτολ
1262 | ορυκτ
1263 | ορ
1264 | οσετ
1265 | οσκ
1266 | ουαλ
1267 | ουγγρ
1268 | ουδ
1269 | ουσιαστικοπ
1270 | ουσιαστ
1271 | ουσ
1272 | πίν
1273 | παθητ
1274 | παθολ
1275 | παθ
1276 | παιδ
1277 | παλαιοντ
1278 | παλαιότ
1279 | παλ
1280 | παππων
1281 | παράγρ
1282 | παράγ
1283 | παράλλ
1284 | παράλ
1285 | παραγ
1286 | παρακ
1287 | παραλ
1288 | παραπ
1289 | παρατ
1290 | παρβ
1291 | παρετυμ
1292 | παροξ
1293 | παρων
1294 | παρωχ
1295 | παρ
1296 | παρ.φρ
1297 | πατριδων
1298 | πατρων
1299 | πβ
1300 | περιθ
1301 | περιλ
1302 | περιφρ
1303 | περσ
1304 | περ
1305 | πιθ
1306 | πληθ
1307 | πληροφ
1308 | ποδ
1309 | ποιητ
1310 | πολιτ
1311 | πολλαπλ
1312 | πολ
1313 | πορτογαλ
1314 | πορτ
1315 | ποσ
1316 | πρακριτ
1317 | πρβλ
1318 | πρβ
1319 | πργ
1320 | πρκμ
1321 | πρκ
1322 | πρλ
1323 | προέλ
1324 | προβηγκ
1325 | προελλ
1326 | προηγ
1327 | προθεμ
1328 | προπαραλ
1329 | προπαροξ
1330 | προπερισπ
1331 | προσαρμ
1332 | προσηγορ
1333 | προσταχτ
1334 | προστ
1335 | προσφών
1336 | προσ
1337 | προτακτ
1338 | προτ.Εισ
1339 | προφ
1340 | προχωρ
1341 | πρτ
1342 | πρόθ
1343 | πρόσθ
1344 | πρόσ
1345 | πρότ
1346 | πρ
1347 | πρ.Εφ
1348 | πτ
1349 | πυ
1350 | π
1351 | π.Χ
1352 | π.μ
1353 | π.χ
1354 | ρήμ
1355 | ρίζ
1356 | ρηματ
1357 | ρητορ
1358 | ριν
1359 | ρουμ
1360 | ρωμ
1361 | ρωσ
1362 | ρ
1363 | σανσκρ
1364 | σαξ
1365 | σελ
1366 | σερβοκρ
1367 | σερβ
1368 | σημασιολ
1369 | σημδ
1370 | σημειολ
1371 | σημερ
1372 | σημιτ
1373 | σημ
1374 | σκανδ
1375 | σκυθ
1376 | σκωπτ
1377 | σλαβ
1378 | σλοβ
1379 | σουηδ
1380 | σουμερ
1381 | σουπ
1382 | σπάν
1383 | σπανιότ
1384 | σπ
1385 | σσ
1386 | στατ
1387 | στερ
1388 | στιγμ
1389 | στιχ
1390 | στρέμ
1391 | στρατιωτ
1392 | στρατ
1393 | στ
1394 | συγγ
1395 | συγκρ
1396 | συγκ
1397 | συμπερ
1398 | συμπλεκτ
1399 | συμπλ
1400 | συμπροφ
1401 | συμφυρ
1402 | συμφ
1403 | συνήθ
1404 | συνίζ
1405 | συναίρ
1406 | συναισθ
1407 | συνδετ
1408 | συνδ
1409 | συνεκδ
1410 | συνηρ
1411 | συνθετ
1412 | συνθ
1413 | συνοπτ
1414 | συντελ
1415 | συντομογρ
1416 | συντ
1417 | συν
1418 | συρ
1419 | σχημ
1420 | σχ
1421 | σύγκρ
1422 | σύμπλ
1423 | σύμφ
1424 | σύνδ
1425 | σύνθ
1426 | σύντμ
1427 | σύντ
1428 | σ
1429 | σ.π
1430 | σ/β
1431 | τακτ
1432 | τελ
1433 | τετρ
1434 | τετρ.μ
1435 | τεχνλ
1436 | τεχνολ
1437 | τεχν
1438 | τεύχ
1439 | τηλεπικ
1440 | τηλεόρ
1441 | τιμ
1442 | τιμ.τομ
1443 | τοΣ
1444 | τον
1445 | τοπογρ
1446 | τοπων
1447 | τοπ
1448 | τοσκ
1449 | τουρκ
1450 | τοχ
1451 | τριτοπρόσ
1452 | τροποπ
1453 | τροπ
1454 | τσεχ
1455 | τσιγγ
1456 | ττ
1457 | τυπ
1458 | τόμ
1459 | τόνν
1460 | τ
1461 | τ.μ
1462 | τ.χλμ
1463 | υβρ
1464 | υπερθ
1465 | υπερσ
1466 | υπερ
1467 | υπεύθ
1468 | υποθ
1469 | υποκορ
1470 | υποκ
1471 | υποσημ
1472 | υποτ
1473 | υποφ
1474 | υποχωρ
1475 | υπόλ
1476 | υπόχρ
1477 | υπ
1478 | υστλατ
1479 | υψόμ
1480 | υψ
1481 | φάκ
1482 | φαρμακολ
1483 | φαρμ
1484 | φιλολ
1485 | φιλοσ
1486 | φιλοτ
1487 | φινλ
1488 | φοινικ
1489 | φράγκ
1490 | φρανκον
1491 | φριζ
1492 | φρ
1493 | φυλλ
1494 | φυσιολ
1495 | φυσ
1496 | φωνηεντ
1497 | φωνητ
1498 | φωνολ
1499 | φων
1500 | φωτογρ
1501 | φ
1502 | φ.τ.μ
1503 | χαμιτ
1504 | χαρτόσ
1505 | χαρτ
1506 | χασμ
1507 | χαϊδ
1508 | χγφ
1509 | χειλ
1510 | χεττ
1511 | χημ
1512 | χιλ
1513 | χλγρ
1514 | χλγ
1515 | χλμ
1516 | χλμ.2
1517 | χλμ.3
1518 | χλσγρ
1519 | χλστγρ
1520 | χλστμ
1521 | χλστμ.2
1522 | χλστμ.3
1523 | χλ
1524 | χργρ
1525 | χρημ
1526 | χρον
1527 | χρ
1528 | χφ
1529 | χ.ε
1530 | χ.κ
1531 | χ.ο
1532 | χ.σ
1533 | χ.τ
1534 | χ.χ
1535 | ψευδ
1536 | ψυχαν
1537 | ψυχιατρ
1538 | ψυχολ
1539 | ψυχ
1540 | ωκεαν
1541 | όμ
1542 | όν
1543 | όπ.παρ
1544 | όπ.π
1545 | ό.π
1546 | ύψ
1547 | 1Βσ
1548 | 1Εσ
1549 | 1Θσ
1550 | 1Ιν
1551 | 1Κρ
1552 | 1Μκ
1553 | 1Πρ
1554 | 1Πτ
1555 | 1Τμ
1556 | 2Βσ
1557 | 2Εσ
1558 | 2Θσ
1559 | 2Ιν
1560 | 2Κρ
1561 | 2Μκ
1562 | 2Πρ
1563 | 2Πτ
1564 | 2Τμ
1565 | 3Βσ
1566 | 3Ιν
1567 | 3Μκ
1568 | 4Βσ
1569 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.en:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 |
33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
34 | Adj
35 | Adm
36 | Adv
37 | Asst
38 | Bart
39 | Bldg
40 | Brig
41 | Bros
42 | Capt
43 | Cmdr
44 | Col
45 | Comdr
46 | Con
47 | Corp
48 | Cpl
49 | DR
50 | Dr
51 | Drs
52 | Ens
53 | Gen
54 | Gov
55 | Hon
56 | Hr
57 | Hosp
58 | Insp
59 | Lt
60 | MM
61 | MR
62 | MRS
63 | MS
64 | Maj
65 | Messrs
66 | Mlle
67 | Mme
68 | Mr
69 | Mrs
70 | Ms
71 | Msgr
72 | Op
73 | Ord
74 | Pfc
75 | Ph
76 | Prof
77 | Pvt
78 | Rep
79 | Reps
80 | Res
81 | Rev
82 | Rt
83 | Sen
84 | Sens
85 | Sfc
86 | Sgt
87 | Sr
88 | St
89 | Supt
90 | Surg
91 |
92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
93 | v
94 | vs
95 | i.e
96 | rev
97 | e.g
98 |
99 | #Numbers only. These should only induce breaks when followed by a numeric sequence
100 | # add NUMERIC_ONLY after the word for this function
101 | #This case is mostly for the english "No." which can either be a sentence of its own, or
102 | #if followed by a number, a non-breaking prefix
103 | No #NUMERIC_ONLY#
104 | Nos
105 | Art #NUMERIC_ONLY#
106 | Nr
107 | pp #NUMERIC_ONLY#
108 |
109 | #month abbreviations
110 | Jan
111 | Feb
112 | Mar
113 | Apr
114 | #May is a full word
115 | Jun
116 | Jul
117 | Aug
118 | Sep
119 | Oct
120 | Nov
121 | Dec
122 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.es:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 |
33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm
34 |
35 | A.C
36 | Apdo
37 | Av
38 | Bco
39 | CC.AA
40 | Da
41 | Dep
42 | Dn
43 | Dr
44 | Dra
45 | EE.UU
46 | Excmo
47 | FF.CC
48 | Fil
49 | Gral
50 | J.C
51 | Let
52 | Lic
53 | N.B
54 | P.D
55 | P.V.P
56 | Prof
57 | Pts
58 | Rte
59 | S.A
60 | S.A.R
61 | S.E
62 | S.L
63 | S.R.C
64 | Sr
65 | Sra
66 | Srta
67 | Sta
68 | Sto
69 | T.V.E
70 | Tel
71 | Ud
72 | Uds
73 | V.B
74 | V.E
75 | Vd
76 | Vds
77 | a/c
78 | adj
79 | admón
80 | afmo
81 | apdo
82 | av
83 | c
84 | c.f
85 | c.g
86 | cap
87 | cm
88 | cta
89 | dcha
90 | doc
91 | ej
92 | entlo
93 | esq
94 | etc
95 | f.c
96 | gr
97 | grs
98 | izq
99 | kg
100 | km
101 | mg
102 | mm
103 | núm
104 | núm
105 | p
106 | p.a
107 | p.ej
108 | ptas
109 | pág
110 | págs
111 | pág
112 | págs
113 | q.e.g.e
114 | q.e.s.m
115 | s
116 | s.s.s
117 | vid
118 | vol
119 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.fi:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT
2 | #indicate an end-of-sentence marker. Special cases are included for prefixes
3 | #that ONLY appear before 0-9 numbers.
4 |
5 | #This list is compiled from omorfi database
6 | #by Tommi A Pirinen.
7 |
8 |
9 | #any single upper case letter followed by a period is not a sentence ender
10 | A
11 | B
12 | C
13 | D
14 | E
15 | F
16 | G
17 | H
18 | I
19 | J
20 | K
21 | L
22 | M
23 | N
24 | O
25 | P
26 | Q
27 | R
28 | S
29 | T
30 | U
31 | V
32 | W
33 | X
34 | Y
35 | Z
36 | Å
37 | Ä
38 | Ö
39 |
40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
41 | alik
42 | alil
43 | amir
44 | apul
45 | apul.prof
46 | arkkit
47 | ass
48 | assist
49 | dipl
50 | dipl.arkkit
51 | dipl.ekon
52 | dipl.ins
53 | dipl.kielenk
54 | dipl.kirjeenv
55 | dipl.kosm
56 | dipl.urk
57 | dos
58 | erikoiseläinl
59 | erikoishammasl
60 | erikoisl
61 | erikoist
62 | ev.luutn
63 | evp
64 | fil
65 | ft
66 | hallinton
67 | hallintot
68 | hammaslääket
69 | jatk
70 | jääk
71 | kansaned
72 | kapt
73 | kapt.luutn
74 | kenr
75 | kenr.luutn
76 | kenr.maj
77 | kers
78 | kirjeenv
79 | kom
80 | kom.kapt
81 | komm
82 | konst
83 | korpr
84 | luutn
85 | maist
86 | maj
87 | Mr
88 | Mrs
89 | Ms
90 | M.Sc
91 | neuv
92 | nimim
93 | Ph.D
94 | prof
95 | puh.joht
96 | pääll
97 | res
98 | san
99 | siht
100 | suom
101 | sähköp
102 | säv
103 | toht
104 | toim
105 | toim.apul
106 | toim.joht
107 | toim.siht
108 | tuom
109 | ups
110 | vänr
111 | vääp
112 | ye.ups
113 | ylik
114 | ylil
115 | ylim
116 | ylimatr
117 | yliop
118 | yliopp
119 | ylip
120 | yliv
121 |
122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall
123 | #into this category - it sometimes ends a sentence)
124 | e.g
125 | ent
126 | esim
127 | huom
128 | i.e
129 | ilm
130 | l
131 | mm
132 | myöh
133 | nk
134 | nyk
135 | par
136 | po
137 | t
138 | v
139 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.fr:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 | #
4 | #any single upper case letter followed by a period is not a sentence ender
5 | #usually upper case letters are initials in a name
6 | #no French words end in single lower-case letters, so we throw those in too?
7 | A
8 | B
9 | C
10 | D
11 | E
12 | F
13 | G
14 | H
15 | I
16 | J
17 | K
18 | L
19 | M
20 | N
21 | O
22 | P
23 | Q
24 | R
25 | S
26 | T
27 | U
28 | V
29 | W
30 | X
31 | Y
32 | Z
33 | a
34 | b
35 | c
36 | d
37 | e
38 | f
39 | g
40 | h
41 | i
42 | j
43 | k
44 | l
45 | m
46 | n
47 | o
48 | p
49 | q
50 | r
51 | s
52 | t
53 | u
54 | v
55 | w
56 | x
57 | y
58 | z
59 |
60 | # Period-final abbreviation list for French
61 | A.C.N
62 | A.M
63 | art
64 | ann
65 | apr
66 | av
67 | auj
68 | lib
69 | B.P
70 | boul
71 | ca
72 | c.-à-d
73 | cf
74 | ch.-l
75 | chap
76 | contr
77 | C.P.I
78 | C.Q.F.D
79 | C.N
80 | C.N.S
81 | C.S
82 | dir
83 | éd
84 | e.g
85 | env
86 | al
87 | etc
88 | E.V
89 | ex
90 | fasc
91 | fém
92 | fig
93 | fr
94 | hab
95 | ibid
96 | id
97 | i.e
98 | inf
99 | LL.AA
100 | LL.AA.II
101 | LL.AA.RR
102 | LL.AA.SS
103 | L.D
104 | LL.EE
105 | LL.MM
106 | LL.MM.II.RR
107 | loc.cit
108 | masc
109 | MM
110 | ms
111 | N.B
112 | N.D.A
113 | N.D.L.R
114 | N.D.T
115 | n/réf
116 | NN.SS
117 | N.S
118 | N.D
119 | N.P.A.I
120 | p.c.c
121 | pl
122 | pp
123 | p.ex
124 | p.j
125 | P.S
126 | R.A.S
127 | R.-V
128 | R.P
129 | R.I.P
130 | SS
131 | S.S
132 | S.A
133 | S.A.I
134 | S.A.R
135 | S.A.S
136 | S.E
137 | sec
138 | sect
139 | sing
140 | S.M
141 | S.M.I.R
142 | sq
143 | sqq
144 | suiv
145 | sup
146 | suppl
147 | tél
148 | T.S.V.P
149 | vb
150 | vol
151 | vs
152 | X.O
153 | Z.I
154 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.hu:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 | Á
33 | É
34 | Í
35 | Ó
36 | Ö
37 | Ő
38 | Ú
39 | Ü
40 | Ű
41 |
42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
43 | Dr
44 | dr
45 | kb
46 | Kb
47 | vö
48 | Vö
49 | pl
50 | Pl
51 | ca
52 | Ca
53 | min
54 | Min
55 | max
56 | Max
57 | ún
58 | Ún
59 | prof
60 | Prof
61 | de
62 | De
63 | du
64 | Du
65 | Szt
66 | St
67 |
68 | #Numbers only. These should only induce breaks when followed by a numeric sequence
69 | # add NUMERIC_ONLY after the word for this function
70 | #This case is mostly for the english "No." which can either be a sentence of its own, or
71 | #if followed by a number, a non-breaking prefix
72 |
73 | # Month name abbreviations
74 | jan #NUMERIC_ONLY#
75 | Jan #NUMERIC_ONLY#
76 | Feb #NUMERIC_ONLY#
77 | feb #NUMERIC_ONLY#
78 | márc #NUMERIC_ONLY#
79 | Márc #NUMERIC_ONLY#
80 | ápr #NUMERIC_ONLY#
81 | Ápr #NUMERIC_ONLY#
82 | máj #NUMERIC_ONLY#
83 | Máj #NUMERIC_ONLY#
84 | jún #NUMERIC_ONLY#
85 | Jún #NUMERIC_ONLY#
86 | Júl #NUMERIC_ONLY#
87 | júl #NUMERIC_ONLY#
88 | aug #NUMERIC_ONLY#
89 | Aug #NUMERIC_ONLY#
90 | Szept #NUMERIC_ONLY#
91 | szept #NUMERIC_ONLY#
92 | okt #NUMERIC_ONLY#
93 | Okt #NUMERIC_ONLY#
94 | nov #NUMERIC_ONLY#
95 | Nov #NUMERIC_ONLY#
96 | dec #NUMERIC_ONLY#
97 | Dec #NUMERIC_ONLY#
98 |
99 | # Other abbreviations
100 | tel #NUMERIC_ONLY#
101 | Tel #NUMERIC_ONLY#
102 | Fax #NUMERIC_ONLY#
103 | fax #NUMERIC_ONLY#
104 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.is:
--------------------------------------------------------------------------------
1 | no #NUMERIC_ONLY#
2 | No #NUMERIC_ONLY#
3 | nr #NUMERIC_ONLY#
4 | Nr #NUMERIC_ONLY#
5 | nR #NUMERIC_ONLY#
6 | NR #NUMERIC_ONLY#
7 | a
8 | b
9 | c
10 | d
11 | e
12 | f
13 | g
14 | h
15 | i
16 | j
17 | k
18 | l
19 | m
20 | n
21 | o
22 | p
23 | q
24 | r
25 | s
26 | t
27 | u
28 | v
29 | w
30 | x
31 | y
32 | z
33 | ^
34 | í
35 | á
36 | ó
37 | æ
38 | A
39 | B
40 | C
41 | D
42 | E
43 | F
44 | G
45 | H
46 | I
47 | J
48 | K
49 | L
50 | M
51 | N
52 | O
53 | P
54 | Q
55 | R
56 | S
57 | T
58 | U
59 | V
60 | W
61 | X
62 | Y
63 | Z
64 | ab.fn
65 | a.fn
66 | afs
67 | al
68 | alm
69 | alg
70 | andh
71 | ath
72 | aths
73 | atr
74 | ao
75 | au
76 | aukaf
77 | áfn
78 | áhrl.s
79 | áhrs
80 | ákv.gr
81 | ákv
82 | bh
83 | bls
84 | dr
85 | e.Kr
86 | et
87 | ef
88 | efn
89 | ennfr
90 | eink
91 | end
92 | e.st
93 | erl
94 | fél
95 | fskj
96 | fh
97 | f.hl
98 | físl
99 | fl
100 | fn
101 | fo
102 | forl
103 | frb
104 | frl
105 | frh
106 | frt
107 | fsl
108 | fsh
109 | fs
110 | fsk
111 | fst
112 | f.Kr
113 | ft
114 | fv
115 | fyrrn
116 | fyrrv
117 | germ
118 | gm
119 | gr
120 | hdl
121 | hdr
122 | hf
123 | hl
124 | hlsk
125 | hljsk
126 | hljv
127 | hljóðv
128 | hr
129 | hv
130 | hvk
131 | holl
132 | Hos
133 | höf
134 | hk
135 | hrl
136 | ísl
137 | kaf
138 | kap
139 | Khöfn
140 | kk
141 | kg
142 | kk
143 | km
144 | kl
145 | klst
146 | kr
147 | kt
148 | kgúrsk
149 | kvk
150 | leturbr
151 | lh
152 | lh.nt
153 | lh.þt
154 | lo
155 | ltr
156 | mlja
157 | mljó
158 | millj
159 | mm
160 | mms
161 | m.fl
162 | miðm
163 | mgr
164 | mst
165 | mín
166 | nf
167 | nh
168 | nhm
169 | nl
170 | nk
171 | nmgr
172 | no
173 | núv
174 | nt
175 | o.áfr
176 | o.m.fl
177 | ohf
178 | o.fl
179 | o.s.frv
180 | ófn
181 | ób
182 | óákv.gr
183 | óákv
184 | pfn
185 | PR
186 | pr
187 | Ritstj
188 | Rvík
189 | Rvk
190 | samb
191 | samhlj
192 | samn
193 | samn
194 | sbr
195 | sek
196 | sérn
197 | sf
198 | sfn
199 | sh
200 | sfn
201 | sh
202 | s.hl
203 | sk
204 | skv
205 | sl
206 | sn
207 | so
208 | ss.us
209 | s.st
210 | samþ
211 | sbr
212 | shlj
213 | sign
214 | skál
215 | st
216 | st.s
217 | stk
218 | sþ
219 | teg
220 | tbl
221 | tfn
222 | tl
223 | tvíhlj
224 | tvt
225 | till
226 | to
227 | umr
228 | uh
229 | us
230 | uppl
231 | útg
232 | vb
233 | Vf
234 | vh
235 | vkf
236 | Vl
237 | vl
238 | vlf
239 | vmf
240 | 8vo
241 | vsk
242 | vth
243 | þt
244 | þf
245 | þjs
246 | þgf
247 | þlt
248 | þolm
249 | þm
250 | þml
251 | þýð
252 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.it:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | B
8 | C
9 | D
10 | E
11 | F
12 | G
13 | H
14 | I
15 | J
16 | K
17 | L
18 | M
19 | N
20 | O
21 | P
22 | Q
23 | R
24 | S
25 | T
26 | U
27 | V
28 | W
29 | X
30 | Y
31 | Z
32 |
33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
34 | Adj
35 | Adm
36 | Adv
37 | Amn
38 | Arch
39 | Asst
40 | Avv
41 | Bart
42 | Bcc
43 | Bldg
44 | Brig
45 | Bros
46 | C.A.P
47 | C.P
48 | Capt
49 | Cc
50 | Cmdr
51 | Co
52 | Col
53 | Comdr
54 | Con
55 | Corp
56 | Cpl
57 | DR
58 | Dott
59 | Dr
60 | Drs
61 | Egr
62 | Ens
63 | Gen
64 | Geom
65 | Gov
66 | Hon
67 | Hosp
68 | Hr
69 | Id
70 | Ing
71 | Insp
72 | Lt
73 | MM
74 | MR
75 | MRS
76 | MS
77 | Maj
78 | Messrs
79 | Mlle
80 | Mme
81 | Mo
82 | Mons
83 | Mr
84 | Mrs
85 | Ms
86 | Msgr
87 | N.B
88 | Op
89 | Ord
90 | P.S
91 | P.T
92 | Pfc
93 | Ph
94 | Prof
95 | Pvt
96 | RP
97 | RSVP
98 | Rag
99 | Rep
100 | Reps
101 | Res
102 | Rev
103 | Rif
104 | Rt
105 | S.A
106 | S.B.F
107 | S.P.M
108 | S.p.A
109 | S.r.l
110 | Sen
111 | Sens
112 | Sfc
113 | Sgt
114 | Sig
115 | Sigg
116 | Soc
117 | Spett
118 | Sr
119 | St
120 | Supt
121 | Surg
122 | V.P
123 |
124 | # other
125 | a.c
126 | acc
127 | all
128 | banc
129 | c.a
130 | c.c.p
131 | c.m
132 | c.p
133 | c.s
134 | c.v
135 | corr
136 | dott
137 | e.p.c
138 | ecc
139 | es
140 | fatt
141 | gg
142 | int
143 | lett
144 | ogg
145 | on
146 | p.c
147 | p.c.c
148 | p.es
149 | p.f
150 | p.r
151 | p.v
152 | post
153 | pp
154 | racc
155 | ric
156 | s.n.c
157 | seg
158 | sgg
159 | ss
160 | tel
161 | u.s
162 | v.r
163 | v.s
164 |
165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
166 | v
167 | vs
168 | i.e
169 | rev
170 | e.g
171 |
172 | #Numbers only. These should only induce breaks when followed by a numeric sequence
173 | # add NUMERIC_ONLY after the word for this function
174 | #This case is mostly for the english "No." which can either be a sentence of its own, or
175 | #if followed by a number, a non-breaking prefix
176 | No #NUMERIC_ONLY#
177 | Nos
178 | Art #NUMERIC_ONLY#
179 | Nr
180 | pp #NUMERIC_ONLY#
181 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.lv:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | A
7 | Ā
8 | B
9 | C
10 | Č
11 | D
12 | E
13 | Ē
14 | F
15 | G
16 | Ģ
17 | H
18 | I
19 | Ī
20 | J
21 | K
22 | Ķ
23 | L
24 | Ļ
25 | M
26 | N
27 | Ņ
28 | O
29 | P
30 | Q
31 | R
32 | S
33 | Š
34 | T
35 | U
36 | Ū
37 | V
38 | W
39 | X
40 | Y
41 | Z
42 | Ž
43 |
44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
45 | dr
46 | Dr
47 | med
48 | prof
49 | Prof
50 | inž
51 | Inž
52 | ist.loc
53 | Ist.loc
54 | kor.loc
55 | Kor.loc
56 | v.i
57 | vietn
58 | Vietn
59 |
60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
61 | a.l
62 | t.p
63 | pārb
64 | Pārb
65 | vec
66 | Vec
67 | inv
68 | Inv
69 | sk
70 | Sk
71 | spec
72 | Spec
73 | vienk
74 | Vienk
75 | virz
76 | Virz
77 | māksl
78 | Māksl
79 | mūz
80 | Mūz
81 | akad
82 | Akad
83 | soc
84 | Soc
85 | galv
86 | Galv
87 | vad
88 | Vad
89 | sertif
90 | Sertif
91 | folkl
92 | Folkl
93 | hum
94 | Hum
95 |
96 | #Numbers only. These should only induce breaks when followed by a numeric sequence
97 | # add NUMERIC_ONLY after the word for this function
98 | #This case is mostly for the english "No." which can either be a sentence of its own, or
99 | #if followed by a number, a non-breaking prefix
100 | Nr #NUMERIC_ONLY#
101 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.nl:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen
4 | # http://nl.wikipedia.org/wiki/Aanspreekvorm
5 | # http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs
6 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
7 | #usually upper case letters are initials in a name
8 | A
9 | B
10 | C
11 | D
12 | E
13 | F
14 | G
15 | H
16 | I
17 | J
18 | K
19 | L
20 | M
21 | N
22 | O
23 | P
24 | Q
25 | R
26 | S
27 | T
28 | U
29 | V
30 | W
31 | X
32 | Y
33 | Z
34 |
35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
36 | bacc
37 | bc
38 | bgen
39 | c.i
40 | dhr
41 | dr
42 | dr.h.c
43 | drs
44 | drs
45 | ds
46 | eint
47 | fa
48 | Fa
49 | fam
50 | gen
51 | genm
52 | ing
53 | ir
54 | jhr
55 | jkvr
56 | jr
57 | kand
58 | kol
59 | lgen
60 | lkol
61 | Lt
62 | maj
63 | Mej
64 | mevr
65 | Mme
66 | mr
67 | mr
68 | Mw
69 | o.b.s
70 | plv
71 | prof
72 | ritm
73 | tint
74 | Vz
75 | Z.D
76 | Z.D.H
77 | Z.E
78 | Z.Em
79 | Z.H
80 | Z.K.H
81 | Z.K.M
82 | Z.M
83 | z.v
84 |
85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence
87 | a.g.v
88 | bijv
89 | bijz
90 | bv
91 | d.w.z
92 | e.c
93 | e.g
94 | e.k
95 | ev
96 | i.p.v
97 | i.s.m
98 | i.t.t
99 | i.v.m
100 | m.a.w
101 | m.b.t
102 | m.b.v
103 | m.h.o
104 | m.i
105 | m.i.v
106 | v.w.t
107 |
108 | #Numbers only. These should only induce breaks when followed by a numeric sequence
109 | # add NUMERIC_ONLY after the word for this function
110 | #This case is mostly for the english "No." which can either be a sentence of its own, or
111 | #if followed by a number, a non-breaking prefix
112 | Nr #NUMERIC_ONLY#
113 | Nrs
114 | nrs
115 | nr #NUMERIC_ONLY#
116 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.pl:
--------------------------------------------------------------------------------
1 | adw
2 | afr
3 | akad
4 | al
5 | Al
6 | am
7 | amer
8 | arch
9 | art
10 | Art
11 | artyst
12 | astr
13 | austr
14 | bałt
15 | bdb
16 | bł
17 | bm
18 | br
19 | bryg
20 | bryt
21 | centr
22 | ces
23 | chem
24 | chiń
25 | chir
26 | c.k
27 | c.o
28 | cyg
29 | cyw
30 | cyt
31 | czes
32 | czw
33 | cd
34 | Cd
35 | czyt
36 | ćw
37 | ćwicz
38 | daw
39 | dcn
40 | dekl
41 | demokr
42 | det
43 | diec
44 | dł
45 | dn
46 | dot
47 | dol
48 | dop
49 | dost
50 | dosł
51 | h.c
52 | ds
53 | dst
54 | duszp
55 | dypl
56 | egz
57 | ekol
58 | ekon
59 | elektr
60 | em
61 | ew
62 | fab
63 | farm
64 | fot
65 | fr
66 | gat
67 | gastr
68 | geogr
69 | geol
70 | gimn
71 | głęb
72 | gm
73 | godz
74 | górn
75 | gosp
76 | gr
77 | gram
78 | hist
79 | hiszp
80 | hr
81 | Hr
82 | hot
83 | id
84 | in
85 | im
86 | iron
87 | jn
88 | kard
89 | kat
90 | katol
91 | k.k
92 | kk
93 | kol
94 | kl
95 | k.p.a
96 | kpc
97 | k.p.c
98 | kpt
99 | kr
100 | k.r
101 | krak
102 | k.r.o
103 | kryt
104 | kult
105 | laic
106 | łac
107 | niem
108 | woj
109 | nb
110 | np
111 | Nb
112 | Np
113 | pol
114 | pow
115 | m.in
116 | pt
117 | ps
118 | Pt
119 | Ps
120 | cdn
121 | jw
122 | ryc
123 | rys
124 | Ryc
125 | Rys
126 | tj
127 | tzw
128 | Tzw
129 | tzn
130 | zob
131 | ang
132 | ub
133 | ul
134 | pw
135 | pn
136 | pl
137 | al
138 | k
139 | n
140 | nr #NUMERIC_ONLY#
141 | Nr #NUMERIC_ONLY#
142 | ww
143 | wł
144 | ur
145 | zm
146 | żyd
147 | żarg
148 | żyw
149 | wył
150 | bp
151 | bp
152 | wyst
153 | tow
154 | Tow
155 | o
156 | sp
157 | Sp
158 | st
159 | spółdz
160 | Spółdz
161 | społ
162 | spółgł
163 | stoł
164 | stow
165 | Stoł
166 | Stow
167 | zn
168 | zew
169 | zewn
170 | zdr
171 | zazw
172 | zast
173 | zaw
174 | zał
175 | zal
176 | zam
177 | zak
178 | zakł
179 | zagr
180 | zach
181 | adw
182 | Adw
183 | lek
184 | Lek
185 | med
186 | mec
187 | Mec
188 | doc
189 | Doc
190 | dyw
191 | dyr
192 | Dyw
193 | Dyr
194 | inż
195 | Inż
196 | mgr
197 | Mgr
198 | dh
199 | dr
200 | Dh
201 | Dr
202 | p
203 | P
204 | red
205 | Red
206 | prof
207 | prok
208 | Prof
209 | Prok
210 | hab
211 | płk
212 | Płk
213 | nadkom
214 | Nadkom
215 | podkom
216 | Podkom
217 | ks
218 | Ks
219 | gen
220 | Gen
221 | por
222 | Por
223 | reż
224 | Reż
225 | przyp
226 | Przyp
227 | śp
228 | św
229 | śW
230 | Śp
231 | Św
232 | ŚW
233 | szer
234 | Szer
235 | pkt #NUMERIC_ONLY#
236 | str #NUMERIC_ONLY#
237 | tab #NUMERIC_ONLY#
238 | Tab #NUMERIC_ONLY#
239 | tel
240 | ust #NUMERIC_ONLY#
241 | par #NUMERIC_ONLY#
242 | poz
243 | pok
244 | oo
245 | oO
246 | Oo
247 | OO
248 | r #NUMERIC_ONLY#
249 | l #NUMERIC_ONLY#
250 | s #NUMERIC_ONLY#
251 | najśw
252 | Najśw
253 | A
254 | B
255 | C
256 | D
257 | E
258 | F
259 | G
260 | H
261 | I
262 | J
263 | K
264 | L
265 | M
266 | N
267 | O
268 | P
269 | Q
270 | R
271 | S
272 | T
273 | U
274 | V
275 | W
276 | X
277 | Y
278 | Z
279 | Ś
280 | Ć
281 | Ż
282 | Ź
283 | Dz
284 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.pt:
--------------------------------------------------------------------------------
1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009.
2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
4 |
5 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
6 | #usually upper case letters are initials in a name
7 | A
8 | B
9 | C
10 | D
11 | E
12 | F
13 | G
14 | H
15 | I
16 | J
17 | K
18 | L
19 | M
20 | N
21 | O
22 | P
23 | Q
24 | R
25 | S
26 | T
27 | U
28 | V
29 | W
30 | X
31 | Y
32 | Z
33 | a
34 | b
35 | c
36 | d
37 | e
38 | f
39 | g
40 | h
41 | i
42 | j
43 | k
44 | l
45 | m
46 | n
47 | o
48 | p
49 | q
50 | r
51 | s
52 | t
53 | u
54 | v
55 | w
56 | x
57 | y
58 | z
59 |
60 |
61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese.
62 | I
63 | II
64 | III
65 | IV
66 | V
67 | VI
68 | VII
69 | VIII
70 | IX
71 | X
72 | XI
73 | XII
74 | XIII
75 | XIV
76 | XV
77 | XVI
78 | XVII
79 | XVIII
80 | XIX
81 | XX
82 | i
83 | ii
84 | iii
85 | iv
86 | v
87 | vi
88 | vii
89 | viii
90 | ix
91 | x
92 | xi
93 | xii
94 | xiii
95 | xiv
96 | xv
97 | xvi
98 | xvii
99 | xviii
100 | xix
101 | xx
102 |
103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
104 | Adj
105 | Adm
106 | Adv
107 | Art
108 | Ca
109 | Capt
110 | Cmdr
111 | Col
112 | Comdr
113 | Con
114 | Corp
115 | Cpl
116 | DR
117 | DRA
118 | Dr
119 | Dra
120 | Dras
121 | Drs
122 | Eng
123 | Enga
124 | Engas
125 | Engos
126 | Ex
127 | Exo
128 | Exmo
129 | Fig
130 | Gen
131 | Hosp
132 | Insp
133 | Lda
134 | MM
135 | MR
136 | MRS
137 | MS
138 | Maj
139 | Mrs
140 | Ms
141 | Msgr
142 | Op
143 | Ord
144 | Pfc
145 | Ph
146 | Prof
147 | Pvt
148 | Rep
149 | Reps
150 | Res
151 | Rev
152 | Rt
153 | Sen
154 | Sens
155 | Sfc
156 | Sgt
157 | Sr
158 | Sra
159 | Sras
160 | Srs
161 | Sto
162 | Supt
163 | Surg
164 | adj
165 | adm
166 | adv
167 | art
168 | cit
169 | col
170 | con
171 | corp
172 | cpl
173 | dr
174 | dra
175 | dras
176 | drs
177 | eng
178 | enga
179 | engas
180 | engos
181 | ex
182 | exo
183 | exmo
184 | fig
185 | op
186 | prof
187 | sr
188 | sra
189 | sras
190 | srs
191 | sto
192 |
193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
194 | v
195 | vs
196 | i.e
197 | rev
198 | e.g
199 |
200 | #Numbers only. These should only induce breaks when followed by a numeric sequence
201 | # add NUMERIC_ONLY after the word for this function
202 | #This case is mostly for the english "No." which can either be a sentence of its own, or
203 | #if followed by a number, a non-breaking prefix
204 | No #NUMERIC_ONLY#
205 | Nos
206 | Art #NUMERIC_ONLY#
207 | Nr
208 | p #NUMERIC_ONLY#
209 | pp #NUMERIC_ONLY#
210 |
211 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.ro:
--------------------------------------------------------------------------------
1 | A
2 | B
3 | C
4 | D
5 | E
6 | F
7 | G
8 | H
9 | I
10 | J
11 | K
12 | L
13 | M
14 | N
15 | O
16 | P
17 | Q
18 | R
19 | S
20 | T
21 | U
22 | V
23 | W
24 | X
25 | Y
26 | Z
27 | dpdv
28 | etc
29 | șamd
30 | M.Ap.N
31 | dl
32 | Dl
33 | d-na
34 | D-na
35 | dvs
36 | Dvs
37 | pt
38 | Pt
39 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.ru:
--------------------------------------------------------------------------------
1 | # added Cyrillic uppercase letters [А-Я]
2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes)
3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013
4 | А
5 | Б
6 | В
7 | Г
8 | Д
9 | Е
10 | Ж
11 | З
12 | И
13 | Й
14 | К
15 | Л
16 | М
17 | Н
18 | О
19 | П
20 | Р
21 | С
22 | Т
23 | У
24 | Ф
25 | Х
26 | Ц
27 | Ч
28 | Ш
29 | Щ
30 | Ъ
31 | Ы
32 | Ь
33 | Э
34 | Ю
35 | Я
36 | A
37 | B
38 | C
39 | D
40 | E
41 | F
42 | G
43 | H
44 | I
45 | J
46 | K
47 | L
48 | M
49 | N
50 | O
51 | P
52 | Q
53 | R
54 | S
55 | T
56 | U
57 | V
58 | W
59 | X
60 | Y
61 | Z
62 | 0гг
63 | 1гг
64 | 2гг
65 | 3гг
66 | 4гг
67 | 5гг
68 | 6гг
69 | 7гг
70 | 8гг
71 | 9гг
72 | 0г
73 | 1г
74 | 2г
75 | 3г
76 | 4г
77 | 5г
78 | 6г
79 | 7г
80 | 8г
81 | 9г
82 | Xвв
83 | Vвв
84 | Iвв
85 | Lвв
86 | Mвв
87 | Cвв
88 | Xв
89 | Vв
90 | Iв
91 | Lв
92 | Mв
93 | Cв
94 | 0м
95 | 1м
96 | 2м
97 | 3м
98 | 4м
99 | 5м
100 | 6м
101 | 7м
102 | 8м
103 | 9м
104 | 0мм
105 | 1мм
106 | 2мм
107 | 3мм
108 | 4мм
109 | 5мм
110 | 6мм
111 | 7мм
112 | 8мм
113 | 9мм
114 | 0см
115 | 1см
116 | 2см
117 | 3см
118 | 4см
119 | 5см
120 | 6см
121 | 7см
122 | 8см
123 | 9см
124 | 0дм
125 | 1дм
126 | 2дм
127 | 3дм
128 | 4дм
129 | 5дм
130 | 6дм
131 | 7дм
132 | 8дм
133 | 9дм
134 | 0л
135 | 1л
136 | 2л
137 | 3л
138 | 4л
139 | 5л
140 | 6л
141 | 7л
142 | 8л
143 | 9л
144 | 0км
145 | 1км
146 | 2км
147 | 3км
148 | 4км
149 | 5км
150 | 6км
151 | 7км
152 | 8км
153 | 9км
154 | 0га
155 | 1га
156 | 2га
157 | 3га
158 | 4га
159 | 5га
160 | 6га
161 | 7га
162 | 8га
163 | 9га
164 | 0кг
165 | 1кг
166 | 2кг
167 | 3кг
168 | 4кг
169 | 5кг
170 | 6кг
171 | 7кг
172 | 8кг
173 | 9кг
174 | 0т
175 | 1т
176 | 2т
177 | 3т
178 | 4т
179 | 5т
180 | 6т
181 | 7т
182 | 8т
183 | 9т
184 | 0г
185 | 1г
186 | 2г
187 | 3г
188 | 4г
189 | 5г
190 | 6г
191 | 7г
192 | 8г
193 | 9г
194 | 0мг
195 | 1мг
196 | 2мг
197 | 3мг
198 | 4мг
199 | 5мг
200 | 6мг
201 | 7мг
202 | 8мг
203 | 9мг
204 | бульв
205 | в
206 | вв
207 | г
208 | га
209 | гг
210 | гл
211 | гос
212 | д
213 | дм
214 | доп
215 | др
216 | е
217 | ед
218 | ед
219 | зам
220 | и
221 | инд
222 | исп
223 | Исп
224 | к
225 | кап
226 | кг
227 | кв
228 | кл
229 | км
230 | кол
231 | комн
232 | коп
233 | куб
234 | л
235 | лиц
236 | лл
237 | м
238 | макс
239 | мг
240 | мин
241 | мл
242 | млн
243 | млрд
244 | мм
245 | н
246 | наб
247 | нач
248 | неуд
249 | ном
250 | о
251 | обл
252 | обр
253 | общ
254 | ок
255 | ост
256 | отл
257 | п
258 | пер
259 | перераб
260 | пл
261 | пос
262 | пр
263 | просп
264 | проф
265 | р
266 | ред
267 | руб
268 | с
269 | сб
270 | св
271 | см
272 | соч
273 | ср
274 | ст
275 | стр
276 | т
277 | тел
278 | Тел
279 | тех
280 | тт
281 | туп
282 | тыс
283 | уд
284 | ул
285 | уч
286 | физ
287 | х
288 | хор
289 | ч
290 | чел
291 | шт
292 | экз
293 | э
294 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.sk:
--------------------------------------------------------------------------------
1 | Bc
2 | Mgr
3 | RNDr
4 | PharmDr
5 | PhDr
6 | JUDr
7 | PaedDr
8 | ThDr
9 | Ing
10 | MUDr
11 | MDDr
12 | MVDr
13 | Dr
14 | ThLic
15 | PhD
16 | ArtD
17 | ThDr
18 | Dr
19 | DrSc
20 | CSs
21 | prof
22 | obr
23 | Obr
24 | Č
25 | č
26 | absol
27 | adj
28 | admin
29 | adr
30 | Adr
31 | adv
32 | advok
33 | afr
34 | ak
35 | akad
36 | akc
37 | akuz
38 | et
39 | al
40 | alch
41 | amer
42 | anat
43 | angl
44 | Angl
45 | anglosas
46 | anorg
47 | ap
48 | apod
49 | arch
50 | archeol
51 | archit
52 | arg
53 | art
54 | astr
55 | astrol
56 | astron
57 | atp
58 | atď
59 | austr
60 | Austr
61 | aut
62 | belg
63 | Belg
64 | bibl
65 | Bibl
66 | biol
67 | bot
68 | bud
69 | bás
70 | býv
71 | cest
72 | chem
73 | cirk
74 | csl
75 | čs
76 | Čs
77 | dat
78 | dep
79 | det
80 | dial
81 | diaľ
82 | dipl
83 | distrib
84 | dokl
85 | dosl
86 | dopr
87 | dram
88 | duš
89 | dv
90 | dvojčl
91 | dór
92 | ekol
93 | ekon
94 | el
95 | elektr
96 | elektrotech
97 | energet
98 | epic
99 | est
100 | etc
101 | etonym
102 | eufem
103 | európ
104 | Európ
105 | ev
106 | evid
107 | expr
108 | fa
109 | fam
110 | farm
111 | fem
112 | feud
113 | fil
114 | filat
115 | filoz
116 | fi
117 | fon
118 | form
119 | fot
120 | fr
121 | Fr
122 | franc
123 | Franc
124 | fraz
125 | fut
126 | fyz
127 | fyziol
128 | garb
129 | gen
130 | genet
131 | genpor
132 | geod
133 | geogr
134 | geol
135 | geom
136 | germ
137 | gr
138 | Gr
139 | gréc
140 | Gréc
141 | gréckokat
142 | hebr
143 | herald
144 | hist
145 | hlav
146 | hosp
147 | hromad
148 | hud
149 | hypok
150 | ident
151 | i.e
152 | ident
153 | imp
154 | impf
155 | indoeur
156 | inf
157 | inform
158 | instr
159 | int
160 | interj
161 | inšt
162 | inštr
163 | iron
164 | jap
165 | Jap
166 | jaz
167 | jedn
168 | juhoamer
169 | juhových
170 | juhozáp
171 | juž
172 | kanad
173 | Kanad
174 | kanc
175 | kapit
176 | kpt
177 | kart
178 | katastr
179 | knih
180 | kniž
181 | komp
182 | konj
183 | konkr
184 | kozmet
185 | krajč
186 | kresť
187 | kt
188 | kuch
189 | lat
190 | latinskoamer
191 | lek
192 | lex
193 | lingv
194 | lit
195 | litur
196 | log
197 | lok
198 | max
199 | Max
200 | maď
201 | Maď
202 | medzinár
203 | mest
204 | metr
205 | mil
206 | Mil
207 | min
208 | Min
209 | miner
210 | ml
211 | mld
212 | mn
213 | mod
214 | mytol
215 | napr
216 | nar
217 | Nar
218 | nasl
219 | nedok
220 | neg
221 | negat
222 | neklas
223 | nem
224 | Nem
225 | neodb
226 | neos
227 | neskl
228 | nesklon
229 | nespis
230 | nespráv
231 | neved
232 | než
233 | niekt
234 | niž
235 | nom
236 | náb
237 | nákl
238 | námor
239 | nár
240 | obch
241 | obj
242 | obv
243 | obyč
244 | obč
245 | občian
246 | odb
247 | odd
248 | ods
249 | ojed
250 | okr
251 | Okr
252 | opt
253 | opyt
254 | org
255 | os
256 | osob
257 | ot
258 | ovoc
259 | par
260 | part
261 | pejor
262 | pers
263 | pf
264 | Pf
265 | P.f
266 | p.f
267 | pl
268 | Plk
269 | pod
270 | podst
271 | pokl
272 | polit
273 | politol
274 | polygr
275 | pomn
276 | popl
277 | por
278 | porad
279 | porov
280 | posch
281 | potrav
282 | použ
283 | poz
284 | pozit
285 | poľ
286 | poľno
287 | poľnohosp
288 | poľov
289 | pošt
290 | pož
291 | prac
292 | predl
293 | pren
294 | prep
295 | preuk
296 | priezv
297 | Priezv
298 | privl
299 | prof
300 | práv
301 | príd
302 | príj
303 | prík
304 | príp
305 | prír
306 | prísl
307 | príslov
308 | príč
309 | psych
310 | publ
311 | pís
312 | písm
313 | pôv
314 | refl
315 | reg
316 | rep
317 | resp
318 | rozk
319 | rozlič
320 | rozpráv
321 | roč
322 | Roč
323 | ryb
324 | rádiotech
325 | rím
326 | samohl
327 | semest
328 | sev
329 | severoamer
330 | severových
331 | severozáp
332 | sg
333 | skr
334 | skup
335 | sl
336 | Sloven
337 | soc
338 | soch
339 | sociol
340 | sp
341 | spol
342 | Spol
343 | spoloč
344 | spoluhl
345 | správ
346 | spôs
347 | st
348 | star
349 | starogréc
350 | starorím
351 | s.r.o
352 | stol
353 | stor
354 | str
355 | stredoamer
356 | stredoškol
357 | subj
358 | subst
359 | superl
360 | sv
361 | sz
362 | súkr
363 | súp
364 | súvzť
365 | tal
366 | Tal
367 | tech
368 | tel
369 | Tel
370 | telef
371 | teles
372 | telev
373 | teol
374 | trans
375 | turist
376 | tuzem
377 | typogr
378 | tzn
379 | tzv
380 | ukaz
381 | ul
382 | Ul
383 | umel
384 | univ
385 | ust
386 | ved
387 | vedľ
388 | verb
389 | veter
390 | vin
391 | viď
392 | vl
393 | vod
394 | vodohosp
395 | pnl
396 | vulg
397 | vyj
398 | vys
399 | vysokoškol
400 | vzťaž
401 | vôb
402 | vých
403 | výd
404 | výrob
405 | výsk
406 | výsl
407 | výtv
408 | výtvar
409 | význ
410 | včel
411 | vš
412 | všeob
413 | zahr
414 | zar
415 | zariad
416 | zast
417 | zastar
418 | zastaráv
419 | zb
420 | zdravot
421 | združ
422 | zjemn
423 | zlat
424 | zn
425 | Zn
426 | zool
427 | zr
428 | zried
429 | zv
430 | záhr
431 | zák
432 | zákl
433 | zám
434 | záp
435 | západoeur
436 | zázn
437 | územ
438 | účt
439 | čast
440 | čes
441 | Čes
442 | čl
443 | čísl
444 | živ
445 | pr
446 | fak
447 | Kr
448 | p.n.l
449 | A
450 | B
451 | C
452 | D
453 | E
454 | F
455 | G
456 | H
457 | I
458 | J
459 | K
460 | L
461 | M
462 | N
463 | O
464 | P
465 | Q
466 | R
467 | S
468 | T
469 | U
470 | V
471 | W
472 | X
473 | Y
474 | Z
475 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.sl:
--------------------------------------------------------------------------------
1 | dr
2 | Dr
3 | itd
4 | itn
5 | št #NUMERIC_ONLY#
6 | Št #NUMERIC_ONLY#
7 | d
8 | jan
9 | Jan
10 | feb
11 | Feb
12 | mar
13 | Mar
14 | apr
15 | Apr
16 | jun
17 | Jun
18 | jul
19 | Jul
20 | avg
21 | Avg
22 | sept
23 | Sept
24 | sep
25 | Sep
26 | okt
27 | Okt
28 | nov
29 | Nov
30 | dec
31 | Dec
32 | tj
33 | Tj
34 | npr
35 | Npr
36 | sl
37 | Sl
38 | op
39 | Op
40 | gl
41 | Gl
42 | oz
43 | Oz
44 | prev
45 | dipl
46 | ing
47 | prim
48 | Prim
49 | cf
50 | Cf
51 | gl
52 | Gl
53 | A
54 | B
55 | C
56 | D
57 | E
58 | F
59 | G
60 | H
61 | I
62 | J
63 | K
64 | L
65 | M
66 | N
67 | O
68 | P
69 | Q
70 | R
71 | S
72 | T
73 | U
74 | V
75 | W
76 | X
77 | Y
78 | Z
79 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.sv:
--------------------------------------------------------------------------------
1 | #single upper case letter are usually initials
2 | A
3 | B
4 | C
5 | D
6 | E
7 | F
8 | G
9 | H
10 | I
11 | J
12 | K
13 | L
14 | M
15 | N
16 | O
17 | P
18 | Q
19 | R
20 | S
21 | T
22 | U
23 | V
24 | W
25 | X
26 | Y
27 | Z
28 | #misc abbreviations
29 | AB
30 | G
31 | VG
32 | dvs
33 | etc
34 | from
35 | iaf
36 | jfr
37 | kl
38 | kr
39 | mao
40 | mfl
41 | mm
42 | osv
43 | pga
44 | tex
45 | tom
46 | vs
47 |
--------------------------------------------------------------------------------
/data/nonbreaking_prefixes/nonbreaking_prefix.ta:
--------------------------------------------------------------------------------
1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
3 |
4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
5 | #usually upper case letters are initials in a name
6 | அ
7 | ஆ
8 | இ
9 | ஈ
10 | உ
11 | ஊ
12 | எ
13 | ஏ
14 | ஐ
15 | ஒ
16 | ஓ
17 | ஔ
18 | ஃ
19 | க
20 | கா
21 | கி
22 | கீ
23 | கு
24 | கூ
25 | கெ
26 | கே
27 | கை
28 | கொ
29 | கோ
30 | கௌ
31 | க்
32 | ச
33 | சா
34 | சி
35 | சீ
36 | சு
37 | சூ
38 | செ
39 | சே
40 | சை
41 | சொ
42 | சோ
43 | சௌ
44 | ச்
45 | ட
46 | டா
47 | டி
48 | டீ
49 | டு
50 | டூ
51 | டெ
52 | டே
53 | டை
54 | டொ
55 | டோ
56 | டௌ
57 | ட்
58 | த
59 | தா
60 | தி
61 | தீ
62 | து
63 | தூ
64 | தெ
65 | தே
66 | தை
67 | தொ
68 | தோ
69 | தௌ
70 | த்
71 | ப
72 | பா
73 | பி
74 | பீ
75 | பு
76 | பூ
77 | பெ
78 | பே
79 | பை
80 | பொ
81 | போ
82 | பௌ
83 | ப்
84 | ற
85 | றா
86 | றி
87 | றீ
88 | று
89 | றூ
90 | றெ
91 | றே
92 | றை
93 | றொ
94 | றோ
95 | றௌ
96 | ற்
97 | ய
98 | யா
99 | யி
100 | யீ
101 | யு
102 | யூ
103 | யெ
104 | யே
105 | யை
106 | யொ
107 | யோ
108 | யௌ
109 | ய்
110 | ர
111 | ரா
112 | ரி
113 | ரீ
114 | ரு
115 | ரூ
116 | ரெ
117 | ரே
118 | ரை
119 | ரொ
120 | ரோ
121 | ரௌ
122 | ர்
123 | ல
124 | லா
125 | லி
126 | லீ
127 | லு
128 | லூ
129 | லெ
130 | லே
131 | லை
132 | லொ
133 | லோ
134 | லௌ
135 | ல்
136 | வ
137 | வா
138 | வி
139 | வீ
140 | வு
141 | வூ
142 | வெ
143 | வே
144 | வை
145 | வொ
146 | வோ
147 | வௌ
148 | வ்
149 | ள
150 | ளா
151 | ளி
152 | ளீ
153 | ளு
154 | ளூ
155 | ளெ
156 | ளே
157 | ளை
158 | ளொ
159 | ளோ
160 | ளௌ
161 | ள்
162 | ழ
163 | ழா
164 | ழி
165 | ழீ
166 | ழு
167 | ழூ
168 | ழெ
169 | ழே
170 | ழை
171 | ழொ
172 | ழோ
173 | ழௌ
174 | ழ்
175 | ங
176 | ஙா
177 | ஙி
178 | ஙீ
179 | ஙு
180 | ஙூ
181 | ஙெ
182 | ஙே
183 | ஙை
184 | ஙொ
185 | ஙோ
186 | ஙௌ
187 | ங்
188 | ஞ
189 | ஞா
190 | ஞி
191 | ஞீ
192 | ஞு
193 | ஞூ
194 | ஞெ
195 | ஞே
196 | ஞை
197 | ஞொ
198 | ஞோ
199 | ஞௌ
200 | ஞ்
201 | ண
202 | ணா
203 | ணி
204 | ணீ
205 | ணு
206 | ணூ
207 | ணெ
208 | ணே
209 | ணை
210 | ணொ
211 | ணோ
212 | ணௌ
213 | ண்
214 | ந
215 | நா
216 | நி
217 | நீ
218 | நு
219 | நூ
220 | நெ
221 | நே
222 | நை
223 | நொ
224 | நோ
225 | நௌ
226 | ந்
227 | ம
228 | மா
229 | மி
230 | மீ
231 | மு
232 | மூ
233 | மெ
234 | மே
235 | மை
236 | மொ
237 | மோ
238 | மௌ
239 | ம்
240 | ன
241 | னா
242 | னி
243 | னீ
244 | னு
245 | னூ
246 | னெ
247 | னே
248 | னை
249 | னொ
250 | னோ
251 | னௌ
252 | ன்
253 |
254 |
255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
256 | திரு
257 | திருமதி
258 | வண
259 | கௌரவ
260 |
261 |
262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
263 | உ.ம்
264 | #கா.ம்
265 | #எ.ம்
266 |
267 |
268 | #Numbers only. These should only induce breaks when followed by a numeric sequence
269 | # add NUMERIC_ONLY after the word for this function
270 | #This case is mostly for the english "No." which can either be a sentence of its own, or
271 | #if followed by a number, a non-breaking prefix
272 | No #NUMERIC_ONLY#
273 | Nos
274 | Art #NUMERIC_ONLY#
275 | Nr
276 | pp #NUMERIC_ONLY#
277 |
--------------------------------------------------------------------------------
/data/normalize-punctuation.perl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 | #
3 | # This file is part of moses. Its use is licensed under the GNU Lesser General
4 | # Public License version 2.1 or, at your option, any later version.
5 |
6 | use warnings;
7 | use strict;
8 |
9 | my $language = "en";
10 | my $PENN = 0;
11 |
12 | while (@ARGV) {
13 | $_ = shift;
14 | /^-b$/ && ($| = 1, next); # not buffered (flush each line)
15 | /^-l$/ && ($language = shift, next);
16 | /^[^\-]/ && ($language = $_, next);
17 | /^-penn$/ && ($PENN = 1, next);
18 | }
19 |
20 | while() {
21 | s/\r//g;
22 | # remove extra spaces
23 | s/\(/ \(/g;
24 | s/\)/\) /g; s/ +/ /g;
25 | s/\) ([\.\!\:\?\;\,])/\)$1/g;
26 | s/\( /\(/g;
27 | s/ \)/\)/g;
28 | s/(\d) \%/$1\%/g;
29 | s/ :/:/g;
30 | s/ ;/;/g;
31 | # normalize unicode punctuation
32 | if ($PENN == 0) {
33 | s/\`/\'/g;
34 | s/\'\'/ \" /g;
35 | }
36 |
37 | s/„/\"/g;
38 | s/“/\"/g;
39 | s/”/\"/g;
40 | s/–/-/g;
41 | s/—/ - /g; s/ +/ /g;
42 | s/´/\'/g;
43 | s/([a-z])‘([a-z])/$1\'$2/gi;
44 | s/([a-z])’([a-z])/$1\'$2/gi;
45 | s/‘/\"/g;
46 | s/‚/\"/g;
47 | s/’/\"/g;
48 | s/''/\"/g;
49 | s/´´/\"/g;
50 | s/…/.../g;
51 | # French quotes
52 | s/ « / \"/g;
53 | s/« /\"/g;
54 | s/«/\"/g;
55 | s/ » /\" /g;
56 | s/ »/\"/g;
57 | s/»/\"/g;
58 | # handle pseudo-spaces
59 | s/ \%/\%/g;
60 | s/nº /nº /g;
61 | s/ :/:/g;
62 | s/ ºC/ ºC/g;
63 | s/ cm/ cm/g;
64 | s/ \?/\?/g;
65 | s/ \!/\!/g;
66 | s/ ;/;/g;
67 | s/, /, /g; s/ +/ /g;
68 |
69 | # English "quotation," followed by comma, style
70 | if ($language eq "en") {
71 | s/\"([,\.]+)/$1\"/g;
72 | }
73 | # Czech is confused
74 | elsif ($language eq "cs" || $language eq "cz") {
75 | }
76 | # German/Spanish/French "quotation", followed by comma, style
77 | else {
78 | s/,\"/\",/g;
79 | s/(\.+)\"(\s*[^<])/\"$1$2/g; # don't fix period at end of sentence
80 | }
81 |
82 |
83 | if ($language eq "de" || $language eq "es" || $language eq "cz" || $language eq "cs" || $language eq "fr") {
84 | s/(\d) (\d)/$1,$2/g;
85 | }
86 | else {
87 | s/(\d) (\d)/$1.$2/g;
88 | }
89 | print $_;
90 | }
91 |
--------------------------------------------------------------------------------
/data/postprocess.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 |
3 | # input path
4 | INPUT=$1
5 |
6 | # output path
7 | OUTPUT=$2
8 |
9 | # restore subword units to original segmentation
10 | sed -r 's/(@@ )|(@@ ?$)//g' ${INPUT} > ${OUTPUT}
11 |
--------------------------------------------------------------------------------
/data/preprocess.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # source language suffix (example: en, cs, de, fr)
4 | S=$1
5 |
6 | # target language suffix (example: en, cs, de, fr)
7 | T=$2
8 |
9 | # path to corpus
10 | CORPUS=$3
11 |
12 | # maximum sequence length
13 | MAXLEN=$4
14 |
15 | echo "normalizing punctuation.."
16 | perl normalize-punctuation.perl -l ${S} < ${CORPUS}.${S} > ${CORPUS}.norm.${S}
17 | perl normalize-punctuation.perl -l ${T} < ${CORPUS}.${T} > ${CORPUS}.norm.${T}
18 |
19 | echo "tokenizing.."
20 | perl tokenizer.perl -l ${S} -threads 10 < ${CORPUS}.norm.${S} > ${CORPUS}.tok.${S}
21 | perl tokenizer.perl -l ${T} -threads 10 < ${CORPUS}.norm.${T} > ${CORPUS}.tok.${T}
22 |
23 | echo "learning bpe.."
24 | # learn BPE on joint vocabulary
25 | cat ${CORPUS}.tok.${S} ${CORPUS}.tok.${T} | python subword_nmt/learn_bpe.py -s 30000 > ${S}${T}.bpe
26 |
27 | echo "applying bpe.."
28 | python subword_nmt/apply_bpe.py -c ${S}${T}.bpe < ${CORPUS}.tok.${S} > ${CORPUS}.bpe.${S}
29 | python subword_nmt/apply_bpe.py -c ${S}${T}.bpe < ${CORPUS}.tok.${T} > ${CORPUS}.bpe.${T}
30 |
31 | echo "cleaning: filtering sequences of length over ${MAXLEN}"
32 | perl clean-corpus-n.perl ${CORPUS}.bpe ${S} ${T} ${CORPUS}.clean 1 ${MAXLEN}
33 |
34 | echo "shuffling.."
35 | python shuffle.py ${CORPUS}.clean.${S} ${CORPUS}.clean.${T}
36 |
37 | mv ${CORPUS}.clean.${S}.shuf ${CORPUS}.shuf.${S}
38 | mv ${CORPUS}.clean.${T}.shuf ${CORPUS}.shuf.${T}
39 |
40 | echo "building dictionaries.."
41 | python build_dictionary.py ${CORPUS}.shuf.${S} ${CORPUS}.shuf.${T}
42 |
43 | echo "preprocessing complete.."
44 | python data_statistics.py ${CORPUS}.shuf.${S} ${CORPUS}.shuf.${T}
45 |
--------------------------------------------------------------------------------
/data/sample.en:
--------------------------------------------------------------------------------
1 | Parliament Does Not Support Amendment Freeing Tymoshenko
2 | Today, the Ukraine parliament dismissed, within the Code of Criminal Procedure amendment, the motion to revoke an article based on which the opposition leader, Yulia Tymoshenko, was sentenced.
3 | The amendment that would lead to freeing the imprisoned former Prime Minister was revoked during second reading of the proposal for mitigation of sentences for economic offences.
4 | In October, Tymoshenko was sentenced to seven years in prison for entering into what was reported to be a disadvantageous gas deal with Russia.
5 | The verdict is not yet final; the court will hear Tymoshenko's appeal in December.
6 | Tymoshenko claims the verdict is a political revenge of the regime; in the West, the trial has also evoked suspicion of being biased.
7 | The proposal to remove Article 365 from the Code of Criminal Procedure, upon which the former Prime Minister was sentenced, was supported by 147 members of parliament.
8 | Its ratification would require 226 votes.
9 | Libya's Victory
10 | The story of Libya's liberation, or rebellion, already has its defeated.
11 | Muammar Kaddafi is buried at an unknown place in the desert. Without him, the war is over.
12 | It is time to define the winners.
13 | As a rule, Islamists win in the country; the question is whether they are the moderate or the radical ones.
14 | The transitional cabinet declared itself a follower of the customary Sharia law, of which we have already heard.
15 | Libya will become a crime free country, as the punishment for stealing is having one's hand amputated.
16 | Women can forget about emancipation; potential religious renegades will be executed; etc.
17 | Instead of a dictator, a society consisting of competing tribes will be united by Koran.
18 | Libya will be in an order we cannot imagine and surely would not ask for.
19 | However, our lifestyle is neither unique nor the best one and would not, most probably, be suitable for, for example, the people of Libya.
20 | In fact, it is a wonder that the Islamic fighters accepted help from the nonbelievers.
--------------------------------------------------------------------------------
/data/sample.fr:
--------------------------------------------------------------------------------
1 | Le Parlement n'a pas ratifié l'amendement pour la libération de Tymosenko
2 | Le parlement ukrainien a refusé, dans la cadre d'un amendement au droit pénal, le projet d'annulation du paragraphe relatif à l'inculpation de Julia Tymosenko, la chef de l'opposition.
3 | Les députés ont refusé en deuxième lecture le projet de modification visant la réduction des peines pour délits économiques, qui aurait pu ouvrir les portes de la liberté pour l'ex-Première Ministre actuellement emprisonnée.
4 | Tymosenko a été condamnée en octobre à 7 ans de prison pour avoir conclu un accord à priori désavantageux avec la Russie pour l'achat de gaz naturel.
5 | Le jugement n'est pas définitif et le tribunal doit statuer sur l'appel de la condamnée en décembre.
6 | Tymosenko qualifie le jugement de vengeance politique du régime et a provoqué un processus de soupçons de partialité du tribunal également à l'ouest.
7 | Le projet d'annuler le paragraphe 365 du droit pénal, sur la base duquel l'ex-Premère Ministre a été condamnée, a été signé par 147 députés.
8 | Il aurait fallu 226 voix pour l'approuver.
9 | Victoire libyenne
10 | La libération ou rebellion libyenne a déjà ses vaincus.
11 | Muammar Kaddafi est enterré dans un endroit inconnu dans le désert et sans lui la guerre est terminée.
12 | Il reste à déterminer les vainqueurs.
13 | Comme il est de coutume dans la région, les vainqueurs aux élections sont les islamistes, mais s'agit-il des modérés ou des radicaux.
14 | Le Conseil National provisoire a décrété le droit habituel de la Charia et nous savons déjà de quoi il s'agit.
15 | La Libye devient un pays sans criminalité car pour un vol on coupe la main.
16 | Les femmes peuvent oublier l'émancipation, les non respectueuses éventuelles dela foi sont passibles de la peine capitale...
17 | C'est le Coran qui va unir à la place de la personnalité du dictateur la société composée de tribus opposées.
18 | En Libye règnera un tel ordre comme nous ne pouvons pas nous le représenter et comme nous n'aimerions sûrement pas.
19 | Mais notre façon de vivre n'est pas unique, n'est pas objectivement la meilleure et ne serait probablement pas avantageuse pour les libyens.
20 | Il est particulièrement étonnant que les guerriers islamistes ont accepté l'aide avec reconnaissance des chiens infidèles.
--------------------------------------------------------------------------------
/data/shuffle.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import random
4 |
5 | import tempfile
6 | from subprocess import call
7 |
8 | '''
9 | This code comes from shuffle.py of
10 | nematus proejct (https://github.com/rsennrich/nematus)
11 | '''
12 |
13 | def main(files, temporary=False):
14 |
15 | tf_os, tpath = tempfile.mkstemp()
16 | tf = open(tpath, 'w')
17 |
18 | fds = [open(ff) for ff in files]
19 |
20 | for l in fds[0]:
21 | lines = [l.strip()] + [ff.readline().strip() for ff in fds[1:]]
22 | print >>tf, "|||".join(lines)
23 |
24 | [ff.close() for ff in fds]
25 | tf.close()
26 |
27 | lines = open(tpath, 'r').readlines()
28 | random.shuffle(lines)
29 |
30 | if temporary:
31 | fds = []
32 | for ff in files:
33 | path, filename = os.path.split(os.path.realpath(ff))
34 | fds.append(tempfile.TemporaryFile(prefix=filename+'.shuf', dir=path))
35 | else:
36 | fds = [open(ff+'.shuf','w') for ff in files]
37 |
38 | for l in lines:
39 | s = l.strip().split('|||')
40 | for ii, fd in enumerate(fds):
41 | print >>fd, s[ii]
42 |
43 | if temporary:
44 | [ff.seek(0) for ff in fds]
45 | else:
46 | [ff.close() for ff in fds]
47 |
48 | os.close(tf_os)
49 | os.remove(tpath)
50 |
51 | return fds
52 |
53 | if __name__ == '__main__':
54 | main(sys.argv[1:])
55 |
56 |
57 |
58 |
59 |
--------------------------------------------------------------------------------
/data/strip_sgml.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import re
3 |
4 | '''
5 | This code comes from strip_sgml.py of
6 | nematus proejct (https://github.com/rsennrich/nematus)
7 | '''
8 |
9 | def main():
10 | fin = sys.stdin
11 | fout = sys.stdout
12 | for l in fin:
13 | line = l.strip()
14 | text = re.sub('<[^<]+>', "", line).strip()
15 | if len(text) == 0:
16 | continue
17 | print >>fout, text
18 |
19 |
20 | if __name__ == "__main__":
21 | main()
22 |
23 |
--------------------------------------------------------------------------------
/data/subword_nmt/README.md:
--------------------------------------------------------------------------------
1 | Preprocessing scripts to learn and apply subword units
2 | (https://github.com/rsennrich/subword-nmt)
3 |
--------------------------------------------------------------------------------
/data/subword_nmt/apply_bpe.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | # Author: Rico Sennrich
4 |
5 | """Use operations learned with learn_bpe.py to encode a new text.
6 | The text will not be smaller, but use only a fixed vocabulary, with rare words
7 | encoded as variable-length sequences of subword units.
8 |
9 | Reference:
10 | Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units.
11 | Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
12 | """
13 |
14 | from __future__ import unicode_literals, division
15 |
16 | import sys
17 | import codecs
18 | import argparse
19 | from collections import defaultdict
20 |
21 | # hack for python2/3 compatibility
22 | from io import open
23 | argparse.open = open
24 |
25 | # python 2/3 compatibility
26 | if sys.version_info < (3, 0):
27 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
28 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
29 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
30 |
31 | class BPE(object):
32 |
33 | def __init__(self, codes, separator='@@'):
34 | self.bpe_codes = [tuple(item.split()) for item in codes]
35 | # some hacking to deal with duplicates (only consider first instance)
36 | self.bpe_codes = dict([(code,i) for (i,code) in reversed(list(enumerate(self.bpe_codes)))])
37 |
38 | self.separator = separator
39 |
40 | def segment(self, sentence):
41 | """segment single sentence (whitespace-tokenized string) with BPE encoding"""
42 |
43 | output = []
44 | for word in sentence.split():
45 | new_word = encode(word, self.bpe_codes)
46 |
47 | for item in new_word[:-1]:
48 | output.append(item + self.separator)
49 | output.append(new_word[-1])
50 |
51 | return ' '.join(output)
52 |
53 | def create_parser():
54 | parser = argparse.ArgumentParser(
55 | formatter_class=argparse.RawDescriptionHelpFormatter,
56 | description="learn BPE-based word segmentation")
57 |
58 | parser.add_argument(
59 | '--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
60 | metavar='PATH',
61 | help="Input file (default: standard input).")
62 | parser.add_argument(
63 | '--codes', '-c', type=argparse.FileType('r'), metavar='PATH',
64 | required=True,
65 | help="File with BPE codes (created by learn_bpe.py).")
66 | parser.add_argument(
67 | '--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
68 | metavar='PATH',
69 | help="Output file (default: standard output)")
70 | parser.add_argument(
71 | '--separator', '-s', type=str, default='@@', metavar='STR',
72 | help="Separator between non-final subword units (default: '%(default)s'))")
73 |
74 | return parser
75 |
76 | def get_pairs(word):
77 | """Return set of symbol pairs in a word.
78 |
79 | word is represented as tuple of symbols (symbols being variable-length strings)
80 | """
81 | pairs = set()
82 | prev_char = word[0]
83 | for char in word[1:]:
84 | pairs.add((prev_char, char))
85 | prev_char = char
86 | return pairs
87 |
88 | def encode(orig, bpe_codes, cache={}):
89 | """Encode word based on list of BPE merge operations, which are applied consecutively
90 | """
91 |
92 | if orig in cache:
93 | return cache[orig]
94 |
95 | word = tuple(orig) + ('',)
96 | pairs = get_pairs(word)
97 |
98 | while True:
99 | bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf')))
100 | if bigram not in bpe_codes:
101 | break
102 | first, second = bigram
103 | new_word = []
104 | i = 0
105 | while i < len(word):
106 | try:
107 | j = word.index(first, i)
108 | new_word.extend(word[i:j])
109 | i = j
110 | except:
111 | new_word.extend(word[i:])
112 | break
113 |
114 | if word[i] == first and i < len(word)-1 and word[i+1] == second:
115 | new_word.append(first+second)
116 | i += 2
117 | else:
118 | new_word.append(word[i])
119 | i += 1
120 | new_word = tuple(new_word)
121 | word = new_word
122 | if len(word) == 1:
123 | break
124 | else:
125 | pairs = get_pairs(word)
126 |
127 | # don't print end-of-word symbols
128 | if word[-1] == '':
129 | word = word[:-1]
130 | elif word[-1].endswith(''):
131 | word = word[:-1] + (word[-1].replace('',''),)
132 |
133 | cache[orig] = word
134 | return word
135 |
136 |
137 | if __name__ == '__main__':
138 | parser = create_parser()
139 | args = parser.parse_args()
140 |
141 | bpe = BPE(args.codes, args.separator)
142 |
143 | for line in args.input:
144 | args.output.write(bpe.segment(line).strip())
145 | args.output.write('\n')
146 |
--------------------------------------------------------------------------------
/data/subword_nmt/chrF.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | # Author: Rico Sennrich
4 |
5 | """Compute chrF3 for machine translation evaluation
6 |
7 | Reference:
8 | Maja Popović (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translationn, pages 392–395, Lisbon, Portugal.
9 | """
10 |
11 | from __future__ import print_function, unicode_literals, division
12 | import sys
13 | import codecs
14 | import io
15 | import argparse
16 | from collections import defaultdict
17 | from math import log, exp
18 |
19 | # hack for python2/3 compatibility
20 | from io import open
21 | argparse.open = open
22 |
23 | # python 2/3 compatibility
24 | if sys.version_info < (3, 0):
25 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
26 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
27 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
28 |
29 |
30 | def create_parser():
31 | parser = argparse.ArgumentParser(
32 | formatter_class=argparse.RawDescriptionHelpFormatter,
33 | description="learn BPE-based word segmentation")
34 |
35 | parser.add_argument(
36 | '--ref', '-r', type=argparse.FileType('r'), required=True,
37 | metavar='PATH',
38 | help="Reference file")
39 | parser.add_argument(
40 | '--hyp', type=argparse.FileType('r'), metavar='PATH',
41 | default=sys.stdin,
42 | help="Hypothesis file (default: stdin).")
43 | parser.add_argument(
44 | '--beta', '-b', type=float, default=3,
45 | metavar='FLOAT',
46 | help="beta parameter (default: '%(default)s')")
47 | parser.add_argument(
48 | '--ngram', '-n', type=int, default=6,
49 | metavar='INT',
50 | help="ngram order (default: '%(default)s')")
51 | parser.add_argument(
52 | '--space', '-s', action='store_true',
53 | help="take spaces into account (default: '%(default)s')")
54 | parser.add_argument(
55 | '--precision', action='store_true',
56 | help="report precision (default: '%(default)s')")
57 | parser.add_argument(
58 | '--recall', action='store_true',
59 | help="report recall (default: '%(default)s')")
60 |
61 | return parser
62 |
63 | def extract_ngrams(words, max_length=4, spaces=False):
64 |
65 | if not spaces:
66 | words = ''.join(words.split())
67 | else:
68 | words = words.strip()
69 |
70 | results = defaultdict(lambda: defaultdict(int))
71 | for length in range(max_length):
72 | for start_pos in range(len(words)):
73 | end_pos = start_pos + length + 1
74 | if end_pos <= len(words):
75 | results[length][tuple(words[start_pos: end_pos])] += 1
76 | return results
77 |
78 |
79 | def get_correct(ngrams_ref, ngrams_test, correct, total):
80 |
81 | for rank in ngrams_test:
82 | for chain in ngrams_test[rank]:
83 | total[rank] += ngrams_test[rank][chain]
84 | if chain in ngrams_ref[rank]:
85 | correct[rank] += min(ngrams_test[rank][chain], ngrams_ref[rank][chain])
86 |
87 | return correct, total
88 |
89 |
90 | def f1(correct, total_hyp, total_ref, max_length, beta=3, smooth=0):
91 |
92 | precision = 0
93 | recall = 0
94 |
95 | for i in range(max_length):
96 | if total_hyp[i] + smooth and total_ref[i] + smooth:
97 | precision += (correct[i] + smooth) / (total_hyp[i] + smooth)
98 | recall += (correct[i] + smooth) / (total_ref[i] + smooth)
99 |
100 | precision /= max_length
101 | recall /= max_length
102 |
103 | return (1 + beta**2) * (precision*recall) / ((beta**2 * precision) + recall), precision, recall
104 |
105 | def main(args):
106 |
107 | correct = [0]*args.ngram
108 | total = [0]*args.ngram
109 | total_ref = [0]*args.ngram
110 | for line in args.ref:
111 | line2 = args.hyp.readline()
112 |
113 | ngrams_ref = extract_ngrams(line, max_length=args.ngram, spaces=args.space)
114 | ngrams_test = extract_ngrams(line2, max_length=args.ngram, spaces=args.space)
115 |
116 | get_correct(ngrams_ref, ngrams_test, correct, total)
117 |
118 | for rank in ngrams_ref:
119 | for chain in ngrams_ref[rank]:
120 | total_ref[rank] += ngrams_ref[rank][chain]
121 |
122 | chrf, precision, recall = f1(correct, total, total_ref, args.ngram, args.beta)
123 |
124 | print('chrF3: {0:.4f}'.format(chrf))
125 | if args.precision:
126 | print('chrPrec: {0:.4f}'.format(precision))
127 | if args.recall:
128 | print('chrRec: {0:.4f}'.format(recall))
129 |
130 | if __name__ == '__main__':
131 |
132 | parser = create_parser()
133 | args = parser.parse_args()
134 |
135 | main(args)
136 |
--------------------------------------------------------------------------------
/data/subword_nmt/learn_bpe.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | # Author: Rico Sennrich
4 |
5 | """Use byte pair encoding (BPE) to learn a variable-length encoding of the vocabulary in a text.
6 | Unlike the original BPE, it does not compress the plain text, but can be used to reduce the vocabulary
7 | of a text to a configurable number of symbols, with only a small increase in the number of tokens.
8 |
9 | Reference:
10 | Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units.
11 | Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
12 | """
13 |
14 | from __future__ import unicode_literals
15 |
16 | import sys
17 | import codecs
18 | import re
19 | import copy
20 | import argparse
21 | from collections import defaultdict, Counter
22 |
23 | # hack for python2/3 compatibility
24 | from io import open
25 | argparse.open = open
26 |
27 | # python 2/3 compatibility
28 | if sys.version_info < (3, 0):
29 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
30 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
31 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
32 |
33 | def create_parser():
34 | parser = argparse.ArgumentParser(
35 | formatter_class=argparse.RawDescriptionHelpFormatter,
36 | description="learn BPE-based word segmentation")
37 |
38 | parser.add_argument(
39 | '--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
40 | metavar='PATH',
41 | help="Input text (default: standard input).")
42 | parser.add_argument(
43 | '--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
44 | metavar='PATH',
45 | help="Output file for BPE codes (default: standard output)")
46 | parser.add_argument(
47 | '--symbols', '-s', type=int, default=10000,
48 | help="Create this many new symbols (each representing a character n-gram) (default: %(default)s))")
49 | parser.add_argument(
50 | '--verbose', '-v', action="store_true",
51 | help="verbose mode.")
52 |
53 | return parser
54 |
55 | def get_vocabulary(fobj):
56 | """Read text and return dictionary that encodes vocabulary
57 | """
58 | vocab = Counter()
59 | for line in fobj:
60 | for word in line.split():
61 | vocab[word] += 1
62 | return vocab
63 |
64 | def update_pair_statistics(pair, changed, stats, indices):
65 | """Minimally update the indices and frequency of symbol pairs
66 |
67 | if we merge a pair of symbols, only pairs that overlap with occurrences
68 | of this pair are affected, and need to be updated.
69 | """
70 | stats[pair] = 0
71 | indices[pair] = defaultdict(int)
72 | first, second = pair
73 | new_pair = first+second
74 | for j, word, old_word, freq in changed:
75 |
76 | # find all instances of pair, and update frequency/indices around it
77 | i = 0
78 | while True:
79 | try:
80 | i = old_word.index(first, i)
81 | except ValueError:
82 | break
83 | if i < len(old_word)-1 and old_word[i+1] == second:
84 | if i:
85 | prev = old_word[i-1:i+1]
86 | stats[prev] -= freq
87 | indices[prev][j] -= 1
88 | if i < len(old_word)-2:
89 | # don't double-count consecutive pairs
90 | if old_word[i+2] != first or i >= len(old_word)-3 or old_word[i+3] != second:
91 | nex = old_word[i+1:i+3]
92 | stats[nex] -= freq
93 | indices[nex][j] -= 1
94 | i += 2
95 | else:
96 | i += 1
97 |
98 | i = 0
99 | while True:
100 | try:
101 | i = word.index(new_pair, i)
102 | except ValueError:
103 | break
104 | if i:
105 | prev = word[i-1:i+1]
106 | stats[prev] += freq
107 | indices[prev][j] += 1
108 | # don't double-count consecutive pairs
109 | if i < len(word)-1 and word[i+1] != new_pair:
110 | nex = word[i:i+2]
111 | stats[nex] += freq
112 | indices[nex][j] += 1
113 | i += 1
114 |
115 |
116 | def get_pair_statistics(vocab):
117 | """Count frequency of all symbol pairs, and create index"""
118 |
119 | # data structure of pair frequencies
120 | stats = defaultdict(int)
121 |
122 | #index from pairs to words
123 | indices = defaultdict(lambda: defaultdict(int))
124 |
125 | for i, (word, freq) in enumerate(vocab):
126 | prev_char = word[0]
127 | for char in word[1:]:
128 | stats[prev_char, char] += freq
129 | indices[prev_char, char][i] += 1
130 | prev_char = char
131 |
132 | return stats, indices
133 |
134 |
135 | def replace_pair(pair, vocab, indices):
136 | """Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'"""
137 | first, second = pair
138 | pair_str = ''.join(pair)
139 | pair_str = pair_str.replace('\\','\\\\')
140 | changes = []
141 | pattern = re.compile(r'(?',) ,y) for (x,y) in vocab.items()])
181 | sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)
182 |
183 | stats, indices = get_pair_statistics(sorted_vocab)
184 | big_stats = copy.deepcopy(stats)
185 | # threshold is inspired by Zipfian assumption, but should only affect speed
186 | threshold = max(stats.values()) / 10
187 | for i in range(args.symbols):
188 | if stats:
189 | most_frequent = max(stats, key=stats.get)
190 |
191 | # we probably missed the best pair because of pruning; go back to full statistics
192 | if not stats or (i and stats[most_frequent] < threshold):
193 | prune_stats(stats, big_stats, threshold)
194 | stats = copy.deepcopy(big_stats)
195 | most_frequent = max(stats, key=stats.get)
196 | # threshold is inspired by Zipfian assumption, but should only affect speed
197 | threshold = stats[most_frequent] * i/(i+10000.0)
198 | prune_stats(stats, big_stats, threshold)
199 |
200 | if stats[most_frequent] < 2:
201 | sys.stderr.write('no pair has frequency > 1. Stopping\n')
202 | break
203 |
204 | if args.verbose:
205 | sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))
206 | args.output.write('{0} {1}\n'.format(*most_frequent))
207 | changes = replace_pair(most_frequent, sorted_vocab, indices)
208 | update_pair_statistics(most_frequent, changes, stats, indices)
209 | stats[most_frequent] = 0
210 | if not i % 100:
211 | prune_stats(stats, big_stats, threshold)
212 |
--------------------------------------------------------------------------------
/data/subword_nmt/segment-char-ngrams.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | # Author: Rico Sennrich
4 |
5 | from __future__ import unicode_literals, division
6 |
7 | import sys
8 | import codecs
9 | import argparse
10 |
11 | # hack for python2/3 compatibility
12 | from io import open
13 | argparse.open = open
14 |
15 | # python 2/3 compatibility
16 | if sys.version_info < (3, 0):
17 | sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
18 | sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
19 | sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
20 |
21 | def create_parser():
22 | parser = argparse.ArgumentParser(
23 | formatter_class=argparse.RawDescriptionHelpFormatter,
24 | description="segment rare words into character n-grams")
25 |
26 | parser.add_argument(
27 | '--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
28 | metavar='PATH',
29 | help="Input file (default: standard input).")
30 | parser.add_argument(
31 | '--vocab', type=argparse.FileType('r'), metavar='PATH',
32 | required=True,
33 | help="Vocabulary file.")
34 | parser.add_argument(
35 | '--shortlist', type=int, metavar='INT', default=0,
36 | help="do not segment INT most frequent words in vocabulary (default: '%(default)s')).")
37 | parser.add_argument(
38 | '-n', type=int, metavar='INT', default=2,
39 | help="segment rare words into character n-grams of size INT (default: '%(default)s')).")
40 | parser.add_argument(
41 | '--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
42 | metavar='PATH',
43 | help="Output file (default: standard output)")
44 | parser.add_argument(
45 | '--separator', '-s', type=str, default='@@', metavar='STR',
46 | help="Separator between non-final subword units (default: '%(default)s'))")
47 |
48 | return parser
49 |
50 |
51 | if __name__ == '__main__':
52 |
53 | parser = create_parser()
54 | args = parser.parse_args()
55 |
56 | vocab = [line.split()[0] for line in args.vocab if len(line.split()) == 2]
57 | vocab = dict((y,x) for (x,y) in enumerate(vocab))
58 |
59 | for line in args.input:
60 | for word in line.split():
61 | if word not in vocab or vocab[word] > args.shortlist:
62 | i = 0
63 | while i*args.n < len(word):
64 | args.output.write(word[i*args.n:i*args.n+args.n])
65 | i += 1
66 | if i*args.n < len(word):
67 | args.output.write(args.separator)
68 | args.output.write(' ')
69 | else:
70 | args.output.write(word + ' ')
71 | args.output.write('\n')
72 |
--------------------------------------------------------------------------------
/data/tokenizer.perl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 | #
3 | # This file is part of moses. Its use is licensed under the GNU Lesser General
4 | # Public License version 2.1 or, at your option, any later version.
5 |
6 | use warnings;
7 |
8 | # Sample Tokenizer
9 | ### Version 1.1
10 | # written by Pidong Wang, based on the code written by Josh Schroeder and Philipp Koehn
11 | # Version 1.1 updates:
12 | # (1) add multithreading option "-threads NUM_THREADS" (default is 1);
13 | # (2) add a timing option "-time" to calculate the average speed of this tokenizer;
14 | # (3) add an option "-lines NUM_SENTENCES_PER_THREAD" to set the number of lines for each thread (default is 2000), and this option controls the memory amount needed: the larger this number is, the larger memory is required (the higher tokenization speed);
15 | ### Version 1.0
16 | # $Id: tokenizer.perl 915 2009-08-10 08:15:49Z philipp $
17 | # written by Josh Schroeder, based on code by Philipp Koehn
18 |
19 | binmode(STDIN, ":utf8");
20 | binmode(STDOUT, ":utf8");
21 |
22 | use warnings;
23 | use FindBin qw($RealBin);
24 | use strict;
25 | use Time::HiRes;
26 |
27 | if (eval {require Thread;1;}) {
28 | #module loaded
29 | Thread->import();
30 | }
31 |
32 | my $mydir = "$RealBin/nonbreaking_prefixes";
33 |
34 | my %NONBREAKING_PREFIX = ();
35 | my @protected_patterns = ();
36 | my $protected_patterns_file = "";
37 | my $language = "en";
38 | my $QUIET = 0;
39 | my $HELP = 0;
40 | my $AGGRESSIVE = 0;
41 | my $SKIP_XML = 0;
42 | my $TIMING = 0;
43 | my $NUM_THREADS = 1;
44 | my $NUM_SENTENCES_PER_THREAD = 2000;
45 | my $PENN = 0;
46 | my $NO_ESCAPING = 0;
47 | while (@ARGV)
48 | {
49 | $_ = shift;
50 | /^-b$/ && ($| = 1, next);
51 | /^-l$/ && ($language = shift, next);
52 | /^-q$/ && ($QUIET = 1, next);
53 | /^-h$/ && ($HELP = 1, next);
54 | /^-x$/ && ($SKIP_XML = 1, next);
55 | /^-a$/ && ($AGGRESSIVE = 1, next);
56 | /^-time$/ && ($TIMING = 1, next);
57 | # Option to add list of regexps to be protected
58 | /^-protected/ && ($protected_patterns_file = shift, next);
59 | /^-threads$/ && ($NUM_THREADS = int(shift), next);
60 | /^-lines$/ && ($NUM_SENTENCES_PER_THREAD = int(shift), next);
61 | /^-penn$/ && ($PENN = 1, next);
62 | /^-no-escape/ && ($NO_ESCAPING = 1, next);
63 | }
64 |
65 | # for time calculation
66 | my $start_time;
67 | if ($TIMING)
68 | {
69 | $start_time = [ Time::HiRes::gettimeofday( ) ];
70 | }
71 |
72 | # print help message
73 | if ($HELP)
74 | {
75 | print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n";
76 | print "Options:\n";
77 | print " -q ... quiet.\n";
78 | print " -a ... aggressive hyphen splitting.\n";
79 | print " -b ... disable Perl buffering.\n";
80 | print " -time ... enable processing time calculation.\n";
81 | print " -penn ... use Penn treebank-like tokenization.\n";
82 | print " -protected FILE ... specify file with patters to be protected in tokenisation.\n";
83 | print " -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n";
84 | exit;
85 | }
86 |
87 | if (!$QUIET)
88 | {
89 | print STDERR "Tokenizer Version 1.1\n";
90 | print STDERR "Language: $language\n";
91 | print STDERR "Number of threads: $NUM_THREADS\n";
92 | }
93 |
94 | # load the language-specific non-breaking prefix info from files in the directory nonbreaking_prefixes
95 | load_prefixes($language,\%NONBREAKING_PREFIX);
96 |
97 | if (scalar(%NONBREAKING_PREFIX) eq 0)
98 | {
99 | print STDERR "Warning: No known abbreviations for language '$language'\n";
100 | }
101 |
102 | # Load protected patterns
103 | if ($protected_patterns_file)
104 | {
105 | open(PP,$protected_patterns_file) || die "Unable to open $protected_patterns_file";
106 | while() {
107 | chomp;
108 | push @protected_patterns, $_;
109 | }
110 | }
111 |
112 | my @batch_sentences = ();
113 | my @thread_list = ();
114 | my $count_sentences = 0;
115 |
116 | if ($NUM_THREADS > 1)
117 | {# multi-threading tokenization
118 | while()
119 | {
120 | $count_sentences = $count_sentences + 1;
121 | push(@batch_sentences, $_);
122 | if (scalar(@batch_sentences)>=($NUM_SENTENCES_PER_THREAD*$NUM_THREADS))
123 | {
124 | # assign each thread work
125 | for (my $i=0; $i<$NUM_THREADS; $i++)
126 | {
127 | my $start_index = $i*$NUM_SENTENCES_PER_THREAD;
128 | my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1;
129 | my @subbatch_sentences = @batch_sentences[$start_index..$end_index];
130 | my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences;
131 | push(@thread_list, $new_thread);
132 | }
133 | foreach (@thread_list)
134 | {
135 | my $tokenized_list = $_->join;
136 | foreach (@$tokenized_list)
137 | {
138 | print $_;
139 | }
140 | }
141 | # reset for the new run
142 | @thread_list = ();
143 | @batch_sentences = ();
144 | }
145 | }
146 | # the last batch
147 | if (scalar(@batch_sentences)>0)
148 | {
149 | # assign each thread work
150 | for (my $i=0; $i<$NUM_THREADS; $i++)
151 | {
152 | my $start_index = $i*$NUM_SENTENCES_PER_THREAD;
153 | if ($start_index >= scalar(@batch_sentences))
154 | {
155 | last;
156 | }
157 | my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1;
158 | if ($end_index >= scalar(@batch_sentences))
159 | {
160 | $end_index = scalar(@batch_sentences)-1;
161 | }
162 | my @subbatch_sentences = @batch_sentences[$start_index..$end_index];
163 | my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences;
164 | push(@thread_list, $new_thread);
165 | }
166 | foreach (@thread_list)
167 | {
168 | my $tokenized_list = $_->join;
169 | foreach (@$tokenized_list)
170 | {
171 | print $_;
172 | }
173 | }
174 | }
175 | }
176 | else
177 | {# single thread only
178 | while()
179 | {
180 | if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
181 | {
182 | #don't try to tokenize XML/HTML tag lines
183 | print $_;
184 | }
185 | else
186 | {
187 | print &tokenize($_);
188 | }
189 | }
190 | }
191 |
192 | if ($TIMING)
193 | {
194 | my $duration = Time::HiRes::tv_interval( $start_time );
195 | print STDERR ("TOTAL EXECUTION TIME: ".$duration."\n");
196 | print STDERR ("TOKENIZATION SPEED: ".($duration/$count_sentences*1000)." milliseconds/line\n");
197 | }
198 |
199 | #####################################################################################
200 | # subroutines afterward
201 |
202 | # tokenize a batch of texts saved in an array
203 | # input: an array containing a batch of texts
204 | # return: another array containing a batch of tokenized texts for the input array
205 | sub tokenize_batch
206 | {
207 | my(@text_list) = @_;
208 | my(@tokenized_list) = ();
209 | foreach (@text_list)
210 | {
211 | if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
212 | {
213 | #don't try to tokenize XML/HTML tag lines
214 | push(@tokenized_list, $_);
215 | }
216 | else
217 | {
218 | push(@tokenized_list, &tokenize($_));
219 | }
220 | }
221 | return \@tokenized_list;
222 | }
223 |
224 | # the actual tokenize function which tokenizes one input string
225 | # input: one string
226 | # return: the tokenized string for the input string
227 | sub tokenize
228 | {
229 | my($text) = @_;
230 |
231 | if ($PENN) {
232 | return tokenize_penn($text);
233 | }
234 |
235 | chomp($text);
236 | $text = " $text ";
237 |
238 | # remove ASCII junk
239 | $text =~ s/\s+/ /g;
240 | $text =~ s/[\000-\037]//g;
241 |
242 | # Find protected patterns
243 | my @protected = ();
244 | foreach my $protected_pattern (@protected_patterns) {
245 | my $t = $text;
246 | while ($t =~ /($protected_pattern)(.*)$/) {
247 | push @protected, $1;
248 | $t = $2;
249 | }
250 | }
251 |
252 | for (my $i = 0; $i < scalar(@protected); ++$i) {
253 | my $subst = sprintf("THISISPROTECTED%.3d", $i);
254 | $text =~ s,\Q$protected[$i], $subst ,g;
255 | }
256 | $text =~ s/ +/ /g;
257 | $text =~ s/^ //g;
258 | $text =~ s/ $//g;
259 |
260 | # seperate out all "other" special characters
261 | $text =~ s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g;
262 |
263 | # aggressive hyphen splitting
264 | if ($AGGRESSIVE)
265 | {
266 | $text =~ s/([\p{IsAlnum}])\-(?=[\p{IsAlnum}])/$1 \@-\@ /g;
267 | }
268 |
269 | #multi-dots stay together
270 | $text =~ s/\.([\.]+)/ DOTMULTI$1/g;
271 | while($text =~ /DOTMULTI\./)
272 | {
273 | $text =~ s/DOTMULTI\.([^\.])/DOTDOTMULTI $1/g;
274 | $text =~ s/DOTMULTI\./DOTDOTMULTI/g;
275 | }
276 |
277 | # seperate out "," except if within numbers (5,300)
278 | #$text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
279 |
280 | # separate out "," except if within numbers (5,300)
281 | # previous "global" application skips some: A,B,C,D,E > A , B,C , D,E
282 | # first application uses up B so rule can't see B,C
283 | # two-step version here may create extra spaces but these are removed later
284 | # will also space digit,letter or letter,digit forms (redundant with next section)
285 | $text =~ s/([^\p{IsN}])[,]/$1 , /g;
286 | $text =~ s/[,]([^\p{IsN}])/ , $1/g;
287 |
288 | # separate , pre and post number
289 | #$text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
290 | #$text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g;
291 |
292 | # turn `into '
293 | #$text =~ s/\`/\'/g;
294 |
295 | #turn '' into "
296 | #$text =~ s/\'\'/ \" /g;
297 |
298 | if ($language eq "en")
299 | {
300 | #split contractions right
301 | $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
302 | $text =~ s/([^\p{IsAlpha}\p{IsN}])[']([\p{IsAlpha}])/$1 ' $2/g;
303 | $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
304 | $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '$2/g;
305 | #special case for "1990's"
306 | $text =~ s/([\p{IsN}])[']([s])/$1 '$2/g;
307 | }
308 | elsif (($language eq "fr") or ($language eq "it"))
309 | {
310 | #split contractions left
311 | $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
312 | $text =~ s/([^\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;
313 | $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
314 | $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1' $2/g;
315 | }
316 | else
317 | {
318 | $text =~ s/\'/ \' /g;
319 | }
320 |
321 | #word token method
322 | my @words = split(/\s/,$text);
323 | $text = "";
324 | for (my $i=0;$i<(scalar(@words));$i++)
325 | {
326 | my $word = $words[$i];
327 | if ( $word =~ /^(\S+)\.$/)
328 | {
329 | my $pre = $1;
330 | if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml
371 | $text =~ s/\'/\'/g; # xml
372 | $text =~ s/\"/\"/g; # xml
373 | $text =~ s/\[/\[/g; # syntax non-terminal
374 | $text =~ s/\]/\]/g; # syntax non-terminal
375 | }
376 |
377 | #ensure final line break
378 | $text .= "\n" unless $text =~ /\n$/;
379 |
380 | return $text;
381 | }
382 |
383 | sub tokenize_penn
384 | {
385 | # Improved compatibility with Penn Treebank tokenization. Useful if
386 | # the text is to later be parsed with a PTB-trained parser.
387 | #
388 | # Adapted from Robert MacIntyre's sed script:
389 | # http://www.cis.upenn.edu/~treebank/tokenizer.sed
390 |
391 | my($text) = @_;
392 | chomp($text);
393 |
394 | # remove ASCII junk
395 | $text =~ s/\s+/ /g;
396 | $text =~ s/[\000-\037]//g;
397 |
398 | # attempt to get correct directional quotes
399 | $text =~ s/^``/`` /g;
400 | $text =~ s/^"/`` /g;
401 | $text =~ s/^`([^`])/` $1/g;
402 | $text =~ s/^'/` /g;
403 | $text =~ s/([ ([{<])"/$1 `` /g;
404 | $text =~ s/([ ([{<])``/$1 `` /g;
405 | $text =~ s/([ ([{<])`([^`])/$1 ` $2/g;
406 | $text =~ s/([ ([{<])'/$1 ` /g;
407 | # close quotes handled at end
408 |
409 | $text =~ s=\.\.\.= _ELLIPSIS_ =g;
410 |
411 | # separate out "," except if within numbers (5,300)
412 | $text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
413 | # separate , pre and post number
414 | $text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
415 | $text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g;
416 |
417 | #$text =~ s=([;:@#\$%&\p{IsSc}])= $1 =g;
418 | $text =~ s=([;:@#\$%&\p{IsSc}\p{IsSo}])= $1 =g;
419 |
420 | # Separate out intra-token slashes. PTB tokenization doesn't do this, so
421 | # the tokens should be merged prior to parsing with a PTB-trained parser
422 | # (see syntax-hyphen-splitting.perl).
423 | $text =~ s/([\p{IsAlnum}])\/([\p{IsAlnum}])/$1 \@\/\@ $2/g;
424 |
425 | # Assume sentence tokenization has been done first, so split FINAL periods
426 | # only.
427 | $text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g;
428 | # however, we may as well split ALL question marks and exclamation points,
429 | # since they shouldn't have the abbrev.-marker ambiguity problem
430 | $text =~ s=([?!])= $1 =g;
431 |
432 | # parentheses, brackets, etc.
433 | $text =~ s=([\]\[\(\){}<>])= $1 =g;
434 | $text =~ s/\(/-LRB-/g;
435 | $text =~ s/\)/-RRB-/g;
436 | $text =~ s/\[/-LSB-/g;
437 | $text =~ s/\]/-RSB-/g;
438 | $text =~ s/{/-LCB-/g;
439 | $text =~ s/}/-RCB-/g;
440 |
441 | $text =~ s=--= -- =g;
442 |
443 | # First off, add a space to the beginning and end of each line, to reduce
444 | # necessary number of regexps.
445 | $text =~ s=$= =;
446 | $text =~ s=^= =;
447 |
448 | $text =~ s="= '' =g;
449 | # possessive or close-single-quote
450 | $text =~ s=([^'])' =$1 ' =g;
451 | # as in it's, I'm, we'd
452 | $text =~ s='([sSmMdD]) = '$1 =g;
453 | $text =~ s='ll = 'll =g;
454 | $text =~ s='re = 're =g;
455 | $text =~ s='ve = 've =g;
456 | $text =~ s=n't = n't =g;
457 | $text =~ s='LL = 'LL =g;
458 | $text =~ s='RE = 'RE =g;
459 | $text =~ s='VE = 'VE =g;
460 | $text =~ s=N'T = N'T =g;
461 |
462 | $text =~ s= ([Cc])annot = $1an not =g;
463 | $text =~ s= ([Dd])'ye = $1' ye =g;
464 | $text =~ s= ([Gg])imme = $1im me =g;
465 | $text =~ s= ([Gg])onna = $1on na =g;
466 | $text =~ s= ([Gg])otta = $1ot ta =g;
467 | $text =~ s= ([Ll])emme = $1em me =g;
468 | $text =~ s= ([Mm])ore'n = $1ore 'n =g;
469 | $text =~ s= '([Tt])is = '$1 is =g;
470 | $text =~ s= '([Tt])was = '$1 was =g;
471 | $text =~ s= ([Ww])anna = $1an na =g;
472 |
473 | #word token method
474 | my @words = split(/\s/,$text);
475 | $text = "";
476 | for (my $i=0;$i<(scalar(@words));$i++)
477 | {
478 | my $word = $words[$i];
479 | if ( $word =~ /^(\S+)\.$/)
480 | {
481 | my $pre = $1;
482 | if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml
511 | $text =~ s/\'/\'/g; # xml
512 | $text =~ s/\"/\"/g; # xml
513 | $text =~ s/\[/\[/g; # syntax non-terminal
514 | $text =~ s/\]/\]/g; # syntax non-terminal
515 |
516 | #ensure final line break
517 | $text .= "\n" unless $text =~ /\n$/;
518 |
519 | return $text;
520 | }
521 |
522 | sub load_prefixes
523 | {
524 | my ($language, $PREFIX_REF) = @_;
525 |
526 | my $prefixfile = "$mydir/nonbreaking_prefix.$language";
527 |
528 | #default back to English if we don't have a language-specific prefix file
529 | if (!(-e $prefixfile))
530 | {
531 | $prefixfile = "$mydir/nonbreaking_prefix.en";
532 | print STDERR "WARNING: No known abbreviations for language '$language', attempting fall-back to English version...\n";
533 | die ("ERROR: No abbreviations files found in $mydir\n") unless (-e $prefixfile);
534 | }
535 |
536 | if (-e "$prefixfile")
537 | {
538 | open(PREFIX, "<:utf8", "$prefixfile");
539 | while ()
540 | {
541 | my $item = $_;
542 | chomp($item);
543 | if (($item) && (substr($item,0,1) ne "#"))
544 | {
545 | if ($item =~ /(.*)[\s]+(\#NUMERIC_ONLY\#)/)
546 | {
547 | $PREFIX_REF->{$1} = 2;
548 | }
549 | else
550 | {
551 | $PREFIX_REF->{$item} = 1;
552 | }
553 | }
554 | }
555 | close(PREFIX);
556 | }
557 | }
558 |
--------------------------------------------------------------------------------
/data/util.py:
--------------------------------------------------------------------------------
1 | '''
2 | Utility functions
3 | '''
4 |
5 | '''
6 | This code is based on the util.py of
7 | nematus project (https://github.com/rsennrich/nematus)
8 | '''
9 |
10 | import sys
11 | import json
12 | import cPickle as pkl
13 |
14 | #json loads strings as unicode; we currently still work with Python 2 strings, and need conversion
15 | def unicode_to_utf8(d):
16 | return dict((key.encode("UTF-8"), value) for (key,value) in d.items())
17 |
18 | def load_dict(filename):
19 | try:
20 | with open(filename, 'rb') as f:
21 | return unicode_to_utf8(json.load(f))
22 | except:
23 | with open(filename, 'rb') as f:
24 | return pkl.load(f)
25 |
26 |
27 | def load_config(basename):
28 | try:
29 | with open('%s.json' % basename, 'rb') as f:
30 | return json.load(f)
31 | except:
32 | try:
33 | with open('%s.pkl' % basename, 'rb') as f:
34 | return pkl.load(f)
35 | except:
36 | sys.stderr.write('Error: config file {0}.json is missing\n'.format(basename))
37 | sys.exit(1)
38 |
39 |
--------------------------------------------------------------------------------
/decode.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "\n",
12 | "#!/usr/bin/env python\n",
13 | "# coding: utf-8\n",
14 | "\n",
15 | "import os\n",
16 | "import math\n",
17 | "import time\n",
18 | "import json\n",
19 | "import random\n",
20 | "\n",
21 | "from collections import OrderedDict\n",
22 | "\n",
23 | "import numpy as np\n",
24 | "import tensorflow as tf\n",
25 | "\n",
26 | "from data.data_iterator import TextIterator\n",
27 | "\n",
28 | "import data.util as util\n",
29 | "import data.data_utils as data_utils\n",
30 | "from data.data_utils import prepare_batch\n",
31 | "from data.data_utils import prepare_train_batch\n",
32 | "\n",
33 | "from seq2seq_model import Seq2SeqModel"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {
40 | "collapsed": true
41 | },
42 | "outputs": [],
43 | "source": [
44 | "# Decoding parameters\n",
45 | "tf.app.flags.DEFINE_integer('beam_width', 12, 'Beam width used in beamsearch')\n",
46 | "tf.app.flags.DEFINE_integer('decode_batch_size', 80, 'Batch size used for decoding')\n",
47 | "tf.app.flags.DEFINE_integer('max_decode_step', 500, 'Maximum time step limit to decode')\n",
48 | "tf.app.flags.DEFINE_boolean('write_n_best', False, 'Write n-best list (n=beam_width)')\n",
49 | "tf.app.flags.DEFINE_string('model_path', None, 'Path to a specific model checkpoint.')\n",
50 | "tf.app.flags.DEFINE_string('decode_input', 'data/newstest2012.bpe.de', 'Decoding input path')\n",
51 | "tf.app.flags.DEFINE_string('decode_output', 'data/newstest2012.bpe.de.trans', 'Decoding output path')\n",
52 | "\n",
53 | "# Runtime parameters\n",
54 | "tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement')\n",
55 | "tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices')\n",
56 | "\n",
57 | "FLAGS = tf.app.flags.FLAGS"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {
64 | "collapsed": true
65 | },
66 | "outputs": [],
67 | "source": [
68 | "def load_config(FLAGS):\n",
69 | " \n",
70 | " config = util.unicode_to_utf8(\n",
71 | " json.load(open('%s.json' % FLAGS.model_path, 'rb')))\n",
72 | " for key, value in FLAGS.__flags.items():\n",
73 | " config[key] = value\n",
74 | "\n",
75 | " return config\n",
76 | "\n",
77 | "\n",
78 | "def load_model(session, config):\n",
79 | " \n",
80 | " model = Seq2SeqModel(config, 'decode')\n",
81 | " if tf.train.checkpoint_exists(FLAGS.model_path):\n",
82 | " print 'Reloading model parameters..'\n",
83 | " model.restore(session, FLAGS.model_path)\n",
84 | " else:\n",
85 | " raise ValueError(\n",
86 | " 'No such file:[{}]'.format(FLAGS.model_path))\n",
87 | " return model"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "collapsed": true
95 | },
96 | "outputs": [],
97 | "source": [
98 | "def decode():\n",
99 | " # Load model config\n",
100 | " config = load_config(FLAGS)\n",
101 | "\n",
102 | " # Load source data to decode\n",
103 | " test_set = TextIterator(source=config['decode_input'],\n",
104 | " batch_size=config['decode_batch_size'],\n",
105 | " source_dict=config['source_vocabulary'],\n",
106 | " maxlen=None,\n",
107 | " n_words_source=config['num_encoder_symbols'])\n",
108 | "\n",
109 | " # Load inverse dictionary used in decoding\n",
110 | " target_inverse_dict = data_utils.load_inverse_dict(config['target_vocabulary'])\n",
111 | " \n",
112 | " # Initiate TF session\n",
113 | " with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement, \n",
114 | " log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess:\n",
115 | "\n",
116 | " # Reload existing checkpoint\n",
117 | " model = load_model(sess, config)\n",
118 | " try:\n",
119 | " print 'Decoding {}..'.format(FLAGS.decode_input)\n",
120 | " if FLAGS.write_n_best:\n",
121 | " fout = [data_utils.fopen((\"%s_%d\" % (FLAGS.decode_output, k)), 'w') \\\n",
122 | " for k in range(FLAGS.beam_width)]\n",
123 | " else:\n",
124 | " fout = [data_utils.fopen(FLAGS.decode_output, 'w')]\n",
125 | " \n",
126 | " for idx, source_seq in enumerate(test_set):\n",
127 | " source, source_len = prepare_batch(source_seq)\n",
128 | " # predicted_ids: GreedyDecoder; [batch_size, max_time_step, 1]\n",
129 | " # BeamSearchDecoder; [batch_size, max_time_step, beam_width]\n",
130 | " predicted_ids = model.predict(sess, encoder_inputs=source, \n",
131 | " encoder_inputs_length=source_len)\n",
132 | " \n",
133 | " # Write decoding results\n",
134 | " for k, f in reversed(list(enumerate(fout))):\n",
135 | " for seq in predicted_ids:\n",
136 | " f.write(str(data_utils.seq2words(seq[:,k], target_inverse_dict)) + '\\n')\n",
137 | " if not FLAGS.write_n_best:\n",
138 | " break\n",
139 | " print ' {}th line decoded'.format(idx * FLAGS.decode_batch_size)\n",
140 | " \n",
141 | " print 'Decoding terminated'\n",
142 | " except IOError:\n",
143 | " pass\n",
144 | " finally:\n",
145 | " [f.close() for f in fout]"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {
152 | "collapsed": true
153 | },
154 | "outputs": [],
155 | "source": [
156 | "def main(_):\n",
157 | " decode()\n",
158 | "\n",
159 | "\n",
160 | "if __name__ == '__main__':\n",
161 | " tf.app.run()"
162 | ]
163 | }
164 | ],
165 | "metadata": {
166 | "kernelspec": {
167 | "display_name": "Python 2",
168 | "language": "python",
169 | "name": "python2"
170 | },
171 | "language_info": {
172 | "codemirror_mode": {
173 | "name": "ipython",
174 | "version": 2
175 | },
176 | "file_extension": ".py",
177 | "mimetype": "text/x-python",
178 | "name": "python",
179 | "nbconvert_exporter": "python",
180 | "pygments_lexer": "ipython2",
181 | "version": "2.7.10"
182 | }
183 | },
184 | "nbformat": 4,
185 | "nbformat_minor": 0
186 | }
187 |
--------------------------------------------------------------------------------
/decode.py:
--------------------------------------------------------------------------------
1 |
2 | #!/usr/bin/env python
3 | # coding: utf-8
4 |
5 | import os
6 | import math
7 | import time
8 | import json
9 | import random
10 |
11 | from collections import OrderedDict
12 |
13 | import numpy as np
14 | import tensorflow as tf
15 |
16 | from data.data_iterator import TextIterator
17 |
18 | import data.util as util
19 | import data.data_utils as data_utils
20 | from data.data_utils import prepare_batch
21 | from data.data_utils import prepare_train_batch
22 |
23 | from seq2seq_model import Seq2SeqModel
24 |
25 | # Decoding parameters
26 | tf.app.flags.DEFINE_integer('beam_width', 12, 'Beam width used in beamsearch')
27 | tf.app.flags.DEFINE_integer('decode_batch_size', 80, 'Batch size used for decoding')
28 | tf.app.flags.DEFINE_integer('max_decode_step', 500, 'Maximum time step limit to decode')
29 | tf.app.flags.DEFINE_boolean('write_n_best', False, 'Write n-best list (n=beam_width)')
30 | tf.app.flags.DEFINE_string('model_path', None, 'Path to a specific model checkpoint.')
31 | tf.app.flags.DEFINE_string('decode_input', 'data/newstest2012.bpe.de', 'Decoding input path')
32 | tf.app.flags.DEFINE_string('decode_output', 'data/newstest2012.bpe.de.trans', 'Decoding output path')
33 |
34 | # Runtime parameters
35 | tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement')
36 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices')
37 |
38 | FLAGS = tf.app.flags.FLAGS
39 |
40 | def load_config(FLAGS):
41 |
42 | config = util.unicode_to_utf8(
43 | json.load(open('%s.json' % FLAGS.model_path, 'rb')))
44 | for key, value in FLAGS.__flags.items():
45 | config[key] = value
46 |
47 | return config
48 |
49 |
50 | def load_model(session, config):
51 |
52 | model = Seq2SeqModel(config, 'decode')
53 | if tf.train.checkpoint_exists(FLAGS.model_path):
54 | print 'Reloading model parameters..'
55 | model.restore(session, FLAGS.model_path)
56 | else:
57 | raise ValueError(
58 | 'No such file:[{}]'.format(FLAGS.model_path))
59 | return model
60 |
61 |
62 | def decode():
63 | # Load model config
64 | config = load_config(FLAGS)
65 |
66 | # Load source data to decode
67 | test_set = TextIterator(source=config['decode_input'],
68 | batch_size=config['decode_batch_size'],
69 | source_dict=config['source_vocabulary'],
70 | maxlen=None,
71 | n_words_source=config['num_encoder_symbols'])
72 |
73 | # Load inverse dictionary used in decoding
74 | target_inverse_dict = data_utils.load_inverse_dict(config['target_vocabulary'])
75 |
76 | # Initiate TF session
77 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement,
78 | log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess:
79 |
80 | # Reload existing checkpoint
81 | model = load_model(sess, config)
82 | try:
83 | print 'Decoding {}..'.format(FLAGS.decode_input)
84 | if FLAGS.write_n_best:
85 | fout = [data_utils.fopen(("%s_%d" % (FLAGS.decode_output, k)), 'w') \
86 | for k in range(FLAGS.beam_width)]
87 | else:
88 | fout = [data_utils.fopen(FLAGS.decode_output, 'w')]
89 |
90 | for idx, source_seq in enumerate(test_set):
91 | source, source_len = prepare_batch(source_seq)
92 | # predicted_ids: GreedyDecoder; [batch_size, max_time_step, 1]
93 | # BeamSearchDecoder; [batch_size, max_time_step, beam_width]
94 | predicted_ids = model.predict(sess, encoder_inputs=source,
95 | encoder_inputs_length=source_len)
96 |
97 | # Write decoding results
98 | for k, f in reversed(list(enumerate(fout))):
99 | for seq in predicted_ids:
100 | f.write(str(data_utils.seq2words(seq[:,k], target_inverse_dict)) + '\n')
101 | if not FLAGS.write_n_best:
102 | break
103 | print ' {}th line decoded'.format(idx * FLAGS.decode_batch_size)
104 |
105 | print 'Decoding terminated'
106 | except IOError:
107 | pass
108 | finally:
109 | [f.close() for f in fout]
110 |
111 |
112 | def main(_):
113 | decode()
114 |
115 |
116 | if __name__ == '__main__':
117 | tf.app.run()
118 |
119 |
--------------------------------------------------------------------------------
/seq2seq_model.py:
--------------------------------------------------------------------------------
1 |
2 | #!/usr/bin/env python
3 | # -*- coding: utf-8 -*-
4 | import math
5 |
6 | import numpy as np
7 | import tensorflow as tf
8 | import tensorflow.contrib.seq2seq as seq2seq
9 |
10 | from tensorflow.python.ops.rnn_cell import GRUCell
11 | from tensorflow.python.ops.rnn_cell import LSTMCell
12 | from tensorflow.python.ops.rnn_cell import MultiRNNCell
13 | from tensorflow.python.ops.rnn_cell import DropoutWrapper, ResidualWrapper
14 |
15 | from tensorflow.python.ops import array_ops
16 | from tensorflow.python.ops import control_flow_ops
17 | from tensorflow.python.framework import constant_op
18 | from tensorflow.python.framework import dtypes
19 | from tensorflow.python.layers.core import Dense
20 | from tensorflow.python.util import nest
21 |
22 | from tensorflow.contrib.seq2seq.python.ops import attention_wrapper
23 | from tensorflow.contrib.seq2seq.python.ops import beam_search_decoder
24 |
25 | import data.data_utils as data_utils
26 |
27 | class Seq2SeqModel(object):
28 |
29 | def __init__(self, config, mode):
30 |
31 | assert mode.lower() in ['train', 'decode']
32 |
33 | self.config = config
34 | self.mode = mode.lower()
35 |
36 | self.cell_type = config['cell_type']
37 | self.hidden_units = config['hidden_units']
38 | self.depth = config['depth']
39 | self.attention_type = config['attention_type']
40 | self.embedding_size = config['embedding_size']
41 | #self.bidirectional = config.bidirectional
42 |
43 | self.num_encoder_symbols = config['num_encoder_symbols']
44 | self.num_decoder_symbols = config['num_decoder_symbols']
45 |
46 | self.use_residual = config['use_residual']
47 | self.attn_input_feeding = config['attn_input_feeding']
48 | self.use_dropout = config['use_dropout']
49 | self.keep_prob = 1.0 - config['dropout_rate']
50 |
51 | self.optimizer = config['optimizer']
52 | self.learning_rate = config['learning_rate']
53 | self.max_gradient_norm = config['max_gradient_norm']
54 | self.global_step = tf.Variable(0, trainable=False, name='global_step')
55 | self.global_epoch_step = tf.Variable(0, trainable=False, name='global_epoch_step')
56 | self.global_epoch_step_op = \
57 | tf.assign(self.global_epoch_step, self.global_epoch_step+1)
58 |
59 | self.dtype = tf.float16 if config['use_fp16'] else tf.float32
60 | self.keep_prob_placeholder = tf.placeholder(self.dtype, shape=[], name='keep_prob')
61 |
62 | self.use_beamsearch_decode=False
63 | if self.mode == 'decode':
64 | self.beam_width = config['beam_width']
65 | self.use_beamsearch_decode = True if self.beam_width > 1 else False
66 | self.max_decode_step = config['max_decode_step']
67 |
68 | self.build_model()
69 |
70 |
71 | def build_model(self):
72 | print("building model..")
73 |
74 | # Building encoder and decoder networks
75 | self.init_placeholders()
76 | self.build_encoder()
77 | self.build_decoder()
78 |
79 | # Merge all the training summaries
80 | self.summary_op = tf.summary.merge_all()
81 |
82 |
83 | def init_placeholders(self):
84 | # encoder_inputs: [batch_size, max_time_steps]
85 | self.encoder_inputs = tf.placeholder(dtype=tf.int32,
86 | shape=(None, None), name='encoder_inputs')
87 |
88 | # encoder_inputs_length: [batch_size]
89 | self.encoder_inputs_length = tf.placeholder(
90 | dtype=tf.int32, shape=(None,), name='encoder_inputs_length')
91 |
92 | # get dynamic batch_size
93 | self.batch_size = tf.shape(self.encoder_inputs)[0]
94 | if self.mode == 'train':
95 |
96 | # decoder_inputs: [batch_size, max_time_steps]
97 | self.decoder_inputs = tf.placeholder(
98 | dtype=tf.int32, shape=(None, None), name='decoder_inputs')
99 | # decoder_inputs_length: [batch_size]
100 | self.decoder_inputs_length = tf.placeholder(
101 | dtype=tf.int32, shape=(None,), name='decoder_inputs_length')
102 |
103 | decoder_start_token = tf.ones(
104 | shape=[self.batch_size, 1], dtype=tf.int32) * data_utils.start_token
105 | decoder_end_token = tf.ones(
106 | shape=[self.batch_size, 1], dtype=tf.int32) * data_utils.end_token
107 |
108 | # decoder_inputs_train: [batch_size , max_time_steps + 1]
109 | # insert _GO symbol in front of each decoder input
110 | self.decoder_inputs_train = tf.concat([decoder_start_token,
111 | self.decoder_inputs], axis=1)
112 |
113 | # decoder_inputs_length_train: [batch_size]
114 | self.decoder_inputs_length_train = self.decoder_inputs_length + 1
115 |
116 | # decoder_targets_train: [batch_size, max_time_steps + 1]
117 | # insert EOS symbol at the end of each decoder input
118 | self.decoder_targets_train = tf.concat([self.decoder_inputs,
119 | decoder_end_token], axis=1)
120 |
121 |
122 | def build_encoder(self):
123 | print("building encoder..")
124 | with tf.variable_scope('encoder'):
125 | # Building encoder_cell
126 | self.encoder_cell = self.build_encoder_cell()
127 |
128 | # Initialize encoder_embeddings to have variance=1.
129 | sqrt3 = math.sqrt(3) # Uniform(-sqrt(3), sqrt(3)) has variance=1.
130 | initializer = tf.random_uniform_initializer(-sqrt3, sqrt3, dtype=self.dtype)
131 |
132 | self.encoder_embeddings = tf.get_variable(name='embedding',
133 | shape=[self.num_encoder_symbols, self.embedding_size],
134 | initializer=initializer, dtype=self.dtype)
135 |
136 | # Embedded_inputs: [batch_size, time_step, embedding_size]
137 | self.encoder_inputs_embedded = tf.nn.embedding_lookup(
138 | params=self.encoder_embeddings, ids=self.encoder_inputs)
139 |
140 | # Input projection layer to feed embedded inputs to the cell
141 | # ** Essential when use_residual=True to match input/output dims
142 | input_layer = Dense(self.hidden_units, dtype=self.dtype, name='input_projection')
143 |
144 | # Embedded inputs having gone through input projection layer
145 | self.encoder_inputs_embedded = input_layer(self.encoder_inputs_embedded)
146 |
147 | # Encode input sequences into context vectors:
148 | # encoder_outputs: [batch_size, max_time_step, cell_output_size]
149 | # encoder_state: [batch_size, cell_output_size]
150 | self.encoder_outputs, self.encoder_last_state = tf.nn.dynamic_rnn(
151 | cell=self.encoder_cell, inputs=self.encoder_inputs_embedded,
152 | sequence_length=self.encoder_inputs_length, dtype=self.dtype,
153 | time_major=False)
154 |
155 |
156 | def build_decoder(self):
157 | print("building decoder and attention..")
158 | with tf.variable_scope('decoder'):
159 | # Building decoder_cell and decoder_initial_state
160 | self.decoder_cell, self.decoder_initial_state = self.build_decoder_cell()
161 |
162 | # Initialize decoder embeddings to have variance=1.
163 | sqrt3 = math.sqrt(3) # Uniform(-sqrt(3), sqrt(3)) has variance=1.
164 | initializer = tf.random_uniform_initializer(-sqrt3, sqrt3, dtype=self.dtype)
165 |
166 | self.decoder_embeddings = tf.get_variable(name='embedding',
167 | shape=[self.num_decoder_symbols, self.embedding_size],
168 | initializer=initializer, dtype=self.dtype)
169 |
170 | # Input projection layer to feed embedded inputs to the cell
171 | # ** Essential when use_residual=True to match input/output dims
172 | input_layer = Dense(self.hidden_units, dtype=self.dtype, name='input_projection')
173 |
174 | # Output projection layer to convert cell_outputs to logits
175 | output_layer = Dense(self.num_decoder_symbols, name='output_projection')
176 |
177 | if self.mode == 'train':
178 | # decoder_inputs_embedded: [batch_size, max_time_step + 1, embedding_size]
179 | self.decoder_inputs_embedded = tf.nn.embedding_lookup(
180 | params=self.decoder_embeddings, ids=self.decoder_inputs_train)
181 |
182 | # Embedded inputs having gone through input projection layer
183 | self.decoder_inputs_embedded = input_layer(self.decoder_inputs_embedded)
184 |
185 | # Helper to feed inputs for training: read inputs from dense ground truth vectors
186 | training_helper = seq2seq.TrainingHelper(inputs=self.decoder_inputs_embedded,
187 | sequence_length=self.decoder_inputs_length_train,
188 | time_major=False,
189 | name='training_helper')
190 |
191 | training_decoder = seq2seq.BasicDecoder(cell=self.decoder_cell,
192 | helper=training_helper,
193 | initial_state=self.decoder_initial_state,
194 | output_layer=output_layer)
195 | #output_layer=None)
196 |
197 | # Maximum decoder time_steps in current batch
198 | max_decoder_length = tf.reduce_max(self.decoder_inputs_length_train)
199 |
200 | # decoder_outputs_train: BasicDecoderOutput
201 | # namedtuple(rnn_outputs, sample_id)
202 | # decoder_outputs_train.rnn_output: [batch_size, max_time_step + 1, num_decoder_symbols] if output_time_major=False
203 | # [max_time_step + 1, batch_size, num_decoder_symbols] if output_time_major=True
204 | # decoder_outputs_train.sample_id: [batch_size], tf.int32
205 | (self.decoder_outputs_train, self.decoder_last_state_train,
206 | self.decoder_outputs_length_train) = (seq2seq.dynamic_decode(
207 | decoder=training_decoder,
208 | output_time_major=False,
209 | impute_finished=True,
210 | maximum_iterations=max_decoder_length))
211 |
212 | # More efficient to do the projection on the batch-time-concatenated tensor
213 | # logits_train: [batch_size, max_time_step + 1, num_decoder_symbols]
214 | # self.decoder_logits_train = output_layer(self.decoder_outputs_train.rnn_output)
215 | self.decoder_logits_train = tf.identity(self.decoder_outputs_train.rnn_output)
216 | # Use argmax to extract decoder symbols to emit
217 | self.decoder_pred_train = tf.argmax(self.decoder_logits_train, axis=-1,
218 | name='decoder_pred_train')
219 |
220 | # masks: masking for valid and padded time steps, [batch_size, max_time_step + 1]
221 | masks = tf.sequence_mask(lengths=self.decoder_inputs_length_train,
222 | maxlen=max_decoder_length, dtype=self.dtype, name='masks')
223 |
224 | # Computes per word average cross-entropy over a batch
225 | # Internally calls 'nn_ops.sparse_softmax_cross_entropy_with_logits' by default
226 | self.loss = seq2seq.sequence_loss(logits=self.decoder_logits_train,
227 | targets=self.decoder_targets_train,
228 | weights=masks,
229 | average_across_timesteps=True,
230 | average_across_batch=True,)
231 | # Training summary for the current batch_loss
232 | tf.summary.scalar('loss', self.loss)
233 |
234 | # Contruct graphs for minimizing loss
235 | self.init_optimizer()
236 |
237 | elif self.mode == 'decode':
238 |
239 | # Start_tokens: [batch_size,] `int32` vector
240 | start_tokens = tf.ones([self.batch_size,], tf.int32) * data_utils.start_token
241 | end_token = data_utils.end_token
242 |
243 | def embed_and_input_proj(inputs):
244 | return input_layer(tf.nn.embedding_lookup(self.decoder_embeddings, inputs))
245 |
246 | if not self.use_beamsearch_decode:
247 | # Helper to feed inputs for greedy decoding: uses the argmax of the output
248 | decoding_helper = seq2seq.GreedyEmbeddingHelper(start_tokens=start_tokens,
249 | end_token=end_token,
250 | embedding=embed_and_input_proj)
251 | # Basic decoder performs greedy decoding at each time step
252 | print("building greedy decoder..")
253 | inference_decoder = seq2seq.BasicDecoder(cell=self.decoder_cell,
254 | helper=decoding_helper,
255 | initial_state=self.decoder_initial_state,
256 | output_layer=output_layer)
257 | else:
258 | # Beamsearch is used to approximately find the most likely translation
259 | print("building beamsearch decoder..")
260 | inference_decoder = beam_search_decoder.BeamSearchDecoder(cell=self.decoder_cell,
261 | embedding=embed_and_input_proj,
262 | start_tokens=start_tokens,
263 | end_token=end_token,
264 | initial_state=self.decoder_initial_state,
265 | beam_width=self.beam_width,
266 | output_layer=output_layer,)
267 | # For GreedyDecoder, return
268 | # decoder_outputs_decode: BasicDecoderOutput instance
269 | # namedtuple(rnn_outputs, sample_id)
270 | # decoder_outputs_decode.rnn_output: [batch_size, max_time_step, num_decoder_symbols] if output_time_major=False
271 | # [max_time_step, batch_size, num_decoder_symbols] if output_time_major=True
272 | # decoder_outputs_decode.sample_id: [batch_size, max_time_step], tf.int32 if output_time_major=False
273 | # [max_time_step, batch_size], tf.int32 if output_time_major=True
274 |
275 | # For BeamSearchDecoder, return
276 | # decoder_outputs_decode: FinalBeamSearchDecoderOutput instance
277 | # namedtuple(predicted_ids, beam_search_decoder_output)
278 | # decoder_outputs_decode.predicted_ids: [batch_size, max_time_step, beam_width] if output_time_major=False
279 | # [max_time_step, batch_size, beam_width] if output_time_major=True
280 | # decoder_outputs_decode.beam_search_decoder_output: BeamSearchDecoderOutput instance
281 | # namedtuple(scores, predicted_ids, parent_ids)
282 |
283 | (self.decoder_outputs_decode, self.decoder_last_state_decode,
284 | self.decoder_outputs_length_decode) = (seq2seq.dynamic_decode(
285 | decoder=inference_decoder,
286 | output_time_major=False,
287 | #impute_finished=True, # error occurs
288 | maximum_iterations=self.max_decode_step))
289 |
290 | if not self.use_beamsearch_decode:
291 | # decoder_outputs_decode.sample_id: [batch_size, max_time_step]
292 | # Or use argmax to find decoder symbols to emit:
293 | # self.decoder_pred_decode = tf.argmax(self.decoder_outputs_decode.rnn_output,
294 | # axis=-1, name='decoder_pred_decode')
295 |
296 | # Here, we use expand_dims to be compatible with the result of the beamsearch decoder
297 | # decoder_pred_decode: [batch_size, max_time_step, 1] (output_major=False)
298 | self.decoder_pred_decode = tf.expand_dims(self.decoder_outputs_decode.sample_id, -1)
299 |
300 | else:
301 | # Use beam search to approximately find the most likely translation
302 | # decoder_pred_decode: [batch_size, max_time_step, beam_width] (output_major=False)
303 | self.decoder_pred_decode = self.decoder_outputs_decode.predicted_ids
304 |
305 |
306 | def build_single_cell(self):
307 | cell_type = LSTMCell
308 | if (self.cell_type.lower() == 'gru'):
309 | cell_type = GRUCell
310 | cell = cell_type(self.hidden_units)
311 |
312 | if self.use_dropout:
313 | cell = DropoutWrapper(cell, dtype=self.dtype,
314 | output_keep_prob=self.keep_prob_placeholder,)
315 | if self.use_residual:
316 | cell = ResidualWrapper(cell)
317 |
318 | return cell
319 |
320 |
321 | # Building encoder cell
322 | def build_encoder_cell (self):
323 |
324 | return MultiRNNCell([self.build_single_cell() for i in range(self.depth)])
325 |
326 |
327 | # Building decoder cell and attention. Also returns decoder_initial_state
328 | def build_decoder_cell(self):
329 |
330 | encoder_outputs = self.encoder_outputs
331 | encoder_last_state = self.encoder_last_state
332 | encoder_inputs_length = self.encoder_inputs_length
333 |
334 | # To use BeamSearchDecoder, encoder_outputs, encoder_last_state, encoder_inputs_length
335 | # needs to be tiled so that: [batch_size, .., ..] -> [batch_size x beam_width, .., ..]
336 | if self.use_beamsearch_decode:
337 | print ("use beamsearch decoding..")
338 | encoder_outputs = seq2seq.tile_batch(
339 | self.encoder_outputs, multiplier=self.beam_width)
340 | encoder_last_state = nest.map_structure(
341 | lambda s: seq2seq.tile_batch(s, self.beam_width), self.encoder_last_state)
342 | encoder_inputs_length = seq2seq.tile_batch(
343 | self.encoder_inputs_length, multiplier=self.beam_width)
344 |
345 | # Building attention mechanism: Default Bahdanau
346 | # 'Bahdanau' style attention: https://arxiv.org/abs/1409.0473
347 | self.attention_mechanism = attention_wrapper.BahdanauAttention(
348 | num_units=self.hidden_units, memory=encoder_outputs,
349 | memory_sequence_length=encoder_inputs_length,)
350 | # 'Luong' style attention: https://arxiv.org/abs/1508.04025
351 | if self.attention_type.lower() == 'luong':
352 | self.attention_mechanism = attention_wrapper.LuongAttention(
353 | num_units=self.hidden_units, memory=encoder_outputs,
354 | memory_sequence_length=encoder_inputs_length,)
355 |
356 | # Building decoder_cell
357 | self.decoder_cell_list = [
358 | self.build_single_cell() for i in range(self.depth)]
359 | decoder_initial_state = encoder_last_state
360 |
361 | def attn_decoder_input_fn(inputs, attention):
362 | if not self.attn_input_feeding:
363 | return inputs
364 |
365 | # Essential when use_residual=True
366 | _input_layer = Dense(self.hidden_units, dtype=self.dtype,
367 | name='attn_input_feeding')
368 | return _input_layer(array_ops.concat([inputs, attention], -1))
369 |
370 | # AttentionWrapper wraps RNNCell with the attention_mechanism
371 | # Note: We implement Attention mechanism only on the top decoder layer
372 | self.decoder_cell_list[-1] = attention_wrapper.AttentionWrapper(
373 | cell=self.decoder_cell_list[-1],
374 | attention_mechanism=self.attention_mechanism,
375 | attention_layer_size=self.hidden_units,
376 | cell_input_fn=attn_decoder_input_fn,
377 | initial_cell_state=encoder_last_state[-1],
378 | alignment_history=False,
379 | name='Attention_Wrapper')
380 |
381 | # To be compatible with AttentionWrapper, the encoder last state
382 | # of the top layer should be converted into the AttentionWrapperState form
383 | # We can easily do this by calling AttentionWrapper.zero_state
384 |
385 | # Also if beamsearch decoding is used, the batch_size argument in .zero_state
386 | # should be ${decoder_beam_width} times to the origianl batch_size
387 | batch_size = self.batch_size if not self.use_beamsearch_decode \
388 | else self.batch_size * self.beam_width
389 | initial_state = [state for state in encoder_last_state]
390 |
391 | initial_state[-1] = self.decoder_cell_list[-1].zero_state(
392 | batch_size=batch_size, dtype=self.dtype)
393 | decoder_initial_state = tuple(initial_state)
394 |
395 | return MultiRNNCell(self.decoder_cell_list), decoder_initial_state
396 |
397 |
398 | def init_optimizer(self):
399 | print("setting optimizer..")
400 | # Gradients and SGD update operation for training the model
401 | trainable_params = tf.trainable_variables()
402 | if self.optimizer.lower() == 'adadelta':
403 | self.opt = tf.train.AdadeltaOptimizer(learning_rate=self.learning_rate)
404 | elif self.optimizer.lower() == 'adam':
405 | self.opt = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
406 | elif self.optimizer.lower() == 'rmsprop':
407 | self.opt = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate)
408 | else:
409 | self.opt = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate)
410 |
411 | # Compute gradients of loss w.r.t. all trainable variables
412 | gradients = tf.gradients(self.loss, trainable_params)
413 |
414 | # Clip gradients by a given maximum_gradient_norm
415 | clip_gradients, _ = tf.clip_by_global_norm(gradients, self.max_gradient_norm)
416 |
417 | # Update the model
418 | self.updates = self.opt.apply_gradients(
419 | zip(clip_gradients, trainable_params), global_step=self.global_step)
420 |
421 | def save(self, sess, path, var_list=None, global_step=None):
422 | # var_list = None returns the list of all saveable variables
423 | saver = tf.train.Saver(var_list)
424 |
425 | # temporary code
426 | #del tf.get_collection_ref('LAYER_NAME_UIDS')[0]
427 | save_path = saver.save(sess, save_path=path, global_step=global_step)
428 | print('model saved at %s' % save_path)
429 |
430 |
431 | def restore(self, sess, path, var_list=None):
432 | # var_list = None returns the list of all saveable variables
433 | saver = tf.train.Saver(var_list)
434 | saver.restore(sess, save_path=path)
435 | print('model restored from %s' % path)
436 |
437 |
438 | def train(self, sess, encoder_inputs, encoder_inputs_length,
439 | decoder_inputs, decoder_inputs_length):
440 | """Run a train step of the model feeding the given inputs.
441 |
442 | Args:
443 | session: tensorflow session to use.
444 | encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps]
445 | to feed as encoder inputs
446 | encoder_inputs_length: a numpy int vector of [batch_size]
447 | to feed as sequence lengths for each element in the given batch
448 | decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps]
449 | to feed as decoder inputs
450 | decoder_inputs_length: a numpy int vector of [batch_size]
451 | to feed as sequence lengths for each element in the given batch
452 |
453 | Returns:
454 | A triple consisting of gradient norm (or None if we did not do backward),
455 | average perplexity, and the outputs.
456 | """
457 | # Check if the model is 'training' mode
458 | if self.mode.lower() != 'train':
459 | raise ValueError("train step can only be operated in train mode")
460 |
461 | input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length,
462 | decoder_inputs, decoder_inputs_length, False)
463 | # Input feeds for dropout
464 | input_feed[self.keep_prob_placeholder.name] = self.keep_prob
465 |
466 | output_feed = [self.updates, # Update Op that does optimization
467 | self.loss, # Loss for current batch
468 | self.summary_op] # Training summary
469 |
470 | outputs = sess.run(output_feed, input_feed)
471 | return outputs[1], outputs[2] # loss, summary
472 |
473 |
474 | def eval(self, sess, encoder_inputs, encoder_inputs_length,
475 | decoder_inputs, decoder_inputs_length):
476 | """Run a evaluation step of the model feeding the given inputs.
477 |
478 | Args:
479 | session: tensorflow session to use.
480 | encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps]
481 | to feed as encoder inputs
482 | encoder_inputs_length: a numpy int vector of [batch_size]
483 | to feed as sequence lengths for each element in the given batch
484 | decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps]
485 | to feed as decoder inputs
486 | decoder_inputs_length: a numpy int vector of [batch_size]
487 | to feed as sequence lengths for each element in the given batch
488 |
489 | Returns:
490 | A triple consisting of gradient norm (or None if we did not do backward),
491 | average perplexity, and the outputs.
492 | """
493 |
494 | input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length,
495 | decoder_inputs, decoder_inputs_length, False)
496 | # Input feeds for dropout
497 | input_feed[self.keep_prob_placeholder.name] = 1.0
498 |
499 | output_feed = [self.loss, # Loss for current batch
500 | self.summary_op] # Training summary
501 | outputs = sess.run(output_feed, input_feed)
502 | return outputs[0], outputs[1] # loss
503 |
504 |
505 | def predict(self, sess, encoder_inputs, encoder_inputs_length):
506 |
507 | input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length,
508 | decoder_inputs=None, decoder_inputs_length=None,
509 | decode=True)
510 |
511 | # Input feeds for dropout
512 | input_feed[self.keep_prob_placeholder.name] = 1.0
513 |
514 | output_feed = [self.decoder_pred_decode]
515 | outputs = sess.run(output_feed, input_feed)
516 |
517 | # GreedyDecoder: [batch_size, max_time_step]
518 | return outputs[0] # BeamSearchDecoder: [batch_size, max_time_step, beam_width]
519 |
520 |
521 | def check_feeds(self, encoder_inputs, encoder_inputs_length,
522 | decoder_inputs, decoder_inputs_length, decode):
523 | """
524 | Args:
525 | encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps]
526 | to feed as encoder inputs
527 | encoder_inputs_length: a numpy int vector of [batch_size]
528 | to feed as sequence lengths for each element in the given batch
529 | decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps]
530 | to feed as decoder inputs
531 | decoder_inputs_length: a numpy int vector of [batch_size]
532 | to feed as sequence lengths for each element in the given batch
533 | decode: a scalar boolean that indicates decode mode
534 | Returns:
535 | A feed for the model that consists of encoder_inputs, encoder_inputs_length,
536 | decoder_inputs, decoder_inputs_length
537 | """
538 |
539 | input_batch_size = encoder_inputs.shape[0]
540 | if input_batch_size != encoder_inputs_length.shape[0]:
541 | raise ValueError("Encoder inputs and their lengths must be equal in their "
542 | "batch_size, %d != %d" % (input_batch_size, encoder_inputs_length.shape[0]))
543 |
544 | if not decode:
545 | target_batch_size = decoder_inputs.shape[0]
546 | if target_batch_size != input_batch_size:
547 | raise ValueError("Encoder inputs and Decoder inputs must be equal in their "
548 | "batch_size, %d != %d" % (input_batch_size, target_batch_size))
549 | if target_batch_size != decoder_inputs_length.shape[0]:
550 | raise ValueError("Decoder targets and their lengths must be equal in their "
551 | "batch_size, %d != %d" % (target_batch_size, decoder_inputs_length.shape[0]))
552 |
553 | input_feed = {}
554 |
555 | input_feed[self.encoder_inputs.name] = encoder_inputs
556 | input_feed[self.encoder_inputs_length.name] = encoder_inputs_length
557 |
558 | if not decode:
559 | input_feed[self.decoder_inputs.name] = decoder_inputs
560 | input_feed[self.decoder_inputs_length.name] = decoder_inputs_length
561 |
562 | return input_feed
563 |
564 |
--------------------------------------------------------------------------------
/train.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "#!/usr/bin/env python\n",
12 | "# coding: utf-8\n",
13 | "\n",
14 | "import os\n",
15 | "import math\n",
16 | "import time\n",
17 | "import json\n",
18 | "import random\n",
19 | "\n",
20 | "from collections import OrderedDict\n",
21 | "\n",
22 | "import numpy as np\n",
23 | "import tensorflow as tf\n",
24 | "\n",
25 | "from data.data_iterator import TextIterator\n",
26 | "from data.data_iterator import BiTextIterator\n",
27 | "\n",
28 | "import data.data_utils as data_utils\n",
29 | "from data.data_utils import prepare_batch\n",
30 | "from data.data_utils import prepare_train_batch\n",
31 | "\n",
32 | "from seq2seq_model import Seq2SeqModel"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {
39 | "collapsed": true
40 | },
41 | "outputs": [],
42 | "source": [
43 | "# Data loading parameters\n",
44 | "tf.app.flags.DEFINE_string('source_vocabulary', 'data/europarl-v7.1.4M.de.json', 'Path to source vocabulary')\n",
45 | "tf.app.flags.DEFINE_string('target_vocabulary', 'data/europarl-v7.1.4M.fr.json', 'Path to target vocabulary')\n",
46 | "tf.app.flags.DEFINE_string('source_train_data', 'data/europarl-v7.1.4M.de', 'Path to source training data')\n",
47 | "tf.app.flags.DEFINE_string('target_train_data', 'data/europarl-v7.1.4M.fr', 'Path to target training data')\n",
48 | "tf.app.flags.DEFINE_string('source_valid_data', 'data/newstest2012.bpe.de', 'Path to source validation data')\n",
49 | "tf.app.flags.DEFINE_string('target_valid_data', 'data/newstest2012.bpe.fr', 'Path to target validation data')\n",
50 | "\n",
51 | "# Network parameters\n",
52 | "tf.app.flags.DEFINE_string('cell_type', 'lstm', 'RNN cell for encoder and decoder, default: lstm')\n",
53 | "tf.app.flags.DEFINE_string('attention_type', 'bahdanau', 'Attention mechanism: (bahdanau, luong), default: bahdanau')\n",
54 | "tf.app.flags.DEFINE_integer('hidden_units', 1024, 'Number of hidden units in each layer')\n",
55 | "tf.app.flags.DEFINE_integer('depth', 2, 'Number of layers in each encoder and decoder')\n",
56 | "tf.app.flags.DEFINE_integer('embedding_size', 500, 'Embedding dimensions of encoder and decoder inputs')\n",
57 | "tf.app.flags.DEFINE_integer('num_encoder_symbols', 30000, 'Source vocabulary size')\n",
58 | "tf.app.flags.DEFINE_integer('num_decoder_symbols', 30000, 'Target vocabulary size')\n",
59 | "\n",
60 | "tf.app.flags.DEFINE_boolean('use_residual', True, 'Use residual connection between layers')\n",
61 | "tf.app.flags.DEFINE_boolean('attn_input_feeding', False, 'Use input feeding method in attentional decoder')\n",
62 | "tf.app.flags.DEFINE_boolean('use_dropout', True, 'Use dropout in each rnn cell')\n",
63 | "tf.app.flags.DEFINE_float('dropout_rate', 0.3, 'Dropout probability for input/output/state units (0.0: no dropout)')\n",
64 | "\n",
65 | "# Training parameters\n",
66 | "tf.app.flags.DEFINE_float('learning_rate', 0.0002, 'Learning rate')\n",
67 | "tf.app.flags.DEFINE_float('max_gradient_norm', 1.0, 'Clip gradients to this norm')\n",
68 | "tf.app.flags.DEFINE_integer('batch_size', 128, 'Batch size')\n",
69 | "tf.app.flags.DEFINE_integer('max_epochs', 10, 'Maximum # of training epochs')\n",
70 | "tf.app.flags.DEFINE_integer('max_load_batches', 20, 'Maximum # of batches to load at one time')\n",
71 | "tf.app.flags.DEFINE_integer('max_seq_length', 50, 'Maximum sequence length')\n",
72 | "tf.app.flags.DEFINE_integer('display_freq', 100, 'Display training status every this iteration')\n",
73 | "tf.app.flags.DEFINE_integer('save_freq', 11500, 'Save model checkpoint every this iteration')\n",
74 | "tf.app.flags.DEFINE_integer('valid_freq', 1150000, 'Evaluate model every this iteration: valid_data needed')\n",
75 | "tf.app.flags.DEFINE_string('optimizer', 'adam', 'Optimizer for training: (adadelta, adam, rmsprop)')\n",
76 | "tf.app.flags.DEFINE_string('model_dir', 'model/', 'Path to save model checkpoints')\n",
77 | "tf.app.flags.DEFINE_string('summary_dir', 'model/summary', 'Path to save model summary')\n",
78 | "tf.app.flags.DEFINE_string('model_name', 'translate.ckpt', 'File name used for model checkpoints')\n",
79 | "tf.app.flags.DEFINE_boolean('shuffle_each_epoch', True, 'Shuffle training dataset for each epoch')\n",
80 | "tf.app.flags.DEFINE_boolean('sort_by_length', True, 'Sort pre-fetched minibatches by their target sequence lengths')\n",
81 | "tf.app.flags.DEFINE_boolean('use_fp16', False, 'Use half precision float16 instead of float32 as dtype')\n",
82 | "\n",
83 | "# Runtime parameters\n",
84 | "tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement')\n",
85 | "tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices')\n",
86 | "\n",
87 | "FLAGS = tf.app.flags.FLAGS"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "collapsed": true
95 | },
96 | "outputs": [],
97 | "source": [
98 | "def create_model(session, FLAGS):\n",
99 | "\n",
100 | " config = OrderedDict(sorted(FLAGS.__flags.items()))\n",
101 | " model = Seq2SeqModel(config, 'train')\n",
102 | "\n",
103 | " ckpt = tf.train.get_checkpoint_state(FLAGS.model_dir)\n",
104 | " if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):\n",
105 | " print 'Reloading model parameters..'\n",
106 | " model.restore(session, ckpt.model_checkpoint_path)\n",
107 | " \n",
108 | " else:\n",
109 | " if not os.path.exists(FLAGS.model_dir):\n",
110 | " os.makedirs(FLAGS.model_dir)\n",
111 | " print 'Created new model parameters..'\n",
112 | " session.run(tf.global_variables_initializer())\n",
113 | " \n",
114 | " return model"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {
121 | "collapsed": true
122 | },
123 | "outputs": [],
124 | "source": [
125 | "def train():\n",
126 | " # Load parallel data to train\n",
127 | " print 'Loading training data..'\n",
128 | " train_set = BiTextIterator(source=FLAGS.source_train_data,\n",
129 | " target=FLAGS.target_train_data,\n",
130 | " source_dict=FLAGS.source_vocabulary,\n",
131 | " target_dict=FLAGS.target_vocabulary,\n",
132 | " batch_size=FLAGS.batch_size,\n",
133 | " maxlen=FLAGS.max_seq_length,\n",
134 | " n_words_source=FLAGS.num_encoder_symbols,\n",
135 | " n_words_target=FLAGS.num_decoder_symbols,\n",
136 | " shuffle_each_epoch=FLAGS.shuffle_each_epoch,\n",
137 | " sort_by_length=FLAGS.sort_by_length,\n",
138 | " maxibatch_size=FLAGS.max_load_batches)\n",
139 | "\n",
140 | " if FLAGS.source_valid_data and FLAGS.target_valid_data:\n",
141 | " print 'Loading validation data..'\n",
142 | " valid_set = BiTextIterator(source=FLAGS.source_valid_data,\n",
143 | " target=FLAGS.target_valid_data,\n",
144 | " source_dict=FLAGS.source_vocabulary,\n",
145 | " target_dict=FLAGS.target_vocabulary,\n",
146 | " batch_size=FLAGS.batch_size,\n",
147 | " maxlen=None,\n",
148 | " n_words_source=FLAGS.num_encoder_symbols,\n",
149 | " n_words_target=FLAGS.num_decoder_symbols)\n",
150 | " else:\n",
151 | " valid_set = None\n",
152 | "\n",
153 | " # Initiate TF session\n",
154 | " with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement, \n",
155 | " log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess:\n",
156 | "\n",
157 | " # Create a log writer object\n",
158 | " log_writer = tf.summary.FileWriter(FLAGS.model_dir, graph=sess.graph)\n",
159 | " \n",
160 | " # Create a new model or reload existing checkpoint\n",
161 | " model = create_model(sess, FLAGS)\n",
162 | "\n",
163 | " step_time, loss = 0.0, 0.0\n",
164 | " words_seen, sents_seen = 0, 0\n",
165 | " start_time = time.time()\n",
166 | "\n",
167 | " # Training loop\n",
168 | " print 'Training..'\n",
169 | " for epoch_idx in xrange(FLAGS.max_epochs):\n",
170 | " if model.global_epoch_step.eval() >= FLAGS.max_epochs:\n",
171 | " print 'Training is already complete.', \\\n",
172 | " 'current epoch:{}, max epoch:{}'.format(model.global_epoch_step.eval(), FLAGS.max_epochs)\n",
173 | " break\n",
174 | "\n",
175 | " for source_seq, target_seq in train_set: \n",
176 | " # Get a batch from training parallel data\n",
177 | " source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq,\n",
178 | " FLAGS.max_seq_length)\n",
179 | " if source is None or target is None:\n",
180 | " print 'No samples under max_seq_length ', FLAGS.max_seq_length\n",
181 | " continue\n",
182 | "\n",
183 | " # Execute a single training step\n",
184 | " step_loss, summary = model.train(sess, encoder_inputs=source, encoder_inputs_length=source_len, \n",
185 | " decoder_inputs=target, decoder_inputs_length=target_len)\n",
186 | "\n",
187 | " loss += float(step_loss) / FLAGS.display_freq\n",
188 | " words_seen += float(np.sum(source_len+target_len))\n",
189 | " sents_seen += float(source.shape[0]) # batch_size\n",
190 | "\n",
191 | " if model.global_step.eval() % FLAGS.display_freq == 0:\n",
192 | "\n",
193 | " avg_perplexity = math.exp(float(loss)) if loss < 300 else float(\"inf\")\n",
194 | "\n",
195 | " time_elapsed = time.time() - start_time\n",
196 | " step_time = time_elapsed / FLAGS.display_freq\n",
197 | "\n",
198 | " words_per_sec = words_seen / time_elapsed\n",
199 | " sents_per_sec = sents_seen / time_elapsed\n",
200 | "\n",
201 | " print 'Epoch ', model.global_epoch_step.eval(), 'Step ', model.global_step.eval(), \\\n",
202 | " 'Perplexity {0:.2f}'.format(avg_perplexity), 'Step-time ', step_time, \\\n",
203 | " '{0:.2f} sents/s'.format(sents_per_sec), '{0:.2f} words/s'.format(words_per_sec)\n",
204 | "\n",
205 | " loss = 0\n",
206 | " words_seen = 0\n",
207 | " sents_seen = 0\n",
208 | " start_time = time.time()\n",
209 | " \n",
210 | " # Record training summary for the current batch\n",
211 | " log_writer.add_summary(summary, model.global_step.eval())\n",
212 | "\n",
213 | " # Execute a validation step\n",
214 | " if valid_set and model.global_step.eval() % FLAGS.valid_freq == 0:\n",
215 | " print 'Validation step'\n",
216 | " valid_loss = 0.0\n",
217 | " valid_sents_seen = 0\n",
218 | " for source_seq, target_seq in valid_set:\n",
219 | " # Get a batch from validation parallel data\n",
220 | " source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq)\n",
221 | "\n",
222 | " # Compute validation loss: average per word cross entropy loss\n",
223 | " step_loss = model.eval(sess, encoder_inputs=source, encoder_inputs_length=source_len,\n",
224 | " decoder_inputs=target, decoder_inputs_length=target_len)\n",
225 | " batch_size = source.shape[0]\n",
226 | "\n",
227 | " valid_loss += step_loss * batch_size\n",
228 | " valid_sents_seen += batch_size\n",
229 | " print ' {} samples seen'.format(valid_sents_seen)\n",
230 | "\n",
231 | " valid_loss = valid_loss / valid_sents_seen\n",
232 | " print 'Valid perplexity: {0:.2f}'.format(math.exp(valid_loss))\n",
233 | "\n",
234 | " # Save the model checkpoint\n",
235 | " if model.global_step.eval() % FLAGS.save_freq == 0:\n",
236 | " print 'Saving the model..'\n",
237 | " checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name)\n",
238 | " model.save(sess, checkpoint_path, global_step=model.global_step)\n",
239 | " json.dump(model.config,\n",
240 | " open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'),\n",
241 | " indent=2)\n",
242 | "\n",
243 | " # Increase the epoch index of the model\n",
244 | " model.global_epoch_step_op.eval()\n",
245 | " print 'Epoch {0:} DONE'.format(model.global_epoch_step.eval())\n",
246 | " \n",
247 | " print 'Saving the last model..'\n",
248 | " checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name)\n",
249 | " model.save(sess, checkpoint_path, global_step=model.global_step)\n",
250 | " json.dump(model.config,\n",
251 | " open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'),\n",
252 | " indent=2)\n",
253 | " \n",
254 | " print 'Training Terminated'"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {
261 | "collapsed": true
262 | },
263 | "outputs": [],
264 | "source": [
265 | "def main(_):\n",
266 | " train()\n",
267 | "\n",
268 | "\n",
269 | "if __name__ == '__main__':\n",
270 | " tf.app.run()"
271 | ]
272 | }
273 | ],
274 | "metadata": {
275 | "kernelspec": {
276 | "display_name": "Python 2",
277 | "language": "python",
278 | "name": "python2"
279 | },
280 | "language_info": {
281 | "codemirror_mode": {
282 | "name": "ipython",
283 | "version": 2
284 | },
285 | "file_extension": ".py",
286 | "mimetype": "text/x-python",
287 | "name": "python",
288 | "nbconvert_exporter": "python",
289 | "pygments_lexer": "ipython2",
290 | "version": "2.7.10"
291 | }
292 | },
293 | "nbformat": 4,
294 | "nbformat_minor": 0
295 | }
296 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 |
2 | #!/usr/bin/env python
3 | # coding: utf-8
4 |
5 | import os
6 | import math
7 | import time
8 | import json
9 | import random
10 |
11 | from collections import OrderedDict
12 |
13 | import numpy as np
14 | import tensorflow as tf
15 |
16 | from data.data_iterator import TextIterator
17 | from data.data_iterator import BiTextIterator
18 |
19 | import data.data_utils as data_utils
20 | from data.data_utils import prepare_batch
21 | from data.data_utils import prepare_train_batch
22 |
23 | from seq2seq_model import Seq2SeqModel
24 |
25 |
26 | # Data loading parameters
27 | tf.app.flags.DEFINE_string('source_vocabulary', 'data/europarl-v7.1.4M.de.json', 'Path to source vocabulary')
28 | tf.app.flags.DEFINE_string('target_vocabulary', 'data/europarl-v7.1.4M.fr.json', 'Path to target vocabulary')
29 | tf.app.flags.DEFINE_string('source_train_data', 'data/europarl-v7.1.4M.de', 'Path to source training data')
30 | tf.app.flags.DEFINE_string('target_train_data', 'data/europarl-v7.1.4M.fr', 'Path to target training data')
31 | tf.app.flags.DEFINE_string('source_valid_data', 'data/newstest2012.bpe.de', 'Path to source validation data')
32 | tf.app.flags.DEFINE_string('target_valid_data', 'data/newstest2012.bpe.fr', 'Path to target validation data')
33 |
34 | # Network parameters
35 | tf.app.flags.DEFINE_string('cell_type', 'lstm', 'RNN cell for encoder and decoder, default: lstm')
36 | tf.app.flags.DEFINE_string('attention_type', 'bahdanau', 'Attention mechanism: (bahdanau, luong), default: bahdanau')
37 | tf.app.flags.DEFINE_integer('hidden_units', 1024, 'Number of hidden units in each layer')
38 | tf.app.flags.DEFINE_integer('depth', 2, 'Number of layers in each encoder and decoder')
39 | tf.app.flags.DEFINE_integer('embedding_size', 500, 'Embedding dimensions of encoder and decoder inputs')
40 | tf.app.flags.DEFINE_integer('num_encoder_symbols', 30000, 'Source vocabulary size')
41 | tf.app.flags.DEFINE_integer('num_decoder_symbols', 30000, 'Target vocabulary size')
42 |
43 | tf.app.flags.DEFINE_boolean('use_residual', True, 'Use residual connection between layers')
44 | tf.app.flags.DEFINE_boolean('attn_input_feeding', False, 'Use input feeding method in attentional decoder')
45 | tf.app.flags.DEFINE_boolean('use_dropout', True, 'Use dropout in each rnn cell')
46 | tf.app.flags.DEFINE_float('dropout_rate', 0.3, 'Dropout probability for input/output/state units (0.0: no dropout)')
47 |
48 | # Training parameters
49 | tf.app.flags.DEFINE_float('learning_rate', 0.0002, 'Learning rate')
50 | tf.app.flags.DEFINE_float('max_gradient_norm', 1.0, 'Clip gradients to this norm')
51 | tf.app.flags.DEFINE_integer('batch_size', 128, 'Batch size')
52 | tf.app.flags.DEFINE_integer('max_epochs', 10, 'Maximum # of training epochs')
53 | tf.app.flags.DEFINE_integer('max_load_batches', 20, 'Maximum # of batches to load at one time')
54 | tf.app.flags.DEFINE_integer('max_seq_length', 50, 'Maximum sequence length')
55 | tf.app.flags.DEFINE_integer('display_freq', 100, 'Display training status every this iteration')
56 | tf.app.flags.DEFINE_integer('save_freq', 11500, 'Save model checkpoint every this iteration')
57 | tf.app.flags.DEFINE_integer('valid_freq', 1150000, 'Evaluate model every this iteration: valid_data needed')
58 | tf.app.flags.DEFINE_string('optimizer', 'adam', 'Optimizer for training: (adadelta, adam, rmsprop)')
59 | tf.app.flags.DEFINE_string('model_dir', 'model/', 'Path to save model checkpoints')
60 | tf.app.flags.DEFINE_string('model_name', 'translate.ckpt', 'File name used for model checkpoints')
61 | tf.app.flags.DEFINE_boolean('shuffle_each_epoch', True, 'Shuffle training dataset for each epoch')
62 | tf.app.flags.DEFINE_boolean('sort_by_length', True, 'Sort pre-fetched minibatches by their target sequence lengths')
63 | tf.app.flags.DEFINE_boolean('use_fp16', False, 'Use half precision float16 instead of float32 as dtype')
64 |
65 | # Runtime parameters
66 | tf.app.flags.DEFINE_boolean('allow_soft_placement', True, 'Allow device soft placement')
67 | tf.app.flags.DEFINE_boolean('log_device_placement', False, 'Log placement of ops on devices')
68 |
69 | FLAGS = tf.app.flags.FLAGS
70 |
71 | def create_model(session, FLAGS):
72 |
73 | config = OrderedDict(sorted(FLAGS.__flags.items()))
74 | model = Seq2SeqModel(config, 'train')
75 |
76 | ckpt = tf.train.get_checkpoint_state(FLAGS.model_dir)
77 | if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
78 | print 'Reloading model parameters..'
79 | model.restore(session, ckpt.model_checkpoint_path)
80 |
81 | else:
82 | if not os.path.exists(FLAGS.model_dir):
83 | os.makedirs(FLAGS.model_dir)
84 | print 'Created new model parameters..'
85 | session.run(tf.global_variables_initializer())
86 |
87 | return model
88 |
89 | def train():
90 | # Load parallel data to train
91 | print 'Loading training data..'
92 | train_set = BiTextIterator(source=FLAGS.source_train_data,
93 | target=FLAGS.target_train_data,
94 | source_dict=FLAGS.source_vocabulary,
95 | target_dict=FLAGS.target_vocabulary,
96 | batch_size=FLAGS.batch_size,
97 | maxlen=FLAGS.max_seq_length,
98 | n_words_source=FLAGS.num_encoder_symbols,
99 | n_words_target=FLAGS.num_decoder_symbols,
100 | shuffle_each_epoch=FLAGS.shuffle_each_epoch,
101 | sort_by_length=FLAGS.sort_by_length,
102 | maxibatch_size=FLAGS.max_load_batches)
103 |
104 | if FLAGS.source_valid_data and FLAGS.target_valid_data:
105 | print 'Loading validation data..'
106 | valid_set = BiTextIterator(source=FLAGS.source_valid_data,
107 | target=FLAGS.target_valid_data,
108 | source_dict=FLAGS.source_vocabulary,
109 | target_dict=FLAGS.target_vocabulary,
110 | batch_size=FLAGS.batch_size,
111 | maxlen=None,
112 | n_words_source=FLAGS.num_encoder_symbols,
113 | n_words_target=FLAGS.num_decoder_symbols)
114 | else:
115 | valid_set = None
116 |
117 | # Initiate TF session
118 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement,
119 | log_device_placement=FLAGS.log_device_placement, gpu_options=tf.GPUOptions(allow_growth=True))) as sess:
120 |
121 | # Create a new model or reload existing checkpoint
122 | model = create_model(sess, FLAGS)
123 |
124 | # Create a log writer object
125 | log_writer = tf.summary.FileWriter(FLAGS.model_dir, graph=sess.graph)
126 |
127 |
128 |
129 | step_time, loss = 0.0, 0.0
130 | words_seen, sents_seen = 0, 0
131 | start_time = time.time()
132 |
133 | # Training loop
134 | print 'Training..'
135 | for epoch_idx in xrange(FLAGS.max_epochs):
136 | if model.global_epoch_step.eval() >= FLAGS.max_epochs:
137 | print 'Training is already complete.', \
138 | 'current epoch:{}, max epoch:{}'.format(model.global_epoch_step.eval(), FLAGS.max_epochs)
139 | break
140 |
141 | for source_seq, target_seq in train_set:
142 | # Get a batch from training parallel data
143 | source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq,
144 | FLAGS.max_seq_length)
145 | if source is None or target is None:
146 | print 'No samples under max_seq_length ', FLAGS.max_seq_length
147 | continue
148 |
149 | # Execute a single training step
150 | step_loss, summary = model.train(sess, encoder_inputs=source, encoder_inputs_length=source_len,
151 | decoder_inputs=target, decoder_inputs_length=target_len)
152 |
153 | loss += float(step_loss) / FLAGS.display_freq
154 | words_seen += float(np.sum(source_len+target_len))
155 | sents_seen += float(source.shape[0]) # batch_size
156 |
157 | if model.global_step.eval() % FLAGS.display_freq == 0:
158 |
159 | avg_perplexity = math.exp(float(loss)) if loss < 300 else float("inf")
160 |
161 | time_elapsed = time.time() - start_time
162 | step_time = time_elapsed / FLAGS.display_freq
163 |
164 | words_per_sec = words_seen / time_elapsed
165 | sents_per_sec = sents_seen / time_elapsed
166 |
167 | print 'Epoch ', model.global_epoch_step.eval(), 'Step ', model.global_step.eval(), \
168 | 'Perplexity {0:.2f}'.format(avg_perplexity), 'Step-time ', step_time, \
169 | '{0:.2f} sents/s'.format(sents_per_sec), '{0:.2f} words/s'.format(words_per_sec)
170 |
171 | loss = 0
172 | words_seen = 0
173 | sents_seen = 0
174 | start_time = time.time()
175 |
176 | # Record training summary for the current batch
177 | log_writer.add_summary(summary, model.global_step.eval())
178 |
179 | # Execute a validation step
180 | if valid_set and model.global_step.eval() % FLAGS.valid_freq == 0:
181 | print 'Validation step'
182 | valid_loss = 0.0
183 | valid_sents_seen = 0
184 | for source_seq, target_seq in valid_set:
185 | # Get a batch from validation parallel data
186 | source, source_len, target, target_len = prepare_train_batch(source_seq, target_seq)
187 |
188 | # Compute validation loss: average per word cross entropy loss
189 | step_loss, summary = model.eval(sess, encoder_inputs=source, encoder_inputs_length=source_len,
190 | decoder_inputs=target, decoder_inputs_length=target_len)
191 | batch_size = source.shape[0]
192 |
193 | valid_loss += step_loss * batch_size
194 | valid_sents_seen += batch_size
195 | print ' {} samples seen'.format(valid_sents_seen)
196 |
197 | valid_loss = valid_loss / valid_sents_seen
198 | print 'Valid perplexity: {0:.2f}'.format(math.exp(valid_loss))
199 |
200 | # Save the model checkpoint
201 | if model.global_step.eval() % FLAGS.save_freq == 0:
202 | print 'Saving the model..'
203 | checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name)
204 | model.save(sess, checkpoint_path, global_step=model.global_step)
205 | json.dump(model.config,
206 | open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'),
207 | indent=2)
208 |
209 | # Increase the epoch index of the model
210 | model.global_epoch_step_op.eval()
211 | print 'Epoch {0:} DONE'.format(model.global_epoch_step.eval())
212 |
213 | print 'Saving the last model..'
214 | checkpoint_path = os.path.join(FLAGS.model_dir, FLAGS.model_name)
215 | model.save(sess, checkpoint_path, global_step=model.global_step)
216 | json.dump(model.config,
217 | open('%s-%d.json' % (checkpoint_path, model.global_step.eval()), 'wb'),
218 | indent=2)
219 |
220 | print 'Training Terminated'
221 |
222 |
223 |
224 | def main(_):
225 | train()
226 |
227 |
228 | if __name__ == '__main__':
229 | tf.app.run()
230 |
--------------------------------------------------------------------------------