├── README.md
├── lib
    ├── __init__.py
    ├── __init__.pyc
    ├── data
    │   ├── Constants.py
    │   ├── Constants.pyc
    │   ├── Dataset.py
    │   ├── Dataset.pyc
    │   ├── Dict.py
    │   ├── Dict.pyc
    │   ├── __init__.py
    │   └── __init__.pyc
    ├── eval
    │   ├── Evaluator.py
    │   ├── Evaluator.pyc
    │   ├── __init__.py
    │   └── __init__.pyc
    ├── metric
    │   ├── Bleu.py
    │   ├── Bleu.pyc
    │   ├── Loss.py
    │   ├── Loss.pyc
    │   ├── PertFunction.py
    │   ├── PertFunction.pyc
    │   ├── Reward.py
    │   ├── Reward.pyc
    │   ├── __init__.py
    │   ├── __init__.pyc
    │   └── test_shaping.py
    ├── model
    │   ├── EncoderDecoder.py
    │   ├── EncoderDecoder.pyc
    │   ├── Generator.py
    │   ├── Generator.pyc
    │   ├── GlobalAttention.py
    │   ├── GlobalAttention.pyc
    │   ├── __init__.py
    │   └── __init__.pyc
    └── train
    │   ├── Optim.py
    │   ├── Optim.pyc
    │   ├── ReinforceTrainer.py
    │   ├── ReinforceTrainer.pyc
    │   ├── Trainer.py
    │   ├── Trainer.pyc
    │   ├── __init__.py
    │   └── __init__.pyc
├── preprocess.py
├── requirements.txt
├── scripts
    ├── extract_parallel.py
    ├── lowercase.perl
    ├── multi-bleu.perl
    ├── output.py
    ├── parse.py
    ├── prepare_data.sh
    ├── preprocess.py
    ├── sgm.perl
    ├── strip.py
    ├── tokenizer.perl
    ├── train.sh
    └── translate.sh
├── share
    └── nonbreaking_prefixes
    │   ├── README.txt
    │   ├── nonbreaking_prefix.ca
    │   ├── nonbreaking_prefix.cs
    │   ├── nonbreaking_prefix.de
    │   ├── nonbreaking_prefix.el
    │   ├── nonbreaking_prefix.en
    │   ├── nonbreaking_prefix.es
    │   ├── nonbreaking_prefix.fi
    │   ├── nonbreaking_prefix.fr
    │   ├── nonbreaking_prefix.ga
    │   ├── nonbreaking_prefix.hu
    │   ├── nonbreaking_prefix.is
    │   ├── nonbreaking_prefix.it
    │   ├── nonbreaking_prefix.lt
    │   ├── nonbreaking_prefix.lv
    │   ├── nonbreaking_prefix.nl
    │   ├── nonbreaking_prefix.pl
    │   ├── nonbreaking_prefix.pt
    │   ├── nonbreaking_prefix.ro
    │   ├── nonbreaking_prefix.ru
    │   ├── nonbreaking_prefix.sk
    │   ├── nonbreaking_prefix.sl
    │   ├── nonbreaking_prefix.sv
    │   ├── nonbreaking_prefix.ta
    │   ├── nonbreaking_prefix.yue
    │   └── nonbreaking_prefix.zh
├── train.py
└── translate.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Multilingual Neural Machine Translation System for TV News
  2 | 
  3 | _This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._
  4 | 
  5 | The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English. 
  6 | 
  7 | The system uses Reinforcement Learning(Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT(Workshop on Machine Translation) test datasets. 
  8 | 
  9 | This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086).
 10 | 
 11 | I have made a GSoC blog, please refer to it for my all GSoC blogposts about the progress made so far.
 12 | Blog link: https://vikrant97.github.io/gsoc_blog/
 13 | 
 14 | The following languages are supported as the source language & the below are their language codes:
 15 | 1) **German - de**
 16 | 2) **French - fr**
 17 | 3) **Russian - ru**
 18 | 4) **Czech - cs**
 19 | 5) **Spanish - es**
 20 | 6) **Portuguese - pt**
 21 | 7) **Danish - da**
 22 | 8) **Swedish - sv**
 23 | 9) **Chinese - zh**
 24 | The target language is English(en).
 25 | 
 26 | ## Getting Started
 27 | 
 28 | ### Prerequisites
 29 | 
 30 | * Python-2.7
 31 | * Pytorch-0.3
 32 | * Tensorflow-gpu
 33 | * Numpy
 34 | * CUDA
 35 | 
 36 | ### Installation & Setup Instructions on CASE HPC
 37 | 
 38 | * Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my hpc acoount i.e **/home/vxg195** & then follow the instructions described for training & translation.
 39 | 
 40 | * nmt directory will contain the following subdirectories:
 41 |   * singularity
 42 |   * data
 43 |   * models
 44 |   * Neural-Machine-Translation 
 45 |   * myenv
 46 | 
 47 | * The **singularity** directory contains a singularity image(rh_xenial_20180308.img) which is copied from the home directory of **Mr. Michael Pacchioli's CASE HPC account**. This singularity image contains some modules like CUDA and CUDANN needed for the system. 
 48 | 
 49 | * The **data** directory consists of cleaned & processed datasets of respective language pairs. The subdirectories of this directory should be named like **de-en** where **de** & **en** are the language codes for **German** & **English**. So for any general language pair whose source language is **$src** and the target language is **$tgt**, the language data subdirectory should be named as **$src-$tgt** and it should contain the following files(train, validation & test):
 50 |   * train.$src-$tgt.$src.processed
 51 |   * train.$src-$tgt.$tgt.processed
 52 |   * valid.$src-$tgt.$src.processed
 53 |   * valid.$src-$tgt.$tgt.processed
 54 |   * test.$src-$tgt.$src.processed
 55 |   * test.$src-$tgt.$tgt.processed
 56 | 
 57 | * The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as **data** directory. For example, **models/de-en** will contains trained models for the **German-English** language pair.
 58 | 
 59 | * The following commands were used to install dependencies for the project:
 60 |   ```bash
 61 |   $ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git
 62 |   $ virtualenv myenv
 63 |   $ source myenv/bin/activate
 64 |   $ pip install -r Neural-Machine-Translation/requirements.txt
 65 |   ```
 66 | * **Note** that the virtual environment(myenv) created using virtualenv command mentioned above, should be of **Python2** .
 67 | 
 68 | ## Data Preparation and Preprocessing
 69 | 
 70 | Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other opern source datasets. One can have a look at shared task on Machine Translation i.e. WMT, to get better datasets. I wrote a bash script which can be used to process & prepare dataset for MT. The following steps can be used to prepare dataset for MT:
 71 | 1) First copy the raw dataset files in the language($src-$tgt) subdirectory of the data directory in the following format:
 72 |   * train.$src-$tgt.$src
 73 |   * train.$src-$tgt.$tgt
 74 |   * valid.$src-$tgt.$src
 75 |   * valid.$src-$tgt.$tgt
 76 |   * test.$src-$tgt.$src
 77 |   * test.$src-$tgt.$tgt
 78 | 
 79 | 2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training:
 80 |  ```bash
 81 |  bash prepare_data.sh $src $tgt
 82 |  ```
 83 |  After this process, clear the entire language directory & just keep \*.processed files. Your processed dataset is ready!!
 84 | 
 85 | ## Training
 86 | 
 87 | To train a model on CASE HPC one needs to run the train.sh file placed in Neural-Machine-translation/scripts folder. The parameters for training are kept such that a model can be efficiently trained for any newly introduced language pair, but one needs to tune the parameters according to the dataset. The prerequisite for training a model is that the parallel data as described in **Installation** section should be residing in the concerned language pair directory in the data folder. The trained models will be saved in the language pair directory in the models folder. To train a model on CASE HPC, run the following command:
 88 |  
 89 |  ```bash
 90 |  cd Neural-Machine-Translation/scripts
 91 |  sbatch train.sh <src-language-code> <target-language-code>
 92 |  # For example to train a model for German->English one should type the following command
 93 |  sbatch train.sh de en
 94 |  ```
 95 | After training, the trained model will be saved in language($src-$tgt) subdirectory in the models directory. The saved model would be something like "model_15.pt" and it should be renamed to "model_15_best.pt". 
 96 | 
 97 | ## Translation
 98 | This project supports translation of both normal text file and news transcripts in any supported language pair.
 99 | To translate any input news transcript, run the following commands:
100 |  ```bash
101 |  cd Neural-Machine-Translation/scripts
102 |  sbatch translate.sh <src-language-code> <target-language-code> <path-of-news-transcript> 0
103 |  ```
104 | To translate any normal text file, run the following commands:
105 |  ```bash
106 |  cd Neural-Machine-Translation/scripts
107 |  sbatch translate.sh <src-language-code> <target-language-code> <path-of-news-transcript> 1
108 |  ```
109 | **Note that the output translated file will be saved in the same directory containing the input file and with a ".pred" string appended to the name of the input file.**
110 | 
111 | ## Evaluation of the trained model
112 | For evaluation, generate translation of any source test corpora. Now, we need to test its efficiency against the original target test corpus. For this, we use multi-bleu.perl script residing in the scripts directory which measures the corpus BLEU score. Usage instructions:
113 | ```bash
114 | perl scripts/multi-bleu.perl $reference-file < $hypothesis-file
115 | ```
116 | 
117 | ## Acknowledgements
118 | 
119 | * [Google Summer of Code 2018](https://summerofcode.withgoogle.com/)
120 | * [Red Hen Lab](http://www.redhenlab.org/)
121 | * [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py)
122 | * [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086)
123 | * [Europarl](http://www.statmt.org/europarl/)
124 | * [Moses](https://github.com/moses-smt/mosesdecoder)
125 | 


--------------------------------------------------------------------------------
/lib/__init__.py:
--------------------------------------------------------------------------------
1 | from .data import *
2 | from .eval import *
3 | from .metric import *
4 | from .model import *
5 | from .train import *
6 | 


--------------------------------------------------------------------------------
/lib/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/__init__.pyc


--------------------------------------------------------------------------------
/lib/data/Constants.py:
--------------------------------------------------------------------------------
 1 | 
 2 | PAD = 0
 3 | UNK = 1
 4 | BOS = 2
 5 | EOS = 3
 6 | 
 7 | PAD_WORD = '<blank>'
 8 | UNK_WORD = '<unk>'
 9 | BOS_WORD = '<s>'
10 | EOS_WORD = '</s>'
11 | 


--------------------------------------------------------------------------------
/lib/data/Constants.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/Constants.pyc


--------------------------------------------------------------------------------
/lib/data/Dataset.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division
 2 | 
 3 | import math
 4 | import random
 5 | 
 6 | import torch
 7 | from torch.autograd import Variable
 8 | 
 9 | import lib
10 | 
11 | 
12 | class Dataset(object):
13 |     def __init__(self, data, batchSize, cuda, eval=False):
14 |         self.src = data["src"]
15 |         self.tgt = data["tgt"]
16 |         self.pos = data["pos"]
17 |         assert(len(self.src) == len(self.tgt))
18 |         self.cuda = cuda
19 | 
20 |         self.batchSize = batchSize
21 |         self.numBatches = math.ceil(len(self.src)/batchSize)
22 |         self.eval = eval
23 | 
24 |     def _batchify(self, data, align_right=False, include_lengths=False):
25 |         lengths = [x.size(0) for x in data]
26 |         max_length = max(lengths)
27 |         out = data[0].new(len(data), max_length).fill_(lib.Constants.PAD)
28 |         for i in range(len(data)):
29 |             data_length = data[i].size(0)
30 |             offset = max_length - data_length if align_right else 0
31 |             out[i].narrow(0, offset, data_length).copy_(data[i])
32 | 
33 |         if include_lengths:
34 |             return out, lengths
35 |         else:
36 |             return out
37 | 
38 |     def __getitem__(self, index):
39 |         assert index < self.numBatches, "%d > %d" % (index, self.numBatches)
40 |         srcBatch, lengths = self._batchify(self.src[index*self.batchSize:(index+1)*self.batchSize],
41 |             include_lengths=True)
42 | 
43 |         tgtBatch = self._batchify(self.tgt[index*self.batchSize:(index+1)*self.batchSize])
44 | 
45 |         # within batch sort by decreasing length.
46 |         indices = range(len(srcBatch))
47 |         batch = zip(indices, srcBatch, tgtBatch)
48 |         batch, lengths = zip(*sorted(zip(batch, lengths), key=lambda x: -x[1]))
49 |         indices, srcBatch, tgtBatch = zip(*batch)
50 | 
51 |         def wrap(b):
52 |             b = torch.stack(b, 0).t().contiguous()
53 |             if self.cuda:
54 |                 b = b.cuda()
55 |             b = Variable(b, volatile=self.eval)
56 |             return b
57 | 
58 |         return (wrap(srcBatch), lengths), wrap(tgtBatch), indices
59 | 
60 |     def __len__(self):
61 |         return self.numBatches
62 | 
63 |     def shuffle(self):
64 |         data = list(zip(self.src, self.tgt, self.pos))
65 |         random.shuffle(data)
66 |         self.src, self.tgt, self.pos = zip(*data)
67 | 
68 |     def restore_pos(self, sents):
69 |         sorted_sents = [None] * len(self.pos)
70 |         for sent, idx in zip(sents, self.pos):
71 |           sorted_sents[idx] = sent
72 |         return sorted_sents
73 | 


--------------------------------------------------------------------------------
/lib/data/Dataset.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/Dataset.pyc


--------------------------------------------------------------------------------
/lib/data/Dict.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | 
  3 | 
  4 | class Dict(object):
  5 |     def __init__(self, data=None):
  6 |         self.idxToLabel = {}
  7 |         self.labelToIdx = {}
  8 |         self.frequencies = {}
  9 | 
 10 |         # Special entries will not be pruned.
 11 |         self.special = []
 12 | 
 13 |         if data is not None:
 14 |             if type(data) == str:
 15 |                 self.loadFile(data)
 16 |             else:
 17 |                 self.addSpecials(data)
 18 | 
 19 |     def size(self):
 20 |         return len(self.idxToLabel)
 21 | 
 22 |     # Load entries from a file.
 23 |     def loadFile(self, filename):
 24 |         for line in open(filename):
 25 |             fields = line.split()
 26 |             label = fields[0]
 27 |             idx = int(fields[1])
 28 |             self.add(label, idx)
 29 | 
 30 |     # Write entries to a file.
 31 |     def writeFile(self, filename):
 32 |         with open(filename, 'w') as file:
 33 |             for i in range(self.size()):
 34 |                 label = self.idxToLabel[i]
 35 |                 file.write('%s %d\n' % (label, i))
 36 | 
 37 |         file.close()
 38 | 
 39 |     def lookup(self, key, default=None):
 40 |         try:
 41 |             return self.labelToIdx[key]
 42 |         except KeyError:
 43 |             return default
 44 | 
 45 |     def getLabel(self, idx, default=None):
 46 |         try:
 47 |             return self.idxToLabel[idx]
 48 |         except KeyError:
 49 |             return default
 50 | 
 51 |     # Mark this `label` and `idx` as special (i.e. will not be pruned).
 52 |     def addSpecial(self, label, idx=None):
 53 |         idx = self.add(label, idx)
 54 |         self.special += [idx]
 55 | 
 56 |     # Mark all labels in `labels` as specials (i.e. will not be pruned).
 57 |     def addSpecials(self, labels):
 58 |         for label in labels:
 59 |             self.addSpecial(label)
 60 | 
 61 |     # Add `label` in the dictionary. Use `idx` as its index if given.
 62 |     def add(self, label, idx=None):
 63 |         if idx is not None:
 64 |             self.idxToLabel[idx] = label
 65 |             self.labelToIdx[label] = idx
 66 |         else:
 67 |             if label in self.labelToIdx:
 68 |                 idx = self.labelToIdx[label]
 69 |             else:
 70 |                 idx = len(self.idxToLabel)
 71 |                 self.idxToLabel[idx] = label
 72 |                 self.labelToIdx[label] = idx
 73 | 
 74 |         if idx not in self.frequencies:
 75 |             self.frequencies[idx] = 1
 76 |         else:
 77 |             self.frequencies[idx] += 1
 78 | 
 79 |         return idx
 80 | 
 81 |     # Return a new dictionary with the `size` most frequent entries.
 82 |     def prune(self, size):
 83 |         if size >= self.size():
 84 |             return self
 85 | 
 86 |         # Only keep the `size` most frequent entries.
 87 |         freq = torch.Tensor(
 88 |                 [self.frequencies[i] for i in range(len(self.frequencies))])
 89 |         _, idx = torch.sort(freq, 0, True)
 90 | 
 91 |         newDict = Dict()
 92 | 
 93 |         # Add special entries in all cases.
 94 |         for i in self.special:
 95 |             newDict.addSpecial(self.idxToLabel[i])
 96 | 
 97 |         for i in idx[:size]:
 98 |             newDict.add(self.idxToLabel[i])
 99 | 
100 |         return newDict
101 | 
102 |     # Convert `labels` to indices. Use `unkWord` if not found.
103 |     # Optionally insert `bosWord` at the beginning and `eosWord` at the .
104 |     def convertToIdx(self, labels, unkWord, bosWord=None, eosWord=None):
105 |         vec = []
106 | 
107 |         if bosWord is not None:
108 |             vec += [self.lookup(bosWord)]
109 | 
110 |         unk = self.lookup(unkWord)
111 |         vec += [self.lookup(label.lower(), default=unk) for label in labels]
112 | 
113 |         if eosWord is not None:
114 |             vec += [self.lookup(eosWord)]
115 | 
116 |         return torch.LongTensor(vec)
117 | 
118 |     # Convert `idx` to labels. If index `stop` is reached, convert it and return.
119 |     def convertToLabels(self, idx, stop):
120 |         labels = []
121 | 
122 |         for i in idx:
123 |             labels += [self.getLabel(i)]
124 |             if i == stop:
125 |                 break
126 | 
127 |         return labels
128 | 


--------------------------------------------------------------------------------
/lib/data/Dict.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/Dict.pyc


--------------------------------------------------------------------------------
/lib/data/__init__.py:
--------------------------------------------------------------------------------
1 | from .Dict import Dict
2 | from .Dataset import Dataset
3 | from .Constants import *
4 | 


--------------------------------------------------------------------------------
/lib/data/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/__init__.pyc


--------------------------------------------------------------------------------
/lib/eval/Evaluator.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division
 2 | import lib
 3 | 
 4 | class Evaluator(object):
 5 |     def __init__(self, model, metrics, dicts, opt):
 6 |         self.model = model
 7 |         self.loss_func = metrics["nmt_loss"]
 8 |         self.sent_reward_func = metrics["sent_reward"]
 9 |         self.corpus_reward_func = metrics["corp_reward"]
10 |         self.dicts = dicts
11 |         self.max_length = opt.max_predict_length
12 | 
13 |     def eval(self, data, pred_file=None):
14 |         self.model.eval()
15 | 
16 |         total_loss = 0
17 |         total_words = 0
18 |         total_sents = 0
19 |         total_sent_reward = 0
20 | 
21 |         all_preds = []
22 |         all_targets = []
23 |         for i in range(len(data)):
24 |             batch = data[i]
25 |             targets = batch[1]
26 | 
27 |             attention_mask = batch[0][0].data.eq(lib.Constants.PAD).t()
28 |             self.model.decoder.attn.applyMask(attention_mask)
29 |             outputs = self.model(batch, True)
30 | 
31 | 
32 |             weights = targets.ne(lib.Constants.PAD).float()
33 |             num_words = weights.data.sum()
34 |             _, loss = self.model.predict(outputs, targets, weights, self.loss_func)
35 | 
36 |             preds = self.model.translate(batch, self.max_length)
37 |             preds = preds.t().tolist()
38 |             targets = targets.data.t().tolist()
39 |             rewards, _ = self.sent_reward_func(preds, targets)
40 | 
41 | 	    #hack
42 | 	    indices=batch[2]
43 | 	    new_batch=zip(preds,targets)
44 | 	    new_batch,indices=zip(*sorted(zip(new_batch,indices),key=lambda x: x[1]))
45 | 	    preds,targets=zip(*new_batch)
46 |             ###
47 | 
48 | 	    all_preds.extend(preds)
49 |             all_targets.extend(targets)
50 | 	    
51 |             total_loss += loss
52 |             total_words += num_words
53 |             total_sent_reward += sum(rewards)
54 |             total_sents += batch[1].size(1)
55 | 
56 |         loss = total_loss / total_words
57 |         sent_reward = total_sent_reward / total_sents
58 |         corpus_reward = self.corpus_reward_func(all_preds, all_targets)
59 | 
60 |         if pred_file is not None:
61 |             self._convert_and_report(data, pred_file, all_preds,
62 |                 (loss, sent_reward, corpus_reward))
63 | 
64 |         return loss, sent_reward, corpus_reward
65 | 
66 |     def _convert_and_report(self, data, pred_file, preds, metrics):
67 |         preds = data.restore_pos(preds)
68 |         with open(pred_file, "w") as f:
69 |             for sent in preds:
70 |                 sent = lib.Reward.clean_up_sentence(sent, remove_unk=False, remove_eos=True)
71 |                 sent = [self.dicts["tgt"].getLabel(w) for w in sent]
72 |                 x=" ".join(sent)+'\n'
73 | 		f.write(x)
74 | 	f.close()
75 |         loss, sent_reward, corpus_reward = metrics
76 |         print("")
77 |         print("Loss: %.6f" % loss)
78 |         print("Sentence reward: %.2f" % (sent_reward * 100))
79 |         print("Corpus reward: %.2f" % (corpus_reward * 100))
80 |         print("Predictions saved to %s" % pred_file)
81 | 
82 | 
83 | 


--------------------------------------------------------------------------------
/lib/eval/Evaluator.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/eval/Evaluator.pyc


--------------------------------------------------------------------------------
/lib/eval/__init__.py:
--------------------------------------------------------------------------------
1 | from .Evaluator import Evaluator
2 | 
3 | 


--------------------------------------------------------------------------------
/lib/eval/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/eval/__init__.pyc


--------------------------------------------------------------------------------
/lib/metric/Bleu.py:
--------------------------------------------------------------------------------
 1 | from __future__ import division
 2 | from collections import defaultdict
 3 | import math
 4 | 
 5 | def _update_ngrams_count(sent, ngrams, count):
 6 |     length = len(sent)
 7 |     for n in range(1, ngrams + 1):
 8 |         for i in range(length - n + 1):
 9 |             ngram = tuple(sent[i : (i + n)])
10 |             count[ngram] += 1
11 | 
12 | def _compute_bleu(p, len_pred, len_gold, smooth):
13 |     # Brevity penalty.
14 |     log_brevity = 1 - max(1, (len_gold + smooth) / (len_pred + smooth))
15 |     log_score = 0
16 |     ngrams = len(p) - 1
17 |     for n in range(1, ngrams + 1):
18 |         if p[n][1] > 0:
19 |             if p[n][0] == 0:
20 |                 p[n][0] = 1e-16
21 |             log_precision = math.log((p[n][0] + smooth) / (p[n][1] + smooth))
22 |             log_score += log_precision
23 |     log_score /= ngrams
24 |     return math.exp(log_score + log_brevity)
25 | 
26 | 
27 | # Calculate BLEU of prefixes of pred.
28 | def score_sentence(pred, gold, ngrams, smooth=0):
29 |     scores = []
30 |     # Get ngrams count for gold.
31 |     count_gold = defaultdict(int)
32 |     _update_ngrams_count(gold, ngrams, count_gold)
33 |     # Init ngrams count for pred to 0.
34 |     count_pred = defaultdict(int)
35 |     # p[n][0] stores the number of overlapped n-grams.
36 |     # p[n][1] is total # of n-grams in pred.
37 |     p = []
38 |     for n in range(ngrams + 1):
39 |         p.append([0, 0])
40 |     for i in range(len(pred)):
41 |         for n in range(1, ngrams + 1):
42 |             if i - n + 1 < 0:
43 |                 continue
44 |             # n-gram is from i - n + 1 to i.
45 |             ngram = tuple(pred[(i - n + 1) : (i + 1)])
46 |             # Update n-gram count.
47 |             count_pred[ngram] += 1
48 |             # Update p[n].
49 |             p[n][1] += 1
50 |             if count_pred[ngram] <= count_gold[ngram]:
51 |                 p[n][0] += 1
52 |         scores.append(_compute_bleu(p, i + 1, len(gold), smooth))
53 |     return scores
54 | 
55 | # Calculate BLEU of a corpus.
56 | def score_corpus(preds, golds, ngrams, smooth=0):
57 |     assert len(preds) == len(golds)
58 |     p = []
59 |     for n in range(ngrams + 1):
60 |         p.append([0, 0])
61 |     len_pred = len_gold = 0
62 |     for pred, gold in zip(preds, golds):
63 |         len_gold += len(gold)
64 |         count_gold = defaultdict(int)
65 |         _update_ngrams_count(gold, ngrams, count_gold)
66 | 
67 |         len_pred += len(pred)
68 |         count_pred = defaultdict(int)
69 |         _update_ngrams_count(pred, ngrams, count_pred)
70 | 
71 |         for k, v in count_pred.items():
72 |             n = len(k)
73 |             p[n][0] += min(v, count_gold[k])
74 |             p[n][1] += v
75 | 
76 |     return _compute_bleu(p, len_pred, len_gold, smooth)
77 | 
78 | 


--------------------------------------------------------------------------------
/lib/metric/Bleu.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/Bleu.pyc


--------------------------------------------------------------------------------
/lib/metric/Loss.py:
--------------------------------------------------------------------------------
 1 | from torch.autograd import Variable
 2 | import numpy as np
 3 | import torch
 4 | import torch.nn as nn
 5 | import torch.nn.functional as F
 6 | 
 7 | def weighted_xent_loss(logits, targets, weights):
 8 |     log_dist = F.log_softmax(logits)
 9 |     losses = -log_dist.gather(1, targets.unsqueeze(1)).squeeze(1)
10 |     losses = losses * weights
11 |     return losses.sum()
12 | 
13 | def weighted_mse(logits, targets, weights):
14 |     losses = (logits - targets)**2
15 |     losses = losses * weights
16 |     return losses.sum()
17 | 


--------------------------------------------------------------------------------
/lib/metric/Loss.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/Loss.pyc


--------------------------------------------------------------------------------
/lib/metric/PertFunction.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import random
 3 | 
 4 | def _adver(rs, _not_use):
 5 |     return [1 - r for r in rs]
 6 | 
 7 | def _random(rs, _not_use):
 8 |     return [random.random() for i in xrange(len(rs))]
 9 | 
10 | def _bin(rs, b):
11 |     return [round(r * b) / b for r in rs]
12 | 
13 | def _variance(rs, scale):
14 |     res = []
15 |     for r in rs:
16 |         # Use 0.67 instead of 67 because scores are in [0,1] instead of [0,100] as in human eval data.
17 |         std = min(r * 0.64, -0.67 * r + 0.67) * scale
18 |         r_new = np.random.normal(r, std)
19 |         r_new = max(0., min(r_new, 1.))
20 |         res.append(r_new)
21 |     return res
22 | 
23 | #def _noise(rs, std):
24 | #    noises = np.random.normal(0, std, size=len(rs)).tolist()
25 | #    return [r + noise for r, noise in zip(rs, noises)]
26 | 
27 | def _curve(rs, p):
28 |     return [r**p for r in rs]
29 | 
30 | class PertFunction(object):
31 |     def __init__(self, func_name, param):
32 |         self.param = param
33 |         if func_name == "bin":
34 |             self.func = _bin
35 |         elif func_name == "skew":
36 |             self.func = _skew
37 |         elif func_name == "variance":
38 |             self.func = _variance
39 |         elif func_name == "random":
40 |             self.func = _random
41 |         elif func_name == "adver":
42 |             self.func = _adver
43 | 
44 |     def __call__(self, r):
45 |         return self.func(r, self.param)
46 | 


--------------------------------------------------------------------------------
/lib/metric/PertFunction.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/PertFunction.pyc


--------------------------------------------------------------------------------
/lib/metric/Reward.py:
--------------------------------------------------------------------------------
 1 | import lib
 2 | 
 3 | def clean_up_sentence(sent, remove_unk=False, remove_eos=False):
 4 |     if lib.Constants.EOS in sent:
 5 |         sent = sent[:sent.index(lib.Constants.EOS) + 1]
 6 |     if remove_unk:
 7 |         sent = filter(lambda x: x != lib.Constants.UNK, sent)
 8 |     if remove_eos:
 9 |         if len(sent) > 0 and sent[-1] == lib.Constants.EOS:
10 |             sent = sent[:-1]
11 |     return sent
12 | 
13 | def single_sentence_bleu(pair):
14 |     length = len(pair[0])
15 |     pred, gold = pair
16 |     pred = clean_up_sentence(pred, remove_unk=False, remove_eos=False)
17 |     gold = clean_up_sentence(gold, remove_unk=False, remove_eos=False)
18 |     len_pred = len(pred)
19 |     if len_pred == 0:
20 |         score = 0.
21 |         pred = [lib.Constants.PAD] * length
22 |     else:
23 |         score = lib.Bleu.score_sentence(pred, gold, 4, smooth=1)[-1]
24 |         while len(pred) < length:
25 |             pred.append(lib.Constants.PAD)
26 | 
27 |         #print pred
28 |         #print gold
29 |         #print score
30 |         #print
31 | 
32 |     return score, pred
33 | 
34 | def sentence_bleu(preds, golds):
35 |     results = map(single_sentence_bleu, zip(preds, golds))
36 |     scores, preds = zip(*results)
37 |     return scores, preds
38 | 
39 | def corpus_bleu(preds, golds):
40 |     assert len(preds) == len(golds)
41 |     clean_preds = []
42 |     clean_golds = []
43 |     for pred, gold in zip(preds, golds):
44 |         pred = clean_up_sentence(pred, remove_unk=False, remove_eos=True)
45 |         gold = clean_up_sentence(gold, remove_unk=False, remove_eos=True)
46 |         clean_preds.append(pred)
47 |         clean_golds.append(gold)
48 |     return lib.Bleu.score_corpus(clean_preds, clean_golds, 4)
49 | 


--------------------------------------------------------------------------------
/lib/metric/Reward.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/Reward.pyc


--------------------------------------------------------------------------------
/lib/metric/__init__.py:
--------------------------------------------------------------------------------
1 | from .PertFunction import PertFunction
2 | from .Loss import *
3 | from .Reward import *
4 | from .Bleu import *
5 | 


--------------------------------------------------------------------------------
/lib/metric/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/__init__.pyc


--------------------------------------------------------------------------------
/lib/metric/test_shaping.py:
--------------------------------------------------------------------------------
 1 | from RewardShaping import *
 2 | 
 3 | func = RewardShaping("bin", 5)
 4 | print func.param
 5 | print "Binning: "
 6 | for i in np.arange(0, 1, 0.05):
 7 |     print i, " ---> ", func(i)
 8 | print
 9 | 
10 | print "Noise: "
11 | func = RewardShaping("noise", 0.1)
12 | print func.param
13 | for i in xrange(10):
14 |     r = 0.3
15 |     print r, " ---> ", func(r)
16 | print
17 | 
18 | 
19 | def test_curve(func):
20 |     print func.param
21 |     for i in np.arange(0, 1, 0.1):
22 |         print i, " ---> ", func(i), "Diff = ", i - func(i)
23 |     print
24 | 
25 | print "Curving: "
26 | func = RewardShaping("curve", 1.1)
27 | test_curve(func)
28 | func = RewardShaping("curve", 0.9)
29 | test_curve(func)
30 | func = RewardShaping("curve", 0.8)
31 | test_curve(func)
32 | 
33 | func = RewardShaping("curve", 1.2)
34 | test_curve(func)
35 | 
36 | func = RewardShaping("curve", 0.5)
37 | test_curve(func)
38 | 
39 | func = RewardShaping("curve", 1.5)
40 | test_curve(func)
41 | 
42 | 


--------------------------------------------------------------------------------
/lib/model/EncoderDecoder.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import torch.nn.functional as F
  4 | from torch.autograd import Variable
  5 | from torch.nn.utils.rnn import pad_packed_sequence as unpack
  6 | from torch.nn.utils.rnn import pack_padded_sequence as pack
  7 | 
  8 | import lib
  9 | 
 10 | class Encoder(nn.Module):
 11 |     def __init__(self, opt, dicts):
 12 |         self.layers = opt.layers
 13 |         self.num_directions = 2 if opt.brnn else 1
 14 |         assert opt.rnn_size % self.num_directions == 0
 15 |         self.hidden_size = opt.rnn_size // self.num_directions
 16 | 
 17 |         super(Encoder, self).__init__()
 18 |         self.word_lut = nn.Embedding(dicts.size(), opt.word_vec_size, padding_idx=lib.Constants.PAD)
 19 |         self.rnn = nn.LSTM(opt.word_vec_size, self.hidden_size, 
 20 |             num_layers=opt.layers, dropout=opt.dropout, bidirectional=opt.brnn)
 21 | 
 22 |     def forward(self, inputs, hidden=None):
 23 |         emb = pack(self.word_lut(inputs[0]), inputs[1])
 24 |         outputs, hidden_t = self.rnn(emb, hidden)
 25 |         outputs = unpack(outputs)[0]
 26 |         return hidden_t, outputs
 27 | 
 28 | 
 29 | class StackedLSTM(nn.Module):
 30 |     def __init__(self, num_layers, input_size, rnn_size, dropout):
 31 |         super(StackedLSTM, self).__init__()
 32 |         self.dropout = nn.Dropout(dropout)
 33 |         self.num_layers = num_layers
 34 |         self.layers = nn.ModuleList()
 35 | 
 36 |         for i in range(num_layers):
 37 |             self.layers.append(nn.LSTMCell(input_size, rnn_size))
 38 |             input_size = rnn_size
 39 | 
 40 |     def forward(self, inputs, hidden):
 41 |         h_0, c_0 = hidden
 42 |         h_1, c_1 = [], []
 43 |         for i, layer in enumerate(self.layers):
 44 |             h_1_i, c_1_i = layer(inputs, (h_0[i], c_0[i]))
 45 |             inputs = h_1_i
 46 |             if i != self.num_layers:
 47 |                 inputs = self.dropout(inputs)
 48 |             h_1 += [h_1_i]
 49 |             c_1 += [c_1_i]
 50 | 
 51 |         h_1 = torch.stack(h_1)
 52 |         c_1 = torch.stack(c_1)
 53 | 
 54 |         return inputs, (h_1, c_1)
 55 | 
 56 | 
 57 | class Decoder(nn.Module):
 58 |     def __init__(self, opt, dicts):
 59 |         self.layers = opt.layers
 60 |         self.input_feed = opt.input_feed
 61 |         input_size = opt.word_vec_size
 62 |         if self.input_feed:
 63 |             input_size += opt.rnn_size
 64 | 
 65 |         super(Decoder, self).__init__()
 66 |         self.word_lut = nn.Embedding(dicts.size(), opt.word_vec_size, padding_idx=lib.Constants.PAD)
 67 |         self.rnn = StackedLSTM(opt.layers, input_size, opt.rnn_size, opt.dropout)
 68 |         self.attn = lib.GlobalAttention(opt.rnn_size)
 69 |         self.dropout = nn.Dropout(opt.dropout)
 70 |         self.hidden_size = opt.rnn_size
 71 | 
 72 |     def step(self, emb, output, hidden, context):
 73 |         if self.input_feed:
 74 |             emb = torch.cat([emb, output], 1)
 75 |         output, hidden = self.rnn(emb, hidden)
 76 |         output, attn = self.attn(output, context)
 77 |         output = self.dropout(output)
 78 |         return output, hidden
 79 | 
 80 |     def forward(self, inputs, init_states):
 81 |         emb, output, hidden, context = init_states
 82 |         embs = self.word_lut(inputs)
 83 | 
 84 |         outputs = []
 85 |         for i in range(inputs.size(0)):
 86 |             output, hidden = self.step(emb, output, hidden, context)
 87 |             outputs.append(output)
 88 |             emb = embs[i]
 89 | 
 90 |         outputs = torch.stack(outputs)
 91 |         return outputs
 92 | 
 93 | 
 94 | class NMTModel(nn.Module):
 95 | 
 96 |     def __init__(self, encoder, decoder, generator, opt):
 97 |         super(NMTModel, self).__init__()
 98 |         self.encoder = encoder
 99 |         self.decoder = decoder
100 |         self.generator = generator
101 |         self.opt = opt
102 | 
103 |     def make_init_decoder_output(self, context):
104 |         batch_size = context.size(1)
105 |         h_size = (batch_size, self.decoder.hidden_size)
106 |         return Variable(context.data.new(*h_size).zero_(), requires_grad=False)
107 | 
108 |     def _fix_enc_hidden(self, h):
109 |         #  the encoder hidden is  (layers*directions) x batch x dim
110 |         #  we need to convert it to layers x batch x (directions*dim)
111 |         if self.encoder.num_directions == 2:
112 |             return h.view(h.size(0) // 2, 2, h.size(1), h.size(2)) \
113 |                     .transpose(1, 2).contiguous() \
114 |                     .view(h.size(0) // 2, h.size(1), h.size(2) * 2)
115 |         else:
116 |             return h
117 | 
118 |     def initialize(self, inputs, eval):
119 |         src = inputs[0]
120 |         tgt = inputs[1]
121 |         enc_hidden, context = self.encoder(src)
122 |         init_output = self.make_init_decoder_output(context)
123 |         enc_hidden = (self._fix_enc_hidden(enc_hidden[0]),
124 |                       self._fix_enc_hidden(enc_hidden[1]))
125 |         init_token = Variable(torch.LongTensor(
126 |             [lib.Constants.BOS] * init_output.size(0)), volatile=eval)
127 |         if self.opt.cuda:
128 |             init_token = init_token.cuda()
129 |         emb = self.decoder.word_lut(init_token)
130 |         return tgt, (emb, init_output, enc_hidden, context.transpose(0, 1))
131 | 
132 |     def forward(self, inputs, eval, regression=False):
133 |         targets, init_states = self.initialize(inputs, eval)
134 |         outputs = self.decoder(targets, init_states)
135 | 
136 |         if regression:
137 |             logits = self.generator(outputs)
138 |             return logits.view_as(targets)
139 |         return outputs
140 | 
141 |     def backward(self, outputs, targets, weights, normalizer, criterion, regression=False):
142 |         grad_output, loss = self.generator.backward(outputs, targets, weights, normalizer, criterion, regression)
143 |         outputs.backward(grad_output)
144 |         return loss
145 | 
146 |     def predict(self, outputs, targets, weights, criterion):
147 |         return self.generator.predict(outputs, targets, weights, criterion)
148 | 
149 |     def translate(self, inputs, max_length):
150 |         targets, init_states = self.initialize(inputs, eval=True)
151 |         emb, output, hidden, context = init_states
152 |         
153 |         preds = [] 
154 |         batch_size = targets.size(1)
155 |         num_eos = targets[0].data.byte().new(batch_size).zero_()
156 | 
157 |         for i in range(max_length):
158 |             output, hidden = self.decoder.step(emb, output, hidden, context)
159 |             logit = self.generator(output)
160 |             pred = logit.max(1)[1].view(-1).data
161 |             preds.append(pred)
162 | 
163 |             # Stop if all sentences reach EOS.
164 |             num_eos |= (pred == lib.Constants.EOS)
165 |             if num_eos.sum() == batch_size: break
166 | 
167 |             emb = self.decoder.word_lut(Variable(pred))
168 | 
169 |         preds = torch.stack(preds)
170 |         return preds
171 | 
172 |     def sample(self, inputs, max_length):
173 |         targets, init_states = self.initialize(inputs, eval=False)
174 |         emb, output, hidden, context = init_states
175 | 
176 |         outputs = []
177 |         samples = []
178 |         batch_size = targets.size(1)
179 |         num_eos = targets[0].data.byte().new(batch_size).zero_()
180 | 
181 |         for i in range(max_length):
182 |             output, hidden = self.decoder.step(emb, output, hidden, context)
183 |             outputs.append(output)
184 |             dist = F.softmax(self.generator(output))
185 |             sample = dist.multinomial(1, replacement=False).view(-1).data
186 |             samples.append(sample)
187 | 
188 |             # Stop if all sentences reach EOS.
189 |             num_eos |= (sample == lib.Constants.EOS)
190 |             if num_eos.sum() == batch_size: break
191 | 
192 |             emb = self.decoder.word_lut(Variable(sample))
193 | 
194 |         outputs = torch.stack(outputs)
195 |         samples = torch.stack(samples)
196 |         return samples, outputs
197 | 
198 | 
199 | 


--------------------------------------------------------------------------------
/lib/model/EncoderDecoder.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/EncoderDecoder.pyc


--------------------------------------------------------------------------------
/lib/model/Generator.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | from torch.autograd import Variable
 5 | 
 6 | 
 7 | class BaseGenerator(nn.Module):
 8 |     def __init__(self, generator, opt):
 9 |         super(BaseGenerator, self).__init__()
10 |         self.generator = generator
11 |         self.opt = opt
12 | 
13 |     def forward(self, inputs):
14 |         return self.generator(inputs.contiguous().view(-1, inputs.size(-1)))
15 | 
16 |     def backward(self, outputs, targets, weights, normalizer, criterion, regression=False):
17 |         outputs = Variable(outputs.data, requires_grad=True)
18 | 
19 |         logits = outputs.contiguous().view(-1) if regression else self.forward(outputs)
20 | 
21 |         loss = criterion(logits, targets.contiguous().view(-1), weights.contiguous().view(-1))
22 |         loss.div(normalizer).backward()
23 |         loss = loss.data[0]
24 | 
25 |         if outputs.grad is None:
26 |             grad_output = torch.zeros(outputs.size())
27 |         else:
28 |             grad_output = outputs.grad.data
29 | 
30 |         return grad_output, loss
31 | 
32 |     def predict(self, outputs, targets, weights, criterion):
33 |         logits = self.forward(outputs)
34 |         preds = logits.data.max(1)[1].view(outputs.size(0), -1)
35 | 
36 |         loss = criterion(logits, targets.contiguous().view(-1), weights.contiguous().view(-1)).data[0]
37 | 
38 |         return preds, loss
39 | 
40 | 
41 | class MemEfficientGenerator(BaseGenerator):
42 |     def __init__(self, generator, opt, dim=1):
43 |         super(MemEfficientGenerator, self).__init__(generator, opt)
44 |         self.batch_size = opt.max_generator_batches
45 |         self.dim = dim
46 | 
47 |     def backward(self, outputs, targets, weights, normalizer, criterion, regression=False):
48 |         outputs_split = torch.split(outputs, self.batch_size, self.dim)
49 |         targets_split = torch.split(targets, self.batch_size, self.dim)
50 |         weights_split = torch.split(weights, self.batch_size, self.dim)
51 | 
52 |         grad_output = []
53 |         loss = 0
54 |         for out_t, targ_t, w_t in zip(outputs_split, targets_split, weights_split):
55 |             grad_output_t, loss_t = super(MemEfficientGenerator, self).backward(
56 |                 out_t, targ_t, w_t, normalizer, criterion, regression)
57 |             grad_output.append(grad_output_t)
58 |             loss += loss_t
59 | 
60 |         grad_output = torch.cat(grad_output, self.dim)
61 |         return grad_output, loss
62 | 
63 |     def predict(self, outputs, targets, weights, criterion):
64 |         outputs_split = torch.split(outputs, self.batch_size, self.dim)
65 |         targets_split = torch.split(targets, self.batch_size, self.dim)
66 |         weights_split = torch.split(weights, self.batch_size, self.dim)
67 | 
68 |         preds = []
69 |         loss = 0
70 |         for out_t, targ_t, w_t in zip(outputs_split, targets_split, weights_split):
71 |             preds_t, loss_t = super(MemEfficientGenerator, self).predict(
72 |                 out_t, targ_t, w_t, criterion)
73 |             preds.append(preds_t)
74 |             loss += loss_t
75 | 
76 |         preds = torch.cat(preds, self.dim)
77 |         return preds, loss
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/lib/model/Generator.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/Generator.pyc


--------------------------------------------------------------------------------
/lib/model/GlobalAttention.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn as nn
 3 | import math
 4 | 
 5 | _INF = float('inf')
 6 | 
 7 | class GlobalAttention(nn.Module):
 8 |     def __init__(self, dim):
 9 |         super(GlobalAttention, self).__init__()
10 |         self.linear_in = nn.Linear(dim, dim, bias=False)
11 |         self.sm = nn.Softmax()
12 |         self.linear_out = nn.Linear(dim*2, dim, bias=False)
13 |         self.tanh = nn.Tanh()
14 |         self.mask = None
15 | 
16 |     def applyMask(self, mask):
17 |         self.mask = mask
18 | 
19 |     def forward(self, inputs, context):
20 |         """
21 |         inputs: batch x dim
22 |         context: batch x sourceL x dim
23 |         """
24 |         targetT = self.linear_in(inputs).unsqueeze(2)  # batch x dim x 1
25 | 
26 |         # Get attention
27 |         attn = torch.bmm(context, targetT).squeeze(2)  # batch x sourceL
28 |         if self.mask is not None:
29 |             attn.data.masked_fill_(self.mask, -_INF)
30 |         attn = self.sm(attn)
31 |         attn3 = attn.view(attn.size(0), 1, attn.size(1))  # batch x 1 x sourceL
32 | 
33 |         weightedContext = torch.bmm(attn3, context).squeeze(1)  # batch x dim
34 |         contextCombined = torch.cat((weightedContext, inputs), 1)
35 | 
36 |         contextOutput = self.tanh(self.linear_out(contextCombined))
37 | 
38 |         return contextOutput, attn
39 | 


--------------------------------------------------------------------------------
/lib/model/GlobalAttention.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/GlobalAttention.pyc


--------------------------------------------------------------------------------
/lib/model/__init__.py:
--------------------------------------------------------------------------------
1 | from .GlobalAttention import *
2 | from .EncoderDecoder import *
3 | from .Generator import *
4 | 


--------------------------------------------------------------------------------
/lib/model/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/__init__.pyc


--------------------------------------------------------------------------------
/lib/train/Optim.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | import torch.optim as optim
 3 | 
 4 | 
 5 | class Optim(object):
 6 |     def _makeOptimizer(self):
 7 |         if self.method == 'sgd':
 8 |             self.optimizer = optim.SGD(self.params, lr=self.lr)
 9 |         elif self.method == 'adagrad':
10 |             self.optimizer = optim.Adagrad(self.params, lr=self.lr)
11 |         elif self.method == 'adadelta':
12 |             self.optimizer = optim.Adadelta(self.params, lr=self.lr)
13 |         elif self.method == 'adam':
14 |             self.optimizer = optim.Adam(self.params, lr=self.lr)
15 |         else:
16 |             raise RuntimeError("Invalid optim method: " + self.method)
17 | 
18 |     def __init__(self, params, method, lr, max_grad_norm, lr_decay=1, start_decay_at=None):
19 |         self.params = list(params)  # careful: params may be a generator
20 |         self.last_loss = None
21 |         self.lr = lr
22 |         self.max_grad_norm = max_grad_norm
23 |         self.method = method
24 |         self.lr_decay = lr_decay
25 |         self.start_decay_at = start_decay_at
26 | 
27 |         self._makeOptimizer()
28 | 
29 |     def step(self):
30 |         # Compute gradients norm.
31 |         grad_norm = 0
32 |         for param in self.params:
33 |             grad_norm += math.pow(param.grad.data.norm(), 2)
34 | 
35 |         grad_norm = math.sqrt(grad_norm)
36 |         shrinkage = self.max_grad_norm / grad_norm
37 | 
38 |         for param in self.params:
39 |             if shrinkage < 1:
40 |                 param.grad.data.mul_(shrinkage)
41 | 
42 |         self.optimizer.step()
43 |         return grad_norm
44 | 
45 |     def set_lr(self, lr):
46 |         self.lr = lr
47 |         self.optimizer.param_groups[0]["lr"] = lr
48 | 
49 |     def updateLearningRate(self, loss, epoch):
50 |         if self.start_decay_at is not None and epoch >= self.start_decay_at:
51 |             if self.last_loss is not None and loss > self.last_loss:
52 |                 self.set_lr(self.lr * self.lr_decay)
53 |         self.last_loss = loss
54 | 


--------------------------------------------------------------------------------
/lib/train/Optim.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/Optim.pyc


--------------------------------------------------------------------------------
/lib/train/ReinforceTrainer.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import math
  3 | import os
  4 | import time
  5 | 
  6 | from torch.autograd import Variable
  7 | import torch
  8 | 
  9 | import lib
 10 | 
 11 | class ReinforceTrainer(object):
 12 | 
 13 |     def __init__(self, actor, critic, train_data, eval_data, metrics, dicts, optim, critic_optim, opt):
 14 |         self.actor = actor
 15 |         self.critic = critic
 16 | 
 17 |         self.train_data = train_data
 18 |         self.eval_data = eval_data
 19 |         self.evaluator = lib.Evaluator(actor, metrics, dicts, opt)
 20 | 
 21 |         self.actor_loss_func = metrics["nmt_loss"]
 22 |         self.critic_loss_func = metrics["critic_loss"]
 23 |         self.sent_reward_func = metrics["sent_reward"]
 24 | 
 25 |         self.dicts = dicts
 26 | 
 27 |         self.optim = optim
 28 |         self.critic_optim = critic_optim
 29 | 
 30 |         self.max_length = opt.max_predict_length
 31 |         self.pert_func = opt.pert_func
 32 |         self.opt = opt
 33 | 
 34 |         print("")
 35 |         print(actor)
 36 |         print("")
 37 |         print(critic)
 38 | 
 39 |     def train(self, start_epoch, end_epoch, pretrain_critic, start_time=None):
 40 |         if start_time is None:
 41 |             self.start_time = time.time()
 42 |         else:
 43 |             self.start_time = start_time
 44 |         self.optim.last_loss = self.critic_optim.last_loss = None
 45 |         self.optim.set_lr(self.opt.reinforce_lr)
 46 | 
 47 |         #  Use large learning rate for critic during pre-training.
 48 |         if pretrain_critic:
 49 |             self.critic_optim.set_lr(1e-3)
 50 |         else:
 51 |             self.critic_optim.set_lr(self.opt.reinforce_lr)
 52 | 
 53 |         for epoch in range(start_epoch, end_epoch + 1):
 54 |             print("")
 55 | 
 56 |             print("* REINFORCE epoch *")
 57 |             print("Actor optim lr: %g; Critic optim lr: %g" %
 58 |                 (self.optim.lr, self.critic_optim.lr))
 59 |             if pretrain_critic:
 60 |                 print("Pretrain critic...")
 61 |             no_update = self.opt.no_update and (not pretrain_critic) and \
 62 |                         (epoch == start_epoch)
 63 | 
 64 |             if no_update: print("No update...")
 65 | 
 66 |             train_reward, critic_loss = self.train_epoch(epoch, pretrain_critic, no_update)
 67 |             print("Train sentence reward: %.2f" % (train_reward * 100))
 68 |             print("Critic loss: %g" % critic_loss)
 69 | 
 70 |             valid_loss, valid_sent_reward, valid_corpus_reward = self.evaluator.eval(self.eval_data)
 71 |             valid_ppl = math.exp(min(valid_loss, 100))
 72 |             print("Validation perplexity: %.2f" % valid_ppl)
 73 |             print("Validation sentence reward: %.2f" % (valid_sent_reward * 100))
 74 |             print("Validation corpus reward: %.2f" %
 75 |                 (valid_corpus_reward * 100))
 76 | 
 77 |             if no_update: break
 78 | 
 79 |             self.optim.updateLearningRate(-valid_sent_reward, epoch)
 80 |             # Actor and critic use the same lr when jointly trained.
 81 |             # TODO: using small lr for critic is better?
 82 |             if not pretrain_critic:
 83 |                 self.critic_optim.set_lr(self.optim.lr)
 84 | 
 85 |             checkpoint = {
 86 |                 "model": self.actor,
 87 |                 "critic": self.critic,
 88 |                 "dicts": self.dicts,
 89 |                 "opt": self.opt,
 90 |                 "epoch": epoch,
 91 |                 "optim": self.optim,
 92 |                 "critic_optim": self.critic_optim
 93 |             }
 94 |             model_name = os.path.join(self.opt.save_dir, "model_%d" % epoch)
 95 |             if pretrain_critic:
 96 |                 model_name += "_pretrain"
 97 |             else:
 98 |                 model_name += "_reinforce"
 99 |             model_name += ".pt"
100 |             torch.save(checkpoint, model_name)
101 |             print("Save model as %s" % model_name)
102 | 
103 |     def train_epoch(self, epoch, pretrain_critic, no_update):
104 |         self.actor.train()
105 | 
106 |         total_reward, report_reward = 0, 0
107 |         total_critic_loss, report_critic_loss = 0, 0
108 |         total_sents, report_sents = 0, 0
109 |         total_words, report_words = 0, 0
110 |         last_time = time.time()
111 |         for i in range(len(self.train_data)):
112 |             batch = self.train_data[i]
113 |             sources = batch[0]
114 |             targets = batch[1]
115 |             batch_size = targets.size(1)
116 | 
117 |             self.actor.zero_grad()
118 |             self.critic.zero_grad()
119 | 
120 |             # Sample translations
121 |             attention_mask = sources[0].data.eq(lib.Constants.PAD).t()
122 |             self.actor.decoder.attn.applyMask(attention_mask)
123 |             samples, outputs = self.actor.sample(batch, self.max_length)
124 | 
125 |             # Calculate rewards
126 |             rewards, samples = self.sent_reward_func(samples.t().tolist(), targets.data.t().tolist())
127 |             reward = sum(rewards)
128 | 
129 |             # Perturb rewards (if specified).
130 |             if self.pert_func is not None:
131 |                 rewards = self.pert_func(rewards)
132 | 
133 |             samples = Variable(torch.LongTensor(samples).t().contiguous())
134 |             rewards = Variable(torch.FloatTensor([rewards] * samples.size(0)).contiguous())
135 |             if self.opt.cuda:
136 |                 samples = samples.cuda()
137 |                 rewards = rewards.cuda()
138 | 
139 |             # Update critic.
140 |             critic_weights = samples.ne(lib.Constants.PAD).float()
141 |             num_words = critic_weights.data.sum()
142 |             if not no_update:
143 |                 baselines = self.critic((sources, samples), eval=False, regression=True)
144 |                 critic_loss = self.critic.backward(
145 |                     baselines, rewards, critic_weights, num_words, self.critic_loss_func, regression=True)
146 |                 self.critic_optim.step()
147 |             else:
148 |                 critic_loss = 0
149 | 
150 |             # Update actor
151 |             if not pretrain_critic and not no_update:
152 |                 # Subtract baseline from reward
153 |                 norm_rewards = Variable((rewards - baselines).data)
154 |                 actor_weights = norm_rewards * critic_weights
155 |                 # TODO: can use PyTorch reinforce() here but that function is a black box.
156 |                 # This is an alternative way where you specify an objective that gives the same gradient
157 |                 # as the policy gradient's objective, which looks much like weighted log-likelihood.
158 |                 actor_loss = self.actor.backward(outputs, samples, actor_weights, 1, self.actor_loss_func)
159 |                 self.optim.step()
160 | 
161 |             # Gather stats
162 |             total_reward += reward
163 |             report_reward += reward
164 |             total_sents += batch_size
165 |             report_sents += batch_size
166 |             total_critic_loss += critic_loss
167 |             report_critic_loss += critic_loss
168 |             total_words += num_words
169 |             report_words += num_words
170 |             if i % self.opt.log_interval == 0 and i > 0:
171 |                 print("""Epoch %3d, %6d/%d batches;
172 |                       actor reward: %.4f; critic loss: %f; %5.0f tokens/s; %s elapsed""" %
173 |                       (epoch, i, len(self.train_data),
174 |                       (report_reward / report_sents) * 100,
175 |                       report_critic_loss / report_words,
176 |                       report_words / (time.time() - last_time),
177 |                       str(datetime.timedelta(seconds=int(time.time() - self.start_time)))))
178 | 
179 |                 report_reward = report_sents = report_critic_loss = report_words = 0
180 |                 last_time = time.time()
181 | 
182 |         return total_reward / total_sents, total_critic_loss / total_words
183 | 
184 | 


--------------------------------------------------------------------------------
/lib/train/ReinforceTrainer.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/ReinforceTrainer.pyc


--------------------------------------------------------------------------------
/lib/train/Trainer.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import math
 3 | import os
 4 | import time
 5 | 
 6 | import torch
 7 | 
 8 | import lib
 9 | 
10 | class Trainer(object):
11 |     def __init__(self, model, train_data, eval_data, metrics, dicts,
12 |         optim, opt):
13 | 
14 |         self.model = model
15 |         self.train_data = train_data
16 |         self.eval_data = eval_data
17 |         self.evaluator = lib.Evaluator(model, metrics, dicts, opt)
18 |         self.loss_func = metrics["nmt_loss"]
19 |         self.dicts = dicts
20 |         self.optim = optim
21 |         self.opt = opt
22 | 
23 |         print(model)
24 | 
25 |     def train(self, start_epoch, end_epoch, start_time=None):
26 |         if start_time is None:
27 |             self.start_time = time.time()
28 |         else:
29 |             self.start_time = start_time
30 |         for epoch in range(start_epoch, end_epoch + 1):
31 |             print('')
32 | 
33 |             print("* XENT epoch *")
34 |             print("Model optim lr: %g" % self.optim.lr)
35 |             train_loss = self.train_epoch(epoch)
36 |             print('Train perplexity: %.2f' % math.exp(min(train_loss, 100)))
37 | 
38 |             valid_loss, valid_sent_reward, valid_corpus_reward = self.evaluator.eval(self.eval_data)
39 |             valid_ppl = math.exp(min(valid_loss, 100))
40 |             print('Validation perplexity: %.2f' % valid_ppl)
41 |             print('Validation sentence reward: %.2f' % (valid_sent_reward * 100))
42 |             print('Validation corpus reward: %.2f' %
43 |                 (valid_corpus_reward * 100))
44 | 
45 |             self.optim.updateLearningRate(valid_loss, epoch)
46 | 
47 |             checkpoint = {
48 |                 'model': self.model,
49 |                 'dicts': self.dicts,
50 |                 'opt': self.opt,
51 |                 'epoch': epoch,
52 |                 'optim': self.optim,
53 |             }
54 |             model_name = os.path.join(self.opt.save_dir, "model_%d.pt" % epoch)
55 |             torch.save(checkpoint, model_name)
56 |             print("Save model as %s" % model_name)
57 | 
58 |             
59 |     def train_epoch(self, epoch):
60 |         self.model.train()
61 | 
62 |         self.train_data.shuffle()
63 | 
64 |         total_loss, report_loss = 0, 0
65 |         total_words, report_words = 0, 0
66 |         last_time = time.time()
67 |         for i in range(len(self.train_data)):
68 |             batch = self.train_data[i]
69 |             targets = batch[1]
70 | 
71 |             self.model.zero_grad()
72 |             attention_mask = batch[0][0].data.eq(lib.Constants.PAD).t()
73 |             self.model.decoder.attn.applyMask(attention_mask)
74 |             outputs = self.model(batch, eval=False)
75 | 
76 |             weights = targets.ne(lib.Constants.PAD).float()
77 |             num_words = weights.data.sum()
78 |             loss = self.model.backward(outputs, targets, weights, num_words, self.loss_func)
79 | 
80 |             self.optim.step()
81 | 
82 |             report_loss += loss
83 |             total_loss += loss
84 |             total_words += num_words
85 |             report_words += num_words
86 |             if i % self.opt.log_interval == 0 and i > 0:
87 |                 print("""Epoch %3d, %6d/%d batches;
88 |                       perplexity: %8.2f; %5.0f tokens/s; %s elapsed""" %
89 |                       (epoch, i, len(self.train_data),
90 |                       math.exp(report_loss / report_words),
91 |                       report_words / (time.time() - last_time),
92 |                       str(datetime.timedelta(seconds=int(time.time() - self.start_time)))))
93 | 
94 |                 report_loss = report_words = 0
95 |                 last_time = time.time()
96 | 
97 |         return total_loss / total_words
98 | 
99 | 


--------------------------------------------------------------------------------
/lib/train/Trainer.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/Trainer.pyc


--------------------------------------------------------------------------------
/lib/train/__init__.py:
--------------------------------------------------------------------------------
1 | from .Optim import *
2 | from .ReinforceTrainer import *
3 | from .Trainer import *
4 | 


--------------------------------------------------------------------------------
/lib/train/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/__init__.pyc


--------------------------------------------------------------------------------
/preprocess.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import torch
  3 | 
  4 | import lib
  5 | 
  6 | parser = argparse.ArgumentParser(description="preprocess.py")
  7 | 
  8 | parser.add_argument("-train_src", required=True,
  9 |                     help="Path to the training source data")
 10 | parser.add_argument("-train_tgt", required=True,
 11 |                     help="Path to the training target data")
 12 | 
 13 | parser.add_argument("-train_xe_src", required=True,
 14 |                     help="Path to the pre-training source data")
 15 | parser.add_argument("-train_xe_tgt", required=True,
 16 |                     help="Path to the pre-training target data")
 17 | 
 18 | parser.add_argument("-train_pg_src", required=True,
 19 |                     help="Path to the bandit training source data")
 20 | parser.add_argument("-train_pg_tgt", required=True,
 21 |                     help="Path to the bandit training target data")
 22 | 
 23 | parser.add_argument("-valid_src", required=True,
 24 |                     help="Path to the validation source data")
 25 | parser.add_argument("-valid_tgt", required=True,
 26 |                      help="Path to the validation target data")
 27 | 
 28 | parser.add_argument("-test_src", required=True,
 29 |                     help="Path to the test source data")
 30 | parser.add_argument("-test_tgt", required=True,
 31 |                      help="Path to the test target data")
 32 | 
 33 | parser.add_argument("-save_data", required=True,
 34 |                     help="Output file for the prepared data")
 35 | 
 36 | parser.add_argument("-src_vocab_size", type=int, default=50000,
 37 |                     help="Size of the source vocabulary")
 38 | parser.add_argument("-tgt_vocab_size", type=int, default=50000,
 39 |                     help="Size of the target vocabulary")
 40 | 
 41 | parser.add_argument("-seq_length", type=int, default=80,
 42 |                     help="Maximum sequence length")
 43 | parser.add_argument("-seed",       type=int, default=3435,
 44 |                     help="Random seed")
 45 | 
 46 | parser.add_argument("-report_every", type=int, default=100000,
 47 |                     help="Report status every this many sentences")
 48 | 
 49 | opt = parser.parse_args()
 50 | torch.manual_seed(opt.seed)
 51 | 
 52 | 
 53 | def makeVocabulary(filename, size):
 54 |     vocab = lib.Dict([lib.Constants.PAD_WORD, lib.Constants.UNK_WORD,
 55 |                        lib.Constants.BOS_WORD, lib.Constants.EOS_WORD])
 56 | 
 57 |     with open(filename) as f:
 58 |         for sent in f.readlines():
 59 |             for word in sent.split():
 60 | 		#vocab.add(word)
 61 |                 vocab.add(word.lower())  # Lowercase all words
 62 | 
 63 |     originalSize = vocab.size()
 64 |     vocab = vocab.prune(size)
 65 |     print("Created dictionary of size %d (pruned from %d)" %
 66 |           (vocab.size(), originalSize))
 67 | 
 68 |     return vocab
 69 | 
 70 | 
 71 | def initVocabulary(name, dataFile, vocabSize, saveFile):
 72 |     print("Building " + name + " vocabulary...")
 73 |     vocab = makeVocabulary(dataFile, vocabSize)
 74 |     print("Saving " + name + " vocabulary to \"" + saveFile + "\"...")
 75 |     vocab.writeFile(saveFile)
 76 |     return vocab
 77 | 
 78 | '''def reorderSentences(pos, src, tgt, perm):
 79 |     new_pos = [pos[idx] for idx in perm]
 80 |     new_src = [src[idx] for idx in perm]
 81 |     new_tgt = [tgt[idx] for idx in perm]
 82 |     return new_pos, new_src, new_tgt
 83 | '''
 84 | def makeData(which, srcFile, tgtFile, srcDicts, tgtDicts):
 85 |     src, tgt = [], []
 86 |     sizes = []
 87 |     count, ignored = 0, 0
 88 | 
 89 |     print("Processing %s & %s ..." % (srcFile, tgtFile))
 90 |     srcF = open(srcFile)
 91 |     tgtF = open(tgtFile)
 92 | 
 93 |     while True:
 94 |         srcWords = srcF.readline().split()
 95 |         tgtWords = tgtF.readline().split()
 96 | 
 97 |         if not srcWords or not tgtWords:
 98 |             if srcWords and not tgtWords or not srcWords and tgtWords:
 99 |                 print("WARNING: source and target do not have the same number of sentences")
100 |             break
101 |             
102 |         if len(srcWords) <= opt.seq_length and len(tgtWords) <= opt.seq_length:
103 |             src += [srcDicts.convertToIdx(srcWords,
104 |                                           lib.Constants.UNK_WORD)]
105 |             tgt += [tgtDicts.convertToIdx(tgtWords,
106 |                                           lib.Constants.UNK_WORD,
107 |                                           eosWord=lib.Constants.EOS_WORD)]
108 |             sizes += [len(srcWords)]
109 |         else:
110 | 	    if which!="test":
111 |             	ignored += 1
112 | 	    else:
113 |            	src += [srcDicts.convertToIdx(srcWords,
114 |                                           lib.Constants.UNK_WORD)]
115 |             	tgt += [tgtDicts.convertToIdx(tgtWords,
116 |                                           lib.Constants.UNK_WORD,
117 |                                           eosWord=lib.Constants.EOS_WORD)]
118 |             	sizes += [len(srcWords)]
119 | 		
120 | 
121 |         count += 1
122 |         if count % opt.report_every == 0:
123 |             print("... %d sentences prepared" % count)
124 | 
125 |     srcF.close()
126 |     tgtF.close()
127 | 
128 |     assert len(src) == len(tgt)
129 |     print("Prepared %d sentences (%d ignored due to length == 0 or > %d)" % (len(src), ignored, opt.seq_length))
130 | 
131 |     return src, tgt, range(len(src))
132 | 
133 | 
134 | def makeDataGeneral(which, src_path, tgt_path, dicts):
135 |     print("Preparing " + which + "...")
136 |     res = {}
137 |     res["src"], res["tgt"], res["pos"] = makeData(which, src_path, tgt_path,
138 |         dicts["src"], dicts["tgt"])
139 |     return res
140 | 
141 | 
142 | def main():
143 |     dicts = {}
144 |     dicts["src"] = initVocabulary("source", opt.train_src, opt.src_vocab_size,
145 |         opt.save_data + ".src.dict")
146 |     dicts["tgt"] = initVocabulary("target", opt.train_tgt, opt.tgt_vocab_size,
147 |         opt.save_data + ".tgt.dict")
148 | 
149 |     save_data = {}
150 |     save_data["dicts"] = dicts
151 |     save_data["train_xe"] = makeDataGeneral("train_xe", opt.train_xe_src,
152 |         opt.train_xe_tgt, dicts)
153 |     save_data["train_pg"] = makeDataGeneral("train_pg", opt.train_pg_src,
154 |         opt.train_pg_tgt, dicts)
155 |     save_data["valid"] = makeDataGeneral("valid", opt.valid_src, opt.valid_tgt,
156 |         dicts)
157 |     save_data["test"] = makeDataGeneral("test", opt.test_src, opt.test_tgt,
158 |         dicts)
159 | 
160 |     print("Saving data to \"" + opt.save_data + "-train.pt\"...")
161 |     torch.save(save_data, opt.save_data + "-train.pt")
162 | 
163 | 
164 | if __name__ == "__main__":
165 |     main()
166 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | absl-py==0.1.10
 2 | backports.functools-lru-cache==1.5
 3 | backports.weakref==1.0.post1
 4 | bleach==1.5.0
 5 | certifi==2018.1.18
 6 | chardet==3.0.4
 7 | cycler==0.10.0
 8 | enum34==1.1.6
 9 | funcsigs==1.0.2
10 | future==0.16.0
11 | futures==3.2.0
12 | html5lib==0.9999999
13 | idna==2.6
14 | Markdown==2.6.11
15 | matplotlib==2.1.2
16 | mock==2.0.0
17 | nltk==3.2.5
18 | numpy==1.14.1
19 | pbr==3.1.1
20 | Pillow==5.0.0
21 | protobuf==3.5.1
22 | pyparsing==2.2.0
23 | pyrouge==0.1.3
24 | python-dateutil==2.6.1
25 | pytz==2018.3
26 | PyYAML==3.12
27 | requests>=2.20.0
28 | six==1.11.0
29 | subprocess32==3.2.7
30 | tensorflow-gpu==1.5.0
31 | tensorflow-tensorboard==1.5.1
32 | torch==0.3.1
33 | torchtext==0.2.1
34 | torchvision==0.2.0
35 | tqdm==4.19.5
36 | urllib3>=1.23
37 | Werkzeug==0.14.1
38 | 


--------------------------------------------------------------------------------
/scripts/extract_parallel.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | import sys,os
 3 | 
 4 | input_dir=sys.argv[1]
 5 | files=os.listdir(input_dir)
 6 | 
 7 | 
 8 | fields = set(["TOP", "COL", "UID", "PID", "ACQ", "DUR", "VID", "TTL", "URL", "TTS", "SRC", "CMT", "LAN", "TTP", "HED", "OBT", "LBT", "END", "CC1", "CC2"])
 9 | 
10 | f1=open("french.txt",'a')
11 | f2=open("english.txt",'a')
12 | 
13 | for input_file in files:
14 | 	with open(os.path.join(input_dir,input_file)) as f:
15 | 		content=f.readlines()
16 | 	content=[x.strip() for x in content]
17 | 	f.close()
18 | 	c1=0
19 | 	c2=0
20 | 	for i in range(len(content)-1):
21 | 		line1=content[i]
22 | 		l1=line1.split('|')
23 | 		line2=content[i+1]
24 | 		l2=line2.split('|')
25 | 		if (l1[0] not in fields and (l1[2]=="CC1" or l1[2]=="CC2")) and  (l2[0] not in fields and (l2[2]=="CC1" or l2[2]=="CC2")):
26 | 			if l1[2]=="CC1" and l2[2]=="CC2":
27 | 				print(l1[3],file=f1)
28 | 				print(l2[3],file=f2)
29 | 				c1+=1
30 | 				c2+=1
31 | 		i+=1
32 | 	if c1!=c2:
33 | 		print(input_file)
34 | 
35 | f1.close()
36 | f2.close()
37 | 
38 | 
39 | 


--------------------------------------------------------------------------------
/scripts/lowercase.perl:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env perl
 2 | #
 3 | # This file is part of moses.  Its use is licensed under the GNU Lesser General
 4 | # Public License version 2.1 or, at your option, any later version.
 5 | 
 6 | use warnings;
 7 | use strict;
 8 | 
 9 | binmode(STDIN, ":utf8");
10 | binmode(STDOUT, ":utf8");
11 | 
12 | while(<STDIN>) {
13 |   print lc($_);
14 | }
15 | 


--------------------------------------------------------------------------------
/scripts/multi-bleu.perl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | #
  3 | # This file is part of moses.  Its use is licensed under the GNU Lesser General
  4 | # Public License version 2.1 or, at your option, any later version.
  5 | 
  6 | # $Id$
  7 | use warnings;
  8 | use strict;
  9 | 
 10 | my $lowercase = 0;
 11 | if ($ARGV[0] eq "-lc") {
 12 |   $lowercase = 1;
 13 |   shift;
 14 | }
 15 | 
 16 | my $stem = $ARGV[0];
 17 | if (!defined $stem) {
 18 |   print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
 19 |   print STDERR "Reads the references from reference or reference0, reference1, ...\n";
 20 |   exit(1);
 21 | }
 22 | 
 23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
 24 | 
 25 | my @REF;
 26 | my $ref=0;
 27 | while(-e "$stem$ref") {
 28 |     &add_to_ref("$stem$ref",\@REF);
 29 |     $ref++;
 30 | }
 31 | &add_to_ref($stem,\@REF) if -e $stem;
 32 | die("ERROR: could not find reference file $stem") unless scalar @REF;
 33 | 
 34 | # add additional references explicitly specified on the command line
 35 | shift;
 36 | foreach my $stem (@ARGV) {
 37 |     &add_to_ref($stem,\@REF) if -e $stem;
 38 | }
 39 | 
 40 | 
 41 | 
 42 | sub add_to_ref {
 43 |     my ($file,$REF) = @_;
 44 |     my $s=0;
 45 |     if ($file =~ /.gz$/) {
 46 | 	open(REF,"gzip -dc $file|") or die "Can't read $file";
 47 |     } else { 
 48 | 	open(REF,$file) or die "Can't read $file";
 49 |     }
 50 |     while(<REF>) {
 51 | 	chop;
 52 | 	push @{$$REF[$s++]}, $_;
 53 |     }
 54 |     close(REF);
 55 | }
 56 | 
 57 | my(@CORRECT,@TOTAL,$length_translation,$length_reference);
 58 | my $s=0;
 59 | while(<STDIN>) {
 60 |     chop;
 61 |     $_ = lc if $lowercase;
 62 |     my @WORD = split;
 63 |     my %REF_NGRAM = ();
 64 |     my $length_translation_this_sentence = scalar(@WORD);
 65 |     my ($closest_diff,$closest_length) = (9999,9999);
 66 |     foreach my $reference (@{$REF[$s]}) {
 67 | #      print "$s $_ <=> $reference\n";
 68 |   $reference = lc($reference) if $lowercase;
 69 | 	my @WORD = split(' ',$reference);
 70 | 	my $length = scalar(@WORD);
 71 |         my $diff = abs($length_translation_this_sentence-$length);
 72 | 	if ($diff < $closest_diff) {
 73 | 	    $closest_diff = $diff;
 74 | 	    $closest_length = $length;
 75 | 	    # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
 76 | 	} elsif ($diff == $closest_diff) {
 77 |             $closest_length = $length if $length < $closest_length;
 78 |             # from two references with the same closeness to me
 79 |             # take the *shorter* into account, not the "first" one.
 80 |         }
 81 | 	for(my $n=1;$n<=4;$n++) {
 82 | 	    my %REF_NGRAM_N = ();
 83 | 	    for(my $start=0;$start<=$#WORD-($n-1);$start++) {
 84 | 		my $ngram = "$n";
 85 | 		for(my $w=0;$w<$n;$w++) {
 86 | 		    $ngram .= " ".$WORD[$start+$w];
 87 | 		}
 88 | 		$REF_NGRAM_N{$ngram}++;
 89 | 	    }
 90 | 	    foreach my $ngram (keys %REF_NGRAM_N) {
 91 | 		if (!defined($REF_NGRAM{$ngram}) ||
 92 | 		    $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
 93 | 		    $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
 94 | #	    print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}<BR>\n";
 95 | 		}
 96 | 	    }
 97 | 	}
 98 |     }
 99 |     $length_translation += $length_translation_this_sentence;
100 |     $length_reference += $closest_length;
101 |     for(my $n=1;$n<=4;$n++) {
102 | 	my %T_NGRAM = ();
103 | 	for(my $start=0;$start<=$#WORD-($n-1);$start++) {
104 | 	    my $ngram = "$n";
105 | 	    for(my $w=0;$w<$n;$w++) {
106 | 		$ngram .= " ".$WORD[$start+$w];
107 | 	    }
108 | 	    $T_NGRAM{$ngram}++;
109 | 	}
110 | 	foreach my $ngram (keys %T_NGRAM) {
111 | 	    $ngram =~ /^(\d+) /;
112 | 	    my $n = $1;
113 |             # my $corr = 0;
114 | #	print "$i e $ngram $T_NGRAM{$ngram}<BR>\n";
115 | 	    $TOTAL[$n] += $T_NGRAM{$ngram};
116 | 	    if (defined($REF_NGRAM{$ngram})) {
117 | 		if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
118 | 		    $CORRECT[$n] += $T_NGRAM{$ngram};
119 |                     # $corr =  $T_NGRAM{$ngram};
120 | #	    print "$i e correct1 $T_NGRAM{$ngram}<BR>\n";
121 | 		}
122 | 		else {
123 | 		    $CORRECT[$n] += $REF_NGRAM{$ngram};
124 |                     # $corr =  $REF_NGRAM{$ngram};
125 | #	    print "$i e correct2 $REF_NGRAM{$ngram}<BR>\n";
126 | 		}
127 | 	    }
128 |             # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
129 |             # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
130 | 	}
131 |     }
132 |     $s++;
133 | }
134 | my $brevity_penalty = 1;
135 | my $bleu = 0;
136 | 
137 | my @bleu=();
138 | 
139 | for(my $n=1;$n<=4;$n++) {
140 |   if (defined ($TOTAL[$n])){
141 |     $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
142 |     # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
143 |   }else{
144 |     $bleu[$n]=0;
145 |   }
146 | }
147 | 
148 | if ($length_reference==0){
149 |   printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
150 |   exit(1);
151 | }
152 | 
153 | if ($length_translation<$length_reference) {
154 |   $brevity_penalty = exp(1-$length_reference/$length_translation);
155 | }
156 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
157 | 				my_log( $bleu[2] ) +
158 | 				my_log( $bleu[3] ) +
159 | 				my_log( $bleu[4] ) ) / 4) ;
160 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
161 |     100*$bleu,
162 |     100*$bleu[1],
163 |     100*$bleu[2],
164 |     100*$bleu[3],
165 |     100*$bleu[4],
166 |     $brevity_penalty,
167 |     $length_translation / $length_reference,
168 |     $length_translation,
169 |     $length_reference;
170 | 
171 | 
172 | print STDERR "It is in-advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n";
173 | 
174 | sub my_log {
175 |   return -9999999999 unless $_[0];
176 |   return log($_[0]);
177 | }
178 | 


--------------------------------------------------------------------------------
/scripts/output.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | import sys,os
 3 | import datetime
 4 | 
 5 | input_file=sys.argv[1]
 6 | fields = set(["TOP", "COL", "UID", "PID", "ACQ", "DUR", "VID", "TTL", "URL", "TTS", "SRC", "CMT", "LAN", "TTP", "HED", "OBT", "LBT", "END", "CC1"])
 7 | 
 8 | with open(input_file) as f:
 9 |         content=f.readlines()
10 | content=[x.strip() for x in content]
11 | f.close()
12 | 
13 | with open("tmp.txt.pred") as f:
14 |         pred_content=f.readlines()
15 | pred_content=[x.strip() for x in pred_content]
16 | f.close()
17 | 
18 | f1=open(input_file+".pred",'a')
19 | sent_index=0
20 | credit_flag=0
21 | lang=""
22 | for line in content:
23 | 	l=line.split('|')
24 | 	if l[0] in fields:
25 | 		print(line,file=f1)
26 | 		if l[0]=="LAN":
27 | 			lang=l[1]		
28 | 	elif l[0] not in fields:
29 | 		if not credit_flag:
30 | 			timestamp=datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
31 | 			source_program="Neural Machine Translation 1.0, translate.sh"
32 | 			source_person="Vikrant Goyal"
33 | 			print(lang+"_01" + '|' + timestamp + '|' + "Source_Program=" + source_program + '|' + "Source_Person=" + source_person ,file=f1)
34 | 			credit_flag=1
35 | 		l[2]=lang+"_01"
36 | 		l[3]=pred_content[sent_index]
37 | 		print(l[0]+'|'+l[1]+'|'+l[2]+'|'+l[3],file=f1)	
38 | 		sent_index+=1
39 | f1.close()
40 | 


--------------------------------------------------------------------------------
/scripts/parse.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | import sys,os
 3 | 
 4 | input_file=sys.argv[1]
 5 | 
 6 | with open(input_file) as f:
 7 | 	content=f.readlines()
 8 | content=[x.strip() for x in content]
 9 | f.close()
10 | 
11 | fields = set(["TOP", "COL", "UID", "PID", "ACQ", "DUR", "VID", "TTL", "URL", "TTS", "SRC", "CMT", "LAN", "TTP", "HED", "OBT", "LBT", "END", "CC1"])
12 | 
13 | f1=open("tmp.txt",'a')
14 | for line in content:
15 | 	l=line.split('|')
16 | 	if l[0]	not in fields:
17 | 		sent=l[3]
18 | 		print(sent,file=f1)
19 | f1.close()
20 | 
21 | 


--------------------------------------------------------------------------------
/scripts/prepare_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | src=$1
 4 | tgt=$2
 5 | lang=$1-$2
 6 | script=../../Neural-Machine-Translation/scripts/
 7 | 
 8 | python $script/strip.py train.$lang.$src train.$lang.$tgt
 9 | perl $script/lowercase.perl < train.$lang.$src.cleaned > train.$lang.$src.cleaned.low
10 | perl $script/lowercase.perl < train.$lang.$tgt.cleaned > train.$lang.$tgt.cleaned.low
11 | perl $script/tokenizer.perl -l $src < train.$lang.$src.cleaned.low > train.$lang.$src.cleaned.low.tok
12 | perl $script/tokenizer.perl -l $tgt < train.$lang.$tgt.cleaned.low > train.$lang.$tgt.cleaned.low.tok
13 | cat train.$lang.$src.cleaned.low.tok train.$lang.$tgt.cleaned.low.tok | ~/Neural-Machine-Translation/subword-nmt/learn_bpe.py -s 32000 > ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000
14 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < train.$lang.$src.cleaned.low.tok > train.$lang.$src.cleaned.low.tok.bpe
15 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < train.$lang.$tgt.cleaned.low.tok > train.$lang.$tgt.cleaned.low.tok.bpe
16 | mv train.$lang.$src.cleaned.low.tok.bpe train.$lang.$src.processed
17 | mv train.$lang.$tgt.cleaned.low.tok.bpe train.$lang.$tgt.processed
18 | 
19 | 
20 | python $script/strip.py valid.$lang.$src valid.$lang.$tgt
21 | perl $script/lowercase.perl < valid.$lang.$src.cleaned > valid.$lang.$src.cleaned.low
22 | perl $script/lowercase.perl < valid.$lang.$tgt.cleaned > valid.$lang.$tgt.cleaned.low
23 | perl $script/tokenizer.perl -l $src < valid.$lang.$src.cleaned.low > valid.$lang.$src.cleaned.low.tok
24 | perl $script/tokenizer.perl -l $tgt < valid.$lang.$tgt.cleaned.low > valid.$lang.$tgt.cleaned.low.tok
25 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < valid.$lang.$src.cleaned.low.tok > valid.$lang.$src.cleaned.low.tok.bpe
26 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < valid.$lang.$tgt.cleaned.low.tok > valid.$lang.$tgt.cleaned.low.tok.bpe
27 | mv valid.$lang.$src.cleaned.low.tok.bpe valid.$lang.$src.processed
28 | mv valid.$lang.$tgt.cleaned.low.tok.bpe valid.$lang.$tgt.processed
29 | 
30 | 
31 | python $script/strip.py test.$lang.$src test.$lang.$tgt
32 | perl $script/lowercase.perl < test.$lang.$src.cleaned > test.$lang.$src.cleaned.low
33 | perl $script/lowercase.perl < test.$lang.$tgt.cleaned > test.$lang.$tgt.cleaned.low
34 | perl $script/tokenizer.perl -l $src < test.$lang.$src.cleaned.low > test.$lang.$src.cleaned.low.tok
35 | perl $script/tokenizer.perl -l $tgt < test.$lang.$tgt.cleaned.low > test.$lang.$tgt.cleaned.low.tok
36 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < test.$lang.$src.cleaned.low.tok > test.$lang.$src.cleaned.low.tok.bpe
37 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < test.$lang.$tgt.cleaned.low.tok > test.$lang.$tgt.cleaned.low.tok.bpe
38 | mv test.$lang.$src.cleaned.low.tok.bpe test.$lang.$src.processed
39 | mv test.$lang.$tgt.cleaned.low.tok.bpe test.$lang.$tgt.processed
40 | 
41 | rm *tok
42 | rm *cleaned
43 | rm *low
44 | 
45 | 


--------------------------------------------------------------------------------
/scripts/preprocess.py:
--------------------------------------------------------------------------------
 1 | ##python code to clean and tokenize a file
 2 | from __future__ import print_function
 3 | import string
 4 | import re,sys,os,codecs
 5 | from unicodedata import normalize
 6 | from mosestokenizer import *
 7 | # load document into memory
 8 | def load_doc(filename):
 9 | 	# open the file as read only
10 | 	file = codecs.open(filename, 'r', encoding='utf-8')
11 | 	# read all text
12 | 	text = file.read()
13 | 	# close the file
14 | 	file.close()
15 | 	return text
16 | 
17 | # split a loaded document into sentences
18 | def to_sentences(doc):
19 | 	return doc.strip().split('\n')
20 | 
21 | # clean a list of lines
22 | def clean_lines(lines,lang):
23 | 	cleaned = list()
24 | 	tokenize=MosesTokenizer(lang)
25 | 	for line in lines:
26 | 		# tokenize usig moses tokenizer
27 | 		line=tokenize(line)
28 | 		# convert to lower case
29 | 		line = [word.lower() for word in line]
30 | 		# store it as a string
31 | 		line = ' '.join(line)
32 | 		cleaned.append(line)
33 | 	return cleaned
34 | 
35 | # save a list of clean sentences to file
36 | def save_clean_sentences(sentences, filename):
37 | 	fout=open(filename,'a')
38 | 	for line in sentences:
39 | 		print(line, file=fout)
40 | 	fout.close()
41 | 	print('Saved: %s' % filename)
42 | 
43 | if __name__=="__main__":
44 | 	# load data for cleaning
45 |         filename = sys.argv[1]
46 |         lang=sys.argv[2]
47 |         doc = load_doc(filename)
48 |         sentences = to_sentences(doc)
49 |         print(sentences[-1])
50 |         sentences = clean_lines(sentences,lang)
51 |         print(sentences[-1])
52 |         save_clean_sentences(sentences, filename+'.processed')
53 | 


--------------------------------------------------------------------------------
/scripts/sgm.perl:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env perl
 2 | #
 3 | # This file is part of moses.  Its use is licensed under the GNU Lesser General
 4 | # Public License version 2.1 or, at your option, any later version.
 5 | 
 6 | use warnings;
 7 | use strict;
 8 | 
 9 | die("ERROR syntax: input-from-sgm.perl < in.sgm > in.txt")
10 |     unless scalar @ARGV == 0;
11 | 
12 | while(my $line = <STDIN>) {
13 |     chop($line);
14 |     while ($line =~ /<seg[^>]+>\s*$/i) {
15 | 	my $next_line = <STDIN>;
16 | 	$line .= $next_line;
17 | 	chop($line);
18 |     }
19 |     while ($line =~ /<seg[^>]+>\s*(.*)\s*$/i &&
20 | 	   $line !~ /<seg[^>]+>\s*(.*)\s*<\/seg>/i) {
21 | 	my $next_line = <STDIN>;
22 | 	$line .= $next_line;
23 | 	chop($line);
24 |     }
25 |     if ($line =~ /<seg[^>]+>\s*(.*)\s*<\/seg>/i) {
26 | 	my $input = $1;
27 | 	$input =~ s/\s+/ /g;
28 | 	$input =~ s/^ //g;
29 | 	$input =~ s/ $//g;
30 | 	print $input."\n";
31 |     }
32 | }
33 | 


--------------------------------------------------------------------------------
/scripts/strip.py:
--------------------------------------------------------------------------------
 1 | ##python code to tokenize a file
 2 | from __future__ import print_function
 3 | from itertools import izip
 4 | import string
 5 | import re,sys,os,codecs
 6 | file1=sys.argv[1]
 7 | file2=sys.argv[2]
 8 | fout1=open(file1+".cleaned",'a')
 9 | fout2=open(file2+".cleaned",'a')
10 | with open(file1) as f1, open(file2) as f2:
11 | 	text1=f1.read().split('\n')
12 | 	text2=f2.read().split('\n')
13 | 	for x,y in izip(text1,text2):
14 | 		x=x.strip()
15 | 		y=y.strip()
16 | 		#filtrate = re.compile(u'[^\u4E00-\u9FA5]')
17 | 		#y=y.decode("utf-8")
18 | 		#y=filtrate.sub(r'',y)
19 | 		#y=y.encode("utf-8")
20 | 		x=x.decode("utf-8").replace(u"\uFDD3",'').encode("utf-8")
21 | 		y=y.decode("utf-8").replace(u"\uFDD3",'').encode("utf-8")
22 | 		if len(x)>=1 and len(y)>=1:
23 | 			print(x,file=fout1)
24 | 			print(y,file=fout2)
25 | f1.close()
26 | f2.close()
27 | 


--------------------------------------------------------------------------------
/scripts/tokenizer.perl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | #
  3 | # This file is part of moses.  Its use is licensed under the GNU Lesser General
  4 | # Public License version 2.1 or, at your option, any later version.
  5 | 
  6 | use warnings;
  7 | 
  8 | # Sample Tokenizer
  9 | ### Version 1.1
 10 | # written by Pidong Wang, based on the code written by Josh Schroeder and Philipp Koehn
 11 | # Version 1.1 updates:
 12 | #       (1) add multithreading option "-threads NUM_THREADS" (default is 1);
 13 | #       (2) add a timing option "-time" to calculate the average speed of this tokenizer;
 14 | #       (3) add an option "-lines NUM_SENTENCES_PER_THREAD" to set the number of lines for each thread (default is 2000), and this option controls the memory amount needed: the larger this number is, the larger memory is required (the higher tokenization speed);
 15 | ### Version 1.0
 16 | # $Id: tokenizer.perl 915 2009-08-10 08:15:49Z philipp $
 17 | # written by Josh Schroeder, based on code by Philipp Koehn
 18 | 
 19 | binmode(STDIN, ":utf8");
 20 | binmode(STDOUT, ":utf8");
 21 | 
 22 | use warnings;
 23 | use FindBin qw($RealBin);
 24 | use strict;
 25 | use Time::HiRes;
 26 | 
 27 | if  (eval {require Thread;1;}) {
 28 |   #module loaded
 29 |   Thread->import();
 30 | }
 31 | 
 32 | my $mydir = "$RealBin/../share/nonbreaking_prefixes";
 33 | 
 34 | my %NONBREAKING_PREFIX = ();
 35 | my @protected_patterns = ();
 36 | my $protected_patterns_file = "";
 37 | my $language = "en";
 38 | my $QUIET = 0;
 39 | my $HELP = 0;
 40 | my $AGGRESSIVE = 0;
 41 | my $SKIP_XML = 0;
 42 | my $TIMING = 0;
 43 | my $NUM_THREADS = 1;
 44 | my $NUM_SENTENCES_PER_THREAD = 2000;
 45 | my $PENN = 0;
 46 | my $NO_ESCAPING = 0;
 47 | while (@ARGV)
 48 | {
 49 | 	$_ = shift;
 50 | 	/^-b$/ && ($| = 1, next);
 51 | 	/^-l$/ && ($language = shift, next);
 52 | 	/^-q$/ && ($QUIET = 1, next);
 53 | 	/^-h$/ && ($HELP = 1, next);
 54 | 	/^-x$/ && ($SKIP_XML = 1, next);
 55 | 	/^-a$/ && ($AGGRESSIVE = 1, next);
 56 | 	/^-time$/ && ($TIMING = 1, next);
 57 |   # Option to add list of regexps to be protected
 58 |   /^-protected/ && ($protected_patterns_file = shift, next);
 59 | 	/^-threads$/ && ($NUM_THREADS = int(shift), next);
 60 | 	/^-lines$/ && ($NUM_SENTENCES_PER_THREAD = int(shift), next);
 61 | 	/^-penn$/ && ($PENN = 1, next);
 62 | 	/^-no-escape/ && ($NO_ESCAPING = 1, next);
 63 | }
 64 | 
 65 | # for time calculation
 66 | my $start_time;
 67 | if ($TIMING)
 68 | {
 69 |     $start_time = [ Time::HiRes::gettimeofday( ) ];
 70 | }
 71 | 
 72 | # print help message
 73 | if ($HELP)
 74 | {
 75 | 	print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n";
 76 |         print "Options:\n";
 77 |         print "  -q     ... quiet.\n";
 78 |         print "  -a     ... aggressive hyphen splitting.\n";
 79 |         print "  -b     ... disable Perl buffering.\n";
 80 |         print "  -time  ... enable processing time calculation.\n";
 81 |         print "  -penn  ... use Penn treebank-like tokenization.\n";
 82 |         print "  -protected FILE  ... specify file with patters to be protected in tokenisation.\n";
 83 | 	print "  -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n";
 84 | 	exit;
 85 | }
 86 | 
 87 | if (!$QUIET)
 88 | {
 89 | 	print STDERR "Tokenizer Version 1.1\n";
 90 | 	print STDERR "Language: $language\n";
 91 | 	print STDERR "Number of threads: $NUM_THREADS\n";
 92 | }
 93 | 
 94 | # load the language-specific non-breaking prefix info from files in the directory nonbreaking_prefixes
 95 | load_prefixes($language,\%NONBREAKING_PREFIX);
 96 | 
 97 | if (scalar(%NONBREAKING_PREFIX) eq 0)
 98 | {
 99 | 	print STDERR "Warning: No known abbreviations for language '$language'\n";
100 | }
101 | 
102 | # Load protected patterns
103 | if ($protected_patterns_file)
104 | {
105 |   open(PP,$protected_patterns_file) || die "Unable to open $protected_patterns_file";
106 |   while(<PP>) {
107 |     chomp;
108 |     push @protected_patterns, $_;
109 |   }
110 | }
111 | 
112 | my @batch_sentences = ();
113 | my @thread_list = ();
114 | my $count_sentences = 0;
115 | 
116 | if ($NUM_THREADS > 1)
117 | {# multi-threading tokenization
118 |     while(<STDIN>)
119 |     {
120 |         $count_sentences = $count_sentences + 1;
121 |         push(@batch_sentences, $_);
122 |         if (scalar(@batch_sentences)>=($NUM_SENTENCES_PER_THREAD*$NUM_THREADS))
123 |         {
124 |             # assign each thread work
125 |             for (my $i=0; $i<$NUM_THREADS; $i++)
126 |             {
127 |                 my $start_index = $i*$NUM_SENTENCES_PER_THREAD;
128 |                 my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1;
129 |                 my @subbatch_sentences = @batch_sentences[$start_index..$end_index];
130 |                 my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences;
131 |                 push(@thread_list, $new_thread);
132 |             }
133 |             foreach (@thread_list)
134 |             {
135 |                 my $tokenized_list = $_->join;
136 |                 foreach (@$tokenized_list)
137 |                 {
138 |                     print $_;
139 |                 }
140 |             }
141 |             # reset for the new run
142 |             @thread_list = ();
143 |             @batch_sentences = ();
144 |         }
145 |     }
146 |     # the last batch
147 |     if (scalar(@batch_sentences)>0)
148 |     {
149 |         # assign each thread work
150 |         for (my $i=0; $i<$NUM_THREADS; $i++)
151 |         {
152 |             my $start_index = $i*$NUM_SENTENCES_PER_THREAD;
153 |             if ($start_index >= scalar(@batch_sentences))
154 |             {
155 |                 last;
156 |             }
157 |             my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1;
158 |             if ($end_index >= scalar(@batch_sentences))
159 |             {
160 |                 $end_index = scalar(@batch_sentences)-1;
161 |             }
162 |             my @subbatch_sentences = @batch_sentences[$start_index..$end_index];
163 |             my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences;
164 |             push(@thread_list, $new_thread);
165 |         }
166 |         foreach (@thread_list)
167 |         {
168 |             my $tokenized_list = $_->join;
169 |             foreach (@$tokenized_list)
170 |             {
171 |                 print $_;
172 |             }
173 |         }
174 |     }
175 | }
176 | else
177 | {# single thread only
178 |     while(<STDIN>)
179 |     {
180 |         if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
181 |         {
182 |             #don't try to tokenize XML/HTML tag lines
183 |             print $_;
184 |         }
185 |         else
186 |         {
187 |             print &tokenize($_);
188 |         }
189 |     }
190 | }
191 | 
192 | if ($TIMING)
193 | {
194 |     my $duration = Time::HiRes::tv_interval( $start_time );
195 |     print STDERR ("TOTAL EXECUTION TIME: ".$duration."\n");
196 |     print STDERR ("TOKENIZATION SPEED: ".($duration/$count_sentences*1000)." milliseconds/line\n");
197 | }
198 | 
199 | #####################################################################################
200 | # subroutines afterward
201 | 
202 | # tokenize a batch of texts saved in an array
203 | # input: an array containing a batch of texts
204 | # return: another array containing a batch of tokenized texts for the input array
205 | sub tokenize_batch
206 | {
207 |     my(@text_list) = @_;
208 |     my(@tokenized_list) = ();
209 |     foreach (@text_list)
210 |     {
211 |         if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
212 |         {
213 |             #don't try to tokenize XML/HTML tag lines
214 |             push(@tokenized_list, $_);
215 |         }
216 |         else
217 |         {
218 |             push(@tokenized_list, &tokenize($_));
219 |         }
220 |     }
221 |     return \@tokenized_list;
222 | }
223 | 
224 | # the actual tokenize function which tokenizes one input string
225 | # input: one string
226 | # return: the tokenized string for the input string
227 | sub tokenize
228 | {
229 |     my($text) = @_;
230 | 
231 |     if ($PENN) {
232 |       return tokenize_penn($text);
233 |     }
234 | 
235 |     chomp($text);
236 |     $text = " $text ";
237 | 
238 |     # remove ASCII junk
239 |     $text =~ s/\s+/ /g;
240 |     $text =~ s/[\000-\037]//g;
241 | 
242 |     # Find protected patterns
243 |     my @protected = ();
244 |     foreach my $protected_pattern (@protected_patterns) {
245 |       my $t = $text;
246 |       while ($t =~ /(?<PATTERN>$protected_pattern)(?<TAIL>.*)$/) {
247 |         push @protected, $+{PATTERN};
248 |         $t = $+{TAIL};
249 |       }
250 |     }
251 | 
252 |     for (my $i = 0; $i < scalar(@protected); ++$i) {
253 |       my $subst = sprintf("THISISPROTECTED%.3d", $i);
254 |       $text =~ s,\Q$protected[$i], $subst ,g;
255 |     }
256 |     $text =~ s/ +/ /g;
257 |     $text =~ s/^ //g;
258 |     $text =~ s/ $//g;
259 | 
260 |     # separate out all "other" special characters
261 |     if (($language eq "fi") or ($language eq "sv")) {
262 |         # in Finnish and Swedish, the colon can be used inside words as an apostrophe-like character:
263 |         # USA:n, 20:een, EU:ssa, USA:s, S:t
264 |         $text =~ s/([^\p{IsAlnum}\s\.\:\'\`\,\-])/ $1 /g;
265 |         # if a colon is not immediately followed by lower-case characters, separate it out anyway
266 |         $text =~ s/(:)(?=$|[^\p{Ll}])/ $1 /g;
267 |     }
268 |     else {
269 |         $text =~ s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g;
270 |     }
271 | 
272 |     # aggressive hyphen splitting
273 |     if ($AGGRESSIVE)
274 |     {
275 |         $text =~ s/([\p{IsAlnum}])\-(?=[\p{IsAlnum}])/$1 \@-\@ /g;
276 |     }
277 | 
278 |     #multi-dots stay together
279 |     $text =~ s/\.([\.]+)/ DOTMULTI$1/g;
280 |     while($text =~ /DOTMULTI\./)
281 |     {
282 |         $text =~ s/DOTMULTI\.([^\.])/DOTDOTMULTI $1/g;
283 |         $text =~ s/DOTMULTI\./DOTDOTMULTI/g;
284 |     }
285 | 
286 |     # seperate out "," except if within numbers (5,300)
287 |     #$text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
288 | 
289 |     # separate out "," except if within numbers (5,300)
290 |     # previous "global" application skips some:  A,B,C,D,E > A , B,C , D,E
291 |     # first application uses up B so rule can't see B,C
292 |     # two-step version here may create extra spaces but these are removed later
293 |     # will also space digit,letter or letter,digit forms (redundant with next section)
294 |     $text =~ s/([^\p{IsN}])[,]/$1 , /g;
295 |     $text =~ s/[,]([^\p{IsN}])/ , $1/g;
296 |     
297 |     # separate "," after a number if it's the end of a sentence
298 |     $text =~ s/([\p{IsN}])[,]$/$1 ,/g;
299 | 
300 |     # separate , pre and post number
301 |     #$text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
302 |     #$text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g;
303 | 
304 |     # turn `into '
305 |     #$text =~ s/\`/\'/g;
306 | 
307 |     #turn '' into "
308 |     #$text =~ s/\'\'/ \" /g;
309 | 
310 |     if ($language eq "en")
311 |     {
312 |         #split contractions right
313 |         $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
314 |         $text =~ s/([^\p{IsAlpha}\p{IsN}])[']([\p{IsAlpha}])/$1 ' $2/g;
315 |         $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
316 |         $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '$2/g;
317 |         #special case for "1990's"
318 |         $text =~ s/([\p{IsN}])[']([s])/$1 '$2/g;
319 |     }
320 |     elsif (($language eq "fr") or ($language eq "it") or ($language eq "ga"))
321 |     {
322 |         #split contractions left
323 |         $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
324 |         $text =~ s/([^\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;
325 |         $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g;
326 |         $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1' $2/g;
327 |     }
328 |     else
329 |     {
330 |         $text =~ s/\'/ \' /g;
331 |     }
332 | 
333 |     #word token method
334 |     my @words = split(/\s/,$text);
335 |     $text = "";
336 |     for (my $i=0;$i<(scalar(@words));$i++)
337 |     {
338 |         my $word = $words[$i];
339 |         if ( $word =~ /^(\S+)\.$/)
340 |         {
341 |             my $pre = $1;
342 |             if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))
343 |             {
344 |                 #no change
345 | 			}
346 |             elsif (($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==2) && ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[0-9]+/)))
347 |             {
348 |                 #no change
349 |             }
350 |             else
351 |             {
352 |                 $word = $pre." .";
353 |             }
354 |         }
355 |         $text .= $word." ";
356 |     }
357 | 
358 |     # clean up extraneous spaces
359 |     $text =~ s/ +/ /g;
360 |     $text =~ s/^ //g;
361 |     $text =~ s/ $//g;
362 | 
363 |     # .' at end of sentence is missed
364 |     $text =~ s/\.\' ?$/ . ' /;
365 | 
366 |     # restore protected
367 |     for (my $i = 0; $i < scalar(@protected); ++$i) {
368 |       my $subst = sprintf("THISISPROTECTED%.3d", $i);
369 |       $text =~ s/$subst/$protected[$i]/g;
370 |     }
371 | 
372 |     #restore multi-dots
373 |     while($text =~ /DOTDOTMULTI/)
374 |     {
375 |         $text =~ s/DOTDOTMULTI/DOTMULTI./g;
376 |     }
377 |     $text =~ s/DOTMULTI/./g;
378 | 
379 |     #escape special chars
380 |     if (!$NO_ESCAPING)
381 |       {
382 | 	$text =~ s/\&/\&amp;/g;   # escape escape
383 | 	$text =~ s/\|/\&#124;/g;  # factor separator
384 | 	$text =~ s/\</\&lt;/g;    # xml
385 | 	$text =~ s/\>/\&gt;/g;    # xml
386 | 	$text =~ s/\'/\&apos;/g;  # xml
387 | 	$text =~ s/\"/\&quot;/g;  # xml
388 | 	$text =~ s/\[/\&#91;/g;   # syntax non-terminal
389 | 	$text =~ s/\]/\&#93;/g;   # syntax non-terminal
390 |       }
391 | 
392 |     #ensure final line break
393 |     $text .= "\n" unless $text =~ /\n$/;
394 | 
395 |     return $text;
396 | }
397 | 
398 | sub tokenize_penn
399 | {
400 |     # Improved compatibility with Penn Treebank tokenization.  Useful if
401 |     # the text is to later be parsed with a PTB-trained parser.
402 |     #
403 |     # Adapted from Robert MacIntyre's sed script:
404 |     #   http://www.cis.upenn.edu/~treebank/tokenizer.sed
405 | 
406 |     my($text) = @_;
407 |     chomp($text);
408 | 
409 |     # remove ASCII junk
410 |     $text =~ s/\s+/ /g;
411 |     $text =~ s/[\000-\037]//g;
412 | 
413 |     # attempt to get correct directional quotes
414 |     $text =~ s/^``/`` /g;
415 |     $text =~ s/^"/`` /g;
416 |     $text =~ s/^`([^`])/` $1/g;
417 |     $text =~ s/^'/`  /g;
418 |     $text =~ s/([ ([{<])"/$1 `` /g;
419 |     $text =~ s/([ ([{<])``/$1 `` /g;
420 |     $text =~ s/([ ([{<])`([^`])/$1 ` $2/g;
421 |     $text =~ s/([ ([{<])'/$1 ` /g;
422 |     # close quotes handled at end
423 | 
424 |     $text =~ s=\.\.\.= _ELLIPSIS_ =g;
425 | 
426 |     # separate out "," except if within numbers (5,300)
427 |     $text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
428 |     # separate , pre and post number
429 |     $text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g;
430 |     $text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g;
431 | 
432 |     #$text =~ s=([;:@#\$%&\p{IsSc}])= $1 =g;
433 | $text =~ s=([;:@#\$%&\p{IsSc}\p{IsSo}])= $1 =g;
434 | 
435 |     # Separate out intra-token slashes.  PTB tokenization doesn't do this, so
436 |     # the tokens should be merged prior to parsing with a PTB-trained parser
437 |     # (see syntax-hyphen-splitting.perl).
438 |     $text =~ s/([\p{IsAlnum}])\/([\p{IsAlnum}])/$1 \@\/\@ $2/g;
439 | 
440 |     # Assume sentence tokenization has been done first, so split FINAL periods
441 |     # only.
442 |     $text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g;
443 |     # however, we may as well split ALL question marks and exclamation points,
444 |     # since they shouldn't have the abbrev.-marker ambiguity problem
445 |     $text =~ s=([?!])= $1 =g;
446 | 
447 |     # parentheses, brackets, etc.
448 |     $text =~ s=([\]\[\(\){}<>])= $1 =g;
449 |     $text =~ s/\(/-LRB-/g;
450 |     $text =~ s/\)/-RRB-/g;
451 |     $text =~ s/\[/-LSB-/g;
452 |     $text =~ s/\]/-RSB-/g;
453 |     $text =~ s/{/-LCB-/g;
454 |     $text =~ s/}/-RCB-/g;
455 | 
456 |     $text =~ s=--= -- =g;
457 | 
458 |     # First off, add a space to the beginning and end of each line, to reduce
459 |     # necessary number of regexps.
460 |     $text =~ s=$= =;
461 |     $text =~ s=^= =;
462 | 
463 |     $text =~ s="= '' =g;
464 |     # possessive or close-single-quote
465 |     $text =~ s=([^'])' =$1 ' =g;
466 |     # as in it's, I'm, we'd
467 |     $text =~ s='([sSmMdD]) = '$1 =g;
468 |     $text =~ s='ll = 'll =g;
469 |     $text =~ s='re = 're =g;
470 |     $text =~ s='ve = 've =g;
471 |     $text =~ s=n't = n't =g;
472 |     $text =~ s='LL = 'LL =g;
473 |     $text =~ s='RE = 'RE =g;
474 |     $text =~ s='VE = 'VE =g;
475 |     $text =~ s=N'T = N'T =g;
476 | 
477 |     $text =~ s= ([Cc])annot = $1an not =g;
478 |     $text =~ s= ([Dd])'ye = $1' ye =g;
479 |     $text =~ s= ([Gg])imme = $1im me =g;
480 |     $text =~ s= ([Gg])onna = $1on na =g;
481 |     $text =~ s= ([Gg])otta = $1ot ta =g;
482 |     $text =~ s= ([Ll])emme = $1em me =g;
483 |     $text =~ s= ([Mm])ore'n = $1ore 'n =g;
484 |     $text =~ s= '([Tt])is = '$1 is =g;
485 |     $text =~ s= '([Tt])was = '$1 was =g;
486 |     $text =~ s= ([Ww])anna = $1an na =g;
487 | 
488 |     #word token method
489 |     my @words = split(/\s/,$text);
490 |     $text = "";
491 |     for (my $i=0;$i<(scalar(@words));$i++)
492 |     {
493 |         my $word = $words[$i];
494 |         if ( $word =~ /^(\S+)\.$/)
495 |         {
496 |             my $pre = $1;
497 |             if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))
498 |             {
499 |                 #no change
500 |             }
501 |             elsif (($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==2) && ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[0-9]+/)))
502 |             {
503 |                 #no change
504 |             }
505 |             else
506 |             {
507 |                 $word = $pre." .";
508 |             }
509 |         }
510 |         $text .= $word." ";
511 |     }
512 | 
513 |     # restore ellipses
514 |     $text =~ s=_ELLIPSIS_=\.\.\.=g;
515 | 
516 |     # clean out extra spaces
517 |     $text =~ s=  *= =g;
518 |     $text =~ s=^ *==g;
519 |     $text =~ s= *$==g;
520 | 
521 |     #escape special chars
522 |     $text =~ s/\&/\&amp;/g;   # escape escape
523 |     $text =~ s/\|/\&#124;/g;  # factor separator
524 |     $text =~ s/\</\&lt;/g;    # xml
525 |     $text =~ s/\>/\&gt;/g;    # xml
526 |     $text =~ s/\'/\&apos;/g;  # xml
527 |     $text =~ s/\"/\&quot;/g;  # xml
528 |     $text =~ s/\[/\&#91;/g;   # syntax non-terminal
529 |     $text =~ s/\]/\&#93;/g;   # syntax non-terminal
530 | 
531 |     #ensure final line break
532 |     $text .= "\n" unless $text =~ /\n$/;
533 | 
534 |     return $text;
535 | }
536 | 
537 | sub load_prefixes
538 | {
539 |     my ($language, $PREFIX_REF) = @_;
540 | 
541 |     my $prefixfile = "$mydir/nonbreaking_prefix.$language";
542 | 
543 |     #default back to English if we don't have a language-specific prefix file
544 |     if (!(-e $prefixfile))
545 |     {
546 |         $prefixfile = "$mydir/nonbreaking_prefix.en";
547 |         print STDERR "WARNING: No known abbreviations for language '$language', attempting fall-back to English version...\n";
548 |         die ("ERROR: No abbreviations files found in $mydir\n") unless (-e $prefixfile);
549 |     }
550 | 
551 |     if (-e "$prefixfile")
552 |     {
553 |         open(PREFIX, "<:utf8", "$prefixfile");
554 |         while (<PREFIX>)
555 |         {
556 |             my $item = $_;
557 |             chomp($item);
558 |             if (($item) && (substr($item,0,1) ne "#"))
559 |             {
560 |                 if ($item =~ /(.*)[\s]+(\#NUMERIC_ONLY\#)/)
561 |                 {
562 |                     $PREFIX_REF->{$1} = 2;
563 |                 }
564 |                 else
565 |                 {
566 |                     $PREFIX_REF->{$item} = 1;
567 |                 }
568 |             }
569 |         }
570 |         close(PREFIX);
571 |     }
572 | }
573 | 


--------------------------------------------------------------------------------
/scripts/train.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH -N 1
 3 | #SBATCH -c 12
 4 | #SBATCH --mem-per-cpu=6G
 5 | #SBATCH -p gpu -C gpuk40 --gres=gpu:1
 6 | #SBATCH --time=10-00:30:00
 7 | #SBATCH --mail-type=ALL
 8 | #SBATCH --output=slurm-train.out
 9 | #SBATCH --job-name="nmt-train"
10 | 
11 | if [[ $# != 2 ]] ; then
12 |     echo 'Error, command should be: <sbatch train.sh src-lang-code tgt-lang-code>'
13 |     exit 1
14 | fi
15 | 
16 | src=$1
17 | tgt=$2
18 | lang=${1}-${2}
19 | 
20 | export HOME=$(pwd)/../..
21 | export DATA=$HOME/data
22 | export DATA_PREP=$DATA/$lang
23 | export MODELS=$HOME/models/$lang
24 | export SCRIPT=$HOME/Neural-Machine-Translation/scripts
25 | 
26 | if [ ! -d "$HOME/models" ]; then
27 | 	mkdir $HOME/models
28 | fi
29 | 
30 | module load singularity/2.5.1
31 | cd $HOME/singularity
32 | singularity shell -w --nv rh_xenial_20180308.img
33 | 
34 | cd $SCRIPT
35 | source $HOME/myenv/bin/activate
36 | 
37 | ##Creates data in a format required by train.py
38 | python ../preprocess.py \
39 |   -train_src $DATA_PREP/train.$lang.$src.processed \
40 |   -train_tgt $DATA_PREP/train.$lang.$tgt.processed \
41 |   -train_xe_src $DATA_PREP/train.$lang.$src.processed \
42 |   -train_xe_tgt $DATA_PREP/train.$lang.$tgt.processed \
43 |   -train_pg_src $DATA_PREP/train.$lang.$src.processed \
44 |   -train_pg_tgt $DATA_PREP/train.$lang.$tgt.processed \
45 |   -valid_src $DATA_PREP/valid.$lang.$src.processed \
46 |   -valid_tgt $DATA_PREP/valid.$lang.$tgt.processed \
47 |   -test_src $DATA_PREP/test.$lang.$src.processed \
48 |   -test_tgt $DATA_PREP/test.$lang.$tgt.processed \
49 |   -save_data $DATA_PREP/processed_all
50 | 
51 | ##Train a model(might take days for training)
52 | python $HOME/Neural-Machine-Translation/train.py -data $DATA_PREP/processed_all-train.pt -layers 4 -word_vec_size 512 -brnn -batch_size 128 -dropout 0.3 -save_dir $MODELS -end_epoch 15
53 | 


--------------------------------------------------------------------------------
/scripts/translate.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH -N 1
 3 | #SBATCH -c 12
 4 | #SBATCH --mem-per-cpu=4G
 5 | #SBATCH -p gpu -C gpuk40 --gres=gpu:1
 6 | #SBATCH --time=10-00:30:00
 7 | #SBATCH --mail-type=ALL
 8 | #SBATCH --output=slurm-translate.out
 9 | #SBATCH --job-name="nmt-translate"
10 | 
11 | if [[ $# != 4 ]] ; then
12 |     echo 'Error, command should be: <sbatch translate.sh src-lang-code tgt-lang-code path-of-inputTranscript toggle>'
13 |     exit 1
14 | fi
15 | 
16 | src=$1
17 | tgt=$2
18 | input_file=$3
19 | lang=${1}-${2}
20 | toggle=$4
21 | 
22 | export HOME=$(pwd)/../..
23 | export DATA=$HOME/data
24 | export DATA_PREP=$DATA/$lang
25 | export MODELS=$HOME/models/$lang
26 | export SCRIPT=$HOME/Neural-Machine-Translation/scripts
27 | 
28 | module load singularity/2.5.1
29 | cd $HOME/singularity
30 | singularity shell -w --nv rh_xenial_20180308.img
31 | 
32 | cd $SCRIPT
33 | source $HOME/myenv/bin/activate
34 | 
35 | if [ $toggle -eq 0 ]
36 | then
37 | python $SCRIPT/parse.py $input_file
38 | if [[ $src = "zh" ]]
39 | then
40 | bash $HOME/stanford-segmenter-2018-02-27/segment.sh pku tmp.txt UTF-8 0 > seg.txt
41 | mv seg.txt tmp.txt
42 | fi
43 | perl $SCRIPT/tokenizer.perl -l $src < tmp.txt > tmp.txt.tok
44 | perl $SCRIPT/lowercase.perl < tmp.txt.tok > tmp.txt.tok.low
45 | $HOME/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c $HOME/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < tmp.txt.tok.low > tmp.txt.tok.low.bpe
46 | mv tmp.txt.tok.low.bpe tmp.txt
47 | python $HOME/Neural-Machine-Translation/translate.py -data $DATA_PREP/processed_all-train.pt -load_from $MODELS/model*_best.pt -test_src $SCRIPT/tmp.txt
48 | sed -r -i 's/(@@ )|(@@ ?$)//g' tmp.txt.pred
49 | python $SCRIPT/output.py $input_file
50 | rm $SCRIPT/tmp.txt*
51 | fi
52 | 
53 | #### To translate a simple file not in the news transcript format:
54 | if [ $toggle -eq 1 ]
55 | then
56 | if [[ $src = "zh" ]]
57 | then
58 | bash $HOME/stanford-segmenter-2018-02-27/segment.sh ctb $input_file UTF-8 0 > tmp.txt
59 | #cp $input_file tmp.txt
60 | else
61 | cp $input_file tmp.txt
62 | fi
63 | perl $SCRIPT/tokenizer.perl -l $src < tmp.txt > tmp.txt.tok
64 | perl $SCRIPT/lowercase.perl < tmp.txt.tok > tmp.txt.tok.low
65 | $HOME/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c $HOME/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < tmp.txt.tok.low > tmp.txt.tok.low.bpe
66 | mv tmp.txt.tok.low.bpe tmp.txt
67 | python $HOME/Neural-Machine-Translation/translate.py -data $DATA_PREP/processed_all-train.pt -load_from $MODELS/model*_best.pt -test_src tmp.txt
68 | sed -r -i 's/(@@ )|(@@ ?$)//g' tmp.txt.pred
69 | cp tmp.txt.pred "$input_file.pred"
70 | rm $SCRIPT/tmp.txt*
71 | fi
72 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/README.txt:
--------------------------------------------------------------------------------
1 | The language suffix can be found here:
2 | 
3 | http://www.loc.gov/standards/iso639-2/php/code_list.php
4 | 
5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations).
6 | This code includes data from czech wiktionary (also czech abbreviations).
7 | 
8 | 
9 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.ca:
--------------------------------------------------------------------------------
 1 | Dr
 2 | Dra
 3 | pàg
 4 | p
 5 | c
 6 | av
 7 | Sr
 8 | Sra
 9 | adm
10 | esq
11 | Prof
12 | S.A
13 | S.L
14 | p.e
15 | ptes
16 | Sta
17 | St
18 | pl
19 | màx
20 | cast
21 | dir
22 | nre
23 | fra
24 | admdora
25 | Emm
26 | Excma
27 | espf
28 | dc
29 | admdor
30 | tel
31 | angl
32 | aprox
33 | ca
34 | dept
35 | dj
36 | dl
37 | dt
38 | ds
39 | dg
40 | dv
41 | ed
42 | entl
43 | al
44 | i.e
45 | maj
46 | smin
47 | n
48 | núm
49 | pta
50 | A
51 | B
52 | C
53 | D
54 | E
55 | F
56 | G
57 | H
58 | I
59 | J
60 | K
61 | L
62 | M
63 | N
64 | O
65 | P
66 | Q
67 | R
68 | S
69 | T
70 | U
71 | V
72 | W
73 | X
74 | Y
75 | Z
76 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.cs:
--------------------------------------------------------------------------------
  1 | Bc
  2 | BcA
  3 | Ing
  4 | Ing.arch
  5 | MUDr
  6 | MVDr
  7 | MgA
  8 | Mgr
  9 | JUDr
 10 | PhDr
 11 | RNDr
 12 | PharmDr
 13 | ThLic
 14 | ThDr
 15 | Ph.D
 16 | Th.D
 17 | prof
 18 | doc
 19 | CSc
 20 | DrSc
 21 | dr. h. c
 22 | PaedDr
 23 | Dr
 24 | PhMr
 25 | DiS
 26 | abt
 27 | ad
 28 | a.i
 29 | aj
 30 | angl
 31 | anon
 32 | apod
 33 | atd
 34 | atp
 35 | aut
 36 | bd
 37 | biogr
 38 | b.m
 39 | b.p
 40 | b.r
 41 | cca
 42 | cit
 43 | cizojaz
 44 | c.k
 45 | col
 46 | čes
 47 | čín
 48 | čj
 49 | ed
 50 | facs
 51 | fasc
 52 | fol
 53 | fot
 54 | franc
 55 | h.c
 56 | hist
 57 | hl
 58 | hrsg
 59 | ibid
 60 | il
 61 | ind
 62 | inv.č
 63 | jap
 64 | jhdt
 65 | jv
 66 | koed
 67 | kol
 68 | korej
 69 | kl
 70 | krit
 71 | lat
 72 | lit
 73 | m.a
 74 | maď
 75 | mj
 76 | mp
 77 | násl
 78 | např
 79 | nepubl
 80 | něm
 81 | no
 82 | nr
 83 | n.s
 84 | okr
 85 | odd
 86 | odp
 87 | obr
 88 | opr
 89 | orig
 90 | phil
 91 | pl
 92 | pokrač
 93 | pol
 94 | port
 95 | pozn
 96 | př.kr
 97 | př.n.l
 98 | přel
 99 | přeprac
100 | příl
101 | pseud
102 | pt
103 | red
104 | repr
105 | resp
106 | revid
107 | rkp
108 | roč
109 | roz
110 | rozš
111 | samost
112 | sect
113 | sest
114 | seš
115 | sign
116 | sl
117 | srv
118 | stol
119 | sv
120 | šk
121 | šk.ro
122 | špan
123 | tab
124 | t.č
125 | tis
126 | tj
127 | tř
128 | tzv
129 | univ
130 | uspoř
131 | vol
132 | vl.jm
133 | vs
134 | vyd
135 | vyobr
136 | zal
137 | zejm
138 | zkr
139 | zprac
140 | zvl
141 | n.p
142 | např
143 | než
144 | MUDr
145 | abl
146 | absol
147 | adj
148 | adv
149 | ak
150 | ak. sl
151 | akt
152 | alch
153 | amer
154 | anat
155 | angl
156 | anglosas
157 | arab
158 | arch
159 | archit
160 | arg
161 | astr
162 | astrol
163 | att
164 | bás
165 | belg
166 | bibl
167 | biol
168 | boh
169 | bot
170 | bulh
171 | círk
172 | csl
173 | č
174 | čas
175 | čes
176 | dat
177 | děj
178 | dep
179 | dět
180 | dial
181 | dór
182 | dopr
183 | dosl
184 | ekon
185 | epic
186 | etnonym
187 | eufem
188 | f
189 | fam
190 | fem
191 | fil
192 | film
193 | form
194 | fot
195 | fr
196 | fut
197 | fyz
198 | gen
199 | geogr
200 | geol
201 | geom
202 | germ
203 | gram
204 | hebr
205 | herald
206 | hist
207 | hl
208 | hovor
209 | hud
210 | hut
211 | chcsl
212 | chem
213 | ie
214 | imp
215 | impf
216 | ind
217 | indoevr
218 | inf
219 | instr
220 | interj
221 | ión
222 | iron
223 | it
224 | kanad
225 | katalán
226 | klas
227 | kniž
228 | komp
229 | konj
230 |  
231 | konkr
232 | kř
233 | kuch
234 | lat
235 | lék
236 | les
237 | lid
238 | lit
239 | liturg
240 | lok
241 | log
242 | m
243 | mat
244 | meteor
245 | metr
246 | mod
247 | ms
248 | mysl
249 | n
250 | náb
251 | námoř
252 | neklas
253 | něm
254 | nesklon
255 | nom
256 | ob
257 | obch
258 | obyč
259 | ojed
260 | opt
261 | part
262 | pas
263 | pejor
264 | pers
265 | pf
266 | pl
267 | plpf
268 |  
269 | práv
270 | prep
271 | předl
272 | přivl
273 | r
274 | rcsl
275 | refl
276 | reg
277 | rkp
278 | ř
279 | řec
280 | s
281 | samohl
282 | sg
283 | sl
284 | souhl
285 | spec
286 | srov
287 | stfr
288 | střv
289 | stsl
290 | subj
291 | subst
292 | superl
293 | sv
294 | sz
295 | táz
296 | tech
297 | telev
298 | teol
299 | trans
300 | typogr
301 | var
302 | vedl
303 | verb
304 | vl. jm
305 | voj
306 | vok
307 | vůb
308 | vulg
309 | výtv
310 | vztaž
311 | zahr
312 | zájm
313 | zast
314 | zejm
315 |  
316 | zeměd
317 | zkr
318 | zř
319 | mj
320 | dl
321 | atp
322 | sport
323 | Mgr
324 | horn
325 | MVDr
326 | JUDr
327 | RSDr
328 | Bc
329 | PhDr
330 | ThDr
331 | Ing
332 | aj
333 | apod
334 | PharmDr
335 | pomn
336 | ev
337 | slang
338 | nprap
339 | odp
340 | dop
341 | pol
342 | st
343 | stol
344 | p. n. l
345 | před n. l
346 | n. l
347 | př. Kr
348 | po Kr
349 | př. n. l
350 | odd
351 | RNDr
352 | tzv
353 | atd
354 | tzn
355 | resp
356 | tj
357 | p
358 | br
359 | č. j
360 | čj
361 | č. p
362 | čp
363 | a. s
364 | s. r. o
365 | spol. s r. o
366 | p. o
367 | s. p
368 | v. o. s
369 | k. s
370 | o. p. s
371 | o. s
372 | v. r
373 | v z
374 | ml
375 | vč
376 | kr
377 | mld
378 | hod
379 | popř
380 | ap
381 | event
382 | rus
383 | slov
384 | rum
385 | švýc
386 | P. T
387 | zvl
388 | hor
389 | dol
390 | S.O.S


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.de:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | #no german words end in single lower-case letters, so we throw those in too.
  7 | A
  8 | B
  9 | C
 10 | D
 11 | E
 12 | F
 13 | G
 14 | H
 15 | I
 16 | J
 17 | K
 18 | L
 19 | M
 20 | N
 21 | O
 22 | P
 23 | Q
 24 | R
 25 | S
 26 | T
 27 | U
 28 | V
 29 | W
 30 | X
 31 | Y
 32 | Z
 33 | a
 34 | b
 35 | c
 36 | d
 37 | e
 38 | f
 39 | g
 40 | h
 41 | i
 42 | j
 43 | k
 44 | l
 45 | m
 46 | n
 47 | o
 48 | p
 49 | q
 50 | r
 51 | s
 52 | t
 53 | u
 54 | v
 55 | w
 56 | x
 57 | y
 58 | z
 59 | 
 60 | 
 61 | #Roman Numerals. A dot after one of these is not a sentence break in German.
 62 | I
 63 | II
 64 | III
 65 | IV
 66 | V
 67 | VI
 68 | VII
 69 | VIII
 70 | IX
 71 | X
 72 | XI
 73 | XII
 74 | XIII
 75 | XIV
 76 | XV
 77 | XVI
 78 | XVII
 79 | XVIII
 80 | XIX
 81 | XX
 82 | i
 83 | ii
 84 | iii
 85 | iv
 86 | v
 87 | vi
 88 | vii
 89 | viii
 90 | ix
 91 | x
 92 | xi
 93 | xii
 94 | xiii
 95 | xiv
 96 | xv
 97 | xvi
 98 | xvii
 99 | xviii
100 | xix
101 | xx
102 | 
103 | #Titles and Honorifics
104 | Adj
105 | Adm
106 | Adv
107 | Asst
108 | Bart
109 | Bldg
110 | Brig
111 | Bros
112 | Capt
113 | Cmdr
114 | Col
115 | Comdr
116 | Con
117 | Corp
118 | Cpl
119 | DR
120 | Dr
121 | Ens
122 | Gen
123 | Gov
124 | Hon
125 | Hosp
126 | Insp
127 | Lt
128 | MM
129 | MR
130 | MRS
131 | MS
132 | Maj
133 | Messrs
134 | Mlle
135 | Mme
136 | Mr
137 | Mrs
138 | Ms
139 | Msgr
140 | Op
141 | Ord
142 | Pfc
143 | Ph
144 | Prof
145 | Pvt
146 | Rep
147 | Reps
148 | Res
149 | Rev
150 | Rt
151 | Sen
152 | Sens
153 | Sfc
154 | Sgt
155 | Sr
156 | St
157 | Supt
158 | Surg
159 | 
160 | #Misc symbols
161 | Mio
162 | Mrd
163 | bzw
164 | v
165 | vs
166 | usw
167 | d.h
168 | z.B
169 | u.a
170 | etc
171 | Mrd
172 | MwSt
173 | ggf
174 | d.J
175 | D.h
176 | m.E
177 | vgl
178 | I.F
179 | z.T
180 | sogen
181 | ff
182 | u.E
183 | g.U
184 | g.g.A
185 | c.-à-d
186 | Buchst
187 | u.s.w
188 | sog
189 | u.ä
190 | Std
191 | evtl
192 | Zt
193 | Chr
194 | u.U
195 | o.ä
196 | Ltd
197 | b.A
198 | z.Zt
199 | spp
200 | sen
201 | SA
202 | k.o
203 | jun
204 | i.H.v
205 | dgl
206 | dergl
207 | Co
208 | zzt
209 | usf
210 | s.p.a
211 | Dkr
212 | Corp
213 | bzgl
214 | BSE
215 | 
216 | #Number indicators
217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it
218 | No
219 | Nos
220 | Art
221 | Nr
222 | pp
223 | ca
224 | Ca
225 | 
226 | #Ordinals are done with . in German - "1." = "1st" in English
227 | 1
228 | 2
229 | 3
230 | 4
231 | 5
232 | 6
233 | 7
234 | 8
235 | 9
236 | 10
237 | 11
238 | 12
239 | 13
240 | 14
241 | 15
242 | 16
243 | 17
244 | 18
245 | 19
246 | 20
247 | 21
248 | 22
249 | 23
250 | 24
251 | 25
252 | 26
253 | 27
254 | 28
255 | 29
256 | 30
257 | 31
258 | 32
259 | 33
260 | 34
261 | 35
262 | 36
263 | 37
264 | 38
265 | 39
266 | 40
267 | 41
268 | 42
269 | 43
270 | 44
271 | 45
272 | 46
273 | 47
274 | 48
275 | 49
276 | 50
277 | 51
278 | 52
279 | 53
280 | 54
281 | 55
282 | 56
283 | 57
284 | 58
285 | 59
286 | 60
287 | 61
288 | 62
289 | 63
290 | 64
291 | 65
292 | 66
293 | 67
294 | 68
295 | 69
296 | 70
297 | 71
298 | 72
299 | 73
300 | 74
301 | 75
302 | 76
303 | 77
304 | 78
305 | 79
306 | 80
307 | 81
308 | 82
309 | 83
310 | 84
311 | 85
312 | 86
313 | 87
314 | 88
315 | 89
316 | 90
317 | 91
318 | 92
319 | 93
320 | 94
321 | 95
322 | 96
323 | 97
324 | 98
325 | 99
326 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.el:
--------------------------------------------------------------------------------
   1 | # Sigle letters in upper-case are usually abbreviations of names
   2 | Α
   3 | Β
   4 | Γ
   5 | Δ
   6 | Ε
   7 | Ζ
   8 | Η
   9 | Θ
  10 | Ι
  11 | Κ
  12 | Λ
  13 | Μ
  14 | Ν
  15 | Ξ
  16 | Ο
  17 | Π
  18 | Ρ
  19 | Σ
  20 | Τ
  21 | Υ
  22 | Φ
  23 | Χ
  24 | Ψ
  25 | Ω
  26 | 
  27 | # Includes abbreviations for the Greek language compiled from various sources (Greek grammar books, Greek language related web content).
  28 | Άθαν
  29 | Έγχρ
  30 | Έκθ
  31 | Έσδ
  32 | Έφ
  33 | Όμ
  34 | Α΄Έσδρ
  35 | Α΄Έσδ
  36 | Α΄Βασ
  37 | Α΄Θεσ
  38 | Α΄Ιω
  39 | Α΄Κορινθ
  40 | Α΄Κορ
  41 | Α΄Μακκ
  42 | Α΄Μακ
  43 | Α΄Πέτρ
  44 | Α΄Πέτ
  45 | Α΄Παραλ
  46 | Α΄Πε
  47 | Α΄Σαμ
  48 | Α΄Τιμ
  49 | Α΄Χρον
  50 | Α΄Χρ
  51 | Α.Β.Α
  52 | Α.Β
  53 | Α.Ε
  54 | Α.Κ.Τ.Ο
  55 | Αέθλ
  56 | Αέτ
  57 | Αίλ.Δ
  58 | Αίλ.Τακτ
  59 | Αίσ
  60 | Αββακ
  61 | Αβυδ
  62 | Αβ
  63 | Αγάκλ
  64 | Αγάπ
  65 | Αγάπ.Αμαρτ.Σ
  66 | Αγάπ.Γεωπ
  67 | Αγαθάγγ
  68 | Αγαθήμ
  69 | Αγαθιν
  70 | Αγαθοκλ
  71 | Αγαθρχ
  72 | Αγαθ
  73 | Αγαθ.Ιστ
  74 | Αγαλλ
  75 | Αγαπητ
  76 | Αγγ
  77 | Αγησ
  78 | Αγλ
  79 | Αγορ.Κ
  80 | Αγρο.Κωδ
  81 | Αγρ.Εξ
  82 | Αγρ.Κ
  83 | Αγ.Γρ
  84 | Αδριαν
  85 | Αδρ
  86 | Αετ
  87 | Αθάν
  88 | Αθήν
  89 | Αθήν.Επιγρ
  90 | Αθήν.Επιτ
  91 | Αθήν.Ιατρ
  92 | Αθήν.Μηχ
  93 | Αθανάσ
  94 | Αθαν
  95 | Αθηνί
  96 | Αθηναγ
  97 | Αθηνόδ
  98 | Αθ
  99 | Αθ.Αρχ
 100 | Αιλ
 101 | Αιλ.Επιστ
 102 | Αιλ.ΖΙ
 103 | Αιλ.ΠΙ
 104 | Αιλ.απ
 105 | Αιμιλ
 106 | Αιν.Γαζ
 107 | Αιν.Τακτ
 108 | Αισχίν
 109 | Αισχίν.Επιστ
 110 | Αισχ
 111 | Αισχ.Αγαμ
 112 | Αισχ.Αγ
 113 | Αισχ.Αλ
 114 | Αισχ.Ελεγ
 115 | Αισχ.Επτ.Θ
 116 | Αισχ.Ευμ
 117 | Αισχ.Ικέτ
 118 | Αισχ.Ικ
 119 | Αισχ.Περσ
 120 | Αισχ.Προμ.Δεσμ
 121 | Αισχ.Πρ
 122 | Αισχ.Χοηφ
 123 | Αισχ.Χο
 124 | Αισχ.απ
 125 | ΑιτΕ
 126 | Αιτ
 127 | Αλκ
 128 | Αλχιας
 129 | Αμ.Π.Ο
 130 | Αμβ
 131 | Αμμών
 132 | Αμ.
 133 | Αν.Πειθ.Συμβ.Δικ
 134 | Ανακρ
 135 | Ανακ
 136 | Αναμν.Τόμ
 137 | Αναπλ
 138 | Ανδ
 139 | Ανθλγος
 140 | Ανθστης
 141 | Αντισθ
 142 | Ανχης
 143 | Αν
 144 | Αποκ
 145 | Απρ
 146 | Απόδ
 147 | Απόφ
 148 | Απόφ.Νομ
 149 | Απ
 150 | Απ.Δαπ
 151 | Απ.Διατ
 152 | Απ.Επιστ
 153 | Αριθ
 154 | Αριστοτ
 155 | Αριστοφ
 156 | Αριστοφ.Όρν
 157 | Αριστοφ.Αχ
 158 | Αριστοφ.Βάτρ
 159 | Αριστοφ.Ειρ
 160 | Αριστοφ.Εκκλ
 161 | Αριστοφ.Θεσμ
 162 | Αριστοφ.Ιππ
 163 | Αριστοφ.Λυσ
 164 | Αριστοφ.Νεφ
 165 | Αριστοφ.Πλ
 166 | Αριστοφ.Σφ
 167 | Αριστ
 168 | Αριστ.Αθ.Πολ
 169 | Αριστ.Αισθ
 170 | Αριστ.Αν.Πρ
 171 | Αριστ.Ζ.Ι
 172 | Αριστ.Ηθ.Ευδ
 173 | Αριστ.Ηθ.Νικ
 174 | Αριστ.Κατ
 175 | Αριστ.Μετ
 176 | Αριστ.Πολ
 177 | Αριστ.Φυσιογν
 178 | Αριστ.Φυσ
 179 | Αριστ.Ψυχ
 180 | Αριστ.Ρητ
 181 | Αρμεν
 182 | Αρμ
 183 | Αρχ.Εκ.Καν.Δ
 184 | Αρχ.Ευβ.Μελ
 185 | Αρχ.Ιδ.Δ
 186 | Αρχ.Νομ
 187 | Αρχ.Ν
 188 | Αρχ.Π.Ε
 189 | Αρ
 190 | Αρ.Φορ.Μητρ
 191 | Ασμ
 192 | Ασμ.ασμ
 193 | Αστ.Δ
 194 | Αστ.Χρον
 195 | Ασ
 196 | Ατομ.Γνωμ
 197 | Αυγ
 198 | Αφρ
 199 | Αχ.Νομ
 200 | Α
 201 | Α.Εγχ.Π
 202 | Α.Κ.΄Υδρας
 203 | Β΄Έσδρ
 204 | Β΄Έσδ
 205 | Β΄Βασ
 206 | Β΄Θεσ
 207 | Β΄Ιω
 208 | Β΄Κορινθ
 209 | Β΄Κορ
 210 | Β΄Μακκ
 211 | Β΄Μακ
 212 | Β΄Πέτρ
 213 | Β΄Πέτ
 214 | Β΄Πέ
 215 | Β΄Παραλ
 216 | Β΄Σαμ
 217 | Β΄Τιμ
 218 | Β΄Χρον
 219 | Β΄Χρ
 220 | Β.Ι.Π.Ε
 221 | Β.Κ.Τ
 222 | Β.Κ.Ψ.Β
 223 | Β.Μ
 224 | Β.Ο.Α.Κ
 225 | Β.Ο.Α
 226 | Β.Ο.Δ
 227 | Βίβλ
 228 | Βαρ
 229 | ΒεΘ
 230 | Βι.Περ
 231 | Βιπερ
 232 | Βιργ
 233 | Βλγ
 234 | Βούλ
 235 | Βρ
 236 | Γ΄Βασ
 237 | Γ΄Μακκ
 238 | ΓΕΝμλ
 239 | Γέν
 240 | Γαλ
 241 | Γεν
 242 | Γλ
 243 | Γν.Ν.Σ.Κρ
 244 | Γνωμ
 245 | Γν
 246 | Γράμμ
 247 | Γρηγ.Ναζ
 248 | Γρηγ.Νύσ
 249 | Γ Νοσ
 250 | Γ' Ογκολ
 251 | Γ.Ν
 252 | Δ΄Βασ
 253 | Δ.Β
 254 | Δ.Δίκη
 255 | Δ.Δίκ
 256 | Δ.Ε.Σ
 257 | Δ.Ε.Φ.Α
 258 | Δ.Ε.Φ
 259 | Δ.Εργ.Ν
 260 | Δαμ
 261 | Δαμ.μνημ.έργ
 262 | Δαν
 263 | Δασ.Κ
 264 | Δεκ
 265 | Δελτ.Δικ.Ε.Τ.Ε
 266 | Δελτ.Νομ
 267 | Δελτ.Συνδ.Α.Ε
 268 | Δερμ
 269 | Δευτ
 270 | Δεύτ
 271 | Δημοσθ
 272 | Δημόκρ
 273 | Δι.Δικ
 274 | Διάτ
 275 | Διαιτ.Απ
 276 | Διαιτ
 277 | Διαρκ.Στρατ
 278 | Δικ
 279 | Διοίκ.Πρωτ
 280 | ΔιοικΔνη
 281 | Διοικ.Εφ
 282 | Διον.Αρ
 283 | Διόρθ.Λαθ
 284 | Δ.κ.Π
 285 | Δνη
 286 | Δν
 287 | Δογμ.Όρος
 288 | Δρ
 289 | Δ.τ.Α
 290 | Δτ
 291 | ΔωδΝομ
 292 | Δ.Περ
 293 | Δ.Στρ
 294 | ΕΔΠολ
 295 | ΕΕυρΚ
 296 | ΕΙΣ
 297 | ΕΝαυτΔ
 298 | ΕΣΑμΕΑ
 299 | ΕΣΘ
 300 | ΕΣυγκΔ
 301 | ΕΤρΑξΧρΔ
 302 | Ε.Φ.Ε.Τ
 303 | Ε.Φ.Ι
 304 | Ε.Φ.Ο.Επ.Α
 305 | Εβδ
 306 | Εβρ
 307 | Εγκύκλ.Επιστ
 308 | Εγκ
 309 | Εε.Αιγ
 310 | Εθν.Κ.Τ
 311 | Εθν
 312 | Ειδ.Δικ.Αγ.Κακ
 313 | Εικ
 314 | Ειρ.Αθ
 315 | Ειρην.Αθ
 316 | Ειρην
 317 | Έλεγχ
 318 | Ειρ
 319 | Εισ.Α.Π
 320 | Εισ.Ε
 321 | Εισ.Ν.Α.Κ
 322 | Εισ.Ν.Κ.Πολ.Δ
 323 | Εισ.Πρωτ
 324 | Εισηγ.Έκθ
 325 | Εισ
 326 | Εκκλ
 327 | Εκκ
 328 | Εκ
 329 | Ελλ.Δνη
 330 | Εν.Ε
 331 | Εξ
 332 | Επ.Αν
 333 | Επ.Εργ.Δ
 334 | Επ.Εφ
 335 | Επ.Κυπ.Δ
 336 | Επ.Μεσ.Αρχ
 337 | Επ.Νομ
 338 | Επίκτ
 339 | Επίκ
 340 | Επι.Δ.Ε
 341 | Επιθ.Ναυτ.Δικ
 342 | Επικ
 343 | Επισκ.Ε.Δ
 344 | Επισκ.Εμπ.Δικ
 345 | Επιστ.Επετ.Αρμ
 346 | Επιστ.Επετ
 347 | Επιστ.Ιερ
 348 | Επιτρ.Προστ.Συνδ.Στελ
 349 | Επιφάν
 350 | Επτ.Εφ
 351 | Επ.Ιρ
 352 | Επ.Ι
 353 | Εργ.Ασφ.Νομ
 354 | Ερμ.Α.Κ
 355 | Ερμη.Σ
 356 | Εσθ
 357 | Εσπερ
 358 | Ετρ.Δ
 359 | Ευκλ
 360 | Ευρ.Δ.Δ.Α
 361 | Ευρ.Σ.Δ.Α
 362 | Ευρ.ΣτΕ
 363 | Ευρατόμ
 364 | Ευρ.Άλκ
 365 | Ευρ.Ανδρομ
 366 | Ευρ.Βάκχ
 367 | Ευρ.Εκ
 368 | Ευρ.Ελ
 369 | Ευρ.Ηλ
 370 | Ευρ.Ηρακ
 371 | Ευρ.Ηρ
 372 | Ευρ.Ηρ.Μαιν
 373 | Ευρ.Ικέτ
 374 | Ευρ.Ιππόλ
 375 | Ευρ.Ιφ.Α
 376 | Ευρ.Ιφ.Τ
 377 | Ευρ.Ι.Τ
 378 | Ευρ.Κύκλ
 379 | Ευρ.Μήδ
 380 | Ευρ.Ορ
 381 | Ευρ.Ρήσ
 382 | Ευρ.Τρωάδ
 383 | Ευρ.Φοίν
 384 | Εφ.Αθ
 385 | Εφ.Εν
 386 | Εφ.Επ
 387 | Εφ.Θρ
 388 | Εφ.Θ
 389 | Εφ.Ι
 390 | Εφ.Κερ
 391 | Εφ.Κρ
 392 | Εφ.Λ
 393 | Εφ.Ν
 394 | Εφ.Πατ
 395 | Εφ.Πειρ
 396 | Εφαρμ.Δ.Δ
 397 | Εφαρμ
 398 | Εφεσ
 399 | Εφημ
 400 | Εφ
 401 | Ζαχ
 402 | Ζιγ
 403 | Ζυ
 404 | Ζχ
 405 | ΗΕ.Δ
 406 | Ημερ
 407 | Ηράκλ
 408 | Ηροδ
 409 | Ησίοδ
 410 | Ησ
 411 | Η.Ε.Γ
 412 | ΘΗΣ
 413 | ΘΡ
 414 | Θαλ
 415 | Θεοδ
 416 | Θεοφ
 417 | Θεσ
 418 | Θεόδ.Μοψ
 419 | Θεόκρ
 420 | Θεόφιλ
 421 | Θουκ
 422 | Θρ
 423 | Θρ.Ε
 424 | Θρ.Ιερ
 425 | Θρ.Ιρ
 426 | Ιακ
 427 | Ιαν
 428 | Ιβ
 429 | Ιδθ
 430 | Ιδ
 431 | Ιεζ
 432 | Ιερ
 433 | Ιζ
 434 | Ιησ
 435 | Ιησ.Ν
 436 | Ικ
 437 | Ιλ
 438 | Ιν
 439 | Ιουδ
 440 | Ιουστ
 441 | Ιούδα
 442 | Ιούλ
 443 | Ιούν
 444 | Ιπποκρ
 445 | Ιππόλ
 446 | Ιρ
 447 | Ισίδ.Πηλ
 448 | Ισοκρ
 449 | Ισ.Ν
 450 | Ιωβ
 451 | Ιωλ
 452 | Ιων
 453 | Ιω
 454 | ΚΟΣ
 455 | ΚΟ.ΜΕ.ΚΟΝ
 456 | ΚΠοινΔ
 457 | ΚΠολΔ
 458 | ΚαΒ
 459 | Καλ
 460 | Καλ.Τέχν
 461 | ΚανΒ
 462 | Καν.Διαδ
 463 | Κατάργ
 464 | Κλ
 465 | ΚοινΔ
 466 | Κολσ
 467 | Κολ
 468 | Κον
 469 | Κορ
 470 | Κος
 471 | ΚριτΕπιθ
 472 | ΚριτΕ
 473 | Κριτ
 474 | Κρ
 475 | ΚτΒ
 476 | ΚτΕ
 477 | ΚτΠ
 478 | Κυβ
 479 | Κυπρ
 480 | Κύριλ.Αλεξ
 481 | Κύριλ.Ιερ
 482 | Λεβ
 483 | Λεξ.Σουίδα
 484 | Λευϊτ
 485 | Λευ
 486 | Λκ
 487 | Λογ
 488 | ΛουκΑμ
 489 | Λουκιαν
 490 | Λουκ.Έρωτ
 491 | Λουκ.Ενάλ.Διάλ
 492 | Λουκ.Ερμ
 493 | Λουκ.Εταιρ.Διάλ
 494 | Λουκ.Ε.Δ
 495 | Λουκ.Θε.Δ
 496 | Λουκ.Ικ.
 497 | Λουκ.Ιππ
 498 | Λουκ.Λεξιφ
 499 | Λουκ.Μεν
 500 | Λουκ.Μισθ.Συν
 501 | Λουκ.Ορχ
 502 | Λουκ.Περ
 503 | Λουκ.Συρ
 504 | Λουκ.Τοξ
 505 | Λουκ.Τυρ
 506 | Λουκ.Φιλοψ
 507 | Λουκ.Φιλ
 508 | Λουκ.Χάρ
 509 | Λουκ.
 510 | Λουκ.Αλ
 511 | Λοχ
 512 | Λυδ
 513 | Λυκ
 514 | Λυσ
 515 | Λωζ
 516 | Λ1
 517 | Λ2
 518 | ΜΟΕφ
 519 | Μάρκ
 520 | Μέν
 521 | Μαλ
 522 | Ματθ
 523 | Μα
 524 | Μιχ
 525 | Μκ
 526 | Μλ
 527 | Μμ
 528 | Μον.Δ.Π
 529 | Μον.Πρωτ
 530 | Μον
 531 | Μρ
 532 | Μτ
 533 | Μχ
 534 | Μ.Βασ
 535 | Μ.Πλ
 536 | ΝΑ
 537 | Ναυτ.Χρον
 538 | Να
 539 | Νδικ
 540 | Νεεμ
 541 | Νε
 542 | Νικ
 543 | ΝκΦ
 544 | Νμ
 545 | ΝοΒ
 546 | Νομ.Δελτ.Τρ.Ελ
 547 | Νομ.Δελτ
 548 | Νομ.Σ.Κ
 549 | Νομ.Χρ
 550 | Νομ
 551 | Νομ.Διεύθ
 552 | Νοσ
 553 | Ντ
 554 | Νόσων
 555 | Ν1
 556 | Ν2
 557 | Ν3
 558 | Ν4
 559 | Νtot
 560 | Ξενοφ
 561 | Ξεν
 562 | Ξεν.Ανάβ
 563 | Ξεν.Απολ
 564 | Ξεν.Απομν
 565 | Ξεν.Απομ
 566 | Ξεν.Ελλ
 567 | Ξεν.Ιέρ
 568 | Ξεν.Ιππαρχ
 569 | Ξεν.Ιππ
 570 | Ξεν.Κυρ.Αν
 571 | Ξεν.Κύρ.Παιδ
 572 | Ξεν.Κ.Π
 573 | Ξεν.Λακ.Πολ
 574 | Ξεν.Οικ
 575 | Ξεν.Προσ
 576 | Ξεν.Συμπόσ
 577 | Ξεν.Συμπ
 578 | Ο΄
 579 | Οβδ
 580 | Οβ
 581 | ΟικΕ
 582 | Οικ
 583 | Οικ.Πατρ
 584 | Οικ.Σύν.Βατ
 585 | Ολομ
 586 | Ολ
 587 | Ολ.Α.Π
 588 | Ομ.Ιλ
 589 | Ομ.Οδ
 590 | ΟπΤοιχ
 591 | Οράτ
 592 | Ορθ
 593 | ΠΡΟ.ΠΟ
 594 | Πίνδ
 595 | Πίνδ.Ι
 596 | Πίνδ.Νεμ
 597 | Πίνδ.Ν
 598 | Πίνδ.Ολ
 599 | Πίνδ.Παθ
 600 | Πίνδ.Πυθ
 601 | Πίνδ.Π
 602 | ΠαγΝμλγ
 603 | Παν
 604 | Παρμ
 605 | Παροιμ
 606 | Παρ
 607 | Παυσ
 608 | Πειθ.Συμβ
 609 | ΠειρΝ
 610 | Πελ
 611 | ΠεντΣτρ
 612 | Πεντ
 613 | Πεντ.Εφ
 614 | ΠερΔικ
 615 | Περ.Γεν.Νοσ
 616 | Πετ
 617 | Πλάτ
 618 | Πλάτ.Αλκ
 619 | Πλάτ.Αντ
 620 | Πλάτ.Αξίοχ
 621 | Πλάτ.Απόλ
 622 | Πλάτ.Γοργ
 623 | Πλάτ.Ευθ
 624 | Πλάτ.Θεαίτ
 625 | Πλάτ.Κρατ
 626 | Πλάτ.Κριτ
 627 | Πλάτ.Λύσ
 628 | Πλάτ.Μεν
 629 | Πλάτ.Νόμ
 630 | Πλάτ.Πολιτ
 631 | Πλάτ.Πολ
 632 | Πλάτ.Πρωτ
 633 | Πλάτ.Σοφ.
 634 | Πλάτ.Συμπ
 635 | Πλάτ.Τίμ
 636 | Πλάτ.Φαίδρ
 637 | Πλάτ.Φιλ
 638 | Πλημ
 639 | Πλούτ
 640 | Πλούτ.Άρατ
 641 | Πλούτ.Αιμ
 642 | Πλούτ.Αλέξ
 643 | Πλούτ.Αλκ
 644 | Πλούτ.Αντ
 645 | Πλούτ.Αρτ
 646 | Πλούτ.Ηθ
 647 | Πλούτ.Θεμ
 648 | Πλούτ.Κάμ
 649 | Πλούτ.Καίσ
 650 | Πλούτ.Κικ
 651 | Πλούτ.Κράσ
 652 | Πλούτ.Κ
 653 | Πλούτ.Λυκ
 654 | Πλούτ.Μάρκ
 655 | Πλούτ.Μάρ
 656 | Πλούτ.Περ
 657 | Πλούτ.Ρωμ
 658 | Πλούτ.Σύλλ
 659 | Πλούτ.Φλαμ
 660 | Πλ
 661 | Ποιν.Δικ
 662 | Ποιν.Δ
 663 | Ποιν.Ν
 664 | Ποιν.Χρον
 665 | Ποιν.Χρ
 666 | Πολ.Δ
 667 | Πολ.Πρωτ
 668 | Πολ
 669 | Πολ.Μηχ
 670 | Πολ.Μ
 671 | Πρακτ.Αναθ
 672 | Πρακτ.Ολ
 673 | Πραξ
 674 | Πρμ
 675 | Πρξ
 676 | Πρωτ
 677 | Πρ
 678 | Πρ.Αν
 679 | Πρ.Λογ
 680 | Πταισμ
 681 | Πυρ.Καλ
 682 | Πόλη
 683 | Π.Δ
 684 | Π.Δ.Άσμ
 685 | ΡΜ.Ε
 686 | Ρθ
 687 | Ρμ
 688 | Ρωμ
 689 | ΣΠλημ
 690 | Σαπφ
 691 | Σειρ
 692 | Σολ
 693 | Σοφ
 694 | Σοφ.Αντιγ
 695 | Σοφ.Αντ
 696 | Σοφ.Αποσ
 697 | Σοφ.Απ
 698 | Σοφ.Ηλέκ
 699 | Σοφ.Ηλ
 700 | Σοφ.Οιδ.Κολ
 701 | Σοφ.Οιδ.Τύρ
 702 | Σοφ.Ο.Τ
 703 | Σοφ.Σειρ
 704 | Σοφ.Σολ
 705 | Σοφ.Τραχ
 706 | Σοφ.Φιλοκτ
 707 | Σρ
 708 | Σ.τ.Ε
 709 | Σ.τ.Π
 710 | Στρ.Π.Κ
 711 | Στ.Ευρ
 712 | Συζήτ
 713 | Συλλ.Νομολ
 714 | Συλ.Νομ
 715 | ΣυμβΕπιθ
 716 | Συμπ.Ν
 717 | Συνθ.Αμ
 718 | Συνθ.Ε.Ε
 719 | Συνθ.Ε.Κ
 720 | Συνθ.Ν
 721 | Σφν
 722 | Σφ
 723 | Σφ.Σλ
 724 | Σχ.Πολ.Δ
 725 | Σχ.Συντ.Ε
 726 | Σωσ
 727 | Σύντ
 728 | Σ.Πληρ
 729 | ΤΘ
 730 | ΤΣ.Δ
 731 | Τίτ
 732 | Τβ
 733 | Τελ.Ενημ
 734 | Τελ.Κ
 735 | Τερτυλ
 736 | Τιμ
 737 | Τοπ.Α
 738 | Τρ.Ο
 739 | Τριμ
 740 | Τριμ.Πλ
 741 | Τρ.Πλημ
 742 | Τρ.Π.Δ
 743 | Τ.τ.Ε
 744 | Ττ
 745 | Τωβ
 746 | Υγ
 747 | Υπερ
 748 | Υπ
 749 | Υ.Γ
 750 | Φιλήμ
 751 | Φιλιπ
 752 | Φιλ
 753 | Φλμ
 754 | Φλ
 755 | Φορ.Β
 756 | Φορ.Δ.Ε
 757 | Φορ.Δνη
 758 | Φορ.Δ
 759 | Φορ.Επ
 760 | Φώτ
 761 | Χρ.Ι.Δ
 762 | Χρ.Ιδ.Δ
 763 | Χρ.Ο
 764 | Χρυσ
 765 | Ψήφ
 766 | Ψαλμ
 767 | Ψαλ
 768 | Ψλ
 769 | Ωριγ
 770 | Ωσ
 771 | Ω.Ρ.Λ
 772 | άγν
 773 | άγν.ετυμολ
 774 | άγ
 775 | άκλ
 776 | άνθρ
 777 | άπ
 778 | άρθρ
 779 | άρν
 780 | άρ
 781 | άτ
 782 | άψ
 783 | ά
 784 | έκδ
 785 | έκφρ
 786 | έμψ
 787 | ένθ.αν
 788 | έτ
 789 | έ.α
 790 | ίδ
 791 | αβεστ
 792 | αβησσ
 793 | αγγλ
 794 | αγγ
 795 | αδημ
 796 | αεροναυτ
 797 | αερον
 798 | αεροπ
 799 | αθλητ
 800 | αθλ
 801 | αθροιστ
 802 | αιγυπτ
 803 | αιγ
 804 | αιτιολ
 805 | αιτ
 806 | αι
 807 | ακαδ
 808 | ακκαδ
 809 | αλβ
 810 | αλλ
 811 | αλφαβητ
 812 | αμα
 813 | αμερικ
 814 | αμερ
 815 | αμετάβ
 816 | αμτβ
 817 | αμφιβ
 818 | αμφισβ
 819 | αμφ
 820 | αμ
 821 | ανάλ
 822 | ανάπτ
 823 | ανάτ
 824 | αναβ
 825 | αναδαν
 826 | αναδιπλασ
 827 | αναδιπλ
 828 | αναδρ
 829 | αναλ
 830 | αναν
 831 | ανασυλλ
 832 | ανατολ
 833 | ανατομ
 834 | ανατυπ
 835 | ανατ
 836 | αναφορ
 837 | αναφ
 838 | ανα.ε
 839 | ανδρων
 840 | ανθρωπολ
 841 | ανθρωπ
 842 | ανθ
 843 | ανομ
 844 | αντίτ
 845 | αντδ
 846 | αντιγρ
 847 | αντιθ
 848 | αντικ
 849 | αντιμετάθ
 850 | αντων
 851 | αντ
 852 | ανωτ
 853 | ανόργ
 854 | ανών
 855 | αορ
 856 | απαρέμφ
 857 | απαρφ
 858 | απαρχ
 859 | απαρ
 860 | απλολ
 861 | απλοπ
 862 | αποβ
 863 | αποηχηροπ
 864 | αποθ
 865 | αποκρυφ
 866 | αποφ
 867 | απρμφ
 868 | απρφ
 869 | απρόσ
 870 | απόδ
 871 | απόλ
 872 | απόσπ
 873 | απόφ
 874 | αραβοτουρκ
 875 | αραβ
 876 | αραμ
 877 | αρβαν
 878 | αργκ
 879 | αριθμτ
 880 | αριθμ
 881 | αριθ
 882 | αρκτικόλ
 883 | αρκ
 884 | αρμεν
 885 | αρμ
 886 | αρνητ
 887 | αρσ
 888 | αρχαιολ
 889 | αρχιτεκτ
 890 | αρχιτ
 891 | αρχκ
 892 | αρχ
 893 | αρωμουν
 894 | αρωμ
 895 | αρ
 896 | αρ.μετρ
 897 | αρ.φ
 898 | ασσυρ
 899 | αστρολ
 900 | αστροναυτ
 901 | αστρον
 902 | αττ
 903 | αυστραλ
 904 | αυτοπ
 905 | αυτ
 906 | αφγαν
 907 | αφηρ
 908 | αφομ
 909 | αφρικ
 910 | αχώρ
 911 | αόρ
 912 | α.α
 913 | α/α
 914 | α0
 915 | βαθμ
 916 | βαθ
 917 | βαπτ
 918 | βασκ
 919 | βεβαιωτ
 920 | βεβ
 921 | βεδ
 922 | βενετ
 923 | βεν
 924 | βερβερ
 925 | βιβλγρ
 926 | βιολ
 927 | βιομ
 928 | βιοχημ
 929 | βιοχ
 930 | βλάχ
 931 | βλ
 932 | βλ.λ
 933 | βοταν
 934 | βοτ
 935 | βουλγαρ
 936 | βουλγ
 937 | βούλ
 938 | βραζιλ
 939 | βρετον
 940 | βόρ
 941 | γαλλ
 942 | γενικότ
 943 | γενοβ
 944 | γεν
 945 | γερμαν
 946 | γερμ
 947 | γεωγρ
 948 | γεωλ
 949 | γεωμετρ
 950 | γεωμ
 951 | γεωπ
 952 | γεωργ
 953 | γλυπτ
 954 | γλωσσολ
 955 | γλωσσ
 956 | γλ
 957 | γνμδ
 958 | γνμ
 959 | γνωμ
 960 | γοτθ
 961 | γραμμ
 962 | γραμ
 963 | γρμ
 964 | γρ
 965 | γυμν
 966 | δίδες
 967 | δίκ
 968 | δίφθ
 969 | δαν
 970 | δεικτ
 971 | δεκατ
 972 | δηλ
 973 | δημογρ
 974 | δημοτ
 975 | δημώδ
 976 | δημ
 977 | διάγρ
 978 | διάκρ
 979 | διάλεξ
 980 | διάλ
 981 | διάσπ
 982 | διαλεκτ
 983 | διατρ
 984 | διαφ
 985 | διαχ
 986 | διδα
 987 | διεθν
 988 | διεθ
 989 | δικον
 990 | διστ
 991 | δισύλλ
 992 | δισ
 993 | διφθογγοπ
 994 | δογμ
 995 | δολ
 996 | δοτ
 997 | δρμ
 998 | δρχ
 999 | δρ(α)
1000 | δωρ
1001 | δ
1002 | εβρ
1003 | εγκλπ
1004 | εδ
1005 | εθνολ
1006 | εθν
1007 | ειδικότ
1008 | ειδ
1009 | ειδ.β
1010 | εικ
1011 | ειρ
1012 | εισ
1013 | εκατοστμ
1014 | εκατοστ
1015 | εκατστ.2
1016 | εκατστ.3
1017 | εκατ
1018 | εκδ
1019 | εκκλησ
1020 | εκκλ
1021 | εκ
1022 | ελλην
1023 | ελλ
1024 | ελνστ
1025 | ελπ
1026 | εμβ
1027 | εμφ
1028 | εναλλ
1029 | ενδ
1030 | ενεργ
1031 | ενεστ
1032 | ενικ
1033 | ενν
1034 | εν
1035 | εξέλ
1036 | εξακολ
1037 | εξομάλ
1038 | εξ
1039 | εο
1040 | επέκτ
1041 | επίδρ
1042 | επίθ
1043 | επίρρ
1044 | επίσ
1045 | επαγγελμ
1046 | επανάλ
1047 | επανέκδ
1048 | επιθ
1049 | επικ
1050 | επιμ
1051 | επιρρ
1052 | επιστ
1053 | επιτατ
1054 | επιφ
1055 | επών
1056 | επ
1057 | εργ
1058 | ερμ
1059 | ερρινοπ
1060 | ερωτ
1061 | ετρουσκ
1062 | ετυμ
1063 | ετ
1064 | ευφ
1065 | ευχετ
1066 | εφ
1067 | εύχρ
1068 | ε.α
1069 | ε/υ
1070 | ε0
1071 | ζωγρ
1072 | ζωολ
1073 | ηθικ
1074 | ηθ
1075 | ηλεκτρολ
1076 | ηλεκτρον
1077 | ηλεκτρ
1078 | ημίτ
1079 | ημίφ
1080 | ημιφ
1081 | ηχηροπ
1082 | ηχηρ
1083 | ηχομιμ
1084 | ηχ
1085 | η
1086 | θέατρ
1087 | θεολ
1088 | θετ
1089 | θηλ
1090 | θρακ
1091 | θρησκειολ
1092 | θρησκ
1093 | θ
1094 | ιαπων
1095 | ιατρ
1096 | ιδιωμ
1097 | ιδ
1098 | ινδ
1099 | ιραν
1100 | ισπαν
1101 | ιστορ
1102 | ιστ
1103 | ισχυροπ
1104 | ιταλ
1105 | ιχθυολ
1106 | ιων
1107 | κάτ
1108 | καθ
1109 | κακοσ
1110 | καν
1111 | καρ
1112 | κατάλ
1113 | κατατ
1114 | κατωτ
1115 | κατ
1116 | κα
1117 | κελτ
1118 | κεφ
1119 | κινεζ
1120 | κινημ
1121 | κλητ
1122 | κλιτ
1123 | κλπ
1124 | κλ
1125 | κν
1126 | κοινωνιολ
1127 | κοινων
1128 | κοπτ
1129 | κουτσοβλαχ
1130 | κουτσοβλ
1131 | κπ
1132 | κρ.γν
1133 | κτγ
1134 | κτην
1135 | κτητ
1136 | κτλ
1137 | κτ
1138 | κυριολ
1139 | κυρ
1140 | κύρ
1141 | κ
1142 | κ.ά
1143 | κ.ά.π
1144 | κ.α
1145 | κ.εξ
1146 | κ.επ
1147 | κ.ε
1148 | κ.λπ
1149 | κ.λ.π
1150 | κ.ού.κ
1151 | κ.ο.κ
1152 | κ.τ.λ
1153 | κ.τ.τ
1154 | κ.τ.ό
1155 | λέξ
1156 | λαογρ
1157 | λαπ
1158 | λατιν
1159 | λατ
1160 | λαϊκότρ
1161 | λαϊκ
1162 | λετ
1163 | λιθ
1164 | λογιστ
1165 | λογοτ
1166 | λογ
1167 | λουβ
1168 | λυδ
1169 | λόγ
1170 | λ
1171 | λ.χ
1172 | μέλλ
1173 | μέσ
1174 | μαθημ
1175 | μαθ
1176 | μαιευτ
1177 | μαλαισ
1178 | μαλτ
1179 | μαμμων
1180 | μεγεθ
1181 | μεε
1182 | μειωτ
1183 | μελ
1184 | μεξ
1185 | μεσν
1186 | μεσογ
1187 | μεσοπαθ
1188 | μεσοφ
1189 | μετάθ
1190 | μεταβτ
1191 | μεταβ
1192 | μετακ
1193 | μεταπλ
1194 | μεταπτωτ
1195 | μεταρ
1196 | μεταφορ
1197 | μετβ
1198 | μετεπιθ
1199 | μετεπιρρ
1200 | μετεωρολ
1201 | μετεωρ
1202 | μετον
1203 | μετουσ
1204 | μετοχ
1205 | μετρ
1206 | μετ
1207 | μητρων
1208 | μηχανολ
1209 | μηχ
1210 | μικροβιολ
1211 | μογγολ
1212 | μορφολ
1213 | μουσ
1214 | μπενελούξ
1215 | μσνλατ
1216 | μσν
1217 | μτβ
1218 | μτγν
1219 | μτγ
1220 | μτφρδ
1221 | μτφρ
1222 | μτφ
1223 | μτχ
1224 | μυθ
1225 | μυκην
1226 | μυκ
1227 | μφ
1228 | μ
1229 | μ.ε
1230 | μ.μ
1231 | μ.π.ε
1232 | μ.π.π
1233 | μ0
1234 | ναυτ
1235 | νεοελλ
1236 | νεολατιν
1237 | νεολατ
1238 | νεολ
1239 | νεότ
1240 | νλατ
1241 | νομ
1242 | νορβ
1243 | νοσ
1244 | νότ
1245 | ν
1246 | ξ.λ
1247 | οικοδ
1248 | οικολ
1249 | οικον
1250 | οικ
1251 | ολλανδ
1252 | ολλ
1253 | ομηρ
1254 | ομόρρ
1255 | ονομ
1256 | ον
1257 | οπτ
1258 | ορθογρ
1259 | ορθ
1260 | οριστ
1261 | ορυκτολ
1262 | ορυκτ
1263 | ορ
1264 | οσετ
1265 | οσκ
1266 | ουαλ
1267 | ουγγρ
1268 | ουδ
1269 | ουσιαστικοπ
1270 | ουσιαστ
1271 | ουσ
1272 | πίν
1273 | παθητ
1274 | παθολ
1275 | παθ
1276 | παιδ
1277 | παλαιοντ
1278 | παλαιότ
1279 | παλ
1280 | παππων
1281 | παράγρ
1282 | παράγ
1283 | παράλλ
1284 | παράλ
1285 | παραγ
1286 | παρακ
1287 | παραλ
1288 | παραπ
1289 | παρατ
1290 | παρβ
1291 | παρετυμ
1292 | παροξ
1293 | παρων
1294 | παρωχ
1295 | παρ
1296 | παρ.φρ
1297 | πατριδων
1298 | πατρων
1299 | πβ
1300 | περιθ
1301 | περιλ
1302 | περιφρ
1303 | περσ
1304 | περ
1305 | πιθ
1306 | πληθ
1307 | πληροφ
1308 | ποδ
1309 | ποιητ
1310 | πολιτ
1311 | πολλαπλ
1312 | πολ
1313 | πορτογαλ
1314 | πορτ
1315 | ποσ
1316 | πρακριτ
1317 | πρβλ
1318 | πρβ
1319 | πργ
1320 | πρκμ
1321 | πρκ
1322 | πρλ
1323 | προέλ
1324 | προβηγκ
1325 | προελλ
1326 | προηγ
1327 | προθεμ
1328 | προπαραλ
1329 | προπαροξ
1330 | προπερισπ
1331 | προσαρμ
1332 | προσηγορ
1333 | προσταχτ
1334 | προστ
1335 | προσφών
1336 | προσ
1337 | προτακτ
1338 | προτ.Εισ
1339 | προφ
1340 | προχωρ
1341 | πρτ
1342 | πρόθ
1343 | πρόσθ
1344 | πρόσ
1345 | πρότ
1346 | πρ
1347 | πρ.Εφ
1348 | πτ
1349 | πυ
1350 | π
1351 | π.Χ
1352 | π.μ
1353 | π.χ
1354 | ρήμ
1355 | ρίζ
1356 | ρηματ
1357 | ρητορ
1358 | ριν
1359 | ρουμ
1360 | ρωμ
1361 | ρωσ
1362 | ρ
1363 | σανσκρ
1364 | σαξ
1365 | σελ
1366 | σερβοκρ
1367 | σερβ
1368 | σημασιολ
1369 | σημδ
1370 | σημειολ
1371 | σημερ
1372 | σημιτ
1373 | σημ
1374 | σκανδ
1375 | σκυθ
1376 | σκωπτ
1377 | σλαβ
1378 | σλοβ
1379 | σουηδ
1380 | σουμερ
1381 | σουπ
1382 | σπάν
1383 | σπανιότ
1384 | σπ
1385 | σσ
1386 | στατ
1387 | στερ
1388 | στιγμ
1389 | στιχ
1390 | στρέμ
1391 | στρατιωτ
1392 | στρατ
1393 | στ
1394 | συγγ
1395 | συγκρ
1396 | συγκ
1397 | συμπερ
1398 | συμπλεκτ
1399 | συμπλ
1400 | συμπροφ
1401 | συμφυρ
1402 | συμφ
1403 | συνήθ
1404 | συνίζ
1405 | συναίρ
1406 | συναισθ
1407 | συνδετ
1408 | συνδ
1409 | συνεκδ
1410 | συνηρ
1411 | συνθετ
1412 | συνθ
1413 | συνοπτ
1414 | συντελ
1415 | συντομογρ
1416 | συντ
1417 | συν
1418 | συρ
1419 | σχημ
1420 | σχ
1421 | σύγκρ
1422 | σύμπλ
1423 | σύμφ
1424 | σύνδ
1425 | σύνθ
1426 | σύντμ
1427 | σύντ
1428 | σ
1429 | σ.π
1430 | σ/β
1431 | τακτ
1432 | τελ
1433 | τετρ
1434 | τετρ.μ
1435 | τεχνλ
1436 | τεχνολ
1437 | τεχν
1438 | τεύχ
1439 | τηλεπικ
1440 | τηλεόρ
1441 | τιμ
1442 | τιμ.τομ
1443 | τοΣ
1444 | τον
1445 | τοπογρ
1446 | τοπων
1447 | τοπ
1448 | τοσκ
1449 | τουρκ
1450 | τοχ
1451 | τριτοπρόσ
1452 | τροποπ
1453 | τροπ
1454 | τσεχ
1455 | τσιγγ
1456 | ττ
1457 | τυπ
1458 | τόμ
1459 | τόνν
1460 | τ
1461 | τ.μ
1462 | τ.χλμ
1463 | υβρ
1464 | υπερθ
1465 | υπερσ
1466 | υπερ
1467 | υπεύθ
1468 | υποθ
1469 | υποκορ
1470 | υποκ
1471 | υποσημ
1472 | υποτ
1473 | υποφ
1474 | υποχωρ
1475 | υπόλ
1476 | υπόχρ
1477 | υπ
1478 | υστλατ
1479 | υψόμ
1480 | υψ
1481 | φάκ
1482 | φαρμακολ
1483 | φαρμ
1484 | φιλολ
1485 | φιλοσ
1486 | φιλοτ
1487 | φινλ
1488 | φοινικ
1489 | φράγκ
1490 | φρανκον
1491 | φριζ
1492 | φρ
1493 | φυλλ
1494 | φυσιολ
1495 | φυσ
1496 | φωνηεντ
1497 | φωνητ
1498 | φωνολ
1499 | φων
1500 | φωτογρ
1501 | φ
1502 | φ.τ.μ
1503 | χαμιτ
1504 | χαρτόσ
1505 | χαρτ
1506 | χασμ
1507 | χαϊδ
1508 | χγφ
1509 | χειλ
1510 | χεττ
1511 | χημ
1512 | χιλ
1513 | χλγρ
1514 | χλγ
1515 | χλμ
1516 | χλμ.2
1517 | χλμ.3
1518 | χλσγρ
1519 | χλστγρ
1520 | χλστμ
1521 | χλστμ.2
1522 | χλστμ.3
1523 | χλ
1524 | χργρ
1525 | χρημ
1526 | χρον
1527 | χρ
1528 | χφ
1529 | χ.ε
1530 | χ.κ
1531 | χ.ο
1532 | χ.σ
1533 | χ.τ
1534 | χ.χ
1535 | ψευδ
1536 | ψυχαν
1537 | ψυχιατρ
1538 | ψυχολ
1539 | ψυχ
1540 | ωκεαν
1541 | όμ
1542 | όν
1543 | όπ.παρ
1544 | όπ.π
1545 | ό.π
1546 | ύψ
1547 | 1Βσ
1548 | 1Εσ
1549 | 1Θσ
1550 | 1Ιν
1551 | 1Κρ
1552 | 1Μκ
1553 | 1Πρ
1554 | 1Πτ
1555 | 1Τμ
1556 | 2Βσ
1557 | 2Εσ
1558 | 2Θσ
1559 | 2Ιν
1560 | 2Κρ
1561 | 2Μκ
1562 | 2Πρ
1563 | 2Πτ
1564 | 2Τμ
1565 | 3Βσ
1566 | 3Ιν
1567 | 3Μκ
1568 | 4Βσ
1569 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.en:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | 
 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 34 | Adj
 35 | Adm
 36 | Adv
 37 | Asst
 38 | Bart
 39 | Bldg
 40 | Brig
 41 | Bros
 42 | Capt
 43 | Cmdr
 44 | Col
 45 | Comdr
 46 | Con
 47 | Corp
 48 | Cpl
 49 | DR
 50 | Dr
 51 | Drs
 52 | Ens
 53 | Gen
 54 | Gov
 55 | Hon
 56 | Hr
 57 | Hosp
 58 | Insp
 59 | Lt
 60 | MM
 61 | MR
 62 | MRS
 63 | MS
 64 | Maj
 65 | Messrs
 66 | Mlle
 67 | Mme
 68 | Mr
 69 | Mrs
 70 | Ms
 71 | Msgr
 72 | Op
 73 | Ord
 74 | Pfc
 75 | Ph
 76 | Prof
 77 | Pvt
 78 | Rep
 79 | Reps
 80 | Res
 81 | Rev
 82 | Rt
 83 | Sen
 84 | Sens
 85 | Sfc
 86 | Sgt
 87 | Sr
 88 | St
 89 | Supt
 90 | Surg
 91 | 
 92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
 93 | v
 94 | vs
 95 | i.e
 96 | rev
 97 | e.g
 98 | 
 99 | #Numbers only. These should only induce breaks when followed by a numeric sequence
100 | # add NUMERIC_ONLY after the word for this function
101 | #This case is mostly for the english "No." which can either be a sentence of its own, or
102 | #if followed by a number, a non-breaking prefix
103 | No #NUMERIC_ONLY# 
104 | Nos
105 | Art #NUMERIC_ONLY#
106 | Nr
107 | pp #NUMERIC_ONLY#
108 | 
109 | #month abbreviations
110 | Jan
111 | Feb
112 | Mar
113 | Apr
114 | #May is a full word
115 | Jun
116 | Jul
117 | Aug
118 | Sep
119 | Oct
120 | Nov
121 | Dec
122 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.es:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | 
 33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm
 34 | 
 35 | A.C
 36 | Apdo
 37 | Av
 38 | Bco
 39 | CC.AA
 40 | Da
 41 | Dep
 42 | Dn
 43 | Dr
 44 | Dra
 45 | EE.UU
 46 | Excmo
 47 | FF.CC
 48 | Fil 
 49 | Gral
 50 | J.C
 51 | Let
 52 | Lic
 53 | N.B
 54 | P.D
 55 | P.V.P
 56 | Prof
 57 | Pts
 58 | Rte
 59 | S.A
 60 | S.A.R
 61 | S.E
 62 | S.L
 63 | S.R.C
 64 | Sr
 65 | Sra
 66 | Srta
 67 | Sta
 68 | Sto
 69 | T.V.E
 70 | Tel
 71 | Ud
 72 | Uds
 73 | V.B
 74 | V.E
 75 | Vd
 76 | Vds
 77 | a/c
 78 | adj
 79 | admón
 80 | afmo
 81 | apdo
 82 | av
 83 | c
 84 | c.f
 85 | c.g
 86 | cap
 87 | cm
 88 | cta
 89 | dcha
 90 | doc
 91 | ej
 92 | entlo
 93 | esq
 94 | etc
 95 | f.c
 96 | gr 
 97 | grs
 98 | izq
 99 | kg
100 | km
101 | mg
102 | mm
103 | nÃºm
104 | núm
105 | p
106 | p.a
107 | p.ej
108 | ptas
109 | pÃ¡g 
110 | pÃ¡gs
111 | pág
112 | págs
113 | q.e.g.e
114 | q.e.s.m
115 | s
116 | s.s.s
117 | vid
118 | vol
119 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.fi:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT
  2 | #indicate an end-of-sentence marker.  Special cases are included for prefixes
  3 | #that ONLY appear before 0-9 numbers.
  4 | 
  5 | #This list is compiled from omorfi <http://code.google.com/p/omorfi> database
  6 | #by Tommi A Pirinen.
  7 | 
  8 | 
  9 | #any single upper case letter  followed by a period is not a sentence ender
 10 | A
 11 | B
 12 | C
 13 | D
 14 | E
 15 | F
 16 | G
 17 | H
 18 | I
 19 | J
 20 | K
 21 | L
 22 | M
 23 | N
 24 | O
 25 | P
 26 | Q
 27 | R
 28 | S
 29 | T
 30 | U
 31 | V
 32 | W
 33 | X
 34 | Y
 35 | Z
 36 | Å
 37 | Ä
 38 | Ö
 39 | 
 40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 41 | alik
 42 | alil
 43 | amir
 44 | apul
 45 | apul.prof
 46 | arkkit
 47 | ass
 48 | assist
 49 | dipl
 50 | dipl.arkkit
 51 | dipl.ekon
 52 | dipl.ins
 53 | dipl.kielenk
 54 | dipl.kirjeenv
 55 | dipl.kosm
 56 | dipl.urk
 57 | dos
 58 | erikoiseläinl
 59 | erikoishammasl
 60 | erikoisl
 61 | erikoist
 62 | ev.luutn
 63 | evp
 64 | fil
 65 | ft
 66 | hallinton
 67 | hallintot
 68 | hammaslääket
 69 | jatk
 70 | jääk
 71 | kansaned
 72 | kapt
 73 | kapt.luutn
 74 | kenr
 75 | kenr.luutn
 76 | kenr.maj
 77 | kers
 78 | kirjeenv
 79 | kom
 80 | kom.kapt
 81 | komm
 82 | konst
 83 | korpr
 84 | luutn
 85 | maist
 86 | maj
 87 | Mr
 88 | Mrs
 89 | Ms
 90 | M.Sc
 91 | neuv
 92 | nimim
 93 | Ph.D
 94 | prof
 95 | puh.joht
 96 | pääll
 97 | res
 98 | san
 99 | siht
100 | suom
101 | sähköp
102 | säv
103 | toht
104 | toim
105 | toim.apul
106 | toim.joht
107 | toim.siht
108 | tuom
109 | ups
110 | vänr
111 | vääp
112 | ye.ups
113 | ylik
114 | ylil
115 | ylim
116 | ylimatr
117 | yliop
118 | yliopp
119 | ylip
120 | yliv
121 | 
122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall
123 | #into this category - it sometimes ends a sentence)
124 | e.g
125 | ent
126 | esim
127 | huom
128 | i.e
129 | ilm
130 | l
131 | mm
132 | myöh
133 | nk
134 | nyk
135 | par
136 | po
137 | t
138 | v
139 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.fr:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | #
  4 | #any single upper case letter  followed by a period is not a sentence ender
  5 | #usually upper case letters are initials in a name
  6 | #no French words end in single lower-case letters, so we throw those in too?
  7 | A
  8 | B
  9 | C
 10 | D
 11 | E
 12 | F
 13 | G
 14 | H
 15 | I
 16 | J
 17 | K
 18 | L
 19 | M
 20 | N
 21 | O
 22 | P
 23 | Q
 24 | R
 25 | S
 26 | T
 27 | U
 28 | V
 29 | W
 30 | X
 31 | Y
 32 | Z
 33 | #a
 34 | b
 35 | c
 36 | d
 37 | e
 38 | f
 39 | g
 40 | h
 41 | i
 42 | j
 43 | k
 44 | l
 45 | m
 46 | n
 47 | o
 48 | p
 49 | q
 50 | r
 51 | s
 52 | t
 53 | u
 54 | v
 55 | w
 56 | x
 57 | y
 58 | z
 59 | 
 60 | # Period-final abbreviation list for French
 61 | A.C.N
 62 | A.M
 63 | art
 64 | ann
 65 | apr
 66 | av
 67 | auj
 68 | lib
 69 | B.P
 70 | boul
 71 | ca
 72 | c.-à-d
 73 | cf
 74 | ch.-l
 75 | chap
 76 | contr
 77 | C.P.I
 78 | C.Q.F.D
 79 | C.N
 80 | C.N.S
 81 | C.S
 82 | dir
 83 | éd
 84 | e.g
 85 | env
 86 | al
 87 | etc
 88 | E.V
 89 | ex
 90 | fasc
 91 | fém
 92 | fig
 93 | fr
 94 | hab
 95 | ibid
 96 | id
 97 | i.e
 98 | inf
 99 | LL.AA
100 | LL.AA.II
101 | LL.AA.RR
102 | LL.AA.SS
103 | L.D
104 | LL.EE
105 | LL.MM
106 | LL.MM.II.RR
107 | loc.cit
108 | masc
109 | MM
110 | ms
111 | N.B
112 | N.D.A
113 | N.D.L.R
114 | N.D.T
115 | n/réf
116 | NN.SS
117 | N.S
118 | N.D
119 | N.P.A.I
120 | p.c.c
121 | pl
122 | pp
123 | p.ex
124 | p.j
125 | P.S
126 | R.A.S
127 | R.-V
128 | R.P
129 | R.I.P
130 | SS
131 | S.S
132 | S.A
133 | S.A.I
134 | S.A.R
135 | S.A.S
136 | S.E
137 | sec
138 | sect
139 | sing
140 | S.M
141 | S.M.I.R
142 | sq
143 | sqq
144 | suiv
145 | sup
146 | suppl
147 | tél
148 | T.S.V.P
149 | vb
150 | vol
151 | vs
152 | X.O
153 | Z.I
154 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.ga:
--------------------------------------------------------------------------------
 1 | 
 2 | A
 3 | B
 4 | C
 5 | D
 6 | E
 7 | F
 8 | G
 9 | H
10 | I
11 | J
12 | K
13 | L
14 | M
15 | N
16 | O
17 | P
18 | Q
19 | R
20 | S
21 | T
22 | U
23 | V
24 | W
25 | X
26 | Y
27 | Z
28 | Á
29 | É
30 | Í
31 | Ó
32 | Ú
33 | 
34 | Uacht
35 | Dr
36 | B.Arch
37 | 
38 | m.sh
39 | .i
40 | Co
41 | Cf
42 | cf
43 | i.e
44 | r
45 | Chr
46 | lch #NUMERIC_ONLY#
47 | lgh #NUMERIC_ONLY#
48 | uimh #NUMERIC_ONLY#
49 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.hu:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | Á
 33 | É
 34 | Í
 35 | Ó
 36 | Ö
 37 | Ő
 38 | Ú
 39 | Ü
 40 | Ű
 41 | 
 42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 43 | Dr
 44 | dr
 45 | kb
 46 | Kb
 47 | vö
 48 | Vö
 49 | pl
 50 | Pl
 51 | ca
 52 | Ca
 53 | min
 54 | Min
 55 | max
 56 | Max
 57 | ún
 58 | Ún
 59 | prof
 60 | Prof
 61 | de
 62 | De
 63 | du
 64 | Du
 65 | Szt
 66 | St
 67 | 
 68 | #Numbers only. These should only induce breaks when followed by a numeric sequence
 69 | # add NUMERIC_ONLY after the word for this function
 70 | #This case is mostly for the english "No." which can either be a sentence of its own, or
 71 | #if followed by a number, a non-breaking prefix
 72 | 
 73 | # Month name abbreviations
 74 | jan #NUMERIC_ONLY#
 75 | Jan #NUMERIC_ONLY#
 76 | Feb #NUMERIC_ONLY#
 77 | feb #NUMERIC_ONLY#
 78 | márc #NUMERIC_ONLY#
 79 | Márc #NUMERIC_ONLY#
 80 | ápr #NUMERIC_ONLY#
 81 | Ápr #NUMERIC_ONLY#
 82 | máj #NUMERIC_ONLY#
 83 | Máj #NUMERIC_ONLY#
 84 | jún #NUMERIC_ONLY#
 85 | Jún #NUMERIC_ONLY#
 86 | Júl #NUMERIC_ONLY#
 87 | júl #NUMERIC_ONLY#
 88 | aug #NUMERIC_ONLY#
 89 | Aug #NUMERIC_ONLY#
 90 | Szept #NUMERIC_ONLY#
 91 | szept #NUMERIC_ONLY#
 92 | okt #NUMERIC_ONLY#
 93 | Okt #NUMERIC_ONLY#
 94 | nov #NUMERIC_ONLY#
 95 | Nov #NUMERIC_ONLY#
 96 | dec #NUMERIC_ONLY#
 97 | Dec #NUMERIC_ONLY#
 98 | 
 99 | # Other abbreviations
100 | tel #NUMERIC_ONLY#
101 | Tel #NUMERIC_ONLY#
102 | Fax #NUMERIC_ONLY#
103 | fax #NUMERIC_ONLY#
104 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.is:
--------------------------------------------------------------------------------
  1 | no #NUMERIC_ONLY#
  2 | No #NUMERIC_ONLY#
  3 | nr #NUMERIC_ONLY#
  4 | Nr #NUMERIC_ONLY#
  5 | nR #NUMERIC_ONLY#
  6 | NR #NUMERIC_ONLY#
  7 | a
  8 | b
  9 | c
 10 | d
 11 | e
 12 | f
 13 | g
 14 | h
 15 | i
 16 | j
 17 | k
 18 | l
 19 | m
 20 | n
 21 | o
 22 | p
 23 | q
 24 | r
 25 | s
 26 | t
 27 | u
 28 | v
 29 | w
 30 | x
 31 | y
 32 | z
 33 | ^
 34 | í
 35 | á
 36 | ó
 37 | æ
 38 | A
 39 | B
 40 | C
 41 | D
 42 | E
 43 | F
 44 | G
 45 | H
 46 | I
 47 | J
 48 | K
 49 | L
 50 | M
 51 | N
 52 | O
 53 | P
 54 | Q
 55 | R
 56 | S
 57 | T
 58 | U
 59 | V
 60 | W
 61 | X
 62 | Y
 63 | Z
 64 | ab.fn
 65 | a.fn
 66 | afs
 67 | al
 68 | alm
 69 | alg
 70 | andh
 71 | ath
 72 | aths
 73 | atr
 74 | ao
 75 | au
 76 | aukaf
 77 | áfn
 78 | áhrl.s
 79 | áhrs
 80 | ákv.gr
 81 | ákv
 82 | bh
 83 | bls
 84 | dr
 85 | e.Kr
 86 | et
 87 | ef
 88 | efn
 89 | ennfr
 90 | eink
 91 | end
 92 | e.st
 93 | erl
 94 | fél
 95 | fskj
 96 | fh
 97 | f.hl
 98 | físl
 99 | fl
100 | fn
101 | fo
102 | forl
103 | frb
104 | frl
105 | frh
106 | frt
107 | fsl
108 | fsh
109 | fs
110 | fsk
111 | fst
112 | f.Kr
113 | ft
114 | fv
115 | fyrrn
116 | fyrrv
117 | germ
118 | gm
119 | gr
120 | hdl
121 | hdr
122 | hf
123 | hl
124 | hlsk
125 | hljsk
126 | hljv
127 | hljóðv
128 | hr
129 | hv
130 | hvk
131 | holl
132 | Hos
133 | höf
134 | hk
135 | hrl
136 | ísl
137 | kaf
138 | kap
139 | Khöfn
140 | kk
141 | kg
142 | kk
143 | km
144 | kl
145 | klst
146 | kr
147 | kt
148 | kgúrsk
149 | kvk
150 | leturbr
151 | lh
152 | lh.nt
153 | lh.þt
154 | lo
155 | ltr
156 | mlja
157 | mljó
158 | millj
159 | mm
160 | mms
161 | m.fl
162 | miðm
163 | mgr
164 | mst
165 | mín
166 | nf
167 | nh
168 | nhm
169 | nl
170 | nk
171 | nmgr
172 | no
173 | núv
174 | nt
175 | o.áfr
176 | o.m.fl
177 | ohf
178 | o.fl
179 | o.s.frv
180 | ófn
181 | ób
182 | óákv.gr
183 | óákv
184 | pfn
185 | PR
186 | pr
187 | Ritstj
188 | Rvík
189 | Rvk
190 | samb
191 | samhlj
192 | samn
193 | samn
194 | sbr
195 | sek
196 | sérn
197 | sf
198 | sfn
199 | sh
200 | sfn
201 | sh
202 | s.hl
203 | sk
204 | skv
205 | sl
206 | sn
207 | so
208 | ss.us
209 | s.st
210 | samþ
211 | sbr
212 | shlj
213 | sign
214 | skál
215 | st
216 | st.s
217 | stk
218 | sþ
219 | teg
220 | tbl
221 | tfn
222 | tl
223 | tvíhlj
224 | tvt
225 | till
226 | to
227 | umr
228 | uh
229 | us
230 | uppl
231 | útg
232 | vb
233 | Vf
234 | vh
235 | vkf
236 | Vl
237 | vl
238 | vlf
239 | vmf
240 | 8vo
241 | vsk
242 | vth
243 | þt
244 | þf
245 | þjs
246 | þgf
247 | þlt
248 | þolm
249 | þm
250 | þml
251 | þýð
252 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.it:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | B
  8 | C
  9 | D
 10 | E
 11 | F
 12 | G
 13 | H
 14 | I
 15 | J
 16 | K
 17 | L
 18 | M
 19 | N
 20 | O
 21 | P
 22 | Q
 23 | R
 24 | S
 25 | T
 26 | U
 27 | V
 28 | W
 29 | X
 30 | Y
 31 | Z
 32 | 
 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 34 | Adj
 35 | Adm
 36 | Adv
 37 | Amn 
 38 | Arch 
 39 | Asst
 40 | Avv
 41 | Bart
 42 | Bcc
 43 | Bldg
 44 | Brig
 45 | Bros
 46 | C.A.P
 47 | C.P
 48 | Capt
 49 | Cc
 50 | Cmdr
 51 | Co
 52 | Col
 53 | Comdr
 54 | Con
 55 | Corp
 56 | Cpl
 57 | DR
 58 | Dott
 59 | Dr
 60 | Drs
 61 | Egr
 62 | Ens
 63 | Gen
 64 | Geom
 65 | Gov
 66 | Hon
 67 | Hosp
 68 | Hr
 69 | Id
 70 | Ing
 71 | Insp
 72 | Lt
 73 | MM
 74 | MR
 75 | MRS
 76 | MS
 77 | Maj
 78 | Messrs
 79 | Mlle
 80 | Mme
 81 | Mo
 82 | Mons
 83 | Mr
 84 | Mrs
 85 | Ms
 86 | Msgr
 87 | N.B
 88 | Op
 89 | Ord
 90 | P.S
 91 | P.T
 92 | Pfc
 93 | Ph
 94 | Prof
 95 | Pvt
 96 | RP
 97 | RSVP
 98 | Rag
 99 | Rep
100 | Reps
101 | Res
102 | Rev
103 | Rif
104 | Rt
105 | S.A
106 | S.B.F
107 | S.P.M
108 | S.p.A
109 | S.r.l
110 | Sen
111 | Sens
112 | Sfc
113 | Sgt
114 | Sig
115 | Sigg
116 | Soc
117 | Spett
118 | Sr
119 | St
120 | Supt
121 | Surg
122 | V.P
123 | 
124 | # other
125 | a.c 
126 | acc
127 | all 
128 | banc
129 | c.a
130 | c.c.p
131 | c.m
132 | c.p
133 | c.s
134 | c.v
135 | corr
136 | dott
137 | e.p.c
138 | ecc
139 | es 
140 | fatt
141 | gg
142 | int
143 | lett
144 | ogg
145 | on
146 | p.c
147 | p.c.c
148 | p.es
149 | p.f
150 | p.r
151 | p.v
152 | post
153 | pp
154 | racc
155 | ric
156 | s.n.c
157 | seg
158 | sgg
159 | ss
160 | tel
161 | u.s
162 | v.r
163 | v.s
164 | 
165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
166 | v
167 | vs
168 | i.e
169 | rev
170 | e.g
171 | 
172 | #Numbers only. These should only induce breaks when followed by a numeric sequence
173 | # add NUMERIC_ONLY after the word for this function
174 | #This case is mostly for the english "No." which can either be a sentence of its own, or
175 | #if followed by a number, a non-breaking prefix
176 | No #NUMERIC_ONLY# 
177 | Nos
178 | Art #NUMERIC_ONLY#
179 | Nr
180 | pp #NUMERIC_ONLY#
181 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.lt:
--------------------------------------------------------------------------------
  1 | # Anything in this file, followed by a period (and an upper-case word),
  2 | # does NOT indicate an end-of-sentence marker.
  3 | # Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  4 | 
  5 | # Any single upper case letter  followed by a period is not a sentence ender
  6 | # (excluding I occasionally, but we leave it in)
  7 | # usually upper case letters are initials in a name
  8 | A
  9 | Ā
 10 | B
 11 | C
 12 | Č
 13 | D
 14 | E
 15 | Ē
 16 | F
 17 | G
 18 | Ģ
 19 | H
 20 | I
 21 | Ī
 22 | J
 23 | K
 24 | Ķ
 25 | L
 26 | Ļ
 27 | M
 28 | N
 29 | Ņ
 30 | O
 31 | P
 32 | Q
 33 | R
 34 | S
 35 | Š
 36 | T
 37 | U
 38 | Ū
 39 | V
 40 | W
 41 | X
 42 | Y
 43 | Z
 44 | Ž
 45 | 
 46 | # Initialis -- Džonas
 47 | Dz
 48 | Dž
 49 | Just
 50 | 
 51 | # Day and month abbreviations
 52 | # m. menesis d. diena  g. gimes
 53 | m
 54 | mėn
 55 | d
 56 | g
 57 | gim
 58 | # Pirmadienis Penktadienis
 59 | Pr
 60 | Pn
 61 | Pirm
 62 | Antr
 63 | Treč
 64 | Ketv
 65 | Penkt
 66 | Šešt
 67 | Sekm
 68 | Saus
 69 | Vas
 70 | Kov
 71 | Bal
 72 | Geg
 73 | Birž
 74 | Liep
 75 | Rugpj
 76 | Rugs
 77 | Spal
 78 | Lapkr
 79 | Gruod
 80 | 
 81 | # Business, governmental, geographical terms
 82 | a
 83 | # aikštė
 84 | adv
 85 | # advokatas
 86 | akad
 87 | # akademikas
 88 | aklg
 89 | # akligatvis
 90 | akt
 91 | # aktorius
 92 | al
 93 | # alėja
 94 | A.V
 95 | # antspaudo vieta
 96 | aps
 97 | apskr
 98 | # apskritis
 99 | apyg
100 | # apygarda
101 | aps
102 | apskr
103 | # apskritis
104 | asist
105 | # asistentas
106 | asmv
107 | avd
108 | # asmenvardis
109 | a.k
110 | asm
111 | asm.k
112 | # asmens kodas
113 | atsak
114 | # atsakingasis
115 | atsisk
116 | sąsk
117 | # atsiskaitomoji sąskaita
118 | aut
119 | # autorius
120 | b
121 | k
122 | b.k
123 | # banko kodas
124 | bkl
125 | # bakalauras
126 | bt
127 | # butas
128 | buv
129 | # buvęs, -usi
130 | dail
131 | # dailininkas
132 | dek
133 | # dekanas
134 | dėst
135 | # dėstytojas
136 | dir
137 | # direktorius
138 | dirig
139 | # dirigentas
140 | doc
141 | # docentas
142 | drp
143 | # durpynas
144 | dš
145 | # dešinysis
146 | egz
147 | # egzempliorius
148 | eil
149 | # eilutė
150 | ekon
151 | # ekonomika
152 | el
153 | # elektroninis
154 | etc
155 | ež
156 | # ežeras
157 | faks
158 | # faksas
159 | fak
160 | # fakultetas
161 | gen
162 | # generolas
163 | gyd
164 | # gydytojas
165 | gv
166 | # gyvenvietė
167 | įl
168 | # įlanka
169 | Įn
170 | # įnagininkas
171 | insp
172 | # inspektorius
173 | pan
174 | # ir panašiai
175 | t.t
176 | # ir taip toliau
177 | k.a
178 | # kaip antai
179 | kand
180 | # kandidatas
181 | kat
182 | # katedra
183 | kyš
184 | # kyšulys
185 | kl
186 | # klasė
187 | kln
188 | # kalnas
189 | kn
190 | # knyga
191 | koresp
192 | # korespondentas
193 | kpt
194 | # kapitonas
195 | kr
196 | # kairysis
197 | kt
198 | # kitas
199 | kun
200 | # kunigas
201 | l
202 | e
203 | p
204 | l.e.p
205 | # laikinai einantis pareigas
206 | ltn
207 | # leitenantas
208 | m
209 | mst
210 | # miestas
211 | m.e
212 | # mūsų eros
213 | m.m
214 | # mokslo metai
215 | mot
216 | # moteris
217 | mstl
218 | # miestelis
219 | mgr
220 | # magistras
221 | mgnt
222 | # magistrantas
223 | mjr
224 | # majoras
225 | mln
226 | # milijonas
227 | mlrd
228 | # milijardas
229 | mok
230 | # mokinys
231 | mokyt
232 | # mokytojas
233 | moksl
234 | # mokslinis
235 | nkt
236 | # nekaitomas
237 | ntk
238 | # neteiktinas
239 | Nr
240 | nr
241 | # numeris
242 | p
243 | # ponas
244 | p.d
245 | a.d
246 | # pašto dėžutė, abonentinė dėžutė
247 | p.m.e
248 | # prieš mūsų erą
249 | pan
250 | # ir panašiai
251 | pav
252 | # paveikslas
253 | pavad
254 | # pavaduotojas
255 | pirm
256 | # pirmininkas
257 | pl
258 | # plentas
259 | plg
260 | # palygink
261 | plk
262 | # pulkininkas; pelkė
263 | pr
264 | # prospektas
265 | Kr
266 | pr.Kr
267 | # prieš Kristų
268 | prok
269 | # prokuroras
270 | prot
271 | # protokolas
272 | pss
273 | # pusiasalis
274 | pšt
275 | # paštas
276 | pvz
277 | # pavyzdžiui
278 | r
279 | # rajonas
280 | red
281 | # redaktorius
282 | rš
283 | # raštų kalbos
284 | sąs
285 | # sąsiuvinis
286 | saviv
287 | sav
288 | # savivaldybė
289 | sekr
290 | # sekretorius
291 | sen
292 | # seniūnija, seniūnas
293 | sk
294 | # skaityk; skyrius
295 | skg
296 | # skersgatvis
297 | skyr
298 | sk
299 | # skyrius
300 | skv
301 | # skveras
302 | sp
303 | # spauda; spaustuvė
304 | spec
305 | # specialistas
306 | sr
307 | # sritis
308 | st
309 | # stotis
310 | str
311 | # straipsnis
312 | stud
313 | # studentas
314 | š
315 | š.m
316 | # šių metų
317 | šnek
318 | # šnekamosios
319 | tir
320 | # tiražas
321 | tūkst
322 | # tūkstantis
323 | up
324 | # upė
325 | upl
326 | # upelis
327 | vad
328 | # vadinamasis, -oji
329 | vlsč
330 | # valsčius
331 | ved
332 | # vedėjas
333 | vet
334 | # veterinarija
335 | virš
336 | # viršininkas, viršaitis
337 | vyr
338 | # vyriausiasis, -ioji; vyras
339 | vyresn
340 | # vyresnysis
341 | vlsč
342 | # valsčius
343 | vs
344 | # viensėdis
345 | Vt
346 | vt
347 | # vietininkas
348 | vtv
349 | vv
350 | # vietovardis
351 | žml
352 | # žemėlapis
353 | 
354 | # Technical terms, abbreviations used in guidebooks, advertisments, etc.
355 | # Generally lower-case.
356 | air
357 | # airiškai
358 | amer
359 | # amerikanizmas
360 | anat
361 | # anatomija
362 | angl
363 | # angl. angliskai
364 | arab
365 | # arabų
366 | archeol
367 | archit
368 | asm
369 | # asmuo
370 | astr
371 | # astronomija
372 | austral
373 | # australiškai
374 | aut
375 | # automobilis
376 | av
377 | # aviacija
378 | bažn
379 | bdv
380 | # būdvardis
381 | bibl
382 | # Biblija
383 | biol
384 | # biologija
385 | bot
386 | # botanika
387 | brt
388 | # burtai, burtažodis.
389 | brus
390 | # baltarusių
391 | buh
392 | # buhalterija
393 | chem
394 | # chemija
395 | col
396 | # collectivum
397 | con
398 | conj
399 | # conjunctivus, jungtukas
400 | dab
401 | # dab. dabartine
402 | dgs
403 | # daugiskaita
404 | dial
405 | # dialektizmas
406 | dipl
407 | dktv
408 | # daiktavardis
409 | džn
410 | # dažnai
411 | ekon
412 | el
413 | # elektra
414 | esam
415 | # esamasis laikas
416 | euf
417 | # eufemizmas
418 | fam
419 | # familiariai
420 | farm
421 | # farmacija
422 | filol
423 | # filologija
424 | filos
425 | # filosofija
426 | fin
427 | # finansai
428 | fiz
429 | # fizika
430 | fiziol
431 | # fiziologija
432 | flk
433 | # folkloras
434 | fon
435 | # fonetika
436 | fot
437 | # fotografija
438 | geod
439 | # geodezija
440 | geogr
441 | geol
442 | # geologija
443 | geom
444 | # geometrija
445 | glžk
446 | gr
447 | # graikų
448 | gram
449 | her
450 | # heraldika
451 | hidr
452 | # hidrotechnika
453 | ind
454 | # Indų
455 | iron
456 | # ironiškai
457 | isp
458 | # ispanų
459 | ist
460 | istor
461 | # istorija
462 | it
463 | # italų
464 | įv
465 | reikšm
466 | įv.reikšm
467 | # įvairiomis reikšmėmis
468 | jap
469 | # japonų
470 | juok
471 | # juokaujamai
472 | jūr
473 | # jūrininkystė
474 | kalb
475 | # kalbotyra
476 | kar
477 | # karyba
478 | kas
479 | # kasyba
480 | kin
481 | # kinematografija
482 | klaus
483 | # klausiamasis
484 | knyg
485 | # knyginis
486 | kom
487 | # komercija
488 | komp
489 | # kompiuteris
490 | kosm
491 | # kosmonautika
492 | kt
493 | # kitas
494 | kul
495 | # kulinarija
496 | kuop
497 | # kuopine
498 | l
499 | # laikas
500 | lit
501 | # literatūrinis
502 | lingv
503 | # lingvistika
504 | log
505 | # logika
506 | lot
507 | # lotynų
508 | mat
509 | # matematika
510 | maž
511 | # mažybinis
512 | med
513 | # medicina
514 | medž
515 | # medžioklė
516 | men
517 | # menas
518 | menk
519 | # menkinamai
520 | metal
521 | # metalurgija
522 | meteor
523 | min
524 | # mineralogija
525 | mit
526 | # mitologija
527 | mok
528 | # mokyklinis
529 | ms
530 | # mįslė
531 | muz
532 | # muzikinis
533 | n
534 | # naujasis
535 | neig
536 | # neigiamasis
537 | neol
538 | # neologizmas
539 | niek
540 | # niekinamai
541 | ofic
542 | # oficialus
543 | opt
544 | # optika
545 | orig
546 | # original
547 | p
548 | # pietūs
549 | pan
550 | # panašiai
551 | parl
552 | # parlamentas
553 | pat
554 | # patarlė
555 | paž
556 | # pažodžiui
557 | plg
558 | # palygink
559 | poet
560 | # poetizmas
561 | poez
562 | #  poezija
563 | poligr
564 | # poligrafija
565 | polit
566 | # politika
567 | ppr
568 | # paprastai
569 | pranc
570 | pr
571 | # prancūzų, prūsų
572 | priet
573 | # prietaras
574 | prek
575 | # prekyba
576 | prk
577 | # perkeltine
578 | prs
579 | # persona, asmuo
580 | psn
581 | # pasenęs žodis
582 | psich
583 | # psichologija
584 | pvz
585 | # pavyzdžiui
586 | r
587 | # rytai
588 | rad
589 | # radiotechnika
590 | rel
591 | # religija
592 | ret
593 | # retai
594 | rus
595 | # rusų
596 | sen
597 | # senasis
598 | sl
599 | # slengas, slavų
600 | sov
601 | # sovietinis
602 | spec
603 | # specialus
604 | sport
605 | stat
606 | # statyba
607 | sudurt
608 | # sudurtinis
609 | sutr
610 | # sutrumpintas
611 | suv
612 | # suvalkiečių
613 | š
614 | # šiaurė
615 | šach
616 | # šachmatai
617 | šiaur
618 | škot
619 | # škotiškai
620 | šnek
621 | # šnekamoji
622 | teatr
623 | tech
624 | techn
625 | # technika
626 | teig
627 | # teigiamas
628 | teis
629 | # teisė
630 | tekst
631 | # tekstilė
632 | tel
633 | # telefonas
634 | teol
635 | # teologija
636 | v
637 | # tik vyriškosios, vakarai
638 | t.p
639 | t
640 | p
641 | # ir taip pat
642 | t.t
643 | # ir taip toliau
644 | t.y
645 | # tai yra
646 | vaik
647 | # vaikų
648 | vart
649 | # vartojama
650 | vet
651 | # veterinarija
652 | vid
653 | # vidurinis
654 | vksm
655 | # veiksmažodis
656 | vns
657 | # vienaskaita
658 | vok
659 | # vokiečių
660 | vulg
661 | # vulgariai
662 | zool
663 | # zoologija
664 | žr
665 | # žiūrėk
666 | ž.ū
667 | ž
668 | ū
669 | # žemės ūkis
670 | 
671 | # List of titles. These are often followed by upper-case names, but do
672 | # not indicate sentence breaks
673 | #
674 | # Jo Eminencija
675 | Em.
676 | # Gerbiamasis
677 | Gerb
678 | gerb
679 | #  malonus
680 | malon
681 | # profesorius
682 | Prof
683 | prof
684 | # daktaras (mokslų)
685 | Dr
686 | dr
687 | habil
688 | med
689 | # inž inžinierius
690 | inž
691 | Inž
692 | 
693 | 
694 | #Numbers only. These should only induce breaks when followed by a numeric sequence
695 | # add NUMERIC_ONLY after the word for this function
696 | #This case is mostly for the english "No." which can either be a sentence of its own, or
697 | #if followed by a number, a non-breaking prefix
698 | No #NUMERIC_ONLY#
699 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.lv:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | A
  7 | Ā
  8 | B
  9 | C
 10 | Č
 11 | D
 12 | E
 13 | Ē
 14 | F
 15 | G
 16 | Ģ
 17 | H
 18 | I
 19 | Ī
 20 | J
 21 | K
 22 | Ķ
 23 | L
 24 | Ļ
 25 | M
 26 | N
 27 | Ņ
 28 | O
 29 | P
 30 | Q
 31 | R
 32 | S
 33 | Š
 34 | T
 35 | U
 36 | Ū
 37 | V
 38 | W
 39 | X
 40 | Y
 41 | Z
 42 | Ž
 43 | 
 44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 45 | dr
 46 | Dr
 47 | med
 48 | prof
 49 | Prof
 50 | inž
 51 | Inž
 52 | ist.loc
 53 | Ist.loc
 54 | kor.loc
 55 | Kor.loc
 56 | v.i
 57 | vietn
 58 | Vietn
 59 | 
 60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
 61 | a.l
 62 | t.p
 63 | pārb
 64 | Pārb
 65 | vec
 66 | Vec
 67 | inv
 68 | Inv
 69 | sk
 70 | Sk
 71 | spec
 72 | Spec
 73 | vienk
 74 | Vienk
 75 | virz
 76 | Virz
 77 | māksl
 78 | Māksl
 79 | mūz
 80 | Mūz
 81 | akad
 82 | Akad
 83 | soc
 84 | Soc
 85 | galv
 86 | Galv
 87 | vad
 88 | Vad
 89 | sertif
 90 | Sertif
 91 | folkl
 92 | Folkl
 93 | hum
 94 | Hum
 95 | 
 96 | #Numbers only. These should only induce breaks when followed by a numeric sequence
 97 | # add NUMERIC_ONLY after the word for this function
 98 | #This case is mostly for the english "No." which can either be a sentence of its own, or
 99 | #if followed by a number, a non-breaking prefix
100 | Nr #NUMERIC_ONLY# 
101 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.nl:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen 
  4 | #         http://nl.wikipedia.org/wiki/Aanspreekvorm
  5 | #         http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs
  6 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  7 | #usually upper case letters are initials in a name
  8 | A
  9 | B
 10 | C
 11 | D
 12 | E
 13 | F
 14 | G
 15 | H
 16 | I
 17 | J
 18 | K
 19 | L
 20 | M
 21 | N
 22 | O
 23 | P
 24 | Q
 25 | R
 26 | S
 27 | T
 28 | U
 29 | V
 30 | W
 31 | X
 32 | Y
 33 | Z
 34 | 
 35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
 36 | bacc
 37 | bc
 38 | bgen
 39 | c.i
 40 | dhr
 41 | dr
 42 | dr.h.c
 43 | drs
 44 | drs
 45 | ds
 46 | eint
 47 | fa
 48 | Fa
 49 | fam
 50 | gen
 51 | genm
 52 | ing
 53 | ir
 54 | jhr
 55 | jkvr
 56 | jr
 57 | kand
 58 | kol
 59 | lgen
 60 | lkol
 61 | Lt
 62 | maj
 63 | Mej
 64 | mevr
 65 | Mme
 66 | mr
 67 | mr
 68 | Mw
 69 | o.b.s
 70 | plv
 71 | prof
 72 | ritm
 73 | tint
 74 | Vz
 75 | Z.D
 76 | Z.D.H
 77 | Z.E
 78 | Z.Em
 79 | Z.H
 80 | Z.K.H
 81 | Z.K.M
 82 | Z.M
 83 | z.v
 84 | 
 85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
 86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence
 87 | a.g.v
 88 | bijv
 89 | bijz
 90 | bv
 91 | d.w.z
 92 | e.c
 93 | e.g
 94 | e.k
 95 | ev
 96 | i.p.v
 97 | i.s.m
 98 | i.t.t
 99 | i.v.m
100 | m.a.w
101 | m.b.t
102 | m.b.v
103 | m.h.o
104 | m.i
105 | m.i.v
106 | v.w.t
107 | 
108 | #Numbers only. These should only induce breaks when followed by a numeric sequence
109 | # add NUMERIC_ONLY after the word for this function
110 | #This case is mostly for the english "No." which can either be a sentence of its own, or
111 | #if followed by a number, a non-breaking prefix
112 | Nr #NUMERIC_ONLY# 
113 | Nrs 
114 | nrs
115 | nr #NUMERIC_ONLY#
116 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.pl:
--------------------------------------------------------------------------------
  1 | adw
  2 | afr
  3 | akad
  4 | al
  5 | Al
  6 | am
  7 | amer
  8 | arch
  9 | art
 10 | Art
 11 | artyst
 12 | astr
 13 | austr
 14 | bałt
 15 | bdb
 16 | bł
 17 | bm
 18 | br
 19 | bryg
 20 | bryt
 21 | centr
 22 | ces
 23 | chem
 24 | chiń
 25 | chir
 26 | c.k
 27 | c.o
 28 | cyg
 29 | cyw
 30 | cyt
 31 | czes
 32 | czw
 33 | cd
 34 | Cd
 35 | czyt
 36 | ćw
 37 | ćwicz
 38 | daw
 39 | dcn
 40 | dekl
 41 | demokr
 42 | det
 43 | diec
 44 | dł
 45 | dn
 46 | dot
 47 | dol
 48 | dop
 49 | dost
 50 | dosł
 51 | h.c
 52 | ds
 53 | dst
 54 | duszp
 55 | dypl
 56 | egz
 57 | ekol
 58 | ekon
 59 | elektr
 60 | em
 61 | ew
 62 | fab
 63 | farm
 64 | fot
 65 | fr
 66 | gat
 67 | gastr
 68 | geogr
 69 | geol
 70 | gimn
 71 | głęb
 72 | gm
 73 | godz
 74 | górn
 75 | gosp
 76 | gr
 77 | gram
 78 | hist
 79 | hiszp
 80 | hr
 81 | Hr
 82 | hot
 83 | id
 84 | in
 85 | im
 86 | iron
 87 | jn
 88 | kard
 89 | kat
 90 | katol
 91 | k.k
 92 | kk
 93 | kol
 94 | kl
 95 | k.p.a
 96 | kpc
 97 | k.p.c
 98 | kpt
 99 | kr
100 | k.r
101 | krak
102 | k.r.o
103 | kryt
104 | kult
105 | laic
106 | łac
107 | niem
108 | woj
109 | nb
110 | np
111 | Nb
112 | Np
113 | pol
114 | pow
115 | m.in
116 | pt
117 | ps
118 | Pt
119 | Ps
120 | cdn
121 | jw
122 | ryc
123 | rys
124 | Ryc
125 | Rys
126 | tj
127 | tzw
128 | Tzw
129 | tzn
130 | zob
131 | ang
132 | ub
133 | ul
134 | pw
135 | pn
136 | pl
137 | al
138 | k
139 | n
140 | nr #NUMERIC_ONLY#
141 | Nr #NUMERIC_ONLY#
142 | ww
143 | wł
144 | ur
145 | zm
146 | żyd
147 | żarg
148 | żyw
149 | wył
150 | bp
151 | bp
152 | wyst
153 | tow
154 | Tow
155 | o
156 | sp
157 | Sp
158 | st
159 | spółdz
160 | Spółdz
161 | społ
162 | spółgł
163 | stoł
164 | stow
165 | Stoł
166 | Stow
167 | zn
168 | zew
169 | zewn
170 | zdr
171 | zazw
172 | zast
173 | zaw
174 | zał
175 | zal
176 | zam
177 | zak
178 | zakł
179 | zagr
180 | zach
181 | adw
182 | Adw
183 | lek
184 | Lek
185 | med
186 | mec
187 | Mec
188 | doc
189 | Doc
190 | dyw
191 | dyr
192 | Dyw
193 | Dyr
194 | inż
195 | Inż
196 | mgr
197 | Mgr
198 | dh
199 | dr
200 | Dh
201 | Dr
202 | p
203 | P
204 | red
205 | Red
206 | prof
207 | prok
208 | Prof
209 | Prok
210 | hab
211 | płk
212 | Płk
213 | nadkom
214 | Nadkom
215 | podkom
216 | Podkom
217 | ks
218 | Ks
219 | gen
220 | Gen
221 | por
222 | Por
223 | reż
224 | Reż
225 | przyp
226 | Przyp
227 | śp
228 | św
229 | śW
230 | Śp
231 | Św
232 | ŚW
233 | szer
234 | Szer
235 | pkt #NUMERIC_ONLY#
236 | str #NUMERIC_ONLY#
237 | tab #NUMERIC_ONLY#
238 | Tab #NUMERIC_ONLY#
239 | tel
240 | ust #NUMERIC_ONLY#
241 | par #NUMERIC_ONLY#
242 | poz
243 | pok
244 | oo
245 | oO
246 | Oo
247 | OO
248 | r #NUMERIC_ONLY#
249 | l #NUMERIC_ONLY#
250 | s #NUMERIC_ONLY#
251 | najśw
252 | Najśw
253 | A
254 | B
255 | C
256 | D
257 | E
258 | F
259 | G
260 | H
261 | I
262 | J
263 | K
264 | L
265 | M
266 | N
267 | O
268 | P
269 | Q
270 | R
271 | S
272 | T
273 | U
274 | V
275 | W
276 | X
277 | Y
278 | Z
279 | Ś
280 | Ć
281 | Ż
282 | Ź
283 | Dz
284 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.pt:
--------------------------------------------------------------------------------
  1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009.
  2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  4 | 
  5 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  6 | #usually upper case letters are initials in a name
  7 | A
  8 | B
  9 | C
 10 | D
 11 | E
 12 | F
 13 | G
 14 | H
 15 | I
 16 | J
 17 | K
 18 | L
 19 | M
 20 | N
 21 | O
 22 | P
 23 | Q
 24 | R
 25 | S
 26 | T
 27 | U
 28 | V
 29 | W
 30 | X
 31 | Y
 32 | Z
 33 | a
 34 | b
 35 | c
 36 | d
 37 | e
 38 | f
 39 | g
 40 | h
 41 | i
 42 | j
 43 | k
 44 | l
 45 | m
 46 | n
 47 | o
 48 | p
 49 | q
 50 | r
 51 | s
 52 | t
 53 | u
 54 | v
 55 | w
 56 | x
 57 | y
 58 | z
 59 | 
 60 | 
 61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese.
 62 | I
 63 | II
 64 | III
 65 | IV
 66 | V
 67 | VI
 68 | VII
 69 | VIII
 70 | IX
 71 | X
 72 | XI
 73 | XII
 74 | XIII
 75 | XIV
 76 | XV
 77 | XVI
 78 | XVII
 79 | XVIII
 80 | XIX
 81 | XX
 82 | i
 83 | ii
 84 | iii
 85 | iv
 86 | v
 87 | vi
 88 | vii
 89 | viii
 90 | ix
 91 | x
 92 | xi
 93 | xii
 94 | xiii
 95 | xiv
 96 | xv
 97 | xvi
 98 | xvii
 99 | xviii
100 | xix
101 | xx
102 | 
103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
104 | Adj
105 | Adm
106 | Adv
107 | Art
108 | Ca
109 | Capt
110 | Cmdr
111 | Col
112 | Comdr
113 | Con
114 | Corp
115 | Cpl
116 | DR
117 | DRA
118 | Dr
119 | Dra
120 | Dras
121 | Drs
122 | Eng
123 | Enga
124 | Engas
125 | Engos
126 | Ex
127 | Exo
128 | Exmo
129 | Fig
130 | Gen
131 | Hosp
132 | Insp
133 | Lda
134 | MM
135 | MR
136 | MRS
137 | MS
138 | Maj
139 | Mrs
140 | Ms
141 | Msgr
142 | Op
143 | Ord
144 | Pfc
145 | Ph
146 | Prof
147 | Pvt
148 | Rep
149 | Reps
150 | Res
151 | Rev
152 | Rt
153 | Sen
154 | Sens
155 | Sfc
156 | Sgt
157 | Sr
158 | Sra
159 | Sras
160 | Srs
161 | Sto
162 | Supt
163 | Surg
164 | adj
165 | adm
166 | adv
167 | art
168 | cit
169 | col
170 | con
171 | corp
172 | cpl
173 | dr
174 | dra
175 | dras
176 | drs
177 | eng
178 | enga
179 | engas
180 | engos
181 | ex
182 | exo
183 | exmo
184 | fig
185 | op
186 | prof
187 | sr
188 | sra
189 | sras
190 | srs
191 | sto
192 | 
193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
194 | v
195 | vs
196 | i.e
197 | rev
198 | e.g
199 | 
200 | #Numbers only. These should only induce breaks when followed by a numeric sequence
201 | # add NUMERIC_ONLY after the word for this function
202 | #This case is mostly for the english "No." which can either be a sentence of its own, or
203 | #if followed by a number, a non-breaking prefix
204 | No #NUMERIC_ONLY# 
205 | Nos
206 | Art #NUMERIC_ONLY#
207 | Nr
208 | p #NUMERIC_ONLY#
209 | pp #NUMERIC_ONLY#
210 | 
211 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.ro:
--------------------------------------------------------------------------------
 1 | A
 2 | B
 3 | C
 4 | D
 5 | E
 6 | F
 7 | G
 8 | H
 9 | I
10 | J
11 | K
12 | L
13 | M
14 | N
15 | O
16 | P
17 | Q
18 | R
19 | S
20 | T
21 | U
22 | V
23 | W
24 | X
25 | Y
26 | Z
27 | dpdv
28 | etc
29 | șamd
30 | M.Ap.N
31 | dl
32 | Dl
33 | d-na
34 | D-na
35 | dvs
36 | Dvs
37 | pt
38 | Pt
39 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.ru:
--------------------------------------------------------------------------------
  1 | # added Cyrillic uppercase letters [А-Я]
  2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes)
  3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013
  4 | А
  5 | Б
  6 | В
  7 | Г
  8 | Д
  9 | Е
 10 | Ж
 11 | З
 12 | И
 13 | Й
 14 | К
 15 | Л
 16 | М
 17 | Н
 18 | О
 19 | П
 20 | Р
 21 | С
 22 | Т
 23 | У
 24 | Ф
 25 | Х
 26 | Ц
 27 | Ч
 28 | Ш
 29 | Щ
 30 | Ъ
 31 | Ы
 32 | Ь
 33 | Э
 34 | Ю
 35 | Я
 36 | A
 37 | B
 38 | C
 39 | D
 40 | E
 41 | F
 42 | G
 43 | H
 44 | I
 45 | J
 46 | K
 47 | L
 48 | M
 49 | N
 50 | O
 51 | P
 52 | Q
 53 | R
 54 | S
 55 | T
 56 | U
 57 | V
 58 | W
 59 | X
 60 | Y
 61 | Z
 62 | 0гг
 63 | 1гг
 64 | 2гг
 65 | 3гг
 66 | 4гг
 67 | 5гг
 68 | 6гг
 69 | 7гг
 70 | 8гг
 71 | 9гг
 72 | 0г
 73 | 1г
 74 | 2г
 75 | 3г
 76 | 4г
 77 | 5г
 78 | 6г
 79 | 7г
 80 | 8г
 81 | 9г
 82 | Xвв
 83 | Vвв
 84 | Iвв
 85 | Lвв
 86 | Mвв
 87 | Cвв
 88 | Xв
 89 | Vв
 90 | Iв
 91 | Lв
 92 | Mв
 93 | Cв
 94 | 0м
 95 | 1м
 96 | 2м
 97 | 3м
 98 | 4м
 99 | 5м
100 | 6м
101 | 7м
102 | 8м
103 | 9м
104 | 0мм
105 | 1мм
106 | 2мм
107 | 3мм
108 | 4мм
109 | 5мм
110 | 6мм
111 | 7мм
112 | 8мм
113 | 9мм
114 | 0см
115 | 1см
116 | 2см
117 | 3см
118 | 4см
119 | 5см
120 | 6см
121 | 7см
122 | 8см
123 | 9см
124 | 0дм
125 | 1дм
126 | 2дм
127 | 3дм
128 | 4дм
129 | 5дм
130 | 6дм
131 | 7дм
132 | 8дм
133 | 9дм
134 | 0л
135 | 1л
136 | 2л
137 | 3л
138 | 4л
139 | 5л
140 | 6л
141 | 7л
142 | 8л
143 | 9л
144 | 0км
145 | 1км
146 | 2км
147 | 3км
148 | 4км
149 | 5км
150 | 6км
151 | 7км
152 | 8км
153 | 9км
154 | 0га
155 | 1га
156 | 2га
157 | 3га
158 | 4га
159 | 5га
160 | 6га
161 | 7га
162 | 8га
163 | 9га
164 | 0кг
165 | 1кг
166 | 2кг
167 | 3кг
168 | 4кг
169 | 5кг
170 | 6кг
171 | 7кг
172 | 8кг
173 | 9кг
174 | 0т
175 | 1т
176 | 2т
177 | 3т
178 | 4т
179 | 5т
180 | 6т
181 | 7т
182 | 8т
183 | 9т
184 | 0г
185 | 1г
186 | 2г
187 | 3г
188 | 4г
189 | 5г
190 | 6г
191 | 7г
192 | 8г
193 | 9г
194 | 0мг
195 | 1мг
196 | 2мг
197 | 3мг
198 | 4мг
199 | 5мг
200 | 6мг
201 | 7мг
202 | 8мг
203 | 9мг
204 | бульв
205 | в
206 | вв
207 | г
208 | га
209 | гг
210 | гл
211 | гос
212 | д
213 | дм
214 | доп
215 | др
216 | е
217 | ед
218 | ед
219 | зам
220 | и
221 | инд
222 | исп
223 | Исп
224 | к
225 | кап
226 | кг
227 | кв
228 | кл
229 | км
230 | кол
231 | комн
232 | коп
233 | куб
234 | л
235 | лиц
236 | лл
237 | м
238 | макс
239 | мг
240 | мин
241 | мл
242 | млн
243 | млрд
244 | мм
245 | н
246 | наб
247 | нач
248 | неуд
249 | ном
250 | о
251 | обл
252 | обр
253 | общ
254 | ок
255 | ост
256 | отл
257 | п
258 | пер
259 | перераб
260 | пл
261 | пос
262 | пр
263 | просп
264 | проф
265 | р
266 | ред
267 | руб
268 | с
269 | сб
270 | св
271 | см
272 | соч
273 | ср
274 | ст
275 | стр
276 | т
277 | тел
278 | Тел
279 | тех
280 | тт
281 | туп
282 | тыс
283 | уд
284 | ул
285 | уч
286 | физ
287 | х
288 | хор
289 | ч
290 | чел
291 | шт
292 | экз
293 | э
294 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.sk:
--------------------------------------------------------------------------------
  1 | Bc
  2 | Mgr
  3 | RNDr
  4 | PharmDr
  5 | PhDr
  6 | JUDr
  7 | PaedDr
  8 | ThDr
  9 | Ing
 10 | MUDr
 11 | MDDr
 12 | MVDr
 13 | Dr
 14 | ThLic
 15 | PhD
 16 | ArtD
 17 | ThDr
 18 | Dr
 19 | DrSc
 20 | CSs
 21 | prof
 22 | obr
 23 | Obr
 24 | Č
 25 | č
 26 | absol
 27 | adj
 28 | admin
 29 | adr
 30 | Adr
 31 | adv
 32 | advok
 33 | afr
 34 | ak
 35 | akad
 36 | akc
 37 | akuz
 38 | et
 39 | al
 40 | alch
 41 | amer
 42 | anat
 43 | angl
 44 | Angl
 45 | anglosas
 46 | anorg
 47 | ap
 48 | apod
 49 | arch
 50 | archeol
 51 | archit
 52 | arg
 53 | art
 54 | astr
 55 | astrol
 56 | astron
 57 | atp
 58 | atď
 59 | austr
 60 | Austr
 61 | aut
 62 | belg
 63 | Belg
 64 | bibl
 65 | Bibl
 66 | biol
 67 | bot
 68 | bud
 69 | bás
 70 | býv
 71 | cest
 72 | chem
 73 | cirk
 74 | csl
 75 | čs
 76 | Čs
 77 | dat
 78 | dep
 79 | det
 80 | dial
 81 | diaľ
 82 | dipl
 83 | distrib
 84 | dokl
 85 | dosl
 86 | dopr
 87 | dram
 88 | duš
 89 | dv
 90 | dvojčl
 91 | dór
 92 | ekol
 93 | ekon
 94 | el
 95 | elektr
 96 | elektrotech
 97 | energet
 98 | epic
 99 | est
100 | etc
101 | etonym
102 | eufem
103 | európ
104 | Európ
105 | ev
106 | evid
107 | expr
108 | fa
109 | fam
110 | farm
111 | fem
112 | feud
113 | fil
114 | filat
115 | filoz
116 | fi
117 | fon
118 | form
119 | fot
120 | fr
121 | Fr
122 | franc
123 | Franc
124 | fraz
125 | fut
126 | fyz
127 | fyziol
128 | garb
129 | gen
130 | genet
131 | genpor
132 | geod
133 | geogr
134 | geol
135 | geom
136 | germ
137 | gr
138 | Gr
139 | gréc
140 | Gréc
141 | gréckokat
142 | hebr
143 | herald
144 | hist
145 | hlav
146 | hosp
147 | hromad
148 | hud
149 | hypok
150 | ident
151 | i.e
152 | ident
153 | imp
154 | impf
155 | indoeur
156 | inf
157 | inform
158 | instr
159 | int
160 | interj
161 | inšt
162 | inštr
163 | iron
164 | jap
165 | Jap
166 | jaz
167 | jedn
168 | juhoamer
169 | juhových
170 | juhozáp
171 | juž
172 | kanad
173 | Kanad
174 | kanc
175 | kapit
176 | kpt
177 | kart
178 | katastr
179 | knih
180 | kniž
181 | komp
182 | konj
183 | konkr
184 | kozmet
185 | krajč
186 | kresť
187 | kt
188 | kuch
189 | lat
190 | latinskoamer
191 | lek
192 | lex
193 | lingv
194 | lit
195 | litur
196 | log
197 | lok
198 | max
199 | Max
200 | maď
201 | Maď
202 | medzinár
203 | mest
204 | metr
205 | mil
206 | Mil
207 | min
208 | Min
209 | miner
210 | ml
211 | mld
212 | mn
213 | mod
214 | mytol
215 | napr
216 | nar
217 | Nar
218 | nasl
219 | nedok
220 | neg
221 | negat
222 | neklas
223 | nem
224 | Nem
225 | neodb
226 | neos
227 | neskl
228 | nesklon
229 | nespis
230 | nespráv
231 | neved
232 | než
233 | niekt
234 | niž
235 | nom
236 | náb
237 | nákl
238 | námor
239 | nár
240 | obch
241 | obj
242 | obv
243 | obyč
244 | obč
245 | občian
246 | odb
247 | odd
248 | ods
249 | ojed
250 | okr
251 | Okr
252 | opt
253 | opyt
254 | org
255 | os
256 | osob
257 | ot
258 | ovoc
259 | par
260 | part
261 | pejor
262 | pers
263 | pf
264 | Pf 
265 | P.f
266 | p.f
267 | pl
268 | Plk
269 | pod
270 | podst
271 | pokl
272 | polit
273 | politol
274 | polygr
275 | pomn
276 | popl
277 | por
278 | porad
279 | porov
280 | posch
281 | potrav
282 | použ
283 | poz
284 | pozit
285 | poľ
286 | poľno
287 | poľnohosp
288 | poľov
289 | pošt
290 | pož
291 | prac
292 | predl
293 | pren
294 | prep
295 | preuk
296 | priezv
297 | Priezv
298 | privl
299 | prof
300 | práv
301 | príd
302 | príj
303 | prík
304 | príp
305 | prír
306 | prísl
307 | príslov
308 | príč
309 | psych
310 | publ
311 | pís
312 | písm
313 | pôv
314 | refl
315 | reg
316 | rep
317 | resp
318 | rozk
319 | rozlič
320 | rozpráv
321 | roč
322 | Roč
323 | ryb
324 | rádiotech
325 | rím
326 | samohl
327 | semest
328 | sev
329 | severoamer
330 | severových
331 | severozáp
332 | sg
333 | skr
334 | skup
335 | sl
336 | Sloven
337 | soc
338 | soch
339 | sociol
340 | sp
341 | spol
342 | Spol
343 | spoloč
344 | spoluhl
345 | správ
346 | spôs
347 | st
348 | star
349 | starogréc
350 | starorím
351 | s.r.o
352 | stol
353 | stor
354 | str
355 | stredoamer
356 | stredoškol
357 | subj
358 | subst
359 | superl
360 | sv
361 | sz
362 | súkr
363 | súp
364 | súvzť
365 | tal
366 | Tal
367 | tech
368 | tel
369 | Tel
370 | telef
371 | teles
372 | telev
373 | teol
374 | trans
375 | turist
376 | tuzem
377 | typogr
378 | tzn
379 | tzv
380 | ukaz
381 | ul
382 | Ul
383 | umel
384 | univ
385 | ust
386 | ved
387 | vedľ
388 | verb
389 | veter
390 | vin
391 | viď
392 | vl
393 | vod
394 | vodohosp
395 | pnl
396 | vulg
397 | vyj
398 | vys
399 | vysokoškol
400 | vzťaž
401 | vôb
402 | vých
403 | výd
404 | výrob
405 | výsk
406 | výsl
407 | výtv
408 | výtvar
409 | význ
410 | včel
411 | vš
412 | všeob
413 | zahr
414 | zar
415 | zariad
416 | zast
417 | zastar
418 | zastaráv
419 | zb
420 | zdravot
421 | združ
422 | zjemn
423 | zlat
424 | zn
425 | Zn
426 | zool
427 | zr
428 | zried
429 | zv
430 | záhr
431 | zák
432 | zákl
433 | zám
434 | záp
435 | západoeur
436 | zázn
437 | územ
438 | účt
439 | čast
440 | čes
441 | Čes
442 | čl
443 | čísl
444 | živ
445 | pr
446 | fak
447 | Kr
448 | p.n.l
449 | A
450 | B
451 | C
452 | D
453 | E
454 | F
455 | G
456 | H
457 | I
458 | J
459 | K
460 | L
461 | M
462 | N
463 | O
464 | P
465 | Q
466 | R
467 | S
468 | T
469 | U
470 | V
471 | W
472 | X
473 | Y
474 | Z
475 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.sl:
--------------------------------------------------------------------------------
 1 | dr
 2 | Dr
 3 | itd
 4 | itn
 5 | št #NUMERIC_ONLY#
 6 | Št #NUMERIC_ONLY#
 7 | d
 8 | jan
 9 | Jan
10 | feb
11 | Feb
12 | mar
13 | Mar
14 | apr
15 | Apr
16 | jun
17 | Jun
18 | jul
19 | Jul
20 | avg
21 | Avg
22 | sept
23 | Sept
24 | sep
25 | Sep
26 | okt
27 | Okt
28 | nov
29 | Nov
30 | dec
31 | Dec
32 | tj
33 | Tj
34 | npr
35 | Npr
36 | sl
37 | Sl
38 | op
39 | Op
40 | gl
41 | Gl
42 | oz
43 | Oz
44 | prev
45 | dipl
46 | ing
47 | prim
48 | Prim
49 | cf
50 | Cf
51 | gl
52 | Gl
53 | A
54 | B
55 | C
56 | D
57 | E
58 | F
59 | G
60 | H
61 | I
62 | J
63 | K
64 | L
65 | M
66 | N
67 | O
68 | P
69 | Q
70 | R
71 | S
72 | T
73 | U
74 | V
75 | W
76 | X
77 | Y
78 | Z
79 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.sv:
--------------------------------------------------------------------------------
 1 | #single upper case letter are usually initials
 2 | A
 3 | B
 4 | C
 5 | D
 6 | E
 7 | F
 8 | G
 9 | H
10 | I
11 | J
12 | K
13 | L
14 | M
15 | N
16 | O
17 | P
18 | Q
19 | R
20 | S
21 | T
22 | U
23 | V
24 | W
25 | X
26 | Y
27 | Z
28 | #misc abbreviations
29 | AB
30 | G
31 | VG
32 | dvs
33 | etc
34 | from
35 | iaf
36 | jfr
37 | kl
38 | kr
39 | mao
40 | mfl
41 | mm
42 | osv
43 | pga
44 | tex
45 | tom
46 | vs
47 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.ta:
--------------------------------------------------------------------------------
  1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker.
  2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers.
  3 | 
  4 | #any single upper case letter  followed by a period is not a sentence ender (excluding I occasionally, but we leave it in)
  5 | #usually upper case letters are initials in a name
  6 | அ
  7 | ஆ
  8 | இ
  9 | ஈ
 10 | உ
 11 | ஊ
 12 | எ
 13 | ஏ
 14 | ஐ
 15 | ஒ
 16 | ஓ
 17 | ஔ
 18 | ஃ
 19 | க
 20 | கா
 21 | கி
 22 | கீ
 23 | கு
 24 | கூ
 25 | கெ
 26 | கே
 27 | கை
 28 | கொ
 29 | கோ
 30 | கௌ
 31 | க்
 32 | ச
 33 | சா
 34 | சி
 35 | சீ
 36 | சு
 37 | சூ
 38 | செ
 39 | சே
 40 | சை
 41 | சொ
 42 | சோ
 43 | சௌ
 44 | ச்
 45 | ட
 46 | டா
 47 | டி
 48 | டீ
 49 | டு
 50 | டூ
 51 | டெ
 52 | டே
 53 | டை
 54 | டொ
 55 | டோ
 56 | டௌ
 57 | ட்
 58 | த
 59 | தா
 60 | தி
 61 | தீ
 62 | து
 63 | தூ
 64 | தெ
 65 | தே
 66 | தை
 67 | தொ
 68 | தோ
 69 | தௌ
 70 | த்
 71 | ப
 72 | பா
 73 | பி
 74 | பீ
 75 | பு
 76 | பூ
 77 | பெ
 78 | பே
 79 | பை
 80 | பொ
 81 | போ
 82 | பௌ
 83 | ப்
 84 | ற
 85 | றா
 86 | றி
 87 | றீ
 88 | று
 89 | றூ
 90 | றெ
 91 | றே
 92 | றை
 93 | றொ
 94 | றோ
 95 | றௌ
 96 | ற்
 97 | ய
 98 | யா
 99 | யி
100 | யீ
101 | யு
102 | யூ
103 | யெ
104 | யே
105 | யை
106 | யொ
107 | யோ
108 | யௌ
109 | ய்
110 | ர
111 | ரா
112 | ரி
113 | ரீ
114 | ரு
115 | ரூ
116 | ரெ
117 | ரே
118 | ரை
119 | ரொ
120 | ரோ
121 | ரௌ
122 | ர்
123 | ல
124 | லா
125 | லி
126 | லீ
127 | லு
128 | லூ
129 | லெ
130 | லே
131 | லை
132 | லொ
133 | லோ
134 | லௌ
135 | ல்
136 | வ
137 | வா
138 | வி
139 | வீ
140 | வு
141 | வூ
142 | வெ
143 | வே
144 | வை
145 | வொ
146 | வோ
147 | வௌ
148 | வ்
149 | ள
150 | ளா
151 | ளி
152 | ளீ
153 | ளு
154 | ளூ
155 | ளெ
156 | ளே
157 | ளை
158 | ளொ
159 | ளோ
160 | ளௌ
161 | ள்
162 | ழ
163 | ழா
164 | ழி
165 | ழீ
166 | ழு
167 | ழூ
168 | ழெ
169 | ழே
170 | ழை
171 | ழொ
172 | ழோ
173 | ழௌ
174 | ழ்
175 | ங
176 | ஙா
177 | ஙி
178 | ஙீ
179 | ஙு
180 | ஙூ
181 | ஙெ
182 | ஙே
183 | ஙை
184 | ஙொ
185 | ஙோ
186 | ஙௌ
187 | ங்  
188 | ஞ
189 | ஞா
190 | ஞி
191 | ஞீ
192 | ஞு
193 | ஞூ
194 | ஞெ
195 | ஞே
196 | ஞை
197 | ஞொ
198 | ஞோ
199 | ஞௌ
200 | ஞ் 
201 | ண
202 | ணா
203 | ணி
204 | ணீ
205 | ணு
206 | ணூ
207 | ணெ
208 | ணே
209 | ணை
210 | ணொ
211 | ணோ
212 | ணௌ
213 | ண்
214 | ந
215 | நா
216 | நி
217 | நீ
218 | நு
219 | நூ
220 | நெ
221 | நே
222 | நை
223 | நொ
224 | நோ
225 | நௌ
226 | ந் 	
227 | ம
228 | மா
229 | மி
230 | மீ
231 | மு
232 | மூ
233 | மெ
234 | மே
235 | மை
236 | மொ
237 | மோ
238 | மௌ
239 | ம் 	
240 | ன
241 | னா
242 | னி
243 | னீ
244 | னு
245 | னூ
246 | னெ
247 | னே
248 | னை
249 | னொ
250 | னோ
251 | னௌ
252 | ன்
253 | 
254 | 
255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks
256 | திரு
257 | திருமதி
258 | வண
259 | கௌரவ
260 | 
261 | 
262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence)
263 | உ.ம்
264 | #கா.ம்
265 | #எ.ம்
266 | 
267 | 
268 | #Numbers only. These should only induce breaks when followed by a numeric sequence
269 | # add NUMERIC_ONLY after the word for this function
270 | #This case is mostly for the english "No." which can either be a sentence of its own, or
271 | #if followed by a number, a non-breaking prefix
272 | No #NUMERIC_ONLY# 
273 | Nos
274 | Art #NUMERIC_ONLY#
275 | Nr
276 | pp #NUMERIC_ONLY#
277 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.yue:
--------------------------------------------------------------------------------
 1 | #
 2 | # Cantonese (Chinese)
 3 | #
 4 | # Anything in this file, followed by a period, 
 5 | # does NOT indicate an end-of-sentence marker.
 6 | #
 7 | # English/Euro-language given-name initials (appearing in
 8 | # news, periodicals, etc.)
 9 | A
10 | Ā
11 | B
12 | C
13 | Č
14 | D
15 | E
16 | Ē
17 | F
18 | G
19 | Ģ
20 | H
21 | I
22 | Ī
23 | J
24 | K
25 | Ķ
26 | L
27 | Ļ
28 | M
29 | N
30 | Ņ
31 | O
32 | P
33 | Q
34 | R
35 | S
36 | Š
37 | T
38 | U
39 | Ū
40 | V
41 | W
42 | X
43 | Y
44 | Z
45 | Ž
46 | 
47 | # Numbers only. These should only induce breaks when followed by
48 | # a numeric sequence.
49 | # Add NUMERIC_ONLY after the word for this function. This case is
50 | # mostly for the english "No." which can either be a sentence of its
51 | # own, or if followed by a number, a non-breaking prefix.
52 | No #NUMERIC_ONLY#
53 | Nr #NUMERIC_ONLY#
54 | 


--------------------------------------------------------------------------------
/share/nonbreaking_prefixes/nonbreaking_prefix.zh:
--------------------------------------------------------------------------------
 1 | #
 2 | # Mandarin (Chinese)
 3 | #
 4 | # Anything in this file, followed by a period, 
 5 | # does NOT indicate an end-of-sentence marker.
 6 | #
 7 | # English/Euro-language given-name initials (appearing in
 8 | # news, periodicals, etc.)
 9 | A
10 | Ā
11 | B
12 | C
13 | Č
14 | D
15 | E
16 | Ē
17 | F
18 | G
19 | Ģ
20 | H
21 | I
22 | Ī
23 | J
24 | K
25 | Ķ
26 | L
27 | Ļ
28 | M
29 | N
30 | Ņ
31 | O
32 | P
33 | Q
34 | R
35 | S
36 | Š
37 | T
38 | U
39 | Ū
40 | V
41 | W
42 | X
43 | Y
44 | Z
45 | Ž
46 | 
47 | # Numbers only. These should only induce breaks when followed by
48 | # a numeric sequence.
49 | # Add NUMERIC_ONLY after the word for this function. This case is
50 | # mostly for the english "No." which can either be a sentence of its
51 | # own, or if followed by a number, a non-breaking prefix.
52 | No #NUMERIC_ONLY#
53 | Nr #NUMERIC_ONLY#
54 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import numpy as np
  4 | import random
  5 | import time
  6 | 
  7 | import torch
  8 | import torch.nn as nn
  9 | from torch import cuda
 10 | from torch.autograd import Variable
 11 | 
 12 | import lib
 13 | 
 14 | parser = argparse.ArgumentParser(description="train.py")
 15 | 
 16 | ## Data options
 17 | parser.add_argument("-data", required=True,
 18 |                     help="Path to the *-train.pt file from preprocess.py")
 19 | parser.add_argument("-save_dir", required=True,
 20 |                     help="Directory to save models")
 21 | parser.add_argument("-load_from", help="Path to load a pretrained model.")
 22 | 
 23 | ## Model options
 24 | 
 25 | parser.add_argument("-layers", type=int, default=1,
 26 |                     help="Number of layers in the LSTM encoder/decoder")
 27 | parser.add_argument("-rnn_size", type=int, default=500,
 28 |                     help="Size of LSTM hidden states")
 29 | parser.add_argument("-word_vec_size", type=int, default=500,
 30 |                     help="Size of word embeddings")
 31 | parser.add_argument("-input_feed", type=int, default=1,
 32 |                     help="""Feed the context vector at each time step as
 33 |                     additional input (via concatenation with the word
 34 |                     embeddings) to the decoder.""")
 35 | parser.add_argument("-brnn", action="store_true",
 36 |                     help="Use a bidirectional encoder")
 37 | parser.add_argument("-brnn_merge", default="concat",
 38 |                     help="""Merge action for the bidirectional hidden states:
 39 |                     [concat|sum]""")
 40 | 
 41 | ## Optimization options
 42 | 
 43 | parser.add_argument("-batch_size", type=int, default=64,
 44 |                     help="Maximum batch size")
 45 | parser.add_argument("-max_generator_batches", type=int, default=32,
 46 |                     help="""Split softmax input into small batches for memory efficiency.
 47 |                     Higher is faster, but uses more memory.""")
 48 | parser.add_argument("-end_epoch", type=int, default=50,
 49 |                     help="Epoch to stop training.")
 50 | parser.add_argument("-start_epoch", type=int, default=1,
 51 |                     help="Epoch to start training.")
 52 | parser.add_argument("-param_init", type=float, default=0.1,
 53 |                     help="""Parameters are initialized over uniform distribution
 54 |                     with support (-param_init, param_init)""")
 55 | parser.add_argument("-optim", default="adam",
 56 |                     help="Optimization method. [sgd|adagrad|adadelta|adam]")
 57 | parser.add_argument("-lr", type=float, default=1e-3,
 58 |                     help="Initial learning rate")
 59 | parser.add_argument("-max_grad_norm", type=float, default=5,
 60 |                     help="""If the norm of the gradient vector exceeds this,
 61 |                     renormalize it to have the norm equal to max_grad_norm""")
 62 | parser.add_argument("-dropout", type=float, default=0,
 63 |                     help="Dropout probability; applied between LSTM stacks.")
 64 | parser.add_argument("-learning_rate_decay", type=float, default=0.5,
 65 |                     help="""Decay learning rate by this much if (i) perplexity
 66 |                     does not decrease on the validation set or (ii) epoch has
 67 |                     gone past the start_decay_at_limit""")
 68 | parser.add_argument("-start_decay_at", type=int, default=5,
 69 |                     help="Start decay after this epoch")
 70 | 
 71 | # GPU
 72 | parser.add_argument("-gpus", default=[0], nargs="+", type=int,
 73 |                     help="Use CUDA")
 74 | parser.add_argument("-log_interval", type=int, default=100,
 75 |                     help="Print stats at this interval.")
 76 | parser.add_argument("-seed", type=int, default=3435,
 77 |                      help="Seed for random initialization")
 78 | 
 79 | # Critic
 80 | parser.add_argument("-start_reinforce", type=int, default=None,
 81 |                     help="""Epoch to start reinforcement training.
 82 |                     Use -1 to start immediately.""")
 83 | parser.add_argument("-critic_pretrain_epochs", type=int, default=0,
 84 |                     help="Number of epochs to pretrain critic (actor fixed).")
 85 | parser.add_argument("-reinforce_lr", type=float, default=1e-4,
 86 |                     help="""Learning rate for reinforcement training.""")
 87 | 
 88 | # Evaluation
 89 | parser.add_argument("-eval", action="store_true", help="Evaluate model only")
 90 | parser.add_argument("-eval_sample", action="store_true", default=False,
 91 |         help="Eval by sampling")
 92 | parser.add_argument("-max_predict_length", type=int, default=80,
 93 |                     help="Maximum length of predictions.")
 94 | 
 95 | 
 96 | # Reward shaping
 97 | parser.add_argument("-pert_func", type=str, default=None,
 98 |         help="Reward-shaping function.")
 99 | parser.add_argument("-pert_param", type=float, default=None,
100 |         help="Reward-shaping parameter.")
101 | 
102 | # Others
103 | parser.add_argument("-no_update", action="store_true", default=False,
104 |         help="No update round. Use to evaluate model samples.")
105 | parser.add_argument("-sup_train_on_bandit", action="store_true", default=False,
106 |         help="Supervised learning update round.")
107 | 
108 | opt = parser.parse_args()
109 | print(opt)
110 | 
111 | # Set seed
112 | torch.manual_seed(opt.seed)
113 | np.random.seed(opt.seed)
114 | random.seed(opt.seed)
115 | 
116 | opt.cuda = len(opt.gpus)
117 | 
118 | if opt.save_dir and not os.path.exists(opt.save_dir):
119 |     os.makedirs(opt.save_dir)
120 | 
121 | if torch.cuda.is_available() and not opt.cuda:
122 |     print("WARNING: You have a CUDA device, so you should probably run with -gpus 1")
123 | 
124 | if opt.cuda:
125 |     cuda.set_device(opt.gpus[0])
126 |     torch.cuda.manual_seed(opt.seed)
127 | 
128 | def init(model):
129 |     for p in model.parameters():
130 |         p.data.uniform_(-opt.param_init, opt.param_init)
131 | 
132 | def create_optim(model):
133 |     optim = lib.Optim(
134 |         model.parameters(), opt.optim, opt.lr, opt.max_grad_norm,
135 |         lr_decay=opt.learning_rate_decay, start_decay_at=opt.start_decay_at
136 |     )
137 |     return optim
138 | 
139 | def create_model(model_class, dicts, gen_out_size):
140 |     encoder = lib.Encoder(opt, dicts["src"])
141 |     decoder = lib.Decoder(opt, dicts["tgt"])
142 |     # Use memory efficient generator when output size is large and
143 |     # max_generator_batches is smaller than batch_size.
144 |     if opt.max_generator_batches < opt.batch_size and gen_out_size > 1:
145 |         generator = lib.MemEfficientGenerator(nn.Linear(opt.rnn_size, gen_out_size), opt)
146 |     else:
147 |         generator = lib.BaseGenerator(nn.Linear(opt.rnn_size, gen_out_size), opt)
148 |     model = model_class(encoder, decoder, generator, opt)
149 |     init(model)
150 |     optim = create_optim(model)
151 |     return model, optim
152 | 
153 | def create_critic(checkpoint, dicts, opt):
154 |     if opt.load_from is not None and "critic" in checkpoint:
155 |         critic = checkpoint["critic"]
156 |         critic_optim = checkpoint["critic_optim"]
157 |     else:
158 |         critic, critic_optim = create_model(lib.NMTModel, dicts, 1)
159 |     if opt.cuda:
160 |         critic.cuda(opt.gpus[0])
161 |     return critic, critic_optim
162 | 
163 | def main():
164 | 
165 |     print('Loading data from "%s"' % opt.data)
166 | 
167 |     dataset = torch.load(opt.data)
168 | 
169 |     supervised_data = lib.Dataset(dataset["train_xe"], opt.batch_size, opt.cuda, eval=False)
170 |     bandit_data = lib.Dataset(dataset["train_pg"], opt.batch_size, opt.cuda, eval=False)
171 |     valid_data = lib.Dataset(dataset["valid"], opt.batch_size, opt.cuda, eval=True)
172 |     test_data  = lib.Dataset(dataset["test"], opt.batch_size, opt.cuda, eval=True)
173 | 
174 |     dicts = dataset["dicts"]
175 |     print(" * vocabulary size. source = %d; target = %d" %
176 |           (dicts["src"].size(), dicts["tgt"].size()))
177 |     print(" * number of XENT training sentences. %d" %
178 |           len(dataset["train_xe"]["src"]))
179 |     print(" * number of PG training sentences. %d" %
180 |           len(dataset["train_pg"]["src"]))
181 |     print(" * maximum batch size. %d" % opt.batch_size)
182 |     print("Building model...")
183 | 
184 |     use_critic = opt.start_reinforce is not None
185 | 
186 |     if opt.load_from is None:
187 |         model, optim = create_model(lib.NMTModel, dicts, dicts["tgt"].size())
188 |         checkpoint = None
189 |     else:
190 |         print("Loading from checkpoint at %s" % opt.load_from)
191 |         checkpoint = torch.load(opt.load_from)
192 |         model = checkpoint["model"]
193 |         optim = checkpoint["optim"]
194 |         opt.start_epoch = checkpoint["epoch"] + 1
195 | 
196 |     # GPU.
197 |     if opt.cuda:
198 |         model.cuda(opt.gpus[0])
199 | 
200 |     # Start reinforce training immediately.
201 |     if opt.start_reinforce == -1:
202 |         opt.start_decay_at = opt.start_epoch
203 |         opt.start_reinforce = opt.start_epoch
204 | 
205 |     # Check if end_epoch is large enough.
206 |     if use_critic:
207 |         assert opt.start_epoch + opt.critic_pretrain_epochs - 1 <= \
208 |             opt.end_epoch, "Please increase -end_epoch to perform pretraining!"
209 | 
210 |     nParams = sum([p.nelement() for p in model.parameters()])
211 |     print("* number of parameters: %d" % nParams)
212 | 
213 |     # Metrics.
214 |     metrics = {}
215 |     metrics["nmt_loss"] = lib.Loss.weighted_xent_loss
216 |     metrics["critic_loss"] = lib.Loss.weighted_mse
217 |     metrics["sent_reward"] = lib.Reward.sentence_bleu
218 |     metrics["corp_reward"] = lib.Reward.corpus_bleu
219 |     if opt.pert_func is not None:
220 |         opt.pert_func = lib.PertFunction(opt.pert_func, opt.pert_param)
221 | 
222 | 
223 |     # Evaluate model on heldout dataset.
224 |     if opt.eval:
225 |         evaluator = lib.Evaluator(model, metrics, dicts, opt)
226 |         # On validation set.
227 |         pred_file = opt.load_from.replace(".pt", ".valid.pred")
228 |         evaluator.eval(valid_data, pred_file)
229 |         # On test set.
230 |         pred_file = opt.load_from.replace(".pt", ".test.pred")
231 |         evaluator.eval(test_data, pred_file)
232 |     elif opt.eval_sample:
233 |         opt.no_update = True
234 |         critic, critic_optim = create_critic(checkpoint, dicts, opt)
235 |         reinforce_trainer = lib.ReinforceTrainer(model, critic, bandit_data, test_data,
236 |             metrics, dicts, optim, critic_optim, opt)
237 |         reinforce_trainer.train(opt.start_epoch, opt.start_epoch, False)
238 |     elif opt.sup_train_on_bandit:
239 |         optim.set_lr(opt.reinforce_lr)
240 |         xent_trainer = lib.Trainer(model, bandit_data, test_data, metrics, dicts, optim, opt)
241 |         xent_trainer.train(opt.start_epoch, opt.start_epoch)
242 |     else:
243 | 	print("theek hai")
244 |         xent_trainer = lib.Trainer(model, supervised_data, valid_data, metrics, dicts, optim, opt)
245 |         if use_critic:
246 |             start_time = time.time()
247 |             # Supervised training.
248 |             xent_trainer.train(opt.start_epoch, opt.start_reinforce - 1, start_time)
249 |             # Create critic here to not affect random seed.
250 |             critic, critic_optim = create_critic(checkpoint, dicts, opt)
251 |             # Pretrain critic.
252 |             if opt.critic_pretrain_epochs > 0:
253 |                 reinforce_trainer = lib.ReinforceTrainer(model, critic, supervised_data, test_data,
254 |                     metrics, dicts, optim, critic_optim, opt)
255 |                 reinforce_trainer.train(opt.start_reinforce,
256 |                     opt.start_reinforce + opt.critic_pretrain_epochs - 1, True, start_time)
257 |             # Reinforce training.
258 |             reinforce_trainer = lib.ReinforceTrainer(model, critic, bandit_data, test_data,
259 |                     metrics, dicts, optim, critic_optim, opt)
260 |             reinforce_trainer.train(opt.start_reinforce + opt.critic_pretrain_epochs, opt.end_epoch,
261 |                 False, start_time)
262 |         # Supervised training only.
263 |         else:
264 |             xent_trainer.train(opt.start_epoch, opt.end_epoch)
265 | 
266 | 
267 | if __name__ == "__main__":
268 |     main()
269 | 


--------------------------------------------------------------------------------
/translate.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import numpy as np
  4 | import random
  5 | import time
  6 | 
  7 | import torch
  8 | import torch.nn as nn
  9 | import torch.nn.parallel
 10 | from torch import cuda
 11 | from torch.autograd import Variable
 12 | 
 13 | import lib
 14 | 
 15 | parser = argparse.ArgumentParser()
 16 | 
 17 | ## Data options
 18 | parser.add_argument("-data", required=True,
 19 |                     help="Path to the *-train.pt file from preprocess.py")
 20 | parser.add_argument("-batch_size", default=32, help="Batch Size")
 21 | parser.add_argument("-save_dir", help="Directory to save predictions")
 22 | parser.add_argument("-load_from", required=True, help="Path to load a trained model.")
 23 | parser.add_argument("-test_src", required=True, help="Path to the file to be translated.")
 24 | 
 25 | # GPU
 26 | parser.add_argument("-gpus", default=[0], nargs="+", type=int,
 27 |                     help="Use CUDA")
 28 | parser.add_argument("-log_interval", type=int, default=100,
 29 |                     help="Print stats at this interval.")
 30 | parser.add_argument("-seed", type=int, default=3435,
 31 |                      help="Seed for random initialization")
 32 | 
 33 | opt = parser.parse_args()
 34 | print(opt)
 35 | 
 36 | # Set seed
 37 | torch.manual_seed(opt.seed)
 38 | np.random.seed(opt.seed)
 39 | random.seed(opt.seed)
 40 | 
 41 | opt.cuda = len(opt.gpus)
 42 | 
 43 | if opt.save_dir and not os.path.exists(opt.save_dir):
 44 |     os.makedirs(opt.save_dir)
 45 | 
 46 | if torch.cuda.is_available() and not opt.cuda:
 47 |     print("WARNING: You have a CUDA device, so you should probably run with -gpus 1")
 48 | 
 49 | #device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 50 | 
 51 | if opt.cuda:
 52 |     	cuda.set_device(opt.gpus[0])
 53 |    	torch.cuda.manual_seed(opt.seed)
 54 | 
 55 | def makeTestData(srcFile,dicts):	
 56 | 	print("Processing %s ..." % srcFile)
 57 | 	srcF = open(srcFile,'r')
 58 | 	text = srcF.read()
 59 | 	srcF.close()
 60 | 	lines=text.strip().split('\n')
 61 | 	src=[]
 62 | 	tgt=[]
 63 | 	srcDicts = dicts["src"]
 64 | 	count=0
 65 | 	for line in lines:
 66 | 		srcWords = line.split()
 67 | 		src += [srcDicts.convertToIdx(srcWords,
 68 | 		                          lib.Constants.UNK_WORD)]
 69 | 		count += 1
 70 | 	print("... %d sentences prepared for testing" % count)
 71 | 	tgt=src
 72 | 	return src,tgt,range(len(src))
 73 | 
 74 | def predict(model,dicts,data,pred_file):
 75 | 	model.eval()
 76 | 	all_preds=[]
 77 | 	max_length=50
 78 | 	for i in range(len(data)):
 79 | 		batch=data[i]
 80 | 		targets=batch[1]
 81 | 		attention_mask=batch[0][0].data.eq(lib.Constants.PAD).t()
 82 | 		model.decoder.attn.applyMask(attention_mask)
 83 | 		preds = model.translate(batch, max_length)
 84 | 		preds = preds.t().tolist()
 85 | 		targets=targets.data.t().tolist()
 86 | 		#hack
 87 | 	    	indices=batch[2]
 88 | 	    	new_batch=zip(preds,targets)
 89 | 	    	new_batch,indices=zip(*sorted(zip(new_batch,indices),key=lambda x: x[1]))
 90 | 	    	preds,targets=zip(*new_batch)
 91 |             	###
 92 | 		all_preds.extend(preds)
 93 | 
 94 | 	with open(pred_file, "w") as f:
 95 |             for sent in all_preds:
 96 |                 sent = lib.Reward.clean_up_sentence(sent, remove_unk=False, remove_eos=True)
 97 |                 sent = [dicts["tgt"].getLabel(w) for w in sent]
 98 |                 x=" ".join(sent)+'\n'
 99 | 		f.write(x)
100 | 	f.close()
101 | 
102 | def main():
103 | 	print('Loading train data from "%s"' % opt.data)
104 | 
105 | 	dataset = torch.load(opt.data)
106 | 	dicts = dataset["dicts"]
107 | 
108 | 	if opt.load_from is None:
109 | 		print("REQUIRES PATH TO THE TRAINED MODEL\n")
110 | 	else:
111 | 		print("Loading from checkpoint at %s" % opt.load_from)
112 | 		checkpoint = torch.load(opt.load_from)
113 | 		model = checkpoint["model"]
114 | 		optim = checkpoint["optim"]
115 | 
116 | 	# GPU.
117 | 	if opt.cuda:
118 | 		model.cuda(opt.gpus[0])
119 | 		#model=torch.nn.DataParallel(model)
120 | 		#torch.distributed.init_process_group(backend='tcp',rank=0,world_size=2)
121 | 		#model = torch.nn.parallel.DistributedDataParallel(model)
122 | 
123 | 		
124 |     	# Generating Translations for test set
125 | 	print('Creating test data\n')
126 | 	src,tgt,pos=makeTestData(opt.test_src,dicts)
127 | 	res={}
128 | 	res["src"]=src
129 | 	res["tgt"]=tgt
130 | 	res["pos"]=pos
131 | 	test_data  = lib.Dataset(res, opt.batch_size, opt.cuda, eval=False)
132 | 	pred_file = opt.test_src+".pred"
133 | 	predict(model,dicts,test_data,pred_file)
134 | 	print('Generated translations successfully\n')
135 | 
136 | if __name__ == "__main__":
137 |     main()
138 | 


--------------------------------------------------------------------------------