├── README.md ├── lib ├── __init__.py ├── __init__.pyc ├── data │ ├── Constants.py │ ├── Constants.pyc │ ├── Dataset.py │ ├── Dataset.pyc │ ├── Dict.py │ ├── Dict.pyc │ ├── __init__.py │ └── __init__.pyc ├── eval │ ├── Evaluator.py │ ├── Evaluator.pyc │ ├── __init__.py │ └── __init__.pyc ├── metric │ ├── Bleu.py │ ├── Bleu.pyc │ ├── Loss.py │ ├── Loss.pyc │ ├── PertFunction.py │ ├── PertFunction.pyc │ ├── Reward.py │ ├── Reward.pyc │ ├── __init__.py │ ├── __init__.pyc │ └── test_shaping.py ├── model │ ├── EncoderDecoder.py │ ├── EncoderDecoder.pyc │ ├── Generator.py │ ├── Generator.pyc │ ├── GlobalAttention.py │ ├── GlobalAttention.pyc │ ├── __init__.py │ └── __init__.pyc └── train │ ├── Optim.py │ ├── Optim.pyc │ ├── ReinforceTrainer.py │ ├── ReinforceTrainer.pyc │ ├── Trainer.py │ ├── Trainer.pyc │ ├── __init__.py │ └── __init__.pyc ├── preprocess.py ├── requirements.txt ├── scripts ├── extract_parallel.py ├── lowercase.perl ├── multi-bleu.perl ├── output.py ├── parse.py ├── prepare_data.sh ├── preprocess.py ├── sgm.perl ├── strip.py ├── tokenizer.perl ├── train.sh └── translate.sh ├── share └── nonbreaking_prefixes │ ├── README.txt │ ├── nonbreaking_prefix.ca │ ├── nonbreaking_prefix.cs │ ├── nonbreaking_prefix.de │ ├── nonbreaking_prefix.el │ ├── nonbreaking_prefix.en │ ├── nonbreaking_prefix.es │ ├── nonbreaking_prefix.fi │ ├── nonbreaking_prefix.fr │ ├── nonbreaking_prefix.ga │ ├── nonbreaking_prefix.hu │ ├── nonbreaking_prefix.is │ ├── nonbreaking_prefix.it │ ├── nonbreaking_prefix.lt │ ├── nonbreaking_prefix.lv │ ├── nonbreaking_prefix.nl │ ├── nonbreaking_prefix.pl │ ├── nonbreaking_prefix.pt │ ├── nonbreaking_prefix.ro │ ├── nonbreaking_prefix.ru │ ├── nonbreaking_prefix.sk │ ├── nonbreaking_prefix.sl │ ├── nonbreaking_prefix.sv │ ├── nonbreaking_prefix.ta │ ├── nonbreaking_prefix.yue │ └── nonbreaking_prefix.zh ├── train.py └── translate.py /README.md: -------------------------------------------------------------------------------- 1 | # Multilingual Neural Machine Translation System for TV News 2 | 3 | _This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._ 4 | 5 | The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English. 6 | 7 | The system uses Reinforcement Learning(Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT(Workshop on Machine Translation) test datasets. 8 | 9 | This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086). 10 | 11 | I have made a GSoC blog, please refer to it for my all GSoC blogposts about the progress made so far. 12 | Blog link: https://vikrant97.github.io/gsoc_blog/ 13 | 14 | The following languages are supported as the source language & the below are their language codes: 15 | 1) **German - de** 16 | 2) **French - fr** 17 | 3) **Russian - ru** 18 | 4) **Czech - cs** 19 | 5) **Spanish - es** 20 | 6) **Portuguese - pt** 21 | 7) **Danish - da** 22 | 8) **Swedish - sv** 23 | 9) **Chinese - zh** 24 | The target language is English(en). 25 | 26 | ## Getting Started 27 | 28 | ### Prerequisites 29 | 30 | * Python-2.7 31 | * Pytorch-0.3 32 | * Tensorflow-gpu 33 | * Numpy 34 | * CUDA 35 | 36 | ### Installation & Setup Instructions on CASE HPC 37 | 38 | * Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my hpc acoount i.e **/home/vxg195** & then follow the instructions described for training & translation. 39 | 40 | * nmt directory will contain the following subdirectories: 41 | * singularity 42 | * data 43 | * models 44 | * Neural-Machine-Translation 45 | * myenv 46 | 47 | * The **singularity** directory contains a singularity image(rh_xenial_20180308.img) which is copied from the home directory of **Mr. Michael Pacchioli's CASE HPC account**. This singularity image contains some modules like CUDA and CUDANN needed for the system. 48 | 49 | * The **data** directory consists of cleaned & processed datasets of respective language pairs. The subdirectories of this directory should be named like **de-en** where **de** & **en** are the language codes for **German** & **English**. So for any general language pair whose source language is **$src** and the target language is **$tgt**, the language data subdirectory should be named as **$src-$tgt** and it should contain the following files(train, validation & test): 50 | * train.$src-$tgt.$src.processed 51 | * train.$src-$tgt.$tgt.processed 52 | * valid.$src-$tgt.$src.processed 53 | * valid.$src-$tgt.$tgt.processed 54 | * test.$src-$tgt.$src.processed 55 | * test.$src-$tgt.$tgt.processed 56 | 57 | * The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as **data** directory. For example, **models/de-en** will contains trained models for the **German-English** language pair. 58 | 59 | * The following commands were used to install dependencies for the project: 60 | ```bash 61 | $ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git 62 | $ virtualenv myenv 63 | $ source myenv/bin/activate 64 | $ pip install -r Neural-Machine-Translation/requirements.txt 65 | ``` 66 | * **Note** that the virtual environment(myenv) created using virtualenv command mentioned above, should be of **Python2** . 67 | 68 | ## Data Preparation and Preprocessing 69 | 70 | Please note that these data preparation steps have to be done manually as we are dealing with a Multilingual system and each language pair might have different sources of data. For instance, I used many different data sources like europarl, newscommentary, commoncrawl & other opern source datasets. One can have a look at shared task on Machine Translation i.e. WMT, to get better datasets. I wrote a bash script which can be used to process & prepare dataset for MT. The following steps can be used to prepare dataset for MT: 71 | 1) First copy the raw dataset files in the language($src-$tgt) subdirectory of the data directory in the following format: 72 | * train.$src-$tgt.$src 73 | * train.$src-$tgt.$tgt 74 | * valid.$src-$tgt.$src 75 | * valid.$src-$tgt.$tgt 76 | * test.$src-$tgt.$src 77 | * test.$src-$tgt.$tgt 78 | 79 | 2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training: 80 | ```bash 81 | bash prepare_data.sh $src $tgt 82 | ``` 83 | After this process, clear the entire language directory & just keep \*.processed files. Your processed dataset is ready!! 84 | 85 | ## Training 86 | 87 | To train a model on CASE HPC one needs to run the train.sh file placed in Neural-Machine-translation/scripts folder. The parameters for training are kept such that a model can be efficiently trained for any newly introduced language pair, but one needs to tune the parameters according to the dataset. The prerequisite for training a model is that the parallel data as described in **Installation** section should be residing in the concerned language pair directory in the data folder. The trained models will be saved in the language pair directory in the models folder. To train a model on CASE HPC, run the following command: 88 | 89 | ```bash 90 | cd Neural-Machine-Translation/scripts 91 | sbatch train.sh 92 | # For example to train a model for German->English one should type the following command 93 | sbatch train.sh de en 94 | ``` 95 | After training, the trained model will be saved in language($src-$tgt) subdirectory in the models directory. The saved model would be something like "model_15.pt" and it should be renamed to "model_15_best.pt". 96 | 97 | ## Translation 98 | This project supports translation of both normal text file and news transcripts in any supported language pair. 99 | To translate any input news transcript, run the following commands: 100 | ```bash 101 | cd Neural-Machine-Translation/scripts 102 | sbatch translate.sh 0 103 | ``` 104 | To translate any normal text file, run the following commands: 105 | ```bash 106 | cd Neural-Machine-Translation/scripts 107 | sbatch translate.sh 1 108 | ``` 109 | **Note that the output translated file will be saved in the same directory containing the input file and with a ".pred" string appended to the name of the input file.** 110 | 111 | ## Evaluation of the trained model 112 | For evaluation, generate translation of any source test corpora. Now, we need to test its efficiency against the original target test corpus. For this, we use multi-bleu.perl script residing in the scripts directory which measures the corpus BLEU score. Usage instructions: 113 | ```bash 114 | perl scripts/multi-bleu.perl $reference-file < $hypothesis-file 115 | ``` 116 | 117 | ## Acknowledgements 118 | 119 | * [Google Summer of Code 2018](https://summerofcode.withgoogle.com/) 120 | * [Red Hen Lab](http://www.redhenlab.org/) 121 | * [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) 122 | * [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086) 123 | * [Europarl](http://www.statmt.org/europarl/) 124 | * [Moses](https://github.com/moses-smt/mosesdecoder) 125 | -------------------------------------------------------------------------------- /lib/__init__.py: -------------------------------------------------------------------------------- 1 | from .data import * 2 | from .eval import * 3 | from .metric import * 4 | from .model import * 5 | from .train import * 6 | -------------------------------------------------------------------------------- /lib/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/__init__.pyc -------------------------------------------------------------------------------- /lib/data/Constants.py: -------------------------------------------------------------------------------- 1 | 2 | PAD = 0 3 | UNK = 1 4 | BOS = 2 5 | EOS = 3 6 | 7 | PAD_WORD = '' 8 | UNK_WORD = '' 9 | BOS_WORD = '' 10 | EOS_WORD = '' 11 | -------------------------------------------------------------------------------- /lib/data/Constants.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/Constants.pyc -------------------------------------------------------------------------------- /lib/data/Dataset.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import math 4 | import random 5 | 6 | import torch 7 | from torch.autograd import Variable 8 | 9 | import lib 10 | 11 | 12 | class Dataset(object): 13 | def __init__(self, data, batchSize, cuda, eval=False): 14 | self.src = data["src"] 15 | self.tgt = data["tgt"] 16 | self.pos = data["pos"] 17 | assert(len(self.src) == len(self.tgt)) 18 | self.cuda = cuda 19 | 20 | self.batchSize = batchSize 21 | self.numBatches = math.ceil(len(self.src)/batchSize) 22 | self.eval = eval 23 | 24 | def _batchify(self, data, align_right=False, include_lengths=False): 25 | lengths = [x.size(0) for x in data] 26 | max_length = max(lengths) 27 | out = data[0].new(len(data), max_length).fill_(lib.Constants.PAD) 28 | for i in range(len(data)): 29 | data_length = data[i].size(0) 30 | offset = max_length - data_length if align_right else 0 31 | out[i].narrow(0, offset, data_length).copy_(data[i]) 32 | 33 | if include_lengths: 34 | return out, lengths 35 | else: 36 | return out 37 | 38 | def __getitem__(self, index): 39 | assert index < self.numBatches, "%d > %d" % (index, self.numBatches) 40 | srcBatch, lengths = self._batchify(self.src[index*self.batchSize:(index+1)*self.batchSize], 41 | include_lengths=True) 42 | 43 | tgtBatch = self._batchify(self.tgt[index*self.batchSize:(index+1)*self.batchSize]) 44 | 45 | # within batch sort by decreasing length. 46 | indices = range(len(srcBatch)) 47 | batch = zip(indices, srcBatch, tgtBatch) 48 | batch, lengths = zip(*sorted(zip(batch, lengths), key=lambda x: -x[1])) 49 | indices, srcBatch, tgtBatch = zip(*batch) 50 | 51 | def wrap(b): 52 | b = torch.stack(b, 0).t().contiguous() 53 | if self.cuda: 54 | b = b.cuda() 55 | b = Variable(b, volatile=self.eval) 56 | return b 57 | 58 | return (wrap(srcBatch), lengths), wrap(tgtBatch), indices 59 | 60 | def __len__(self): 61 | return self.numBatches 62 | 63 | def shuffle(self): 64 | data = list(zip(self.src, self.tgt, self.pos)) 65 | random.shuffle(data) 66 | self.src, self.tgt, self.pos = zip(*data) 67 | 68 | def restore_pos(self, sents): 69 | sorted_sents = [None] * len(self.pos) 70 | for sent, idx in zip(sents, self.pos): 71 | sorted_sents[idx] = sent 72 | return sorted_sents 73 | -------------------------------------------------------------------------------- /lib/data/Dataset.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/Dataset.pyc -------------------------------------------------------------------------------- /lib/data/Dict.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | class Dict(object): 5 | def __init__(self, data=None): 6 | self.idxToLabel = {} 7 | self.labelToIdx = {} 8 | self.frequencies = {} 9 | 10 | # Special entries will not be pruned. 11 | self.special = [] 12 | 13 | if data is not None: 14 | if type(data) == str: 15 | self.loadFile(data) 16 | else: 17 | self.addSpecials(data) 18 | 19 | def size(self): 20 | return len(self.idxToLabel) 21 | 22 | # Load entries from a file. 23 | def loadFile(self, filename): 24 | for line in open(filename): 25 | fields = line.split() 26 | label = fields[0] 27 | idx = int(fields[1]) 28 | self.add(label, idx) 29 | 30 | # Write entries to a file. 31 | def writeFile(self, filename): 32 | with open(filename, 'w') as file: 33 | for i in range(self.size()): 34 | label = self.idxToLabel[i] 35 | file.write('%s %d\n' % (label, i)) 36 | 37 | file.close() 38 | 39 | def lookup(self, key, default=None): 40 | try: 41 | return self.labelToIdx[key] 42 | except KeyError: 43 | return default 44 | 45 | def getLabel(self, idx, default=None): 46 | try: 47 | return self.idxToLabel[idx] 48 | except KeyError: 49 | return default 50 | 51 | # Mark this `label` and `idx` as special (i.e. will not be pruned). 52 | def addSpecial(self, label, idx=None): 53 | idx = self.add(label, idx) 54 | self.special += [idx] 55 | 56 | # Mark all labels in `labels` as specials (i.e. will not be pruned). 57 | def addSpecials(self, labels): 58 | for label in labels: 59 | self.addSpecial(label) 60 | 61 | # Add `label` in the dictionary. Use `idx` as its index if given. 62 | def add(self, label, idx=None): 63 | if idx is not None: 64 | self.idxToLabel[idx] = label 65 | self.labelToIdx[label] = idx 66 | else: 67 | if label in self.labelToIdx: 68 | idx = self.labelToIdx[label] 69 | else: 70 | idx = len(self.idxToLabel) 71 | self.idxToLabel[idx] = label 72 | self.labelToIdx[label] = idx 73 | 74 | if idx not in self.frequencies: 75 | self.frequencies[idx] = 1 76 | else: 77 | self.frequencies[idx] += 1 78 | 79 | return idx 80 | 81 | # Return a new dictionary with the `size` most frequent entries. 82 | def prune(self, size): 83 | if size >= self.size(): 84 | return self 85 | 86 | # Only keep the `size` most frequent entries. 87 | freq = torch.Tensor( 88 | [self.frequencies[i] for i in range(len(self.frequencies))]) 89 | _, idx = torch.sort(freq, 0, True) 90 | 91 | newDict = Dict() 92 | 93 | # Add special entries in all cases. 94 | for i in self.special: 95 | newDict.addSpecial(self.idxToLabel[i]) 96 | 97 | for i in idx[:size]: 98 | newDict.add(self.idxToLabel[i]) 99 | 100 | return newDict 101 | 102 | # Convert `labels` to indices. Use `unkWord` if not found. 103 | # Optionally insert `bosWord` at the beginning and `eosWord` at the . 104 | def convertToIdx(self, labels, unkWord, bosWord=None, eosWord=None): 105 | vec = [] 106 | 107 | if bosWord is not None: 108 | vec += [self.lookup(bosWord)] 109 | 110 | unk = self.lookup(unkWord) 111 | vec += [self.lookup(label.lower(), default=unk) for label in labels] 112 | 113 | if eosWord is not None: 114 | vec += [self.lookup(eosWord)] 115 | 116 | return torch.LongTensor(vec) 117 | 118 | # Convert `idx` to labels. If index `stop` is reached, convert it and return. 119 | def convertToLabels(self, idx, stop): 120 | labels = [] 121 | 122 | for i in idx: 123 | labels += [self.getLabel(i)] 124 | if i == stop: 125 | break 126 | 127 | return labels 128 | -------------------------------------------------------------------------------- /lib/data/Dict.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/Dict.pyc -------------------------------------------------------------------------------- /lib/data/__init__.py: -------------------------------------------------------------------------------- 1 | from .Dict import Dict 2 | from .Dataset import Dataset 3 | from .Constants import * 4 | -------------------------------------------------------------------------------- /lib/data/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/data/__init__.pyc -------------------------------------------------------------------------------- /lib/eval/Evaluator.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import lib 3 | 4 | class Evaluator(object): 5 | def __init__(self, model, metrics, dicts, opt): 6 | self.model = model 7 | self.loss_func = metrics["nmt_loss"] 8 | self.sent_reward_func = metrics["sent_reward"] 9 | self.corpus_reward_func = metrics["corp_reward"] 10 | self.dicts = dicts 11 | self.max_length = opt.max_predict_length 12 | 13 | def eval(self, data, pred_file=None): 14 | self.model.eval() 15 | 16 | total_loss = 0 17 | total_words = 0 18 | total_sents = 0 19 | total_sent_reward = 0 20 | 21 | all_preds = [] 22 | all_targets = [] 23 | for i in range(len(data)): 24 | batch = data[i] 25 | targets = batch[1] 26 | 27 | attention_mask = batch[0][0].data.eq(lib.Constants.PAD).t() 28 | self.model.decoder.attn.applyMask(attention_mask) 29 | outputs = self.model(batch, True) 30 | 31 | 32 | weights = targets.ne(lib.Constants.PAD).float() 33 | num_words = weights.data.sum() 34 | _, loss = self.model.predict(outputs, targets, weights, self.loss_func) 35 | 36 | preds = self.model.translate(batch, self.max_length) 37 | preds = preds.t().tolist() 38 | targets = targets.data.t().tolist() 39 | rewards, _ = self.sent_reward_func(preds, targets) 40 | 41 | #hack 42 | indices=batch[2] 43 | new_batch=zip(preds,targets) 44 | new_batch,indices=zip(*sorted(zip(new_batch,indices),key=lambda x: x[1])) 45 | preds,targets=zip(*new_batch) 46 | ### 47 | 48 | all_preds.extend(preds) 49 | all_targets.extend(targets) 50 | 51 | total_loss += loss 52 | total_words += num_words 53 | total_sent_reward += sum(rewards) 54 | total_sents += batch[1].size(1) 55 | 56 | loss = total_loss / total_words 57 | sent_reward = total_sent_reward / total_sents 58 | corpus_reward = self.corpus_reward_func(all_preds, all_targets) 59 | 60 | if pred_file is not None: 61 | self._convert_and_report(data, pred_file, all_preds, 62 | (loss, sent_reward, corpus_reward)) 63 | 64 | return loss, sent_reward, corpus_reward 65 | 66 | def _convert_and_report(self, data, pred_file, preds, metrics): 67 | preds = data.restore_pos(preds) 68 | with open(pred_file, "w") as f: 69 | for sent in preds: 70 | sent = lib.Reward.clean_up_sentence(sent, remove_unk=False, remove_eos=True) 71 | sent = [self.dicts["tgt"].getLabel(w) for w in sent] 72 | x=" ".join(sent)+'\n' 73 | f.write(x) 74 | f.close() 75 | loss, sent_reward, corpus_reward = metrics 76 | print("") 77 | print("Loss: %.6f" % loss) 78 | print("Sentence reward: %.2f" % (sent_reward * 100)) 79 | print("Corpus reward: %.2f" % (corpus_reward * 100)) 80 | print("Predictions saved to %s" % pred_file) 81 | 82 | 83 | -------------------------------------------------------------------------------- /lib/eval/Evaluator.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/eval/Evaluator.pyc -------------------------------------------------------------------------------- /lib/eval/__init__.py: -------------------------------------------------------------------------------- 1 | from .Evaluator import Evaluator 2 | 3 | -------------------------------------------------------------------------------- /lib/eval/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/eval/__init__.pyc -------------------------------------------------------------------------------- /lib/metric/Bleu.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from collections import defaultdict 3 | import math 4 | 5 | def _update_ngrams_count(sent, ngrams, count): 6 | length = len(sent) 7 | for n in range(1, ngrams + 1): 8 | for i in range(length - n + 1): 9 | ngram = tuple(sent[i : (i + n)]) 10 | count[ngram] += 1 11 | 12 | def _compute_bleu(p, len_pred, len_gold, smooth): 13 | # Brevity penalty. 14 | log_brevity = 1 - max(1, (len_gold + smooth) / (len_pred + smooth)) 15 | log_score = 0 16 | ngrams = len(p) - 1 17 | for n in range(1, ngrams + 1): 18 | if p[n][1] > 0: 19 | if p[n][0] == 0: 20 | p[n][0] = 1e-16 21 | log_precision = math.log((p[n][0] + smooth) / (p[n][1] + smooth)) 22 | log_score += log_precision 23 | log_score /= ngrams 24 | return math.exp(log_score + log_brevity) 25 | 26 | 27 | # Calculate BLEU of prefixes of pred. 28 | def score_sentence(pred, gold, ngrams, smooth=0): 29 | scores = [] 30 | # Get ngrams count for gold. 31 | count_gold = defaultdict(int) 32 | _update_ngrams_count(gold, ngrams, count_gold) 33 | # Init ngrams count for pred to 0. 34 | count_pred = defaultdict(int) 35 | # p[n][0] stores the number of overlapped n-grams. 36 | # p[n][1] is total # of n-grams in pred. 37 | p = [] 38 | for n in range(ngrams + 1): 39 | p.append([0, 0]) 40 | for i in range(len(pred)): 41 | for n in range(1, ngrams + 1): 42 | if i - n + 1 < 0: 43 | continue 44 | # n-gram is from i - n + 1 to i. 45 | ngram = tuple(pred[(i - n + 1) : (i + 1)]) 46 | # Update n-gram count. 47 | count_pred[ngram] += 1 48 | # Update p[n]. 49 | p[n][1] += 1 50 | if count_pred[ngram] <= count_gold[ngram]: 51 | p[n][0] += 1 52 | scores.append(_compute_bleu(p, i + 1, len(gold), smooth)) 53 | return scores 54 | 55 | # Calculate BLEU of a corpus. 56 | def score_corpus(preds, golds, ngrams, smooth=0): 57 | assert len(preds) == len(golds) 58 | p = [] 59 | for n in range(ngrams + 1): 60 | p.append([0, 0]) 61 | len_pred = len_gold = 0 62 | for pred, gold in zip(preds, golds): 63 | len_gold += len(gold) 64 | count_gold = defaultdict(int) 65 | _update_ngrams_count(gold, ngrams, count_gold) 66 | 67 | len_pred += len(pred) 68 | count_pred = defaultdict(int) 69 | _update_ngrams_count(pred, ngrams, count_pred) 70 | 71 | for k, v in count_pred.items(): 72 | n = len(k) 73 | p[n][0] += min(v, count_gold[k]) 74 | p[n][1] += v 75 | 76 | return _compute_bleu(p, len_pred, len_gold, smooth) 77 | 78 | -------------------------------------------------------------------------------- /lib/metric/Bleu.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/Bleu.pyc -------------------------------------------------------------------------------- /lib/metric/Loss.py: -------------------------------------------------------------------------------- 1 | from torch.autograd import Variable 2 | import numpy as np 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | 7 | def weighted_xent_loss(logits, targets, weights): 8 | log_dist = F.log_softmax(logits) 9 | losses = -log_dist.gather(1, targets.unsqueeze(1)).squeeze(1) 10 | losses = losses * weights 11 | return losses.sum() 12 | 13 | def weighted_mse(logits, targets, weights): 14 | losses = (logits - targets)**2 15 | losses = losses * weights 16 | return losses.sum() 17 | -------------------------------------------------------------------------------- /lib/metric/Loss.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/Loss.pyc -------------------------------------------------------------------------------- /lib/metric/PertFunction.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | 4 | def _adver(rs, _not_use): 5 | return [1 - r for r in rs] 6 | 7 | def _random(rs, _not_use): 8 | return [random.random() for i in xrange(len(rs))] 9 | 10 | def _bin(rs, b): 11 | return [round(r * b) / b for r in rs] 12 | 13 | def _variance(rs, scale): 14 | res = [] 15 | for r in rs: 16 | # Use 0.67 instead of 67 because scores are in [0,1] instead of [0,100] as in human eval data. 17 | std = min(r * 0.64, -0.67 * r + 0.67) * scale 18 | r_new = np.random.normal(r, std) 19 | r_new = max(0., min(r_new, 1.)) 20 | res.append(r_new) 21 | return res 22 | 23 | #def _noise(rs, std): 24 | # noises = np.random.normal(0, std, size=len(rs)).tolist() 25 | # return [r + noise for r, noise in zip(rs, noises)] 26 | 27 | def _curve(rs, p): 28 | return [r**p for r in rs] 29 | 30 | class PertFunction(object): 31 | def __init__(self, func_name, param): 32 | self.param = param 33 | if func_name == "bin": 34 | self.func = _bin 35 | elif func_name == "skew": 36 | self.func = _skew 37 | elif func_name == "variance": 38 | self.func = _variance 39 | elif func_name == "random": 40 | self.func = _random 41 | elif func_name == "adver": 42 | self.func = _adver 43 | 44 | def __call__(self, r): 45 | return self.func(r, self.param) 46 | -------------------------------------------------------------------------------- /lib/metric/PertFunction.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/PertFunction.pyc -------------------------------------------------------------------------------- /lib/metric/Reward.py: -------------------------------------------------------------------------------- 1 | import lib 2 | 3 | def clean_up_sentence(sent, remove_unk=False, remove_eos=False): 4 | if lib.Constants.EOS in sent: 5 | sent = sent[:sent.index(lib.Constants.EOS) + 1] 6 | if remove_unk: 7 | sent = filter(lambda x: x != lib.Constants.UNK, sent) 8 | if remove_eos: 9 | if len(sent) > 0 and sent[-1] == lib.Constants.EOS: 10 | sent = sent[:-1] 11 | return sent 12 | 13 | def single_sentence_bleu(pair): 14 | length = len(pair[0]) 15 | pred, gold = pair 16 | pred = clean_up_sentence(pred, remove_unk=False, remove_eos=False) 17 | gold = clean_up_sentence(gold, remove_unk=False, remove_eos=False) 18 | len_pred = len(pred) 19 | if len_pred == 0: 20 | score = 0. 21 | pred = [lib.Constants.PAD] * length 22 | else: 23 | score = lib.Bleu.score_sentence(pred, gold, 4, smooth=1)[-1] 24 | while len(pred) < length: 25 | pred.append(lib.Constants.PAD) 26 | 27 | #print pred 28 | #print gold 29 | #print score 30 | #print 31 | 32 | return score, pred 33 | 34 | def sentence_bleu(preds, golds): 35 | results = map(single_sentence_bleu, zip(preds, golds)) 36 | scores, preds = zip(*results) 37 | return scores, preds 38 | 39 | def corpus_bleu(preds, golds): 40 | assert len(preds) == len(golds) 41 | clean_preds = [] 42 | clean_golds = [] 43 | for pred, gold in zip(preds, golds): 44 | pred = clean_up_sentence(pred, remove_unk=False, remove_eos=True) 45 | gold = clean_up_sentence(gold, remove_unk=False, remove_eos=True) 46 | clean_preds.append(pred) 47 | clean_golds.append(gold) 48 | return lib.Bleu.score_corpus(clean_preds, clean_golds, 4) 49 | -------------------------------------------------------------------------------- /lib/metric/Reward.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/Reward.pyc -------------------------------------------------------------------------------- /lib/metric/__init__.py: -------------------------------------------------------------------------------- 1 | from .PertFunction import PertFunction 2 | from .Loss import * 3 | from .Reward import * 4 | from .Bleu import * 5 | -------------------------------------------------------------------------------- /lib/metric/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/metric/__init__.pyc -------------------------------------------------------------------------------- /lib/metric/test_shaping.py: -------------------------------------------------------------------------------- 1 | from RewardShaping import * 2 | 3 | func = RewardShaping("bin", 5) 4 | print func.param 5 | print "Binning: " 6 | for i in np.arange(0, 1, 0.05): 7 | print i, " ---> ", func(i) 8 | print 9 | 10 | print "Noise: " 11 | func = RewardShaping("noise", 0.1) 12 | print func.param 13 | for i in xrange(10): 14 | r = 0.3 15 | print r, " ---> ", func(r) 16 | print 17 | 18 | 19 | def test_curve(func): 20 | print func.param 21 | for i in np.arange(0, 1, 0.1): 22 | print i, " ---> ", func(i), "Diff = ", i - func(i) 23 | print 24 | 25 | print "Curving: " 26 | func = RewardShaping("curve", 1.1) 27 | test_curve(func) 28 | func = RewardShaping("curve", 0.9) 29 | test_curve(func) 30 | func = RewardShaping("curve", 0.8) 31 | test_curve(func) 32 | 33 | func = RewardShaping("curve", 1.2) 34 | test_curve(func) 35 | 36 | func = RewardShaping("curve", 0.5) 37 | test_curve(func) 38 | 39 | func = RewardShaping("curve", 1.5) 40 | test_curve(func) 41 | 42 | -------------------------------------------------------------------------------- /lib/model/EncoderDecoder.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from torch.autograd import Variable 5 | from torch.nn.utils.rnn import pad_packed_sequence as unpack 6 | from torch.nn.utils.rnn import pack_padded_sequence as pack 7 | 8 | import lib 9 | 10 | class Encoder(nn.Module): 11 | def __init__(self, opt, dicts): 12 | self.layers = opt.layers 13 | self.num_directions = 2 if opt.brnn else 1 14 | assert opt.rnn_size % self.num_directions == 0 15 | self.hidden_size = opt.rnn_size // self.num_directions 16 | 17 | super(Encoder, self).__init__() 18 | self.word_lut = nn.Embedding(dicts.size(), opt.word_vec_size, padding_idx=lib.Constants.PAD) 19 | self.rnn = nn.LSTM(opt.word_vec_size, self.hidden_size, 20 | num_layers=opt.layers, dropout=opt.dropout, bidirectional=opt.brnn) 21 | 22 | def forward(self, inputs, hidden=None): 23 | emb = pack(self.word_lut(inputs[0]), inputs[1]) 24 | outputs, hidden_t = self.rnn(emb, hidden) 25 | outputs = unpack(outputs)[0] 26 | return hidden_t, outputs 27 | 28 | 29 | class StackedLSTM(nn.Module): 30 | def __init__(self, num_layers, input_size, rnn_size, dropout): 31 | super(StackedLSTM, self).__init__() 32 | self.dropout = nn.Dropout(dropout) 33 | self.num_layers = num_layers 34 | self.layers = nn.ModuleList() 35 | 36 | for i in range(num_layers): 37 | self.layers.append(nn.LSTMCell(input_size, rnn_size)) 38 | input_size = rnn_size 39 | 40 | def forward(self, inputs, hidden): 41 | h_0, c_0 = hidden 42 | h_1, c_1 = [], [] 43 | for i, layer in enumerate(self.layers): 44 | h_1_i, c_1_i = layer(inputs, (h_0[i], c_0[i])) 45 | inputs = h_1_i 46 | if i != self.num_layers: 47 | inputs = self.dropout(inputs) 48 | h_1 += [h_1_i] 49 | c_1 += [c_1_i] 50 | 51 | h_1 = torch.stack(h_1) 52 | c_1 = torch.stack(c_1) 53 | 54 | return inputs, (h_1, c_1) 55 | 56 | 57 | class Decoder(nn.Module): 58 | def __init__(self, opt, dicts): 59 | self.layers = opt.layers 60 | self.input_feed = opt.input_feed 61 | input_size = opt.word_vec_size 62 | if self.input_feed: 63 | input_size += opt.rnn_size 64 | 65 | super(Decoder, self).__init__() 66 | self.word_lut = nn.Embedding(dicts.size(), opt.word_vec_size, padding_idx=lib.Constants.PAD) 67 | self.rnn = StackedLSTM(opt.layers, input_size, opt.rnn_size, opt.dropout) 68 | self.attn = lib.GlobalAttention(opt.rnn_size) 69 | self.dropout = nn.Dropout(opt.dropout) 70 | self.hidden_size = opt.rnn_size 71 | 72 | def step(self, emb, output, hidden, context): 73 | if self.input_feed: 74 | emb = torch.cat([emb, output], 1) 75 | output, hidden = self.rnn(emb, hidden) 76 | output, attn = self.attn(output, context) 77 | output = self.dropout(output) 78 | return output, hidden 79 | 80 | def forward(self, inputs, init_states): 81 | emb, output, hidden, context = init_states 82 | embs = self.word_lut(inputs) 83 | 84 | outputs = [] 85 | for i in range(inputs.size(0)): 86 | output, hidden = self.step(emb, output, hidden, context) 87 | outputs.append(output) 88 | emb = embs[i] 89 | 90 | outputs = torch.stack(outputs) 91 | return outputs 92 | 93 | 94 | class NMTModel(nn.Module): 95 | 96 | def __init__(self, encoder, decoder, generator, opt): 97 | super(NMTModel, self).__init__() 98 | self.encoder = encoder 99 | self.decoder = decoder 100 | self.generator = generator 101 | self.opt = opt 102 | 103 | def make_init_decoder_output(self, context): 104 | batch_size = context.size(1) 105 | h_size = (batch_size, self.decoder.hidden_size) 106 | return Variable(context.data.new(*h_size).zero_(), requires_grad=False) 107 | 108 | def _fix_enc_hidden(self, h): 109 | # the encoder hidden is (layers*directions) x batch x dim 110 | # we need to convert it to layers x batch x (directions*dim) 111 | if self.encoder.num_directions == 2: 112 | return h.view(h.size(0) // 2, 2, h.size(1), h.size(2)) \ 113 | .transpose(1, 2).contiguous() \ 114 | .view(h.size(0) // 2, h.size(1), h.size(2) * 2) 115 | else: 116 | return h 117 | 118 | def initialize(self, inputs, eval): 119 | src = inputs[0] 120 | tgt = inputs[1] 121 | enc_hidden, context = self.encoder(src) 122 | init_output = self.make_init_decoder_output(context) 123 | enc_hidden = (self._fix_enc_hidden(enc_hidden[0]), 124 | self._fix_enc_hidden(enc_hidden[1])) 125 | init_token = Variable(torch.LongTensor( 126 | [lib.Constants.BOS] * init_output.size(0)), volatile=eval) 127 | if self.opt.cuda: 128 | init_token = init_token.cuda() 129 | emb = self.decoder.word_lut(init_token) 130 | return tgt, (emb, init_output, enc_hidden, context.transpose(0, 1)) 131 | 132 | def forward(self, inputs, eval, regression=False): 133 | targets, init_states = self.initialize(inputs, eval) 134 | outputs = self.decoder(targets, init_states) 135 | 136 | if regression: 137 | logits = self.generator(outputs) 138 | return logits.view_as(targets) 139 | return outputs 140 | 141 | def backward(self, outputs, targets, weights, normalizer, criterion, regression=False): 142 | grad_output, loss = self.generator.backward(outputs, targets, weights, normalizer, criterion, regression) 143 | outputs.backward(grad_output) 144 | return loss 145 | 146 | def predict(self, outputs, targets, weights, criterion): 147 | return self.generator.predict(outputs, targets, weights, criterion) 148 | 149 | def translate(self, inputs, max_length): 150 | targets, init_states = self.initialize(inputs, eval=True) 151 | emb, output, hidden, context = init_states 152 | 153 | preds = [] 154 | batch_size = targets.size(1) 155 | num_eos = targets[0].data.byte().new(batch_size).zero_() 156 | 157 | for i in range(max_length): 158 | output, hidden = self.decoder.step(emb, output, hidden, context) 159 | logit = self.generator(output) 160 | pred = logit.max(1)[1].view(-1).data 161 | preds.append(pred) 162 | 163 | # Stop if all sentences reach EOS. 164 | num_eos |= (pred == lib.Constants.EOS) 165 | if num_eos.sum() == batch_size: break 166 | 167 | emb = self.decoder.word_lut(Variable(pred)) 168 | 169 | preds = torch.stack(preds) 170 | return preds 171 | 172 | def sample(self, inputs, max_length): 173 | targets, init_states = self.initialize(inputs, eval=False) 174 | emb, output, hidden, context = init_states 175 | 176 | outputs = [] 177 | samples = [] 178 | batch_size = targets.size(1) 179 | num_eos = targets[0].data.byte().new(batch_size).zero_() 180 | 181 | for i in range(max_length): 182 | output, hidden = self.decoder.step(emb, output, hidden, context) 183 | outputs.append(output) 184 | dist = F.softmax(self.generator(output)) 185 | sample = dist.multinomial(1, replacement=False).view(-1).data 186 | samples.append(sample) 187 | 188 | # Stop if all sentences reach EOS. 189 | num_eos |= (sample == lib.Constants.EOS) 190 | if num_eos.sum() == batch_size: break 191 | 192 | emb = self.decoder.word_lut(Variable(sample)) 193 | 194 | outputs = torch.stack(outputs) 195 | samples = torch.stack(samples) 196 | return samples, outputs 197 | 198 | 199 | -------------------------------------------------------------------------------- /lib/model/EncoderDecoder.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/EncoderDecoder.pyc -------------------------------------------------------------------------------- /lib/model/Generator.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from torch.autograd import Variable 5 | 6 | 7 | class BaseGenerator(nn.Module): 8 | def __init__(self, generator, opt): 9 | super(BaseGenerator, self).__init__() 10 | self.generator = generator 11 | self.opt = opt 12 | 13 | def forward(self, inputs): 14 | return self.generator(inputs.contiguous().view(-1, inputs.size(-1))) 15 | 16 | def backward(self, outputs, targets, weights, normalizer, criterion, regression=False): 17 | outputs = Variable(outputs.data, requires_grad=True) 18 | 19 | logits = outputs.contiguous().view(-1) if regression else self.forward(outputs) 20 | 21 | loss = criterion(logits, targets.contiguous().view(-1), weights.contiguous().view(-1)) 22 | loss.div(normalizer).backward() 23 | loss = loss.data[0] 24 | 25 | if outputs.grad is None: 26 | grad_output = torch.zeros(outputs.size()) 27 | else: 28 | grad_output = outputs.grad.data 29 | 30 | return grad_output, loss 31 | 32 | def predict(self, outputs, targets, weights, criterion): 33 | logits = self.forward(outputs) 34 | preds = logits.data.max(1)[1].view(outputs.size(0), -1) 35 | 36 | loss = criterion(logits, targets.contiguous().view(-1), weights.contiguous().view(-1)).data[0] 37 | 38 | return preds, loss 39 | 40 | 41 | class MemEfficientGenerator(BaseGenerator): 42 | def __init__(self, generator, opt, dim=1): 43 | super(MemEfficientGenerator, self).__init__(generator, opt) 44 | self.batch_size = opt.max_generator_batches 45 | self.dim = dim 46 | 47 | def backward(self, outputs, targets, weights, normalizer, criterion, regression=False): 48 | outputs_split = torch.split(outputs, self.batch_size, self.dim) 49 | targets_split = torch.split(targets, self.batch_size, self.dim) 50 | weights_split = torch.split(weights, self.batch_size, self.dim) 51 | 52 | grad_output = [] 53 | loss = 0 54 | for out_t, targ_t, w_t in zip(outputs_split, targets_split, weights_split): 55 | grad_output_t, loss_t = super(MemEfficientGenerator, self).backward( 56 | out_t, targ_t, w_t, normalizer, criterion, regression) 57 | grad_output.append(grad_output_t) 58 | loss += loss_t 59 | 60 | grad_output = torch.cat(grad_output, self.dim) 61 | return grad_output, loss 62 | 63 | def predict(self, outputs, targets, weights, criterion): 64 | outputs_split = torch.split(outputs, self.batch_size, self.dim) 65 | targets_split = torch.split(targets, self.batch_size, self.dim) 66 | weights_split = torch.split(weights, self.batch_size, self.dim) 67 | 68 | preds = [] 69 | loss = 0 70 | for out_t, targ_t, w_t in zip(outputs_split, targets_split, weights_split): 71 | preds_t, loss_t = super(MemEfficientGenerator, self).predict( 72 | out_t, targ_t, w_t, criterion) 73 | preds.append(preds_t) 74 | loss += loss_t 75 | 76 | preds = torch.cat(preds, self.dim) 77 | return preds, loss 78 | 79 | 80 | -------------------------------------------------------------------------------- /lib/model/Generator.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/Generator.pyc -------------------------------------------------------------------------------- /lib/model/GlobalAttention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import math 4 | 5 | _INF = float('inf') 6 | 7 | class GlobalAttention(nn.Module): 8 | def __init__(self, dim): 9 | super(GlobalAttention, self).__init__() 10 | self.linear_in = nn.Linear(dim, dim, bias=False) 11 | self.sm = nn.Softmax() 12 | self.linear_out = nn.Linear(dim*2, dim, bias=False) 13 | self.tanh = nn.Tanh() 14 | self.mask = None 15 | 16 | def applyMask(self, mask): 17 | self.mask = mask 18 | 19 | def forward(self, inputs, context): 20 | """ 21 | inputs: batch x dim 22 | context: batch x sourceL x dim 23 | """ 24 | targetT = self.linear_in(inputs).unsqueeze(2) # batch x dim x 1 25 | 26 | # Get attention 27 | attn = torch.bmm(context, targetT).squeeze(2) # batch x sourceL 28 | if self.mask is not None: 29 | attn.data.masked_fill_(self.mask, -_INF) 30 | attn = self.sm(attn) 31 | attn3 = attn.view(attn.size(0), 1, attn.size(1)) # batch x 1 x sourceL 32 | 33 | weightedContext = torch.bmm(attn3, context).squeeze(1) # batch x dim 34 | contextCombined = torch.cat((weightedContext, inputs), 1) 35 | 36 | contextOutput = self.tanh(self.linear_out(contextCombined)) 37 | 38 | return contextOutput, attn 39 | -------------------------------------------------------------------------------- /lib/model/GlobalAttention.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/GlobalAttention.pyc -------------------------------------------------------------------------------- /lib/model/__init__.py: -------------------------------------------------------------------------------- 1 | from .GlobalAttention import * 2 | from .EncoderDecoder import * 3 | from .Generator import * 4 | -------------------------------------------------------------------------------- /lib/model/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/model/__init__.pyc -------------------------------------------------------------------------------- /lib/train/Optim.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch.optim as optim 3 | 4 | 5 | class Optim(object): 6 | def _makeOptimizer(self): 7 | if self.method == 'sgd': 8 | self.optimizer = optim.SGD(self.params, lr=self.lr) 9 | elif self.method == 'adagrad': 10 | self.optimizer = optim.Adagrad(self.params, lr=self.lr) 11 | elif self.method == 'adadelta': 12 | self.optimizer = optim.Adadelta(self.params, lr=self.lr) 13 | elif self.method == 'adam': 14 | self.optimizer = optim.Adam(self.params, lr=self.lr) 15 | else: 16 | raise RuntimeError("Invalid optim method: " + self.method) 17 | 18 | def __init__(self, params, method, lr, max_grad_norm, lr_decay=1, start_decay_at=None): 19 | self.params = list(params) # careful: params may be a generator 20 | self.last_loss = None 21 | self.lr = lr 22 | self.max_grad_norm = max_grad_norm 23 | self.method = method 24 | self.lr_decay = lr_decay 25 | self.start_decay_at = start_decay_at 26 | 27 | self._makeOptimizer() 28 | 29 | def step(self): 30 | # Compute gradients norm. 31 | grad_norm = 0 32 | for param in self.params: 33 | grad_norm += math.pow(param.grad.data.norm(), 2) 34 | 35 | grad_norm = math.sqrt(grad_norm) 36 | shrinkage = self.max_grad_norm / grad_norm 37 | 38 | for param in self.params: 39 | if shrinkage < 1: 40 | param.grad.data.mul_(shrinkage) 41 | 42 | self.optimizer.step() 43 | return grad_norm 44 | 45 | def set_lr(self, lr): 46 | self.lr = lr 47 | self.optimizer.param_groups[0]["lr"] = lr 48 | 49 | def updateLearningRate(self, loss, epoch): 50 | if self.start_decay_at is not None and epoch >= self.start_decay_at: 51 | if self.last_loss is not None and loss > self.last_loss: 52 | self.set_lr(self.lr * self.lr_decay) 53 | self.last_loss = loss 54 | -------------------------------------------------------------------------------- /lib/train/Optim.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/Optim.pyc -------------------------------------------------------------------------------- /lib/train/ReinforceTrainer.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import math 3 | import os 4 | import time 5 | 6 | from torch.autograd import Variable 7 | import torch 8 | 9 | import lib 10 | 11 | class ReinforceTrainer(object): 12 | 13 | def __init__(self, actor, critic, train_data, eval_data, metrics, dicts, optim, critic_optim, opt): 14 | self.actor = actor 15 | self.critic = critic 16 | 17 | self.train_data = train_data 18 | self.eval_data = eval_data 19 | self.evaluator = lib.Evaluator(actor, metrics, dicts, opt) 20 | 21 | self.actor_loss_func = metrics["nmt_loss"] 22 | self.critic_loss_func = metrics["critic_loss"] 23 | self.sent_reward_func = metrics["sent_reward"] 24 | 25 | self.dicts = dicts 26 | 27 | self.optim = optim 28 | self.critic_optim = critic_optim 29 | 30 | self.max_length = opt.max_predict_length 31 | self.pert_func = opt.pert_func 32 | self.opt = opt 33 | 34 | print("") 35 | print(actor) 36 | print("") 37 | print(critic) 38 | 39 | def train(self, start_epoch, end_epoch, pretrain_critic, start_time=None): 40 | if start_time is None: 41 | self.start_time = time.time() 42 | else: 43 | self.start_time = start_time 44 | self.optim.last_loss = self.critic_optim.last_loss = None 45 | self.optim.set_lr(self.opt.reinforce_lr) 46 | 47 | # Use large learning rate for critic during pre-training. 48 | if pretrain_critic: 49 | self.critic_optim.set_lr(1e-3) 50 | else: 51 | self.critic_optim.set_lr(self.opt.reinforce_lr) 52 | 53 | for epoch in range(start_epoch, end_epoch + 1): 54 | print("") 55 | 56 | print("* REINFORCE epoch *") 57 | print("Actor optim lr: %g; Critic optim lr: %g" % 58 | (self.optim.lr, self.critic_optim.lr)) 59 | if pretrain_critic: 60 | print("Pretrain critic...") 61 | no_update = self.opt.no_update and (not pretrain_critic) and \ 62 | (epoch == start_epoch) 63 | 64 | if no_update: print("No update...") 65 | 66 | train_reward, critic_loss = self.train_epoch(epoch, pretrain_critic, no_update) 67 | print("Train sentence reward: %.2f" % (train_reward * 100)) 68 | print("Critic loss: %g" % critic_loss) 69 | 70 | valid_loss, valid_sent_reward, valid_corpus_reward = self.evaluator.eval(self.eval_data) 71 | valid_ppl = math.exp(min(valid_loss, 100)) 72 | print("Validation perplexity: %.2f" % valid_ppl) 73 | print("Validation sentence reward: %.2f" % (valid_sent_reward * 100)) 74 | print("Validation corpus reward: %.2f" % 75 | (valid_corpus_reward * 100)) 76 | 77 | if no_update: break 78 | 79 | self.optim.updateLearningRate(-valid_sent_reward, epoch) 80 | # Actor and critic use the same lr when jointly trained. 81 | # TODO: using small lr for critic is better? 82 | if not pretrain_critic: 83 | self.critic_optim.set_lr(self.optim.lr) 84 | 85 | checkpoint = { 86 | "model": self.actor, 87 | "critic": self.critic, 88 | "dicts": self.dicts, 89 | "opt": self.opt, 90 | "epoch": epoch, 91 | "optim": self.optim, 92 | "critic_optim": self.critic_optim 93 | } 94 | model_name = os.path.join(self.opt.save_dir, "model_%d" % epoch) 95 | if pretrain_critic: 96 | model_name += "_pretrain" 97 | else: 98 | model_name += "_reinforce" 99 | model_name += ".pt" 100 | torch.save(checkpoint, model_name) 101 | print("Save model as %s" % model_name) 102 | 103 | def train_epoch(self, epoch, pretrain_critic, no_update): 104 | self.actor.train() 105 | 106 | total_reward, report_reward = 0, 0 107 | total_critic_loss, report_critic_loss = 0, 0 108 | total_sents, report_sents = 0, 0 109 | total_words, report_words = 0, 0 110 | last_time = time.time() 111 | for i in range(len(self.train_data)): 112 | batch = self.train_data[i] 113 | sources = batch[0] 114 | targets = batch[1] 115 | batch_size = targets.size(1) 116 | 117 | self.actor.zero_grad() 118 | self.critic.zero_grad() 119 | 120 | # Sample translations 121 | attention_mask = sources[0].data.eq(lib.Constants.PAD).t() 122 | self.actor.decoder.attn.applyMask(attention_mask) 123 | samples, outputs = self.actor.sample(batch, self.max_length) 124 | 125 | # Calculate rewards 126 | rewards, samples = self.sent_reward_func(samples.t().tolist(), targets.data.t().tolist()) 127 | reward = sum(rewards) 128 | 129 | # Perturb rewards (if specified). 130 | if self.pert_func is not None: 131 | rewards = self.pert_func(rewards) 132 | 133 | samples = Variable(torch.LongTensor(samples).t().contiguous()) 134 | rewards = Variable(torch.FloatTensor([rewards] * samples.size(0)).contiguous()) 135 | if self.opt.cuda: 136 | samples = samples.cuda() 137 | rewards = rewards.cuda() 138 | 139 | # Update critic. 140 | critic_weights = samples.ne(lib.Constants.PAD).float() 141 | num_words = critic_weights.data.sum() 142 | if not no_update: 143 | baselines = self.critic((sources, samples), eval=False, regression=True) 144 | critic_loss = self.critic.backward( 145 | baselines, rewards, critic_weights, num_words, self.critic_loss_func, regression=True) 146 | self.critic_optim.step() 147 | else: 148 | critic_loss = 0 149 | 150 | # Update actor 151 | if not pretrain_critic and not no_update: 152 | # Subtract baseline from reward 153 | norm_rewards = Variable((rewards - baselines).data) 154 | actor_weights = norm_rewards * critic_weights 155 | # TODO: can use PyTorch reinforce() here but that function is a black box. 156 | # This is an alternative way where you specify an objective that gives the same gradient 157 | # as the policy gradient's objective, which looks much like weighted log-likelihood. 158 | actor_loss = self.actor.backward(outputs, samples, actor_weights, 1, self.actor_loss_func) 159 | self.optim.step() 160 | 161 | # Gather stats 162 | total_reward += reward 163 | report_reward += reward 164 | total_sents += batch_size 165 | report_sents += batch_size 166 | total_critic_loss += critic_loss 167 | report_critic_loss += critic_loss 168 | total_words += num_words 169 | report_words += num_words 170 | if i % self.opt.log_interval == 0 and i > 0: 171 | print("""Epoch %3d, %6d/%d batches; 172 | actor reward: %.4f; critic loss: %f; %5.0f tokens/s; %s elapsed""" % 173 | (epoch, i, len(self.train_data), 174 | (report_reward / report_sents) * 100, 175 | report_critic_loss / report_words, 176 | report_words / (time.time() - last_time), 177 | str(datetime.timedelta(seconds=int(time.time() - self.start_time))))) 178 | 179 | report_reward = report_sents = report_critic_loss = report_words = 0 180 | last_time = time.time() 181 | 182 | return total_reward / total_sents, total_critic_loss / total_words 183 | 184 | -------------------------------------------------------------------------------- /lib/train/ReinforceTrainer.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/ReinforceTrainer.pyc -------------------------------------------------------------------------------- /lib/train/Trainer.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import math 3 | import os 4 | import time 5 | 6 | import torch 7 | 8 | import lib 9 | 10 | class Trainer(object): 11 | def __init__(self, model, train_data, eval_data, metrics, dicts, 12 | optim, opt): 13 | 14 | self.model = model 15 | self.train_data = train_data 16 | self.eval_data = eval_data 17 | self.evaluator = lib.Evaluator(model, metrics, dicts, opt) 18 | self.loss_func = metrics["nmt_loss"] 19 | self.dicts = dicts 20 | self.optim = optim 21 | self.opt = opt 22 | 23 | print(model) 24 | 25 | def train(self, start_epoch, end_epoch, start_time=None): 26 | if start_time is None: 27 | self.start_time = time.time() 28 | else: 29 | self.start_time = start_time 30 | for epoch in range(start_epoch, end_epoch + 1): 31 | print('') 32 | 33 | print("* XENT epoch *") 34 | print("Model optim lr: %g" % self.optim.lr) 35 | train_loss = self.train_epoch(epoch) 36 | print('Train perplexity: %.2f' % math.exp(min(train_loss, 100))) 37 | 38 | valid_loss, valid_sent_reward, valid_corpus_reward = self.evaluator.eval(self.eval_data) 39 | valid_ppl = math.exp(min(valid_loss, 100)) 40 | print('Validation perplexity: %.2f' % valid_ppl) 41 | print('Validation sentence reward: %.2f' % (valid_sent_reward * 100)) 42 | print('Validation corpus reward: %.2f' % 43 | (valid_corpus_reward * 100)) 44 | 45 | self.optim.updateLearningRate(valid_loss, epoch) 46 | 47 | checkpoint = { 48 | 'model': self.model, 49 | 'dicts': self.dicts, 50 | 'opt': self.opt, 51 | 'epoch': epoch, 52 | 'optim': self.optim, 53 | } 54 | model_name = os.path.join(self.opt.save_dir, "model_%d.pt" % epoch) 55 | torch.save(checkpoint, model_name) 56 | print("Save model as %s" % model_name) 57 | 58 | 59 | def train_epoch(self, epoch): 60 | self.model.train() 61 | 62 | self.train_data.shuffle() 63 | 64 | total_loss, report_loss = 0, 0 65 | total_words, report_words = 0, 0 66 | last_time = time.time() 67 | for i in range(len(self.train_data)): 68 | batch = self.train_data[i] 69 | targets = batch[1] 70 | 71 | self.model.zero_grad() 72 | attention_mask = batch[0][0].data.eq(lib.Constants.PAD).t() 73 | self.model.decoder.attn.applyMask(attention_mask) 74 | outputs = self.model(batch, eval=False) 75 | 76 | weights = targets.ne(lib.Constants.PAD).float() 77 | num_words = weights.data.sum() 78 | loss = self.model.backward(outputs, targets, weights, num_words, self.loss_func) 79 | 80 | self.optim.step() 81 | 82 | report_loss += loss 83 | total_loss += loss 84 | total_words += num_words 85 | report_words += num_words 86 | if i % self.opt.log_interval == 0 and i > 0: 87 | print("""Epoch %3d, %6d/%d batches; 88 | perplexity: %8.2f; %5.0f tokens/s; %s elapsed""" % 89 | (epoch, i, len(self.train_data), 90 | math.exp(report_loss / report_words), 91 | report_words / (time.time() - last_time), 92 | str(datetime.timedelta(seconds=int(time.time() - self.start_time))))) 93 | 94 | report_loss = report_words = 0 95 | last_time = time.time() 96 | 97 | return total_loss / total_words 98 | 99 | -------------------------------------------------------------------------------- /lib/train/Trainer.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/Trainer.pyc -------------------------------------------------------------------------------- /lib/train/__init__.py: -------------------------------------------------------------------------------- 1 | from .Optim import * 2 | from .ReinforceTrainer import * 3 | from .Trainer import * 4 | -------------------------------------------------------------------------------- /lib/train/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedHenLab/Neural-Machine-Translation/99bc3c9b30fd55c73ddfaf2d1735fcb627e42266/lib/train/__init__.pyc -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import torch 3 | 4 | import lib 5 | 6 | parser = argparse.ArgumentParser(description="preprocess.py") 7 | 8 | parser.add_argument("-train_src", required=True, 9 | help="Path to the training source data") 10 | parser.add_argument("-train_tgt", required=True, 11 | help="Path to the training target data") 12 | 13 | parser.add_argument("-train_xe_src", required=True, 14 | help="Path to the pre-training source data") 15 | parser.add_argument("-train_xe_tgt", required=True, 16 | help="Path to the pre-training target data") 17 | 18 | parser.add_argument("-train_pg_src", required=True, 19 | help="Path to the bandit training source data") 20 | parser.add_argument("-train_pg_tgt", required=True, 21 | help="Path to the bandit training target data") 22 | 23 | parser.add_argument("-valid_src", required=True, 24 | help="Path to the validation source data") 25 | parser.add_argument("-valid_tgt", required=True, 26 | help="Path to the validation target data") 27 | 28 | parser.add_argument("-test_src", required=True, 29 | help="Path to the test source data") 30 | parser.add_argument("-test_tgt", required=True, 31 | help="Path to the test target data") 32 | 33 | parser.add_argument("-save_data", required=True, 34 | help="Output file for the prepared data") 35 | 36 | parser.add_argument("-src_vocab_size", type=int, default=50000, 37 | help="Size of the source vocabulary") 38 | parser.add_argument("-tgt_vocab_size", type=int, default=50000, 39 | help="Size of the target vocabulary") 40 | 41 | parser.add_argument("-seq_length", type=int, default=80, 42 | help="Maximum sequence length") 43 | parser.add_argument("-seed", type=int, default=3435, 44 | help="Random seed") 45 | 46 | parser.add_argument("-report_every", type=int, default=100000, 47 | help="Report status every this many sentences") 48 | 49 | opt = parser.parse_args() 50 | torch.manual_seed(opt.seed) 51 | 52 | 53 | def makeVocabulary(filename, size): 54 | vocab = lib.Dict([lib.Constants.PAD_WORD, lib.Constants.UNK_WORD, 55 | lib.Constants.BOS_WORD, lib.Constants.EOS_WORD]) 56 | 57 | with open(filename) as f: 58 | for sent in f.readlines(): 59 | for word in sent.split(): 60 | #vocab.add(word) 61 | vocab.add(word.lower()) # Lowercase all words 62 | 63 | originalSize = vocab.size() 64 | vocab = vocab.prune(size) 65 | print("Created dictionary of size %d (pruned from %d)" % 66 | (vocab.size(), originalSize)) 67 | 68 | return vocab 69 | 70 | 71 | def initVocabulary(name, dataFile, vocabSize, saveFile): 72 | print("Building " + name + " vocabulary...") 73 | vocab = makeVocabulary(dataFile, vocabSize) 74 | print("Saving " + name + " vocabulary to \"" + saveFile + "\"...") 75 | vocab.writeFile(saveFile) 76 | return vocab 77 | 78 | '''def reorderSentences(pos, src, tgt, perm): 79 | new_pos = [pos[idx] for idx in perm] 80 | new_src = [src[idx] for idx in perm] 81 | new_tgt = [tgt[idx] for idx in perm] 82 | return new_pos, new_src, new_tgt 83 | ''' 84 | def makeData(which, srcFile, tgtFile, srcDicts, tgtDicts): 85 | src, tgt = [], [] 86 | sizes = [] 87 | count, ignored = 0, 0 88 | 89 | print("Processing %s & %s ..." % (srcFile, tgtFile)) 90 | srcF = open(srcFile) 91 | tgtF = open(tgtFile) 92 | 93 | while True: 94 | srcWords = srcF.readline().split() 95 | tgtWords = tgtF.readline().split() 96 | 97 | if not srcWords or not tgtWords: 98 | if srcWords and not tgtWords or not srcWords and tgtWords: 99 | print("WARNING: source and target do not have the same number of sentences") 100 | break 101 | 102 | if len(srcWords) <= opt.seq_length and len(tgtWords) <= opt.seq_length: 103 | src += [srcDicts.convertToIdx(srcWords, 104 | lib.Constants.UNK_WORD)] 105 | tgt += [tgtDicts.convertToIdx(tgtWords, 106 | lib.Constants.UNK_WORD, 107 | eosWord=lib.Constants.EOS_WORD)] 108 | sizes += [len(srcWords)] 109 | else: 110 | if which!="test": 111 | ignored += 1 112 | else: 113 | src += [srcDicts.convertToIdx(srcWords, 114 | lib.Constants.UNK_WORD)] 115 | tgt += [tgtDicts.convertToIdx(tgtWords, 116 | lib.Constants.UNK_WORD, 117 | eosWord=lib.Constants.EOS_WORD)] 118 | sizes += [len(srcWords)] 119 | 120 | 121 | count += 1 122 | if count % opt.report_every == 0: 123 | print("... %d sentences prepared" % count) 124 | 125 | srcF.close() 126 | tgtF.close() 127 | 128 | assert len(src) == len(tgt) 129 | print("Prepared %d sentences (%d ignored due to length == 0 or > %d)" % (len(src), ignored, opt.seq_length)) 130 | 131 | return src, tgt, range(len(src)) 132 | 133 | 134 | def makeDataGeneral(which, src_path, tgt_path, dicts): 135 | print("Preparing " + which + "...") 136 | res = {} 137 | res["src"], res["tgt"], res["pos"] = makeData(which, src_path, tgt_path, 138 | dicts["src"], dicts["tgt"]) 139 | return res 140 | 141 | 142 | def main(): 143 | dicts = {} 144 | dicts["src"] = initVocabulary("source", opt.train_src, opt.src_vocab_size, 145 | opt.save_data + ".src.dict") 146 | dicts["tgt"] = initVocabulary("target", opt.train_tgt, opt.tgt_vocab_size, 147 | opt.save_data + ".tgt.dict") 148 | 149 | save_data = {} 150 | save_data["dicts"] = dicts 151 | save_data["train_xe"] = makeDataGeneral("train_xe", opt.train_xe_src, 152 | opt.train_xe_tgt, dicts) 153 | save_data["train_pg"] = makeDataGeneral("train_pg", opt.train_pg_src, 154 | opt.train_pg_tgt, dicts) 155 | save_data["valid"] = makeDataGeneral("valid", opt.valid_src, opt.valid_tgt, 156 | dicts) 157 | save_data["test"] = makeDataGeneral("test", opt.test_src, opt.test_tgt, 158 | dicts) 159 | 160 | print("Saving data to \"" + opt.save_data + "-train.pt\"...") 161 | torch.save(save_data, opt.save_data + "-train.pt") 162 | 163 | 164 | if __name__ == "__main__": 165 | main() 166 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.1.10 2 | backports.functools-lru-cache==1.5 3 | backports.weakref==1.0.post1 4 | bleach==1.5.0 5 | certifi==2018.1.18 6 | chardet==3.0.4 7 | cycler==0.10.0 8 | enum34==1.1.6 9 | funcsigs==1.0.2 10 | future==0.16.0 11 | futures==3.2.0 12 | html5lib==0.9999999 13 | idna==2.6 14 | Markdown==2.6.11 15 | matplotlib==2.1.2 16 | mock==2.0.0 17 | nltk==3.2.5 18 | numpy==1.14.1 19 | pbr==3.1.1 20 | Pillow==5.0.0 21 | protobuf==3.5.1 22 | pyparsing==2.2.0 23 | pyrouge==0.1.3 24 | python-dateutil==2.6.1 25 | pytz==2018.3 26 | PyYAML==3.12 27 | requests>=2.20.0 28 | six==1.11.0 29 | subprocess32==3.2.7 30 | tensorflow-gpu==1.5.0 31 | tensorflow-tensorboard==1.5.1 32 | torch==0.3.1 33 | torchtext==0.2.1 34 | torchvision==0.2.0 35 | tqdm==4.19.5 36 | urllib3>=1.23 37 | Werkzeug==0.14.1 38 | -------------------------------------------------------------------------------- /scripts/extract_parallel.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | import sys,os 3 | 4 | input_dir=sys.argv[1] 5 | files=os.listdir(input_dir) 6 | 7 | 8 | fields = set(["TOP", "COL", "UID", "PID", "ACQ", "DUR", "VID", "TTL", "URL", "TTS", "SRC", "CMT", "LAN", "TTP", "HED", "OBT", "LBT", "END", "CC1", "CC2"]) 9 | 10 | f1=open("french.txt",'a') 11 | f2=open("english.txt",'a') 12 | 13 | for input_file in files: 14 | with open(os.path.join(input_dir,input_file)) as f: 15 | content=f.readlines() 16 | content=[x.strip() for x in content] 17 | f.close() 18 | c1=0 19 | c2=0 20 | for i in range(len(content)-1): 21 | line1=content[i] 22 | l1=line1.split('|') 23 | line2=content[i+1] 24 | l2=line2.split('|') 25 | if (l1[0] not in fields and (l1[2]=="CC1" or l1[2]=="CC2")) and (l2[0] not in fields and (l2[2]=="CC1" or l2[2]=="CC2")): 26 | if l1[2]=="CC1" and l2[2]=="CC2": 27 | print(l1[3],file=f1) 28 | print(l2[3],file=f2) 29 | c1+=1 30 | c2+=1 31 | i+=1 32 | if c1!=c2: 33 | print(input_file) 34 | 35 | f1.close() 36 | f2.close() 37 | 38 | 39 | -------------------------------------------------------------------------------- /scripts/lowercase.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | use warnings; 7 | use strict; 8 | 9 | binmode(STDIN, ":utf8"); 10 | binmode(STDOUT, ":utf8"); 11 | 12 | while() { 13 | print lc($_); 14 | } 15 | -------------------------------------------------------------------------------- /scripts/multi-bleu.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | # $Id$ 7 | use warnings; 8 | use strict; 9 | 10 | my $lowercase = 0; 11 | if ($ARGV[0] eq "-lc") { 12 | $lowercase = 1; 13 | shift; 14 | } 15 | 16 | my $stem = $ARGV[0]; 17 | if (!defined $stem) { 18 | print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n"; 19 | print STDERR "Reads the references from reference or reference0, reference1, ...\n"; 20 | exit(1); 21 | } 22 | 23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0"; 24 | 25 | my @REF; 26 | my $ref=0; 27 | while(-e "$stem$ref") { 28 | &add_to_ref("$stem$ref",\@REF); 29 | $ref++; 30 | } 31 | &add_to_ref($stem,\@REF) if -e $stem; 32 | die("ERROR: could not find reference file $stem") unless scalar @REF; 33 | 34 | # add additional references explicitly specified on the command line 35 | shift; 36 | foreach my $stem (@ARGV) { 37 | &add_to_ref($stem,\@REF) if -e $stem; 38 | } 39 | 40 | 41 | 42 | sub add_to_ref { 43 | my ($file,$REF) = @_; 44 | my $s=0; 45 | if ($file =~ /.gz$/) { 46 | open(REF,"gzip -dc $file|") or die "Can't read $file"; 47 | } else { 48 | open(REF,$file) or die "Can't read $file"; 49 | } 50 | while() { 51 | chop; 52 | push @{$$REF[$s++]}, $_; 53 | } 54 | close(REF); 55 | } 56 | 57 | my(@CORRECT,@TOTAL,$length_translation,$length_reference); 58 | my $s=0; 59 | while() { 60 | chop; 61 | $_ = lc if $lowercase; 62 | my @WORD = split; 63 | my %REF_NGRAM = (); 64 | my $length_translation_this_sentence = scalar(@WORD); 65 | my ($closest_diff,$closest_length) = (9999,9999); 66 | foreach my $reference (@{$REF[$s]}) { 67 | # print "$s $_ <=> $reference\n"; 68 | $reference = lc($reference) if $lowercase; 69 | my @WORD = split(' ',$reference); 70 | my $length = scalar(@WORD); 71 | my $diff = abs($length_translation_this_sentence-$length); 72 | if ($diff < $closest_diff) { 73 | $closest_diff = $diff; 74 | $closest_length = $length; 75 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n"; 76 | } elsif ($diff == $closest_diff) { 77 | $closest_length = $length if $length < $closest_length; 78 | # from two references with the same closeness to me 79 | # take the *shorter* into account, not the "first" one. 80 | } 81 | for(my $n=1;$n<=4;$n++) { 82 | my %REF_NGRAM_N = (); 83 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 84 | my $ngram = "$n"; 85 | for(my $w=0;$w<$n;$w++) { 86 | $ngram .= " ".$WORD[$start+$w]; 87 | } 88 | $REF_NGRAM_N{$ngram}++; 89 | } 90 | foreach my $ngram (keys %REF_NGRAM_N) { 91 | if (!defined($REF_NGRAM{$ngram}) || 92 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) { 93 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram}; 94 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}
\n"; 95 | } 96 | } 97 | } 98 | } 99 | $length_translation += $length_translation_this_sentence; 100 | $length_reference += $closest_length; 101 | for(my $n=1;$n<=4;$n++) { 102 | my %T_NGRAM = (); 103 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 104 | my $ngram = "$n"; 105 | for(my $w=0;$w<$n;$w++) { 106 | $ngram .= " ".$WORD[$start+$w]; 107 | } 108 | $T_NGRAM{$ngram}++; 109 | } 110 | foreach my $ngram (keys %T_NGRAM) { 111 | $ngram =~ /^(\d+) /; 112 | my $n = $1; 113 | # my $corr = 0; 114 | # print "$i e $ngram $T_NGRAM{$ngram}
\n"; 115 | $TOTAL[$n] += $T_NGRAM{$ngram}; 116 | if (defined($REF_NGRAM{$ngram})) { 117 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) { 118 | $CORRECT[$n] += $T_NGRAM{$ngram}; 119 | # $corr = $T_NGRAM{$ngram}; 120 | # print "$i e correct1 $T_NGRAM{$ngram}
\n"; 121 | } 122 | else { 123 | $CORRECT[$n] += $REF_NGRAM{$ngram}; 124 | # $corr = $REF_NGRAM{$ngram}; 125 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n"; 126 | } 127 | } 128 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram}; 129 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n" 130 | } 131 | } 132 | $s++; 133 | } 134 | my $brevity_penalty = 1; 135 | my $bleu = 0; 136 | 137 | my @bleu=(); 138 | 139 | for(my $n=1;$n<=4;$n++) { 140 | if (defined ($TOTAL[$n])){ 141 | $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0; 142 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n"; 143 | }else{ 144 | $bleu[$n]=0; 145 | } 146 | } 147 | 148 | if ($length_reference==0){ 149 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n"; 150 | exit(1); 151 | } 152 | 153 | if ($length_translation<$length_reference) { 154 | $brevity_penalty = exp(1-$length_reference/$length_translation); 155 | } 156 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) + 157 | my_log( $bleu[2] ) + 158 | my_log( $bleu[3] ) + 159 | my_log( $bleu[4] ) ) / 4) ; 160 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n", 161 | 100*$bleu, 162 | 100*$bleu[1], 163 | 100*$bleu[2], 164 | 100*$bleu[3], 165 | 100*$bleu[4], 166 | $brevity_penalty, 167 | $length_translation / $length_reference, 168 | $length_translation, 169 | $length_reference; 170 | 171 | 172 | print STDERR "It is in-advisable to publish scores from multi-bleu.perl. The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups. Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization. Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n"; 173 | 174 | sub my_log { 175 | return -9999999999 unless $_[0]; 176 | return log($_[0]); 177 | } 178 | -------------------------------------------------------------------------------- /scripts/output.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | import sys,os 3 | import datetime 4 | 5 | input_file=sys.argv[1] 6 | fields = set(["TOP", "COL", "UID", "PID", "ACQ", "DUR", "VID", "TTL", "URL", "TTS", "SRC", "CMT", "LAN", "TTP", "HED", "OBT", "LBT", "END", "CC1"]) 7 | 8 | with open(input_file) as f: 9 | content=f.readlines() 10 | content=[x.strip() for x in content] 11 | f.close() 12 | 13 | with open("tmp.txt.pred") as f: 14 | pred_content=f.readlines() 15 | pred_content=[x.strip() for x in pred_content] 16 | f.close() 17 | 18 | f1=open(input_file+".pred",'a') 19 | sent_index=0 20 | credit_flag=0 21 | lang="" 22 | for line in content: 23 | l=line.split('|') 24 | if l[0] in fields: 25 | print(line,file=f1) 26 | if l[0]=="LAN": 27 | lang=l[1] 28 | elif l[0] not in fields: 29 | if not credit_flag: 30 | timestamp=datetime.datetime.now().strftime("%Y-%m-%d %H:%M") 31 | source_program="Neural Machine Translation 1.0, translate.sh" 32 | source_person="Vikrant Goyal" 33 | print(lang+"_01" + '|' + timestamp + '|' + "Source_Program=" + source_program + '|' + "Source_Person=" + source_person ,file=f1) 34 | credit_flag=1 35 | l[2]=lang+"_01" 36 | l[3]=pred_content[sent_index] 37 | print(l[0]+'|'+l[1]+'|'+l[2]+'|'+l[3],file=f1) 38 | sent_index+=1 39 | f1.close() 40 | -------------------------------------------------------------------------------- /scripts/parse.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | import sys,os 3 | 4 | input_file=sys.argv[1] 5 | 6 | with open(input_file) as f: 7 | content=f.readlines() 8 | content=[x.strip() for x in content] 9 | f.close() 10 | 11 | fields = set(["TOP", "COL", "UID", "PID", "ACQ", "DUR", "VID", "TTL", "URL", "TTS", "SRC", "CMT", "LAN", "TTP", "HED", "OBT", "LBT", "END", "CC1"]) 12 | 13 | f1=open("tmp.txt",'a') 14 | for line in content: 15 | l=line.split('|') 16 | if l[0] not in fields: 17 | sent=l[3] 18 | print(sent,file=f1) 19 | f1.close() 20 | 21 | -------------------------------------------------------------------------------- /scripts/prepare_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | src=$1 4 | tgt=$2 5 | lang=$1-$2 6 | script=../../Neural-Machine-Translation/scripts/ 7 | 8 | python $script/strip.py train.$lang.$src train.$lang.$tgt 9 | perl $script/lowercase.perl < train.$lang.$src.cleaned > train.$lang.$src.cleaned.low 10 | perl $script/lowercase.perl < train.$lang.$tgt.cleaned > train.$lang.$tgt.cleaned.low 11 | perl $script/tokenizer.perl -l $src < train.$lang.$src.cleaned.low > train.$lang.$src.cleaned.low.tok 12 | perl $script/tokenizer.perl -l $tgt < train.$lang.$tgt.cleaned.low > train.$lang.$tgt.cleaned.low.tok 13 | cat train.$lang.$src.cleaned.low.tok train.$lang.$tgt.cleaned.low.tok | ~/Neural-Machine-Translation/subword-nmt/learn_bpe.py -s 32000 > ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 14 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < train.$lang.$src.cleaned.low.tok > train.$lang.$src.cleaned.low.tok.bpe 15 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < train.$lang.$tgt.cleaned.low.tok > train.$lang.$tgt.cleaned.low.tok.bpe 16 | mv train.$lang.$src.cleaned.low.tok.bpe train.$lang.$src.processed 17 | mv train.$lang.$tgt.cleaned.low.tok.bpe train.$lang.$tgt.processed 18 | 19 | 20 | python $script/strip.py valid.$lang.$src valid.$lang.$tgt 21 | perl $script/lowercase.perl < valid.$lang.$src.cleaned > valid.$lang.$src.cleaned.low 22 | perl $script/lowercase.perl < valid.$lang.$tgt.cleaned > valid.$lang.$tgt.cleaned.low 23 | perl $script/tokenizer.perl -l $src < valid.$lang.$src.cleaned.low > valid.$lang.$src.cleaned.low.tok 24 | perl $script/tokenizer.perl -l $tgt < valid.$lang.$tgt.cleaned.low > valid.$lang.$tgt.cleaned.low.tok 25 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < valid.$lang.$src.cleaned.low.tok > valid.$lang.$src.cleaned.low.tok.bpe 26 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < valid.$lang.$tgt.cleaned.low.tok > valid.$lang.$tgt.cleaned.low.tok.bpe 27 | mv valid.$lang.$src.cleaned.low.tok.bpe valid.$lang.$src.processed 28 | mv valid.$lang.$tgt.cleaned.low.tok.bpe valid.$lang.$tgt.processed 29 | 30 | 31 | python $script/strip.py test.$lang.$src test.$lang.$tgt 32 | perl $script/lowercase.perl < test.$lang.$src.cleaned > test.$lang.$src.cleaned.low 33 | perl $script/lowercase.perl < test.$lang.$tgt.cleaned > test.$lang.$tgt.cleaned.low 34 | perl $script/tokenizer.perl -l $src < test.$lang.$src.cleaned.low > test.$lang.$src.cleaned.low.tok 35 | perl $script/tokenizer.perl -l $tgt < test.$lang.$tgt.cleaned.low > test.$lang.$tgt.cleaned.low.tok 36 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < test.$lang.$src.cleaned.low.tok > test.$lang.$src.cleaned.low.tok.bpe 37 | ~/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c ~/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < test.$lang.$tgt.cleaned.low.tok > test.$lang.$tgt.cleaned.low.tok.bpe 38 | mv test.$lang.$src.cleaned.low.tok.bpe test.$lang.$src.processed 39 | mv test.$lang.$tgt.cleaned.low.tok.bpe test.$lang.$tgt.processed 40 | 41 | rm *tok 42 | rm *cleaned 43 | rm *low 44 | 45 | -------------------------------------------------------------------------------- /scripts/preprocess.py: -------------------------------------------------------------------------------- 1 | ##python code to clean and tokenize a file 2 | from __future__ import print_function 3 | import string 4 | import re,sys,os,codecs 5 | from unicodedata import normalize 6 | from mosestokenizer import * 7 | # load document into memory 8 | def load_doc(filename): 9 | # open the file as read only 10 | file = codecs.open(filename, 'r', encoding='utf-8') 11 | # read all text 12 | text = file.read() 13 | # close the file 14 | file.close() 15 | return text 16 | 17 | # split a loaded document into sentences 18 | def to_sentences(doc): 19 | return doc.strip().split('\n') 20 | 21 | # clean a list of lines 22 | def clean_lines(lines,lang): 23 | cleaned = list() 24 | tokenize=MosesTokenizer(lang) 25 | for line in lines: 26 | # tokenize usig moses tokenizer 27 | line=tokenize(line) 28 | # convert to lower case 29 | line = [word.lower() for word in line] 30 | # store it as a string 31 | line = ' '.join(line) 32 | cleaned.append(line) 33 | return cleaned 34 | 35 | # save a list of clean sentences to file 36 | def save_clean_sentences(sentences, filename): 37 | fout=open(filename,'a') 38 | for line in sentences: 39 | print(line, file=fout) 40 | fout.close() 41 | print('Saved: %s' % filename) 42 | 43 | if __name__=="__main__": 44 | # load data for cleaning 45 | filename = sys.argv[1] 46 | lang=sys.argv[2] 47 | doc = load_doc(filename) 48 | sentences = to_sentences(doc) 49 | print(sentences[-1]) 50 | sentences = clean_lines(sentences,lang) 51 | print(sentences[-1]) 52 | save_clean_sentences(sentences, filename+'.processed') 53 | -------------------------------------------------------------------------------- /scripts/sgm.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | use warnings; 7 | use strict; 8 | 9 | die("ERROR syntax: input-from-sgm.perl < in.sgm > in.txt") 10 | unless scalar @ARGV == 0; 11 | 12 | while(my $line = ) { 13 | chop($line); 14 | while ($line =~ /]+>\s*$/i) { 15 | my $next_line = ; 16 | $line .= $next_line; 17 | chop($line); 18 | } 19 | while ($line =~ /]+>\s*(.*)\s*$/i && 20 | $line !~ /]+>\s*(.*)\s*<\/seg>/i) { 21 | my $next_line = ; 22 | $line .= $next_line; 23 | chop($line); 24 | } 25 | if ($line =~ /]+>\s*(.*)\s*<\/seg>/i) { 26 | my $input = $1; 27 | $input =~ s/\s+/ /g; 28 | $input =~ s/^ //g; 29 | $input =~ s/ $//g; 30 | print $input."\n"; 31 | } 32 | } 33 | -------------------------------------------------------------------------------- /scripts/strip.py: -------------------------------------------------------------------------------- 1 | ##python code to tokenize a file 2 | from __future__ import print_function 3 | from itertools import izip 4 | import string 5 | import re,sys,os,codecs 6 | file1=sys.argv[1] 7 | file2=sys.argv[2] 8 | fout1=open(file1+".cleaned",'a') 9 | fout2=open(file2+".cleaned",'a') 10 | with open(file1) as f1, open(file2) as f2: 11 | text1=f1.read().split('\n') 12 | text2=f2.read().split('\n') 13 | for x,y in izip(text1,text2): 14 | x=x.strip() 15 | y=y.strip() 16 | #filtrate = re.compile(u'[^\u4E00-\u9FA5]') 17 | #y=y.decode("utf-8") 18 | #y=filtrate.sub(r'',y) 19 | #y=y.encode("utf-8") 20 | x=x.decode("utf-8").replace(u"\uFDD3",'').encode("utf-8") 21 | y=y.decode("utf-8").replace(u"\uFDD3",'').encode("utf-8") 22 | if len(x)>=1 and len(y)>=1: 23 | print(x,file=fout1) 24 | print(y,file=fout2) 25 | f1.close() 26 | f2.close() 27 | -------------------------------------------------------------------------------- /scripts/tokenizer.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | use warnings; 7 | 8 | # Sample Tokenizer 9 | ### Version 1.1 10 | # written by Pidong Wang, based on the code written by Josh Schroeder and Philipp Koehn 11 | # Version 1.1 updates: 12 | # (1) add multithreading option "-threads NUM_THREADS" (default is 1); 13 | # (2) add a timing option "-time" to calculate the average speed of this tokenizer; 14 | # (3) add an option "-lines NUM_SENTENCES_PER_THREAD" to set the number of lines for each thread (default is 2000), and this option controls the memory amount needed: the larger this number is, the larger memory is required (the higher tokenization speed); 15 | ### Version 1.0 16 | # $Id: tokenizer.perl 915 2009-08-10 08:15:49Z philipp $ 17 | # written by Josh Schroeder, based on code by Philipp Koehn 18 | 19 | binmode(STDIN, ":utf8"); 20 | binmode(STDOUT, ":utf8"); 21 | 22 | use warnings; 23 | use FindBin qw($RealBin); 24 | use strict; 25 | use Time::HiRes; 26 | 27 | if (eval {require Thread;1;}) { 28 | #module loaded 29 | Thread->import(); 30 | } 31 | 32 | my $mydir = "$RealBin/../share/nonbreaking_prefixes"; 33 | 34 | my %NONBREAKING_PREFIX = (); 35 | my @protected_patterns = (); 36 | my $protected_patterns_file = ""; 37 | my $language = "en"; 38 | my $QUIET = 0; 39 | my $HELP = 0; 40 | my $AGGRESSIVE = 0; 41 | my $SKIP_XML = 0; 42 | my $TIMING = 0; 43 | my $NUM_THREADS = 1; 44 | my $NUM_SENTENCES_PER_THREAD = 2000; 45 | my $PENN = 0; 46 | my $NO_ESCAPING = 0; 47 | while (@ARGV) 48 | { 49 | $_ = shift; 50 | /^-b$/ && ($| = 1, next); 51 | /^-l$/ && ($language = shift, next); 52 | /^-q$/ && ($QUIET = 1, next); 53 | /^-h$/ && ($HELP = 1, next); 54 | /^-x$/ && ($SKIP_XML = 1, next); 55 | /^-a$/ && ($AGGRESSIVE = 1, next); 56 | /^-time$/ && ($TIMING = 1, next); 57 | # Option to add list of regexps to be protected 58 | /^-protected/ && ($protected_patterns_file = shift, next); 59 | /^-threads$/ && ($NUM_THREADS = int(shift), next); 60 | /^-lines$/ && ($NUM_SENTENCES_PER_THREAD = int(shift), next); 61 | /^-penn$/ && ($PENN = 1, next); 62 | /^-no-escape/ && ($NO_ESCAPING = 1, next); 63 | } 64 | 65 | # for time calculation 66 | my $start_time; 67 | if ($TIMING) 68 | { 69 | $start_time = [ Time::HiRes::gettimeofday( ) ]; 70 | } 71 | 72 | # print help message 73 | if ($HELP) 74 | { 75 | print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n"; 76 | print "Options:\n"; 77 | print " -q ... quiet.\n"; 78 | print " -a ... aggressive hyphen splitting.\n"; 79 | print " -b ... disable Perl buffering.\n"; 80 | print " -time ... enable processing time calculation.\n"; 81 | print " -penn ... use Penn treebank-like tokenization.\n"; 82 | print " -protected FILE ... specify file with patters to be protected in tokenisation.\n"; 83 | print " -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n"; 84 | exit; 85 | } 86 | 87 | if (!$QUIET) 88 | { 89 | print STDERR "Tokenizer Version 1.1\n"; 90 | print STDERR "Language: $language\n"; 91 | print STDERR "Number of threads: $NUM_THREADS\n"; 92 | } 93 | 94 | # load the language-specific non-breaking prefix info from files in the directory nonbreaking_prefixes 95 | load_prefixes($language,\%NONBREAKING_PREFIX); 96 | 97 | if (scalar(%NONBREAKING_PREFIX) eq 0) 98 | { 99 | print STDERR "Warning: No known abbreviations for language '$language'\n"; 100 | } 101 | 102 | # Load protected patterns 103 | if ($protected_patterns_file) 104 | { 105 | open(PP,$protected_patterns_file) || die "Unable to open $protected_patterns_file"; 106 | while() { 107 | chomp; 108 | push @protected_patterns, $_; 109 | } 110 | } 111 | 112 | my @batch_sentences = (); 113 | my @thread_list = (); 114 | my $count_sentences = 0; 115 | 116 | if ($NUM_THREADS > 1) 117 | {# multi-threading tokenization 118 | while() 119 | { 120 | $count_sentences = $count_sentences + 1; 121 | push(@batch_sentences, $_); 122 | if (scalar(@batch_sentences)>=($NUM_SENTENCES_PER_THREAD*$NUM_THREADS)) 123 | { 124 | # assign each thread work 125 | for (my $i=0; $i<$NUM_THREADS; $i++) 126 | { 127 | my $start_index = $i*$NUM_SENTENCES_PER_THREAD; 128 | my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1; 129 | my @subbatch_sentences = @batch_sentences[$start_index..$end_index]; 130 | my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences; 131 | push(@thread_list, $new_thread); 132 | } 133 | foreach (@thread_list) 134 | { 135 | my $tokenized_list = $_->join; 136 | foreach (@$tokenized_list) 137 | { 138 | print $_; 139 | } 140 | } 141 | # reset for the new run 142 | @thread_list = (); 143 | @batch_sentences = (); 144 | } 145 | } 146 | # the last batch 147 | if (scalar(@batch_sentences)>0) 148 | { 149 | # assign each thread work 150 | for (my $i=0; $i<$NUM_THREADS; $i++) 151 | { 152 | my $start_index = $i*$NUM_SENTENCES_PER_THREAD; 153 | if ($start_index >= scalar(@batch_sentences)) 154 | { 155 | last; 156 | } 157 | my $end_index = $start_index+$NUM_SENTENCES_PER_THREAD-1; 158 | if ($end_index >= scalar(@batch_sentences)) 159 | { 160 | $end_index = scalar(@batch_sentences)-1; 161 | } 162 | my @subbatch_sentences = @batch_sentences[$start_index..$end_index]; 163 | my $new_thread = new Thread \&tokenize_batch, @subbatch_sentences; 164 | push(@thread_list, $new_thread); 165 | } 166 | foreach (@thread_list) 167 | { 168 | my $tokenized_list = $_->join; 169 | foreach (@$tokenized_list) 170 | { 171 | print $_; 172 | } 173 | } 174 | } 175 | } 176 | else 177 | {# single thread only 178 | while() 179 | { 180 | if (($SKIP_XML && /^<.+>$/) || /^\s*$/) 181 | { 182 | #don't try to tokenize XML/HTML tag lines 183 | print $_; 184 | } 185 | else 186 | { 187 | print &tokenize($_); 188 | } 189 | } 190 | } 191 | 192 | if ($TIMING) 193 | { 194 | my $duration = Time::HiRes::tv_interval( $start_time ); 195 | print STDERR ("TOTAL EXECUTION TIME: ".$duration."\n"); 196 | print STDERR ("TOKENIZATION SPEED: ".($duration/$count_sentences*1000)." milliseconds/line\n"); 197 | } 198 | 199 | ##################################################################################### 200 | # subroutines afterward 201 | 202 | # tokenize a batch of texts saved in an array 203 | # input: an array containing a batch of texts 204 | # return: another array containing a batch of tokenized texts for the input array 205 | sub tokenize_batch 206 | { 207 | my(@text_list) = @_; 208 | my(@tokenized_list) = (); 209 | foreach (@text_list) 210 | { 211 | if (($SKIP_XML && /^<.+>$/) || /^\s*$/) 212 | { 213 | #don't try to tokenize XML/HTML tag lines 214 | push(@tokenized_list, $_); 215 | } 216 | else 217 | { 218 | push(@tokenized_list, &tokenize($_)); 219 | } 220 | } 221 | return \@tokenized_list; 222 | } 223 | 224 | # the actual tokenize function which tokenizes one input string 225 | # input: one string 226 | # return: the tokenized string for the input string 227 | sub tokenize 228 | { 229 | my($text) = @_; 230 | 231 | if ($PENN) { 232 | return tokenize_penn($text); 233 | } 234 | 235 | chomp($text); 236 | $text = " $text "; 237 | 238 | # remove ASCII junk 239 | $text =~ s/\s+/ /g; 240 | $text =~ s/[\000-\037]//g; 241 | 242 | # Find protected patterns 243 | my @protected = (); 244 | foreach my $protected_pattern (@protected_patterns) { 245 | my $t = $text; 246 | while ($t =~ /(?$protected_pattern)(?.*)$/) { 247 | push @protected, $+{PATTERN}; 248 | $t = $+{TAIL}; 249 | } 250 | } 251 | 252 | for (my $i = 0; $i < scalar(@protected); ++$i) { 253 | my $subst = sprintf("THISISPROTECTED%.3d", $i); 254 | $text =~ s,\Q$protected[$i], $subst ,g; 255 | } 256 | $text =~ s/ +/ /g; 257 | $text =~ s/^ //g; 258 | $text =~ s/ $//g; 259 | 260 | # separate out all "other" special characters 261 | if (($language eq "fi") or ($language eq "sv")) { 262 | # in Finnish and Swedish, the colon can be used inside words as an apostrophe-like character: 263 | # USA:n, 20:een, EU:ssa, USA:s, S:t 264 | $text =~ s/([^\p{IsAlnum}\s\.\:\'\`\,\-])/ $1 /g; 265 | # if a colon is not immediately followed by lower-case characters, separate it out anyway 266 | $text =~ s/(:)(?=$|[^\p{Ll}])/ $1 /g; 267 | } 268 | else { 269 | $text =~ s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g; 270 | } 271 | 272 | # aggressive hyphen splitting 273 | if ($AGGRESSIVE) 274 | { 275 | $text =~ s/([\p{IsAlnum}])\-(?=[\p{IsAlnum}])/$1 \@-\@ /g; 276 | } 277 | 278 | #multi-dots stay together 279 | $text =~ s/\.([\.]+)/ DOTMULTI$1/g; 280 | while($text =~ /DOTMULTI\./) 281 | { 282 | $text =~ s/DOTMULTI\.([^\.])/DOTDOTMULTI $1/g; 283 | $text =~ s/DOTMULTI\./DOTDOTMULTI/g; 284 | } 285 | 286 | # seperate out "," except if within numbers (5,300) 287 | #$text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 288 | 289 | # separate out "," except if within numbers (5,300) 290 | # previous "global" application skips some: A,B,C,D,E > A , B,C , D,E 291 | # first application uses up B so rule can't see B,C 292 | # two-step version here may create extra spaces but these are removed later 293 | # will also space digit,letter or letter,digit forms (redundant with next section) 294 | $text =~ s/([^\p{IsN}])[,]/$1 , /g; 295 | $text =~ s/[,]([^\p{IsN}])/ , $1/g; 296 | 297 | # separate "," after a number if it's the end of a sentence 298 | $text =~ s/([\p{IsN}])[,]$/$1 ,/g; 299 | 300 | # separate , pre and post number 301 | #$text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 302 | #$text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g; 303 | 304 | # turn `into ' 305 | #$text =~ s/\`/\'/g; 306 | 307 | #turn '' into " 308 | #$text =~ s/\'\'/ \" /g; 309 | 310 | if ($language eq "en") 311 | { 312 | #split contractions right 313 | $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 314 | $text =~ s/([^\p{IsAlpha}\p{IsN}])[']([\p{IsAlpha}])/$1 ' $2/g; 315 | $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 316 | $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '$2/g; 317 | #special case for "1990's" 318 | $text =~ s/([\p{IsN}])[']([s])/$1 '$2/g; 319 | } 320 | elsif (($language eq "fr") or ($language eq "it") or ($language eq "ga")) 321 | { 322 | #split contractions left 323 | $text =~ s/([^\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 324 | $text =~ s/([^\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g; 325 | $text =~ s/([\p{IsAlpha}])[']([^\p{IsAlpha}])/$1 ' $2/g; 326 | $text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1' $2/g; 327 | } 328 | else 329 | { 330 | $text =~ s/\'/ \' /g; 331 | } 332 | 333 | #word token method 334 | my @words = split(/\s/,$text); 335 | $text = ""; 336 | for (my $i=0;$i<(scalar(@words));$i++) 337 | { 338 | my $word = $words[$i]; 339 | if ( $word =~ /^(\S+)\.$/) 340 | { 341 | my $pre = $1; 342 | if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml 386 | $text =~ s/\'/\'/g; # xml 387 | $text =~ s/\"/\"/g; # xml 388 | $text =~ s/\[/\[/g; # syntax non-terminal 389 | $text =~ s/\]/\]/g; # syntax non-terminal 390 | } 391 | 392 | #ensure final line break 393 | $text .= "\n" unless $text =~ /\n$/; 394 | 395 | return $text; 396 | } 397 | 398 | sub tokenize_penn 399 | { 400 | # Improved compatibility with Penn Treebank tokenization. Useful if 401 | # the text is to later be parsed with a PTB-trained parser. 402 | # 403 | # Adapted from Robert MacIntyre's sed script: 404 | # http://www.cis.upenn.edu/~treebank/tokenizer.sed 405 | 406 | my($text) = @_; 407 | chomp($text); 408 | 409 | # remove ASCII junk 410 | $text =~ s/\s+/ /g; 411 | $text =~ s/[\000-\037]//g; 412 | 413 | # attempt to get correct directional quotes 414 | $text =~ s/^``/`` /g; 415 | $text =~ s/^"/`` /g; 416 | $text =~ s/^`([^`])/` $1/g; 417 | $text =~ s/^'/` /g; 418 | $text =~ s/([ ([{<])"/$1 `` /g; 419 | $text =~ s/([ ([{<])``/$1 `` /g; 420 | $text =~ s/([ ([{<])`([^`])/$1 ` $2/g; 421 | $text =~ s/([ ([{<])'/$1 ` /g; 422 | # close quotes handled at end 423 | 424 | $text =~ s=\.\.\.= _ELLIPSIS_ =g; 425 | 426 | # separate out "," except if within numbers (5,300) 427 | $text =~ s/([^\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 428 | # separate , pre and post number 429 | $text =~ s/([\p{IsN}])[,]([^\p{IsN}])/$1 , $2/g; 430 | $text =~ s/([^\p{IsN}])[,]([\p{IsN}])/$1 , $2/g; 431 | 432 | #$text =~ s=([;:@#\$%&\p{IsSc}])= $1 =g; 433 | $text =~ s=([;:@#\$%&\p{IsSc}\p{IsSo}])= $1 =g; 434 | 435 | # Separate out intra-token slashes. PTB tokenization doesn't do this, so 436 | # the tokens should be merged prior to parsing with a PTB-trained parser 437 | # (see syntax-hyphen-splitting.perl). 438 | $text =~ s/([\p{IsAlnum}])\/([\p{IsAlnum}])/$1 \@\/\@ $2/g; 439 | 440 | # Assume sentence tokenization has been done first, so split FINAL periods 441 | # only. 442 | $text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g; 443 | # however, we may as well split ALL question marks and exclamation points, 444 | # since they shouldn't have the abbrev.-marker ambiguity problem 445 | $text =~ s=([?!])= $1 =g; 446 | 447 | # parentheses, brackets, etc. 448 | $text =~ s=([\]\[\(\){}<>])= $1 =g; 449 | $text =~ s/\(/-LRB-/g; 450 | $text =~ s/\)/-RRB-/g; 451 | $text =~ s/\[/-LSB-/g; 452 | $text =~ s/\]/-RSB-/g; 453 | $text =~ s/{/-LCB-/g; 454 | $text =~ s/}/-RCB-/g; 455 | 456 | $text =~ s=--= -- =g; 457 | 458 | # First off, add a space to the beginning and end of each line, to reduce 459 | # necessary number of regexps. 460 | $text =~ s=$= =; 461 | $text =~ s=^= =; 462 | 463 | $text =~ s="= '' =g; 464 | # possessive or close-single-quote 465 | $text =~ s=([^'])' =$1 ' =g; 466 | # as in it's, I'm, we'd 467 | $text =~ s='([sSmMdD]) = '$1 =g; 468 | $text =~ s='ll = 'll =g; 469 | $text =~ s='re = 're =g; 470 | $text =~ s='ve = 've =g; 471 | $text =~ s=n't = n't =g; 472 | $text =~ s='LL = 'LL =g; 473 | $text =~ s='RE = 'RE =g; 474 | $text =~ s='VE = 'VE =g; 475 | $text =~ s=N'T = N'T =g; 476 | 477 | $text =~ s= ([Cc])annot = $1an not =g; 478 | $text =~ s= ([Dd])'ye = $1' ye =g; 479 | $text =~ s= ([Gg])imme = $1im me =g; 480 | $text =~ s= ([Gg])onna = $1on na =g; 481 | $text =~ s= ([Gg])otta = $1ot ta =g; 482 | $text =~ s= ([Ll])emme = $1em me =g; 483 | $text =~ s= ([Mm])ore'n = $1ore 'n =g; 484 | $text =~ s= '([Tt])is = '$1 is =g; 485 | $text =~ s= '([Tt])was = '$1 was =g; 486 | $text =~ s= ([Ww])anna = $1an na =g; 487 | 488 | #word token method 489 | my @words = split(/\s/,$text); 490 | $text = ""; 491 | for (my $i=0;$i<(scalar(@words));$i++) 492 | { 493 | my $word = $words[$i]; 494 | if ( $word =~ /^(\S+)\.$/) 495 | { 496 | my $pre = $1; 497 | if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i/\>/g; # xml 526 | $text =~ s/\'/\'/g; # xml 527 | $text =~ s/\"/\"/g; # xml 528 | $text =~ s/\[/\[/g; # syntax non-terminal 529 | $text =~ s/\]/\]/g; # syntax non-terminal 530 | 531 | #ensure final line break 532 | $text .= "\n" unless $text =~ /\n$/; 533 | 534 | return $text; 535 | } 536 | 537 | sub load_prefixes 538 | { 539 | my ($language, $PREFIX_REF) = @_; 540 | 541 | my $prefixfile = "$mydir/nonbreaking_prefix.$language"; 542 | 543 | #default back to English if we don't have a language-specific prefix file 544 | if (!(-e $prefixfile)) 545 | { 546 | $prefixfile = "$mydir/nonbreaking_prefix.en"; 547 | print STDERR "WARNING: No known abbreviations for language '$language', attempting fall-back to English version...\n"; 548 | die ("ERROR: No abbreviations files found in $mydir\n") unless (-e $prefixfile); 549 | } 550 | 551 | if (-e "$prefixfile") 552 | { 553 | open(PREFIX, "<:utf8", "$prefixfile"); 554 | while () 555 | { 556 | my $item = $_; 557 | chomp($item); 558 | if (($item) && (substr($item,0,1) ne "#")) 559 | { 560 | if ($item =~ /(.*)[\s]+(\#NUMERIC_ONLY\#)/) 561 | { 562 | $PREFIX_REF->{$1} = 2; 563 | } 564 | else 565 | { 566 | $PREFIX_REF->{$item} = 1; 567 | } 568 | } 569 | } 570 | close(PREFIX); 571 | } 572 | } 573 | -------------------------------------------------------------------------------- /scripts/train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -N 1 3 | #SBATCH -c 12 4 | #SBATCH --mem-per-cpu=6G 5 | #SBATCH -p gpu -C gpuk40 --gres=gpu:1 6 | #SBATCH --time=10-00:30:00 7 | #SBATCH --mail-type=ALL 8 | #SBATCH --output=slurm-train.out 9 | #SBATCH --job-name="nmt-train" 10 | 11 | if [[ $# != 2 ]] ; then 12 | echo 'Error, command should be: ' 13 | exit 1 14 | fi 15 | 16 | src=$1 17 | tgt=$2 18 | lang=${1}-${2} 19 | 20 | export HOME=$(pwd)/../.. 21 | export DATA=$HOME/data 22 | export DATA_PREP=$DATA/$lang 23 | export MODELS=$HOME/models/$lang 24 | export SCRIPT=$HOME/Neural-Machine-Translation/scripts 25 | 26 | if [ ! -d "$HOME/models" ]; then 27 | mkdir $HOME/models 28 | fi 29 | 30 | module load singularity/2.5.1 31 | cd $HOME/singularity 32 | singularity shell -w --nv rh_xenial_20180308.img 33 | 34 | cd $SCRIPT 35 | source $HOME/myenv/bin/activate 36 | 37 | ##Creates data in a format required by train.py 38 | python ../preprocess.py \ 39 | -train_src $DATA_PREP/train.$lang.$src.processed \ 40 | -train_tgt $DATA_PREP/train.$lang.$tgt.processed \ 41 | -train_xe_src $DATA_PREP/train.$lang.$src.processed \ 42 | -train_xe_tgt $DATA_PREP/train.$lang.$tgt.processed \ 43 | -train_pg_src $DATA_PREP/train.$lang.$src.processed \ 44 | -train_pg_tgt $DATA_PREP/train.$lang.$tgt.processed \ 45 | -valid_src $DATA_PREP/valid.$lang.$src.processed \ 46 | -valid_tgt $DATA_PREP/valid.$lang.$tgt.processed \ 47 | -test_src $DATA_PREP/test.$lang.$src.processed \ 48 | -test_tgt $DATA_PREP/test.$lang.$tgt.processed \ 49 | -save_data $DATA_PREP/processed_all 50 | 51 | ##Train a model(might take days for training) 52 | python $HOME/Neural-Machine-Translation/train.py -data $DATA_PREP/processed_all-train.pt -layers 4 -word_vec_size 512 -brnn -batch_size 128 -dropout 0.3 -save_dir $MODELS -end_epoch 15 53 | -------------------------------------------------------------------------------- /scripts/translate.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -N 1 3 | #SBATCH -c 12 4 | #SBATCH --mem-per-cpu=4G 5 | #SBATCH -p gpu -C gpuk40 --gres=gpu:1 6 | #SBATCH --time=10-00:30:00 7 | #SBATCH --mail-type=ALL 8 | #SBATCH --output=slurm-translate.out 9 | #SBATCH --job-name="nmt-translate" 10 | 11 | if [[ $# != 4 ]] ; then 12 | echo 'Error, command should be: ' 13 | exit 1 14 | fi 15 | 16 | src=$1 17 | tgt=$2 18 | input_file=$3 19 | lang=${1}-${2} 20 | toggle=$4 21 | 22 | export HOME=$(pwd)/../.. 23 | export DATA=$HOME/data 24 | export DATA_PREP=$DATA/$lang 25 | export MODELS=$HOME/models/$lang 26 | export SCRIPT=$HOME/Neural-Machine-Translation/scripts 27 | 28 | module load singularity/2.5.1 29 | cd $HOME/singularity 30 | singularity shell -w --nv rh_xenial_20180308.img 31 | 32 | cd $SCRIPT 33 | source $HOME/myenv/bin/activate 34 | 35 | if [ $toggle -eq 0 ] 36 | then 37 | python $SCRIPT/parse.py $input_file 38 | if [[ $src = "zh" ]] 39 | then 40 | bash $HOME/stanford-segmenter-2018-02-27/segment.sh pku tmp.txt UTF-8 0 > seg.txt 41 | mv seg.txt tmp.txt 42 | fi 43 | perl $SCRIPT/tokenizer.perl -l $src < tmp.txt > tmp.txt.tok 44 | perl $SCRIPT/lowercase.perl < tmp.txt.tok > tmp.txt.tok.low 45 | $HOME/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c $HOME/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < tmp.txt.tok.low > tmp.txt.tok.low.bpe 46 | mv tmp.txt.tok.low.bpe tmp.txt 47 | python $HOME/Neural-Machine-Translation/translate.py -data $DATA_PREP/processed_all-train.pt -load_from $MODELS/model*_best.pt -test_src $SCRIPT/tmp.txt 48 | sed -r -i 's/(@@ )|(@@ ?$)//g' tmp.txt.pred 49 | python $SCRIPT/output.py $input_file 50 | rm $SCRIPT/tmp.txt* 51 | fi 52 | 53 | #### To translate a simple file not in the news transcript format: 54 | if [ $toggle -eq 1 ] 55 | then 56 | if [[ $src = "zh" ]] 57 | then 58 | bash $HOME/stanford-segmenter-2018-02-27/segment.sh ctb $input_file UTF-8 0 > tmp.txt 59 | #cp $input_file tmp.txt 60 | else 61 | cp $input_file tmp.txt 62 | fi 63 | perl $SCRIPT/tokenizer.perl -l $src < tmp.txt > tmp.txt.tok 64 | perl $SCRIPT/lowercase.perl < tmp.txt.tok > tmp.txt.tok.low 65 | $HOME/Neural-Machine-Translation/subword-nmt/apply_bpe.py -c $HOME/Neural-Machine-Translation/subword-nmt/$lang/bpe.32000 < tmp.txt.tok.low > tmp.txt.tok.low.bpe 66 | mv tmp.txt.tok.low.bpe tmp.txt 67 | python $HOME/Neural-Machine-Translation/translate.py -data $DATA_PREP/processed_all-train.pt -load_from $MODELS/model*_best.pt -test_src tmp.txt 68 | sed -r -i 's/(@@ )|(@@ ?$)//g' tmp.txt.pred 69 | cp tmp.txt.pred "$input_file.pred" 70 | rm $SCRIPT/tmp.txt* 71 | fi 72 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/README.txt: -------------------------------------------------------------------------------- 1 | The language suffix can be found here: 2 | 3 | http://www.loc.gov/standards/iso639-2/php/code_list.php 4 | 5 | This code includes data from Daniel Naber's Language Tools (czech abbreviations). 6 | This code includes data from czech wiktionary (also czech abbreviations). 7 | 8 | 9 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.ca: -------------------------------------------------------------------------------- 1 | Dr 2 | Dra 3 | pàg 4 | p 5 | c 6 | av 7 | Sr 8 | Sra 9 | adm 10 | esq 11 | Prof 12 | S.A 13 | S.L 14 | p.e 15 | ptes 16 | Sta 17 | St 18 | pl 19 | màx 20 | cast 21 | dir 22 | nre 23 | fra 24 | admdora 25 | Emm 26 | Excma 27 | espf 28 | dc 29 | admdor 30 | tel 31 | angl 32 | aprox 33 | ca 34 | dept 35 | dj 36 | dl 37 | dt 38 | ds 39 | dg 40 | dv 41 | ed 42 | entl 43 | al 44 | i.e 45 | maj 46 | smin 47 | n 48 | núm 49 | pta 50 | A 51 | B 52 | C 53 | D 54 | E 55 | F 56 | G 57 | H 58 | I 59 | J 60 | K 61 | L 62 | M 63 | N 64 | O 65 | P 66 | Q 67 | R 68 | S 69 | T 70 | U 71 | V 72 | W 73 | X 74 | Y 75 | Z 76 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.cs: -------------------------------------------------------------------------------- 1 | Bc 2 | BcA 3 | Ing 4 | Ing.arch 5 | MUDr 6 | MVDr 7 | MgA 8 | Mgr 9 | JUDr 10 | PhDr 11 | RNDr 12 | PharmDr 13 | ThLic 14 | ThDr 15 | Ph.D 16 | Th.D 17 | prof 18 | doc 19 | CSc 20 | DrSc 21 | dr. h. c 22 | PaedDr 23 | Dr 24 | PhMr 25 | DiS 26 | abt 27 | ad 28 | a.i 29 | aj 30 | angl 31 | anon 32 | apod 33 | atd 34 | atp 35 | aut 36 | bd 37 | biogr 38 | b.m 39 | b.p 40 | b.r 41 | cca 42 | cit 43 | cizojaz 44 | c.k 45 | col 46 | čes 47 | čín 48 | čj 49 | ed 50 | facs 51 | fasc 52 | fol 53 | fot 54 | franc 55 | h.c 56 | hist 57 | hl 58 | hrsg 59 | ibid 60 | il 61 | ind 62 | inv.č 63 | jap 64 | jhdt 65 | jv 66 | koed 67 | kol 68 | korej 69 | kl 70 | krit 71 | lat 72 | lit 73 | m.a 74 | maď 75 | mj 76 | mp 77 | násl 78 | např 79 | nepubl 80 | něm 81 | no 82 | nr 83 | n.s 84 | okr 85 | odd 86 | odp 87 | obr 88 | opr 89 | orig 90 | phil 91 | pl 92 | pokrač 93 | pol 94 | port 95 | pozn 96 | př.kr 97 | př.n.l 98 | přel 99 | přeprac 100 | příl 101 | pseud 102 | pt 103 | red 104 | repr 105 | resp 106 | revid 107 | rkp 108 | roč 109 | roz 110 | rozš 111 | samost 112 | sect 113 | sest 114 | seš 115 | sign 116 | sl 117 | srv 118 | stol 119 | sv 120 | šk 121 | šk.ro 122 | špan 123 | tab 124 | t.č 125 | tis 126 | tj 127 | tř 128 | tzv 129 | univ 130 | uspoř 131 | vol 132 | vl.jm 133 | vs 134 | vyd 135 | vyobr 136 | zal 137 | zejm 138 | zkr 139 | zprac 140 | zvl 141 | n.p 142 | např 143 | než 144 | MUDr 145 | abl 146 | absol 147 | adj 148 | adv 149 | ak 150 | ak. sl 151 | akt 152 | alch 153 | amer 154 | anat 155 | angl 156 | anglosas 157 | arab 158 | arch 159 | archit 160 | arg 161 | astr 162 | astrol 163 | att 164 | bás 165 | belg 166 | bibl 167 | biol 168 | boh 169 | bot 170 | bulh 171 | círk 172 | csl 173 | č 174 | čas 175 | čes 176 | dat 177 | děj 178 | dep 179 | dět 180 | dial 181 | dór 182 | dopr 183 | dosl 184 | ekon 185 | epic 186 | etnonym 187 | eufem 188 | f 189 | fam 190 | fem 191 | fil 192 | film 193 | form 194 | fot 195 | fr 196 | fut 197 | fyz 198 | gen 199 | geogr 200 | geol 201 | geom 202 | germ 203 | gram 204 | hebr 205 | herald 206 | hist 207 | hl 208 | hovor 209 | hud 210 | hut 211 | chcsl 212 | chem 213 | ie 214 | imp 215 | impf 216 | ind 217 | indoevr 218 | inf 219 | instr 220 | interj 221 | ión 222 | iron 223 | it 224 | kanad 225 | katalán 226 | klas 227 | kniž 228 | komp 229 | konj 230 | 231 | konkr 232 | kř 233 | kuch 234 | lat 235 | lék 236 | les 237 | lid 238 | lit 239 | liturg 240 | lok 241 | log 242 | m 243 | mat 244 | meteor 245 | metr 246 | mod 247 | ms 248 | mysl 249 | n 250 | náb 251 | námoř 252 | neklas 253 | něm 254 | nesklon 255 | nom 256 | ob 257 | obch 258 | obyč 259 | ojed 260 | opt 261 | part 262 | pas 263 | pejor 264 | pers 265 | pf 266 | pl 267 | plpf 268 | 269 | práv 270 | prep 271 | předl 272 | přivl 273 | r 274 | rcsl 275 | refl 276 | reg 277 | rkp 278 | ř 279 | řec 280 | s 281 | samohl 282 | sg 283 | sl 284 | souhl 285 | spec 286 | srov 287 | stfr 288 | střv 289 | stsl 290 | subj 291 | subst 292 | superl 293 | sv 294 | sz 295 | táz 296 | tech 297 | telev 298 | teol 299 | trans 300 | typogr 301 | var 302 | vedl 303 | verb 304 | vl. jm 305 | voj 306 | vok 307 | vůb 308 | vulg 309 | výtv 310 | vztaž 311 | zahr 312 | zájm 313 | zast 314 | zejm 315 | 316 | zeměd 317 | zkr 318 | zř 319 | mj 320 | dl 321 | atp 322 | sport 323 | Mgr 324 | horn 325 | MVDr 326 | JUDr 327 | RSDr 328 | Bc 329 | PhDr 330 | ThDr 331 | Ing 332 | aj 333 | apod 334 | PharmDr 335 | pomn 336 | ev 337 | slang 338 | nprap 339 | odp 340 | dop 341 | pol 342 | st 343 | stol 344 | p. n. l 345 | před n. l 346 | n. l 347 | př. Kr 348 | po Kr 349 | př. n. l 350 | odd 351 | RNDr 352 | tzv 353 | atd 354 | tzn 355 | resp 356 | tj 357 | p 358 | br 359 | č. j 360 | čj 361 | č. p 362 | čp 363 | a. s 364 | s. r. o 365 | spol. s r. o 366 | p. o 367 | s. p 368 | v. o. s 369 | k. s 370 | o. p. s 371 | o. s 372 | v. r 373 | v z 374 | ml 375 | vč 376 | kr 377 | mld 378 | hod 379 | popř 380 | ap 381 | event 382 | rus 383 | slov 384 | rum 385 | švýc 386 | P. T 387 | zvl 388 | hor 389 | dol 390 | S.O.S -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.de: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | #no german words end in single lower-case letters, so we throw those in too. 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in German. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #Titles and Honorifics 104 | Adj 105 | Adm 106 | Adv 107 | Asst 108 | Bart 109 | Bldg 110 | Brig 111 | Bros 112 | Capt 113 | Cmdr 114 | Col 115 | Comdr 116 | Con 117 | Corp 118 | Cpl 119 | DR 120 | Dr 121 | Ens 122 | Gen 123 | Gov 124 | Hon 125 | Hosp 126 | Insp 127 | Lt 128 | MM 129 | MR 130 | MRS 131 | MS 132 | Maj 133 | Messrs 134 | Mlle 135 | Mme 136 | Mr 137 | Mrs 138 | Ms 139 | Msgr 140 | Op 141 | Ord 142 | Pfc 143 | Ph 144 | Prof 145 | Pvt 146 | Rep 147 | Reps 148 | Res 149 | Rev 150 | Rt 151 | Sen 152 | Sens 153 | Sfc 154 | Sgt 155 | Sr 156 | St 157 | Supt 158 | Surg 159 | 160 | #Misc symbols 161 | Mio 162 | Mrd 163 | bzw 164 | v 165 | vs 166 | usw 167 | d.h 168 | z.B 169 | u.a 170 | etc 171 | Mrd 172 | MwSt 173 | ggf 174 | d.J 175 | D.h 176 | m.E 177 | vgl 178 | I.F 179 | z.T 180 | sogen 181 | ff 182 | u.E 183 | g.U 184 | g.g.A 185 | c.-à-d 186 | Buchst 187 | u.s.w 188 | sog 189 | u.ä 190 | Std 191 | evtl 192 | Zt 193 | Chr 194 | u.U 195 | o.ä 196 | Ltd 197 | b.A 198 | z.Zt 199 | spp 200 | sen 201 | SA 202 | k.o 203 | jun 204 | i.H.v 205 | dgl 206 | dergl 207 | Co 208 | zzt 209 | usf 210 | s.p.a 211 | Dkr 212 | Corp 213 | bzgl 214 | BSE 215 | 216 | #Number indicators 217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it 218 | No 219 | Nos 220 | Art 221 | Nr 222 | pp 223 | ca 224 | Ca 225 | 226 | #Ordinals are done with . in German - "1." = "1st" in English 227 | 1 228 | 2 229 | 3 230 | 4 231 | 5 232 | 6 233 | 7 234 | 8 235 | 9 236 | 10 237 | 11 238 | 12 239 | 13 240 | 14 241 | 15 242 | 16 243 | 17 244 | 18 245 | 19 246 | 20 247 | 21 248 | 22 249 | 23 250 | 24 251 | 25 252 | 26 253 | 27 254 | 28 255 | 29 256 | 30 257 | 31 258 | 32 259 | 33 260 | 34 261 | 35 262 | 36 263 | 37 264 | 38 265 | 39 266 | 40 267 | 41 268 | 42 269 | 43 270 | 44 271 | 45 272 | 46 273 | 47 274 | 48 275 | 49 276 | 50 277 | 51 278 | 52 279 | 53 280 | 54 281 | 55 282 | 56 283 | 57 284 | 58 285 | 59 286 | 60 287 | 61 288 | 62 289 | 63 290 | 64 291 | 65 292 | 66 293 | 67 294 | 68 295 | 69 296 | 70 297 | 71 298 | 72 299 | 73 300 | 74 301 | 75 302 | 76 303 | 77 304 | 78 305 | 79 306 | 80 307 | 81 308 | 82 309 | 83 310 | 84 311 | 85 312 | 86 313 | 87 314 | 88 315 | 89 316 | 90 317 | 91 318 | 92 319 | 93 320 | 94 321 | 95 322 | 96 323 | 97 324 | 98 325 | 99 326 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.el: -------------------------------------------------------------------------------- 1 | # Sigle letters in upper-case are usually abbreviations of names 2 | Α 3 | Β 4 | Γ 5 | Δ 6 | Ε 7 | Ζ 8 | Η 9 | Θ 10 | Ι 11 | Κ 12 | Λ 13 | Μ 14 | Ν 15 | Ξ 16 | Ο 17 | Π 18 | Ρ 19 | Σ 20 | Τ 21 | Υ 22 | Φ 23 | Χ 24 | Ψ 25 | Ω 26 | 27 | # Includes abbreviations for the Greek language compiled from various sources (Greek grammar books, Greek language related web content). 28 | Άθαν 29 | Έγχρ 30 | Έκθ 31 | Έσδ 32 | Έφ 33 | Όμ 34 | Α΄Έσδρ 35 | Α΄Έσδ 36 | Α΄Βασ 37 | Α΄Θεσ 38 | Α΄Ιω 39 | Α΄Κορινθ 40 | Α΄Κορ 41 | Α΄Μακκ 42 | Α΄Μακ 43 | Α΄Πέτρ 44 | Α΄Πέτ 45 | Α΄Παραλ 46 | Α΄Πε 47 | Α΄Σαμ 48 | Α΄Τιμ 49 | Α΄Χρον 50 | Α΄Χρ 51 | Α.Β.Α 52 | Α.Β 53 | Α.Ε 54 | Α.Κ.Τ.Ο 55 | Αέθλ 56 | Αέτ 57 | Αίλ.Δ 58 | Αίλ.Τακτ 59 | Αίσ 60 | Αββακ 61 | Αβυδ 62 | Αβ 63 | Αγάκλ 64 | Αγάπ 65 | Αγάπ.Αμαρτ.Σ 66 | Αγάπ.Γεωπ 67 | Αγαθάγγ 68 | Αγαθήμ 69 | Αγαθιν 70 | Αγαθοκλ 71 | Αγαθρχ 72 | Αγαθ 73 | Αγαθ.Ιστ 74 | Αγαλλ 75 | Αγαπητ 76 | Αγγ 77 | Αγησ 78 | Αγλ 79 | Αγορ.Κ 80 | Αγρο.Κωδ 81 | Αγρ.Εξ 82 | Αγρ.Κ 83 | Αγ.Γρ 84 | Αδριαν 85 | Αδρ 86 | Αετ 87 | Αθάν 88 | Αθήν 89 | Αθήν.Επιγρ 90 | Αθήν.Επιτ 91 | Αθήν.Ιατρ 92 | Αθήν.Μηχ 93 | Αθανάσ 94 | Αθαν 95 | Αθηνί 96 | Αθηναγ 97 | Αθηνόδ 98 | Αθ 99 | Αθ.Αρχ 100 | Αιλ 101 | Αιλ.Επιστ 102 | Αιλ.ΖΙ 103 | Αιλ.ΠΙ 104 | Αιλ.απ 105 | Αιμιλ 106 | Αιν.Γαζ 107 | Αιν.Τακτ 108 | Αισχίν 109 | Αισχίν.Επιστ 110 | Αισχ 111 | Αισχ.Αγαμ 112 | Αισχ.Αγ 113 | Αισχ.Αλ 114 | Αισχ.Ελεγ 115 | Αισχ.Επτ.Θ 116 | Αισχ.Ευμ 117 | Αισχ.Ικέτ 118 | Αισχ.Ικ 119 | Αισχ.Περσ 120 | Αισχ.Προμ.Δεσμ 121 | Αισχ.Πρ 122 | Αισχ.Χοηφ 123 | Αισχ.Χο 124 | Αισχ.απ 125 | ΑιτΕ 126 | Αιτ 127 | Αλκ 128 | Αλχιας 129 | Αμ.Π.Ο 130 | Αμβ 131 | Αμμών 132 | Αμ. 133 | Αν.Πειθ.Συμβ.Δικ 134 | Ανακρ 135 | Ανακ 136 | Αναμν.Τόμ 137 | Αναπλ 138 | Ανδ 139 | Ανθλγος 140 | Ανθστης 141 | Αντισθ 142 | Ανχης 143 | Αν 144 | Αποκ 145 | Απρ 146 | Απόδ 147 | Απόφ 148 | Απόφ.Νομ 149 | Απ 150 | Απ.Δαπ 151 | Απ.Διατ 152 | Απ.Επιστ 153 | Αριθ 154 | Αριστοτ 155 | Αριστοφ 156 | Αριστοφ.Όρν 157 | Αριστοφ.Αχ 158 | Αριστοφ.Βάτρ 159 | Αριστοφ.Ειρ 160 | Αριστοφ.Εκκλ 161 | Αριστοφ.Θεσμ 162 | Αριστοφ.Ιππ 163 | Αριστοφ.Λυσ 164 | Αριστοφ.Νεφ 165 | Αριστοφ.Πλ 166 | Αριστοφ.Σφ 167 | Αριστ 168 | Αριστ.Αθ.Πολ 169 | Αριστ.Αισθ 170 | Αριστ.Αν.Πρ 171 | Αριστ.Ζ.Ι 172 | Αριστ.Ηθ.Ευδ 173 | Αριστ.Ηθ.Νικ 174 | Αριστ.Κατ 175 | Αριστ.Μετ 176 | Αριστ.Πολ 177 | Αριστ.Φυσιογν 178 | Αριστ.Φυσ 179 | Αριστ.Ψυχ 180 | Αριστ.Ρητ 181 | Αρμεν 182 | Αρμ 183 | Αρχ.Εκ.Καν.Δ 184 | Αρχ.Ευβ.Μελ 185 | Αρχ.Ιδ.Δ 186 | Αρχ.Νομ 187 | Αρχ.Ν 188 | Αρχ.Π.Ε 189 | Αρ 190 | Αρ.Φορ.Μητρ 191 | Ασμ 192 | Ασμ.ασμ 193 | Αστ.Δ 194 | Αστ.Χρον 195 | Ασ 196 | Ατομ.Γνωμ 197 | Αυγ 198 | Αφρ 199 | Αχ.Νομ 200 | Α 201 | Α.Εγχ.Π 202 | Α.Κ.΄Υδρας 203 | Β΄Έσδρ 204 | Β΄Έσδ 205 | Β΄Βασ 206 | Β΄Θεσ 207 | Β΄Ιω 208 | Β΄Κορινθ 209 | Β΄Κορ 210 | Β΄Μακκ 211 | Β΄Μακ 212 | Β΄Πέτρ 213 | Β΄Πέτ 214 | Β΄Πέ 215 | Β΄Παραλ 216 | Β΄Σαμ 217 | Β΄Τιμ 218 | Β΄Χρον 219 | Β΄Χρ 220 | Β.Ι.Π.Ε 221 | Β.Κ.Τ 222 | Β.Κ.Ψ.Β 223 | Β.Μ 224 | Β.Ο.Α.Κ 225 | Β.Ο.Α 226 | Β.Ο.Δ 227 | Βίβλ 228 | Βαρ 229 | ΒεΘ 230 | Βι.Περ 231 | Βιπερ 232 | Βιργ 233 | Βλγ 234 | Βούλ 235 | Βρ 236 | Γ΄Βασ 237 | Γ΄Μακκ 238 | ΓΕΝμλ 239 | Γέν 240 | Γαλ 241 | Γεν 242 | Γλ 243 | Γν.Ν.Σ.Κρ 244 | Γνωμ 245 | Γν 246 | Γράμμ 247 | Γρηγ.Ναζ 248 | Γρηγ.Νύσ 249 | Γ Νοσ 250 | Γ' Ογκολ 251 | Γ.Ν 252 | Δ΄Βασ 253 | Δ.Β 254 | Δ.Δίκη 255 | Δ.Δίκ 256 | Δ.Ε.Σ 257 | Δ.Ε.Φ.Α 258 | Δ.Ε.Φ 259 | Δ.Εργ.Ν 260 | Δαμ 261 | Δαμ.μνημ.έργ 262 | Δαν 263 | Δασ.Κ 264 | Δεκ 265 | Δελτ.Δικ.Ε.Τ.Ε 266 | Δελτ.Νομ 267 | Δελτ.Συνδ.Α.Ε 268 | Δερμ 269 | Δευτ 270 | Δεύτ 271 | Δημοσθ 272 | Δημόκρ 273 | Δι.Δικ 274 | Διάτ 275 | Διαιτ.Απ 276 | Διαιτ 277 | Διαρκ.Στρατ 278 | Δικ 279 | Διοίκ.Πρωτ 280 | ΔιοικΔνη 281 | Διοικ.Εφ 282 | Διον.Αρ 283 | Διόρθ.Λαθ 284 | Δ.κ.Π 285 | Δνη 286 | Δν 287 | Δογμ.Όρος 288 | Δρ 289 | Δ.τ.Α 290 | Δτ 291 | ΔωδΝομ 292 | Δ.Περ 293 | Δ.Στρ 294 | ΕΔΠολ 295 | ΕΕυρΚ 296 | ΕΙΣ 297 | ΕΝαυτΔ 298 | ΕΣΑμΕΑ 299 | ΕΣΘ 300 | ΕΣυγκΔ 301 | ΕΤρΑξΧρΔ 302 | Ε.Φ.Ε.Τ 303 | Ε.Φ.Ι 304 | Ε.Φ.Ο.Επ.Α 305 | Εβδ 306 | Εβρ 307 | Εγκύκλ.Επιστ 308 | Εγκ 309 | Εε.Αιγ 310 | Εθν.Κ.Τ 311 | Εθν 312 | Ειδ.Δικ.Αγ.Κακ 313 | Εικ 314 | Ειρ.Αθ 315 | Ειρην.Αθ 316 | Ειρην 317 | Έλεγχ 318 | Ειρ 319 | Εισ.Α.Π 320 | Εισ.Ε 321 | Εισ.Ν.Α.Κ 322 | Εισ.Ν.Κ.Πολ.Δ 323 | Εισ.Πρωτ 324 | Εισηγ.Έκθ 325 | Εισ 326 | Εκκλ 327 | Εκκ 328 | Εκ 329 | Ελλ.Δνη 330 | Εν.Ε 331 | Εξ 332 | Επ.Αν 333 | Επ.Εργ.Δ 334 | Επ.Εφ 335 | Επ.Κυπ.Δ 336 | Επ.Μεσ.Αρχ 337 | Επ.Νομ 338 | Επίκτ 339 | Επίκ 340 | Επι.Δ.Ε 341 | Επιθ.Ναυτ.Δικ 342 | Επικ 343 | Επισκ.Ε.Δ 344 | Επισκ.Εμπ.Δικ 345 | Επιστ.Επετ.Αρμ 346 | Επιστ.Επετ 347 | Επιστ.Ιερ 348 | Επιτρ.Προστ.Συνδ.Στελ 349 | Επιφάν 350 | Επτ.Εφ 351 | Επ.Ιρ 352 | Επ.Ι 353 | Εργ.Ασφ.Νομ 354 | Ερμ.Α.Κ 355 | Ερμη.Σ 356 | Εσθ 357 | Εσπερ 358 | Ετρ.Δ 359 | Ευκλ 360 | Ευρ.Δ.Δ.Α 361 | Ευρ.Σ.Δ.Α 362 | Ευρ.ΣτΕ 363 | Ευρατόμ 364 | Ευρ.Άλκ 365 | Ευρ.Ανδρομ 366 | Ευρ.Βάκχ 367 | Ευρ.Εκ 368 | Ευρ.Ελ 369 | Ευρ.Ηλ 370 | Ευρ.Ηρακ 371 | Ευρ.Ηρ 372 | Ευρ.Ηρ.Μαιν 373 | Ευρ.Ικέτ 374 | Ευρ.Ιππόλ 375 | Ευρ.Ιφ.Α 376 | Ευρ.Ιφ.Τ 377 | Ευρ.Ι.Τ 378 | Ευρ.Κύκλ 379 | Ευρ.Μήδ 380 | Ευρ.Ορ 381 | Ευρ.Ρήσ 382 | Ευρ.Τρωάδ 383 | Ευρ.Φοίν 384 | Εφ.Αθ 385 | Εφ.Εν 386 | Εφ.Επ 387 | Εφ.Θρ 388 | Εφ.Θ 389 | Εφ.Ι 390 | Εφ.Κερ 391 | Εφ.Κρ 392 | Εφ.Λ 393 | Εφ.Ν 394 | Εφ.Πατ 395 | Εφ.Πειρ 396 | Εφαρμ.Δ.Δ 397 | Εφαρμ 398 | Εφεσ 399 | Εφημ 400 | Εφ 401 | Ζαχ 402 | Ζιγ 403 | Ζυ 404 | Ζχ 405 | ΗΕ.Δ 406 | Ημερ 407 | Ηράκλ 408 | Ηροδ 409 | Ησίοδ 410 | Ησ 411 | Η.Ε.Γ 412 | ΘΗΣ 413 | ΘΡ 414 | Θαλ 415 | Θεοδ 416 | Θεοφ 417 | Θεσ 418 | Θεόδ.Μοψ 419 | Θεόκρ 420 | Θεόφιλ 421 | Θουκ 422 | Θρ 423 | Θρ.Ε 424 | Θρ.Ιερ 425 | Θρ.Ιρ 426 | Ιακ 427 | Ιαν 428 | Ιβ 429 | Ιδθ 430 | Ιδ 431 | Ιεζ 432 | Ιερ 433 | Ιζ 434 | Ιησ 435 | Ιησ.Ν 436 | Ικ 437 | Ιλ 438 | Ιν 439 | Ιουδ 440 | Ιουστ 441 | Ιούδα 442 | Ιούλ 443 | Ιούν 444 | Ιπποκρ 445 | Ιππόλ 446 | Ιρ 447 | Ισίδ.Πηλ 448 | Ισοκρ 449 | Ισ.Ν 450 | Ιωβ 451 | Ιωλ 452 | Ιων 453 | Ιω 454 | ΚΟΣ 455 | ΚΟ.ΜΕ.ΚΟΝ 456 | ΚΠοινΔ 457 | ΚΠολΔ 458 | ΚαΒ 459 | Καλ 460 | Καλ.Τέχν 461 | ΚανΒ 462 | Καν.Διαδ 463 | Κατάργ 464 | Κλ 465 | ΚοινΔ 466 | Κολσ 467 | Κολ 468 | Κον 469 | Κορ 470 | Κος 471 | ΚριτΕπιθ 472 | ΚριτΕ 473 | Κριτ 474 | Κρ 475 | ΚτΒ 476 | ΚτΕ 477 | ΚτΠ 478 | Κυβ 479 | Κυπρ 480 | Κύριλ.Αλεξ 481 | Κύριλ.Ιερ 482 | Λεβ 483 | Λεξ.Σουίδα 484 | Λευϊτ 485 | Λευ 486 | Λκ 487 | Λογ 488 | ΛουκΑμ 489 | Λουκιαν 490 | Λουκ.Έρωτ 491 | Λουκ.Ενάλ.Διάλ 492 | Λουκ.Ερμ 493 | Λουκ.Εταιρ.Διάλ 494 | Λουκ.Ε.Δ 495 | Λουκ.Θε.Δ 496 | Λουκ.Ικ. 497 | Λουκ.Ιππ 498 | Λουκ.Λεξιφ 499 | Λουκ.Μεν 500 | Λουκ.Μισθ.Συν 501 | Λουκ.Ορχ 502 | Λουκ.Περ 503 | Λουκ.Συρ 504 | Λουκ.Τοξ 505 | Λουκ.Τυρ 506 | Λουκ.Φιλοψ 507 | Λουκ.Φιλ 508 | Λουκ.Χάρ 509 | Λουκ. 510 | Λουκ.Αλ 511 | Λοχ 512 | Λυδ 513 | Λυκ 514 | Λυσ 515 | Λωζ 516 | Λ1 517 | Λ2 518 | ΜΟΕφ 519 | Μάρκ 520 | Μέν 521 | Μαλ 522 | Ματθ 523 | Μα 524 | Μιχ 525 | Μκ 526 | Μλ 527 | Μμ 528 | Μον.Δ.Π 529 | Μον.Πρωτ 530 | Μον 531 | Μρ 532 | Μτ 533 | Μχ 534 | Μ.Βασ 535 | Μ.Πλ 536 | ΝΑ 537 | Ναυτ.Χρον 538 | Να 539 | Νδικ 540 | Νεεμ 541 | Νε 542 | Νικ 543 | ΝκΦ 544 | Νμ 545 | ΝοΒ 546 | Νομ.Δελτ.Τρ.Ελ 547 | Νομ.Δελτ 548 | Νομ.Σ.Κ 549 | Νομ.Χρ 550 | Νομ 551 | Νομ.Διεύθ 552 | Νοσ 553 | Ντ 554 | Νόσων 555 | Ν1 556 | Ν2 557 | Ν3 558 | Ν4 559 | Νtot 560 | Ξενοφ 561 | Ξεν 562 | Ξεν.Ανάβ 563 | Ξεν.Απολ 564 | Ξεν.Απομν 565 | Ξεν.Απομ 566 | Ξεν.Ελλ 567 | Ξεν.Ιέρ 568 | Ξεν.Ιππαρχ 569 | Ξεν.Ιππ 570 | Ξεν.Κυρ.Αν 571 | Ξεν.Κύρ.Παιδ 572 | Ξεν.Κ.Π 573 | Ξεν.Λακ.Πολ 574 | Ξεν.Οικ 575 | Ξεν.Προσ 576 | Ξεν.Συμπόσ 577 | Ξεν.Συμπ 578 | Ο΄ 579 | Οβδ 580 | Οβ 581 | ΟικΕ 582 | Οικ 583 | Οικ.Πατρ 584 | Οικ.Σύν.Βατ 585 | Ολομ 586 | Ολ 587 | Ολ.Α.Π 588 | Ομ.Ιλ 589 | Ομ.Οδ 590 | ΟπΤοιχ 591 | Οράτ 592 | Ορθ 593 | ΠΡΟ.ΠΟ 594 | Πίνδ 595 | Πίνδ.Ι 596 | Πίνδ.Νεμ 597 | Πίνδ.Ν 598 | Πίνδ.Ολ 599 | Πίνδ.Παθ 600 | Πίνδ.Πυθ 601 | Πίνδ.Π 602 | ΠαγΝμλγ 603 | Παν 604 | Παρμ 605 | Παροιμ 606 | Παρ 607 | Παυσ 608 | Πειθ.Συμβ 609 | ΠειρΝ 610 | Πελ 611 | ΠεντΣτρ 612 | Πεντ 613 | Πεντ.Εφ 614 | ΠερΔικ 615 | Περ.Γεν.Νοσ 616 | Πετ 617 | Πλάτ 618 | Πλάτ.Αλκ 619 | Πλάτ.Αντ 620 | Πλάτ.Αξίοχ 621 | Πλάτ.Απόλ 622 | Πλάτ.Γοργ 623 | Πλάτ.Ευθ 624 | Πλάτ.Θεαίτ 625 | Πλάτ.Κρατ 626 | Πλάτ.Κριτ 627 | Πλάτ.Λύσ 628 | Πλάτ.Μεν 629 | Πλάτ.Νόμ 630 | Πλάτ.Πολιτ 631 | Πλάτ.Πολ 632 | Πλάτ.Πρωτ 633 | Πλάτ.Σοφ. 634 | Πλάτ.Συμπ 635 | Πλάτ.Τίμ 636 | Πλάτ.Φαίδρ 637 | Πλάτ.Φιλ 638 | Πλημ 639 | Πλούτ 640 | Πλούτ.Άρατ 641 | Πλούτ.Αιμ 642 | Πλούτ.Αλέξ 643 | Πλούτ.Αλκ 644 | Πλούτ.Αντ 645 | Πλούτ.Αρτ 646 | Πλούτ.Ηθ 647 | Πλούτ.Θεμ 648 | Πλούτ.Κάμ 649 | Πλούτ.Καίσ 650 | Πλούτ.Κικ 651 | Πλούτ.Κράσ 652 | Πλούτ.Κ 653 | Πλούτ.Λυκ 654 | Πλούτ.Μάρκ 655 | Πλούτ.Μάρ 656 | Πλούτ.Περ 657 | Πλούτ.Ρωμ 658 | Πλούτ.Σύλλ 659 | Πλούτ.Φλαμ 660 | Πλ 661 | Ποιν.Δικ 662 | Ποιν.Δ 663 | Ποιν.Ν 664 | Ποιν.Χρον 665 | Ποιν.Χρ 666 | Πολ.Δ 667 | Πολ.Πρωτ 668 | Πολ 669 | Πολ.Μηχ 670 | Πολ.Μ 671 | Πρακτ.Αναθ 672 | Πρακτ.Ολ 673 | Πραξ 674 | Πρμ 675 | Πρξ 676 | Πρωτ 677 | Πρ 678 | Πρ.Αν 679 | Πρ.Λογ 680 | Πταισμ 681 | Πυρ.Καλ 682 | Πόλη 683 | Π.Δ 684 | Π.Δ.Άσμ 685 | ΡΜ.Ε 686 | Ρθ 687 | Ρμ 688 | Ρωμ 689 | ΣΠλημ 690 | Σαπφ 691 | Σειρ 692 | Σολ 693 | Σοφ 694 | Σοφ.Αντιγ 695 | Σοφ.Αντ 696 | Σοφ.Αποσ 697 | Σοφ.Απ 698 | Σοφ.Ηλέκ 699 | Σοφ.Ηλ 700 | Σοφ.Οιδ.Κολ 701 | Σοφ.Οιδ.Τύρ 702 | Σοφ.Ο.Τ 703 | Σοφ.Σειρ 704 | Σοφ.Σολ 705 | Σοφ.Τραχ 706 | Σοφ.Φιλοκτ 707 | Σρ 708 | Σ.τ.Ε 709 | Σ.τ.Π 710 | Στρ.Π.Κ 711 | Στ.Ευρ 712 | Συζήτ 713 | Συλλ.Νομολ 714 | Συλ.Νομ 715 | ΣυμβΕπιθ 716 | Συμπ.Ν 717 | Συνθ.Αμ 718 | Συνθ.Ε.Ε 719 | Συνθ.Ε.Κ 720 | Συνθ.Ν 721 | Σφν 722 | Σφ 723 | Σφ.Σλ 724 | Σχ.Πολ.Δ 725 | Σχ.Συντ.Ε 726 | Σωσ 727 | Σύντ 728 | Σ.Πληρ 729 | ΤΘ 730 | ΤΣ.Δ 731 | Τίτ 732 | Τβ 733 | Τελ.Ενημ 734 | Τελ.Κ 735 | Τερτυλ 736 | Τιμ 737 | Τοπ.Α 738 | Τρ.Ο 739 | Τριμ 740 | Τριμ.Πλ 741 | Τρ.Πλημ 742 | Τρ.Π.Δ 743 | Τ.τ.Ε 744 | Ττ 745 | Τωβ 746 | Υγ 747 | Υπερ 748 | Υπ 749 | Υ.Γ 750 | Φιλήμ 751 | Φιλιπ 752 | Φιλ 753 | Φλμ 754 | Φλ 755 | Φορ.Β 756 | Φορ.Δ.Ε 757 | Φορ.Δνη 758 | Φορ.Δ 759 | Φορ.Επ 760 | Φώτ 761 | Χρ.Ι.Δ 762 | Χρ.Ιδ.Δ 763 | Χρ.Ο 764 | Χρυσ 765 | Ψήφ 766 | Ψαλμ 767 | Ψαλ 768 | Ψλ 769 | Ωριγ 770 | Ωσ 771 | Ω.Ρ.Λ 772 | άγν 773 | άγν.ετυμολ 774 | άγ 775 | άκλ 776 | άνθρ 777 | άπ 778 | άρθρ 779 | άρν 780 | άρ 781 | άτ 782 | άψ 783 | ά 784 | έκδ 785 | έκφρ 786 | έμψ 787 | ένθ.αν 788 | έτ 789 | έ.α 790 | ίδ 791 | αβεστ 792 | αβησσ 793 | αγγλ 794 | αγγ 795 | αδημ 796 | αεροναυτ 797 | αερον 798 | αεροπ 799 | αθλητ 800 | αθλ 801 | αθροιστ 802 | αιγυπτ 803 | αιγ 804 | αιτιολ 805 | αιτ 806 | αι 807 | ακαδ 808 | ακκαδ 809 | αλβ 810 | αλλ 811 | αλφαβητ 812 | αμα 813 | αμερικ 814 | αμερ 815 | αμετάβ 816 | αμτβ 817 | αμφιβ 818 | αμφισβ 819 | αμφ 820 | αμ 821 | ανάλ 822 | ανάπτ 823 | ανάτ 824 | αναβ 825 | αναδαν 826 | αναδιπλασ 827 | αναδιπλ 828 | αναδρ 829 | αναλ 830 | αναν 831 | ανασυλλ 832 | ανατολ 833 | ανατομ 834 | ανατυπ 835 | ανατ 836 | αναφορ 837 | αναφ 838 | ανα.ε 839 | ανδρων 840 | ανθρωπολ 841 | ανθρωπ 842 | ανθ 843 | ανομ 844 | αντίτ 845 | αντδ 846 | αντιγρ 847 | αντιθ 848 | αντικ 849 | αντιμετάθ 850 | αντων 851 | αντ 852 | ανωτ 853 | ανόργ 854 | ανών 855 | αορ 856 | απαρέμφ 857 | απαρφ 858 | απαρχ 859 | απαρ 860 | απλολ 861 | απλοπ 862 | αποβ 863 | αποηχηροπ 864 | αποθ 865 | αποκρυφ 866 | αποφ 867 | απρμφ 868 | απρφ 869 | απρόσ 870 | απόδ 871 | απόλ 872 | απόσπ 873 | απόφ 874 | αραβοτουρκ 875 | αραβ 876 | αραμ 877 | αρβαν 878 | αργκ 879 | αριθμτ 880 | αριθμ 881 | αριθ 882 | αρκτικόλ 883 | αρκ 884 | αρμεν 885 | αρμ 886 | αρνητ 887 | αρσ 888 | αρχαιολ 889 | αρχιτεκτ 890 | αρχιτ 891 | αρχκ 892 | αρχ 893 | αρωμουν 894 | αρωμ 895 | αρ 896 | αρ.μετρ 897 | αρ.φ 898 | ασσυρ 899 | αστρολ 900 | αστροναυτ 901 | αστρον 902 | αττ 903 | αυστραλ 904 | αυτοπ 905 | αυτ 906 | αφγαν 907 | αφηρ 908 | αφομ 909 | αφρικ 910 | αχώρ 911 | αόρ 912 | α.α 913 | α/α 914 | α0 915 | βαθμ 916 | βαθ 917 | βαπτ 918 | βασκ 919 | βεβαιωτ 920 | βεβ 921 | βεδ 922 | βενετ 923 | βεν 924 | βερβερ 925 | βιβλγρ 926 | βιολ 927 | βιομ 928 | βιοχημ 929 | βιοχ 930 | βλάχ 931 | βλ 932 | βλ.λ 933 | βοταν 934 | βοτ 935 | βουλγαρ 936 | βουλγ 937 | βούλ 938 | βραζιλ 939 | βρετον 940 | βόρ 941 | γαλλ 942 | γενικότ 943 | γενοβ 944 | γεν 945 | γερμαν 946 | γερμ 947 | γεωγρ 948 | γεωλ 949 | γεωμετρ 950 | γεωμ 951 | γεωπ 952 | γεωργ 953 | γλυπτ 954 | γλωσσολ 955 | γλωσσ 956 | γλ 957 | γνμδ 958 | γνμ 959 | γνωμ 960 | γοτθ 961 | γραμμ 962 | γραμ 963 | γρμ 964 | γρ 965 | γυμν 966 | δίδες 967 | δίκ 968 | δίφθ 969 | δαν 970 | δεικτ 971 | δεκατ 972 | δηλ 973 | δημογρ 974 | δημοτ 975 | δημώδ 976 | δημ 977 | διάγρ 978 | διάκρ 979 | διάλεξ 980 | διάλ 981 | διάσπ 982 | διαλεκτ 983 | διατρ 984 | διαφ 985 | διαχ 986 | διδα 987 | διεθν 988 | διεθ 989 | δικον 990 | διστ 991 | δισύλλ 992 | δισ 993 | διφθογγοπ 994 | δογμ 995 | δολ 996 | δοτ 997 | δρμ 998 | δρχ 999 | δρ(α) 1000 | δωρ 1001 | δ 1002 | εβρ 1003 | εγκλπ 1004 | εδ 1005 | εθνολ 1006 | εθν 1007 | ειδικότ 1008 | ειδ 1009 | ειδ.β 1010 | εικ 1011 | ειρ 1012 | εισ 1013 | εκατοστμ 1014 | εκατοστ 1015 | εκατστ.2 1016 | εκατστ.3 1017 | εκατ 1018 | εκδ 1019 | εκκλησ 1020 | εκκλ 1021 | εκ 1022 | ελλην 1023 | ελλ 1024 | ελνστ 1025 | ελπ 1026 | εμβ 1027 | εμφ 1028 | εναλλ 1029 | ενδ 1030 | ενεργ 1031 | ενεστ 1032 | ενικ 1033 | ενν 1034 | εν 1035 | εξέλ 1036 | εξακολ 1037 | εξομάλ 1038 | εξ 1039 | εο 1040 | επέκτ 1041 | επίδρ 1042 | επίθ 1043 | επίρρ 1044 | επίσ 1045 | επαγγελμ 1046 | επανάλ 1047 | επανέκδ 1048 | επιθ 1049 | επικ 1050 | επιμ 1051 | επιρρ 1052 | επιστ 1053 | επιτατ 1054 | επιφ 1055 | επών 1056 | επ 1057 | εργ 1058 | ερμ 1059 | ερρινοπ 1060 | ερωτ 1061 | ετρουσκ 1062 | ετυμ 1063 | ετ 1064 | ευφ 1065 | ευχετ 1066 | εφ 1067 | εύχρ 1068 | ε.α 1069 | ε/υ 1070 | ε0 1071 | ζωγρ 1072 | ζωολ 1073 | ηθικ 1074 | ηθ 1075 | ηλεκτρολ 1076 | ηλεκτρον 1077 | ηλεκτρ 1078 | ημίτ 1079 | ημίφ 1080 | ημιφ 1081 | ηχηροπ 1082 | ηχηρ 1083 | ηχομιμ 1084 | ηχ 1085 | η 1086 | θέατρ 1087 | θεολ 1088 | θετ 1089 | θηλ 1090 | θρακ 1091 | θρησκειολ 1092 | θρησκ 1093 | θ 1094 | ιαπων 1095 | ιατρ 1096 | ιδιωμ 1097 | ιδ 1098 | ινδ 1099 | ιραν 1100 | ισπαν 1101 | ιστορ 1102 | ιστ 1103 | ισχυροπ 1104 | ιταλ 1105 | ιχθυολ 1106 | ιων 1107 | κάτ 1108 | καθ 1109 | κακοσ 1110 | καν 1111 | καρ 1112 | κατάλ 1113 | κατατ 1114 | κατωτ 1115 | κατ 1116 | κα 1117 | κελτ 1118 | κεφ 1119 | κινεζ 1120 | κινημ 1121 | κλητ 1122 | κλιτ 1123 | κλπ 1124 | κλ 1125 | κν 1126 | κοινωνιολ 1127 | κοινων 1128 | κοπτ 1129 | κουτσοβλαχ 1130 | κουτσοβλ 1131 | κπ 1132 | κρ.γν 1133 | κτγ 1134 | κτην 1135 | κτητ 1136 | κτλ 1137 | κτ 1138 | κυριολ 1139 | κυρ 1140 | κύρ 1141 | κ 1142 | κ.ά 1143 | κ.ά.π 1144 | κ.α 1145 | κ.εξ 1146 | κ.επ 1147 | κ.ε 1148 | κ.λπ 1149 | κ.λ.π 1150 | κ.ού.κ 1151 | κ.ο.κ 1152 | κ.τ.λ 1153 | κ.τ.τ 1154 | κ.τ.ό 1155 | λέξ 1156 | λαογρ 1157 | λαπ 1158 | λατιν 1159 | λατ 1160 | λαϊκότρ 1161 | λαϊκ 1162 | λετ 1163 | λιθ 1164 | λογιστ 1165 | λογοτ 1166 | λογ 1167 | λουβ 1168 | λυδ 1169 | λόγ 1170 | λ 1171 | λ.χ 1172 | μέλλ 1173 | μέσ 1174 | μαθημ 1175 | μαθ 1176 | μαιευτ 1177 | μαλαισ 1178 | μαλτ 1179 | μαμμων 1180 | μεγεθ 1181 | μεε 1182 | μειωτ 1183 | μελ 1184 | μεξ 1185 | μεσν 1186 | μεσογ 1187 | μεσοπαθ 1188 | μεσοφ 1189 | μετάθ 1190 | μεταβτ 1191 | μεταβ 1192 | μετακ 1193 | μεταπλ 1194 | μεταπτωτ 1195 | μεταρ 1196 | μεταφορ 1197 | μετβ 1198 | μετεπιθ 1199 | μετεπιρρ 1200 | μετεωρολ 1201 | μετεωρ 1202 | μετον 1203 | μετουσ 1204 | μετοχ 1205 | μετρ 1206 | μετ 1207 | μητρων 1208 | μηχανολ 1209 | μηχ 1210 | μικροβιολ 1211 | μογγολ 1212 | μορφολ 1213 | μουσ 1214 | μπενελούξ 1215 | μσνλατ 1216 | μσν 1217 | μτβ 1218 | μτγν 1219 | μτγ 1220 | μτφρδ 1221 | μτφρ 1222 | μτφ 1223 | μτχ 1224 | μυθ 1225 | μυκην 1226 | μυκ 1227 | μφ 1228 | μ 1229 | μ.ε 1230 | μ.μ 1231 | μ.π.ε 1232 | μ.π.π 1233 | μ0 1234 | ναυτ 1235 | νεοελλ 1236 | νεολατιν 1237 | νεολατ 1238 | νεολ 1239 | νεότ 1240 | νλατ 1241 | νομ 1242 | νορβ 1243 | νοσ 1244 | νότ 1245 | ν 1246 | ξ.λ 1247 | οικοδ 1248 | οικολ 1249 | οικον 1250 | οικ 1251 | ολλανδ 1252 | ολλ 1253 | ομηρ 1254 | ομόρρ 1255 | ονομ 1256 | ον 1257 | οπτ 1258 | ορθογρ 1259 | ορθ 1260 | οριστ 1261 | ορυκτολ 1262 | ορυκτ 1263 | ορ 1264 | οσετ 1265 | οσκ 1266 | ουαλ 1267 | ουγγρ 1268 | ουδ 1269 | ουσιαστικοπ 1270 | ουσιαστ 1271 | ουσ 1272 | πίν 1273 | παθητ 1274 | παθολ 1275 | παθ 1276 | παιδ 1277 | παλαιοντ 1278 | παλαιότ 1279 | παλ 1280 | παππων 1281 | παράγρ 1282 | παράγ 1283 | παράλλ 1284 | παράλ 1285 | παραγ 1286 | παρακ 1287 | παραλ 1288 | παραπ 1289 | παρατ 1290 | παρβ 1291 | παρετυμ 1292 | παροξ 1293 | παρων 1294 | παρωχ 1295 | παρ 1296 | παρ.φρ 1297 | πατριδων 1298 | πατρων 1299 | πβ 1300 | περιθ 1301 | περιλ 1302 | περιφρ 1303 | περσ 1304 | περ 1305 | πιθ 1306 | πληθ 1307 | πληροφ 1308 | ποδ 1309 | ποιητ 1310 | πολιτ 1311 | πολλαπλ 1312 | πολ 1313 | πορτογαλ 1314 | πορτ 1315 | ποσ 1316 | πρακριτ 1317 | πρβλ 1318 | πρβ 1319 | πργ 1320 | πρκμ 1321 | πρκ 1322 | πρλ 1323 | προέλ 1324 | προβηγκ 1325 | προελλ 1326 | προηγ 1327 | προθεμ 1328 | προπαραλ 1329 | προπαροξ 1330 | προπερισπ 1331 | προσαρμ 1332 | προσηγορ 1333 | προσταχτ 1334 | προστ 1335 | προσφών 1336 | προσ 1337 | προτακτ 1338 | προτ.Εισ 1339 | προφ 1340 | προχωρ 1341 | πρτ 1342 | πρόθ 1343 | πρόσθ 1344 | πρόσ 1345 | πρότ 1346 | πρ 1347 | πρ.Εφ 1348 | πτ 1349 | πυ 1350 | π 1351 | π.Χ 1352 | π.μ 1353 | π.χ 1354 | ρήμ 1355 | ρίζ 1356 | ρηματ 1357 | ρητορ 1358 | ριν 1359 | ρουμ 1360 | ρωμ 1361 | ρωσ 1362 | ρ 1363 | σανσκρ 1364 | σαξ 1365 | σελ 1366 | σερβοκρ 1367 | σερβ 1368 | σημασιολ 1369 | σημδ 1370 | σημειολ 1371 | σημερ 1372 | σημιτ 1373 | σημ 1374 | σκανδ 1375 | σκυθ 1376 | σκωπτ 1377 | σλαβ 1378 | σλοβ 1379 | σουηδ 1380 | σουμερ 1381 | σουπ 1382 | σπάν 1383 | σπανιότ 1384 | σπ 1385 | σσ 1386 | στατ 1387 | στερ 1388 | στιγμ 1389 | στιχ 1390 | στρέμ 1391 | στρατιωτ 1392 | στρατ 1393 | στ 1394 | συγγ 1395 | συγκρ 1396 | συγκ 1397 | συμπερ 1398 | συμπλεκτ 1399 | συμπλ 1400 | συμπροφ 1401 | συμφυρ 1402 | συμφ 1403 | συνήθ 1404 | συνίζ 1405 | συναίρ 1406 | συναισθ 1407 | συνδετ 1408 | συνδ 1409 | συνεκδ 1410 | συνηρ 1411 | συνθετ 1412 | συνθ 1413 | συνοπτ 1414 | συντελ 1415 | συντομογρ 1416 | συντ 1417 | συν 1418 | συρ 1419 | σχημ 1420 | σχ 1421 | σύγκρ 1422 | σύμπλ 1423 | σύμφ 1424 | σύνδ 1425 | σύνθ 1426 | σύντμ 1427 | σύντ 1428 | σ 1429 | σ.π 1430 | σ/β 1431 | τακτ 1432 | τελ 1433 | τετρ 1434 | τετρ.μ 1435 | τεχνλ 1436 | τεχνολ 1437 | τεχν 1438 | τεύχ 1439 | τηλεπικ 1440 | τηλεόρ 1441 | τιμ 1442 | τιμ.τομ 1443 | τοΣ 1444 | τον 1445 | τοπογρ 1446 | τοπων 1447 | τοπ 1448 | τοσκ 1449 | τουρκ 1450 | τοχ 1451 | τριτοπρόσ 1452 | τροποπ 1453 | τροπ 1454 | τσεχ 1455 | τσιγγ 1456 | ττ 1457 | τυπ 1458 | τόμ 1459 | τόνν 1460 | τ 1461 | τ.μ 1462 | τ.χλμ 1463 | υβρ 1464 | υπερθ 1465 | υπερσ 1466 | υπερ 1467 | υπεύθ 1468 | υποθ 1469 | υποκορ 1470 | υποκ 1471 | υποσημ 1472 | υποτ 1473 | υποφ 1474 | υποχωρ 1475 | υπόλ 1476 | υπόχρ 1477 | υπ 1478 | υστλατ 1479 | υψόμ 1480 | υψ 1481 | φάκ 1482 | φαρμακολ 1483 | φαρμ 1484 | φιλολ 1485 | φιλοσ 1486 | φιλοτ 1487 | φινλ 1488 | φοινικ 1489 | φράγκ 1490 | φρανκον 1491 | φριζ 1492 | φρ 1493 | φυλλ 1494 | φυσιολ 1495 | φυσ 1496 | φωνηεντ 1497 | φωνητ 1498 | φωνολ 1499 | φων 1500 | φωτογρ 1501 | φ 1502 | φ.τ.μ 1503 | χαμιτ 1504 | χαρτόσ 1505 | χαρτ 1506 | χασμ 1507 | χαϊδ 1508 | χγφ 1509 | χειλ 1510 | χεττ 1511 | χημ 1512 | χιλ 1513 | χλγρ 1514 | χλγ 1515 | χλμ 1516 | χλμ.2 1517 | χλμ.3 1518 | χλσγρ 1519 | χλστγρ 1520 | χλστμ 1521 | χλστμ.2 1522 | χλστμ.3 1523 | χλ 1524 | χργρ 1525 | χρημ 1526 | χρον 1527 | χρ 1528 | χφ 1529 | χ.ε 1530 | χ.κ 1531 | χ.ο 1532 | χ.σ 1533 | χ.τ 1534 | χ.χ 1535 | ψευδ 1536 | ψυχαν 1537 | ψυχιατρ 1538 | ψυχολ 1539 | ψυχ 1540 | ωκεαν 1541 | όμ 1542 | όν 1543 | όπ.παρ 1544 | όπ.π 1545 | ό.π 1546 | ύψ 1547 | 1Βσ 1548 | 1Εσ 1549 | 1Θσ 1550 | 1Ιν 1551 | 1Κρ 1552 | 1Μκ 1553 | 1Πρ 1554 | 1Πτ 1555 | 1Τμ 1556 | 2Βσ 1557 | 2Εσ 1558 | 2Θσ 1559 | 2Ιν 1560 | 2Κρ 1561 | 2Μκ 1562 | 2Πρ 1563 | 2Πτ 1564 | 2Τμ 1565 | 3Βσ 1566 | 3Ιν 1567 | 3Μκ 1568 | 4Βσ 1569 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.en: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Asst 38 | Bart 39 | Bldg 40 | Brig 41 | Bros 42 | Capt 43 | Cmdr 44 | Col 45 | Comdr 46 | Con 47 | Corp 48 | Cpl 49 | DR 50 | Dr 51 | Drs 52 | Ens 53 | Gen 54 | Gov 55 | Hon 56 | Hr 57 | Hosp 58 | Insp 59 | Lt 60 | MM 61 | MR 62 | MRS 63 | MS 64 | Maj 65 | Messrs 66 | Mlle 67 | Mme 68 | Mr 69 | Mrs 70 | Ms 71 | Msgr 72 | Op 73 | Ord 74 | Pfc 75 | Ph 76 | Prof 77 | Pvt 78 | Rep 79 | Reps 80 | Res 81 | Rev 82 | Rt 83 | Sen 84 | Sens 85 | Sfc 86 | Sgt 87 | Sr 88 | St 89 | Supt 90 | Surg 91 | 92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 93 | v 94 | vs 95 | i.e 96 | rev 97 | e.g 98 | 99 | #Numbers only. These should only induce breaks when followed by a numeric sequence 100 | # add NUMERIC_ONLY after the word for this function 101 | #This case is mostly for the english "No." which can either be a sentence of its own, or 102 | #if followed by a number, a non-breaking prefix 103 | No #NUMERIC_ONLY# 104 | Nos 105 | Art #NUMERIC_ONLY# 106 | Nr 107 | pp #NUMERIC_ONLY# 108 | 109 | #month abbreviations 110 | Jan 111 | Feb 112 | Mar 113 | Apr 114 | #May is a full word 115 | Jun 116 | Jul 117 | Aug 118 | Sep 119 | Oct 120 | Nov 121 | Dec 122 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.es: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | # Period-final abbreviation list from http://www.ctspanish.com/words/abbreviations.htm 34 | 35 | A.C 36 | Apdo 37 | Av 38 | Bco 39 | CC.AA 40 | Da 41 | Dep 42 | Dn 43 | Dr 44 | Dra 45 | EE.UU 46 | Excmo 47 | FF.CC 48 | Fil 49 | Gral 50 | J.C 51 | Let 52 | Lic 53 | N.B 54 | P.D 55 | P.V.P 56 | Prof 57 | Pts 58 | Rte 59 | S.A 60 | S.A.R 61 | S.E 62 | S.L 63 | S.R.C 64 | Sr 65 | Sra 66 | Srta 67 | Sta 68 | Sto 69 | T.V.E 70 | Tel 71 | Ud 72 | Uds 73 | V.B 74 | V.E 75 | Vd 76 | Vds 77 | a/c 78 | adj 79 | admón 80 | afmo 81 | apdo 82 | av 83 | c 84 | c.f 85 | c.g 86 | cap 87 | cm 88 | cta 89 | dcha 90 | doc 91 | ej 92 | entlo 93 | esq 94 | etc 95 | f.c 96 | gr 97 | grs 98 | izq 99 | kg 100 | km 101 | mg 102 | mm 103 | núm 104 | núm 105 | p 106 | p.a 107 | p.ej 108 | ptas 109 | pág 110 | págs 111 | pág 112 | págs 113 | q.e.g.e 114 | q.e.s.m 115 | s 116 | s.s.s 117 | vid 118 | vol 119 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.fi: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT 2 | #indicate an end-of-sentence marker. Special cases are included for prefixes 3 | #that ONLY appear before 0-9 numbers. 4 | 5 | #This list is compiled from omorfi database 6 | #by Tommi A Pirinen. 7 | 8 | 9 | #any single upper case letter followed by a period is not a sentence ender 10 | A 11 | B 12 | C 13 | D 14 | E 15 | F 16 | G 17 | H 18 | I 19 | J 20 | K 21 | L 22 | M 23 | N 24 | O 25 | P 26 | Q 27 | R 28 | S 29 | T 30 | U 31 | V 32 | W 33 | X 34 | Y 35 | Z 36 | Å 37 | Ä 38 | Ö 39 | 40 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 41 | alik 42 | alil 43 | amir 44 | apul 45 | apul.prof 46 | arkkit 47 | ass 48 | assist 49 | dipl 50 | dipl.arkkit 51 | dipl.ekon 52 | dipl.ins 53 | dipl.kielenk 54 | dipl.kirjeenv 55 | dipl.kosm 56 | dipl.urk 57 | dos 58 | erikoiseläinl 59 | erikoishammasl 60 | erikoisl 61 | erikoist 62 | ev.luutn 63 | evp 64 | fil 65 | ft 66 | hallinton 67 | hallintot 68 | hammaslääket 69 | jatk 70 | jääk 71 | kansaned 72 | kapt 73 | kapt.luutn 74 | kenr 75 | kenr.luutn 76 | kenr.maj 77 | kers 78 | kirjeenv 79 | kom 80 | kom.kapt 81 | komm 82 | konst 83 | korpr 84 | luutn 85 | maist 86 | maj 87 | Mr 88 | Mrs 89 | Ms 90 | M.Sc 91 | neuv 92 | nimim 93 | Ph.D 94 | prof 95 | puh.joht 96 | pääll 97 | res 98 | san 99 | siht 100 | suom 101 | sähköp 102 | säv 103 | toht 104 | toim 105 | toim.apul 106 | toim.joht 107 | toim.siht 108 | tuom 109 | ups 110 | vänr 111 | vääp 112 | ye.ups 113 | ylik 114 | ylil 115 | ylim 116 | ylimatr 117 | yliop 118 | yliopp 119 | ylip 120 | yliv 121 | 122 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall 123 | #into this category - it sometimes ends a sentence) 124 | e.g 125 | ent 126 | esim 127 | huom 128 | i.e 129 | ilm 130 | l 131 | mm 132 | myöh 133 | nk 134 | nyk 135 | par 136 | po 137 | t 138 | v 139 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.fr: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | # 4 | #any single upper case letter followed by a period is not a sentence ender 5 | #usually upper case letters are initials in a name 6 | #no French words end in single lower-case letters, so we throw those in too? 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | #a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | # Period-final abbreviation list for French 61 | A.C.N 62 | A.M 63 | art 64 | ann 65 | apr 66 | av 67 | auj 68 | lib 69 | B.P 70 | boul 71 | ca 72 | c.-à-d 73 | cf 74 | ch.-l 75 | chap 76 | contr 77 | C.P.I 78 | C.Q.F.D 79 | C.N 80 | C.N.S 81 | C.S 82 | dir 83 | éd 84 | e.g 85 | env 86 | al 87 | etc 88 | E.V 89 | ex 90 | fasc 91 | fém 92 | fig 93 | fr 94 | hab 95 | ibid 96 | id 97 | i.e 98 | inf 99 | LL.AA 100 | LL.AA.II 101 | LL.AA.RR 102 | LL.AA.SS 103 | L.D 104 | LL.EE 105 | LL.MM 106 | LL.MM.II.RR 107 | loc.cit 108 | masc 109 | MM 110 | ms 111 | N.B 112 | N.D.A 113 | N.D.L.R 114 | N.D.T 115 | n/réf 116 | NN.SS 117 | N.S 118 | N.D 119 | N.P.A.I 120 | p.c.c 121 | pl 122 | pp 123 | p.ex 124 | p.j 125 | P.S 126 | R.A.S 127 | R.-V 128 | R.P 129 | R.I.P 130 | SS 131 | S.S 132 | S.A 133 | S.A.I 134 | S.A.R 135 | S.A.S 136 | S.E 137 | sec 138 | sect 139 | sing 140 | S.M 141 | S.M.I.R 142 | sq 143 | sqq 144 | suiv 145 | sup 146 | suppl 147 | tél 148 | T.S.V.P 149 | vb 150 | vol 151 | vs 152 | X.O 153 | Z.I 154 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.ga: -------------------------------------------------------------------------------- 1 | 2 | A 3 | B 4 | C 5 | D 6 | E 7 | F 8 | G 9 | H 10 | I 11 | J 12 | K 13 | L 14 | M 15 | N 16 | O 17 | P 18 | Q 19 | R 20 | S 21 | T 22 | U 23 | V 24 | W 25 | X 26 | Y 27 | Z 28 | Á 29 | É 30 | Í 31 | Ó 32 | Ú 33 | 34 | Uacht 35 | Dr 36 | B.Arch 37 | 38 | m.sh 39 | .i 40 | Co 41 | Cf 42 | cf 43 | i.e 44 | r 45 | Chr 46 | lch #NUMERIC_ONLY# 47 | lgh #NUMERIC_ONLY# 48 | uimh #NUMERIC_ONLY# 49 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.hu: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | Á 33 | É 34 | Í 35 | Ó 36 | Ö 37 | Ő 38 | Ú 39 | Ü 40 | Ű 41 | 42 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 43 | Dr 44 | dr 45 | kb 46 | Kb 47 | vö 48 | Vö 49 | pl 50 | Pl 51 | ca 52 | Ca 53 | min 54 | Min 55 | max 56 | Max 57 | ún 58 | Ún 59 | prof 60 | Prof 61 | de 62 | De 63 | du 64 | Du 65 | Szt 66 | St 67 | 68 | #Numbers only. These should only induce breaks when followed by a numeric sequence 69 | # add NUMERIC_ONLY after the word for this function 70 | #This case is mostly for the english "No." which can either be a sentence of its own, or 71 | #if followed by a number, a non-breaking prefix 72 | 73 | # Month name abbreviations 74 | jan #NUMERIC_ONLY# 75 | Jan #NUMERIC_ONLY# 76 | Feb #NUMERIC_ONLY# 77 | feb #NUMERIC_ONLY# 78 | márc #NUMERIC_ONLY# 79 | Márc #NUMERIC_ONLY# 80 | ápr #NUMERIC_ONLY# 81 | Ápr #NUMERIC_ONLY# 82 | máj #NUMERIC_ONLY# 83 | Máj #NUMERIC_ONLY# 84 | jún #NUMERIC_ONLY# 85 | Jún #NUMERIC_ONLY# 86 | Júl #NUMERIC_ONLY# 87 | júl #NUMERIC_ONLY# 88 | aug #NUMERIC_ONLY# 89 | Aug #NUMERIC_ONLY# 90 | Szept #NUMERIC_ONLY# 91 | szept #NUMERIC_ONLY# 92 | okt #NUMERIC_ONLY# 93 | Okt #NUMERIC_ONLY# 94 | nov #NUMERIC_ONLY# 95 | Nov #NUMERIC_ONLY# 96 | dec #NUMERIC_ONLY# 97 | Dec #NUMERIC_ONLY# 98 | 99 | # Other abbreviations 100 | tel #NUMERIC_ONLY# 101 | Tel #NUMERIC_ONLY# 102 | Fax #NUMERIC_ONLY# 103 | fax #NUMERIC_ONLY# 104 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.is: -------------------------------------------------------------------------------- 1 | no #NUMERIC_ONLY# 2 | No #NUMERIC_ONLY# 3 | nr #NUMERIC_ONLY# 4 | Nr #NUMERIC_ONLY# 5 | nR #NUMERIC_ONLY# 6 | NR #NUMERIC_ONLY# 7 | a 8 | b 9 | c 10 | d 11 | e 12 | f 13 | g 14 | h 15 | i 16 | j 17 | k 18 | l 19 | m 20 | n 21 | o 22 | p 23 | q 24 | r 25 | s 26 | t 27 | u 28 | v 29 | w 30 | x 31 | y 32 | z 33 | ^ 34 | í 35 | á 36 | ó 37 | æ 38 | A 39 | B 40 | C 41 | D 42 | E 43 | F 44 | G 45 | H 46 | I 47 | J 48 | K 49 | L 50 | M 51 | N 52 | O 53 | P 54 | Q 55 | R 56 | S 57 | T 58 | U 59 | V 60 | W 61 | X 62 | Y 63 | Z 64 | ab.fn 65 | a.fn 66 | afs 67 | al 68 | alm 69 | alg 70 | andh 71 | ath 72 | aths 73 | atr 74 | ao 75 | au 76 | aukaf 77 | áfn 78 | áhrl.s 79 | áhrs 80 | ákv.gr 81 | ákv 82 | bh 83 | bls 84 | dr 85 | e.Kr 86 | et 87 | ef 88 | efn 89 | ennfr 90 | eink 91 | end 92 | e.st 93 | erl 94 | fél 95 | fskj 96 | fh 97 | f.hl 98 | físl 99 | fl 100 | fn 101 | fo 102 | forl 103 | frb 104 | frl 105 | frh 106 | frt 107 | fsl 108 | fsh 109 | fs 110 | fsk 111 | fst 112 | f.Kr 113 | ft 114 | fv 115 | fyrrn 116 | fyrrv 117 | germ 118 | gm 119 | gr 120 | hdl 121 | hdr 122 | hf 123 | hl 124 | hlsk 125 | hljsk 126 | hljv 127 | hljóðv 128 | hr 129 | hv 130 | hvk 131 | holl 132 | Hos 133 | höf 134 | hk 135 | hrl 136 | ísl 137 | kaf 138 | kap 139 | Khöfn 140 | kk 141 | kg 142 | kk 143 | km 144 | kl 145 | klst 146 | kr 147 | kt 148 | kgúrsk 149 | kvk 150 | leturbr 151 | lh 152 | lh.nt 153 | lh.þt 154 | lo 155 | ltr 156 | mlja 157 | mljó 158 | millj 159 | mm 160 | mms 161 | m.fl 162 | miðm 163 | mgr 164 | mst 165 | mín 166 | nf 167 | nh 168 | nhm 169 | nl 170 | nk 171 | nmgr 172 | no 173 | núv 174 | nt 175 | o.áfr 176 | o.m.fl 177 | ohf 178 | o.fl 179 | o.s.frv 180 | ófn 181 | ób 182 | óákv.gr 183 | óákv 184 | pfn 185 | PR 186 | pr 187 | Ritstj 188 | Rvík 189 | Rvk 190 | samb 191 | samhlj 192 | samn 193 | samn 194 | sbr 195 | sek 196 | sérn 197 | sf 198 | sfn 199 | sh 200 | sfn 201 | sh 202 | s.hl 203 | sk 204 | skv 205 | sl 206 | sn 207 | so 208 | ss.us 209 | s.st 210 | samþ 211 | sbr 212 | shlj 213 | sign 214 | skál 215 | st 216 | st.s 217 | stk 218 | sþ 219 | teg 220 | tbl 221 | tfn 222 | tl 223 | tvíhlj 224 | tvt 225 | till 226 | to 227 | umr 228 | uh 229 | us 230 | uppl 231 | útg 232 | vb 233 | Vf 234 | vh 235 | vkf 236 | Vl 237 | vl 238 | vlf 239 | vmf 240 | 8vo 241 | vsk 242 | vth 243 | þt 244 | þf 245 | þjs 246 | þgf 247 | þlt 248 | þolm 249 | þm 250 | þml 251 | þýð 252 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.it: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Amn 38 | Arch 39 | Asst 40 | Avv 41 | Bart 42 | Bcc 43 | Bldg 44 | Brig 45 | Bros 46 | C.A.P 47 | C.P 48 | Capt 49 | Cc 50 | Cmdr 51 | Co 52 | Col 53 | Comdr 54 | Con 55 | Corp 56 | Cpl 57 | DR 58 | Dott 59 | Dr 60 | Drs 61 | Egr 62 | Ens 63 | Gen 64 | Geom 65 | Gov 66 | Hon 67 | Hosp 68 | Hr 69 | Id 70 | Ing 71 | Insp 72 | Lt 73 | MM 74 | MR 75 | MRS 76 | MS 77 | Maj 78 | Messrs 79 | Mlle 80 | Mme 81 | Mo 82 | Mons 83 | Mr 84 | Mrs 85 | Ms 86 | Msgr 87 | N.B 88 | Op 89 | Ord 90 | P.S 91 | P.T 92 | Pfc 93 | Ph 94 | Prof 95 | Pvt 96 | RP 97 | RSVP 98 | Rag 99 | Rep 100 | Reps 101 | Res 102 | Rev 103 | Rif 104 | Rt 105 | S.A 106 | S.B.F 107 | S.P.M 108 | S.p.A 109 | S.r.l 110 | Sen 111 | Sens 112 | Sfc 113 | Sgt 114 | Sig 115 | Sigg 116 | Soc 117 | Spett 118 | Sr 119 | St 120 | Supt 121 | Surg 122 | V.P 123 | 124 | # other 125 | a.c 126 | acc 127 | all 128 | banc 129 | c.a 130 | c.c.p 131 | c.m 132 | c.p 133 | c.s 134 | c.v 135 | corr 136 | dott 137 | e.p.c 138 | ecc 139 | es 140 | fatt 141 | gg 142 | int 143 | lett 144 | ogg 145 | on 146 | p.c 147 | p.c.c 148 | p.es 149 | p.f 150 | p.r 151 | p.v 152 | post 153 | pp 154 | racc 155 | ric 156 | s.n.c 157 | seg 158 | sgg 159 | ss 160 | tel 161 | u.s 162 | v.r 163 | v.s 164 | 165 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 166 | v 167 | vs 168 | i.e 169 | rev 170 | e.g 171 | 172 | #Numbers only. These should only induce breaks when followed by a numeric sequence 173 | # add NUMERIC_ONLY after the word for this function 174 | #This case is mostly for the english "No." which can either be a sentence of its own, or 175 | #if followed by a number, a non-breaking prefix 176 | No #NUMERIC_ONLY# 177 | Nos 178 | Art #NUMERIC_ONLY# 179 | Nr 180 | pp #NUMERIC_ONLY# 181 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.lt: -------------------------------------------------------------------------------- 1 | # Anything in this file, followed by a period (and an upper-case word), 2 | # does NOT indicate an end-of-sentence marker. 3 | # Special cases are included for prefixes that ONLY appear before 0-9 numbers. 4 | 5 | # Any single upper case letter followed by a period is not a sentence ender 6 | # (excluding I occasionally, but we leave it in) 7 | # usually upper case letters are initials in a name 8 | A 9 | Ā 10 | B 11 | C 12 | Č 13 | D 14 | E 15 | Ē 16 | F 17 | G 18 | Ģ 19 | H 20 | I 21 | Ī 22 | J 23 | K 24 | Ķ 25 | L 26 | Ļ 27 | M 28 | N 29 | Ņ 30 | O 31 | P 32 | Q 33 | R 34 | S 35 | Š 36 | T 37 | U 38 | Ū 39 | V 40 | W 41 | X 42 | Y 43 | Z 44 | Ž 45 | 46 | # Initialis -- Džonas 47 | Dz 48 | Dž 49 | Just 50 | 51 | # Day and month abbreviations 52 | # m. menesis d. diena g. gimes 53 | m 54 | mėn 55 | d 56 | g 57 | gim 58 | # Pirmadienis Penktadienis 59 | Pr 60 | Pn 61 | Pirm 62 | Antr 63 | Treč 64 | Ketv 65 | Penkt 66 | Šešt 67 | Sekm 68 | Saus 69 | Vas 70 | Kov 71 | Bal 72 | Geg 73 | Birž 74 | Liep 75 | Rugpj 76 | Rugs 77 | Spal 78 | Lapkr 79 | Gruod 80 | 81 | # Business, governmental, geographical terms 82 | a 83 | # aikštė 84 | adv 85 | # advokatas 86 | akad 87 | # akademikas 88 | aklg 89 | # akligatvis 90 | akt 91 | # aktorius 92 | al 93 | # alėja 94 | A.V 95 | # antspaudo vieta 96 | aps 97 | apskr 98 | # apskritis 99 | apyg 100 | # apygarda 101 | aps 102 | apskr 103 | # apskritis 104 | asist 105 | # asistentas 106 | asmv 107 | avd 108 | # asmenvardis 109 | a.k 110 | asm 111 | asm.k 112 | # asmens kodas 113 | atsak 114 | # atsakingasis 115 | atsisk 116 | sąsk 117 | # atsiskaitomoji sąskaita 118 | aut 119 | # autorius 120 | b 121 | k 122 | b.k 123 | # banko kodas 124 | bkl 125 | # bakalauras 126 | bt 127 | # butas 128 | buv 129 | # buvęs, -usi 130 | dail 131 | # dailininkas 132 | dek 133 | # dekanas 134 | dėst 135 | # dėstytojas 136 | dir 137 | # direktorius 138 | dirig 139 | # dirigentas 140 | doc 141 | # docentas 142 | drp 143 | # durpynas 144 | dš 145 | # dešinysis 146 | egz 147 | # egzempliorius 148 | eil 149 | # eilutė 150 | ekon 151 | # ekonomika 152 | el 153 | # elektroninis 154 | etc 155 | ež 156 | # ežeras 157 | faks 158 | # faksas 159 | fak 160 | # fakultetas 161 | gen 162 | # generolas 163 | gyd 164 | # gydytojas 165 | gv 166 | # gyvenvietė 167 | įl 168 | # įlanka 169 | Įn 170 | # įnagininkas 171 | insp 172 | # inspektorius 173 | pan 174 | # ir panašiai 175 | t.t 176 | # ir taip toliau 177 | k.a 178 | # kaip antai 179 | kand 180 | # kandidatas 181 | kat 182 | # katedra 183 | kyš 184 | # kyšulys 185 | kl 186 | # klasė 187 | kln 188 | # kalnas 189 | kn 190 | # knyga 191 | koresp 192 | # korespondentas 193 | kpt 194 | # kapitonas 195 | kr 196 | # kairysis 197 | kt 198 | # kitas 199 | kun 200 | # kunigas 201 | l 202 | e 203 | p 204 | l.e.p 205 | # laikinai einantis pareigas 206 | ltn 207 | # leitenantas 208 | m 209 | mst 210 | # miestas 211 | m.e 212 | # mūsų eros 213 | m.m 214 | # mokslo metai 215 | mot 216 | # moteris 217 | mstl 218 | # miestelis 219 | mgr 220 | # magistras 221 | mgnt 222 | # magistrantas 223 | mjr 224 | # majoras 225 | mln 226 | # milijonas 227 | mlrd 228 | # milijardas 229 | mok 230 | # mokinys 231 | mokyt 232 | # mokytojas 233 | moksl 234 | # mokslinis 235 | nkt 236 | # nekaitomas 237 | ntk 238 | # neteiktinas 239 | Nr 240 | nr 241 | # numeris 242 | p 243 | # ponas 244 | p.d 245 | a.d 246 | # pašto dėžutė, abonentinė dėžutė 247 | p.m.e 248 | # prieš mūsų erą 249 | pan 250 | # ir panašiai 251 | pav 252 | # paveikslas 253 | pavad 254 | # pavaduotojas 255 | pirm 256 | # pirmininkas 257 | pl 258 | # plentas 259 | plg 260 | # palygink 261 | plk 262 | # pulkininkas; pelkė 263 | pr 264 | # prospektas 265 | Kr 266 | pr.Kr 267 | # prieš Kristų 268 | prok 269 | # prokuroras 270 | prot 271 | # protokolas 272 | pss 273 | # pusiasalis 274 | pšt 275 | # paštas 276 | pvz 277 | # pavyzdžiui 278 | r 279 | # rajonas 280 | red 281 | # redaktorius 282 | rš 283 | # raštų kalbos 284 | sąs 285 | # sąsiuvinis 286 | saviv 287 | sav 288 | # savivaldybė 289 | sekr 290 | # sekretorius 291 | sen 292 | # seniūnija, seniūnas 293 | sk 294 | # skaityk; skyrius 295 | skg 296 | # skersgatvis 297 | skyr 298 | sk 299 | # skyrius 300 | skv 301 | # skveras 302 | sp 303 | # spauda; spaustuvė 304 | spec 305 | # specialistas 306 | sr 307 | # sritis 308 | st 309 | # stotis 310 | str 311 | # straipsnis 312 | stud 313 | # studentas 314 | š 315 | š.m 316 | # šių metų 317 | šnek 318 | # šnekamosios 319 | tir 320 | # tiražas 321 | tūkst 322 | # tūkstantis 323 | up 324 | # upė 325 | upl 326 | # upelis 327 | vad 328 | # vadinamasis, -oji 329 | vlsč 330 | # valsčius 331 | ved 332 | # vedėjas 333 | vet 334 | # veterinarija 335 | virš 336 | # viršininkas, viršaitis 337 | vyr 338 | # vyriausiasis, -ioji; vyras 339 | vyresn 340 | # vyresnysis 341 | vlsč 342 | # valsčius 343 | vs 344 | # viensėdis 345 | Vt 346 | vt 347 | # vietininkas 348 | vtv 349 | vv 350 | # vietovardis 351 | žml 352 | # žemėlapis 353 | 354 | # Technical terms, abbreviations used in guidebooks, advertisments, etc. 355 | # Generally lower-case. 356 | air 357 | # airiškai 358 | amer 359 | # amerikanizmas 360 | anat 361 | # anatomija 362 | angl 363 | # angl. angliskai 364 | arab 365 | # arabų 366 | archeol 367 | archit 368 | asm 369 | # asmuo 370 | astr 371 | # astronomija 372 | austral 373 | # australiškai 374 | aut 375 | # automobilis 376 | av 377 | # aviacija 378 | bažn 379 | bdv 380 | # būdvardis 381 | bibl 382 | # Biblija 383 | biol 384 | # biologija 385 | bot 386 | # botanika 387 | brt 388 | # burtai, burtažodis. 389 | brus 390 | # baltarusių 391 | buh 392 | # buhalterija 393 | chem 394 | # chemija 395 | col 396 | # collectivum 397 | con 398 | conj 399 | # conjunctivus, jungtukas 400 | dab 401 | # dab. dabartine 402 | dgs 403 | # daugiskaita 404 | dial 405 | # dialektizmas 406 | dipl 407 | dktv 408 | # daiktavardis 409 | džn 410 | # dažnai 411 | ekon 412 | el 413 | # elektra 414 | esam 415 | # esamasis laikas 416 | euf 417 | # eufemizmas 418 | fam 419 | # familiariai 420 | farm 421 | # farmacija 422 | filol 423 | # filologija 424 | filos 425 | # filosofija 426 | fin 427 | # finansai 428 | fiz 429 | # fizika 430 | fiziol 431 | # fiziologija 432 | flk 433 | # folkloras 434 | fon 435 | # fonetika 436 | fot 437 | # fotografija 438 | geod 439 | # geodezija 440 | geogr 441 | geol 442 | # geologija 443 | geom 444 | # geometrija 445 | glžk 446 | gr 447 | # graikų 448 | gram 449 | her 450 | # heraldika 451 | hidr 452 | # hidrotechnika 453 | ind 454 | # Indų 455 | iron 456 | # ironiškai 457 | isp 458 | # ispanų 459 | ist 460 | istor 461 | # istorija 462 | it 463 | # italų 464 | įv 465 | reikšm 466 | įv.reikšm 467 | # įvairiomis reikšmėmis 468 | jap 469 | # japonų 470 | juok 471 | # juokaujamai 472 | jūr 473 | # jūrininkystė 474 | kalb 475 | # kalbotyra 476 | kar 477 | # karyba 478 | kas 479 | # kasyba 480 | kin 481 | # kinematografija 482 | klaus 483 | # klausiamasis 484 | knyg 485 | # knyginis 486 | kom 487 | # komercija 488 | komp 489 | # kompiuteris 490 | kosm 491 | # kosmonautika 492 | kt 493 | # kitas 494 | kul 495 | # kulinarija 496 | kuop 497 | # kuopine 498 | l 499 | # laikas 500 | lit 501 | # literatūrinis 502 | lingv 503 | # lingvistika 504 | log 505 | # logika 506 | lot 507 | # lotynų 508 | mat 509 | # matematika 510 | maž 511 | # mažybinis 512 | med 513 | # medicina 514 | medž 515 | # medžioklė 516 | men 517 | # menas 518 | menk 519 | # menkinamai 520 | metal 521 | # metalurgija 522 | meteor 523 | min 524 | # mineralogija 525 | mit 526 | # mitologija 527 | mok 528 | # mokyklinis 529 | ms 530 | # mįslė 531 | muz 532 | # muzikinis 533 | n 534 | # naujasis 535 | neig 536 | # neigiamasis 537 | neol 538 | # neologizmas 539 | niek 540 | # niekinamai 541 | ofic 542 | # oficialus 543 | opt 544 | # optika 545 | orig 546 | # original 547 | p 548 | # pietūs 549 | pan 550 | # panašiai 551 | parl 552 | # parlamentas 553 | pat 554 | # patarlė 555 | paž 556 | # pažodžiui 557 | plg 558 | # palygink 559 | poet 560 | # poetizmas 561 | poez 562 | # poezija 563 | poligr 564 | # poligrafija 565 | polit 566 | # politika 567 | ppr 568 | # paprastai 569 | pranc 570 | pr 571 | # prancūzų, prūsų 572 | priet 573 | # prietaras 574 | prek 575 | # prekyba 576 | prk 577 | # perkeltine 578 | prs 579 | # persona, asmuo 580 | psn 581 | # pasenęs žodis 582 | psich 583 | # psichologija 584 | pvz 585 | # pavyzdžiui 586 | r 587 | # rytai 588 | rad 589 | # radiotechnika 590 | rel 591 | # religija 592 | ret 593 | # retai 594 | rus 595 | # rusų 596 | sen 597 | # senasis 598 | sl 599 | # slengas, slavų 600 | sov 601 | # sovietinis 602 | spec 603 | # specialus 604 | sport 605 | stat 606 | # statyba 607 | sudurt 608 | # sudurtinis 609 | sutr 610 | # sutrumpintas 611 | suv 612 | # suvalkiečių 613 | š 614 | # šiaurė 615 | šach 616 | # šachmatai 617 | šiaur 618 | škot 619 | # škotiškai 620 | šnek 621 | # šnekamoji 622 | teatr 623 | tech 624 | techn 625 | # technika 626 | teig 627 | # teigiamas 628 | teis 629 | # teisė 630 | tekst 631 | # tekstilė 632 | tel 633 | # telefonas 634 | teol 635 | # teologija 636 | v 637 | # tik vyriškosios, vakarai 638 | t.p 639 | t 640 | p 641 | # ir taip pat 642 | t.t 643 | # ir taip toliau 644 | t.y 645 | # tai yra 646 | vaik 647 | # vaikų 648 | vart 649 | # vartojama 650 | vet 651 | # veterinarija 652 | vid 653 | # vidurinis 654 | vksm 655 | # veiksmažodis 656 | vns 657 | # vienaskaita 658 | vok 659 | # vokiečių 660 | vulg 661 | # vulgariai 662 | zool 663 | # zoologija 664 | žr 665 | # žiūrėk 666 | ž.ū 667 | ž 668 | ū 669 | # žemės ūkis 670 | 671 | # List of titles. These are often followed by upper-case names, but do 672 | # not indicate sentence breaks 673 | # 674 | # Jo Eminencija 675 | Em. 676 | # Gerbiamasis 677 | Gerb 678 | gerb 679 | # malonus 680 | malon 681 | # profesorius 682 | Prof 683 | prof 684 | # daktaras (mokslų) 685 | Dr 686 | dr 687 | habil 688 | med 689 | # inž inžinierius 690 | inž 691 | Inž 692 | 693 | 694 | #Numbers only. These should only induce breaks when followed by a numeric sequence 695 | # add NUMERIC_ONLY after the word for this function 696 | #This case is mostly for the english "No." which can either be a sentence of its own, or 697 | #if followed by a number, a non-breaking prefix 698 | No #NUMERIC_ONLY# 699 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.lv: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | Ā 8 | B 9 | C 10 | Č 11 | D 12 | E 13 | Ē 14 | F 15 | G 16 | Ģ 17 | H 18 | I 19 | Ī 20 | J 21 | K 22 | Ķ 23 | L 24 | Ļ 25 | M 26 | N 27 | Ņ 28 | O 29 | P 30 | Q 31 | R 32 | S 33 | Š 34 | T 35 | U 36 | Ū 37 | V 38 | W 39 | X 40 | Y 41 | Z 42 | Ž 43 | 44 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 45 | dr 46 | Dr 47 | med 48 | prof 49 | Prof 50 | inž 51 | Inž 52 | ist.loc 53 | Ist.loc 54 | kor.loc 55 | Kor.loc 56 | v.i 57 | vietn 58 | Vietn 59 | 60 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 61 | a.l 62 | t.p 63 | pārb 64 | Pārb 65 | vec 66 | Vec 67 | inv 68 | Inv 69 | sk 70 | Sk 71 | spec 72 | Spec 73 | vienk 74 | Vienk 75 | virz 76 | Virz 77 | māksl 78 | Māksl 79 | mūz 80 | Mūz 81 | akad 82 | Akad 83 | soc 84 | Soc 85 | galv 86 | Galv 87 | vad 88 | Vad 89 | sertif 90 | Sertif 91 | folkl 92 | Folkl 93 | hum 94 | Hum 95 | 96 | #Numbers only. These should only induce breaks when followed by a numeric sequence 97 | # add NUMERIC_ONLY after the word for this function 98 | #This case is mostly for the english "No." which can either be a sentence of its own, or 99 | #if followed by a number, a non-breaking prefix 100 | Nr #NUMERIC_ONLY# 101 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.nl: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | #Sources: http://nl.wikipedia.org/wiki/Lijst_van_afkortingen 4 | # http://nl.wikipedia.org/wiki/Aanspreekvorm 5 | # http://nl.wikipedia.org/wiki/Titulatuur_in_het_Nederlands_hoger_onderwijs 6 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 7 | #usually upper case letters are initials in a name 8 | A 9 | B 10 | C 11 | D 12 | E 13 | F 14 | G 15 | H 16 | I 17 | J 18 | K 19 | L 20 | M 21 | N 22 | O 23 | P 24 | Q 25 | R 26 | S 27 | T 28 | U 29 | V 30 | W 31 | X 32 | Y 33 | Z 34 | 35 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 36 | bacc 37 | bc 38 | bgen 39 | c.i 40 | dhr 41 | dr 42 | dr.h.c 43 | drs 44 | drs 45 | ds 46 | eint 47 | fa 48 | Fa 49 | fam 50 | gen 51 | genm 52 | ing 53 | ir 54 | jhr 55 | jkvr 56 | jr 57 | kand 58 | kol 59 | lgen 60 | lkol 61 | Lt 62 | maj 63 | Mej 64 | mevr 65 | Mme 66 | mr 67 | mr 68 | Mw 69 | o.b.s 70 | plv 71 | prof 72 | ritm 73 | tint 74 | Vz 75 | Z.D 76 | Z.D.H 77 | Z.E 78 | Z.Em 79 | Z.H 80 | Z.K.H 81 | Z.K.M 82 | Z.M 83 | z.v 84 | 85 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 86 | #we seem to have a lot of these in dutch i.e.: i.p.v - in plaats van (in stead of) never ends a sentence 87 | a.g.v 88 | bijv 89 | bijz 90 | bv 91 | d.w.z 92 | e.c 93 | e.g 94 | e.k 95 | ev 96 | i.p.v 97 | i.s.m 98 | i.t.t 99 | i.v.m 100 | m.a.w 101 | m.b.t 102 | m.b.v 103 | m.h.o 104 | m.i 105 | m.i.v 106 | v.w.t 107 | 108 | #Numbers only. These should only induce breaks when followed by a numeric sequence 109 | # add NUMERIC_ONLY after the word for this function 110 | #This case is mostly for the english "No." which can either be a sentence of its own, or 111 | #if followed by a number, a non-breaking prefix 112 | Nr #NUMERIC_ONLY# 113 | Nrs 114 | nrs 115 | nr #NUMERIC_ONLY# 116 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.pl: -------------------------------------------------------------------------------- 1 | adw 2 | afr 3 | akad 4 | al 5 | Al 6 | am 7 | amer 8 | arch 9 | art 10 | Art 11 | artyst 12 | astr 13 | austr 14 | bałt 15 | bdb 16 | bł 17 | bm 18 | br 19 | bryg 20 | bryt 21 | centr 22 | ces 23 | chem 24 | chiń 25 | chir 26 | c.k 27 | c.o 28 | cyg 29 | cyw 30 | cyt 31 | czes 32 | czw 33 | cd 34 | Cd 35 | czyt 36 | ćw 37 | ćwicz 38 | daw 39 | dcn 40 | dekl 41 | demokr 42 | det 43 | diec 44 | dł 45 | dn 46 | dot 47 | dol 48 | dop 49 | dost 50 | dosł 51 | h.c 52 | ds 53 | dst 54 | duszp 55 | dypl 56 | egz 57 | ekol 58 | ekon 59 | elektr 60 | em 61 | ew 62 | fab 63 | farm 64 | fot 65 | fr 66 | gat 67 | gastr 68 | geogr 69 | geol 70 | gimn 71 | głęb 72 | gm 73 | godz 74 | górn 75 | gosp 76 | gr 77 | gram 78 | hist 79 | hiszp 80 | hr 81 | Hr 82 | hot 83 | id 84 | in 85 | im 86 | iron 87 | jn 88 | kard 89 | kat 90 | katol 91 | k.k 92 | kk 93 | kol 94 | kl 95 | k.p.a 96 | kpc 97 | k.p.c 98 | kpt 99 | kr 100 | k.r 101 | krak 102 | k.r.o 103 | kryt 104 | kult 105 | laic 106 | łac 107 | niem 108 | woj 109 | nb 110 | np 111 | Nb 112 | Np 113 | pol 114 | pow 115 | m.in 116 | pt 117 | ps 118 | Pt 119 | Ps 120 | cdn 121 | jw 122 | ryc 123 | rys 124 | Ryc 125 | Rys 126 | tj 127 | tzw 128 | Tzw 129 | tzn 130 | zob 131 | ang 132 | ub 133 | ul 134 | pw 135 | pn 136 | pl 137 | al 138 | k 139 | n 140 | nr #NUMERIC_ONLY# 141 | Nr #NUMERIC_ONLY# 142 | ww 143 | wł 144 | ur 145 | zm 146 | żyd 147 | żarg 148 | żyw 149 | wył 150 | bp 151 | bp 152 | wyst 153 | tow 154 | Tow 155 | o 156 | sp 157 | Sp 158 | st 159 | spółdz 160 | Spółdz 161 | społ 162 | spółgł 163 | stoł 164 | stow 165 | Stoł 166 | Stow 167 | zn 168 | zew 169 | zewn 170 | zdr 171 | zazw 172 | zast 173 | zaw 174 | zał 175 | zal 176 | zam 177 | zak 178 | zakł 179 | zagr 180 | zach 181 | adw 182 | Adw 183 | lek 184 | Lek 185 | med 186 | mec 187 | Mec 188 | doc 189 | Doc 190 | dyw 191 | dyr 192 | Dyw 193 | Dyr 194 | inż 195 | Inż 196 | mgr 197 | Mgr 198 | dh 199 | dr 200 | Dh 201 | Dr 202 | p 203 | P 204 | red 205 | Red 206 | prof 207 | prok 208 | Prof 209 | Prok 210 | hab 211 | płk 212 | Płk 213 | nadkom 214 | Nadkom 215 | podkom 216 | Podkom 217 | ks 218 | Ks 219 | gen 220 | Gen 221 | por 222 | Por 223 | reż 224 | Reż 225 | przyp 226 | Przyp 227 | śp 228 | św 229 | śW 230 | Śp 231 | Św 232 | ŚW 233 | szer 234 | Szer 235 | pkt #NUMERIC_ONLY# 236 | str #NUMERIC_ONLY# 237 | tab #NUMERIC_ONLY# 238 | Tab #NUMERIC_ONLY# 239 | tel 240 | ust #NUMERIC_ONLY# 241 | par #NUMERIC_ONLY# 242 | poz 243 | pok 244 | oo 245 | oO 246 | Oo 247 | OO 248 | r #NUMERIC_ONLY# 249 | l #NUMERIC_ONLY# 250 | s #NUMERIC_ONLY# 251 | najśw 252 | Najśw 253 | A 254 | B 255 | C 256 | D 257 | E 258 | F 259 | G 260 | H 261 | I 262 | J 263 | K 264 | L 265 | M 266 | N 267 | O 268 | P 269 | Q 270 | R 271 | S 272 | T 273 | U 274 | V 275 | W 276 | X 277 | Y 278 | Z 279 | Ś 280 | Ć 281 | Ż 282 | Ź 283 | Dz 284 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.pt: -------------------------------------------------------------------------------- 1 | #File adapted for PT by H. Leal Fontes from the EN & DE versions published with moses-2009-04-13. Last update: 10.11.2009. 2 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 3 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 4 | 5 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 6 | #usually upper case letters are initials in a name 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in Portuguese. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 104 | Adj 105 | Adm 106 | Adv 107 | Art 108 | Ca 109 | Capt 110 | Cmdr 111 | Col 112 | Comdr 113 | Con 114 | Corp 115 | Cpl 116 | DR 117 | DRA 118 | Dr 119 | Dra 120 | Dras 121 | Drs 122 | Eng 123 | Enga 124 | Engas 125 | Engos 126 | Ex 127 | Exo 128 | Exmo 129 | Fig 130 | Gen 131 | Hosp 132 | Insp 133 | Lda 134 | MM 135 | MR 136 | MRS 137 | MS 138 | Maj 139 | Mrs 140 | Ms 141 | Msgr 142 | Op 143 | Ord 144 | Pfc 145 | Ph 146 | Prof 147 | Pvt 148 | Rep 149 | Reps 150 | Res 151 | Rev 152 | Rt 153 | Sen 154 | Sens 155 | Sfc 156 | Sgt 157 | Sr 158 | Sra 159 | Sras 160 | Srs 161 | Sto 162 | Supt 163 | Surg 164 | adj 165 | adm 166 | adv 167 | art 168 | cit 169 | col 170 | con 171 | corp 172 | cpl 173 | dr 174 | dra 175 | dras 176 | drs 177 | eng 178 | enga 179 | engas 180 | engos 181 | ex 182 | exo 183 | exmo 184 | fig 185 | op 186 | prof 187 | sr 188 | sra 189 | sras 190 | srs 191 | sto 192 | 193 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 194 | v 195 | vs 196 | i.e 197 | rev 198 | e.g 199 | 200 | #Numbers only. These should only induce breaks when followed by a numeric sequence 201 | # add NUMERIC_ONLY after the word for this function 202 | #This case is mostly for the english "No." which can either be a sentence of its own, or 203 | #if followed by a number, a non-breaking prefix 204 | No #NUMERIC_ONLY# 205 | Nos 206 | Art #NUMERIC_ONLY# 207 | Nr 208 | p #NUMERIC_ONLY# 209 | pp #NUMERIC_ONLY# 210 | 211 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.ro: -------------------------------------------------------------------------------- 1 | A 2 | B 3 | C 4 | D 5 | E 6 | F 7 | G 8 | H 9 | I 10 | J 11 | K 12 | L 13 | M 14 | N 15 | O 16 | P 17 | Q 18 | R 19 | S 20 | T 21 | U 22 | V 23 | W 24 | X 25 | Y 26 | Z 27 | dpdv 28 | etc 29 | șamd 30 | M.Ap.N 31 | dl 32 | Dl 33 | d-na 34 | D-na 35 | dvs 36 | Dvs 37 | pt 38 | Pt 39 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.ru: -------------------------------------------------------------------------------- 1 | # added Cyrillic uppercase letters [А-Я] 2 | # removed 000D carriage return (this is not removed by chomp in tokenizer.perl, and prevents recognition of the prefixes) 3 | # edited by Kate Young (nspaceanalysis@earthlink.net) 21 May 2013 4 | А 5 | Б 6 | В 7 | Г 8 | Д 9 | Е 10 | Ж 11 | З 12 | И 13 | Й 14 | К 15 | Л 16 | М 17 | Н 18 | О 19 | П 20 | Р 21 | С 22 | Т 23 | У 24 | Ф 25 | Х 26 | Ц 27 | Ч 28 | Ш 29 | Щ 30 | Ъ 31 | Ы 32 | Ь 33 | Э 34 | Ю 35 | Я 36 | A 37 | B 38 | C 39 | D 40 | E 41 | F 42 | G 43 | H 44 | I 45 | J 46 | K 47 | L 48 | M 49 | N 50 | O 51 | P 52 | Q 53 | R 54 | S 55 | T 56 | U 57 | V 58 | W 59 | X 60 | Y 61 | Z 62 | 0гг 63 | 1гг 64 | 2гг 65 | 3гг 66 | 4гг 67 | 5гг 68 | 6гг 69 | 7гг 70 | 8гг 71 | 9гг 72 | 0г 73 | 1г 74 | 2г 75 | 3г 76 | 4г 77 | 5г 78 | 6г 79 | 7г 80 | 8г 81 | 9г 82 | Xвв 83 | Vвв 84 | Iвв 85 | Lвв 86 | Mвв 87 | Cвв 88 | Xв 89 | Vв 90 | Iв 91 | Lв 92 | Mв 93 | Cв 94 | 0м 95 | 1м 96 | 2м 97 | 3м 98 | 4м 99 | 5м 100 | 6м 101 | 7м 102 | 8м 103 | 9м 104 | 0мм 105 | 1мм 106 | 2мм 107 | 3мм 108 | 4мм 109 | 5мм 110 | 6мм 111 | 7мм 112 | 8мм 113 | 9мм 114 | 0см 115 | 1см 116 | 2см 117 | 3см 118 | 4см 119 | 5см 120 | 6см 121 | 7см 122 | 8см 123 | 9см 124 | 0дм 125 | 1дм 126 | 2дм 127 | 3дм 128 | 4дм 129 | 5дм 130 | 6дм 131 | 7дм 132 | 8дм 133 | 9дм 134 | 0л 135 | 1л 136 | 2л 137 | 3л 138 | 4л 139 | 5л 140 | 6л 141 | 7л 142 | 8л 143 | 9л 144 | 0км 145 | 1км 146 | 2км 147 | 3км 148 | 4км 149 | 5км 150 | 6км 151 | 7км 152 | 8км 153 | 9км 154 | 0га 155 | 1га 156 | 2га 157 | 3га 158 | 4га 159 | 5га 160 | 6га 161 | 7га 162 | 8га 163 | 9га 164 | 0кг 165 | 1кг 166 | 2кг 167 | 3кг 168 | 4кг 169 | 5кг 170 | 6кг 171 | 7кг 172 | 8кг 173 | 9кг 174 | 0т 175 | 1т 176 | 2т 177 | 3т 178 | 4т 179 | 5т 180 | 6т 181 | 7т 182 | 8т 183 | 9т 184 | 0г 185 | 1г 186 | 2г 187 | 3г 188 | 4г 189 | 5г 190 | 6г 191 | 7г 192 | 8г 193 | 9г 194 | 0мг 195 | 1мг 196 | 2мг 197 | 3мг 198 | 4мг 199 | 5мг 200 | 6мг 201 | 7мг 202 | 8мг 203 | 9мг 204 | бульв 205 | в 206 | вв 207 | г 208 | га 209 | гг 210 | гл 211 | гос 212 | д 213 | дм 214 | доп 215 | др 216 | е 217 | ед 218 | ед 219 | зам 220 | и 221 | инд 222 | исп 223 | Исп 224 | к 225 | кап 226 | кг 227 | кв 228 | кл 229 | км 230 | кол 231 | комн 232 | коп 233 | куб 234 | л 235 | лиц 236 | лл 237 | м 238 | макс 239 | мг 240 | мин 241 | мл 242 | млн 243 | млрд 244 | мм 245 | н 246 | наб 247 | нач 248 | неуд 249 | ном 250 | о 251 | обл 252 | обр 253 | общ 254 | ок 255 | ост 256 | отл 257 | п 258 | пер 259 | перераб 260 | пл 261 | пос 262 | пр 263 | просп 264 | проф 265 | р 266 | ред 267 | руб 268 | с 269 | сб 270 | св 271 | см 272 | соч 273 | ср 274 | ст 275 | стр 276 | т 277 | тел 278 | Тел 279 | тех 280 | тт 281 | туп 282 | тыс 283 | уд 284 | ул 285 | уч 286 | физ 287 | х 288 | хор 289 | ч 290 | чел 291 | шт 292 | экз 293 | э 294 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.sk: -------------------------------------------------------------------------------- 1 | Bc 2 | Mgr 3 | RNDr 4 | PharmDr 5 | PhDr 6 | JUDr 7 | PaedDr 8 | ThDr 9 | Ing 10 | MUDr 11 | MDDr 12 | MVDr 13 | Dr 14 | ThLic 15 | PhD 16 | ArtD 17 | ThDr 18 | Dr 19 | DrSc 20 | CSs 21 | prof 22 | obr 23 | Obr 24 | Č 25 | č 26 | absol 27 | adj 28 | admin 29 | adr 30 | Adr 31 | adv 32 | advok 33 | afr 34 | ak 35 | akad 36 | akc 37 | akuz 38 | et 39 | al 40 | alch 41 | amer 42 | anat 43 | angl 44 | Angl 45 | anglosas 46 | anorg 47 | ap 48 | apod 49 | arch 50 | archeol 51 | archit 52 | arg 53 | art 54 | astr 55 | astrol 56 | astron 57 | atp 58 | atď 59 | austr 60 | Austr 61 | aut 62 | belg 63 | Belg 64 | bibl 65 | Bibl 66 | biol 67 | bot 68 | bud 69 | bás 70 | býv 71 | cest 72 | chem 73 | cirk 74 | csl 75 | čs 76 | Čs 77 | dat 78 | dep 79 | det 80 | dial 81 | diaľ 82 | dipl 83 | distrib 84 | dokl 85 | dosl 86 | dopr 87 | dram 88 | duš 89 | dv 90 | dvojčl 91 | dór 92 | ekol 93 | ekon 94 | el 95 | elektr 96 | elektrotech 97 | energet 98 | epic 99 | est 100 | etc 101 | etonym 102 | eufem 103 | európ 104 | Európ 105 | ev 106 | evid 107 | expr 108 | fa 109 | fam 110 | farm 111 | fem 112 | feud 113 | fil 114 | filat 115 | filoz 116 | fi 117 | fon 118 | form 119 | fot 120 | fr 121 | Fr 122 | franc 123 | Franc 124 | fraz 125 | fut 126 | fyz 127 | fyziol 128 | garb 129 | gen 130 | genet 131 | genpor 132 | geod 133 | geogr 134 | geol 135 | geom 136 | germ 137 | gr 138 | Gr 139 | gréc 140 | Gréc 141 | gréckokat 142 | hebr 143 | herald 144 | hist 145 | hlav 146 | hosp 147 | hromad 148 | hud 149 | hypok 150 | ident 151 | i.e 152 | ident 153 | imp 154 | impf 155 | indoeur 156 | inf 157 | inform 158 | instr 159 | int 160 | interj 161 | inšt 162 | inštr 163 | iron 164 | jap 165 | Jap 166 | jaz 167 | jedn 168 | juhoamer 169 | juhových 170 | juhozáp 171 | juž 172 | kanad 173 | Kanad 174 | kanc 175 | kapit 176 | kpt 177 | kart 178 | katastr 179 | knih 180 | kniž 181 | komp 182 | konj 183 | konkr 184 | kozmet 185 | krajč 186 | kresť 187 | kt 188 | kuch 189 | lat 190 | latinskoamer 191 | lek 192 | lex 193 | lingv 194 | lit 195 | litur 196 | log 197 | lok 198 | max 199 | Max 200 | maď 201 | Maď 202 | medzinár 203 | mest 204 | metr 205 | mil 206 | Mil 207 | min 208 | Min 209 | miner 210 | ml 211 | mld 212 | mn 213 | mod 214 | mytol 215 | napr 216 | nar 217 | Nar 218 | nasl 219 | nedok 220 | neg 221 | negat 222 | neklas 223 | nem 224 | Nem 225 | neodb 226 | neos 227 | neskl 228 | nesklon 229 | nespis 230 | nespráv 231 | neved 232 | než 233 | niekt 234 | niž 235 | nom 236 | náb 237 | nákl 238 | námor 239 | nár 240 | obch 241 | obj 242 | obv 243 | obyč 244 | obč 245 | občian 246 | odb 247 | odd 248 | ods 249 | ojed 250 | okr 251 | Okr 252 | opt 253 | opyt 254 | org 255 | os 256 | osob 257 | ot 258 | ovoc 259 | par 260 | part 261 | pejor 262 | pers 263 | pf 264 | Pf 265 | P.f 266 | p.f 267 | pl 268 | Plk 269 | pod 270 | podst 271 | pokl 272 | polit 273 | politol 274 | polygr 275 | pomn 276 | popl 277 | por 278 | porad 279 | porov 280 | posch 281 | potrav 282 | použ 283 | poz 284 | pozit 285 | poľ 286 | poľno 287 | poľnohosp 288 | poľov 289 | pošt 290 | pož 291 | prac 292 | predl 293 | pren 294 | prep 295 | preuk 296 | priezv 297 | Priezv 298 | privl 299 | prof 300 | práv 301 | príd 302 | príj 303 | prík 304 | príp 305 | prír 306 | prísl 307 | príslov 308 | príč 309 | psych 310 | publ 311 | pís 312 | písm 313 | pôv 314 | refl 315 | reg 316 | rep 317 | resp 318 | rozk 319 | rozlič 320 | rozpráv 321 | roč 322 | Roč 323 | ryb 324 | rádiotech 325 | rím 326 | samohl 327 | semest 328 | sev 329 | severoamer 330 | severových 331 | severozáp 332 | sg 333 | skr 334 | skup 335 | sl 336 | Sloven 337 | soc 338 | soch 339 | sociol 340 | sp 341 | spol 342 | Spol 343 | spoloč 344 | spoluhl 345 | správ 346 | spôs 347 | st 348 | star 349 | starogréc 350 | starorím 351 | s.r.o 352 | stol 353 | stor 354 | str 355 | stredoamer 356 | stredoškol 357 | subj 358 | subst 359 | superl 360 | sv 361 | sz 362 | súkr 363 | súp 364 | súvzť 365 | tal 366 | Tal 367 | tech 368 | tel 369 | Tel 370 | telef 371 | teles 372 | telev 373 | teol 374 | trans 375 | turist 376 | tuzem 377 | typogr 378 | tzn 379 | tzv 380 | ukaz 381 | ul 382 | Ul 383 | umel 384 | univ 385 | ust 386 | ved 387 | vedľ 388 | verb 389 | veter 390 | vin 391 | viď 392 | vl 393 | vod 394 | vodohosp 395 | pnl 396 | vulg 397 | vyj 398 | vys 399 | vysokoškol 400 | vzťaž 401 | vôb 402 | vých 403 | výd 404 | výrob 405 | výsk 406 | výsl 407 | výtv 408 | výtvar 409 | význ 410 | včel 411 | vš 412 | všeob 413 | zahr 414 | zar 415 | zariad 416 | zast 417 | zastar 418 | zastaráv 419 | zb 420 | zdravot 421 | združ 422 | zjemn 423 | zlat 424 | zn 425 | Zn 426 | zool 427 | zr 428 | zried 429 | zv 430 | záhr 431 | zák 432 | zákl 433 | zám 434 | záp 435 | západoeur 436 | zázn 437 | územ 438 | účt 439 | čast 440 | čes 441 | Čes 442 | čl 443 | čísl 444 | živ 445 | pr 446 | fak 447 | Kr 448 | p.n.l 449 | A 450 | B 451 | C 452 | D 453 | E 454 | F 455 | G 456 | H 457 | I 458 | J 459 | K 460 | L 461 | M 462 | N 463 | O 464 | P 465 | Q 466 | R 467 | S 468 | T 469 | U 470 | V 471 | W 472 | X 473 | Y 474 | Z 475 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.sl: -------------------------------------------------------------------------------- 1 | dr 2 | Dr 3 | itd 4 | itn 5 | št #NUMERIC_ONLY# 6 | Št #NUMERIC_ONLY# 7 | d 8 | jan 9 | Jan 10 | feb 11 | Feb 12 | mar 13 | Mar 14 | apr 15 | Apr 16 | jun 17 | Jun 18 | jul 19 | Jul 20 | avg 21 | Avg 22 | sept 23 | Sept 24 | sep 25 | Sep 26 | okt 27 | Okt 28 | nov 29 | Nov 30 | dec 31 | Dec 32 | tj 33 | Tj 34 | npr 35 | Npr 36 | sl 37 | Sl 38 | op 39 | Op 40 | gl 41 | Gl 42 | oz 43 | Oz 44 | prev 45 | dipl 46 | ing 47 | prim 48 | Prim 49 | cf 50 | Cf 51 | gl 52 | Gl 53 | A 54 | B 55 | C 56 | D 57 | E 58 | F 59 | G 60 | H 61 | I 62 | J 63 | K 64 | L 65 | M 66 | N 67 | O 68 | P 69 | Q 70 | R 71 | S 72 | T 73 | U 74 | V 75 | W 76 | X 77 | Y 78 | Z 79 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.sv: -------------------------------------------------------------------------------- 1 | #single upper case letter are usually initials 2 | A 3 | B 4 | C 5 | D 6 | E 7 | F 8 | G 9 | H 10 | I 11 | J 12 | K 13 | L 14 | M 15 | N 16 | O 17 | P 18 | Q 19 | R 20 | S 21 | T 22 | U 23 | V 24 | W 25 | X 26 | Y 27 | Z 28 | #misc abbreviations 29 | AB 30 | G 31 | VG 32 | dvs 33 | etc 34 | from 35 | iaf 36 | jfr 37 | kl 38 | kr 39 | mao 40 | mfl 41 | mm 42 | osv 43 | pga 44 | tex 45 | tom 46 | vs 47 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.ta: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | அ 7 | ஆ 8 | இ 9 | ஈ 10 | உ 11 | ஊ 12 | எ 13 | ஏ 14 | ஐ 15 | ஒ 16 | ஓ 17 | ஔ 18 | ஃ 19 | க 20 | கா 21 | கி 22 | கீ 23 | கு 24 | கூ 25 | கெ 26 | கே 27 | கை 28 | கொ 29 | கோ 30 | கௌ 31 | க் 32 | ச 33 | சா 34 | சி 35 | சீ 36 | சு 37 | சூ 38 | செ 39 | சே 40 | சை 41 | சொ 42 | சோ 43 | சௌ 44 | ச் 45 | ட 46 | டா 47 | டி 48 | டீ 49 | டு 50 | டூ 51 | டெ 52 | டே 53 | டை 54 | டொ 55 | டோ 56 | டௌ 57 | ட் 58 | த 59 | தா 60 | தி 61 | தீ 62 | து 63 | தூ 64 | தெ 65 | தே 66 | தை 67 | தொ 68 | தோ 69 | தௌ 70 | த் 71 | ப 72 | பா 73 | பி 74 | பீ 75 | பு 76 | பூ 77 | பெ 78 | பே 79 | பை 80 | பொ 81 | போ 82 | பௌ 83 | ப் 84 | ற 85 | றா 86 | றி 87 | றீ 88 | று 89 | றூ 90 | றெ 91 | றே 92 | றை 93 | றொ 94 | றோ 95 | றௌ 96 | ற் 97 | ய 98 | யா 99 | யி 100 | யீ 101 | யு 102 | யூ 103 | யெ 104 | யே 105 | யை 106 | யொ 107 | யோ 108 | யௌ 109 | ய் 110 | ர 111 | ரா 112 | ரி 113 | ரீ 114 | ரு 115 | ரூ 116 | ரெ 117 | ரே 118 | ரை 119 | ரொ 120 | ரோ 121 | ரௌ 122 | ர் 123 | ல 124 | லா 125 | லி 126 | லீ 127 | லு 128 | லூ 129 | லெ 130 | லே 131 | லை 132 | லொ 133 | லோ 134 | லௌ 135 | ல் 136 | வ 137 | வா 138 | வி 139 | வீ 140 | வு 141 | வூ 142 | வெ 143 | வே 144 | வை 145 | வொ 146 | வோ 147 | வௌ 148 | வ் 149 | ள 150 | ளா 151 | ளி 152 | ளீ 153 | ளு 154 | ளூ 155 | ளெ 156 | ளே 157 | ளை 158 | ளொ 159 | ளோ 160 | ளௌ 161 | ள் 162 | ழ 163 | ழா 164 | ழி 165 | ழீ 166 | ழு 167 | ழூ 168 | ழெ 169 | ழே 170 | ழை 171 | ழொ 172 | ழோ 173 | ழௌ 174 | ழ் 175 | ங 176 | ஙா 177 | ஙி 178 | ஙீ 179 | ஙு 180 | ஙூ 181 | ஙெ 182 | ஙே 183 | ஙை 184 | ஙொ 185 | ஙோ 186 | ஙௌ 187 | ங் 188 | ஞ 189 | ஞா 190 | ஞி 191 | ஞீ 192 | ஞு 193 | ஞூ 194 | ஞெ 195 | ஞே 196 | ஞை 197 | ஞொ 198 | ஞோ 199 | ஞௌ 200 | ஞ் 201 | ண 202 | ணா 203 | ணி 204 | ணீ 205 | ணு 206 | ணூ 207 | ணெ 208 | ணே 209 | ணை 210 | ணொ 211 | ணோ 212 | ணௌ 213 | ண் 214 | ந 215 | நா 216 | நி 217 | நீ 218 | நு 219 | நூ 220 | நெ 221 | நே 222 | நை 223 | நொ 224 | நோ 225 | நௌ 226 | ந் 227 | ம 228 | மா 229 | மி 230 | மீ 231 | மு 232 | மூ 233 | மெ 234 | மே 235 | மை 236 | மொ 237 | மோ 238 | மௌ 239 | ம் 240 | ன 241 | னா 242 | னி 243 | னீ 244 | னு 245 | னூ 246 | னெ 247 | னே 248 | னை 249 | னொ 250 | னோ 251 | னௌ 252 | ன் 253 | 254 | 255 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 256 | திரு 257 | திருமதி 258 | வண 259 | கௌரவ 260 | 261 | 262 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 263 | உ.ம் 264 | #கா.ம் 265 | #எ.ம் 266 | 267 | 268 | #Numbers only. These should only induce breaks when followed by a numeric sequence 269 | # add NUMERIC_ONLY after the word for this function 270 | #This case is mostly for the english "No." which can either be a sentence of its own, or 271 | #if followed by a number, a non-breaking prefix 272 | No #NUMERIC_ONLY# 273 | Nos 274 | Art #NUMERIC_ONLY# 275 | Nr 276 | pp #NUMERIC_ONLY# 277 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.yue: -------------------------------------------------------------------------------- 1 | # 2 | # Cantonese (Chinese) 3 | # 4 | # Anything in this file, followed by a period, 5 | # does NOT indicate an end-of-sentence marker. 6 | # 7 | # English/Euro-language given-name initials (appearing in 8 | # news, periodicals, etc.) 9 | A 10 | Ā 11 | B 12 | C 13 | Č 14 | D 15 | E 16 | Ē 17 | F 18 | G 19 | Ģ 20 | H 21 | I 22 | Ī 23 | J 24 | K 25 | Ķ 26 | L 27 | Ļ 28 | M 29 | N 30 | Ņ 31 | O 32 | P 33 | Q 34 | R 35 | S 36 | Š 37 | T 38 | U 39 | Ū 40 | V 41 | W 42 | X 43 | Y 44 | Z 45 | Ž 46 | 47 | # Numbers only. These should only induce breaks when followed by 48 | # a numeric sequence. 49 | # Add NUMERIC_ONLY after the word for this function. This case is 50 | # mostly for the english "No." which can either be a sentence of its 51 | # own, or if followed by a number, a non-breaking prefix. 52 | No #NUMERIC_ONLY# 53 | Nr #NUMERIC_ONLY# 54 | -------------------------------------------------------------------------------- /share/nonbreaking_prefixes/nonbreaking_prefix.zh: -------------------------------------------------------------------------------- 1 | # 2 | # Mandarin (Chinese) 3 | # 4 | # Anything in this file, followed by a period, 5 | # does NOT indicate an end-of-sentence marker. 6 | # 7 | # English/Euro-language given-name initials (appearing in 8 | # news, periodicals, etc.) 9 | A 10 | Ā 11 | B 12 | C 13 | Č 14 | D 15 | E 16 | Ē 17 | F 18 | G 19 | Ģ 20 | H 21 | I 22 | Ī 23 | J 24 | K 25 | Ķ 26 | L 27 | Ļ 28 | M 29 | N 30 | Ņ 31 | O 32 | P 33 | Q 34 | R 35 | S 36 | Š 37 | T 38 | U 39 | Ū 40 | V 41 | W 42 | X 43 | Y 44 | Z 45 | Ž 46 | 47 | # Numbers only. These should only induce breaks when followed by 48 | # a numeric sequence. 49 | # Add NUMERIC_ONLY after the word for this function. This case is 50 | # mostly for the english "No." which can either be a sentence of its 51 | # own, or if followed by a number, a non-breaking prefix. 52 | No #NUMERIC_ONLY# 53 | Nr #NUMERIC_ONLY# 54 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import numpy as np 4 | import random 5 | import time 6 | 7 | import torch 8 | import torch.nn as nn 9 | from torch import cuda 10 | from torch.autograd import Variable 11 | 12 | import lib 13 | 14 | parser = argparse.ArgumentParser(description="train.py") 15 | 16 | ## Data options 17 | parser.add_argument("-data", required=True, 18 | help="Path to the *-train.pt file from preprocess.py") 19 | parser.add_argument("-save_dir", required=True, 20 | help="Directory to save models") 21 | parser.add_argument("-load_from", help="Path to load a pretrained model.") 22 | 23 | ## Model options 24 | 25 | parser.add_argument("-layers", type=int, default=1, 26 | help="Number of layers in the LSTM encoder/decoder") 27 | parser.add_argument("-rnn_size", type=int, default=500, 28 | help="Size of LSTM hidden states") 29 | parser.add_argument("-word_vec_size", type=int, default=500, 30 | help="Size of word embeddings") 31 | parser.add_argument("-input_feed", type=int, default=1, 32 | help="""Feed the context vector at each time step as 33 | additional input (via concatenation with the word 34 | embeddings) to the decoder.""") 35 | parser.add_argument("-brnn", action="store_true", 36 | help="Use a bidirectional encoder") 37 | parser.add_argument("-brnn_merge", default="concat", 38 | help="""Merge action for the bidirectional hidden states: 39 | [concat|sum]""") 40 | 41 | ## Optimization options 42 | 43 | parser.add_argument("-batch_size", type=int, default=64, 44 | help="Maximum batch size") 45 | parser.add_argument("-max_generator_batches", type=int, default=32, 46 | help="""Split softmax input into small batches for memory efficiency. 47 | Higher is faster, but uses more memory.""") 48 | parser.add_argument("-end_epoch", type=int, default=50, 49 | help="Epoch to stop training.") 50 | parser.add_argument("-start_epoch", type=int, default=1, 51 | help="Epoch to start training.") 52 | parser.add_argument("-param_init", type=float, default=0.1, 53 | help="""Parameters are initialized over uniform distribution 54 | with support (-param_init, param_init)""") 55 | parser.add_argument("-optim", default="adam", 56 | help="Optimization method. [sgd|adagrad|adadelta|adam]") 57 | parser.add_argument("-lr", type=float, default=1e-3, 58 | help="Initial learning rate") 59 | parser.add_argument("-max_grad_norm", type=float, default=5, 60 | help="""If the norm of the gradient vector exceeds this, 61 | renormalize it to have the norm equal to max_grad_norm""") 62 | parser.add_argument("-dropout", type=float, default=0, 63 | help="Dropout probability; applied between LSTM stacks.") 64 | parser.add_argument("-learning_rate_decay", type=float, default=0.5, 65 | help="""Decay learning rate by this much if (i) perplexity 66 | does not decrease on the validation set or (ii) epoch has 67 | gone past the start_decay_at_limit""") 68 | parser.add_argument("-start_decay_at", type=int, default=5, 69 | help="Start decay after this epoch") 70 | 71 | # GPU 72 | parser.add_argument("-gpus", default=[0], nargs="+", type=int, 73 | help="Use CUDA") 74 | parser.add_argument("-log_interval", type=int, default=100, 75 | help="Print stats at this interval.") 76 | parser.add_argument("-seed", type=int, default=3435, 77 | help="Seed for random initialization") 78 | 79 | # Critic 80 | parser.add_argument("-start_reinforce", type=int, default=None, 81 | help="""Epoch to start reinforcement training. 82 | Use -1 to start immediately.""") 83 | parser.add_argument("-critic_pretrain_epochs", type=int, default=0, 84 | help="Number of epochs to pretrain critic (actor fixed).") 85 | parser.add_argument("-reinforce_lr", type=float, default=1e-4, 86 | help="""Learning rate for reinforcement training.""") 87 | 88 | # Evaluation 89 | parser.add_argument("-eval", action="store_true", help="Evaluate model only") 90 | parser.add_argument("-eval_sample", action="store_true", default=False, 91 | help="Eval by sampling") 92 | parser.add_argument("-max_predict_length", type=int, default=80, 93 | help="Maximum length of predictions.") 94 | 95 | 96 | # Reward shaping 97 | parser.add_argument("-pert_func", type=str, default=None, 98 | help="Reward-shaping function.") 99 | parser.add_argument("-pert_param", type=float, default=None, 100 | help="Reward-shaping parameter.") 101 | 102 | # Others 103 | parser.add_argument("-no_update", action="store_true", default=False, 104 | help="No update round. Use to evaluate model samples.") 105 | parser.add_argument("-sup_train_on_bandit", action="store_true", default=False, 106 | help="Supervised learning update round.") 107 | 108 | opt = parser.parse_args() 109 | print(opt) 110 | 111 | # Set seed 112 | torch.manual_seed(opt.seed) 113 | np.random.seed(opt.seed) 114 | random.seed(opt.seed) 115 | 116 | opt.cuda = len(opt.gpus) 117 | 118 | if opt.save_dir and not os.path.exists(opt.save_dir): 119 | os.makedirs(opt.save_dir) 120 | 121 | if torch.cuda.is_available() and not opt.cuda: 122 | print("WARNING: You have a CUDA device, so you should probably run with -gpus 1") 123 | 124 | if opt.cuda: 125 | cuda.set_device(opt.gpus[0]) 126 | torch.cuda.manual_seed(opt.seed) 127 | 128 | def init(model): 129 | for p in model.parameters(): 130 | p.data.uniform_(-opt.param_init, opt.param_init) 131 | 132 | def create_optim(model): 133 | optim = lib.Optim( 134 | model.parameters(), opt.optim, opt.lr, opt.max_grad_norm, 135 | lr_decay=opt.learning_rate_decay, start_decay_at=opt.start_decay_at 136 | ) 137 | return optim 138 | 139 | def create_model(model_class, dicts, gen_out_size): 140 | encoder = lib.Encoder(opt, dicts["src"]) 141 | decoder = lib.Decoder(opt, dicts["tgt"]) 142 | # Use memory efficient generator when output size is large and 143 | # max_generator_batches is smaller than batch_size. 144 | if opt.max_generator_batches < opt.batch_size and gen_out_size > 1: 145 | generator = lib.MemEfficientGenerator(nn.Linear(opt.rnn_size, gen_out_size), opt) 146 | else: 147 | generator = lib.BaseGenerator(nn.Linear(opt.rnn_size, gen_out_size), opt) 148 | model = model_class(encoder, decoder, generator, opt) 149 | init(model) 150 | optim = create_optim(model) 151 | return model, optim 152 | 153 | def create_critic(checkpoint, dicts, opt): 154 | if opt.load_from is not None and "critic" in checkpoint: 155 | critic = checkpoint["critic"] 156 | critic_optim = checkpoint["critic_optim"] 157 | else: 158 | critic, critic_optim = create_model(lib.NMTModel, dicts, 1) 159 | if opt.cuda: 160 | critic.cuda(opt.gpus[0]) 161 | return critic, critic_optim 162 | 163 | def main(): 164 | 165 | print('Loading data from "%s"' % opt.data) 166 | 167 | dataset = torch.load(opt.data) 168 | 169 | supervised_data = lib.Dataset(dataset["train_xe"], opt.batch_size, opt.cuda, eval=False) 170 | bandit_data = lib.Dataset(dataset["train_pg"], opt.batch_size, opt.cuda, eval=False) 171 | valid_data = lib.Dataset(dataset["valid"], opt.batch_size, opt.cuda, eval=True) 172 | test_data = lib.Dataset(dataset["test"], opt.batch_size, opt.cuda, eval=True) 173 | 174 | dicts = dataset["dicts"] 175 | print(" * vocabulary size. source = %d; target = %d" % 176 | (dicts["src"].size(), dicts["tgt"].size())) 177 | print(" * number of XENT training sentences. %d" % 178 | len(dataset["train_xe"]["src"])) 179 | print(" * number of PG training sentences. %d" % 180 | len(dataset["train_pg"]["src"])) 181 | print(" * maximum batch size. %d" % opt.batch_size) 182 | print("Building model...") 183 | 184 | use_critic = opt.start_reinforce is not None 185 | 186 | if opt.load_from is None: 187 | model, optim = create_model(lib.NMTModel, dicts, dicts["tgt"].size()) 188 | checkpoint = None 189 | else: 190 | print("Loading from checkpoint at %s" % opt.load_from) 191 | checkpoint = torch.load(opt.load_from) 192 | model = checkpoint["model"] 193 | optim = checkpoint["optim"] 194 | opt.start_epoch = checkpoint["epoch"] + 1 195 | 196 | # GPU. 197 | if opt.cuda: 198 | model.cuda(opt.gpus[0]) 199 | 200 | # Start reinforce training immediately. 201 | if opt.start_reinforce == -1: 202 | opt.start_decay_at = opt.start_epoch 203 | opt.start_reinforce = opt.start_epoch 204 | 205 | # Check if end_epoch is large enough. 206 | if use_critic: 207 | assert opt.start_epoch + opt.critic_pretrain_epochs - 1 <= \ 208 | opt.end_epoch, "Please increase -end_epoch to perform pretraining!" 209 | 210 | nParams = sum([p.nelement() for p in model.parameters()]) 211 | print("* number of parameters: %d" % nParams) 212 | 213 | # Metrics. 214 | metrics = {} 215 | metrics["nmt_loss"] = lib.Loss.weighted_xent_loss 216 | metrics["critic_loss"] = lib.Loss.weighted_mse 217 | metrics["sent_reward"] = lib.Reward.sentence_bleu 218 | metrics["corp_reward"] = lib.Reward.corpus_bleu 219 | if opt.pert_func is not None: 220 | opt.pert_func = lib.PertFunction(opt.pert_func, opt.pert_param) 221 | 222 | 223 | # Evaluate model on heldout dataset. 224 | if opt.eval: 225 | evaluator = lib.Evaluator(model, metrics, dicts, opt) 226 | # On validation set. 227 | pred_file = opt.load_from.replace(".pt", ".valid.pred") 228 | evaluator.eval(valid_data, pred_file) 229 | # On test set. 230 | pred_file = opt.load_from.replace(".pt", ".test.pred") 231 | evaluator.eval(test_data, pred_file) 232 | elif opt.eval_sample: 233 | opt.no_update = True 234 | critic, critic_optim = create_critic(checkpoint, dicts, opt) 235 | reinforce_trainer = lib.ReinforceTrainer(model, critic, bandit_data, test_data, 236 | metrics, dicts, optim, critic_optim, opt) 237 | reinforce_trainer.train(opt.start_epoch, opt.start_epoch, False) 238 | elif opt.sup_train_on_bandit: 239 | optim.set_lr(opt.reinforce_lr) 240 | xent_trainer = lib.Trainer(model, bandit_data, test_data, metrics, dicts, optim, opt) 241 | xent_trainer.train(opt.start_epoch, opt.start_epoch) 242 | else: 243 | print("theek hai") 244 | xent_trainer = lib.Trainer(model, supervised_data, valid_data, metrics, dicts, optim, opt) 245 | if use_critic: 246 | start_time = time.time() 247 | # Supervised training. 248 | xent_trainer.train(opt.start_epoch, opt.start_reinforce - 1, start_time) 249 | # Create critic here to not affect random seed. 250 | critic, critic_optim = create_critic(checkpoint, dicts, opt) 251 | # Pretrain critic. 252 | if opt.critic_pretrain_epochs > 0: 253 | reinforce_trainer = lib.ReinforceTrainer(model, critic, supervised_data, test_data, 254 | metrics, dicts, optim, critic_optim, opt) 255 | reinforce_trainer.train(opt.start_reinforce, 256 | opt.start_reinforce + opt.critic_pretrain_epochs - 1, True, start_time) 257 | # Reinforce training. 258 | reinforce_trainer = lib.ReinforceTrainer(model, critic, bandit_data, test_data, 259 | metrics, dicts, optim, critic_optim, opt) 260 | reinforce_trainer.train(opt.start_reinforce + opt.critic_pretrain_epochs, opt.end_epoch, 261 | False, start_time) 262 | # Supervised training only. 263 | else: 264 | xent_trainer.train(opt.start_epoch, opt.end_epoch) 265 | 266 | 267 | if __name__ == "__main__": 268 | main() 269 | -------------------------------------------------------------------------------- /translate.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import numpy as np 4 | import random 5 | import time 6 | 7 | import torch 8 | import torch.nn as nn 9 | import torch.nn.parallel 10 | from torch import cuda 11 | from torch.autograd import Variable 12 | 13 | import lib 14 | 15 | parser = argparse.ArgumentParser() 16 | 17 | ## Data options 18 | parser.add_argument("-data", required=True, 19 | help="Path to the *-train.pt file from preprocess.py") 20 | parser.add_argument("-batch_size", default=32, help="Batch Size") 21 | parser.add_argument("-save_dir", help="Directory to save predictions") 22 | parser.add_argument("-load_from", required=True, help="Path to load a trained model.") 23 | parser.add_argument("-test_src", required=True, help="Path to the file to be translated.") 24 | 25 | # GPU 26 | parser.add_argument("-gpus", default=[0], nargs="+", type=int, 27 | help="Use CUDA") 28 | parser.add_argument("-log_interval", type=int, default=100, 29 | help="Print stats at this interval.") 30 | parser.add_argument("-seed", type=int, default=3435, 31 | help="Seed for random initialization") 32 | 33 | opt = parser.parse_args() 34 | print(opt) 35 | 36 | # Set seed 37 | torch.manual_seed(opt.seed) 38 | np.random.seed(opt.seed) 39 | random.seed(opt.seed) 40 | 41 | opt.cuda = len(opt.gpus) 42 | 43 | if opt.save_dir and not os.path.exists(opt.save_dir): 44 | os.makedirs(opt.save_dir) 45 | 46 | if torch.cuda.is_available() and not opt.cuda: 47 | print("WARNING: You have a CUDA device, so you should probably run with -gpus 1") 48 | 49 | #device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 50 | 51 | if opt.cuda: 52 | cuda.set_device(opt.gpus[0]) 53 | torch.cuda.manual_seed(opt.seed) 54 | 55 | def makeTestData(srcFile,dicts): 56 | print("Processing %s ..." % srcFile) 57 | srcF = open(srcFile,'r') 58 | text = srcF.read() 59 | srcF.close() 60 | lines=text.strip().split('\n') 61 | src=[] 62 | tgt=[] 63 | srcDicts = dicts["src"] 64 | count=0 65 | for line in lines: 66 | srcWords = line.split() 67 | src += [srcDicts.convertToIdx(srcWords, 68 | lib.Constants.UNK_WORD)] 69 | count += 1 70 | print("... %d sentences prepared for testing" % count) 71 | tgt=src 72 | return src,tgt,range(len(src)) 73 | 74 | def predict(model,dicts,data,pred_file): 75 | model.eval() 76 | all_preds=[] 77 | max_length=50 78 | for i in range(len(data)): 79 | batch=data[i] 80 | targets=batch[1] 81 | attention_mask=batch[0][0].data.eq(lib.Constants.PAD).t() 82 | model.decoder.attn.applyMask(attention_mask) 83 | preds = model.translate(batch, max_length) 84 | preds = preds.t().tolist() 85 | targets=targets.data.t().tolist() 86 | #hack 87 | indices=batch[2] 88 | new_batch=zip(preds,targets) 89 | new_batch,indices=zip(*sorted(zip(new_batch,indices),key=lambda x: x[1])) 90 | preds,targets=zip(*new_batch) 91 | ### 92 | all_preds.extend(preds) 93 | 94 | with open(pred_file, "w") as f: 95 | for sent in all_preds: 96 | sent = lib.Reward.clean_up_sentence(sent, remove_unk=False, remove_eos=True) 97 | sent = [dicts["tgt"].getLabel(w) for w in sent] 98 | x=" ".join(sent)+'\n' 99 | f.write(x) 100 | f.close() 101 | 102 | def main(): 103 | print('Loading train data from "%s"' % opt.data) 104 | 105 | dataset = torch.load(opt.data) 106 | dicts = dataset["dicts"] 107 | 108 | if opt.load_from is None: 109 | print("REQUIRES PATH TO THE TRAINED MODEL\n") 110 | else: 111 | print("Loading from checkpoint at %s" % opt.load_from) 112 | checkpoint = torch.load(opt.load_from) 113 | model = checkpoint["model"] 114 | optim = checkpoint["optim"] 115 | 116 | # GPU. 117 | if opt.cuda: 118 | model.cuda(opt.gpus[0]) 119 | #model=torch.nn.DataParallel(model) 120 | #torch.distributed.init_process_group(backend='tcp',rank=0,world_size=2) 121 | #model = torch.nn.parallel.DistributedDataParallel(model) 122 | 123 | 124 | # Generating Translations for test set 125 | print('Creating test data\n') 126 | src,tgt,pos=makeTestData(opt.test_src,dicts) 127 | res={} 128 | res["src"]=src 129 | res["tgt"]=tgt 130 | res["pos"]=pos 131 | test_data = lib.Dataset(res, opt.batch_size, opt.cuda, eval=False) 132 | pred_file = opt.test_src+".pred" 133 | predict(model,dicts,test_data,pred_file) 134 | print('Generated translations successfully\n') 135 | 136 | if __name__ == "__main__": 137 | main() 138 | --------------------------------------------------------------------------------