├── README.md
├── batch_filtering
    ├── README.md
    ├── install_external_tools.sh
    ├── install_models.sh
    ├── scoring_pipeline.py
    └── source
    │   ├── embed.py
    │   ├── lib
    │       ├── indexing.py
    │       ├── romanize_lc.py
    │       └── text_processing.py
    │   └── mine_bitexts.py
├── segmentation
    ├── LICENSE.txt
    ├── README.md
    ├── __init__.py
    ├── segmenter.py
    └── setup.py
├── training
    ├── README.md
    ├── preprocessing
    │   ├── README.md
    │   ├── preprocessor.py
    │   ├── remove_evaluation_pairs.py
    │   └── replacePatterns.txt
    └── seq2seq
    │   ├── .gitignore
    │   ├── dataProcessor.py
    │   ├── multi-bleu-detok.perl
    │   ├── pipeline.py
    │   ├── requirements.txt
    │   └── sample_input_dir
    │       ├── data
    │           ├── RisingNews.test.bn
    │           ├── RisingNews.test.en
    │           ├── RisingNews.valid.bn
    │           └── RisingNews.valid.en
    │       └── vocab
    │           ├── bn.model
    │           └── en.model
└── vocab.tar.bz2


/README.md:
--------------------------------------------------------------------------------
 1 | # Bangla-NMT
 2 | 
 3 | This repository contains the code and data of the paper titled [**"Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation"**](https://www.aclweb.org/anthology/2020.emnlp-main.207/) published in *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.*
 4 | 
 5 | ## Updates
 6 | 
 7 | * The base translation models are now available for download.
 8 | * The training code has been refactored to support [OpenNMT-py 2.2.0](https://github.com/OpenNMT/OpenNMT-py).
 9 | * [Colab Notebook](https://colab.research.google.com/drive/1TPkYXEWrf_dUjq-1qpapkreLc7JOug9E?usp=sharing) added for the inference module.
10 | 
11 | ## Table of Contents
12 | 
13 | - [Bangla-NMT](#bangla-nmt)
14 |   - [Updates](#updates)
15 |   - [Table of Contents](#table-of-contents)
16 |   - [Datasets](#datasets)
17 |   - [Models](#models)
18 |   - [Dependencies](#dependencies)
19 |   - [Segmentation](#segmentation)
20 |   - [Batch-filtering](#batch-filtering)
21 |   - [Training & Evaluation](#training--evaluation)
22 |   - [License](#license)
23 |   - [Citation](#citation)
24 | 
25 | 
26 | ## Datasets
27 |   Download the dataset from [here](https://docs.google.com/uc?export=download&id=1FLlC0NNXFKVGaVM3-cYW-XEx8p8eV3Wm). This includes:
28 | * Our original 2.75M training corpus (`2.75M/`)
29 | * [Preprocessed](training/preprocessing) training corpus (`data/`)
30 | * RisingNews dev/test sets (`data/`)
31 | * Preprocessed sipc dev/test sets (`data/`)
32 | * Sentencepiece vocabulary models for Bengali and English (`vocab/`) 
33 | 
34 | ## Models
35 | 
36 | The base-sized transformer model (6 layers, 8 attention heads) checkpoints can be found below: 
37 | 
38 | * [Bengali to English](https://docs.google.com/uc?export=download&id=1nYKua6_q7W-WK-Xwng_DjoLoZ0k1HgjB)
39 | * [English to Bengali](https://docs.google.com/uc?export=download&id=1uX8nL3yeosmK3YVCRHNJolv861-fCCbi)
40 | * [Sentencepiece vocabulary files](vocab.tar.bz2)
41 | 
42 | To evaluate these models on new datasets, please refer to [here](https://github.com/csebuetnlp/banglanmt/tree/master/training). You can also use the [Colab Notebook](https://colab.research.google.com/drive/1TPkYXEWrf_dUjq-1qpapkreLc7JOug9E?usp=sharing) for direct inference.
43 | 
44 | ## Dependencies
45 | * Python 3.7.3
46 | * [PyTorch 1.2](http://pytorch.org/)
47 | * [Cython](https://pypi.org/project/Cython/)
48 | * [Faiss](https://github.com/facebookresearch/faiss)
49 | * [FastBPE](https://github.com/glample/fastBPE)
50 | * [sentencepiece](https://github.com/google/sentencepiece) (`Install CLI`)
51 | * [transliterate](https://pypi.org/project/transliterate) 
52 | * [regex](https://pypi.org/project/regex/)
53 | * [torchtext](https://pypi.org/project/torchtext) (`pip install torchtext==0.4.0`)
54 | * [sacrebleu](https://pypi.org/project/sacrebleu)
55 | * [aksharamukha](https://pypi.org/project/aksharamukha)
56 | 
57 | 
58 | ## Segmentation
59 |   * See [segmentation module.](segmentation/)
60 | 
61 | ## Batch-filtering
62 |   * See [batch-filtering module.](batch_filtering/)
63 | 
64 | ## Training & Evaluation
65 |   * See [training and evaluation module.](training/)
66 |   * Try out the models in [Google Colaboratory.](https://colab.research.google.com/drive/1TPkYXEWrf_dUjq-1qpapkreLc7JOug9E?usp=sharing)
67 | 
68 | ## License
69 | Contents of this repository are licensed under [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). 
70 | 
71 | ## Citation
72 | If you use this dataset or code modules, please cite the following paper:
73 | ```
74 | @inproceedings{hasan-etal-2020-low,
75 |     title = "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for {B}engali-{E}nglish Machine Translation",
76 |     author = "Hasan, Tahmid  and
77 |       Bhattacharjee, Abhik  and
78 |       Samin, Kazi  and
79 |       Hasan, Masum  and
80 |       Basak, Madhusudan  and
81 |       Rahman, M. Sohel  and
82 |       Shahriyar, Rifat",
83 |     booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
84 |     month = nov,
85 |     year = "2020",
86 |     address = "Online",
87 |     publisher = "Association for Computational Linguistics",
88 |     url = "https://www.aclweb.org/anthology/2020.emnlp-main.207",
89 |     doi = "10.18653/v1/2020.emnlp-main.207",
90 |     pages = "2612--2623",
91 |     abstract = "Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.",
92 | }
93 | ```
94 | 


--------------------------------------------------------------------------------
/batch_filtering/README.md:
--------------------------------------------------------------------------------
 1 | ## Setup
 2 | * Install all dependecies mentioned [here](https://github.com/csebuetnlp/banglanmt).
 3 | * download models: `bash ./install_models.sh`
 4 | * setup necessary tools: `bash ./install_external_tools.sh`
 5 | 
 6 | ## Usage
 7 | * setup environment variable before running.
 8 |   ```bash
 9 |   # inside this directory
10 |   $ export LASER=$(pwd)
11 |   ```
12 | * Batch filtering options
13 |   ```bash
14 |   $ python3 scoring_pipeline.py -h
15 |   usage: scoring_pipeline.py [-h] --input_dir PATH --output_dir PATH --src_lang
16 |                            SRC_LANG --tgt_lang TGT_LANG [--thresh THRESH]
17 |                            [--batch_size BATCH_SIZE] [--cpu]
18 | 
19 |   optional arguments:
20 |     -h, --help            show this help message and exit
21 |     --input_dir PATH, -i PATH
22 |                           Input directory
23 |     --output_dir PATH, -o PATH
24 |                           Output directory
25 |     --src_lang SRC_LANG   Source language
26 |     --tgt_lang TGT_LANG   Target language
27 |     --thresh THRESH       threshold
28 |     --batch_size BATCH_SIZE
29 |                           batch size
30 | 
31 |   ```
32 |   *  ***The script will recursively look for all filepairs `(X.src_lang, X.tgt_lang)` inside `input_dir`, where `X` is any common file prefix, and produce the following output files within the corresponding subdirectories of `output_dir`***
33 |   
34 |      * `X.merged.tsv`: Output linepairs with their similarity score
35 |      * `X.passed.src_lang` / `X.passed.tgt_lang`: Linepairs that have similarity scores greater than given `thresh`
36 |      * `X.failed.src_lang` / `X.failed.tgt_lang`: Linepairs that have similarity scores less than given `thresh`
37 | 


--------------------------------------------------------------------------------
/batch_filtering/install_external_tools.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | # Copyright (c) Facebook, Inc. and its affiliates.
  3 | # All rights reserved.
  4 | #
  5 | # This source code is licensed under the BSD-style license found in the
  6 | # LICENSE file in the root directory of this source tree.
  7 | #
  8 | # LASER  Language-Agnostic SEntence Representations
  9 | # is a toolkit to calculate multilingual sentence embeddings
 10 | # and to use them for document classification, bitext filtering
 11 | # and mining
 12 | # 
 13 | #-------------------------------------------------------
 14 | #
 15 | # This bash script installs third party software 
 16 | #
 17 | 
 18 | if [ -z ${LASER} ] ; then 
 19 |   echo "Please set the environment variable 'LASER'"
 20 |   exit
 21 | fi
 22 | 
 23 | bdir="${LASER}"
 24 | tools_ext="${bdir}/tools-external"
 25 | 
 26 | 
 27 | ###################################################################
 28 | #
 29 | # Generic helper functions
 30 | #
 31 | ###################################################################
 32 | 
 33 | MKDIR () {
 34 |   dname=$1
 35 |   if [ ! -d ${dname} ] ; then
 36 |     echo " - creating directory ${dname}"
 37 |     mkdir -p ${dname}
 38 |   fi
 39 | }
 40 | 
 41 | 
 42 | ###################################################################
 43 | #
 44 | # Tokenization tools from Moses
 45 | # It is important to use the official release V4 and not the current one
 46 | # to obtain the same results than the published ones.
 47 | # (the behavior of the tokenizer for end-of-sentence abbreviations has changed)
 48 | #
 49 | ###################################################################
 50 | 
 51 | InstallMosesTools () {
 52 |   moses_git="https://raw.githubusercontent.com/moses-smt/mosesdecoder/RELEASE-4.0/scripts"
 53 |   moses_files=("tokenizer/tokenizer.perl" "tokenizer/detokenizer.perl" \
 54 |                "tokenizer/normalize-punctuation.perl" \
 55 |                "tokenizer/remove-non-printing-char.perl" \
 56 |                "tokenizer/deescape-special-chars.perl" \
 57 |                "tokenizer/lowercase.perl" \
 58 |                "tokenizer/basic-protected-patterns" \
 59 |               )
 60 | 
 61 |   wdir="${tools_ext}/moses-tokenizer/tokenizer"
 62 |   MKDIR ${wdir}
 63 |   cd ${wdir}
 64 | 
 65 |   for f in ${moses_files[@]} ; do
 66 |     if [ ! -f `basename ${f}` ] ; then
 67 |       echo " - download ${f}"
 68 |       wget -q ${moses_git}/${f}
 69 |     fi
 70 |   done
 71 |   chmod 755 *perl
 72 | 
 73 |   # download non-breaking prefixes per language
 74 |   moses_non_breakings="share/nonbreaking_prefixes/nonbreaking_prefix"
 75 |   moses_non_breaking_langs=( \
 76 |       "ca" "cs" "de" "el" "en" "es" "fi" "fr" "ga" "hu" "is" \
 77 |       "it" "lt" "lv" "nl" "pl" "pt" "ro" "ru" "sk" "sl" "sv" \
 78 |       "ta" "yue" "zh" )
 79 |   wdir="${tools_ext}/moses-tokenizer/share/nonbreaking_prefixes"
 80 |   MKDIR ${wdir}
 81 |   cd ${wdir}
 82 | 
 83 |   for l in ${moses_non_breaking_langs[@]} ; do
 84 |     f="${moses_non_breakings}.${l}"
 85 |     if [ ! -f `basename ${f}` ] ; then
 86 |       echo " - download ${f}"
 87 |       wget -q ${moses_git}/${f} 
 88 |     fi
 89 |   done
 90 | }
 91 | 
 92 | 
 93 | ###################################################################
 94 | #
 95 | # FAST BPE 
 96 | #
 97 | ###################################################################
 98 | 
 99 | InstallFastBPE () {
100 |   cd ${tools_ext}
101 |   if [ ! -x fastBPE/fast ] ; then
102 |     echo " - download fastBPE software from github"
103 |     wget https://github.com/glample/fastBPE/archive/master.zip
104 |     unzip master.zip
105 |     /bin/rm master.zip
106 |     mv fastBPE-master fastBPE
107 |     cd fastBPE
108 |     echo " - compiling"
109 |     g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
110 |     if [ $? -eq 1 ] ; then
111 |       echo "ERROR: compilation failed, please install manually"; exit
112 |     fi
113 |     python setup.py install
114 |   fi
115 | }
116 | 
117 | 
118 | ###################################################################
119 | #
120 | # Install Japanese tokenizer Mecab
121 | # We do not use automatic installation with "pip" but directly add the soruce directory
122 | #
123 | ###################################################################
124 | 
125 | InstallMecab () {
126 |   cd ${tools_ext}
127 |   if [ ! -x mecab/mecab/bin/mecab ] ; then
128 |     echo " - download mecab from github"
129 |     wget https://github.com/taku910/mecab/archive/master.zip
130 |     unzip master.zip 
131 |     #/bin/rm master.zip
132 |     if [ ! -s mecab/bin/mecab ] ; then
133 |       mkdir mecab
134 |       cd mecab-master/mecab
135 |       echo " - installing code"
136 |       ./configure --prefix ${tools_ext}/mecab && make && make install 
137 |       if [ $? -q 1 ] ; then
138 |         echo "ERROR: installation failed, please install manually"; exit
139 |       fi
140 |     fi
141 |     if [ ! -d mecab/lib/mecab/dic/ipadic ] ; then
142 |       cd ${tools_ext}/mecab-master/mecab-ipadic
143 |       echo " - installing dictionaries"
144 |       ./configure --prefix ${tools_ext}/mecab --with-mecab-config=${tools_ext}/mecab/bin/mecab-config \
145 |         && make && make install 
146 |       if [ $? -eq 1 ] ; then
147 |         echo "ERROR: compilation failed, please install manually"; exit
148 |       fi
149 |     fi
150 |   fi
151 | }
152 | 
153 | 
154 | ###################################################################
155 | #
156 | # main
157 | #
158 | ###################################################################
159 | 
160 | echo "Installing external tools"
161 | 
162 | InstallMosesTools
163 | InstallFastBPE
164 | 
165 | #InstallMecab
166 | echo ""
167 | echo "automatic installation of the Japanese tokenizer mecab may be tricky"
168 | echo "Please install it manually from https://github.com/taku910/mecab"
169 | echo ""
170 | echo "The installation directory should be ${LASER}/tools-external/mecab"
171 | echo ""
172 | 


--------------------------------------------------------------------------------
/batch_filtering/install_models.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Copyright (c) Facebook, Inc. and its affiliates.
 3 | # All rights reserved.
 4 | #
 5 | # This source code is licensed under the BSD-style license found in the
 6 | # LICENSE file in the root directory of this source tree.
 7 | #
 8 | # LASER  Language-Agnostic SEntence Representations
 9 | # is a toolkit to calculate multilingual sentence embeddings
10 | # and to use them for document classification, bitext filtering
11 | # and mining
12 | # 
13 | #-------------------------------------------------------
14 | #
15 | # This bash script installs sentence encoders from Amazon s3
16 | #
17 | # if [ -z ${LASER} ] ; then 
18 | #   echo "Please set the environment variable 'LASER'"
19 | #   exit
20 | # fi
21 | 
22 | LASER='.'
23 | 
24 | mdir="${LASER}/models"
25 | 
26 | # available encoders
27 | s3="https://dl.fbaipublicfiles.com/laser/models"
28 | networks=("bilstm.eparl21.2018-11-19.pt" \
29 |           "eparl21.fcodes" "eparl21.fvocab" \
30 |           "bilstm.93langs.2018-12-26.pt" \
31 |           "93langs.fcodes" "93langs.fvocab")
32 | 
33 | 
34 | echo "Downloading networks"
35 | 
36 | if [ ! -d ${mdir} ] ; then
37 |   echo " - creating directory ${mdir}"
38 |   mkdir -p ${mdir}
39 | fi
40 | 
41 | cd ${mdir}
42 | for f in ${networks[@]} ; do
43 |   if [ -f ${f} ] ; then
44 |     echo " - ${mdir}/${f} already downloaded"
45 |   else
46 |     echo " - ${f}"
47 |     wget -q ${s3}/${f}
48 |   fi
49 | done
50 | 


--------------------------------------------------------------------------------
/batch_filtering/scoring_pipeline.py:
--------------------------------------------------------------------------------
  1 | from source.embed import *
  2 | from multiprocessing import Pool
  3 | import multiprocessing as mp
  4 | import time
  5 | import random
  6 | import shutil
  7 | import argparse
  8 | import glob
  9 | import math
 10 | 
 11 | random.seed(3435)
 12 | 
 13 | 
 14 | def loadEncoder(cpu=False):
 15 |     model_loc = os.path.join(os.environ["LASER"], "models", "bilstm.93langs.2018-12-26.pt")
 16 |     print(' - Encoder: loading {}'.format(model_loc))
 17 |     global ENCODER
 18 | 
 19 |     ENCODER = SentenceEncoder(model_loc,
 20 |                               max_sentences=None,
 21 |                               max_tokens=12000,
 22 |                               sort_kind='quicksort',
 23 |                               cpu=cpu)
 24 |                               
 25 | def encode(ifname, ofname, language):
 26 |     with tempfile.TemporaryDirectory() as tmpdir:
 27 |         
 28 |         tok_fname = os.path.join(tmpdir, 'tok')
 29 |         Token(ifname,
 30 |                 tok_fname,
 31 |                 lang=language,
 32 |                 romanize=True if language == 'el' else False,
 33 |                 lower_case=True, gzip=False,
 34 |                 verbose=True, over_write=False)
 35 |         ifname = tok_fname
 36 | 
 37 |         bpe_fname = os.path.join(tmpdir, 'bpe')
 38 |         BPEfastApply(ifname,
 39 |                     bpe_fname,
 40 |                     os.path.join(os.environ["LASER"], "models", "93langs.fcodes"),
 41 |                     verbose=True, over_write=False)
 42 |         ifname = bpe_fname
 43 | 
 44 |         EncodeFile(ENCODER,
 45 |                 ifname,
 46 |                 ofname,
 47 |                 verbose=True, over_write=False,
 48 |                 buffer_size=10000)
 49 | 
 50 | def getLines(filename):
 51 |     lines = []
 52 |     with open(filename) as f:
 53 |         for line in f:
 54 |             assert line.strip(), "Empty line found"
 55 |             lines.append(line.strip())
 56 |     return lines
 57 | 
 58 | def writeValidLinePairs(file1, file2):
 59 |     f1Lines, f2Lines = [], []
 60 | 
 61 |     with open(file1) as f1, open(file2) as f2:
 62 |         for line1, line2 in zip(f1, f2):
 63 |             if line1.strip() == "" or line2.strip() == "":
 64 |                 continue
 65 | 
 66 |             f1Lines.append(line1.replace('\t', ' ').strip())
 67 |             f2Lines.append(line2.replace('\t', ' ').strip())
 68 | 
 69 |     
 70 |     linePairList = list(dict.fromkeys(zip(f1Lines, f2Lines)))
 71 | 
 72 |     with open(file1, 'w') as f1, open(file2, 'w') as f2:
 73 |         for linePair in linePairList:
 74 |             print(linePair[0].strip(), file=f1)
 75 |             print(linePair[1].strip(), file=f2)
 76 | 
 77 | def score(prefix, args):
 78 |     writeValidLinePairs(f'{prefix}.{args.src_lang}', f'{prefix}.{args.tgt_lang}')
 79 | 
 80 |     s = f'''
 81 |     python3 \"{os.path.join(os.environ["LASER"], "source", "mine_bitexts.py")}\" \
 82 |         \"{prefix}.{args.src_lang}\" \"{prefix}.{args.tgt_lang}\" \
 83 |         --src-lang {args.src_lang} --trg-lang {args.tgt_lang} \
 84 |         --src-embeddings \"{prefix}.enc.{args.src_lang}\" --trg-embeddings \"{prefix}.enc.{args.tgt_lang}\" \
 85 |         --mode score --retrieval max -k 4  \
 86 |         --output \"{prefix}.tsv\" \
 87 |         --verbose {'--gpu' if not args.cpu else ''}
 88 |     '''
 89 |     os.system(s)
 90 | 
 91 |     os.remove(f'{prefix}.enc.{args.src_lang}')
 92 |     os.remove(f'{prefix}.enc.{args.tgt_lang}')
 93 | 
 94 | def mergeScores(input_dir, output_file):
 95 |     output_lines = []
 96 |     for input_file in glob.glob(os.path.join(input_dir, "*tsv")):
 97 |         output_lines.extend(getLines(input_file))
 98 |     
 99 |     _create(output_lines, output_file)
100 | 
101 | def scoreDir(dirname, out_prefix, args):
102 |     prefixes = [f[:-len(args.tgt_lang) - 1] for f in glob.glob(os.path.join(dirname, f"*{args.tgt_lang}"))]
103 | 
104 |     for prefix in prefixes:
105 |         encode(f'{prefix}.{args.src_lang}', f'{prefix}.enc.{args.src_lang}', args.src_lang)
106 |         encode(f'{prefix}.{args.tgt_lang}', f'{prefix}.enc.{args.tgt_lang}', args.tgt_lang)
107 |  
108 |     if args.cpu:
109 |         with Pool() as pool:
110 |             pool.starmap(score, [(prefix, args) for prefix in prefixes])
111 |     else:
112 |         for prefix in prefixes:
113 |             score(prefix, args)
114 | 
115 |     for filename in glob.glob(os.path.join(dirname, "*.enc.*")):
116 |         os.remove(filename)
117 |     
118 |     mergeScores(dirname, os.path.join(os.path.dirname(dirname), out_prefix))
119 |     shutil.rmtree(dirname)
120 | 
121 | def shufflePairs(srcFile, tgtFile):
122 |     with open(f'{srcFile}.shuffled', 'w') as srcF, open(f'{tgtFile}.shuffled', 'w') as tgtF:
123 |         srcLines, tgtLines = [], []
124 | 
125 |         with open(srcFile) as f:
126 |             srcLines.extend(f.readlines())
127 | 
128 |         with open(tgtFile) as f:
129 |             tgtLines.extend(f.readlines())
130 | 
131 |         assert len(srcLines) == len(tgtLines), "src and tgt line counts dont match"
132 | 
133 |         indices = list(range(len(srcLines)))
134 |         random.shuffle(indices)
135 | 
136 |         for i in indices:
137 |             print(srcLines[i].strip(), file=srcF)
138 |             print(tgtLines[i].strip(), file=tgtF)
139 | 
140 |     shutil.move(f'{srcFile}.shuffled', srcFile)
141 |     shutil.move(f'{tgtFile}.shuffled', tgtFile)
142 | 
143 | def _create(lines, output_file):
144 |     with open(output_file, 'w') as outf:
145 |         for line in lines:
146 |             print(line.strip(), file=outf)
147 | 
148 | def createChunks(input_file, output_dir, suffix, chunk_size):
149 |     os.makedirs(output_dir, exist_ok=True)
150 |     input_lines = getLines(input_file)
151 |     no_chunks = math.ceil(len(input_lines) / chunk_size)
152 | 
153 |     for i in range(no_chunks):
154 |         output_file = os.path.join(output_dir, f"{i}.{suffix}")
155 |         lines = input_lines[i * chunk_size: (i + 1) * chunk_size]
156 |         _create(lines, output_file)
157 |     
158 | def chunkFiles(prefix, dirname, args):
159 |     if os.path.isdir(os.path.join(dirname, "original")):
160 |         shutil.rmtree(os.path.join(dirname, "original"))
161 |        
162 |     shutil.copy(f'{prefix}.{args.src_lang}', f'{prefix}.{args.src_lang}.backup')
163 |     shutil.copy(f'{prefix}.{args.tgt_lang}', f'{prefix}.{args.tgt_lang}.backup')
164 |     
165 |     shufflePairs(f'{prefix}.{args.src_lang}', f'{prefix}.{args.tgt_lang}')
166 | 
167 |     createChunks(f"{prefix}.{args.src_lang}", os.path.join(dirname, "original"), args.src_lang, args.batch_size)
168 |     createChunks(f"{prefix}.{args.tgt_lang}", os.path.join(dirname, "original"), args.tgt_lang, args.batch_size)
169 | 
170 |     shutil.move(f'{prefix}.{args.src_lang}.backup', f'{prefix}.{args.src_lang}')
171 |     shutil.move(f'{prefix}.{args.tgt_lang}.backup', f'{prefix}.{args.tgt_lang}')
172 | 
173 | def batchFilterDir(args):
174 |     for tgtFile in glob.glob(os.path.join(args.input_dir, "**", f"*{args.tgt_lang}"), recursive=True):
175 |         dirname = os.path.dirname(tgtFile)
176 |         prefix = tgtFile[:-len(args.tgt_lang) - 1]
177 |         if not os.path.isfile(f'{prefix}.{args.src_lang}'):
178 |             continue
179 | 
180 |         chunkFiles(prefix, dirname, args)
181 |         out_prefix = os.path.basename(prefix)
182 |         tsv_name = out_prefix + ".merged.tsv"
183 |         scoreDir(os.path.join(dirname, "original"), tsv_name, args)
184 | 
185 |         outDir = dirname.replace(os.path.normpath(args.input_dir), os.path.normpath(args.output_dir), 1)
186 |         os.makedirs(outDir, exist_ok=True)
187 |         passed = failed = 0
188 |         
189 |         shutil.move(os.path.join(dirname, tsv_name), os.path.join(outDir, tsv_name))
190 | 
191 |         with open(os.path.join(outDir, f'{out_prefix}.passed.{args.src_lang}'), 'w') as psrc, \
192 |             open(os.path.join(outDir, f'{out_prefix}.passed.{args.tgt_lang}'), 'w') as ptgt, \
193 |             open(os.path.join(outDir, f'{out_prefix}.failed.{args.src_lang}'), 'w') as fsrc, \
194 |             open(os.path.join(outDir, f'{out_prefix}.failed.{args.tgt_lang}'), 'w') as ftgt:
195 | 
196 |             with open(os.path.join(outDir, tsv_name)) as f:
197 |                 for line in f:
198 |                     score, srcLine, tgtLine = line.split('\t')
199 |                     
200 |                     if float(score) > args.thresh:
201 |                         print(srcLine.strip(), file=psrc)
202 |                         print(tgtLine.strip(), file=ptgt)
203 |                         passed += 1
204 |                     else:
205 |                         print(srcLine.strip(), file=fsrc)
206 |                         print(tgtLine.strip(), file=ftgt)
207 |                         failed += 1
208 |         
209 |         print(f'Passed Sentences: {passed}')
210 |         print(f'Failed Sentences: {failed}')
211 | 
212 |     
213 | 
214 | if __name__ == "__main__":
215 |     parser = argparse.ArgumentParser()
216 |     parser.add_argument(
217 |         '--input_dir', '-i', type=str,
218 |         required=True,
219 |         metavar='PATH',
220 |         help="Input directory")
221 | 
222 |     parser.add_argument(
223 |         '--output_dir', '-o', type=str,
224 |         required=True,
225 |         metavar='PATH',
226 |         help="Output directory")
227 | 
228 |     parser.add_argument(
229 |         '--src_lang', type=str,
230 |         required=True,
231 |         help="Source language")
232 | 
233 |     parser.add_argument(
234 |         '--tgt_lang', type=str,
235 |         required=True,
236 |         help="Target language")
237 | 
238 |     parser.add_argument('--thresh', type=float, default=.95, help='threshold')
239 |     parser.add_argument('--batch_size', type=int, default=1000, help='batch size')
240 | 
241 |     parser.add_argument('--cpu', action='store_true',
242 |         help='Run on cpu')
243 | 
244 |     args = parser.parse_args()
245 |     assert args.input_dir != args.output_dir, "input and output directories cant be the same."
246 |     loadEncoder(args.cpu)
247 |     batchFilterDir(args)
248 |     


--------------------------------------------------------------------------------
/batch_filtering/source/embed.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | # Copyright (c) Facebook, Inc. and its affiliates.
  3 | # All rights reserved.
  4 | #
  5 | # This source code is licensed under the BSD-style license found in the
  6 | # LICENSE file in the root directory of this source tree.
  7 | #
  8 | # LASER  Language-Agnostic SEntence Representations
  9 | # is a toolkit to calculate multilingual sentence embeddings
 10 | # and to use them for document classification, bitext filtering
 11 | # and mining
 12 | #
 13 | # --------------------------------------------------------
 14 | #
 15 | # Tool to calculate to embed a text file
 16 | # The functions can be also imported into another Python code
 17 | 
 18 | 
 19 | import re
 20 | import os
 21 | import tempfile
 22 | import sys
 23 | import time
 24 | import argparse
 25 | import numpy as np
 26 | from collections import namedtuple
 27 | 
 28 | import torch
 29 | import torch.nn as nn
 30 | 
 31 | # get environment
 32 | assert os.environ.get('LASER'), 'Please set the enviornment variable LASER'
 33 | LASER = os.environ['LASER']
 34 | 
 35 | sys.path.append(LASER + '/source/lib')
 36 | from text_processing import Token, BPEfastApply
 37 | 
 38 | SPACE_NORMALIZER = re.compile("\s+")
 39 | Batch = namedtuple('Batch', 'srcs tokens lengths')
 40 | 
 41 | 
 42 | def buffered_read(fp, buffer_size):
 43 |     buffer = []
 44 |     for src_str in fp:
 45 |         buffer.append(src_str.strip())
 46 |         if len(buffer) >= buffer_size:
 47 |             yield buffer
 48 |             buffer = []
 49 | 
 50 |     if len(buffer) > 0:
 51 |         yield buffer
 52 | 
 53 | 
 54 | def buffered_arange(max):
 55 |     if not hasattr(buffered_arange, 'buf'):
 56 |         buffered_arange.buf = torch.LongTensor()
 57 |     if max > buffered_arange.buf.numel():
 58 |         torch.arange(max, out=buffered_arange.buf)
 59 |     return buffered_arange.buf[:max]
 60 | 
 61 | 
 62 | # TODO Do proper padding from the beginning
 63 | def convert_padding_direction(src_tokens, padding_idx, right_to_left=False, left_to_right=False):
 64 |     assert right_to_left ^ left_to_right
 65 |     pad_mask = src_tokens.eq(padding_idx)
 66 |     if not pad_mask.any():
 67 |         # no padding, return early
 68 |         return src_tokens
 69 |     if left_to_right and not pad_mask[:, 0].any():
 70 |         # already right padded
 71 |         return src_tokens
 72 |     if right_to_left and not pad_mask[:, -1].any():
 73 |         # already left padded
 74 |         return src_tokens
 75 |     max_len = src_tokens.size(1)
 76 |     range = buffered_arange(max_len).type_as(src_tokens).expand_as(src_tokens)
 77 |     num_pads = pad_mask.long().sum(dim=1, keepdim=True)
 78 |     if right_to_left:
 79 |         index = torch.remainder(range - num_pads, max_len)
 80 |     else:
 81 |         index = torch.remainder(range + num_pads, max_len)
 82 |     return src_tokens.gather(1, index)
 83 | 
 84 | 
 85 | class SentenceEncoder:
 86 | 
 87 |     def __init__(self, model_path, max_sentences=None, max_tokens=None, cpu=False, fp16=False, verbose=False,
 88 |                  sort_kind='quicksort'):
 89 |         self.use_cuda = torch.cuda.is_available() and not cpu
 90 |         self.max_sentences = max_sentences
 91 |         self.max_tokens = max_tokens
 92 |         if self.max_tokens is None and self.max_sentences is None:
 93 |             self.max_sentences = 1
 94 | 
 95 |         state_dict = torch.load(model_path)
 96 |         self.encoder = Encoder(**state_dict['params'])
 97 |         self.encoder.load_state_dict(state_dict['model'])
 98 |         self.dictionary = state_dict['dictionary']
 99 |         self.pad_index = self.dictionary['<pad>']
100 |         self.eos_index = self.dictionary['</s>']
101 |         self.unk_index = self.dictionary['<unk>']
102 |         if fp16:
103 |             self.encoder.half()
104 |         if self.use_cuda:
105 |             if verbose:
106 |                 print(' - transfer encoder to GPU')
107 |             self.encoder.cuda()
108 |         self.sort_kind = sort_kind
109 | 
110 |     def _process_batch(self, batch):
111 |         tokens = batch.tokens
112 |         lengths = batch.lengths
113 |         if self.use_cuda:
114 |             tokens = tokens.cuda()
115 |             lengths = lengths.cuda()
116 |         self.encoder.eval()
117 |         embeddings = self.encoder(tokens, lengths)['sentemb']
118 |         return embeddings.detach().cpu().numpy()
119 | 
120 |     def _tokenize(self, line):
121 |         tokens = SPACE_NORMALIZER.sub(" ", line).strip().split()
122 |         ntokens = len(tokens)
123 |         ids = torch.LongTensor(ntokens + 1)
124 |         for i, token in enumerate(tokens):
125 |             ids[i] = self.dictionary.get(token, self.unk_index)
126 |         ids[ntokens] = self.eos_index
127 |         return ids
128 | 
129 |     def _make_batches(self, lines):
130 |         tokens = [self._tokenize(line) for line in lines]
131 |         lengths = np.array([t.numel() for t in tokens])
132 |         indices = np.argsort(-lengths, kind=self.sort_kind)
133 | 
134 |         def batch(tokens, lengths, indices):
135 |             toks = tokens[0].new_full((len(tokens), tokens[0].shape[0]), self.pad_index)
136 |             for i in range(len(tokens)):
137 |                 toks[i, -tokens[i].shape[0]:] = tokens[i]
138 |             return Batch(
139 |                 srcs=None,
140 |                 tokens=toks,
141 |                 lengths=torch.LongTensor(lengths)
142 |             ), indices
143 | 
144 |         batch_tokens, batch_lengths, batch_indices = [], [], []
145 |         ntokens = nsentences = 0
146 |         for i in indices:
147 |             if nsentences > 0 and ((self.max_tokens is not None and ntokens + lengths[i] > self.max_tokens) or
148 |                                    (self.max_sentences is not None and nsentences == self.max_sentences)):
149 |                 yield batch(batch_tokens, batch_lengths, batch_indices)
150 |                 ntokens = nsentences = 0
151 |                 batch_tokens, batch_lengths, batch_indices = [], [], []
152 |             batch_tokens.append(tokens[i])
153 |             batch_lengths.append(lengths[i])
154 |             batch_indices.append(i)
155 |             ntokens += tokens[i].shape[0]
156 |             nsentences += 1
157 |         if nsentences > 0:
158 |             yield batch(batch_tokens, batch_lengths, batch_indices)
159 | 
160 |     def encode_sentences(self, sentences):
161 |         indices = []
162 |         results = []
163 |         for batch, batch_indices in self._make_batches(sentences):
164 |             indices.extend(batch_indices)
165 |             results.append(self._process_batch(batch))
166 |         return np.vstack(results)[np.argsort(indices, kind=self.sort_kind)]
167 | 
168 | 
169 | class Encoder(nn.Module):
170 |     def __init__(
171 |             self, num_embeddings, padding_idx, embed_dim=320, hidden_size=512, num_layers=1, bidirectional=False,
172 |             left_pad=True, padding_value=0.
173 |     ):
174 |         super().__init__()
175 | 
176 |         self.num_layers = num_layers
177 |         self.bidirectional = bidirectional
178 |         self.hidden_size = hidden_size
179 | 
180 |         self.padding_idx = padding_idx
181 |         self.embed_tokens = nn.Embedding(num_embeddings, embed_dim, padding_idx=self.padding_idx)
182 | 
183 |         self.lstm = nn.LSTM(
184 |             input_size=embed_dim,
185 |             hidden_size=hidden_size,
186 |             num_layers=num_layers,
187 |             bidirectional=bidirectional,
188 |         )
189 |         self.left_pad = left_pad
190 |         self.padding_value = padding_value
191 | 
192 |         self.output_units = hidden_size
193 |         if bidirectional:
194 |             self.output_units *= 2
195 | 
196 |     def forward(self, src_tokens, src_lengths):
197 |         if self.left_pad:
198 |             # convert left-padding to right-padding
199 |             src_tokens = convert_padding_direction(
200 |                 src_tokens,
201 |                 self.padding_idx,
202 |                 left_to_right=True,
203 |             )
204 | 
205 |         bsz, seqlen = src_tokens.size()
206 | 
207 |         # embed tokens
208 |         x = self.embed_tokens(src_tokens)
209 | 
210 |         # B x T x C -> T x B x C
211 |         x = x.transpose(0, 1)
212 | 
213 |         # pack embedded source tokens into a PackedSequence
214 |         packed_x = nn.utils.rnn.pack_padded_sequence(x, src_lengths.data.tolist())
215 | 
216 |         # apply LSTM
217 |         if self.bidirectional:
218 |             state_size = 2 * self.num_layers, bsz, self.hidden_size
219 |         else:
220 |             state_size = self.num_layers, bsz, self.hidden_size
221 |         h0 = x.data.new(*state_size).zero_()
222 |         c0 = x.data.new(*state_size).zero_()
223 |         packed_outs, (final_hiddens, final_cells) = self.lstm(packed_x, (h0, c0))
224 | 
225 |         # unpack outputs and apply dropout
226 |         x, _ = nn.utils.rnn.pad_packed_sequence(packed_outs, padding_value=self.padding_value)
227 |         assert list(x.size()) == [seqlen, bsz, self.output_units]
228 | 
229 |         if self.bidirectional:
230 |             def combine_bidir(outs):
231 |                 return torch.cat([
232 |                     torch.cat([outs[2 * i], outs[2 * i + 1]], dim=0).view(1, bsz, self.output_units)
233 |                     for i in range(self.num_layers)
234 |                 ], dim=0)
235 | 
236 |             final_hiddens = combine_bidir(final_hiddens)
237 |             final_cells = combine_bidir(final_cells)
238 | 
239 |         encoder_padding_mask = src_tokens.eq(self.padding_idx).t()
240 | 
241 |         # Set padded outputs to -inf so they are not selected by max-pooling
242 |         padding_mask = src_tokens.eq(self.padding_idx).t().unsqueeze(-1)
243 |         if padding_mask.any():
244 |             x = x.float().masked_fill_(padding_mask, float('-inf')).type_as(x)
245 | 
246 |         # Build the sentence embedding by max-pooling over the encoder outputs
247 |         sentemb = x.max(dim=0)[0]
248 | 
249 |         return {
250 |             'sentemb': sentemb,
251 |             'encoder_out': (x, final_hiddens, final_cells),
252 |             'encoder_padding_mask': encoder_padding_mask if encoder_padding_mask.any() else None
253 |         }
254 | 
255 | 
256 | def EncodeLoad(args):
257 |     args.buffer_size = max(args.buffer_size, 1)
258 |     assert not args.max_sentences or args.max_sentences <= args.buffer_size, \
259 |         '--max-sentences/--batch-size cannot be larger than --buffer-size'
260 | 
261 |     print(' - loading encoder', args.encoder)
262 |     return SentenceEncoder(args.encoder,
263 |                            max_sentences=args.max_sentences,
264 |                            max_tokens=args.max_tokens,
265 |                            cpu=args.cpu,
266 |                            verbose=args.verbose)
267 | 
268 | 
269 | def EncodeTime(t):
270 |     t = int(time.time() - t)
271 |     if t < 1000:
272 |         print(' in {:d}s'.format(t))
273 |     else:
274 |         print(' in {:d}m{:d}s'.format(t // 60, t % 60))
275 | 
276 | 
277 | # Encode sentences (existing file pointers)
278 | def EncodeFilep(encoder, inp_file, out_file, buffer_size=10000, verbose=False):
279 |     n = 0
280 |     t = time.time()
281 |     for sentences in buffered_read(inp_file, buffer_size):
282 |         encoder.encode_sentences(sentences).tofile(out_file)
283 |         n += len(sentences)
284 |         if verbose and n % 10000 == 0:
285 |             print('\r - Encoder: {:d} sentences'.format(n), end='')
286 |     if verbose:
287 |         print('\r - Encoder: {:d} sentences'.format(n), end='')
288 |         EncodeTime(t)
289 | 
290 | 
291 | # Encode sentences (file names)
292 | def EncodeFile(encoder, inp_fname, out_fname,
293 |                buffer_size=10000, verbose=False, over_write=False,
294 |                inp_encoding='utf-8'):
295 |     # TODO :handle over write
296 |     if not os.path.isfile(out_fname):
297 |         if verbose:
298 |             print(' - Encoder: {} to {}'.
299 |                   format(os.path.basename(inp_fname) if len(inp_fname) > 0 else 'stdin',
300 |                          os.path.basename(out_fname)))
301 |         fin = open(inp_fname, 'r', encoding=inp_encoding, errors='surrogateescape') if len(inp_fname) > 0 else sys.stdin
302 |         fout = open(out_fname, mode='wb')
303 |         EncodeFilep(encoder, fin, fout, buffer_size=buffer_size, verbose=verbose)
304 |         fin.close()
305 |         fout.close()
306 |     elif not over_write and verbose:
307 |         print(' - Encoder: {} exists already'.format(os.path.basename(out_fname)))
308 | 
309 | 
310 | # Load existing embeddings
311 | def EmbedLoad(fname, dim=1024, verbose=False):
312 |     x = np.fromfile(fname, dtype=np.float32, count=-1)
313 |     x.resize(x.shape[0] // dim, dim)
314 |     if verbose:
315 |         print(' - Embeddings: {:s}, {:d}x{:d}'.format(fname, x.shape[0], dim))
316 |     return x
317 | 
318 | 
319 | # Get memory mapped embeddings
320 | def EmbedMmap(fname, dim=1024, dtype=np.float32, verbose=False):
321 |     nbex = int(os.path.getsize(fname) / dim / np.dtype(dtype).itemsize)
322 |     E = np.memmap(fname, mode='r', dtype=dtype, shape=(nbex, dim))
323 |     if verbose:
324 |         print(' - embeddings on disk: {:s} {:d} x {:d}'.format(fname, nbex, dim))
325 |     return E
326 | 
327 | 
328 | if __name__ == '__main__':
329 |     parser = argparse.ArgumentParser(description='LASER: Embed sentences')
330 |     parser.add_argument('--encoder', type=str, required=True,
331 |                         help='encoder to be used')
332 |     parser.add_argument('--token-lang', type=str, default='--',
333 |                         help="Perform tokenization with given language ('--' for no tokenization)")
334 |     parser.add_argument('--bpe-codes', type=str, default=None,
335 |                         help='Apply BPE using specified codes')
336 |     parser.add_argument('-v', '--verbose', action='store_true',
337 |                         help='Detailed output')
338 | 
339 |     parser.add_argument('-o', '--output', required=True,
340 |                         help='Output sentence embeddings')
341 |     parser.add_argument('--buffer-size', type=int, default=10000,
342 |                         help='Buffer size (sentences)')
343 |     parser.add_argument('--max-tokens', type=int, default=12000,
344 |                         help='Maximum number of tokens to process in a batch')
345 |     parser.add_argument('--max-sentences', type=int, default=None,
346 |                         help='Maximum number of sentences to process in a batch')
347 |     parser.add_argument('--cpu', action='store_true',
348 |                         help='Use CPU instead of GPU')
349 |     parser.add_argument('--stable', action='store_true',
350 |                         help='Use stable merge sort instead of quick sort')
351 |     args = parser.parse_args()
352 | 
353 |     args.buffer_size = max(args.buffer_size, 1)
354 |     assert not args.max_sentences or args.max_sentences <= args.buffer_size, \
355 |         '--max-sentences/--batch-size cannot be larger than --buffer-size'
356 | 
357 |     if args.verbose:
358 |         print(' - Encoder: loading {}'.format(args.encoder))
359 |     encoder = SentenceEncoder(args.encoder,
360 |                               max_sentences=args.max_sentences,
361 |                               max_tokens=args.max_tokens,
362 |                               sort_kind='mergesort' if args.stable else 'quicksort',
363 |                               cpu=args.cpu)
364 | 
365 |     with tempfile.TemporaryDirectory() as tmpdir:
366 |         ifname = ''  # stdin will be used
367 |         if args.token_lang != '--':
368 |             tok_fname = os.path.join(tmpdir, 'tok')
369 |             Token(ifname,
370 |                   tok_fname,
371 |                   lang=args.token_lang,
372 |                   romanize=True if args.token_lang == 'el' else False,
373 |                   lower_case=True, gzip=False,
374 |                   verbose=args.verbose, over_write=False)
375 |             ifname = tok_fname
376 | 
377 |         if args.bpe_codes:
378 |             bpe_fname = os.path.join(tmpdir, 'bpe')
379 |             BPEfastApply(ifname,
380 |                          bpe_fname,
381 |                          args.bpe_codes,
382 |                          verbose=args.verbose, over_write=False)
383 |             ifname = bpe_fname
384 | 
385 |         EncodeFile(encoder,
386 |                    ifname,
387 |                    args.output,
388 |                    verbose=args.verbose, over_write=False,
389 |                    buffer_size=args.buffer_size)
390 | 


--------------------------------------------------------------------------------
/batch_filtering/source/lib/indexing.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | # Copyright (c) Facebook, Inc. and its affiliates.
  3 | # All rights reserved.
  4 | #
  5 | # This source code is licensed under the BSD-style license found in the
  6 | # LICENSE file in the root directory of this source tree.
  7 | #
  8 | # LASER  Language-Agnostic SEntence Representations
  9 | # is a toolkit to calculate multilingual sentence embeddings
 10 | # and to use them for document classification, bitext filtering
 11 | # and mining
 12 | #
 13 | # --------------------------------------------------------
 14 | #
 15 | # tools for indexing and search with FAISS
 16 | 
 17 | import faiss
 18 | import os.path
 19 | import sys
 20 | import numpy as np
 21 | 
 22 | #-------------------------------------------------------------
 23 | # Get list of fnames:
 24 | #  - we loop over the list of given languages
 25 | #  - for each language, we also check if there are splitted files .%03d
 26 | 
 27 | def SplitFnames(par_fname, langs):
 28 |     fnames = []
 29 |     for l in langs:
 30 |         fname = par_fname + '.' + l
 31 |         if os.path.isfile(fname):
 32 |             fnames.append(fname)
 33 |         for i in range(1000):
 34 |             fname = par_fname + '.' + l + '.{:03d}'.format(i)
 35 |             if os.path.isfile(fname):
 36 |                 fnames.append(fname)
 37 |     if len(fnames) == 0:
 38 |         print("ERROR: no embeddings found in {:s}*".format(par_fname))
 39 |         sys.exit(1)
 40 |     return fnames
 41 | 
 42 | def SplitOpen(par_fname, langs, dim, dtype, verbose=False):
 43 |     M = []
 44 |     nf = 0
 45 |     nc = 0
 46 |     print('Reading sentence embeddings')
 47 |     print(' - memory mapped files {:s}'.format(par_fname))
 48 |     for fname in SplitFnames(par_fname, langs):
 49 |         n = int(os.path.getsize(fname) / dim / np.dtype(dtype).itemsize)
 50 |         if verbose:
 51 |             print(' - {:s}: {:d} x {:d}'.format(fname, n, dim))
 52 |         Mi = np.memmap(fname, mode='r', dtype=dtype, shape=(n, dim))
 53 |         nc += n
 54 |         nf += 1
 55 |         M.append(Mi)
 56 |     print(' - total of {:d} files: {:d} x {:d}'.format(nf, nc, dim))
 57 |     return M
 58 | 
 59 | def SplitAccess(M, idx):
 60 |     i = idx
 61 |     for Mi in M:
 62 |         n = Mi.shape[0]
 63 |         if i < n:
 64 |             return Mi[i,:]
 65 |         i -= n
 66 |     print('ERROR: index {:d} is too large form memory mapped files'.format(idx))
 67 |     sys.exit(1)
 68 | 
 69 | 
 70 | ###############################################################################
 71 | # create an FAISS index on the given data
 72 | 
 73 | def IndexCreate(dname, idx_type,
 74 |                 verbose=False, normalize=True, save_index=False, dim=1024):
 75 | 
 76 |     assert idx_type == 'FlatL2', 'only FlatL2 index is currently supported'
 77 |     x = np.fromfile(dname, dtype=np.float32, count=-1)
 78 |     nbex = x.shape[0] // dim
 79 |     print(' - embedding: {:s} {:d} examples of dim {:d}'
 80 |           .format(dname, nbex, dim))
 81 |     x.resize(nbex, dim)
 82 |     print(' - creating FAISS index')
 83 |     idx = faiss.IndexFlatL2(dim)
 84 |     if normalize:
 85 |         faiss.normalize_L2(x)
 86 |     idx.add(x)
 87 |     if save_index:
 88 |         iname = 'TODO'
 89 |         print(' - saving index into ' + iname)
 90 |         faiss.write_index(idx, iname)
 91 |     return x, idx
 92 | 
 93 | 
 94 | ###############################################################################
 95 | # search closest vector for all languages pairs and calculate error rate
 96 | 
 97 | def IndexSearchMultiple(data, idx, verbose=False, texts=None, print_errors=False):
 98 |     nl = len(data)
 99 |     nbex = data[0].shape[0]
100 |     err = np.zeros((nl, nl)).astype(float)
101 |     ref = np.linspace(0, nbex-1, nbex).astype(int)  # [0, nbex)
102 |     if verbose:
103 |         if texts is None: 
104 |             print('Calculating similarity error (indices):')
105 |         else:
106 |             print('Calculating similarity error (textual):')
107 |     for i1 in range(nl):
108 |         for i2 in range(nl):
109 |             if i1 != i2:
110 |                 D, I = idx[i2].search(data[i1], 1)
111 |                 if texts: # do textual comparison
112 |                     e1 = 0
113 |                     for p in range(I.shape[0]):
114 |                         if texts[i2][p] != texts[i2][I[p,0]]:
115 |                             e1 += 1
116 |                             if print_errors:
117 |                                 print('Error {:s}\n      {:s}'
118 |                                       .format(texts[i2][p].strip(), texts[i2][I[p,0]].strip()))
119 |                     err[i1, i2] = e1 / nbex
120 |                 else:  # do index based comparision
121 |                     err[i1, i2] \
122 |                         = (nbex - np.equal(I.reshape(nbex), ref)
123 |                            .astype(int).sum()) / nbex
124 |                 if verbose:
125 |                     print(' - similarity error {:s}/{:s}: {:5d}={:5.2f}%'
126 |                           .format(args.langs[i1], args.langs[i2],
127 |                                   err[i1, i2], 100.0 * err[i1, i2]))
128 |     return err
129 | 
130 | 
131 | ###############################################################################
132 | # print confusion matrix
133 | 
134 | def IndexPrintConfusionMatrix(err, langs):
135 |     nl = len(langs)
136 |     assert nl == err.shape[0], 'size of errror matrix doesn not match'
137 |     print('Confusion matrix:')
138 |     print('{:8s}'.format('langs'), end='')
139 |     for i2 in range(nl):
140 |         print('{:8s} '.format(langs[i2]), end='')
141 |     print('{:8s}'.format('avg'))
142 |     for i1 in range(nl):
143 |         print('{:3s}'.format(langs[i1]), end='')
144 |         for i2 in range(nl):
145 |             print('{:8.2f}%'.format(100 * err[i1, i2]), end='')
146 |         print('{:8.2f}%'.format(100 * err[i1, :].sum() / (nl-1)))
147 | 
148 |     print('avg', end='')
149 |     for i2 in range(nl):
150 |         print('{:8.2f}%'.format(100 * err[:, i2].sum() / (nl-1)), end='')
151 | 
152 |     # global average
153 |     print('{:8.2f}%'.format(100 * err.sum() / (nl-1) / nl))
154 | 
155 | 
156 | ###############################################################################
157 | # Load an FAISS index
158 | 
159 | def IndexLoad(idx_name, nprobe, gpu=False):
160 |     print('Reading FAISS index')
161 |     print(' - index: {:s}'.format(idx_name))
162 |     index = faiss.read_index(idx_name)
163 |     print(' - found {:d} sentences of dim {:d}'.format(index.ntotal, index.d))
164 |     print(' - setting nbprobe to {:d}'.format(nprobe))
165 |     if gpu:
166 |         print(' - transfer index to %d GPUs ' % faiss.get_num_gpus())
167 |         #co = faiss.GpuMultipleClonerOptions()
168 |         #co.shard = True
169 |         index = faiss.index_cpu_to_all_gpus(index) # co=co
170 |         faiss.GpuParameterSpace().set_index_parameter(index, 'nprobe', nprobe)
171 |     return index
172 | 
173 | 
174 | ###############################################################################
175 | # Opens a text file with the sentences corresponding to the indices used
176 | # by an FAISS index
177 | # We also need the reference files with the byte offsets to the beginning
178 | # of each sentence
179 | # optionnally:  array with number of words per sentence
180 | # All arrays are memory mapped
181 | 
182 | def IndexTextOpen(txt_fname):
183 |     print('Reading text corpus')
184 |     print(' - texts: {:s}'.format(txt_fname))
185 |     txt_mmap = np.memmap(txt_fname, mode='r', dtype=np.uint8)
186 |     fname = txt_fname.replace('.txt', '.ref.bin32')
187 |     if os.path.isfile(fname):
188 |         print(' - sentence start offsets (32 bit): {}'.format(fname))
189 |         ref_mmap = np.memmap(fname, mode='r', dtype=np.uint32)
190 |     else:
191 |         fname = txt_fname.replace('.txt', '.ref.bin64')
192 |         if os.path.isfile(fname):
193 |             print(' - sentence start offsets (64 bit): {}'.format(fname))
194 |             ref_mmap = np.memmap(fname, mode='r', dtype=np.uint64)
195 |         else:
196 |             print('ERROR: no file with sentence start offsets found')
197 |             sys.exit(1)
198 |     print(' - found {:d} sentences'.format(ref_mmap.shape[0]))
199 | 
200 |     nbw_mmap = None
201 |     fname = txt_fname.replace('.txt', '.nw.bin8')
202 |     if os.path.isfile(fname):
203 |         print(' - word counts: {:s}'.format(fname))
204 |         nbw_mmap = np.memmap(fname, mode='r', dtype=np.uint8)
205 | 
206 |     M = None
207 |     fname = txt_fname.replace('.txt', '.meta')
208 |     if os.path.isfile(fname):
209 |         M = []
210 |         n = 0
211 |         print(' - metafile: {:s}'.format(fname))
212 |         with open(fname, 'r') as fp:
213 |             for line in fp:
214 |                 fields = line.strip().split()
215 |                 if len(fields) != 2:
216 |                     print('ERROR: format error in meta file')
217 |                     sys.exit(1)
218 |                 n += int(fields[1])
219 |                 M.append({'lang': fields[0], 'n': n})
220 |         print(' - found {:d} languages:'.format(len(M)), end='')
221 |         for L in M:
222 |             print(' {:s}'.format(L['lang']), end='')
223 |         print('')
224 | 
225 |     return txt_mmap, ref_mmap, nbw_mmap, M
226 | 
227 | 
228 | ###############################################################################
229 | # Return the text for the given index
230 | 
231 | def IndexTextQuery(txt_mmap, ref_mmap, idx):
232 |     p = int(ref_mmap[idx])  # get starting byte position
233 |     i = 0
234 |     dim = 10000  # max sentence length in bytes
235 |     b = bytearray(dim)
236 |     #  find EOL
237 |     while txt_mmap[p+i] != 10 and i < dim:
238 |         b[i] = txt_mmap[p+i]
239 |         i += 1
240 | 
241 |     return b[0:i].decode('utf-8')
242 | 
243 | 
244 | ###############################################################################
245 | # Search the [k] nearest vectors of [x] in the given index
246 | # and return the text lines
247 | 
248 | def IndexSearchKNN(index, x, T, R, kmax=1, Dmax=1.0, dedup=True):
249 |     D, I = index.search(x, kmax)
250 |     prev = {}  # for depuplication
251 |     res = []
252 |     for n in range(x.shape[0]):
253 |         for i in range(kmax):
254 |             txt = IndexTextQuery(T, R, I[n, i])
255 |             if (dedup and txt not in prev) and D[n, i] <= Dmax:
256 |                 prev[txt] = 1
257 |                 res.append([txt, D[n, i]])
258 |     return res
259 | 


--------------------------------------------------------------------------------
/batch_filtering/source/lib/romanize_lc.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # Copyright (c) Facebook, Inc. and its affiliates.
 3 | # All rights reserved.
 4 | #
 5 | # This source code is licensed under the BSD-style license found in the
 6 | # LICENSE file in the root directory of this source tree.
 7 | #
 8 | # LASER  Language-Agnostic SEntence Representations
 9 | # is a toolkit to calculate multilingual sentence embeddings
10 | # and to use them for document classification, bitext filtering
11 | # and mining
12 | #
13 | # --------------------------------------------------------
14 | #
15 | # Romanize and lower case text
16 | 
17 | import os
18 | import sys
19 | import argparse
20 | from transliterate import translit, get_available_language_codes
21 | 
22 | parser = argparse.ArgumentParser(
23 |     formatter_class=argparse.RawDescriptionHelpFormatter,
24 |     description="Calculate multilingual sentence encodings")
25 | parser.add_argument(
26 |     '--input', '-i', type=argparse.FileType('r', encoding='UTF-8'),
27 |     default=sys.stdin,
28 |     metavar='PATH',
29 |     help="Input text file (default: standard input).")
30 | parser.add_argument(
31 |     '--output', '-o', type=argparse.FileType('w', encoding='UTF-8'),
32 |     default=sys.stdout,
33 |     metavar='PATH',
34 |     help="Output text file (default: standard output).")
35 | parser.add_argument(
36 |     '--language', '-l', type=str,
37 |     metavar='STR', default="none",
38 |     help="perform transliteration into Roman characters"
39 |          " from the specified language (default none)")
40 | parser.add_argument(
41 |     '--preserve-case', '-C', action='store_true',
42 |     help="Preserve case of input texts (default is all lower case)")
43 | 
44 | args = parser.parse_args()
45 | 
46 | for line in args.input:
47 |     if args.language != "none":
48 |         line = translit(line, args.language, reversed=True)
49 |     if not args.preserve_case:
50 |         line = line.lower()
51 |     args.output.write(line)
52 | 


--------------------------------------------------------------------------------
/batch_filtering/source/lib/text_processing.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | # Copyright (c) Facebook, Inc. and its affiliates.
  3 | # All rights reserved.
  4 | #
  5 | # This source code is licensed under the BSD-style license found in the
  6 | # LICENSE file in the root directory of this source tree.
  7 | #
  8 | # LASER  Language-Agnostic SEntence Representations
  9 | # is a toolkit to calculate multilingual sentence embeddings
 10 | # and to use them for document classification, bitext filtering
 11 | # and mining
 12 | #
 13 | # --------------------------------------------------------
 14 | #
 15 | # Helper functions for tokenization and BPE
 16 | 
 17 | import os
 18 | import sys
 19 | import tempfile
 20 | import fastBPE
 21 | import numpy as np
 22 | from subprocess import run, check_output, DEVNULL
 23 | 
 24 | # get environment
 25 | assert os.environ.get('LASER'), 'Please set the enviornment variable LASER'
 26 | LASER = os.environ['LASER']
 27 | 
 28 | FASTBPE = LASER + '/tools-external/fastBPE/fast'
 29 | MOSES_BDIR = LASER + '/tools-external/moses-tokenizer/tokenizer/'
 30 | MOSES_TOKENIZER = MOSES_BDIR + 'tokenizer.perl -q -no-escape -threads 20 -l '
 31 | MOSES_LC = MOSES_BDIR + 'lowercase.perl'
 32 | NORM_PUNC = MOSES_BDIR + 'normalize-punctuation.perl -l '
 33 | DESCAPE = MOSES_BDIR + 'deescape-special-chars.perl'
 34 | REM_NON_PRINT_CHAR = MOSES_BDIR + 'remove-non-printing-char.perl'
 35 | 
 36 | # Romanization (Greek only)
 37 | ROMAN_LC = 'python3 ' + LASER + '/source/lib/romanize_lc.py -l '
 38 | 
 39 | # Mecab tokenizer for Japanese
 40 | MECAB = LASER + '/tools-external/mecab'
 41 | 
 42 | 
 43 | ###############################################################################
 44 | #
 45 | # Tokenize a line of text
 46 | #
 47 | ###############################################################################
 48 | 
 49 | def TokenLine(line, lang='en', lower_case=True, romanize=False):
 50 |     assert lower_case, 'lower case is needed by all the models'
 51 |     roman = lang if romanize else 'none'
 52 |     tok = check_output(
 53 |             REM_NON_PRINT_CHAR
 54 |             + '|' + NORM_PUNC + lang
 55 |             + '|' + DESCAPE
 56 |             + '|' + MOSES_TOKENIZER + lang
 57 |             + ('| python3 -m jieba -d ' if lang == 'zh' else '')
 58 |             + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
 59 |             + '|' + ROMAN_LC + roman,
 60 |             input=line,
 61 |             encoding='UTF-8',
 62 |             shell=True)
 63 |     return tok.strip()
 64 | 
 65 | 
 66 | ###############################################################################
 67 | #
 68 | # Tokenize a file
 69 | #
 70 | ###############################################################################
 71 | 
 72 | def Token(inp_fname, out_fname, lang='en',
 73 |           lower_case=True, romanize=False, descape=False,
 74 |           verbose=False, over_write=False, gzip=False):
 75 |     assert lower_case, 'lower case is needed by all the models'
 76 |     assert not over_write, 'over-write is not yet implemented'
 77 |     if not os.path.isfile(out_fname):
 78 |         cat = 'zcat ' if gzip else 'cat '
 79 |         roman = lang if romanize else 'none'
 80 |         # handle some iso3 langauge codes
 81 |         if lang in ('cmn', 'wuu', 'yue'):
 82 |             lang = 'zh'
 83 |         if lang in ('jpn'):
 84 |             lang = 'ja'
 85 |         if verbose:
 86 |             print(' - Tokenizer: {} in language {} {} {}'
 87 |                   .format(os.path.basename(inp_fname), lang,
 88 |                           '(gzip)' if gzip else '',
 89 |                           '(de-escaped)' if descape else '',
 90 |                           '(romanized)' if romanize else ''))
 91 |         run(cat + inp_fname
 92 |             + '|' + REM_NON_PRINT_CHAR
 93 |             + '|' + NORM_PUNC + lang
 94 |             + ('|' + DESCAPE if descape else '')
 95 |             + '|' + MOSES_TOKENIZER + lang
 96 |             + ('| python3 -m jieba -d ' if lang == 'zh' else '')
 97 |             + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
 98 |             + '|' + ROMAN_LC + roman
 99 |             + '>' + out_fname,
100 |             env=dict(os.environ, LD_LIBRARY_PATH=MECAB + '/lib'),
101 |             shell=True)
102 |     elif not over_write and verbose:
103 |         print(' - Tokenizer: {} exists already'
104 |               .format(os.path.basename(out_fname), lang))
105 | 
106 | 
107 | ###############################################################################
108 | #
109 | # Apply FastBPE on one line of text
110 | #
111 | ###############################################################################
112 | 
113 | def BPEfastLoad(line, bpe_codes):
114 |     bpe_vocab = bpe_codes.replace('fcodes', 'fvocab')
115 |     return fastBPE.fastBPE(bpe_codes, bpe_vocab)
116 | 
117 | def BPEfastApplyLine(line, bpe):
118 |     return bpe.apply([line])[0]
119 | 
120 | 
121 | ###############################################################################
122 | #
123 | # Apply FastBPE on a whole file
124 | #
125 | ###############################################################################
126 | 
127 | def BPEfastApply(inp_fname, out_fname, bpe_codes,
128 |                  verbose=False, over_write=False):
129 |     if not os.path.isfile(out_fname):
130 |         if verbose:
131 |             print(' - fast BPE: processing {}'
132 |                   .format(os.path.basename(inp_fname)))
133 |         bpe_vocab = bpe_codes.replace('fcodes', 'fvocab')
134 |         if not os.path.isfile(bpe_vocab):
135 |             print(' - fast BPE: focab file not found {}'.format(bpe_vocab))
136 |             bpe_vocab = ''
137 |         run(FASTBPE + ' applybpe '
138 |             + out_fname + ' ' + inp_fname
139 |             + ' ' + bpe_codes
140 |             + ' ' + bpe_vocab, shell=True, stderr=DEVNULL)
141 |     elif not over_write and verbose:
142 |         print(' - fast BPE: {} exists already'
143 |               .format(os.path.basename(out_fname)))
144 | 
145 | 
146 | ###############################################################################
147 | #
148 | # Split long lines into multiple sentences at "."
149 | #
150 | ###############################################################################
151 | 
152 | def SplitLines(ifname, of_txt, of_sid):
153 |     if os.path.isfile(of_txt):
154 |         print(' - SplitLines: {} already exists'.format(of_txt))
155 |         return
156 |     nl = 0
157 |     nl_sp = 0
158 |     maxw = 0
159 |     maxw_sp = 0
160 |     fp_sid = open(of_sid, 'w')
161 |     fp_txt = open(of_txt, 'w')
162 |     with open(ifname, 'r') as ifp:
163 |         for line in ifp:
164 |             print('{:d}'.format(nl), file=fp_sid)  # store current sentence ID
165 |             nw = 0
166 |             words = line.strip().split()
167 |             maxw = max(maxw, len(words))
168 |             for i, word in enumerate(words):
169 |                 if word == '.' and i != len(words)-1:
170 |                     if nw > 0:
171 |                         print(' {}'.format(word), file=fp_txt)
172 |                     else:
173 |                         print('{}'.format(word), file=fp_txt)
174 |                     # store current sentence ID
175 |                     print('{:d}'.format(nl), file=fp_sid)
176 |                     nl_sp += 1
177 |                     maxw_sp = max(maxw_sp, nw+1)
178 |                     nw = 0
179 |                 else:
180 |                     if nw > 0:
181 |                         print(' {}'.format(word), end='', file=fp_txt)
182 |                     else:
183 |                         print('{}'.format(word), end='', file=fp_txt)
184 |                     nw += 1
185 |             if nw > 0:
186 |                 # handle remainder of sentence
187 |                 print('', file=fp_txt)
188 |                 nl_sp += 1
189 |                 maxw_sp = max(maxw_sp, nw+1)
190 |             nl += 1
191 |     print(' - Split sentences: {}'.format(ifname))
192 |     print(' -                  lines/max words: {:d}/{:d} -> {:d}/{:d}'
193 |           .format(nl, maxw, nl_sp, maxw_sp))
194 |     fp_sid.close()
195 |     fp_txt.close()
196 | 
197 | 
198 | ###############################################################################
199 | #
200 | # Join embeddings of previously split lines (average)
201 | #
202 | ###############################################################################
203 | 
204 | def JoinEmbed(if_embed, sid_fname, of_embed, dim=1024):
205 |     if os.path.isfile(of_embed):
206 |         print(' - JoinEmbed: {} already exists'.format(of_embed))
207 |         return
208 |     # read the input embeddings
209 |     em_in = np.fromfile(if_embed, dtype=np.float32, count=-1).reshape(-1, dim)
210 |     ninp = em_in.shape[0]
211 |     print(' - Combine embeddings:')
212 |     print('                input: {:s} {:d} sentences'.format(if_embed, ninp))
213 | 
214 |     # get all sentence IDs
215 |     sid = np.empty(ninp, dtype=np.int32)
216 |     i = 0
217 |     with open(sid_fname, 'r') as fp_sid:
218 |         for line in fp_sid:
219 |             sid[i] = int(line)
220 |             i += 1
221 |     nout = sid.max() + 1
222 |     print('                IDs: {:s}, {:d} sentences'.format(sid_fname, nout))
223 | 
224 |     # combining
225 |     em_out = np.zeros((nout, dim), dtype=np.float32)
226 |     cnt = np.zeros(nout, dtype=np.int32)
227 |     for i in range(ninp):
228 |         idx = sid[i]
229 |         em_out[idx] += em_in[i]  # cumulate sentence vectors
230 |         cnt[idx] += 1
231 | 
232 |     if (cnt == 0).astype(int).sum() > 0:
233 |         print('ERROR: missing lines')
234 |         sys.exit(1)
235 | 
236 |     # normalize
237 |     for i in range(nout):
238 |         em_out[i] /= cnt[i]
239 | 
240 |     print('                output: {:s}'.format(of_embed))
241 |     em_out.tofile(of_embed)
242 | 


--------------------------------------------------------------------------------
/batch_filtering/source/mine_bitexts.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | # Copyright (c) Facebook, Inc. and its affiliates.
  3 | # All rights reserved.
  4 | #
  5 | # This source code is licensed under the BSD-style license found in the
  6 | # LICENSE file in the root directory of this source tree.
  7 | #
  8 | # LASER  Language-Agnostic SEntence Representations
  9 | # is a toolkit to calculate multilingual sentence embeddings
 10 | # and to use them for document classification, bitext filtering
 11 | # and mining
 12 | #
 13 | # --------------------------------------------------------
 14 | #
 15 | # Tool to calculate to embed a text file
 16 | # The functions can be also imported into another Python code
 17 | 
 18 | import os
 19 | import sys
 20 | import faiss
 21 | import argparse
 22 | import tempfile
 23 | import numpy as np
 24 | 
 25 | # get environment
 26 | assert os.environ.get('LASER'), 'Please set the enviornment variable LASER'
 27 | LASER = os.environ['LASER']
 28 | 
 29 | sys.path.append(LASER + '/source')
 30 | sys.path.append(LASER + '/source/tools')
 31 | from embed import SentenceEncoder, EncodeLoad, EncodeFile, EmbedLoad
 32 | from text_processing import Token, BPEfastApply
 33 | 
 34 | 
 35 | ###############################################################################
 36 | #
 37 | # Load texts and remove duplicates
 38 | #
 39 | ###############################################################################
 40 | 
 41 | def TextLoadUnify(fname, args):
 42 |     if args.verbose:
 43 |         print(' - loading texts {:s}: '.format(fname), end='')
 44 |         
 45 |     fin = open(fname, encoding=args.encoding, errors='surrogateescape')
 46 |     inds = []
 47 |     sents = []
 48 | 
 49 |     if args.mode == 'score':
 50 |         for i, line in enumerate(fin):
 51 |             inds.append(i)
 52 |             sents.append(line[:-1])
 53 |         
 54 |         return inds, sents
 55 |     
 56 |     sent2ind = {}
 57 |     n = 0
 58 |     nu = 0
 59 |     for line in fin:
 60 |         new_ind = len(sent2ind)
 61 |         inds.append(sent2ind.setdefault(line, new_ind))
 62 |         if args.unify:
 63 |             if inds[-1] == new_ind:
 64 |                 sents.append(line[:-1])
 65 |                 nu += 1
 66 |         else:
 67 |             sents.append(line[:-1])
 68 |             nu += 1
 69 |         n += 1
 70 |     if args.verbose:
 71 |         print('{:d} lines, {:d} unique'.format(n, nu))
 72 |     del sent2ind
 73 |     return inds, sents
 74 | 
 75 | 
 76 | ###############################################################################
 77 | #
 78 | # Wrapper for knn on CPU/GPU
 79 | #
 80 | ###############################################################################
 81 | 
 82 | def knn(x, y, k, use_gpu):
 83 |     return knnGPU(x, y, k) if use_gpu else knnCPU(x, y, k)
 84 | 
 85 | 
 86 | ###############################################################################
 87 | #
 88 | # Perform knn on GPU
 89 | #
 90 | ###############################################################################
 91 | 
 92 | def knnGPU(x, y, k, mem=5*1024*1024*1024):
 93 |     dim = x.shape[1]
 94 |     batch_size = mem // (dim*4)
 95 |     sim = np.zeros((x.shape[0], k), dtype=np.float32)
 96 |     ind = np.zeros((x.shape[0], k), dtype=np.int64)
 97 |     for xfrom in range(0, x.shape[0], batch_size):
 98 |         xto = min(xfrom + batch_size, x.shape[0])
 99 |         bsims, binds = [], []
100 |         for yfrom in range(0, y.shape[0], batch_size):
101 |             yto = min(yfrom + batch_size, y.shape[0])
102 |             # print('{}-{}  ->  {}-{}'.format(xfrom, xto, yfrom, yto))
103 |             idx = faiss.IndexFlatIP(dim)
104 |             idx = faiss.index_cpu_to_all_gpus(idx)
105 |             idx.add(y[yfrom:yto])
106 |             bsim, bind = idx.search(x[xfrom:xto], min(k, yto-yfrom))
107 |             bsims.append(bsim)
108 |             binds.append(bind + yfrom)
109 |             del idx
110 |         bsims = np.concatenate(bsims, axis=1)
111 |         binds = np.concatenate(binds, axis=1)
112 |         aux = np.argsort(-bsims, axis=1)
113 |         for i in range(xfrom, xto):
114 |             for j in range(k):
115 |                 sim[i, j] = bsims[i-xfrom, aux[i-xfrom, j]]
116 |                 ind[i, j] = binds[i-xfrom, aux[i-xfrom, j]]
117 |     return sim, ind
118 | 
119 | 
120 | ###############################################################################
121 | #
122 | # Perform knn on CPU
123 | #
124 | ###############################################################################
125 | 
126 | def knnCPU(x, y, k):
127 |     dim = x.shape[1]
128 |     idx = faiss.IndexFlatIP(dim)
129 |     idx.add(y)
130 |     sim, ind = idx.search(x, k)
131 |     return sim, ind
132 | 
133 | 
134 | ###############################################################################
135 | #
136 | # Scoring
137 | #
138 | ###############################################################################
139 | 
140 | def score(x, y, fwd_mean, bwd_mean, margin):
141 |     return margin(x.dot(y), (fwd_mean + bwd_mean) / 2)
142 | 
143 | def score_with_trans(x, y, xTrans, fwd_mean, bwd_mean, xTrans_fwd_mean, xTrans_bwd_mean, margin, verbose=False):
144 |     return margin(x.dot(y), (fwd_mean + bwd_mean) / 2, xTrans.dot(y), (xTrans_fwd_mean + xTrans_bwd_mean) / 2)
145 | 
146 | def score_candidates(x, y, candidate_inds, fwd_mean, bwd_mean, margin, verbose=False):
147 |     if verbose:
148 |         print(' - scoring {:d} candidates'.format(x.shape[0]))
149 |     scores = np.zeros(candidate_inds.shape)
150 |     for i in range(scores.shape[0]):
151 |         for j in range(scores.shape[1]):
152 |             k = candidate_inds[i, j]
153 |             scores[i, j] = score(x[i], y[k], fwd_mean[i], bwd_mean[k], margin)
154 |     return scores
155 | 
156 | 
157 | def score_candidates_with_trans(x, y, xTrans, candidate_inds, fwd_mean, bwd_mean, xTrans_fwd_mean, xTrans_bwd_mean, margin, verbose=False):
158 |     if verbose:
159 |         print(' - scoring {:d} candidates'.format(x.shape[0]))
160 |     scores = np.zeros(candidate_inds.shape)
161 |     for i in range(scores.shape[0]):
162 |         for j in range(scores.shape[1]):
163 |             k = candidate_inds[i, j]
164 |             scores[i, j] = score_with_trans(x[i], y[k], xTrans[i], fwd_mean[i], bwd_mean[k], xTrans_fwd_mean[i], xTrans_bwd_mean[k], margin)
165 |     return scores
166 | 
167 | 
168 | ###############################################################################
169 | #
170 | # Main
171 | #
172 | ###############################################################################
173 | 
174 | if __name__ == '__main__':
175 |     parser = argparse.ArgumentParser(description='LASER: Mine bitext')
176 |     parser.add_argument('src',
177 |         help='Source language corpus')
178 |     parser.add_argument('trg',
179 |         help='Target language corpus')
180 |     parser.add_argument('--encoding', default='utf-8',
181 |         help='Character encoding for input/output')
182 |     parser.add_argument('--src-lang', required=True,
183 |         help='Source language id')
184 |     parser.add_argument('--trg-lang', required=True,
185 |         help='Target language id')
186 |     parser.add_argument('--output', required=True,
187 |         help='Output file')
188 |     parser.add_argument('--threshold', type=float, default=0,
189 |         help='Threshold on extracted bitexts')
190 | 
191 |     # mining params
192 |     parser.add_argument('--mode',
193 |         choices=['search', 'score', 'mine'], required=True,
194 |         help='Execution mode')
195 |     parser.add_argument('-k', '--neighborhood',
196 |         type=int, default=4,
197 |         help='Neighborhood size')
198 |     parser.add_argument('--margin',
199 |         choices=['absolute', 'distance', 'ratio'], default='ratio',
200 |         help='Margin function')
201 |     parser.add_argument('--retrieval',
202 |         choices=['fwd', 'bwd', 'max', 'intersect'], default='max',
203 |         help='Retrieval strategy')
204 |     parser.add_argument('--unify', action='store_true',
205 |         help='Unify texts')
206 |     parser.add_argument('--gpu', action='store_true',
207 |         help='Run knn on all available GPUs')
208 |     parser.add_argument('--verbose', action='store_true',
209 |         help='Detailed output')
210 | 
211 |     # embeddings
212 |     parser.add_argument('--src-embeddings', required=True,
213 |         help='Precomputed source sentence embeddings')
214 |     parser.add_argument('--trg-embeddings', required=True,
215 |         help='Precomputed target sentence embeddings')
216 |     parser.add_argument('--dim', type=int, default=1024,
217 |         help='Embedding dimensionality')
218 |     
219 | 
220 |     parser.add_argument('--trans', action='store_true',
221 |         help='Use translations for scoring')
222 | 
223 | 
224 |     args = parser.parse_args()
225 | 
226 |     
227 |     print('LASER: tool to search, score or mine bitexts')
228 |     if args.gpu:
229 |         print(' - knn will run on all available GPUs (recommended)')
230 |     else:
231 |         print(' - knn will run on CPU (slow)')
232 | 
233 |     src_inds, src_sents = TextLoadUnify(args.src, args)
234 |     trg_inds, trg_sents = TextLoadUnify(args.trg, args)
235 |     
236 |     if args.trans:
237 |         srcTransFile = ".enTranslated".join(args.src.rsplit('.bn', 1))
238 |         srcTrans_inds, srcTrans_sents = TextLoadUnify(srcTransFile, args)
239 | 
240 |     def unique_embeddings(emb, ind, verbose=False):
241 |         aux = {j: i for i, j in enumerate(ind)}
242 |         if verbose:
243 |             print(' - unify embeddings: {:d} -> {:d}'.format(len(emb), len(aux)))
244 |         return emb[[aux[i] for i in range(len(aux))]]
245 | 
246 |     # load the embeddings
247 |     x = EmbedLoad(args.src_embeddings, args.dim, verbose=args.verbose)
248 | 
249 |     if args.trans:
250 |         xTrans = EmbedLoad(".enTranslated".join(args.src_embeddings.rsplit('.bn', 1)), args.dim, verbose=args.verbose)
251 | 
252 |     if args.unify:
253 |         x = unique_embeddings(x, src_inds, args.verbose)
254 |         if args.trans:
255 |             xTrans = unique_embeddings(xTrans, src_inds, args.verbose)
256 |     
257 |     if args.trans:
258 |         faiss.normalize_L2(xTrans)
259 | 
260 |     faiss.normalize_L2(x)
261 |     y = EmbedLoad(args.trg_embeddings, args.dim, verbose=args.verbose)
262 |     if args.unify:
263 |         y = unique_embeddings(y, trg_inds, args.verbose)
264 |     faiss.normalize_L2(y)
265 | 
266 |     # calculate knn in both directions
267 |     if args.retrieval is not 'bwd':
268 |         if args.verbose:
269 |             print(' - perform {:d}-nn source against target'.format(args.neighborhood))
270 |         x2y_sim, x2y_ind = knn(x, y, min(y.shape[0], args.neighborhood), args.gpu)
271 |         x2y_mean = x2y_sim.mean(axis=1)
272 |         if args.trans:
273 |             xTrans2y_sim, xTrans2y_ind = knn(xTrans, y, min(y.shape[0], args.neighborhood), args.gpu)
274 |             xTrans2y_mean = xTrans2y_sim.mean(axis=1)
275 | 
276 |     if args.retrieval is not 'fwd':
277 |         if args.verbose:
278 |             print(' - perform {:d}-nn target against source'.format(args.neighborhood))
279 |         y2x_sim, y2x_ind = knn(y, x, min(x.shape[0], args.neighborhood), args.gpu)
280 |         y2x_mean = y2x_sim.mean(axis=1)
281 |         if args.trans:
282 |             y2xTrans_sim, y2xTrans_ind = knn(y, xTrans, min(xTrans.shape[0], args.neighborhood), args.gpu)
283 |             y2xTrans_mean = y2xTrans_sim.mean(axis=1)
284 | 
285 |     # margin function
286 |     if args.trans:
287 |         if args.margin == 'absolute':
288 |             margin = lambda a, b, c, d: c / d
289 |         elif args.margin == 'distance':
290 |             margin = lambda a, b, c, d: 2 / (1 / (a - b) + 1 / (c - d))
291 |         else:  # args.margin == 'ratio':
292 |             margin = lambda a, b, c, d: 2 / (1 / (a / b) + 1 / (c / d))
293 |     else:
294 |         if args.margin == 'absolute':
295 |             margin = lambda a, b: a
296 |         elif args.margin == 'distance':
297 |             margin = lambda a, b: a - b
298 |         else:  # args.margin == 'ratio':
299 |             margin = lambda a, b: a / b
300 | 
301 |     fout = open(args.output, mode='w', encoding=args.encoding, errors='surrogateescape')
302 | 
303 |     if args.mode == 'search':
304 |         if args.verbose:
305 |             print(' - Searching for closest sentences in target')
306 |             print(' - writing alignments to {:s}'.format(args.output))
307 |         scores = score_candidates(x, y, x2y_ind, x2y_mean, y2x_mean, margin, args.verbose)
308 |         best = x2y_ind[np.arange(x.shape[0]), scores.argmax(axis=1)]
309 | 
310 |         nbex = x.shape[0]
311 |         ref = np.linspace(0, nbex-1, nbex).astype(int)  # [0, nbex)
312 |         err = nbex - np.equal(best.reshape(nbex), ref).astype(int).sum()
313 |         print(' - errors: {:d}={:.2f}%'.format(err, 100*err/nbex))
314 |         for i in src_inds:
315 |             print(trg_sents[best[i]], file=fout)
316 | 
317 |     elif args.mode == 'score':
318 |         for i, j in zip(src_inds, trg_inds):
319 |             s = score(x[i], y[j], x2y_mean[i], y2x_mean[j], margin)
320 |             print(s, src_sents[i], trg_sents[j], sep='\t', file=fout)
321 | 
322 |     elif args.mode == 'mine':
323 |         if args.verbose:
324 |             print(' - mining for parallel data')
325 |         if args.trans:
326 |             fwd_scores = score_candidates_with_trans(x, y, xTrans, x2y_ind, x2y_mean, y2x_mean, xTrans2y_mean, y2xTrans_mean, margin, args.verbose)
327 |             bwd_scores = score_candidates_with_trans(y, x, xTrans, y2x_ind, y2x_mean, x2y_mean, y2xTrans_mean, xTrans2y_mean, margin, args.verbose)
328 |         else:
329 |             fwd_scores = score_candidates(x, y, x2y_ind, x2y_mean, y2x_mean, margin, args.verbose)
330 |             bwd_scores = score_candidates(y, x, y2x_ind, y2x_mean, x2y_mean, margin, args.verbose)
331 | 
332 |         fwd_best = x2y_ind[np.arange(x.shape[0]), fwd_scores.argmax(axis=1)]
333 |         bwd_best = y2x_ind[np.arange(y.shape[0]), bwd_scores.argmax(axis=1)]
334 |         if args.verbose:
335 |             print(' - writing alignments to {:s}'.format(args.output))
336 |             if args.threshold > 0:
337 |                 print(' - with threshold of {:f}'.format(args.threshold))
338 |         if args.retrieval == 'fwd':
339 |             for i, j in enumerate(fwd_best):
340 |                 print(fwd_scores[i].max(), src_sents[i], trg_sents[j], sep='\t', file=fout)
341 |         if args.retrieval == 'bwd':
342 |             for j, i in enumerate(bwd_best):
343 |                 print(bwd_scores[j].max(), src_sents[i], trg_sents[j], sep='\t', file=fout)
344 |         if args.retrieval == 'intersect':
345 |             for i, j in enumerate(fwd_best):
346 |                 if bwd_best[j] == i:
347 |                     print(fwd_scores[i].max(), src_sents[i], trg_sents[j], sep='\t', file=fout)
348 |         if args.retrieval == 'max':
349 |             indices = np.stack((np.concatenate((np.arange(x.shape[0]), bwd_best)),
350 |                                 np.concatenate((fwd_best, np.arange(y.shape[0])))), axis=1)
351 |             scores = np.concatenate((fwd_scores.max(axis=1), bwd_scores.max(axis=1)))
352 |             seen_src, seen_trg = set(), set()
353 |             for i in np.argsort(-scores):
354 |                 src_ind, trg_ind = indices[i]
355 |                 if not src_ind in seen_src and not trg_ind in seen_trg:
356 |                     seen_src.add(src_ind)
357 |                     seen_trg.add(trg_ind)
358 |                     if scores[i] > args.threshold:
359 |                         print(scores[i], src_sents[src_ind], trg_sents[trg_ind], sep='\t', file=fout)
360 |                         
361 |     fout.close()
362 | 
363 | 


--------------------------------------------------------------------------------
/segmentation/LICENSE.txt:
--------------------------------------------------------------------------------
1 | Copyright 2020 Florian Leitner
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 | 


--------------------------------------------------------------------------------
/segmentation/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## Setup
 3 | 
 4 | ``` python setup.py install``` or ``` pip install .```
 5 | 
 6 | ## Usage
 7 | 
 8 | * ### From python scripts
 9 |     ```python
10 |     >>> from segmentation import segmenter
11 |     >>> input_text = '''
12 |         কাজী মুহম্মদ ওয়াজেদের একমাত্র পুত্র ছিলেন এ. কে. ফজলুক হক। A. K. Fazlul Huq (Sher-E-Bangla) was born into a middle class Bengali Muslim family in Bakerganj, Barisal, Bangladesh in 1873. 
13 |     '''
14 |     >>> segmenter.segment_text(input_text)
15 |     ['কাজী মুহম্মদ ওয়াজেদের একমাত্র পুত্র ছিলেন এ. কে. ফজলুক হক।',
16 |     'A. K. Fazlul Huq (Sher-E-Bangla) was born into a middle class Bengali Muslim family in Bakerganj, Barisal, Bangladesh in 1873.']
17 |     ```
18 |     *If you don't want a linebreak to be an explicit new line marker, use the following*
19 |     ```python        
20 |     >>> segmenter.segment_text(input_text, mode='multi') 
21 |     ```
22 |     
23 |         
24 |   * ***Note: the above snippets run with most of the default options, for more advanced options, refer to the terminal script.***      
25 |          
26 | * ### From terminal
27 |     ```bash
28 |     segmenter --help
29 |     ```


--------------------------------------------------------------------------------
/segmentation/__init__.py:
--------------------------------------------------------------------------------
1 | from . import segmenter


--------------------------------------------------------------------------------
/segmentation/segmenter.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # Adapted from: https://github.com/fnl/segtok.git
  5 | """    
  6 | A pattern-based sentence segmentation strategy;
  7 | Primarily written for indo-european languages and extended specifically
  8 | for bengali. Could be extended for other languages by introducing new rules.
  9 | 
 10 | Known limitations:
 11 | 1. The sentence must use a known sentence terminal followed by space(s),
 12 |    skipping one optional, intervening quote and/or bracket.
 13 | 2. The next sentence must start with an upper-case letter or a number,
 14 |    ignoring one optional quote and/or bracket before it.
 15 |    Alternatively, it may start with a camel-cased word, like "gene-A".
 16 | 3. If the sentence ends with a single upper-case letter followed by a dot,
 17 |    a split is made (splits names like "A. Dent"), unless there is an easy
 18 |    to deduce reason that it is a human name.
 19 | 
 20 | The decision for requiring an "syntactically correct" terminal sequence with upper-case letters or
 21 | numbers as start symbol is based on the preference to under-split rather than over-split sentences.
 22 | 
 23 | Special care is taken not to split at common abbreviations like "i.e." or "etc.",
 24 | to not split at first or middle name initials "... F. M. Last ...",
 25 | to not split before a comma, colon, or semi-colon,
 26 | and to avoid single letters or digits as sentences ("A. This sentence...").
 27 | 
 28 | Sentence splits will always be enforced at [consecutive] line separators.
 29 | 
 30 | Important: Windows text files use ``\\r\\n`` as linebreaks and Mac files use ``\\r``;
 31 | Convert the text to Unix linebreaks if the case.
 32 | """
 33 | from __future__ import absolute_import, unicode_literals
 34 | import codecs
 35 | from regex import compile, DOTALL, UNICODE, VERBOSE
 36 | from itertools import chain
 37 | import re
 38 | import string
 39 | 
 40 | 
 41 | SENTENCE_TERMINALS = '.!?\u203C\u203D\u2047\u2048\u2049\u3002' \
 42 |                      '\uFE52\uFE57\uFF01\uFF0E\uFF1F\uFF61\u09F7\u0964'
 43 | "The list of valid Unicode sentence terminal characters."
 44 | 
 45 | # Note that Unicode the category Pd is NOT a good set for valid word-breaking hyphens,
 46 | # because it contains many dashes that should not be considered part of a word.
 47 | HYPHENS = '\u00AD\u058A\u05BE\u0F0C\u1400\u1806\u2010-\u2012\u2e17\u30A0-'
 48 | "Any valid word-breaking hyphen, including ASCII hyphen minus."
 49 | 
 50 | # Use upper-case for abbreviations that always are capitalized:
 51 | # Lower-case abbreviations may occur capitalized or not.
 52 | # Only abbreviations that should never occur at the end of a sentence
 53 | # (such as "etc.")
 54 | BENGALISINGLECHARS = "অ আ ই ঈ উ ঊ ঋ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য র ল শ ষ স হ ড় ঢ় য়".split()
 55 | ABBREVIATIONS = """
 56 | Adj Adm Adv Asst Bart Bldg Brig Bros Capt Cant Cmdr Col Comdr
 57 | Con Corp Cpl Dr Drs Ens Gen Gov Hon Hr Hop Inc Insp Lt MM Maj
 58 | Messrs Mlle Mme Op Ord Pfc Ph Pvt Rep Reps Res Rev Rt Sen Sens
 59 | Sfc Sgt Sr St Supt Surg approx Capt cf Col Dr f\.?e figs? Gen
 60 | e\.?g i\.?e i\.?v Mag med Mr Mrs Mt nat No nr p\.e phil prof rer
 61 | sci Sgt Sr Sra Srta St univ vol vs z\.B Jän Jan Ene Feb Mär Mar
 62 | Apr Abr May Jun Jul Aug Sep Sept Oct Okt Nov Dic Dez Dec Prof
 63 | E\.U U\.K U\.S viz ltd co est rs md Ms tk TK Ps PS Ex""".split()
 64 | 
 65 | BENGALIABBREVIATIONS = """এ বি সি ডি ই এফ জি এইচ আই জে কে এল এম এন ও পি কিউ আর আস টি ইউ ভি আর এস টি ইউ ভি ডব্লিউ এক্স ওআই জেড মি
 66 | মো মু কো কৌ মুহ মি মিস প্রফ ফিল গভ অপ ভল ডা লে জনাব মিজ মিসেস ডে যে মি লি সা ডঃ ডেপ্ট ডেপট অধ্যাপক গে অর্গ ডাব্লিউ সেন্ট ওয়াই এম\.ডি ঢা\.বি লিট ডি\.লিট
 67 | সং ইস মিস্টার মি গ্রা মিগ্রা মি\.গ্রা রেভ প্র প্রা ইঙ্ক গভ বিদ্র বি\.দ্র দ্র মোহা কিমি কি\.মি কি রেভা মুদ্রা আনু খ্রি খি ক্যান্ট সে সে\.মি সেমি মে জন মি\.লি মিলি লি মি অনু মৃত্যু পূ পৃ ডব্লু
 68 | """.split()
 69 | 
 70 | ABBREVIATIONS.extend(a.capitalize() for a in ABBREVIATIONS if a[0].islower())
 71 | ABBREVIATIONS.extend(BENGALISINGLECHARS)
 72 | ABBREVIATIONS.extend(BENGALIABBREVIATIONS)
 73 | 
 74 | ABBREVIATIONS.extend(list(string.ascii_uppercase))
 75 | 
 76 | JWSPECAILS = """Aux\.Pios Par chap pars Pubs ftn Jas Rom ROM PROV Mic
 77 | TIM স\.অগ্র, বি\.অগ্র তীম Tim গীত Ps যিশা Isa গালা Gal পিতর Pet মথি Matt করি Cor
 78 | রোমীয় Rom ইব্রীয় Heb প্রকা Rev যিহি Ezek বিচার Judg আদি Gen দানি Dan রাজা Ki শমূ Sam
 79 | মালাখি Mal ইফি Eph হিতো Prov যিহো Josh দ্বিতী Deut দ্বিতীয় Deut গণনা Num সফ Zeph হোশেও
 80 | Hos ফিলি Phil যির Jer কল Col উপ ECCL উপ Eccl পরম Sol থিষল Thess থিষ Thess লেবীয়
 81 | Lev যাত্রা Ex বংশা Chron নহি Neh হবক্ Hab অগ্র Pios সখ Zech প্রেরিত Acts ফিলী Philem সা\.কা
 82 | লেবী Lev রূৎ Ruth পাদ ftn জানু Jan ফেব্রু Feb সেপ্ট Sept সেপ্টে Sept অক্টো Oct নভে Nov ডিসে Dec পরি pp""".split()
 83 | # ABBREVIATIONS.extend(JWSPECAILS)
 84 | 
 85 | ABBREVIATIONS = '|'.join(sorted(list(set(ABBREVIATIONS))))
 86 | ABBREVIATIONS = compile(r"""
 87 | (?: \b(?:%s) # 1. known abbreviations,
 88 | |   ^\S      # 2. a single, non-space character "sentence" (only),
 89 | |   ^\d+     # 3. a series of digits "sentence" (only), or
 90 | |   (?: \b   # 4. terminal letters A.-A, A.A, or A, if prefixed with:
 91 |     # 4.a. something that makes them most likely a human first name initial
 92 |         (?: [Bb]y
 93 |         |   [Cc](?:aptain|ommander)
 94 |         |   [Dd]o[ck]tor
 95 |         |   [Gg]eneral
 96 |         |   [Mm](?:ag)?is(?:ter|s)
 97 |         |   [Pp]rofessor
 98 |         |   [Ss]e\u00F1or(?:it)?a?
 99 |         ) \s
100 |     # 4.b. if they are most likely part of an author list: (avoiding "...A and B")
101 |     |   (?: (?<! \b\p{Lu}\p{Lm}? ) , (?: \s and )?
102 |         |   (?<! \b[\p{Lu},]\p{Lm}? ) \s and
103 |         ) \s
104 |     # 4.c. a bracket opened just before the letters
105 |     |   [\[\(]
106 |     ) (?: # finally, the letter sequence A.-A, A.A, or A:
107 |         [\p{Lu}\p{Lt}] \p{Lm}? \. # optional A.
108 |         [%s]?                     # optional hyphen
109 |     )? [\p{Lu}\p{Lt}] \p{Lm}?     # required A
110 | ) $""" % (ABBREVIATIONS, HYPHENS), UNICODE | VERBOSE)
111 | """
112 | Common abbreviations at the candidate sentence end that normally don't terminate a sentence.
113 | Note that a check is required to ensure the potential abbreviation is actually followed by a dot
114 | and not some other sentence segmentation marker.
115 | """
116 | 
117 | # PMC OA corpus statistics
118 | # SSs: sentence starters
119 | # abbrevs: abbreviations
120 | #
121 | # Words likely used as SSs (poor continuations, >10%):
122 | # after, though, upon, while, yet
123 | #
124 | # Words hardly used after abbrevs vs. SSs (poor continuations, <2%):
125 | # [after], as, at, but, during, for, in, nor, on, to, [though], [upon],
126 | # whereas, [while], within, [yet]
127 | #
128 | # Words hardly ever used as SSs (excellent continuations, <2%):
129 | # and, are, between, by, from, has, into, is, of, or, that, than, through,
130 | # via, was, were, with
131 | #
132 | # Words frequently used after abbrevs (excellent continuations, >10%):
133 | # [and, are, has, into, is, of, or, than, via, was, were]
134 | #
135 | # Grey zone: undecidable words -> leave in to bias towards under-splitting
136 | # whether
137 | 
138 | ENDS_IN_DATE_DIGITS = compile(r"\b[0123]?[0-9]$")
139 | MONTH = compile(r"(J[äa]n|Ene|Feb|M[äa]r|A[pb]r|May|Jun|Jul|Aug|Sep|O[ck]t|Nov|D[ei][cz]|0?[1-9]|1[012])")
140 | """
141 | Special facilities to detect European-style dates.
142 | """
143 | 
144 | CONTINUATIONS = compile(r""" ^ # at string start only
145 | (?: a(?: nd|re )
146 | |   b(?: etween|y )
147 | |   from
148 | |   has
149 | |   i(?: nto|s )
150 | |   o[fr]
151 | |   t(?: han|hat|hrough )
152 | |   via
153 | |   w(?: as|ere|hether|ith )
154 | )\b""", UNICODE | VERBOSE)
155 | "Lower-case words that in the given form usually don't start a sentence."
156 | 
157 | BEFORE_LOWER = compile(r""" .*?
158 | (?: [%s]"[\)\]]*           # ."]) .") ."
159 | |   [%s] [\)\]]+           # .]) .)
160 | |   \b spp \.              # spp.  (species pluralis)
161 | |   \b \p{L} \p{Ll}? \.    # Ll. L.
162 | ) \s+ $""" % (SENTENCE_TERMINALS, SENTENCE_TERMINALS), DOTALL | UNICODE | VERBOSE
163 | )
164 | """
165 | Endings that, if followed by a lower-case word, are not sentence terminals:
166 | - Quotations and brackets ("Hello!" said the man.)
167 | - dotted abbreviations (U.S.A. was)
168 | - genus-species-like (m. musculus)
169 | """
170 | LOWER_WORD = compile(r'^\p{Ll}+[%s]?\p{Ll}*\b' % HYPHENS, UNICODE)
171 | "Lower-case words are not sentence starters (after an abbreviation)."
172 | 
173 | MIDDLE_INITIAL_END = compile(r'\b\p{Lu}\p{Ll}+\W+\p{Lu}$', UNICODE)
174 | "Upper-case initial after upper-case word at the end of a string."
175 | 
176 | UPPER_WORD_START = compile(r'^\p{Lu}\p{Ll}+\b', UNICODE)
177 | "Upper-case word at the beginning of a string."
178 | 
179 | LONE_WORD = compile(r'^\p{Ll}+[\p{Ll}\p{Nd}%s]*$' % HYPHENS, UNICODE)
180 | "Any 'lone' lower-case word [with hyphens or digits inside] is a continuation."
181 | 
182 | UPPER_CASE_END = compile(r'\b[\p{Lu}\p{Lt}]\p{L}*\.\s+$', UNICODE)
183 | "Inside brackets, 'Words' that can be part of a proper noun abbreviation, like a journal name."
184 | UPPER_CASE_START = compile(r'^(?:(?:\(\d{4}\)\s)?[\p{Lu}\p{Lt}]\p{L}*|\d+)[\.,:]\s+', UNICODE)
185 | "Inside brackets, 'Words' that can be part of a large abbreviation, like a journal name."
186 | 
187 | SHORT_SENTENCE_LENGTH = 55
188 | "Length of either sentence fragment inside brackets to assume the fragment is not its own sentence."
189 | # This can be increased/decreased to heighten/lower the likelihood of splits inside brackets.
190 | 
191 | NON_UNIX_LINEBREAK = compile(r'(?:\r\n|\r|\u2028)', UNICODE)
192 | "All linebreak sequence variants except the Unix newline (only)."
193 | 
194 | SEGMENTER_REGEX = r"""
195 | (                       # A sentence ends at one of two sequences:
196 |     [%s]                # Either, a sequence starting with a sentence terminal,
197 |     [\'\u2019\"\u201D]? # an optional right quote,
198 |     [\]\)]*             # optional closing brackets and
199 |     \s+                 # a sequence of required spaces.
200 | |                       # Otherwise,
201 |     \n{{{},}}           # a sentence also terminates at [consecutive] newlines.
202 | |
203 |     [\u0964]+
204 |     [\'\u2019\"\u201D]? # an optional right quote,
205 |     [\]\)]*             # optional closing brackets and
206 |     \s*                 # a sequence of optional spaces.
207 | 
208 | )""" % SENTENCE_TERMINALS
209 | 
210 | """
211 | Sentence end a sentence terminal, followed by spaces.
212 | Optionally, a right quote and any number of closing brackets may succeed the terminal marker.
213 | Alternatively, an yet undefined number of line-breaks also may terminate sentences.
214 | """
215 | 
216 | _compile = lambda count: compile(SEGMENTER_REGEX.format(count), UNICODE | VERBOSE)
217 | 
218 | # Define that one or more line-breaks split sentences:
219 | DO_NOT_CROSS_LINES = _compile(1)
220 | "A segmentation pattern where any newline char also terminates a sentence."
221 | 
222 | # Define that two or more line-breaks split sentences:
223 | MAY_CROSS_ONE_LINE = _compile(2)
224 | "A segmentation pattern where two or more newline chars also terminate sentences."
225 | 
226 | # some normalization primitives
227 | REPLACE_UNICODE_PUNCTUATION = [
228 |     (u"\u09F7", u"\u0964"),
229 |     (u"，", u","),
230 |     (u"、", u","),
231 |     (u"”", u'"'),
232 |     (u"“", u'"'),
233 |     (u"∶", u":"),
234 |     (u"：", u":"),
235 |     (u"？", u"?"),
236 |     (u"《", u'"'),
237 |     (u"》", u'"'),
238 |     (u"）", u")"),
239 |     (u"！", u"!"),
240 |     (u"（", u"("),
241 |     (u"；", u";"),
242 |     (u"」", u'"'),
243 |     (u"「", u'"'),
244 |     (u"０", u"0"),
245 |     (u"１", u'1'),
246 |     (u"２", u"2"),
247 |     (u"３", u"3"),
248 |     (u"４", u"4"),
249 |     (u"５", u"5"),
250 |     (u"６", u"6"),
251 |     (u"７", u"7"),
252 |     (u"８", u"8"),
253 |     (u"９", u"9"),
254 |     (u"～", u"~"),
255 |     (u"’", u"'"),
256 |     (u"…", u"..."),
257 |     (u"━", u"-"),
258 |     (u"〈", u"<"),
259 |     (u"〉", u">"),
260 |     (u"【", u"["),
261 |     (u"】", u"]"),
262 |     (u"％", u"%"),
263 | ]
264 | 
265 | NORMALIZE_UNICODE = [
266 |     ('\u00AD', ''),
267 |     ('\u09AF\u09BC', '\u09DF'),
268 |     ('\u09A2\u09BC', '\u09DD'),
269 |     ('\u09A1\u09BC', '\u09DC'),
270 |     ('\u09AC\u09BC', '\u09B0'),
271 |     ('\u09C7\u09BE', '\u09CB'),
272 |     ('\u09C7\u09D7', '\u09CC'),
273 |     ('\u0985\u09BE', '\u0986'),
274 |     ('\u09C7\u0981\u09D7', '\u09CC\u0981'),
275 |     ('\u09C7\u0981\u09BE', '\u09CB\u0981'),
276 |     ('\u09C7([^\u09D7])\u09D7', "\g<1>\u09CC"),
277 |     ('\\xa0', ' '),
278 |     ('\u200B', u''),  
279 |     ('\u2060', u''),
280 |     (u'„', r'"'),
281 |     (u'“', r'"'),
282 |     (u'”', r'"'),
283 |     (u'–', r'-'),
284 |     (u'—', r' - '),
285 |     (r' +', r' '),
286 |     (u'´', r"'"),
287 |     (u'([a-zA-Z])‘([a-zA-Z])', r"\g<1>'\g<2>"),
288 |     (u'([a-zA-Z])’([a-zA-Z])', r"\g<1>'\g<2>"),
289 |     (u'‘', r"'"),
290 |     (u'‚', r"'"),
291 |     (u'’', r"'"),
292 |     (u'´´', r'"'),
293 |     (u'…', r'...'),
294 | ]
295 | 
296 | FRENCH_QUOTES = [
297 |     (u'\u00A0«\u00A0', r'"'),
298 |     (u'«\u00A0', r'"'),
299 |     (u'«', r'"'),
300 |     (u'\u00A0»\u00A0', r'"'),
301 |     (u'\u00A0»', r'"'),
302 |     (u'»', r'"'),
303 | ]
304 | 
305 | SUBSTITUTIONS = [NORMALIZE_UNICODE, FRENCH_QUOTES, REPLACE_UNICODE_PUNCTUATION]
306 | SUBSTITUTIONS = list(chain(*SUBSTITUTIONS))
307 | 
308 | def normalize_punctuation(text):
309 |     """Normalize common punctuations for the splitter to work better"""
310 |     for regexp, replacement in SUBSTITUTIONS:
311 |         text = re.sub(regexp, replacement, text, flags=re.UNICODE)
312 |     
313 |     for block in re.findall(r'[\s\.]{2,}', text, flags=re.UNICODE):
314 |         block = block.strip()
315 |         if len(re.findall(r'[\.]', block, flags=re.UNICODE)) > 1:
316 |             newBlock = re.sub(r'[^\S\r\n]', '', block, flags=re.UNICODE)
317 |             text = text.replace(block, newBlock, 1)
318 | 
319 |     return text
320 | 
321 | # added punctuation normalization in here
322 | def split_single(text, join_on_lowercase=False, short_sentence_length=SHORT_SENTENCE_LENGTH):
323 |     """
324 |     Default: split `text` at sentence terminals and at newline chars.
325 |     """
326 |     text = normalize_punctuation(text)
327 |     sentences = _sentences(DO_NOT_CROSS_LINES.split(text), join_on_lowercase, short_sentence_length)
328 |     return [s for ss in sentences  for s in ss.split('\n')]
329 | 
330 | 
331 | def split_multi(text, join_on_lowercase=False, short_sentence_length=SHORT_SENTENCE_LENGTH):
332 |     """
333 |     Sentences may contain non-consecutive (single) newline chars, while consecutive newline chars
334 |     ("paragraph separators") always split sentences.
335 |     """
336 |     text = normalize_punctuation(text)
337 |     return _sentences(MAY_CROSS_ONE_LINE.split(text), join_on_lowercase, short_sentence_length)
338 | 
339 | 
340 | def split_newline(text):
341 |     """
342 |     Split the `text` at newlines (``\\n'') and strip the lines,
343 |     but only return lines with content.
344 |     """
345 |     for line in text.split('\n'):
346 |         line = line.strip()
347 | 
348 |         if line:
349 |             yield line
350 | 
351 | 
352 | def rewrite_line_separators(text, pattern, join_on_lowercase=False,
353 |                             short_sentence_length=SHORT_SENTENCE_LENGTH):
354 |     """
355 |     Remove line separator chars inside sentences and ensure there is a ``\\n`` at their end.
356 | 
357 |     :param text: input plain-text
358 |     :param pattern: for the initial sentence splitting
359 |     :param join_on_lowercase: always join sentences that start with lower-case
360 |     :param short_sentence_length: the upper boundary for text spans that are not split
361 |                                   into sentences inside brackets
362 |     :return: a generator yielding the spans of text
363 |     """
364 |     offset = 0
365 | 
366 |     for sentence in _sentences(pattern.split(text), join_on_lowercase, short_sentence_length):
367 |         start = text.index(sentence, offset)
368 |         intervening = text[offset:start]
369 | 
370 |         if offset != 0 and '\n' not in intervening:
371 |             yield '\n'
372 |             intervening = intervening[1:]
373 | 
374 |         yield intervening
375 |         yield sentence.replace('\n', ' ')
376 |         offset = start + len(sentence)
377 | 
378 |     if offset < len(text):
379 |         yield text[offset:]
380 | 
381 | 
382 | def to_unix_linebreaks(text):
383 |     """Replace non-Unix linebreak sequences (Windows, Mac, Unicode) with newlines (\\n)."""
384 |     return NON_UNIX_LINEBREAK.sub('\n', text)
385 | 
386 | 
387 | def _sentences(spans, join_on_lowercase, short_sentence_length):
388 |     """Join spans back together into sentences as necessary."""
389 |     last = None
390 |     shorterThanATypicalSentence = lambda c, l: c < short_sentence_length or l < short_sentence_length
391 |     
392 |     for current in _abbreviation_joiner(spans):
393 |         if last is not None:
394 |             
395 |             if (join_on_lowercase or BEFORE_LOWER.match(last)) and LOWER_WORD.match(current):
396 |                 last = '%s%s' % (last, current)
397 |             elif shorterThanATypicalSentence(len(current), len(last)) and _is_open(last) and (
398 |                 _is_not_opened(current) or last.endswith(' et al. ') or (
399 |                     UPPER_CASE_END.search(last) and UPPER_CASE_START.match(current)
400 |                 )
401 |             ):
402 |                 last = '%s%s' % (last, current)
403 |             elif shorterThanATypicalSentence(len(current), len(last)) and _is_open(last, '[]') and (
404 |                 _is_not_opened(current, '[]') or last.endswith(' et al. ') or (
405 |                     UPPER_CASE_END.search(last) and UPPER_CASE_START.match(current)
406 |                 )
407 |             ):
408 |                 last = '%s%s' % (last, current)
409 |             elif CONTINUATIONS.match(current):
410 |                 last = '%s%s' % (last, current)
411 |             elif re.search(r'^[\"\']+$|^[\"\']+[ \t]*\n+.+', current.strip(), flags=re.UNICODE):
412 |                 last = '%s%s' % (last.strip(), current.strip())
413 |             elif current.strip().startswith('-') or re.search(r'^[\"\']\s*[\-]', current.strip(), flags=re.UNICODE):
414 |                 last = '%s%s' % (last.strip(), current.strip())
415 |             else:
416 |                 yield last.strip()
417 |                 last = current
418 |         else:
419 |             last = current
420 | 
421 |     if last is not None:
422 |         yield last.strip()
423 | 
424 | 
425 | def _abbreviation_joiner(spans):
426 |     """Join spans that match the ABBREVIATIONS pattern."""
427 |     segment = None
428 |     makeSentence = lambda start, end: ''.join(spans[start:end])
429 |     total = len(spans)
430 | 
431 |     for pos in range(total):
432 |         if pos and pos % 2:  # even => segment, uneven => (potential) terminal
433 |             prev_s = spans[pos - 1]
434 |             marker = spans[pos]
435 |             next_s = spans[pos+1] if pos + 1 < total else None
436 | 
437 |             if prev_s[-1:].isspace() and marker[0] != '\u0964':
438 |                 pass # join
439 |             elif marker[0] == '.' and ABBREVIATIONS.search(prev_s):
440 |                 pass # join
441 |             elif marker[0] == '.' and next_s and (
442 |                     LONE_WORD.match(next_s) or
443 |                     (ENDS_IN_DATE_DIGITS.search(prev_s) and MONTH.match(next_s)) or
444 |                     (MIDDLE_INITIAL_END.search(prev_s) and UPPER_WORD_START.match(next_s))
445 |                     ):
446 |                 pass # join
447 |             else:
448 |                 yield makeSentence(segment, pos + 1)
449 |                 segment = None
450 |         elif segment is None:
451 |             segment = pos
452 | 
453 |     if segment is not None:
454 |         yield makeSentence(segment, total)
455 | 
456 | 
457 | def _is_open(span_str, brackets='()'):
458 |     """Check if the span ends with an unclosed `bracket`."""
459 |     offset = span_str.find(brackets[0])
460 |     nesting = 0 if offset == -1 else 1
461 | 
462 |     while offset != -1:
463 |         opener = span_str.find(brackets[0], offset + 1)
464 |         closer = span_str.find(brackets[1], offset + 1)
465 | 
466 |         if opener == -1:
467 |             if closer == -1:
468 |                 offset = -1
469 |             else:
470 |                 offset = closer
471 |                 nesting -= 1
472 |         elif closer == -1:
473 |             offset = opener
474 |             nesting += 1
475 |         elif opener < closer:
476 |             offset = opener
477 |             nesting += 1
478 |         elif closer < opener:
479 |             offset = closer
480 |             nesting -= 1
481 |         else:
482 |             msg = 'at offset={}: closer={}, opener={}'
483 |             raise RuntimeError(msg.format(offset, closer, opener))
484 | 
485 |     return nesting > 0
486 | 
487 | 
488 | def _is_not_opened(span_str, brackets='()'):
489 |     """Check if the span starts with an unopened `bracket`."""
490 |     offset = span_str.rfind(brackets[1])
491 |     nesting = 0 if offset == -1 else 1
492 | 
493 |     while offset != -1:
494 |         opener = span_str.rfind(brackets[0], 0, offset)
495 |         closer = span_str.rfind(brackets[1], 0, offset)
496 | 
497 |         if opener == -1:
498 |             if closer == -1:
499 |                 offset = -1
500 |             else:
501 |                 offset = closer
502 |                 nesting += 1
503 |         elif closer == -1:
504 |             offset = opener
505 |             nesting -= 1
506 |         elif closer < opener:
507 |             offset = opener
508 |             nesting -= 1
509 |         elif opener < closer:
510 |             offset = closer
511 |             nesting += 1
512 |         else:
513 |             msg = 'at offset={}: closer={}, opener={}'
514 |             raise RuntimeError(msg.format(offset, closer, opener))
515 | 
516 |     return nesting > 0
517 | 
518 | def segment_text(input_text, mode='single'): 
519 |     """Simple api to segment text with most default values"""
520 |     normal = to_unix_linebreaks
521 |     if mode == 'single':
522 |         sentences = split_single(normal(input_text), short_sentence_length=SHORT_SENTENCE_LENGTH)
523 |         text_spans = [i for s in sentences for i in (s, '\n')]
524 |     elif mode == 'multi':
525 |         text_spans = rewrite_line_separators(normal(input_text), MAY_CROSS_ONE_LINE, short_sentence_length=SHORT_SENTENCE_LENGTH)
526 | 
527 |     segments = [span.strip() for span in text_spans if span.strip()]
528 |     return segments
529 | 
530 |     
531 | 
532 | def main():
533 |     # print one sentence per line
534 |     from argparse import ArgumentParser
535 |     from sys import argv, stdout, stdin, stderr, getdefaultencoding, version_info
536 |     from os import path, linesep
537 | 
538 |     single, multi = 0, 1
539 | 
540 |     parser = ArgumentParser(usage='%(prog)s [--mode] [FILE ...]',
541 |                             description=__doc__, prog=path.basename(argv[0]),
542 |                             epilog='default encoding: ' + getdefaultencoding())
543 |     parser.add_argument('files', metavar='FILE', nargs='*',
544 |                         help='UTF-8 plain-text file(s); if absent, read from STDIN')
545 |     parser.add_argument('--with-ids', action='store_true',
546 |                         help='STDIN (only!) input is ID-tab-TEXT; the ID is '
547 |                              'preserved in the output as ID-tab-N-tab-SENTENCE '
548 |                              'where N is the incremental sentence number for that '
549 |                              'text ID')
550 |     parser.add_argument('--normal-breaks', '-n', action='store_true',
551 |                         help=to_unix_linebreaks.__doc__)
552 |     parser.add_argument('--bracket-spans', '-b', metavar="INT", type=int,
553 |                         default=SHORT_SENTENCE_LENGTH,
554 |                         help="upper boundary for text spans that are not split "
555 |                              "into sentences inside brackets [%(default)d]")
556 |     parser.add_argument('--encoding', '-e', help='force another encoding to use')
557 |     mode = parser.add_mutually_exclusive_group()
558 |     parser.set_defaults(mode=single)
559 |     mode.add_argument('--single', '-s', action='store_const', dest='mode', const=single,
560 |                       help=split_single.__doc__)
561 |     mode.add_argument('--multi', '-m', action='store_const', dest='mode', const=multi,
562 |                       help=split_multi.__doc__)
563 | 
564 |     args = parser.parse_args()
565 |     pattern = [DO_NOT_CROSS_LINES, MAY_CROSS_ONE_LINE, ][args.mode]
566 |     normal = to_unix_linebreaks if args.normal_breaks else lambda t: t
567 | 
568 |     # fix broken Unicode handling in Python 2.x
569 |     # see http://www.macfreek.nl/memory/Encoding_of_Python_stdout
570 |     if args.encoding or version_info < (3, 0):
571 |         if version_info >= (3, 0):
572 |             stdout = stdout.buffer
573 |             stdin = stdin.buffer
574 | 
575 |         stdout = codecs.getwriter(
576 |             args.encoding or 'utf-8'
577 |         )(stdout, 'xmlcharrefreplace')
578 | 
579 |         stdin = codecs.getreader(
580 |             args.encoding or 'utf-8'
581 |         )(stdin, 'xmlcharrefreplace')
582 | 
583 |         if not args.encoding:
584 |             stderr.write('wrapped segmenter stdio with UTF-8 de/encoders')
585 |             stderr.write(linesep)
586 | 
587 |     if not args.files and args.mode != single:
588 |         parser.error('only single line splitting mode allowed '
589 |                      'when reading from STDIN')
590 | 
591 |     def segment(text):
592 |         if not args.files and args.with_ids:
593 |             tid, text = text.split('\t', 1)
594 |         else:
595 |             tid = None
596 | 
597 |         if args.mode == single:
598 |             sentences = split_single(normal(text), short_sentence_length=args.bracket_spans)
599 |             text_spans = [i for s in sentences for i in (s, '\n')]
600 |         else:
601 |             text_spans = rewrite_line_separators(
602 |                 normal(text), pattern, short_sentence_length=args.bracket_spans
603 |             )
604 | 
605 |         if tid is not None:
606 |             def write_ids(tid, sid):
607 |                 stdout.write(tid)
608 |                 stdout.write('\t')
609 |                 stdout.write(str(sid))
610 |                 stdout.write('\t')
611 | 
612 |             last = '\n'
613 |             sid = 1
614 | 
615 |             for span in text_spans:
616 |                 if last == '\n' and span not in ('', '\n'):
617 |                     write_ids(tid, sid)
618 |                     sid += 1
619 | 
620 |                 stdout.write(span)
621 | 
622 |                 if span:
623 |                     last = span
624 |         else:
625 |             for span in text_spans:
626 |                 if span.strip() == "":
627 |                     continue
628 |                 stdout.write(f'{span.strip()}\n')
629 | 
630 |     if args.files:
631 |         for txt_file_path in args.files:
632 |             with codecs.open(
633 |                 txt_file_path, 'r', encoding=(args.encoding or 'utf-8')
634 |             ) as fp:
635 |                 segment(fp.read())
636 |     else:
637 |         for line in stdin:
638 |             segment(line)
639 | 
640 | 
641 | if __name__ == '__main__':
642 |     main()
643 | 


--------------------------------------------------------------------------------
/segmentation/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from setuptools import setup
 3 | 
 4 | try:
 5 |     with open('README.md') as file:
 6 |         long_description = file.read()
 7 | except IOError:
 8 |     long_description = "missing"
 9 | 
10 | 
11 | setup(
12 |     name='segmentation',
13 |     data_files = [("", ["LICENSE.txt"])],
14 |     packages = ['segmentation'],
15 |     package_dir = {'segmentation':''},
16 |     install_requires=['regex'],
17 |     long_description=long_description,
18 |     entry_points={
19 |         'console_scripts': [
20 |             'segmenter = segmentation.segmenter:main'
21 |         ]
22 |     }
23 | )
24 | 


--------------------------------------------------------------------------------
/training/README.md:
--------------------------------------------------------------------------------
  1 | # Preprocessing
  2 | 
  3 | If you want to,
  4 | *  build a new bn-en training dataset from a noisy parallel corpora (by filtering / cleaning some pairs based on our heuristics) with corresponding vocabulary models or
  5 | *  normalize a new dataset before evaluating on the model or
  6 | *  remove all evaluation pairs from training pairs for a new set of training / test datasets 
  7 | 
  8 | refer to [here](preprocessing/).
  9 | 
 10 | # Training & Evaluation
 11 | 
 12 | **Note:** This code has been refactored to support [OpenNMT-py 2.0](https://github.com/OpenNMT/OpenNMT-py)
 13 | 
 14 | ### Setup
 15 | 
 16 | ```bash
 17 | $ cd seq2seq/
 18 | $ conda create python==3.7.9 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch -p ./env
 19 | $ conda activate ./env # or source activate ./env (for older versions of anaconda)
 20 | $ pip install --upgrade -r requirements.txt
 21 | ```
 22 | - **Note**: For newer NVIDIA GPUS such as ***A100*** or ***3090*** use `cudatoolkit=11.0`.
 23 | 
 24 | 
 25 | ### Usage
 26 | 
 27 | ```bash
 28 | $ cd seq2seq/
 29 | $ python pipeline.py -h
 30 | usage: pipeline.py [-h] --input_dir PATH --output_dir PATH --src_lang SRC_LANG
 31 |                    --tgt_lang TGT_LANG
 32 |                    [--validation_samples VALIDATION_SAMPLES]
 33 |                    [--src_seq_length SRC_SEQ_LENGTH]
 34 |                    [--tgt_seq_length TGT_SEQ_LENGTH]
 35 |                    [--model_prefix MODEL_PREFIX] [--eval_model PATH]
 36 |                    [--train_steps TRAIN_STEPS]
 37 |                    [--train_batch_size TRAIN_BATCH_SIZE]
 38 |                    [--eval_batch_size EVAL_BATCH_SIZE]
 39 |                    [--gradient_accum GRADIENT_ACCUM]
 40 |                    [--warmup_steps WARMUP_STEPS]
 41 |                    [--learning_rate LEARNING_RATE] [--layers LAYERS]
 42 |                    [--rnn_size RNN_SIZE] [--word_vec_size WORD_VEC_SIZE]
 43 |                    [--transformer_ff TRANSFORMER_FF] [--heads HEADS]
 44 |                    [--valid_steps VALID_STEPS]
 45 |                    [--save_checkpoint_steps SAVE_CHECKPOINT_STEPS]
 46 |                    [--average_last AVERAGE_LAST] [--world_size WORLD_SIZE]
 47 |                    [--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]]
 48 |                    [--train_from TRAIN_FROM] [--do_train] [--do_eval]
 49 |                    [--nbest NBEST] [--alpha ALPHA]
 50 | 
 51 | optional arguments:
 52 |   -h, --help            show this help message and exit
 53 |   --input_dir PATH, -i PATH
 54 |                         Input directory
 55 |   --output_dir PATH, -o PATH
 56 |                         Output directory
 57 |   --src_lang SRC_LANG   Source language
 58 |   --tgt_lang TGT_LANG   Target language
 59 |   --validation_samples VALIDATION_SAMPLES
 60 |                         no. of validation samples to take out from train
 61 |                         dataset when no validation data is present
 62 |   --src_seq_length SRC_SEQ_LENGTH
 63 |                         maximum source sequence length
 64 |   --tgt_seq_length TGT_SEQ_LENGTH
 65 |                         maximum target sequence length
 66 |   --model_prefix MODEL_PREFIX
 67 |                         Prefix of the model to save
 68 |   --eval_model PATH     Path to the specific model to evaluate
 69 |   --train_steps TRAIN_STEPS
 70 |                         no of training steps
 71 |   --train_batch_size TRAIN_BATCH_SIZE
 72 |                         training batch size (in tokens)
 73 |   --eval_batch_size EVAL_BATCH_SIZE
 74 |                         evaluation batch size (in sentences)
 75 |   --gradient_accum GRADIENT_ACCUM
 76 |                         gradient accum
 77 |   --warmup_steps WARMUP_STEPS
 78 |                         warmup steps
 79 |   --learning_rate LEARNING_RATE
 80 |                         learning rate
 81 |   --layers LAYERS       layers
 82 |   --rnn_size RNN_SIZE   rnn size
 83 |   --word_vec_size WORD_VEC_SIZE
 84 |                         word vector size
 85 |   --transformer_ff TRANSFORMER_FF
 86 |                         transformer feed forward size
 87 |   --heads HEADS         no of heads
 88 |   --valid_steps VALID_STEPS
 89 |                         validation interval
 90 |   --save_checkpoint_steps SAVE_CHECKPOINT_STEPS
 91 |                         model saving interval
 92 |   --average_last AVERAGE_LAST
 93 |                         average last X models
 94 |   --world_size WORLD_SIZE
 95 |                         world size
 96 |   --gpu_ranks [GPU_RANKS [GPU_RANKS ...]]
 97 |                         gpu ranks
 98 |   --train_from TRAIN_FROM
 99 |                         start training from this checkpoint
100 |   --do_train            Run training
101 |   --do_eval             Run evaluation
102 |   --nbest NBEST         sentencepiece nbest size
103 |   --alpha ALPHA         sentencepiece alpha
104 | ```
105 | 
106 | *  ***Sample `input_dir` structure for bn2en training and evaluation:***
107 | 
108 |     ```bash
109 |     input_dir/
110 |     |---> data/
111 |     |    |---> corpus.train.bn
112 |     |    |---> corpus.train.en
113 |     |    |---> RisingNews.valid.bn
114 |     |    |---> RisingNews.valid.en
115 |     |    |---> RisingNews.test.bn
116 |     |    |---> RisingNews.test.en
117 |     |    |---> sipc.test.bn
118 |     |    |---> sipc.test.en.0
119 |     |    |---> sipc.test.en.1
120 |     |    ...
121 |     |---> vocab/
122 |     |    |---> bn.model
123 |     |    |---> en.model
124 |     ```
125 |      * Input data files inside the `data/` subdirectory must have the following format: **`X.type.lang(.count)`**, where `X` is any common file prefix, `type` is one of `{train, valid, test}` and `count` is an optional integer (**only applicable for the `target_lang`, when there are multiple reference files**). There can be multiple `.train.`/`.valid.` filepairs. In absence of `.valid.` files, `validation_samples` no of example pairs will be randomly sampled from the training files during `training`.
126 |      * The `vocab` subdirectory must hold two sentencepiece vocabulary models formatted as `src_lang.model` and `tgt_lang.model`
127 |  
128 | * ***After training / evaluation, the `output_dir` will have the following subdirectories with these contents.***
129 |    * `Models`:  All the saved models
130 |    * `Reports`:  **BLEU and SACREBLEU** scores on the validation files for all saved models with the given `model_prefix`, and the scores on the test files for the given `eval_model` (if the corresponding reference files are present)
131 |    * `Outputs`: Detokenized model predictions.
132 |    * `data`: Merged training files after applying subword regularization.
133 |   * `Preprocessed`: Training and validation data shards
134 |    
135 | 
136 | ***To reproduce our results on an AWS  p3.8xlarge ec2 instance, equipped with 4 Tesla V100 GPUs, run the script with the default hyperparameters.*** For example, for bn2en training,
137 | ```bash
138 | $ export CUDA_VISIBLE_DEVICES=0,1,2,3
139 | # for training
140 | $ python pipeline.py \
141 |   --src_lang bn --tgt_lang en \
142 |   -i inputFolder/ -o outputFolder/ \ 
143 |   --model_prefix bn2en --do_train --do_eval
144 | ```
145 | For single GPU training, additionally provide the following flags: ``--world_size 1``, ``--gpu_ranks 0`` and update the effective batch size according to available GPU VRAM using the flags `--train_batch_size X` and ``--gradient_accum X``.
146 | 
147 | 
148 | # Evaluation
149 | 
150 | For evaluating trained models on a single GPU on new test files, use the following snippet with appropriate arguments:
151 | 
152 | ```bash
153 | $ python pipeline.py 
154 |   --src_lang bn --tgt_lang en \
155 |   -i inputFolder/ -o outputFolder/ \
156 |   --eval_model  <path/to/model> \
157 |   --do_eval 
158 | ```
159 | 


--------------------------------------------------------------------------------
/training/preprocessing/README.md:
--------------------------------------------------------------------------------
 1 | ## Cleaning / Normalizing / Training Vocabularies
 2 | ***The purpose of this extra cleaning on top of batch filtering is to maximize the amount of useful information in the bn-en dataset for a bilingual MT system. We do this by employing a variety of heuristics such as removing identical spans of foreign texts on both sides, applying transliteration when appropriate, thresholding allowed amount of foreign text in a sentence pair, etc. For more details, refer to the code.  Additionally, the script generates sentencepiece vocabulary files required for tokenizing the parallel corpora.*** 
 3 | 
 4 | ### Usage
 5 | 
 6 | ```bash
 7 | $ python preprocessor.py -h
 8 | usage: preprocessor.py [-h] --input_dir PATH --output_dir PATH [--normalize]
 9 |                       [--bn_vocab_size BN_VOCAB_SIZE]
10 |                       [--en_vocab_size EN_VOCAB_SIZE]
11 |                       [--bn_model_type BN_MODEL_TYPE]
12 |                       [--en_model_type EN_MODEL_TYPE]
13 |                       [--bn_coverage BN_COVERAGE] [--en_coverage EN_COVERAGE]
14 | 
15 | optional arguments:
16 |   -h, --help            show this help message and exit
17 |   --input_dir PATH, -i PATH
18 |                         Input directory
19 |   --output_dir PATH, -o PATH
20 |                         Output directory
21 |   --normalize           Only normalize the files in input directory
22 |   --bn_vocab_size BN_VOCAB_SIZE
23 |                         bengali vocab size
24 |   --en_vocab_size EN_VOCAB_SIZE
25 |                         english vocab size
26 |   --bn_model_type BN_MODEL_TYPE
27 |                         bengali sentencepiece model type
28 |   --en_model_type EN_MODEL_TYPE
29 |                         english sentencepiece model type
30 |   --bn_coverage BN_COVERAGE
31 |                         bengali character coverage
32 |   --en_coverage EN_COVERAGE
33 |                         english character coverage
34 | ```
35 | 
36 |   * If the script is invoked with `--normalize`, it will only produce the normalized version of all .bn / .en files found in the `input_dir` in corresponding subdirectories of `output_dir`.
37 |   * Otherwise, the script will recursively look for all filepairs (`X.bn`, `X.en`) inside `input_dir`, where `X` is any common file prefix, and produce the following files inside `output_dir`:
38 |     
39 |     * `combined.bn` / `combined.en`: filepairs obtained by cleaning all linepairs.
40 |     * `bn.model`, `bn.vocab` / `en.model`, `en.vocab`: sentencepiece models
41 | 
42 | 
43 | ## Removing Evaluation pairs
44 | ***If you are training from scratch with new test / train datasets, you should remove all evaluation pairs (`validation` / `test`)  first from the training dataset to prevent data leakage.*** To do so, run `remove_evaluation_pairs.py`. 
45 | 
46 | **Make sure all datasets are normalized before running the script.** 
47 | 
48 | ### Usage
49 | ```bash
50 | $ python remove_evaluation_pairs.py -h
51 | usage: remove_evaluation_pairs.py [-h] --input_dir PATH --output_dir PATH
52 |                                   --src_lang SRC_LANG --tgt_lang TGT_LANG
53 | 
54 | optional arguments:
55 |   -h, --help            show this help message and exit
56 |   --input_dir PATH, -i PATH
57 |                         Input directory
58 |   --output_dir PATH, -o PATH
59 |                         Output directory
60 |   --src_lang SRC_LANG   Source language
61 |   --tgt_lang TGT_LANG   Target language
62 | ```
63 | 
64 | * The input directory must be structured as mentioned [here](../). This script will remove all evaluation pairs from training pairs and write those to  `corpus.train.src_lang` / `corpus.train.tgt_lang` inside `output_dir`.
65 | 


--------------------------------------------------------------------------------
/training/preprocessing/preprocessor.py:
--------------------------------------------------------------------------------
   1 | #%%
   2 | """
   3 | Generated files/folders (depending on the flags set on __main__) summary:
   4 |     `Folders`
   5 |       - `tmp/Initial`       > Contains files generated by initial hardline filtering
   6 |       - `tmp/pattern`       > Contains files generated by character map replacements
   7 |       - `tmp/patch`         > Contains files generated by replacing patches of identical non-bangla texts on both sides
   8 |       - `tmp/transilaterate`> Contains files generated by transliterating dangling characters on bangla side
   9 |       - `Final`             > Contains files generated by applying previous transformations and some final postprocessing
  10 |     `Files(prefixes)`
  11 |       - `saved`             > Transformed Lines that pass hardline filtering after applying transformation 
  12 |       - `savedOriginal`     > Original Lines that pass hardline filtering after applying transformation
  13 |       - `filtered`          > Lines that don't pass hardline filtering at a stage
  14 |       - `cleaned`           > Lines that pass hardline filtering at a stage (Initial/Previous passing Lines + saved)
  15 | """
  16 | 
  17 | import re
  18 | import os
  19 | import difflib
  20 | import time
  21 | from subprocess import check_output
  22 | from aksharamukha import transliterate
  23 | from itertools import chain
  24 | import shutil
  25 | import sys
  26 | import uuid
  27 | import multiprocessing
  28 | from multiprocessing import Pool
  29 | import argparse
  30 | import glob
  31 | from tqdm import tqdm
  32 | 
  33 | def globalize(func):
  34 |     def result(*args, **kwargs):
  35 |         return func(*args, **kwargs)
  36 |     result.__name__ = result.__qualname__ = uuid.uuid4().hex
  37 |     setattr(sys.modules[result.__module__], result.__name__, result)
  38 |     return result
  39 | 
  40 | REPLACE_UNICODE_PUNCTUATION = [
  41 |     (u"\u09F7", u"\u0964"),
  42 |     (u"，", u","),
  43 |     (u"、", u","),
  44 |     (u"”", u'"'),
  45 |     (u"“", u'"'),
  46 |     (u"∶", u":"),
  47 |     (u"：", u":"),
  48 |     (u"？", u"?"),
  49 |     (u"《", u'"'),
  50 |     (u"》", u'"'),
  51 |     (u"）", u")"),
  52 |     (u"！", u"!"),
  53 |     (u"（", u"("),
  54 |     (u"；", u";"),
  55 |     (u"」", u'"'),
  56 |     (u"「", u'"'),
  57 |     (u"０", u"0"),
  58 |     (u"１", u'1'),
  59 |     (u"２", u"2"),
  60 |     (u"３", u"3"),
  61 |     (u"４", u"4"),
  62 |     (u"５", u"5"),
  63 |     (u"６", u"6"),
  64 |     (u"７", u"7"),
  65 |     (u"８", u"8"),
  66 |     (u"９", u"9"),
  67 |     (u"～", u"~"),
  68 |     (u"’", u"'"),
  69 |     (u"…", u"..."),
  70 |     (u"━", u"-"),
  71 |     (u"〈", u"<"),
  72 |     (u"〉", u">"),
  73 |     (u"【", u"["),
  74 |     (u"】", u"]"),
  75 |     (u"％", u"%"),
  76 | ]
  77 | 
  78 | NORMALIZE_UNICODE = [
  79 |     ('\u00AD', ''),
  80 |     ('\u09AF\u09BC', '\u09DF'),
  81 |     ('\u09A2\u09BC', '\u09DD'),
  82 |     ('\u09A1\u09BC', '\u09DC'),
  83 |     ('\u09AC\u09BC', '\u09B0'),
  84 |     ('\u09C7\u09BE', '\u09CB'),
  85 |     ('\u09C7\u09D7', '\u09CC'),
  86 |     ('\u0985\u09BE', '\u0986'),
  87 |     ('\u09C7\u0981\u09D7', '\u09CC\u0981'),
  88 |     ('\u09C7\u0981\u09BE', '\u09CB\u0981'),
  89 |     ('\u09C7([^\u09D7])\u09D7', "\g<1>\u09CC"),
  90 |     ('\\xa0', ' '),
  91 |     ('\u200B', u''),  
  92 |     ('\u2060', u''),
  93 |     (u'„', r'"'),
  94 |     (u'“', r'"'),
  95 |     (u'”', r'"'),
  96 |     (u'–', r'-'),
  97 |     (u'—', r' - '),
  98 |     (r' +', r' '),
  99 |     (u'´', r"'"),
 100 |     (u'([a-zA-Z])‘([a-zA-Z])', r"\g<1>'\g<2>"),
 101 |     (u'([a-zA-Z])’([a-zA-Z])', r"\g<1>'\g<2>"),
 102 |     (u'‘', r"'"),
 103 |     (u'‚', r"'"),
 104 |     (u'’', r"'"),
 105 |     (u'´´', r'"'),
 106 |     (u'…', r'...'),
 107 | ]
 108 | 
 109 | FRENCH_QUOTES = [
 110 |     (u'\u00A0«\u00A0', r'"'),
 111 |     (u'«\u00A0', r'"'),
 112 |     (u'«', r'"'),
 113 |     (u'\u00A0»\u00A0', r'"'),
 114 |     (u'\u00A0»', r'"'),
 115 |     (u'»', r'"'),
 116 | ]
 117 | 
 118 | SUBSTITUTIONS = [NORMALIZE_UNICODE, FRENCH_QUOTES, REPLACE_UNICODE_PUNCTUATION]
 119 | SUBSTITUTIONS = list(chain(*SUBSTITUTIONS))
 120 | 
 121 | BANGLA_CHARS = (
 122 |     r'['
 123 |     r'\u0981-\u0983'
 124 |     r'\u0985-\u098B'
 125 |     r'\u098F-\u0990'
 126 |     r'\u0993-\u09A8'
 127 |     r'\u09AA-\u09B0'
 128 |     r'\u09B2'
 129 |     r'\u09B6-\u09B9'
 130 |     r'\u09BC'
 131 |     r'\u09BE-\u09C3'
 132 |     r'\u09C7-\u09C8'
 133 |     r'\u09CB-\u09CC'
 134 |     r'\u09CE'
 135 |     r'\u09D7'
 136 |     r'\u09DC-\u09DD'
 137 |     r'\u09DF'
 138 |     r'\u09E6-\u09EF'
 139 |     r'\u09F3'
 140 |     r'\u0964'
 141 |     r']'
 142 | )
 143 | 
 144 | NEUTRAL_CHARS = (
 145 |     r'['
 146 |     r'\s'
 147 |     r'\u09CD'
 148 |     r'\u0021-\u002F'
 149 |     r'\u003A-\u0040'
 150 |     r'\u005B-\u0060'
 151 |     r'\u007B-\u007E'
 152 |     r'\u00A0'
 153 |     r'\u00A3'
 154 |     r'\u00B0'
 155 |     r'\u2000-\u2014'
 156 |     r'\u2018-\u201D'
 157 |     r'\u2028-\u202F'
 158 |     r'\u2032-\u2033'
 159 |     r'\u2035-\u2036'
 160 |     r'\u2060-\u206F'
 161 |     r']'
 162 | )
 163 | 
 164 | ENGLISH_CHARS = (
 165 |     r'[a-zA-Z0-9]'
 166 | )
 167 | 
 168 | NON_BANGLA_PATCH = (
 169 |     r'['
 170 |     r'^'
 171 |     r'\u0981-\u0983'
 172 |     r'\u0985-\u098B'
 173 |     r'\u098F-\u0990'
 174 |     r'\u0993-\u09A8'
 175 |     r'\u09AA-\u09B0'
 176 |     r'\u09B2'
 177 |     r'\u09B6-\u09B9'
 178 |     r'\u09BC'
 179 |     r'\u09BE-\u09C3'
 180 |     r'\u09C7-\u09C8'
 181 |     r'\u09CB-\u09CC'
 182 |     r'\u09CE'
 183 |     r'\u09D7'
 184 |     r'\u09DC-\u09DD'
 185 |     r'\u09DF'
 186 |     r'\u09E6-\u09EF'
 187 |     r'\u09F3'
 188 |     r'\u0964'
 189 |     r'\s'
 190 |     r'\u09CD'
 191 |     r'\u0021-\u002F'
 192 |     r'\u003A-\u0040'
 193 |     r'\u005B-\u0060'
 194 |     r'\u007B-\u007E'
 195 |     r'\u00A0'
 196 |     r'\u00A3'
 197 |     r'\u00B0'
 198 |     r'\u2000-\u2014'
 199 |     r'\u2018-\u201D'
 200 |     r'\u2028-\u202F'
 201 |     r'\u2032-\u2033'
 202 |     r'\u2035-\u2036'
 203 |     r'\u2060-\u206F'
 204 |     r']'
 205 |     r'+'
 206 | )
 207 | 
 208 | NON_ENGLISH_PATCH = (
 209 |     r'['
 210 |     r'^'
 211 |     r'a-z'
 212 |     r'A-Z'
 213 |     r'0-9'
 214 |     r'\s'
 215 |     r'\u09CD'
 216 |     r'\u0021-\u002F'
 217 |     r'\u003A-\u0040'
 218 |     r'\u005B-\u0060'
 219 |     r'\u007B-\u007E'
 220 |     r'\u00A0'
 221 |     r'\u00A3'
 222 |     r'\u00B0'
 223 |     r'\u2000-\u2014'
 224 |     r'\u2018-\u201D'
 225 |     r'\u2028-\u202F'
 226 |     r'\u2032-\u2033'
 227 |     r'\u2035-\u2036'
 228 |     r'\u2060-\u206F'
 229 |     r']'
 230 |     r'+'
 231 | )
 232 | 
 233 | WHITESPACE_PUNCTATION = (
 234 |     r'[\(\[\{'
 235 |     r'\u0021-\u0027'
 236 |     r'\u002A-\u002F'
 237 |     r'\u003A-\u0040'
 238 |     r'\u005C'
 239 |     r'\u005E-\u0060'
 240 |     r'\u007C'
 241 |     r'\u007E'
 242 |     r'\u02B9-\u02DD'
 243 |     r'\u09F7'
 244 |     r'\u0964'
 245 |     r'\u0965'
 246 |     r'\u2010-\u201F'
 247 |     r'\s\t'
 248 |     r'\)\]\}]'
 249 |     r'+'
 250 | )
 251 | 
 252 | INCLUSIVE_NON_BANGLA_PATCH = (
 253 |     r'[\(\[\{'
 254 |     r'a-zA-Z'
 255 |     r'\u00A1-\u00AC'
 256 |     r'\u00AE-\u02FF'
 257 |     r'\u0300-\u07BF'
 258 |     r'\u0900'
 259 |     r'\u0904-\u094D'
 260 |     r'\u094E-\u0950'
 261 |     r'\u0955-\u0963'
 262 |     r'\u0966-\u097F'
 263 |     r'\u0A00-\u1FFF'
 264 |     r'\u2020-\u2027'
 265 |     r'\u2030-\u2031'
 266 |     r'\u203B-\u205E'
 267 |     r'\u2070-\uFE4F'
 268 |     r'\uFE70-\uFEFF'
 269 |     r'\uFF21-\uFF3A'
 270 |     r'\uFF41-\uFF5A'
 271 |     r'\uFF5F-\uFFEF'
 272 |     r'\uFFF9-\uFFFF'
 273 |     r'\u0021-\u0027'
 274 |     r'\u002A-\u002F'
 275 |     r'\u003A-\u0040'
 276 |     r'\u005C'
 277 |     r'\u005E-\u0060'
 278 |     r'\u007C'
 279 |     r'\u007E'
 280 |     r'\u02B9-\u02DD'
 281 |     r'\u2010-\u201F'
 282 |     r'0-9'
 283 |     r'\s\t'
 284 |     r'\)\]\}]'
 285 |     '+'
 286 | )
 287 | 
 288 | BRACKETED_SPANS = (
 289 |     r'[\(\[\{]'
 290 |     r'['
 291 |     r'\u0021-\u0027'
 292 |     r'\u002A-\u002F'
 293 |     r'\u003A-\u0040'
 294 |     r'\u005C'
 295 |     r'\u005E-\u0060'
 296 |     r'\u007C'
 297 |     r'\u007E'
 298 |     r'\u02B9-\u02DD'
 299 |     r'\u2010-\u201F'
 300 |     r'\s\t'
 301 |     r']'
 302 |     r'*'
 303 |     r'[\)\]\}]'
 304 | )
 305 | 
 306 | def normalize(text):
 307 |     for regexp, replacement in SUBSTITUTIONS:
 308 |         text = re.sub(regexp, replacement, text, flags=re.UNICODE)
 309 |     
 310 |     text = re.sub(r'\s+', ' ', text)
 311 |     return text.strip()
 312 | 
 313 | def readFile(filename):
 314 |     with open(filename) as f:
 315 |         lines = [line.strip() for line in f.readlines()]
 316 |         return lines
 317 | 
 318 | def readFilePair(bnFile, enFile):
 319 |     bnLines = readFile(bnFile)
 320 |     enLines = readFile(enFile)
 321 |     
 322 |     return zip(bnLines, enLines)
 323 | 
 324 | def readReplacePatterns(filename):
 325 |     """
 326 |         Patterns should have the following form in each line:
 327 |         `Pattern:enReplacement(:optional bnReplacement)`
 328 | 
 329 |         Both Pattern and Replacement can contain arbitrary no of spaces.
 330 |         Be careful not to place unnecessary spaces.
 331 |         In absence of bnReplacement, bnReplacement = enReplacement
 332 |     """
 333 |     enPatternMap, bnPatternMap = {}, {}
 334 | 
 335 |     with open(filename) as f:
 336 |         for line in f.readlines():
 337 |             try:
 338 |                 splitLine = line.rstrip('\n').split(":")
 339 |                 pattern = splitLine[0]
 340 |                 enReplacement = splitLine[1]
 341 |                 if len(splitLine) == 3:
 342 |                     bnPatternMap[pattern] = splitLine[2]
 343 |                 else:
 344 |                     bnPatternMap[pattern] = enReplacement
 345 |                 enPatternMap[pattern] = enReplacement
 346 |             except:
 347 |                 continue
 348 |     
 349 |     return enPatternMap, bnPatternMap
 350 | 
 351 | def hasNonBangla(line):
 352 |     return len(line) - countBanglaChars(line) - countNeutralChars(line)
 353 | 
 354 | def hasNonEnglish(line):
 355 |     return len(line) - countEnglishChars(line) - countNeutralChars(line)
 356 | 
 357 | def hasOOV(line):
 358 |     return len(line) - countBanglaChars(line) - countEnglishChars(line) - countNeutralChars(line)
 359 | 
 360 | def countBanglaChars(line):
 361 |     chars = re.findall(
 362 |         BANGLA_CHARS,
 363 |         line,
 364 |         flags=re.UNICODE
 365 |     )
 366 |     return len(chars)
 367 | 
 368 | def countEnglishChars(line):
 369 |     chars = re.findall(
 370 |         ENGLISH_CHARS,
 371 |         line,
 372 |         flags=re.UNICODE
 373 |     )
 374 |     return len(chars)
 375 | 
 376 | def countNeutralChars(line):
 377 |     chars = re.findall(
 378 |         NEUTRAL_CHARS,
 379 |         line,
 380 |         flags=re.UNICODE
 381 |     )
 382 |     return len(chars)
 383 | 
 384 | def getNonBanglaPatches(line):
 385 |     return re.findall(
 386 |         NON_BANGLA_PATCH,
 387 |         line,
 388 |         flags=re.UNICODE
 389 |     )
 390 | 
 391 | def getNonEnglishPatches(line):
 392 |     return re.findall(
 393 |         NON_ENGLISH_PATCH,
 394 |         line,
 395 |         flags=re.UNICODE
 396 |     )
 397 | 
 398 | def replaceEmojis(*lines):
 399 |     outputLines = []
 400 |     for line in lines:
 401 |         line = re.sub(
 402 |             r'\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]',
 403 |             "",
 404 |             line,
 405 |             flags=re.UNICODE
 406 |         )
 407 |         line = re.sub(r'[\U00010000-\U0010ffff]', "", line, flags=re.UNICODE)
 408 |         outputLines.append(line)
 409 | 
 410 |     return tuple(outputLines)
 411 | 
 412 | def isValidProcessedPair(bnLine, enLine):
 413 |     if bnLine.strip() == "" or enLine.strip() == "":
 414 |         return False 
 415 | 
 416 |     # check if either side contains only punctuations and whitespaces
 417 |     
 418 |     if (
 419 |             re.sub(WHITESPACE_PUNCTATION, "", bnLine.strip(), flags=re.UNICODE) == "" or 
 420 |             re.sub(WHITESPACE_PUNCTATION, "", enLine.strip(), flags=re.UNICODE) == ""
 421 |         ):
 422 |         return False
 423 | 
 424 |     return True
 425 |    
 426 | def writeFilePairs(dir, validPairs, foreignPairs, savedPairs=None):
 427 |     os.makedirs(dir, exist_ok=True)
 428 | 
 429 |     with open(os.path.join(dir, "cleaned.bn"), 'w') as bn, \
 430 |         open(os.path.join(dir, "cleaned.en"), 'w') as en:
 431 |         for bnLine, enLine in validPairs:
 432 |             print(bnLine, file=bn)
 433 |             print(enLine, file=en)
 434 | 
 435 |     with open(os.path.join(dir, "filtered.bn"), 'w') as bn, \
 436 |         open(os.path.join(dir, "filtered.en"), 'w') as en:
 437 |         for bnLine, enLine in foreignPairs:
 438 |             print(bnLine, file=bn)
 439 |             print(enLine, file=en)
 440 |     
 441 |     if savedPairs:
 442 |         with open(os.path.join(dir, "savedOrignal.bn"), 'w') as bnO, \
 443 |             open(os.path.join(dir, "savedOrignal.en"), 'w') as enO, \
 444 |             open(os.path.join(dir, "saved.bn"), 'w') as bnS, \
 445 |             open(os.path.join(dir, "saved.en"), 'w') as enS:
 446 |             for bnOriginal, enOriginal, bnSaved, enSaved in savedPairs:
 447 |                 print(bnOriginal, file=bnO)
 448 |                 print(enOriginal, file=enO)
 449 |                 print(bnSaved, file=bnS)
 450 |                 print(enSaved, file=enS)
 451 | 
 452 | def convertNumerals(bnLine, enLine):
 453 |     newBnLine, newEnLine = bnLine, enLine
 454 | 
 455 |     for enNumeral in re.findall(r'[0-9]+', newBnLine, flags=re.UNICODE):
 456 |         newBnLine = newBnLine.replace(enNumeral, transliterate.process('RomanReadable', 'Bengali', enNumeral), 1)
 457 | 
 458 |     for bnNumeral in re.findall(r'[০-৯]+', newEnLine, flags=re.UNICODE):
 459 |         newEnLine = newEnLine.replace(bnNumeral, transliterate.process('Bengali', 'RomanReadable', bnNumeral), 1)
 460 | 
 461 |     return newBnLine, newEnLine 
 462 | 
 463 | def applyPatterns(line, patternMap):
 464 |     """
 465 |     Returns:
 466 |     patternFound(bool) : Whether any of the patterns were found in the line
 467 |     line(str) : Transformed line using pattern replacements
 468 |     """
 469 |     patternFound = False
 470 |     orignalLine = line
 471 |     for pattern, replacement in patternMap.items():
 472 |         line = line.replace(pattern, replacement)
 473 |         if orignalLine != line:
 474 |             patternFound = True
 475 | 
 476 |     return patternFound, line
 477 | 
 478 | def patternHandler(bnLine, enLine, verbose):
 479 |     _, newEnLine = applyPatterns(enLine, enPatternMap)
 480 |     _, newBnLine = applyPatterns(bnLine, bnPatternMap)
 481 | 
 482 |     if verbose:
 483 |         # if there is foreign text in the linepair after applying replacements
 484 |         if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine):
 485 |             patternHandler.foreignPairs.append((newBnLine, newEnLine))
 486 |         else:
 487 |             # if the original linepair had foreign characters
 488 |             if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 489 |                 patternHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 490 | 
 491 |             patternHandler.validPairs.append((newBnLine, newEnLine))
 492 | 
 493 |     return newBnLine, newEnLine
 494 | 
 495 | def patchHandler(bnLine, enLine, verbose):
 496 |     """
 497 |     finds continous patches of non bangla text on bangla side and removes
 498 |     a found patch from both sides if it is also present in english side  
 499 |     """
 500 | 
 501 |     minPatchLength = 2 # this shouldn't be too small
 502 |     newBnLine, newEnLine = bnLine, enLine
 503 |     nonBanglaPatches = re.findall(
 504 |         INCLUSIVE_NON_BANGLA_PATCH,
 505 |         newBnLine,
 506 |         flags=re.UNICODE
 507 |     )
 508 | 
 509 |     for patch in nonBanglaPatches:
 510 |         patch = patch.strip()
 511 | 
 512 |         # ignore whitespace only patches
 513 |         if patch == "":
 514 |             continue
 515 |         # ignore number only patches
 516 |         try:
 517 |             float(patch)
 518 |             continue
 519 |         except:
 520 |             pass
 521 |         # patch shouldnt end in starting braces
 522 |         if patch[-1] in ['[', '{', '(']:
 523 |             patch = patch[:-1].strip()
 524 |         # patch shouldnt start with ending braces/common punctuations
 525 |         if patch and patch[0] in [']', '}', ')', ',', '.']:
 526 |             patch = patch[1:].strip()
 527 | 
 528 |         # should the patch length be counted with spaces included??
 529 |         # should all matching patches be removed in english side?
 530 |         if (
 531 |                 len(patch) >= minPatchLength and
 532 |                 (
 533 |                     patch in newEnLine or
 534 |                     patch.upper() in newEnLine or 
 535 |                     patch.lower() in newEnLine or
 536 |                     patch.capitalize() in newEnLine
 537 |                 )
 538 |             ):
 539 |             newBnLine = newBnLine.replace(patch, "").strip()
 540 |             newEnLine = newEnLine.replace(
 541 |                 patch, ""
 542 |             ).replace(
 543 |                 patch.lower(), ""
 544 |             ).replace(
 545 |                 patch.upper(), ""
 546 |             ).replace(
 547 |                 patch.capitalize(), ""
 548 |             ).strip()
 549 | 
 550 | 
 551 |     if verbose:
 552 |         # if there is foreign text in the linepair after applying replacements
 553 |         if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine):
 554 |             patchHandler.foreignPairs.append((newBnLine, newEnLine))
 555 |         else:
 556 |             # if the original linepair had foreign characters
 557 |             if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 558 |                 patchHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 559 | 
 560 |             patchHandler.validPairs.append((newBnLine, newEnLine))
 561 | 
 562 |     return newBnLine, newEnLine
 563 | 
 564 | def alternatePatchHandler(bnLine, enLine, verbose):
 565 |     """
 566 |     finds common patches in both lines and removes a patch
 567 |     from bothsides if the found patch is non bangla
 568 |     """
 569 | 
 570 |     def isValidPatch(patch):
 571 |         patch = re.sub(
 572 |             INCLUSIVE_NON_BANGLA_PATCH,
 573 |             "",
 574 |             patch,
 575 |             flags=re.UNICODE
 576 |         ).strip()
 577 |         return patch == ""
 578 | 
 579 |     minPatchLength = 2 # this shouldn't be too small
 580 |     newBnLine, newEnLine = bnLine, enLine
 581 | 
 582 |     matcher = difflib.SequenceMatcher(None, newBnLine, newEnLine)
 583 |     potentialPatches = [newBnLine[match.a : match.a + match.size] for match in matcher.get_matching_blocks() if match.size > 0]
 584 | 
 585 |     for patch in potentialPatches:
 586 |         patch = patch.strip()
 587 | 
 588 |         # ignore whitespace only patches
 589 |         if patch == "":
 590 |             continue
 591 |         # ignore number only patches
 592 |         try:
 593 |             float(patch)
 594 |             continue
 595 |         except:
 596 |             pass
 597 | 
 598 |         # patch shouldnt end in starting braces
 599 |         if patch[-1] in ['[', '{', '(']:
 600 |             patch = patch[:-1].strip()
 601 |         # patch shouldnt start with ending braces/common punctuations
 602 |         if patch and patch[0] in [']', '}', ')', ',', '.']:
 603 |             patch = patch[1:].strip()
 604 | 
 605 |         # should the patch length be counted with spaces included??
 606 |         # should all matching patches be removed in english side?
 607 |         if (
 608 |                 isValidPatch(patch) and
 609 |                 len(patch) >= minPatchLength and
 610 |                 (
 611 |                     patch in newEnLine or
 612 |                     patch.upper() in newEnLine or 
 613 |                     patch.lower() in newEnLine or
 614 |                     patch.capitalize() in newEnLine
 615 |                 )
 616 |             ):
 617 |             newBnLine = newBnLine.replace(patch, "").strip()
 618 |             newEnLine = newEnLine.replace(
 619 |                 patch, ""
 620 |             ).replace(
 621 |                 patch.lower(), ""
 622 |             ).replace(
 623 |                 patch.upper(), ""
 624 |             ).replace(
 625 |                 patch.capitalize(), ""
 626 |             ).strip()
 627 | 
 628 |     if verbose:
 629 |         # if there is foreign text in the linepair after applying replacements
 630 |         if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine):
 631 |             alternatePatchHandler.foreignPairs.append((newBnLine, newEnLine))
 632 |         else:
 633 |             # if the original linepair had foreign characters
 634 |             if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 635 |                 alternatePatchHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 636 | 
 637 |             alternatePatchHandler.validPairs.append((newBnLine, newEnLine))
 638 | 
 639 |     return newBnLine, newEnLine
 640 | 
 641 | def bpediaPatchHandler(bnLine, enLine, verbose):
 642 |     """
 643 |     removes non bangla patches from first-bracketed spans on bangla side.
 644 |     Specific to Banglapedia.
 645 |     """
 646 |     newBnLine, newEnLine = bnLine, enLine
 647 |     bracketedSpans = re.findall(r'\([^\)]*?\)', newBnLine, flags=re.UNICODE)
 648 | 
 649 |     for span in bracketedSpans:
 650 |         if hasNonBangla(span):
 651 |             # if span contains no bangla text, remove it
 652 |             if not countBanglaChars(span):
 653 |                 newBnLine = newBnLine.replace(span, "").strip()
 654 |             else:
 655 |                 nonBanglaPatches = re.findall(
 656 |                     INCLUSIVE_NON_BANGLA_PATCH,
 657 |                     span,
 658 |                     flags=re.UNICODE
 659 |                 )
 660 |                 newSpan = span
 661 |                 for patch in nonBanglaPatches:
 662 |                     if isValidProcessedPair(patch, patch):
 663 |                         newSpan = newSpan.replace(patch.strip(), "")
 664 | 
 665 |                 newBnLine = newBnLine.replace(span, newSpan.strip()).strip()
 666 | 
 667 |     if verbose:
 668 |         # if there is foreign text in the linepair after applying replacements
 669 |         if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine):
 670 |             bpediaPatchHandler.foreignPairs.append((newBnLine, newEnLine))
 671 |         else:
 672 |             # if the original linepair had foreign characters
 673 |             if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 674 |                 bpediaPatchHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 675 | 
 676 |             bpediaPatchHandler.validPairs.append((newBnLine, newEnLine))
 677 | 
 678 |     return newBnLine, newEnLine
 679 | 
 680 | def transliterateHandler(bnLine, enLine, verbose):
 681 |     newBnLine, newEnLine = bnLine, enLine
 682 |     # convert numerals to appropriate script
 683 |     newBnLine, newEnLine = convertNumerals(newBnLine, newEnLine)
 684 | 
 685 |     charConvertMap = {
 686 |         'a' : 'এ',
 687 |         'b' : 'বি',
 688 |         'c' : 'সি',
 689 |         'd' : 'ডি',
 690 |         'e' : 'ই',
 691 |         'f' : 'এফ',
 692 |         'g' : 'জি',
 693 |         'h' : 'এইচ',
 694 |         'i' : 'আই',
 695 |         'j' : 'জে',
 696 |         'k' : 'কে',
 697 |         'l' : 'এল',
 698 |         'm' : 'এম',
 699 |         'n' : 'এন',
 700 |         'o' : 'ও',
 701 |         'p' : 'পি',
 702 |         'q' : 'কিউ',
 703 |         'r' : 'আর',
 704 |         's' : 'এস',
 705 |         't' : 'টি',
 706 |         'u' : 'ইউ',
 707 |         'v' : 'ভি',
 708 |         'w' : 'ডব্লিউ',
 709 |         'x' : 'এক্স',
 710 |         'y' : 'ওআই',
 711 |         'z' : 'জেড',
 712 |     }
 713 | 
 714 |     def customReplace(match):
 715 |         caughtChar = match.group(1).strip()
 716 |         return charConvertMap[caughtChar.lower()]
 717 | 
 718 |     newBnLine = re.sub(r'(\b[a-zA-Z]\b)', customReplace, newBnLine, flags=re.UNICODE)
 719 | 
 720 |     if verbose:
 721 |         # if there is foreign text in the linepair after applying replacements
 722 |         if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine):
 723 |             transliterateHandler.foreignPairs.append((newBnLine, newEnLine))
 724 |         else:
 725 |             # if the original linepair had foreign characters
 726 |             if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 727 |                 transliterateHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 728 | 
 729 |             transliterateHandler.validPairs.append((newBnLine, newEnLine))
 730 | 
 731 |     return newBnLine, newEnLine
 732 | 
 733 | def ratioHandler(bnLine, enLine, verbose):
 734 |     newBnLine, newEnLine = bnLine, enLine
 735 | 
 736 |     bnLowerThresh = .001
 737 |     bnUpperThresh = 10.0
 738 | 
 739 |     enLowerThresh = .001
 740 |     enUpperThresh = 10.0
 741 | 
 742 | 
 743 |     bnNativeChars = countBanglaChars(bnLine)
 744 |     bnForeignChars = len(bnLine) - bnNativeChars - countNeutralChars(bnLine)
 745 |     bnRatio = bnForeignChars/bnNativeChars
 746 |     
 747 |     enNativeChars = countEnglishChars(enLine)
 748 |     enForeignChars = len(enLine) - enNativeChars - countNeutralChars(enLine)
 749 |     enRatio = enForeignChars/enNativeChars
 750 | 
 751 |     if verbose:
 752 |         # if there is foreign text in the linepair after applying replacements
 753 |         if (
 754 |                 bnRatio >= bnUpperThresh or
 755 |                 enRatio >= enUpperThresh or
 756 |                 bnRatio <= bnLowerThresh or
 757 |                 enRatio <= enLowerThresh or
 758 |                 countBanglaChars(enLine) or countEnglishChars(bnLine)
 759 |             ):
 760 |             ratioHandler.foreignPairs.append((newBnLine, newEnLine))
 761 |         else:
 762 |             # if the original linepair had foreign characters
 763 |             if bnForeignChars or enForeignChars:
 764 |                 ratioHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 765 | 
 766 |             ratioHandler.validPairs.append((newBnLine, newEnLine))
 767 | 
 768 |     return newBnLine, newEnLine
 769 | 
 770 | def cleanSentencePairs(linePairs, options, output_dir, verbose=True):
 771 |     for option, choice in options.items():
 772 |         if choice:
 773 |             globals()[f'{option}Handler'].validPairs = multiprocessing.Manager().list()
 774 |             globals()[f'{option}Handler'].foreignPairs = multiprocessing.Manager().list()
 775 |             globals()[f'{option}Handler'].savedPairs = multiprocessing.Manager().list()
 776 |     
 777 | 
 778 |     global finalValidPairs, finalForeignPairs, finalSavedPairs, initialValidPairs, initialForeignPairs
 779 | 
 780 |     finalValidPairs, finalForeignPairs, finalSavedPairs = (
 781 |         multiprocessing.Manager().list(), 
 782 |         multiprocessing.Manager().list(),
 783 |         multiprocessing.Manager().list()
 784 |     )
 785 |     initialValidPairs, initialForeignPairs = (
 786 |         multiprocessing.Manager().list(),
 787 |         multiprocessing.Manager().list()
 788 |     )
 789 | 
 790 |     @globalize
 791 |     def processPair(bnLine, enLine):
 792 |         validPair = False
 793 | 
 794 |         if bnLine == enLine or (not countBanglaChars(bnLine) or not countEnglishChars(enLine)):
 795 |             return
 796 | 
 797 |         if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 798 |             initialForeignPairs.append((bnLine, enLine))
 799 |         else:
 800 |             initialValidPairs.append((bnLine, enLine))
 801 |             validPair = True
 802 | 
 803 |         # apply all selected transformations
 804 |         newBnLine, newEnLine = bnLine, enLine
 805 | 
 806 |         newBnLine = normalize(newBnLine).strip()
 807 |         newEnLine = normalize(newEnLine).strip()
 808 |         newBnLine, newEnLine = replaceEmojis(newBnLine, newEnLine)
 809 | 
 810 |         # remove unnecessary bracketed spans; this needs to be done first for patch removal to work
 811 |         newBnLine = re.sub(
 812 |             BRACKETED_SPANS,
 813 |             "",
 814 |             newBnLine,
 815 |             flags=re.UNICODE
 816 |         ) 
 817 |         newEnLine = re.sub(
 818 |             BRACKETED_SPANS,
 819 |             "",
 820 |             newEnLine,
 821 |             flags=re.UNICODE
 822 |         ) 
 823 |         
 824 |         for option, choice in options.items():
 825 |             if choice and not validPair:
 826 |                 newBnLine, newEnLine = globals()[f'{option}Handler'](newBnLine, newEnLine, verbose)
 827 |     
 828 |         # do some final postprocessing
 829 |         if not isValidProcessedPair(newBnLine, newEnLine):
 830 |             return
 831 | 
 832 |         # remove unnecessary bracketed spans (after patch deletion)
 833 |         newBnLine = re.sub(
 834 |             BRACKETED_SPANS,
 835 |             "",
 836 |             newBnLine,
 837 |             flags=re.UNICODE
 838 |         ) 
 839 |         newEnLine = re.sub(
 840 |             BRACKETED_SPANS,
 841 |             "",
 842 |             newEnLine,
 843 |             flags=re.UNICODE
 844 |         )
 845 | 
 846 |         newBnLine = re.sub(r'\s+', ' ', newBnLine)
 847 |         newEnLine = re.sub(r'\s+', ' ', newEnLine)
 848 |     
 849 | 
 850 |         # if there is foreign text in the linepair after applying transformations
 851 |         if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine):
 852 |             finalForeignPairs.append((bnLine, enLine))
 853 |         else:
 854 |             # if the original linepair had foreign characters
 855 |             if hasNonBangla(bnLine) or hasNonEnglish(enLine):
 856 |                 finalSavedPairs.append((bnLine, enLine, newBnLine, newEnLine))
 857 | 
 858 |             finalValidPairs.append((newBnLine, newEnLine))
 859 | 
 860 |     print('\tStarting processes...')
 861 |     with Pool() as pool:
 862 |         pool.starmap(processPair, linePairs)
 863 | 
 864 |     print('\tWriting logs and outputs...')
 865 |     # write the final and temporary filepairs to appropriate directories
 866 |     writeFilePairs(
 867 |         os.path.join(output_dir, 'Final'),
 868 |         finalValidPairs, finalForeignPairs, finalSavedPairs
 869 |     )
 870 | 
 871 |     if verbose:
 872 |         writeFilePairs(
 873 |             os.path.join(output_dir, "tmp", "Initial"), 
 874 |             initialValidPairs, initialForeignPairs)
 875 | 
 876 |         for option, choice in options.items():
 877 |             if choice:
 878 |                 funcName = globals()[f'{option}Handler']
 879 |                 writeFilePairs(
 880 |                     os.path.join(output_dir, "tmp", option), 
 881 |                     funcName.validPairs,
 882 |                     funcName.foreignPairs,
 883 |                     funcName.savedPairs
 884 |                 )
 885 | 
 886 | 
 887 | def col(id):
 888 | 	if id == 1: return "\033[32m"
 889 | 	if id == 2: return "\033[33m"
 890 | 	if id == 3: return "\033[31m"
 891 | 	return "\033[0m"
 892 | 
 893 | def cleanup(dirname):
 894 |     for sub_dir in ["Final", "tmp"]:
 895 |         if os.path.isdir(os.path.join(dirname, sub_dir)):
 896 |             shutil.rmtree(os.path.join(dirname, sub_dir))
 897 | 
 898 | def _merge(input_files, output_file):
 899 |     with open(output_file, 'w') as outf:
 900 |         for input_file in input_files:
 901 |             with open(input_file) as inpf:
 902 |                 for line in inpf:
 903 |                     print(line.strip(), file=outf)
 904 | 
 905 | def _train(input_file, output_prefix, coverage, vocab_size, model_type):
 906 |     cmd = [
 907 |         f"spm_train --input=\"{input_file}\"",
 908 |         f"--model_prefix=\"{output_prefix}\"",
 909 |         f"--vocab_size={vocab_size}",
 910 |         f"--character_coverage={coverage}",
 911 |         "--train_extremely_large_corpus"
 912 |     ]
 913 |     os.system(" ".join(cmd))
 914 | 
 915 | def main(args):
 916 |     global enPatternMap, bnPatternMap
 917 |     enPatternMap, bnPatternMap = readReplacePatterns(
 918 |         os.path.join(os.path.dirname(__file__), "replacePatterns.txt")
 919 |     )
 920 |     if os.path.isdir(args.output_dir):
 921 |         shutil.rmtree(args.output_dir)
 922 | 
 923 |     if args.normalize:
 924 |         iterator = tqdm(
 925 |             glob.glob(os.path.join(args.input_dir, "**", "*.bn"), recursive=True) +
 926 |             glob.glob(os.path.join(args.input_dir, "**", "*.en"), recursive=True),
 927 |             desc="Normalizing files"
 928 |         )
 929 |         for input_file in iterator:
 930 |             output_file = input_file.replace(
 931 |                 os.path.normpath(args.input_dir),
 932 |                 os.path.normpath(args.output_dir) 
 933 |             )
 934 |             os.makedirs(os.path.dirname(output_file), exist_ok=True)
 935 |             with open(output_file, 'w') as outf:
 936 |                 with open(input_file) as inpf:
 937 |                     for line in inpf:
 938 |                         line = normalize(line)
 939 |                         _, line = applyPatterns(line, enPatternMap)
 940 |                         _, line = applyPatterns(line, bnPatternMap)
 941 |                         print(line.strip(), file=outf)
 942 |     else:
 943 |         cleanup(args.output_dir)
 944 |         os.makedirs(os.path.join(args.output_dir, "data"), exist_ok=True)
 945 | 
 946 |         linePair_list = []
 947 |         for bnFile in glob.glob(os.path.join(args.input_dir, "**", f"*.bn"), recursive=True):
 948 |             enFile = bnFile[:-3] + ".en"
 949 |             if not os.path.isfile(enFile):
 950 |                 continue
 951 |             linePair_list.append(readFilePair(bnFile, enFile))
 952 |         
 953 |         linePairs = list(chain.from_iterable(linePair_list))
 954 | 
 955 |         print(col(2) + 'Starting Stage 1...' + col(0))
 956 |         cleanSentencePairs(linePairs, {'pattern': True}, args.output_dir)
 957 | 
 958 |         shutil.copy(
 959 |             os.path.join(args.output_dir, "Final", "cleaned.bn"),
 960 |             os.path.join(args.output_dir, "data", "data1.bn")
 961 |         )
 962 |         shutil.copy(
 963 |             os.path.join(args.output_dir, "Final", "cleaned.en"),
 964 |             os.path.join(args.output_dir, "data", "data1.en")
 965 |         )
 966 |         shutil.copy(
 967 |             os.path.join(args.output_dir, "tmp", "pattern", "filtered.bn"),
 968 |             os.path.join(args.output_dir, "data", "stage1Filtered.bn")
 969 |         )
 970 |         shutil.copy(
 971 |             os.path.join(args.output_dir, "tmp", "pattern", "filtered.en"),
 972 |             os.path.join(args.output_dir, "data", "stage1Filtered.en")
 973 |         )
 974 | 
 975 |         print(col(2) + 'Starting Stage 2...' + col(0))
 976 |         cleanup(args.output_dir)
 977 |         linePairs = readFilePair(
 978 |             os.path.join(args.output_dir, "data", "stage1Filtered.bn"),
 979 |             os.path.join(args.output_dir, "data", "stage1Filtered.en")
 980 |         )
 981 |         cleanSentencePairs(linePairs, {'ratio': True}, args.output_dir)
 982 |         
 983 |         shutil.copy(
 984 |             os.path.join(args.output_dir, "tmp", "ratio", "cleaned.bn"),
 985 |             os.path.join(args.output_dir, "data", "data2.bn")
 986 |         )
 987 |         shutil.copy(
 988 |             os.path.join(args.output_dir, "tmp", "ratio", "cleaned.en"),
 989 |             os.path.join(args.output_dir, "data", "data2.en")
 990 |         )
 991 |         shutil.copy(
 992 |             os.path.join(args.output_dir, "tmp", "ratio", "filtered.bn"),
 993 |             os.path.join(args.output_dir, "data", "stage2Filtered.bn")
 994 |         )
 995 |         shutil.copy(
 996 |             os.path.join(args.output_dir, "tmp", "ratio", "filtered.en"),
 997 |             os.path.join(args.output_dir, "data", "stage2Filtered.en")
 998 |         )
 999 | 
1000 | 
1001 |         print(col(2) + 'Starting Stage 3...' + col(0))
1002 |         cleanup(args.output_dir)
1003 |         linePairs = readFilePair(
1004 |             os.path.join(args.output_dir, "data", "stage2Filtered.bn"),
1005 |             os.path.join(args.output_dir, "data", "stage2Filtered.en")
1006 |         )
1007 |         cleanSentencePairs(
1008 |             linePairs,
1009 |             {
1010 |                 'patch': True,
1011 |                 'alternatePatch': True,
1012 |                 'transliterate': True
1013 |             },
1014 |             args.output_dir
1015 |         )
1016 |         
1017 |         shutil.copy(
1018 |             os.path.join(args.output_dir, "Final", "cleaned.bn"),
1019 |             os.path.join(args.output_dir, "data", "data3.bn")
1020 |         )
1021 |         shutil.copy(
1022 |             os.path.join(args.output_dir, "Final", "cleaned.en"),
1023 |             os.path.join(args.output_dir, "data", "data3.en")
1024 |         )
1025 |         
1026 |         _merge(
1027 |             [
1028 |                 os.path.join(args.output_dir, "data", "data1.bn"),
1029 |                 os.path.join(args.output_dir, "data", "data2.bn"),
1030 |                 os.path.join(args.output_dir, "data", "data3.bn"),
1031 |             ],
1032 |             os.path.join(args.output_dir, "combined.bn")
1033 |         )
1034 |         _merge(
1035 |             [
1036 |                 os.path.join(args.output_dir, "data", "data1.en"),
1037 |                 os.path.join(args.output_dir, "data", "data2.en"),
1038 |                 os.path.join(args.output_dir, "data", "data3.en"),
1039 |             ],
1040 |             os.path.join(args.output_dir, "combined.en")
1041 |         )
1042 |         _merge(
1043 |             [
1044 |                 os.path.join(args.output_dir, "data", "data1.bn"),
1045 |                 os.path.join(args.output_dir, "data", "data3.bn"),
1046 |             ],
1047 |             os.path.join(args.output_dir, "vocab_train.bn")
1048 |         )
1049 |         _merge(
1050 |             [
1051 |                 os.path.join(args.output_dir, "data", "data1.en"),
1052 |                 os.path.join(args.output_dir, "data", "data3.en"),
1053 |             ],
1054 |             os.path.join(args.output_dir, "vocab_train.en")
1055 |         )
1056 | 
1057 |         _train(
1058 |             os.path.join(args.output_dir, "vocab_train.bn"),
1059 |             os.path.join(args.output_dir, "bn"),
1060 |             args.bn_coverage,
1061 |             args.bn_vocab_size,
1062 |             args.bn_model_type
1063 |         )
1064 | 
1065 |         _train(
1066 |             os.path.join(args.output_dir, "vocab_train.en"),
1067 |             os.path.join(args.output_dir, "en"),
1068 |             args.en_coverage,
1069 |             args.en_vocab_size,
1070 |             args.en_model_type
1071 |         )
1072 | 
1073 |         shutil.rmtree(os.path.join(args.output_dir, "data"))
1074 |         os.remove(os.path.join(args.output_dir, "vocab_train.bn"))
1075 |         os.remove(os.path.join(args.output_dir, "vocab_train.en"))
1076 |         cleanup(args.output_dir)
1077 | 
1078 | 
1079 | 
1080 | if __name__ == "__main__":
1081 |     parser = argparse.ArgumentParser()
1082 |     parser.add_argument(
1083 |         '--input_dir', '-i', type=str,
1084 |         required=True,
1085 |         metavar='PATH',
1086 |         help="Input directory")
1087 | 
1088 |     parser.add_argument(
1089 |         '--output_dir', '-o', type=str,
1090 |         required=True,
1091 |         metavar='PATH',
1092 |         help="Output directory")
1093 | 
1094 |     parser.add_argument('--normalize', action='store_true',
1095 |         help='Only normalize the files in input directory')
1096 | 
1097 |     parser.add_argument(
1098 |         '--bn_vocab_size', type=int, default=32000, 
1099 |         help='bengali vocab size')
1100 | 
1101 |     parser.add_argument(
1102 |         '--en_vocab_size', type=int, default=32000, 
1103 |         help='english vocab size')
1104 | 
1105 |     parser.add_argument(
1106 |         '--bn_model_type', type=str, default="unigram", 
1107 |         help='bengali sentencepiece model type')
1108 | 
1109 |     parser.add_argument(
1110 |         '--en_model_type', type=str, default="unigram", 
1111 |         help='english sentencepiece model type')
1112 | 
1113 |     parser.add_argument(
1114 |         '--bn_coverage', type=float, default=1.0, 
1115 |         help='bengali character coverage')
1116 | 
1117 |     parser.add_argument(
1118 |         '--en_coverage', type=float, default=1.0, 
1119 |         help='english character coverage')
1120 | 
1121 |     args = parser.parse_args()
1122 |     main(args)
1123 | 
1124 |     


--------------------------------------------------------------------------------
/training/preprocessing/remove_evaluation_pairs.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import argparse
 3 | import sys
 4 | import shutil
 5 | import pyonmttok
 6 | import os
 7 | import glob
 8 | import math
 9 | from tqdm import tqdm
10 | 
11 | def get_linepairs(args, data_type):
12 |     linepairs = set()
13 |     
14 |     for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.{data_type}.{args.src_lang}")):
15 |         tgt_file_prefix = src_file.rsplit(f".{data_type}.{args.src_lang}", 1)[0] + f".{data_type}.{args.tgt_lang}"
16 |         tgt_files = glob.glob(tgt_file_prefix + "*")
17 | 
18 |         if tgt_files:
19 |             for tgt_file in tgt_files:
20 |                 with open(src_file) as fs, open(tgt_file) as ft:
21 |                     for src_line, tgt_line in zip(fs, ft):
22 |                         linepairs.add(
23 |                             (src_line.strip(), tgt_line.strip())
24 |                         )
25 |     return linepairs
26 |         
27 | def main(args):
28 |     exclude_linepairs = set()
29 |     exclude_linepairs.update(
30 |         get_linepairs(args, "valid")
31 |     )
32 |     exclude_linepairs.update(
33 |         get_linepairs(args, "test")
34 |     )
35 | 
36 |     os.makedirs(args.output_dir, exist_ok=True)
37 |     with open(os.path.join(args.output_dir, f"corpus.train.{args.src_lang}"), 'w') as srcF, \
38 |         open(os.path.join(args.output_dir,  f"corpus.train.{args.tgt_lang}"), 'w') as tgtF:
39 |         
40 |         for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.train.{args.src_lang}")):
41 |             tgt_file_prefix = src_file.rsplit(f".train.{args.src_lang}", 1)[0] + f".train.{args.tgt_lang}"
42 |             tgt_files = glob.glob(tgt_file_prefix + "*")
43 | 
44 |             if tgt_files:
45 |                 # when multiple references are present, pick the first one
46 |                 tgt_file = tgt_files[0]
47 |                 
48 |                 with open(src_file) as fs, open(tgt_file) as ft:
49 |                     for src_line, tgt_line in zip(fs, ft):
50 |                         src_line = src_line.strip()
51 |                         tgt_line = tgt_line.strip()
52 | 
53 |                         if (src_line, tgt_line) not in exclude_linepairs:
54 |                             print(src_line, file=srcF)
55 |                             print(tgt_line, file=tgtF)
56 | 
57 | 
58 | if __name__ == "__main__":
59 |     parser = argparse.ArgumentParser()
60 |     parser.add_argument(
61 |         '--input_dir', '-i', type=str,
62 |         required=True,
63 |         metavar='PATH',
64 |         help="Input directory")
65 | 
66 |     parser.add_argument(
67 |         '--output_dir', '-o', type=str,
68 |         required=True,
69 |         metavar='PATH',
70 |         help="Output directory")
71 | 
72 |     parser.add_argument(
73 |         '--src_lang', type=str,
74 |         required=True,
75 |         help="Source language")
76 | 
77 |     parser.add_argument(
78 |         '--tgt_lang', type=str,
79 |         required=True,
80 |         help="Target language")
81 | 
82 |     args = parser.parse_args()
83 |     main(args)
84 |         
85 | 
86 | 
87 | 


--------------------------------------------------------------------------------
/training/preprocessing/replacePatterns.txt:
--------------------------------------------------------------------------------
  1 | ¹:1
  2 | ²:2
  3 | ³:3
  4 | À:A
  5 | Á:A
  6 | Â:A
  7 | Ã:A
  8 | Ä:A
  9 | Å:A
 10 | Ā:A
 11 | Ă:A
 12 | Ą:A
 13 | Ǎ:A
 14 | Ǟ:A
 15 | Ǡ:A
 16 | Ǻ:A
 17 | Ȁ:A
 18 | Ȃ:A
 19 | Ȧ:A
 20 | Ḁ:A
 21 | Ạ:A
 22 | Ả:A
 23 | Ấ:A
 24 | Ầ:A
 25 | Ẩ:A
 26 | Ẫ:A
 27 | Ậ:A
 28 | Ắ:A
 29 | Ằ:A
 30 | Ẳ:A
 31 | Ẵ:A
 32 | Ặ:A
 33 | à:a
 34 | á:a
 35 | â:a
 36 | ã:a
 37 | ä:a
 38 | å:a
 39 | ª:a
 40 | ā:a
 41 | ă:a
 42 | ą:a
 43 | ǎ:a
 44 | ǟ:a
 45 | ǡ:a
 46 | ǻ:a
 47 | ȁ:a
 48 | ȃ:a
 49 | ȧ:a
 50 | ḁ:a
 51 | ạ:a
 52 | ả:a
 53 | ấ:a
 54 | ầ:a
 55 | ẩ:a
 56 | ẫ:a
 57 | ậ:a
 58 | ắ:a
 59 | ằ:a
 60 | ẳ:a
 61 | ẵ:a
 62 | ặ:a
 63 | Ḃ:B
 64 | Ḅ:B
 65 | Ḇ:B
 66 | ḃ:b
 67 | ḅ:b
 68 | ḇ:b
 69 | Ç:C
 70 | Ć:C
 71 | Ĉ:C
 72 | Ċ:C
 73 | Č:C
 74 | Ḉ:C
 75 | ç:c
 76 | ć:c
 77 | ĉ:c
 78 | ċ:c
 79 | č:c
 80 | ḉ:c
 81 | Ð:D
 82 | Ď:D
 83 | Đ:D
 84 | Ḋ:D
 85 | Ḍ:D
 86 | Ḏ:D
 87 | Ḑ:D
 88 | Ḓ:D
 89 | ď:d
 90 | đ:d
 91 | ḋ:d
 92 | ḍ:d
 93 | ḏ:d
 94 | ḑ:d
 95 | ḓ:d
 96 | È:E
 97 | É:E
 98 | Ê:E
 99 | Ë:E
100 | Ē:E
101 | Ĕ:E
102 | Ė:E
103 | Ę:E
104 | Ě:E
105 | Ȅ:E
106 | Ȇ:E
107 | Ȩ:E
108 | Ḕ:E
109 | Ḗ:E
110 | Ḙ:E
111 | Ḛ:E
112 | Ḝ:E
113 | Ẹ:E
114 | Ẻ:E
115 | Ẽ:E
116 | Ế:E
117 | Ề:E
118 | Ể:E
119 | Ễ:E
120 | Ệ:E
121 | è:e
122 | é:e
123 | ê:e
124 | ë:e
125 | ē:e
126 | ĕ:e
127 | ė:e
128 | ę:e
129 | ě:e
130 | ȅ:e
131 | ȇ:e
132 | ȩ:e
133 | ḕ:e
134 | ḗ:e
135 | ḙ:e
136 | ḛ:e
137 | ḝ:e
138 | ẹ:e
139 | ẻ:e
140 | ẽ:e
141 | ế:e
142 | ề:e
143 | ể:e
144 | ễ:e
145 | ệ:e
146 | Ḟ:F
147 | ḟ:f
148 | Ĝ:G
149 | Ğ:G
150 | Ġ:G
151 | Ģ:G
152 | Ǧ:G
153 | Ǵ:G
154 | Ḡ:G
155 | ĝ:g
156 | ğ:g
157 | ġ:g
158 | ģ:g
159 | ǧ:g
160 | ǵ:g
161 | ḡ:g
162 | Ĥ:H
163 | Ħ:H
164 | Ȟ:H
165 | Ḣ:H
166 | Ḥ:H
167 | Ḧ:H
168 | Ḩ:H
169 | Ḫ:H
170 | ĥ:h
171 | ħ:h
172 | ȟ:h
173 | ḣ:h
174 | ḥ:h
175 | ḧ:h
176 | ḩ:h
177 | ḫ:h
178 | ẖ:h
179 | Ì:I
180 | Í:I
181 | Î:I
182 | Ï:I
183 | Ĩ:I
184 | Ī:I
185 | Ĭ:I
186 | Į:I
187 | İ:I
188 | Ǐ:I
189 | Ȉ:I
190 | Ȋ:I
191 | Ḭ:I
192 | Ḯ:I
193 | Ỉ:I
194 | Ị:I
195 | ì:i
196 | í:i
197 | î:i
198 | ï:i
199 | ĩ:i
200 | ī:i
201 | ĭ:i
202 | į:i
203 | ı:i
204 | ǐ:i
205 | ȉ:i
206 | ȋ:i
207 | ḭ:i
208 | ḯ:i
209 | ỉ:i
210 | ị:i
211 | Ĵ:J
212 | ĵ:j
213 | Ķ:K
214 | Ǩ:K
215 | Ḱ:K
216 | Ḳ:K
217 | Ḵ:K
218 | ķ:k
219 | ǩ:k
220 | ḱ:k
221 | ḳ:k
222 | ḵ:k
223 | Ĺ:L
224 | Ļ:L
225 | Ľ:L
226 | Ŀ:L
227 | Ł:L
228 | Ḷ:L
229 | Ḹ:L
230 | Ḻ:L
231 | Ḽ:L
232 | ĺ:l
233 | ļ:l
234 | ľ:l
235 | ŀ:l
236 | ł:l
237 | ḷ:l
238 | ḹ:l
239 | ḻ:l
240 | ḽ:l
241 | Ḿ:M
242 | Ṁ:M
243 | Ṃ:M
244 | ḿ:m
245 | ṁ:m
246 | ṃ:m
247 | Ñ:N
248 | Ń:N
249 | Ņ:N
250 | Ň:N
251 | Ǹ:N
252 | Ṅ:N
253 | Ṇ:N
254 | Ṉ:N
255 | Ṋ:N
256 | ñ:n
257 | ń:n
258 | ņ:n
259 | ň:n
260 | ǹ:n
261 | ṅ:n
262 | ṇ:n
263 | ṉ:n
264 | ṋ:n
265 | Ò:O
266 | Ó:O
267 | Ô:O
268 | Õ:O
269 | Ö:O
270 | Ō:O
271 | Ŏ:O
272 | Ő:O
273 | Ơ:O
274 | Ǒ:O
275 | Ǫ:O
276 | Ǭ:O
277 | Ȍ:O
278 | Ȏ:O
279 | Ȫ:O
280 | Ȭ:O
281 | Ȯ:O
282 | Ȱ:O
283 | Ṍ:O
284 | Ṏ:O
285 | Ṑ:O
286 | Ṓ:O
287 | Ọ:O
288 | Ỏ:O
289 | Ố:O
290 | Ồ:O
291 | Ổ:O
292 | Ỗ:O
293 | Ộ:O
294 | Ớ:O
295 | Ờ:O
296 | Ở:O
297 | Ỡ:O
298 | Ợ:O
299 | ò:o
300 | ó:o
301 | ô:o
302 | õ:o
303 | ö:o
304 | ō:o
305 | ŏ:o
306 | ő:o
307 | ơ:o
308 | ǒ:o
309 | ǫ:o
310 | ǭ:o
311 | ȍ:o
312 | ȏ:o
313 | ȫ:o
314 | ȭ:o
315 | ȯ:o
316 | ȱ:o
317 | ṍ:o
318 | ṏ:o
319 | ṑ:o
320 | ṓ:o
321 | ọ:o
322 | ỏ:o
323 | ố:o
324 | ồ:o
325 | ổ:o
326 | ỗ:o
327 | ộ:o
328 | ớ:o
329 | ờ:o
330 | ở:o
331 | ỡ:o
332 | ợ:o
333 | Ṕ:P
334 | Ṗ:P
335 | ṕ:p
336 | ṗ:p
337 | Ŕ:R
338 | Ŗ:R
339 | Ř:R
340 | Ȑ:R
341 | Ȓ:R
342 | Ṙ:R
343 | Ṛ:R
344 | Ṝ:R
345 | Ṟ:R
346 | ŕ:r
347 | ŗ:r
348 | ř:r
349 | ȑ:r
350 | ȓ:r
351 | ṙ:r
352 | ṛ:r
353 | ṝ:r
354 | ṟ:r
355 | Ś:S
356 | Ŝ:S
357 | Ş:S
358 | Š:S
359 | Ș:S
360 | Ṡ:S
361 | Ṣ:S
362 | Ṥ:S
363 | Ṧ:S
364 | Ṩ:S
365 | ś:s
366 | ŝ:s
367 | ş:s
368 | š:s
369 | ș:s
370 | ṡ:s
371 | ṣ:s
372 | ṥ:s
373 | ṧ:s
374 | ṩ:s
375 | Ţ:T
376 | Ť:T
377 | Ŧ:T
378 | Ț:T
379 | Ṫ:T
380 | Ṭ:T
381 | Ṯ:T
382 | Ṱ:T
383 | ţ:t
384 | ť:t
385 | ŧ:t
386 | ț:t
387 | ṫ:t
388 | ṭ:t
389 | ṯ:t
390 | ṱ:t
391 | ẗ:t
392 | Ù:U
393 | Ú:U
394 | Û:U
395 | Ü:U
396 | Ũ:U
397 | Ū:U
398 | Ŭ:U
399 | Ů:U
400 | Ű:U
401 | Ų:U
402 | Ư:U
403 | Ǔ:U
404 | Ǖ:U
405 | Ǘ:U
406 | Ǚ:U
407 | Ǜ:U
408 | Ȕ:U
409 | Ȗ:U
410 | Ṳ:U
411 | Ṵ:U
412 | Ṷ:U
413 | Ṹ:U
414 | Ṻ:U
415 | Ụ:U
416 | Ủ:U
417 | Ứ:U
418 | Ừ:U
419 | Ử:U
420 | Ữ:U
421 | Ự:U
422 | ù:u
423 | ú:u
424 | û:u
425 | ü:u
426 | ũ:u
427 | ū:u
428 | ŭ:u
429 | ů:u
430 | ű:u
431 | ų:u
432 | ư:u
433 | ǔ:u
434 | ǖ:u
435 | ǘ:u
436 | ǚ:u
437 | ǜ:u
438 | ȕ:u
439 | ȗ:u
440 | ṳ:u
441 | ṵ:u
442 | ṷ:u
443 | ṹ:u
444 | ṻ:u
445 | ụ:u
446 | ủ:u
447 | ứ:u
448 | ừ:u
449 | ử:u
450 | ữ:u
451 | ự:u
452 | Ṽ:V
453 | Ṿ:V
454 | ṽ:v
455 | ṿ:v
456 | Ŵ:W
457 | Ẁ:W
458 | Ẃ:W
459 | Ẅ:W
460 | Ẇ:W
461 | Ẉ:W
462 | ŵ:w
463 | ẁ:w
464 | ẃ:w
465 | ẅ:w
466 | ẇ:w
467 | ẉ:w
468 | ẘ:w
469 | Ẋ:X
470 | Ẍ:X
471 | ẋ:x
472 | ẍ:x
473 | Ý:Y
474 | Ŷ:Y
475 | Ÿ:Y
476 | Ȳ:Y
477 | Ẏ:Y
478 | Ỳ:Y
479 | Ỵ:Y
480 | Ỷ:Y
481 | Ỹ:Y
482 | ý:y
483 | ÿ:y
484 | ŷ:y
485 | ȳ:y
486 | ẏ:y
487 | ỳ:y
488 | ỵ:y
489 | ỷ:y
490 | ỹ:y
491 | ẙ:y
492 | Ź:Z
493 | Ż:Z
494 | Ž:Z
495 | Ẑ:Z
496 | Ẓ:Z
497 | Ẕ:Z
498 | ź:z
499 | ż:z
500 | ž:z
501 | ẑ:z
502 | ẓ:z
503 | ẕ:z
504 | Ĳ:IJ
505 | ĳ:ij
506 | 
507 | 
508 | 
509 | 
510 | 
511 | 
512 | ú:u
513 | ó:o
514 | é:e
515 | á:a
516 | à:a
517 | ñ:n
518 | è:e
519 | ê:e
520 | û:u
521 | ç:c
522 | â:a
523 | ô:o
524 | ē:e
525 | ā:a
526 | ṇ:n
527 | ḍ:d
528 | ṛ:r
529 | ū:u
530 | Ā:A
531 | ṣ:s
532 | ō:o
533 | ý:y
534 | ü:u
535 | Ý:y
536 | Ł:L
537 | ã:a
538 | ś:s
539 | ä:a
540 | ö:o
541 | ń:n
542 | ł:l
543 | ę:e
544 | Š:s
545 | š:s
546 | ć:c
547 | đ:d
548 | ž:z
549 | ả:a
550 | Ớ:O
551 | ớ:o
552 | ấ:a
553 | ö:o
554 | ő:o
555 | 
556 | ộ:o
557 | ạ:a
558 | 
559 | ò:o
560 | ş:s
561 | 
562 | ğ:g
563 | ą:a
564 | ø:o
565 | Ø:O
566 | ơ:o
567 | Đ:D
568 | ứ:u
569 | ế:e
570 | Č:C
571 | ň:n
572 | 
573 | ë:e
574 | å:a
575 | č:c
576 | ริ:y
577 | 
578 | 
579 | ř:r
580 | ÿ:y
581 | ȓ:r
582 | ầ:a
583 | ũ:u
584 | 
585 | ă:a
586 | ţ:t
587 | ễ:e
588 | ệ:e
589 | 
590 | ù:u
591 | ọ:o
592 | ừ:u
593 | ỳ:y
594 | ồ:o
595 | ề:e
596 | ư:u
597 | ự:u
598 | ậ:a
599 | ắ:a
600 | 
601 | 
602 | 
603 | 
604 | í:i
605 | ī:i
606 | ì:i
607 | î:i
608 | ï:i
609 | ị:i
610 | ĩ:i
611 | ɨ:i
612 | 
613 | Ð:D
614 | ð:d


--------------------------------------------------------------------------------
/training/seq2seq/.gitignore:
--------------------------------------------------------------------------------
  1 | # repo-specific stuff
  2 | pred.txt
  3 | *.pt
  4 | \#*#
  5 | .idea
  6 | *.sublime-*
  7 | .DS_Store
  8 | data/
  9 | thesis/Models
 10 | thesis/glove_dir
 11 | thesis/data
 12 | thesis/Preprocessed_files
 13 | 
 14 | # Byte-compiled / optimized / DLL files
 15 | __pycache__/
 16 | *.py[cod]
 17 | *$py.class
 18 | 
 19 | # C extensions
 20 | *.so
 21 | 
 22 | # Distribution / packaging
 23 | .Python
 24 | build/
 25 | develop-eggs/
 26 | dist/
 27 | downloads/
 28 | eggs/
 29 | .eggs/
 30 | lib/
 31 | lib64/
 32 | parts/
 33 | sdist/
 34 | var/
 35 | wheels/
 36 | *.egg-info/
 37 | .installed.cfg
 38 | *.egg
 39 | 
 40 | # PyInstaller
 41 | #  Usually these files are written by a python script from a template
 42 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 43 | *.manifest
 44 | *.spec
 45 | 
 46 | # Installer logs
 47 | pip-log.txt
 48 | pip-delete-this-directory.txt
 49 | 
 50 | # Unit test / coverage reports
 51 | htmlcov/
 52 | .tox/
 53 | .coverage
 54 | .coverage.*
 55 | .cache
 56 | nosetests.xml
 57 | coverage.xml
 58 | *.cover
 59 | .hypothesis/
 60 | 
 61 | # Translations
 62 | *.mo
 63 | *.pot
 64 | 
 65 | # Django stuff:
 66 | *.log
 67 | local_settings.py
 68 | 
 69 | # Flask stuff:
 70 | instance/
 71 | .webassets-cache
 72 | 
 73 | # Scrapy stuff:
 74 | .scrapy
 75 | 
 76 | # Sphinx documentation
 77 | docs/_build/
 78 | 
 79 | # PyBuilder
 80 | target/
 81 | 
 82 | # Jupyter Notebook
 83 | .ipynb_checkpoints
 84 | 
 85 | # pyenv
 86 | .python-version
 87 | 
 88 | # celery beat schedule file
 89 | celerybeat-schedule
 90 | 
 91 | # SageMath parsed files
 92 | *.sage.py
 93 | 
 94 | # Environments
 95 | .env
 96 | .venv
 97 | env/
 98 | venv/
 99 | ENV/
100 | 
101 | # Spyder project settings
102 | .spyderproject
103 | .spyproject
104 | 
105 | # Rope project settings
106 | .ropeproject
107 | 
108 | # mkdocs documentation
109 | /site
110 | 
111 | # mypy
112 | .mypy_cache/
113 | 
114 | # Tensorboard
115 | runs/
116 | 


--------------------------------------------------------------------------------
/training/seq2seq/dataProcessor.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import argparse
  3 | import sys
  4 | import shutil
  5 | import pyonmttok
  6 | import os
  7 | import glob
  8 | import math
  9 | from tqdm import tqdm
 10 | 
 11 | def createFolders(args):
 12 |     required_dirnames = [
 13 |         "data",
 14 |         "Outputs",
 15 |         "temp",
 16 |         "Preprocessed",
 17 |         "Reports",
 18 |         "Models"
 19 |     ]
 20 | 
 21 |     # do cleanup first
 22 |     for dirname in required_dirnames[:4]:
 23 |         if os.path.isdir(os.path.join(args.output_dir, dirname)):
 24 |             shutil.rmtree(os.path.join(args.output_dir, dirname))
 25 | 
 26 |     for dirname in required_dirnames:
 27 |         os.makedirs(os.path.join(args.output_dir, dirname), exist_ok=True)
 28 | 
 29 | def _merge(args, data_type):
 30 |     with open(os.path.join(args.output_dir, "data", f"src-{data_type}.txt"), 'w') as srcF, \
 31 |         open(os.path.join(args.output_dir, "data", f"tgt-{data_type}.txt"), 'w') as tgtF:
 32 |         
 33 |         for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.{data_type}.{args.src_lang}")):
 34 |             tgt_file_prefix = src_file.rsplit(f".{data_type}.{args.src_lang}", 1)[0] + f".{data_type}.{args.tgt_lang}"
 35 |             tgt_files = glob.glob(tgt_file_prefix + "*")
 36 | 
 37 |             if tgt_files:
 38 |                 # when multiple references are present, pick the first one
 39 |                 tgt_file = tgt_files[0]
 40 |                 
 41 |                 with open(src_file) as f:
 42 |                     for line in f:
 43 |                         print(line.strip(), file=srcF)
 44 |                 
 45 |                 with open(tgt_file) as f:
 46 |                     for line in f:
 47 |                         print(line.strip(), file=tgtF)
 48 | 
 49 | def _move(args, dataset_category):
 50 |     for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.{dataset_category}.{args.src_lang}")):
 51 |         tgt_file_prefix = src_file.rsplit(f".{dataset_category}.{args.src_lang}", 1)[0] + f".{dataset_category}.{args.tgt_lang}"
 52 |         tgt_files = glob.glob(tgt_file_prefix + "*")
 53 | 
 54 |         shutil.copy(
 55 |             src_file,
 56 |             os.path.join(
 57 |                 args.output_dir,
 58 |                 "Outputs",
 59 |                 f".src-{dataset_category}.txt".join(
 60 |                     os.path.basename(src_file).rsplit(f".{dataset_category}.{args.src_lang}", 1)
 61 |                 )
 62 |             ) 
 63 |         )
 64 | 
 65 |         for tgt_file in tgt_files:
 66 |             shutil.copy(
 67 |             tgt_file, 
 68 |             os.path.join(
 69 |                 args.output_dir,
 70 |                 "Outputs",
 71 |                 f".tgt-{dataset_category}.txt".join(
 72 |                     os.path.basename(tgt_file).rsplit(f".{dataset_category}.{args.tgt_lang}", 1)
 73 |                 )
 74 |             )
 75 |         )
 76 | 
 77 | def moveRawData(args):
 78 |     # move vocab models
 79 |     shutil.copy(
 80 |         os.path.join(args.input_dir, "vocab", f"{args.src_lang}.model"),
 81 |         os.path.join(args.output_dir, "Preprocessed", "srcSPM.model")
 82 |     )
 83 |     shutil.copy(
 84 |         os.path.join(args.input_dir, "vocab", f"{args.tgt_lang}.model"),
 85 |         os.path.join(args.output_dir, "Preprocessed", "tgtSPM.model")
 86 |     )
 87 | 
 88 |     vocab_cmd = [
 89 |         "spm_export_vocab --model",
 90 |         os.path.join(args.output_dir, "Preprocessed", "srcSPM.model"),
 91 |         "| tail -n +4 >",
 92 |         os.path.join(args.output_dir, "Preprocessed", "srcSPM.vocab")
 93 |     ]
 94 |     os.system(" ".join(vocab_cmd))
 95 | 
 96 |     vocab_cmd = [
 97 |         "spm_export_vocab --model",
 98 |         os.path.join(args.output_dir, "Preprocessed", "tgtSPM.model"),
 99 |         "| tail -n +4 >",
100 |         os.path.join(args.output_dir, "Preprocessed", "tgtSPM.vocab")
101 |     ]
102 |     os.system(" ".join(vocab_cmd))
103 | 
104 |     if args.do_train:
105 |         _merge(args, "train")
106 |         _merge(args, "valid")
107 |         
108 |         if not glob.glob(os.path.join(args.input_dir, "data", f"*.valid.{args.src_lang}")):       
109 |             np.random.seed(3435)
110 |             sampledCount = 0
111 |             
112 |             with open(os.path.join(args.output_dir, "data", "src-train.txt.backup"), 'w') as srcT, \
113 |                 open(os.path.join(args.output_dir, "data", "tgt-train.txt.backup"), 'w') as tgtT, \
114 |                 open(os.path.join(args.output_dir, "data", "src-valid.txt"), 'w') as srcV, \
115 |                 open(os.path.join(args.output_dir, "data", "tgt-valid.txt"), 'w') as tgtV, \
116 |                 open(os.path.join(args.output_dir, "data", "src-train.txt")) as srcO, \
117 |                 open(os.path.join(args.output_dir, "data", "tgt-train.txt")) as tgtO:
118 |                 
119 |                 for srcLine, tgtLine in zip(srcO, tgtO):
120 |                     if sampledCount < args.validation_samples:
121 |                         if np.random.random() > .5:
122 |                             print(srcLine.strip(), file=srcV)
123 |                             print(tgtLine.strip(), file=tgtV)
124 |                             sampledCount += 1
125 |                             continue
126 |                     
127 |                     print(srcLine.strip(), file=srcT)
128 |                     print(tgtLine.strip(), file=tgtT)
129 | 
130 |             shutil.move(
131 |                 os.path.join(args.output_dir, "data", "src-train.txt.backup"),
132 |                 os.path.join(args.output_dir, "data", "src-train.txt")
133 |             )
134 |             shutil.move(
135 |                 os.path.join(args.output_dir, "data", "tgt-train.txt.backup"),
136 |                 os.path.join(args.output_dir, "data", "tgt-train.txt")
137 |             )
138 | 
139 | 
140 |     if args.do_eval:
141 |         _move(args, "valid")
142 |         _move(args, "test")
143 |         
144 | def _lc(input_file):
145 |     lc = 0
146 |     with open(input_file) as f:
147 |         for _ in f:
148 |             lc += 1
149 |     return lc
150 |     
151 | 
152 | def spmOperate(args, fileType, tokenize):
153 |     if tokenize:
154 |         modelName = os.path.join(args.output_dir, "Preprocessed", f"{fileType}SPM.model")
155 |         input_files = glob.glob(os.path.join(args.output_dir, "Outputs", f'*{fileType}-*'))
156 | 
157 |         for input_file in input_files:
158 |             spm_cmd = [
159 |                 f"spm_encode --model=\"{modelName}\"",
160 |                 f"--output_format=piece",
161 |                 f"< \"{input_file}\" > \"{input_file}.tok\""
162 |             ]
163 |             os.system(" ".join(spm_cmd))
164 |             os.remove(input_file)
165 | 
166 |     else:
167 |         modelName = os.path.join(args.output_dir, "Preprocessed", f"tgtSPM.model")
168 |         for input_file in glob.glob(os.path.join(args.output_dir, "Outputs", f'*{fileType}-*.tok')):
169 |             spm_cmd = [
170 |                 f"spm_decode --model=\"{modelName}\"",
171 |                 f"< \"{input_file}\" > \"{'.detok'.join(input_file.rsplit('.tok', 1))}\""
172 |             ]
173 |             os.system(" ".join(spm_cmd))
174 |             os.remove(input_file)
175 |             post_cmd = f"""sed 's/▁/ /g;s/  */ /g' -i \"{'.detok'.join(input_file.rsplit('.tok', 1))}\""""
176 |             os.system(post_cmd)
177 |         
178 |         
179 | def tokenize(args):
180 |     spmOperate(args, 'src', tokenize=True)
181 |     spmOperate(args, 'tgt', tokenize=True)
182 |             
183 | def detokenize(args):        
184 |     spmOperate(args, 'tgt', tokenize=False)
185 |     spmOperate(args, 'pred', tokenize=False)
186 | 
187 | def processData(args, tokenization):
188 |     if tokenization:
189 |         createFolders(args)
190 |         moveRawData(args)
191 |         tokenize(args)
192 |     else:
193 |         detokenize(args)
194 | 
195 | 


--------------------------------------------------------------------------------
/training/seq2seq/multi-bleu-detok.perl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env perl
  2 | #
  3 | # This file is part of moses.  Its use is licensed under the GNU Lesser General
  4 | # Public License version 2.1 or, at your option, any later version.
  5 | 
  6 | # This file uses the internal tokenization of mteval-v13a.pl,
  7 | # giving the exact same (case-sensitive) results on untokenized text.
  8 | # Using this script with detokenized output and untokenized references is
  9 | # preferrable over multi-bleu.perl, since scores aren't affected by tokenization differences.
 10 | # 
 11 | # like multi-bleu.perl , it supports plain text input and multiple references.
 12 | 
 13 | # This file is retrieved from Moses Decoder ::  https://github.com/moses-smt/mosesdecoder
 14 | # $Id$
 15 | use warnings;
 16 | use strict;
 17 | 
 18 | my $lowercase = 0;
 19 | if ($ARGV[0] eq "-lc") {
 20 |   $lowercase = 1;
 21 |   shift;
 22 | }
 23 | 
 24 | my $stem = $ARGV[0];
 25 | if (!defined $stem) {
 26 |   print STDERR "usage: multi-bleu-detok.pl [-lc] reference < hypothesis\n";
 27 |   print STDERR "Reads the references from reference or reference0, reference1, ...\n";
 28 |   exit(1);
 29 | }
 30 | 
 31 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
 32 | 
 33 | my @REF;
 34 | my $ref=0;
 35 | while(-e "$stem$ref") {
 36 |     &add_to_ref("$stem$ref",\@REF);
 37 |     $ref++;
 38 | }
 39 | &add_to_ref($stem,\@REF) if -e $stem;
 40 | die("ERROR: could not find reference file $stem") unless scalar @REF;
 41 | 
 42 | # add additional references explicitly specified on the command line
 43 | shift;
 44 | foreach my $stem (@ARGV) {
 45 |     &add_to_ref($stem,\@REF) if -e $stem;
 46 | }
 47 | 
 48 | 
 49 | 
 50 | sub add_to_ref {
 51 |     my ($file,$REF) = @_;
 52 |     my $s=0;
 53 |     if ($file =~ /.gz$/) {
 54 | 	open(REF,"gzip -dc $file|") or die "Can't read $file";
 55 |     } else { 
 56 | 	open(REF,$file) or die "Can't read $file";
 57 |     }
 58 |     while(<REF>) {
 59 | 	chop;
 60 | 	$_ = tokenization($_);
 61 | 	push @{$$REF[$s++]}, $_;
 62 |     }
 63 |     close(REF);
 64 | }
 65 | 
 66 | my(@CORRECT,@TOTAL,$length_translation,$length_reference);
 67 | my $s=0;
 68 | while(<STDIN>) {
 69 |     chop;
 70 |     $_ = lc if $lowercase;
 71 |     $_ = tokenization($_);
 72 |     my @WORD = split;
 73 |     my %REF_NGRAM = ();
 74 |     my $length_translation_this_sentence = scalar(@WORD);
 75 |     my ($closest_diff,$closest_length) = (9999,9999);
 76 |     foreach my $reference (@{$REF[$s]}) {
 77 | #      print "$s $_ <=> $reference\n";
 78 |   $reference = lc($reference) if $lowercase;
 79 | 	my @WORD = split(' ',$reference);
 80 | 	my $length = scalar(@WORD);
 81 |         my $diff = abs($length_translation_this_sentence-$length);
 82 | 	if ($diff < $closest_diff) {
 83 | 	    $closest_diff = $diff;
 84 | 	    $closest_length = $length;
 85 | 	    # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
 86 | 	} elsif ($diff == $closest_diff) {
 87 |             $closest_length = $length if $length < $closest_length;
 88 |             # from two references with the same closeness to me
 89 |             # take the *shorter* into account, not the "first" one.
 90 |         }
 91 | 	for(my $n=1;$n<=4;$n++) {
 92 | 	    my %REF_NGRAM_N = ();
 93 | 	    for(my $start=0;$start<=$#WORD-($n-1);$start++) {
 94 | 		my $ngram = "$n";
 95 | 		for(my $w=0;$w<$n;$w++) {
 96 | 		    $ngram .= " ".$WORD[$start+$w];
 97 | 		}
 98 | 		$REF_NGRAM_N{$ngram}++;
 99 | 	    }
100 | 	    foreach my $ngram (keys %REF_NGRAM_N) {
101 | 		if (!defined($REF_NGRAM{$ngram}) ||
102 | 		    $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
103 | 		    $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
104 | #	    print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}<BR>\n";
105 | 		}
106 | 	    }
107 | 	}
108 |     }
109 |     $length_translation += $length_translation_this_sentence;
110 |     $length_reference += $closest_length;
111 |     for(my $n=1;$n<=4;$n++) {
112 | 	my %T_NGRAM = ();
113 | 	for(my $start=0;$start<=$#WORD-($n-1);$start++) {
114 | 	    my $ngram = "$n";
115 | 	    for(my $w=0;$w<$n;$w++) {
116 | 		$ngram .= " ".$WORD[$start+$w];
117 | 	    }
118 | 	    $T_NGRAM{$ngram}++;
119 | 	}
120 | 	foreach my $ngram (keys %T_NGRAM) {
121 | 	    $ngram =~ /^(\d+) /;
122 | 	    my $n = $1;
123 |             # my $corr = 0;
124 | #	print "$i e $ngram $T_NGRAM{$ngram}<BR>\n";
125 | 	    $TOTAL[$n] += $T_NGRAM{$ngram};
126 | 	    if (defined($REF_NGRAM{$ngram})) {
127 | 		if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
128 | 		    $CORRECT[$n] += $T_NGRAM{$ngram};
129 |                     # $corr =  $T_NGRAM{$ngram};
130 | #	    print "$i e correct1 $T_NGRAM{$ngram}<BR>\n";
131 | 		}
132 | 		else {
133 | 		    $CORRECT[$n] += $REF_NGRAM{$ngram};
134 |                     # $corr =  $REF_NGRAM{$ngram};
135 | #	    print "$i e correct2 $REF_NGRAM{$ngram}<BR>\n";
136 | 		}
137 | 	    }
138 |             # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
139 |             # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
140 | 	}
141 |     }
142 |     $s++;
143 | }
144 | my $brevity_penalty = 1;
145 | my $bleu = 0;
146 | 
147 | my @bleu=();
148 | 
149 | for(my $n=1;$n<=4;$n++) {
150 |   if (defined ($TOTAL[$n])){
151 |     $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
152 |     # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
153 |   }else{
154 |     $bleu[$n]=0;
155 |   }
156 | }
157 | 
158 | if ($length_reference==0){
159 |   printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
160 |   exit(1);
161 | }
162 | 
163 | if ($length_translation<$length_reference) {
164 |   $brevity_penalty = exp(1-$length_reference/$length_translation);
165 | }
166 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
167 | 				my_log( $bleu[2] ) +
168 | 				my_log( $bleu[3] ) +
169 | 				my_log( $bleu[4] ) ) / 4) ;
170 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
171 |     100*$bleu,
172 |     100*$bleu[1],
173 |     100*$bleu[2],
174 |     100*$bleu[3],
175 |     100*$bleu[4],
176 |     $brevity_penalty,
177 |     $length_translation / $length_reference,
178 |     $length_translation,
179 |     $length_reference;
180 | 
181 | sub my_log {
182 |   return -9999999999 unless $_[0];
183 |   return log($_[0]);
184 | }
185 | 
186 | 
187 | 
188 | sub tokenization
189 | {
190 | 	my ($norm_text) = @_;
191 | 
192 | # language-independent part:
193 | 	$norm_text =~ s/<skipped>//g; # strip "skipped" tags
194 | 	$norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines
195 | 	$norm_text =~ s/\n/ /g; # join lines
196 | 	$norm_text =~ s/&quot;/"/g;  # convert SGML tag for quote to "
197 | 	$norm_text =~ s/&amp;/&/g;   # convert SGML tag for ampersand to &
198 | 	$norm_text =~ s/&lt;/</g;    # convert SGML tag for less-than to >
199 | 	$norm_text =~ s/&gt;/>/g;    # convert SGML tag for greater-than to <
200 | 
201 | # language-dependent part (assuming Western languages):
202 | 	$norm_text = " $norm_text ";
203 | 	$norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g;   # tokenize punctuation
204 | 	$norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit
205 | 	$norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit
206 | 	$norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit
207 | 	$norm_text =~ s/\s+/ /g; # one space only between words
208 | 	$norm_text =~ s/^\s+//;  # no leading space
209 | 	$norm_text =~ s/\s+$//;  # no trailing space
210 | 
211 | 	return $norm_text;
212 | }
213 | 


--------------------------------------------------------------------------------
/training/seq2seq/pipeline.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import subprocess
  4 | import traceback
  5 | import time
  6 | import shutil
  7 | import argparse
  8 | import glob
  9 | import json
 10 | from dataProcessor import processData
 11 | 
 12 | FILEDIR = os.path.dirname(__file__)
 13 | 
 14 | def train(args):
 15 |     data_map = {
 16 |         "train": {
 17 |             "path_src": os.path.join(args.output_dir, "data", "src-train.txt"),
 18 |             "path_tgt": os.path.join(args.output_dir, "data", "tgt-train.txt"),
 19 |             "transforms": ["sentencepiece", "filtertoolong"],
 20 |             "weight": 1
 21 |         },
 22 |         "valid": {
 23 |             "path_src": os.path.join(args.output_dir, "data", "src-valid.txt"),
 24 |             "path_tgt": os.path.join(args.output_dir, "data", "tgt-valid.txt"),
 25 |             "transforms": ["sentencepiece", "filtertoolong"]
 26 |         }     
 27 |     }
 28 |     cmd = f'''
 29 |         onmt_train \
 30 |             -data \"{json.dumps(data_map)}\" \
 31 |             -src_vocab \"{os.path.join(args.output_dir, "Preprocessed", "srcSPM.vocab")}\" \
 32 |             -tgt_vocab \"{os.path.join(args.output_dir, "Preprocessed", "tgtSPM.vocab")}\" \
 33 |             -src_subword_type sentencepiece \
 34 |             -tgt_subword_type sentencepiece \
 35 |             -src_subword_model \"{os.path.join(args.output_dir, "Preprocessed", "srcSPM.model")}\" \
 36 |             -tgt_subword_model \"{os.path.join(args.output_dir, "Preprocessed", "tgtSPM.model")}\" \
 37 |             -src_subword_nbest {args.nbest} \
 38 |             -src_subword_alpha {args.alpha} \
 39 |             -tgt_subword_nbest {args.nbest} \
 40 |             -tgt_subword_alpha {args.alpha} \
 41 |             -src_seq_length {args.src_seq_length} \
 42 |             -tgt_seq_length {args.tgt_seq_length} \
 43 |             -save_model \"{os.path.join(args.output_dir, "Models", args.model_prefix)}\" \
 44 |             -layers {args.layers} -rnn_size {args.rnn_size} -word_vec_size {args.word_vec_size} -transformer_ff {args.transformer_ff} -heads {args.heads}  \
 45 | 			-encoder_type transformer -decoder_type transformer -position_encoding \
 46 |             -train_steps {args.train_steps} -max_generator_batches 2 -dropout 0.1 \
 47 |             -batch_size {args.train_batch_size} -batch_type tokens -normalization tokens -accum_count {args.gradient_accum} \
 48 |             -queue_size 10000 -bucket_size 32768 \
 49 |             -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps {args.warmup_steps} -learning_rate {args.learning_rate} \
 50 |             -max_grad_norm 0 -param_init 0  -param_init_glorot \
 51 | 			-share_decoder_embeddings  \
 52 |             -label_smoothing 0.1 -valid_steps {args.valid_steps} -save_checkpoint_steps {args.save_checkpoint_steps} \
 53 |             -world_size {args.world_size} -gpu_ranks {" ".join(args.gpu_ranks)} {"-train_from " + args.train_from if args.train_from else ""}  
 54 |     '''
 55 |     os.system(cmd)
 56 | 
 57 | def average_models(args):
 58 |     step_count = lambda p: int(re.search(r"_step_(\d+)", p).group(1))
 59 |     model_paths = sorted(
 60 |             glob.glob(os.path.join(args.output_dir, "Models", f"{args.model_prefix}*.pt")),
 61 |             key=step_count
 62 |     )
 63 |     if len(model_paths) > args.average_last:
 64 |         model_paths = model_paths[-args.average_last:]
 65 |         output_path = (
 66 |             model_paths[0].rsplit("_step_")[0] + 
 67 |             f"_step_{step_count(model_paths[0])}-{step_count(model_paths[-1])}-{args.average_last}.pt"
 68 |         )
 69 | 
 70 |         model_paths = [f"\"{k}\"" for k in model_paths]
 71 |         cmd = [
 72 |             f"onmt_average_models",
 73 |             f"-models {' '.join(model_paths)}",
 74 |             f"-output \"{output_path}\""
 75 |         ]
 76 |         os.system(" ".join(cmd))
 77 | 
 78 | def _translate(args, modelName, inputFile, outputFile):
 79 |     cmd = f'''
 80 |         onmt_translate \
 81 |             -model \"{modelName}\" \
 82 |             -src \"{inputFile}\" \
 83 |             -output \"{outputFile}\" \
 84 |             -replace_unk copy -verbose -max_length {args.tgt_seq_length} -batch_size {args.eval_batch_size} -gpu 0
 85 |     '''
 86 |     os.system(cmd)
 87 | 
 88 | def translate(model_path, dataset_category, args):
 89 |     src_lines, src_map = [], {}
 90 |     for src_file in glob.glob(os.path.join(args.output_dir, "Outputs", f'*src-{dataset_category}.txt.tok')):
 91 |         with open(src_file) as f:
 92 |             lines = f.readlines()
 93 |             src_map[src_file] = len(lines)
 94 |             src_lines.extend(lines)
 95 | 
 96 |     merged_src_file = os.path.join(args.output_dir, "temp", "merged.src")
 97 |     merged_tgt_file = os.path.join(args.output_dir, "temp", "merged.tgt")
 98 | 
 99 |     with open(merged_src_file, 'w') as f:
100 |         for line in src_lines:
101 |             print(line.strip(), file=f)
102 |     
103 |     _translate(args, model_path, merged_src_file, merged_tgt_file)
104 | 
105 |     with open(merged_tgt_file) as inpf:
106 |         idx = 0
107 |         lines = inpf.readlines()
108 | 
109 |         for src_file in src_map:
110 |             pred_file = f"pred-{dataset_category}.txt.tok".join(
111 |                 src_file.rsplit(
112 |                     f"src-{dataset_category}.txt.tok", 1
113 |                 )
114 |             )
115 | 
116 |             with open(pred_file, 'w') as outf:
117 |                 for _ in range(src_map[src_file]):
118 |                     print(lines[idx].strip(), file=outf)
119 |                     idx += 1
120 | 
121 |     os.remove(merged_src_file)
122 |     os.remove(merged_tgt_file)
123 | 
124 | def calculate_scores(args, dataset_category):
125 |     scores = []
126 |     for pred_file in glob.glob(os.path.join(args.output_dir, "Outputs", f'*pred-{dataset_category}.txt.detok')):
127 |         dataset_name = os.path.basename(pred_file).rsplit(
128 |             f".pred-{dataset_category}.txt.detok", 1
129 |         )[0]
130 | 
131 |         tgt_file_prefix = f".tgt-{dataset_category}.txt.*detok".join(
132 |             pred_file.rsplit(
133 |                 f".pred-{dataset_category}.txt.detok", 1
134 |             )
135 |         )
136 |         tgt_files = glob.glob(tgt_file_prefix)
137 |         if tgt_files:
138 |             bleu_cmd = [
139 |                 f"perl \"{os.path.join(FILEDIR, 'multi-bleu-detok.perl')}\"",
140 |                 f"-lc {' '.join(tgt_files)} < \"{pred_file}\""
141 |             ]
142 |             sacre_cmd = [
143 |                 f"cat \"{pred_file}\"",
144 |                 "|",
145 |                 f"sacrebleu {' '.join(tgt_files)}"
146 |             ]
147 |             
148 |             try:
149 |                 bleu_output = str(subprocess.check_output(" ".join(bleu_cmd), shell=True)).strip()
150 |                 bleu_score = bleu_output.splitlines()[-1].split(",")[0].split("=")[1]
151 |             except:
152 |                 bleu_score = -1
153 | 
154 |             try:
155 |                 sacre_output = str(subprocess.check_output(" ".join(sacre_cmd), shell=True)).strip()
156 |                 sacre_score = sacre_output.splitlines()[-1].split("=")[1].split()[0]
157 |             except:
158 |                 sacre_score = -1
159 | 
160 |             scores.append(
161 |                 {
162 |                     "dataset": dataset_name,
163 |                     "bleu": bleu_score,
164 |                     "sacrebleu": sacre_score
165 |                 }
166 |             )
167 | 
168 |     return scores
169 | 
170 | def write_scores(scores, output_path):
171 |     with open(output_path, 'w') as f:
172 |         for model_name in scores:
173 |             print(model_name, ":", file=f)
174 |             for dataset_score in scores[model_name]:
175 |                 print(
176 |                     "",
177 |                     f"Dataset: {dataset_score['dataset']},",
178 |                     f"BLEU: {dataset_score['bleu']},",
179 |                     f"SACREBLEU: {dataset_score['sacrebleu']},",
180 |                     sep="\t",
181 |                     file=f
182 |                 )
183 | 
184 | def evaluate(args):
185 |     if args.model_prefix:
186 |         model_paths = sorted(
187 |             glob.glob(os.path.join(args.output_dir, "Models", f"{args.model_prefix}*.pt")),
188 |             key=lambda p: int(re.search(r"_step_(\d+)", p).group(1))
189 |         )
190 |         model_scores = {} 
191 |         for model_path in model_paths:
192 |             translate(model_path, "valid", args)
193 |             processData(args, False)
194 |             scores = calculate_scores(args, "valid")
195 |             model_scores[os.path.basename(model_path)] = scores
196 |         
197 |         write_scores(
198 |             model_scores,
199 |             os.path.join(
200 |                 args.output_dir, 
201 |                 "Reports", f"{args.model_prefix}.valid.{args.src_lang}2{args.tgt_lang}.log"
202 |             )
203 |         )
204 |     
205 |     if args.eval_model:
206 |         model_scores = {}
207 |         translate(args.eval_model, "test", args)
208 |         processData(args, False)
209 |         scores = calculate_scores(args, "test")
210 |         model_scores[os.path.basename(args.eval_model)] = scores
211 | 
212 |         write_scores(
213 |             model_scores,
214 |             os.path.join(
215 |                 args.output_dir, "Reports", f"{os.path.basename(args.eval_model)}.test.{args.src_lang}2{args.tgt_lang}.log"
216 |             )
217 |         )
218 | 
219 | 
220 | def main(args):
221 |     processData(args, True)
222 |     if args.do_train:
223 |         train(args)
224 |     if args.model_prefix and args.average_last:
225 |         average_models(args)
226 |     if args.do_eval:
227 |         evaluate(args)
228 |     
229 |                             
230 | if __name__ == "__main__":
231 |     parser = argparse.ArgumentParser()
232 |     parser.add_argument(
233 |         '--input_dir', '-i', type=str,
234 |         required=True,
235 |         metavar='PATH',
236 |         help="Input directory")
237 | 
238 |     parser.add_argument(
239 |         '--output_dir', '-o', type=str,
240 |         required=True,
241 |         metavar='PATH',
242 |         help="Output directory")
243 | 
244 |     parser.add_argument(
245 |         '--src_lang', type=str,
246 |         required=True,
247 |         help="Source language")
248 | 
249 |     parser.add_argument(
250 |         '--tgt_lang', type=str,
251 |         required=True,
252 |         help="Target language")
253 |     
254 |     parser.add_argument(
255 |         '--validation_samples', type=int, default=5000, 
256 |         help='no. of validation samples to take out from train dataset when no validation data is present')
257 |     
258 |     parser.add_argument(
259 |         '--src_seq_length', type=int, default=200, 
260 |         help='maximum source sequence length')
261 | 
262 |     parser.add_argument(
263 |         '--tgt_seq_length', type=int, default=200, 
264 |         help='maximum target sequence length')
265 | 
266 |     parser.add_argument(
267 |         '--model_prefix', type=str,
268 |         help='Prefix of the model to save')
269 | 
270 |     parser.add_argument(
271 |         '--eval_model', type=str, metavar="PATH", 
272 |         help='Path to the specific model to evaluate')
273 | 
274 |     parser.add_argument(
275 |         '--train_steps', type=int, default=120000, 
276 |         help='no of training steps')
277 | 
278 |     parser.add_argument(
279 |         '--train_batch_size', type=int, default=12288, 
280 |         help='training batch size (in tokens)')
281 | 
282 |     parser.add_argument(
283 |         '--eval_batch_size', type=int, default=8, 
284 |         help='evaluation batch size (in sentences)')
285 | 
286 |     parser.add_argument(
287 |         '--gradient_accum', type=int, default=2, 
288 |         help='gradient accum')
289 | 
290 |     parser.add_argument(
291 |         '--warmup_steps', type=int, default=4000, 
292 |         help='warmup steps')
293 | 
294 |     parser.add_argument(
295 |         '--learning_rate', type=int, default=2, 
296 |         help='learning rate')
297 | 
298 |     parser.add_argument(
299 |         '--layers', type=int, default=6, 
300 |         help='layers')
301 | 
302 |     parser.add_argument(
303 |         '--rnn_size', type=int, default=512, 
304 |         help='rnn size')
305 | 
306 |     parser.add_argument(
307 |         '--word_vec_size', type=int, default=512, 
308 |         help='word vector size')
309 | 
310 |     parser.add_argument(
311 |         '--transformer_ff', type=int, default=2048, 
312 |         help='transformer feed forward size')
313 | 
314 |     parser.add_argument(
315 |         '--heads', type=int, default=8, 
316 |         help='no of heads')
317 | 
318 |     parser.add_argument(
319 |         '--valid_steps', type=int, default=2000, 
320 |         help='validation interval')
321 | 
322 |     parser.add_argument(
323 |         '--save_checkpoint_steps', type=int, default=1000, 
324 |         help='model saving interval')
325 | 
326 |     parser.add_argument(
327 |         '--average_last', type=int, default=20, 
328 |         help='average last X models')
329 | 
330 |     parser.add_argument(
331 |         '--world_size', type=int, default=4, 
332 |         help='world size')
333 | 
334 |     parser.add_argument(
335 |         '--gpu_ranks', type=str, nargs="*", default=["0", "1", "2", "3"], 
336 |         help='gpu ranks')
337 | 
338 |     parser.add_argument(
339 |         '--train_from', type=str, default="", 
340 |         help='start training from this checkpoint')
341 |     
342 |     parser.add_argument('--do_train', action='store_true',
343 |         help='Run training')
344 |     parser.add_argument('--do_eval', action='store_true',
345 |         help='Run evaluation')
346 |     
347 |     parser.add_argument(
348 |         '--nbest', type=int, default=32, 
349 |         help='sentencepiece nbest size')
350 |     parser.add_argument(
351 |         '--alpha', type=float, default=0.1, 
352 |         help='sentencepiece alpha')
353 |     
354 |     args = parser.parse_args()
355 |     main(args)
356 |     
357 | 
358 |     
359 | 


--------------------------------------------------------------------------------
/training/seq2seq/requirements.txt:
--------------------------------------------------------------------------------
1 | git+https://github.com/abhik1505040/OpenNMT-py
2 | pyrouge
3 | git+https://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070
4 | sentencepiece>=0.1.94
5 | subword-nmt>=0.3.7
6 | sacrebleu==1.4.2
7 | 


--------------------------------------------------------------------------------
/training/seq2seq/sample_input_dir/vocab/bn.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/csebuetnlp/banglanmt/361801040950e5a50ddf51d4c02d36269fe5dd91/training/seq2seq/sample_input_dir/vocab/bn.model


--------------------------------------------------------------------------------
/training/seq2seq/sample_input_dir/vocab/en.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/csebuetnlp/banglanmt/361801040950e5a50ddf51d4c02d36269fe5dd91/training/seq2seq/sample_input_dir/vocab/en.model


--------------------------------------------------------------------------------
/vocab.tar.bz2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/csebuetnlp/banglanmt/361801040950e5a50ddf51d4c02d36269fe5dd91/vocab.tar.bz2


--------------------------------------------------------------------------------