├── README.md ├── batch_filtering ├── README.md ├── install_external_tools.sh ├── install_models.sh ├── scoring_pipeline.py └── source │ ├── embed.py │ ├── lib │ ├── indexing.py │ ├── romanize_lc.py │ └── text_processing.py │ └── mine_bitexts.py ├── segmentation ├── LICENSE.txt ├── README.md ├── __init__.py ├── segmenter.py └── setup.py ├── training ├── README.md ├── preprocessing │ ├── README.md │ ├── preprocessor.py │ ├── remove_evaluation_pairs.py │ └── replacePatterns.txt └── seq2seq │ ├── .gitignore │ ├── dataProcessor.py │ ├── multi-bleu-detok.perl │ ├── pipeline.py │ ├── requirements.txt │ └── sample_input_dir │ ├── data │ ├── RisingNews.test.bn │ ├── RisingNews.test.en │ ├── RisingNews.valid.bn │ └── RisingNews.valid.en │ └── vocab │ ├── bn.model │ └── en.model └── vocab.tar.bz2 /README.md: -------------------------------------------------------------------------------- 1 | # Bangla-NMT 2 | 3 | This repository contains the code and data of the paper titled [**"Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation"**](https://www.aclweb.org/anthology/2020.emnlp-main.207/) published in *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.* 4 | 5 | ## Updates 6 | 7 | * The base translation models are now available for download. 8 | * The training code has been refactored to support [OpenNMT-py 2.2.0](https://github.com/OpenNMT/OpenNMT-py). 9 | * [Colab Notebook](https://colab.research.google.com/drive/1TPkYXEWrf_dUjq-1qpapkreLc7JOug9E?usp=sharing) added for the inference module. 10 | 11 | ## Table of Contents 12 | 13 | - [Bangla-NMT](#bangla-nmt) 14 | - [Updates](#updates) 15 | - [Table of Contents](#table-of-contents) 16 | - [Datasets](#datasets) 17 | - [Models](#models) 18 | - [Dependencies](#dependencies) 19 | - [Segmentation](#segmentation) 20 | - [Batch-filtering](#batch-filtering) 21 | - [Training & Evaluation](#training--evaluation) 22 | - [License](#license) 23 | - [Citation](#citation) 24 | 25 | 26 | ## Datasets 27 | Download the dataset from [here](https://docs.google.com/uc?export=download&id=1FLlC0NNXFKVGaVM3-cYW-XEx8p8eV3Wm). This includes: 28 | * Our original 2.75M training corpus (`2.75M/`) 29 | * [Preprocessed](training/preprocessing) training corpus (`data/`) 30 | * RisingNews dev/test sets (`data/`) 31 | * Preprocessed sipc dev/test sets (`data/`) 32 | * Sentencepiece vocabulary models for Bengali and English (`vocab/`) 33 | 34 | ## Models 35 | 36 | The base-sized transformer model (6 layers, 8 attention heads) checkpoints can be found below: 37 | 38 | * [Bengali to English](https://docs.google.com/uc?export=download&id=1nYKua6_q7W-WK-Xwng_DjoLoZ0k1HgjB) 39 | * [English to Bengali](https://docs.google.com/uc?export=download&id=1uX8nL3yeosmK3YVCRHNJolv861-fCCbi) 40 | * [Sentencepiece vocabulary files](vocab.tar.bz2) 41 | 42 | To evaluate these models on new datasets, please refer to [here](https://github.com/csebuetnlp/banglanmt/tree/master/training). You can also use the [Colab Notebook](https://colab.research.google.com/drive/1TPkYXEWrf_dUjq-1qpapkreLc7JOug9E?usp=sharing) for direct inference. 43 | 44 | ## Dependencies 45 | * Python 3.7.3 46 | * [PyTorch 1.2](http://pytorch.org/) 47 | * [Cython](https://pypi.org/project/Cython/) 48 | * [Faiss](https://github.com/facebookresearch/faiss) 49 | * [FastBPE](https://github.com/glample/fastBPE) 50 | * [sentencepiece](https://github.com/google/sentencepiece) (`Install CLI`) 51 | * [transliterate](https://pypi.org/project/transliterate) 52 | * [regex](https://pypi.org/project/regex/) 53 | * [torchtext](https://pypi.org/project/torchtext) (`pip install torchtext==0.4.0`) 54 | * [sacrebleu](https://pypi.org/project/sacrebleu) 55 | * [aksharamukha](https://pypi.org/project/aksharamukha) 56 | 57 | 58 | ## Segmentation 59 | * See [segmentation module.](segmentation/) 60 | 61 | ## Batch-filtering 62 | * See [batch-filtering module.](batch_filtering/) 63 | 64 | ## Training & Evaluation 65 | * See [training and evaluation module.](training/) 66 | * Try out the models in [Google Colaboratory.](https://colab.research.google.com/drive/1TPkYXEWrf_dUjq-1qpapkreLc7JOug9E?usp=sharing) 67 | 68 | ## License 69 | Contents of this repository are licensed under [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). 70 | 71 | ## Citation 72 | If you use this dataset or code modules, please cite the following paper: 73 | ``` 74 | @inproceedings{hasan-etal-2020-low, 75 | title = "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for {B}engali-{E}nglish Machine Translation", 76 | author = "Hasan, Tahmid and 77 | Bhattacharjee, Abhik and 78 | Samin, Kazi and 79 | Hasan, Masum and 80 | Basak, Madhusudan and 81 | Rahman, M. Sohel and 82 | Shahriyar, Rifat", 83 | booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", 84 | month = nov, 85 | year = "2020", 86 | address = "Online", 87 | publisher = "Association for Computational Linguistics", 88 | url = "https://www.aclweb.org/anthology/2020.emnlp-main.207", 89 | doi = "10.18653/v1/2020.emnlp-main.207", 90 | pages = "2612--2623", 91 | abstract = "Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.", 92 | } 93 | ``` 94 | -------------------------------------------------------------------------------- /batch_filtering/README.md: -------------------------------------------------------------------------------- 1 | ## Setup 2 | * Install all dependecies mentioned [here](https://github.com/csebuetnlp/banglanmt). 3 | * download models: `bash ./install_models.sh` 4 | * setup necessary tools: `bash ./install_external_tools.sh` 5 | 6 | ## Usage 7 | * setup environment variable before running. 8 | ```bash 9 | # inside this directory 10 | $ export LASER=$(pwd) 11 | ``` 12 | * Batch filtering options 13 | ```bash 14 | $ python3 scoring_pipeline.py -h 15 | usage: scoring_pipeline.py [-h] --input_dir PATH --output_dir PATH --src_lang 16 | SRC_LANG --tgt_lang TGT_LANG [--thresh THRESH] 17 | [--batch_size BATCH_SIZE] [--cpu] 18 | 19 | optional arguments: 20 | -h, --help show this help message and exit 21 | --input_dir PATH, -i PATH 22 | Input directory 23 | --output_dir PATH, -o PATH 24 | Output directory 25 | --src_lang SRC_LANG Source language 26 | --tgt_lang TGT_LANG Target language 27 | --thresh THRESH threshold 28 | --batch_size BATCH_SIZE 29 | batch size 30 | 31 | ``` 32 | * ***The script will recursively look for all filepairs `(X.src_lang, X.tgt_lang)` inside `input_dir`, where `X` is any common file prefix, and produce the following output files within the corresponding subdirectories of `output_dir`*** 33 | 34 | * `X.merged.tsv`: Output linepairs with their similarity score 35 | * `X.passed.src_lang` / `X.passed.tgt_lang`: Linepairs that have similarity scores greater than given `thresh` 36 | * `X.failed.src_lang` / `X.failed.tgt_lang`: Linepairs that have similarity scores less than given `thresh` 37 | -------------------------------------------------------------------------------- /batch_filtering/install_external_tools.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | #------------------------------------------------------- 14 | # 15 | # This bash script installs third party software 16 | # 17 | 18 | if [ -z ${LASER} ] ; then 19 | echo "Please set the environment variable 'LASER'" 20 | exit 21 | fi 22 | 23 | bdir="${LASER}" 24 | tools_ext="${bdir}/tools-external" 25 | 26 | 27 | ################################################################### 28 | # 29 | # Generic helper functions 30 | # 31 | ################################################################### 32 | 33 | MKDIR () { 34 | dname=$1 35 | if [ ! -d ${dname} ] ; then 36 | echo " - creating directory ${dname}" 37 | mkdir -p ${dname} 38 | fi 39 | } 40 | 41 | 42 | ################################################################### 43 | # 44 | # Tokenization tools from Moses 45 | # It is important to use the official release V4 and not the current one 46 | # to obtain the same results than the published ones. 47 | # (the behavior of the tokenizer for end-of-sentence abbreviations has changed) 48 | # 49 | ################################################################### 50 | 51 | InstallMosesTools () { 52 | moses_git="https://raw.githubusercontent.com/moses-smt/mosesdecoder/RELEASE-4.0/scripts" 53 | moses_files=("tokenizer/tokenizer.perl" "tokenizer/detokenizer.perl" \ 54 | "tokenizer/normalize-punctuation.perl" \ 55 | "tokenizer/remove-non-printing-char.perl" \ 56 | "tokenizer/deescape-special-chars.perl" \ 57 | "tokenizer/lowercase.perl" \ 58 | "tokenizer/basic-protected-patterns" \ 59 | ) 60 | 61 | wdir="${tools_ext}/moses-tokenizer/tokenizer" 62 | MKDIR ${wdir} 63 | cd ${wdir} 64 | 65 | for f in ${moses_files[@]} ; do 66 | if [ ! -f `basename ${f}` ] ; then 67 | echo " - download ${f}" 68 | wget -q ${moses_git}/${f} 69 | fi 70 | done 71 | chmod 755 *perl 72 | 73 | # download non-breaking prefixes per language 74 | moses_non_breakings="share/nonbreaking_prefixes/nonbreaking_prefix" 75 | moses_non_breaking_langs=( \ 76 | "ca" "cs" "de" "el" "en" "es" "fi" "fr" "ga" "hu" "is" \ 77 | "it" "lt" "lv" "nl" "pl" "pt" "ro" "ru" "sk" "sl" "sv" \ 78 | "ta" "yue" "zh" ) 79 | wdir="${tools_ext}/moses-tokenizer/share/nonbreaking_prefixes" 80 | MKDIR ${wdir} 81 | cd ${wdir} 82 | 83 | for l in ${moses_non_breaking_langs[@]} ; do 84 | f="${moses_non_breakings}.${l}" 85 | if [ ! -f `basename ${f}` ] ; then 86 | echo " - download ${f}" 87 | wget -q ${moses_git}/${f} 88 | fi 89 | done 90 | } 91 | 92 | 93 | ################################################################### 94 | # 95 | # FAST BPE 96 | # 97 | ################################################################### 98 | 99 | InstallFastBPE () { 100 | cd ${tools_ext} 101 | if [ ! -x fastBPE/fast ] ; then 102 | echo " - download fastBPE software from github" 103 | wget https://github.com/glample/fastBPE/archive/master.zip 104 | unzip master.zip 105 | /bin/rm master.zip 106 | mv fastBPE-master fastBPE 107 | cd fastBPE 108 | echo " - compiling" 109 | g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast 110 | if [ $? -eq 1 ] ; then 111 | echo "ERROR: compilation failed, please install manually"; exit 112 | fi 113 | python setup.py install 114 | fi 115 | } 116 | 117 | 118 | ################################################################### 119 | # 120 | # Install Japanese tokenizer Mecab 121 | # We do not use automatic installation with "pip" but directly add the soruce directory 122 | # 123 | ################################################################### 124 | 125 | InstallMecab () { 126 | cd ${tools_ext} 127 | if [ ! -x mecab/mecab/bin/mecab ] ; then 128 | echo " - download mecab from github" 129 | wget https://github.com/taku910/mecab/archive/master.zip 130 | unzip master.zip 131 | #/bin/rm master.zip 132 | if [ ! -s mecab/bin/mecab ] ; then 133 | mkdir mecab 134 | cd mecab-master/mecab 135 | echo " - installing code" 136 | ./configure --prefix ${tools_ext}/mecab && make && make install 137 | if [ $? -q 1 ] ; then 138 | echo "ERROR: installation failed, please install manually"; exit 139 | fi 140 | fi 141 | if [ ! -d mecab/lib/mecab/dic/ipadic ] ; then 142 | cd ${tools_ext}/mecab-master/mecab-ipadic 143 | echo " - installing dictionaries" 144 | ./configure --prefix ${tools_ext}/mecab --with-mecab-config=${tools_ext}/mecab/bin/mecab-config \ 145 | && make && make install 146 | if [ $? -eq 1 ] ; then 147 | echo "ERROR: compilation failed, please install manually"; exit 148 | fi 149 | fi 150 | fi 151 | } 152 | 153 | 154 | ################################################################### 155 | # 156 | # main 157 | # 158 | ################################################################### 159 | 160 | echo "Installing external tools" 161 | 162 | InstallMosesTools 163 | InstallFastBPE 164 | 165 | #InstallMecab 166 | echo "" 167 | echo "automatic installation of the Japanese tokenizer mecab may be tricky" 168 | echo "Please install it manually from https://github.com/taku910/mecab" 169 | echo "" 170 | echo "The installation directory should be ${LASER}/tools-external/mecab" 171 | echo "" 172 | -------------------------------------------------------------------------------- /batch_filtering/install_models.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | #------------------------------------------------------- 14 | # 15 | # This bash script installs sentence encoders from Amazon s3 16 | # 17 | # if [ -z ${LASER} ] ; then 18 | # echo "Please set the environment variable 'LASER'" 19 | # exit 20 | # fi 21 | 22 | LASER='.' 23 | 24 | mdir="${LASER}/models" 25 | 26 | # available encoders 27 | s3="https://dl.fbaipublicfiles.com/laser/models" 28 | networks=("bilstm.eparl21.2018-11-19.pt" \ 29 | "eparl21.fcodes" "eparl21.fvocab" \ 30 | "bilstm.93langs.2018-12-26.pt" \ 31 | "93langs.fcodes" "93langs.fvocab") 32 | 33 | 34 | echo "Downloading networks" 35 | 36 | if [ ! -d ${mdir} ] ; then 37 | echo " - creating directory ${mdir}" 38 | mkdir -p ${mdir} 39 | fi 40 | 41 | cd ${mdir} 42 | for f in ${networks[@]} ; do 43 | if [ -f ${f} ] ; then 44 | echo " - ${mdir}/${f} already downloaded" 45 | else 46 | echo " - ${f}" 47 | wget -q ${s3}/${f} 48 | fi 49 | done 50 | -------------------------------------------------------------------------------- /batch_filtering/scoring_pipeline.py: -------------------------------------------------------------------------------- 1 | from source.embed import * 2 | from multiprocessing import Pool 3 | import multiprocessing as mp 4 | import time 5 | import random 6 | import shutil 7 | import argparse 8 | import glob 9 | import math 10 | 11 | random.seed(3435) 12 | 13 | 14 | def loadEncoder(cpu=False): 15 | model_loc = os.path.join(os.environ["LASER"], "models", "bilstm.93langs.2018-12-26.pt") 16 | print(' - Encoder: loading {}'.format(model_loc)) 17 | global ENCODER 18 | 19 | ENCODER = SentenceEncoder(model_loc, 20 | max_sentences=None, 21 | max_tokens=12000, 22 | sort_kind='quicksort', 23 | cpu=cpu) 24 | 25 | def encode(ifname, ofname, language): 26 | with tempfile.TemporaryDirectory() as tmpdir: 27 | 28 | tok_fname = os.path.join(tmpdir, 'tok') 29 | Token(ifname, 30 | tok_fname, 31 | lang=language, 32 | romanize=True if language == 'el' else False, 33 | lower_case=True, gzip=False, 34 | verbose=True, over_write=False) 35 | ifname = tok_fname 36 | 37 | bpe_fname = os.path.join(tmpdir, 'bpe') 38 | BPEfastApply(ifname, 39 | bpe_fname, 40 | os.path.join(os.environ["LASER"], "models", "93langs.fcodes"), 41 | verbose=True, over_write=False) 42 | ifname = bpe_fname 43 | 44 | EncodeFile(ENCODER, 45 | ifname, 46 | ofname, 47 | verbose=True, over_write=False, 48 | buffer_size=10000) 49 | 50 | def getLines(filename): 51 | lines = [] 52 | with open(filename) as f: 53 | for line in f: 54 | assert line.strip(), "Empty line found" 55 | lines.append(line.strip()) 56 | return lines 57 | 58 | def writeValidLinePairs(file1, file2): 59 | f1Lines, f2Lines = [], [] 60 | 61 | with open(file1) as f1, open(file2) as f2: 62 | for line1, line2 in zip(f1, f2): 63 | if line1.strip() == "" or line2.strip() == "": 64 | continue 65 | 66 | f1Lines.append(line1.replace('\t', ' ').strip()) 67 | f2Lines.append(line2.replace('\t', ' ').strip()) 68 | 69 | 70 | linePairList = list(dict.fromkeys(zip(f1Lines, f2Lines))) 71 | 72 | with open(file1, 'w') as f1, open(file2, 'w') as f2: 73 | for linePair in linePairList: 74 | print(linePair[0].strip(), file=f1) 75 | print(linePair[1].strip(), file=f2) 76 | 77 | def score(prefix, args): 78 | writeValidLinePairs(f'{prefix}.{args.src_lang}', f'{prefix}.{args.tgt_lang}') 79 | 80 | s = f''' 81 | python3 \"{os.path.join(os.environ["LASER"], "source", "mine_bitexts.py")}\" \ 82 | \"{prefix}.{args.src_lang}\" \"{prefix}.{args.tgt_lang}\" \ 83 | --src-lang {args.src_lang} --trg-lang {args.tgt_lang} \ 84 | --src-embeddings \"{prefix}.enc.{args.src_lang}\" --trg-embeddings \"{prefix}.enc.{args.tgt_lang}\" \ 85 | --mode score --retrieval max -k 4 \ 86 | --output \"{prefix}.tsv\" \ 87 | --verbose {'--gpu' if not args.cpu else ''} 88 | ''' 89 | os.system(s) 90 | 91 | os.remove(f'{prefix}.enc.{args.src_lang}') 92 | os.remove(f'{prefix}.enc.{args.tgt_lang}') 93 | 94 | def mergeScores(input_dir, output_file): 95 | output_lines = [] 96 | for input_file in glob.glob(os.path.join(input_dir, "*tsv")): 97 | output_lines.extend(getLines(input_file)) 98 | 99 | _create(output_lines, output_file) 100 | 101 | def scoreDir(dirname, out_prefix, args): 102 | prefixes = [f[:-len(args.tgt_lang) - 1] for f in glob.glob(os.path.join(dirname, f"*{args.tgt_lang}"))] 103 | 104 | for prefix in prefixes: 105 | encode(f'{prefix}.{args.src_lang}', f'{prefix}.enc.{args.src_lang}', args.src_lang) 106 | encode(f'{prefix}.{args.tgt_lang}', f'{prefix}.enc.{args.tgt_lang}', args.tgt_lang) 107 | 108 | if args.cpu: 109 | with Pool() as pool: 110 | pool.starmap(score, [(prefix, args) for prefix in prefixes]) 111 | else: 112 | for prefix in prefixes: 113 | score(prefix, args) 114 | 115 | for filename in glob.glob(os.path.join(dirname, "*.enc.*")): 116 | os.remove(filename) 117 | 118 | mergeScores(dirname, os.path.join(os.path.dirname(dirname), out_prefix)) 119 | shutil.rmtree(dirname) 120 | 121 | def shufflePairs(srcFile, tgtFile): 122 | with open(f'{srcFile}.shuffled', 'w') as srcF, open(f'{tgtFile}.shuffled', 'w') as tgtF: 123 | srcLines, tgtLines = [], [] 124 | 125 | with open(srcFile) as f: 126 | srcLines.extend(f.readlines()) 127 | 128 | with open(tgtFile) as f: 129 | tgtLines.extend(f.readlines()) 130 | 131 | assert len(srcLines) == len(tgtLines), "src and tgt line counts dont match" 132 | 133 | indices = list(range(len(srcLines))) 134 | random.shuffle(indices) 135 | 136 | for i in indices: 137 | print(srcLines[i].strip(), file=srcF) 138 | print(tgtLines[i].strip(), file=tgtF) 139 | 140 | shutil.move(f'{srcFile}.shuffled', srcFile) 141 | shutil.move(f'{tgtFile}.shuffled', tgtFile) 142 | 143 | def _create(lines, output_file): 144 | with open(output_file, 'w') as outf: 145 | for line in lines: 146 | print(line.strip(), file=outf) 147 | 148 | def createChunks(input_file, output_dir, suffix, chunk_size): 149 | os.makedirs(output_dir, exist_ok=True) 150 | input_lines = getLines(input_file) 151 | no_chunks = math.ceil(len(input_lines) / chunk_size) 152 | 153 | for i in range(no_chunks): 154 | output_file = os.path.join(output_dir, f"{i}.{suffix}") 155 | lines = input_lines[i * chunk_size: (i + 1) * chunk_size] 156 | _create(lines, output_file) 157 | 158 | def chunkFiles(prefix, dirname, args): 159 | if os.path.isdir(os.path.join(dirname, "original")): 160 | shutil.rmtree(os.path.join(dirname, "original")) 161 | 162 | shutil.copy(f'{prefix}.{args.src_lang}', f'{prefix}.{args.src_lang}.backup') 163 | shutil.copy(f'{prefix}.{args.tgt_lang}', f'{prefix}.{args.tgt_lang}.backup') 164 | 165 | shufflePairs(f'{prefix}.{args.src_lang}', f'{prefix}.{args.tgt_lang}') 166 | 167 | createChunks(f"{prefix}.{args.src_lang}", os.path.join(dirname, "original"), args.src_lang, args.batch_size) 168 | createChunks(f"{prefix}.{args.tgt_lang}", os.path.join(dirname, "original"), args.tgt_lang, args.batch_size) 169 | 170 | shutil.move(f'{prefix}.{args.src_lang}.backup', f'{prefix}.{args.src_lang}') 171 | shutil.move(f'{prefix}.{args.tgt_lang}.backup', f'{prefix}.{args.tgt_lang}') 172 | 173 | def batchFilterDir(args): 174 | for tgtFile in glob.glob(os.path.join(args.input_dir, "**", f"*{args.tgt_lang}"), recursive=True): 175 | dirname = os.path.dirname(tgtFile) 176 | prefix = tgtFile[:-len(args.tgt_lang) - 1] 177 | if not os.path.isfile(f'{prefix}.{args.src_lang}'): 178 | continue 179 | 180 | chunkFiles(prefix, dirname, args) 181 | out_prefix = os.path.basename(prefix) 182 | tsv_name = out_prefix + ".merged.tsv" 183 | scoreDir(os.path.join(dirname, "original"), tsv_name, args) 184 | 185 | outDir = dirname.replace(os.path.normpath(args.input_dir), os.path.normpath(args.output_dir), 1) 186 | os.makedirs(outDir, exist_ok=True) 187 | passed = failed = 0 188 | 189 | shutil.move(os.path.join(dirname, tsv_name), os.path.join(outDir, tsv_name)) 190 | 191 | with open(os.path.join(outDir, f'{out_prefix}.passed.{args.src_lang}'), 'w') as psrc, \ 192 | open(os.path.join(outDir, f'{out_prefix}.passed.{args.tgt_lang}'), 'w') as ptgt, \ 193 | open(os.path.join(outDir, f'{out_prefix}.failed.{args.src_lang}'), 'w') as fsrc, \ 194 | open(os.path.join(outDir, f'{out_prefix}.failed.{args.tgt_lang}'), 'w') as ftgt: 195 | 196 | with open(os.path.join(outDir, tsv_name)) as f: 197 | for line in f: 198 | score, srcLine, tgtLine = line.split('\t') 199 | 200 | if float(score) > args.thresh: 201 | print(srcLine.strip(), file=psrc) 202 | print(tgtLine.strip(), file=ptgt) 203 | passed += 1 204 | else: 205 | print(srcLine.strip(), file=fsrc) 206 | print(tgtLine.strip(), file=ftgt) 207 | failed += 1 208 | 209 | print(f'Passed Sentences: {passed}') 210 | print(f'Failed Sentences: {failed}') 211 | 212 | 213 | 214 | if __name__ == "__main__": 215 | parser = argparse.ArgumentParser() 216 | parser.add_argument( 217 | '--input_dir', '-i', type=str, 218 | required=True, 219 | metavar='PATH', 220 | help="Input directory") 221 | 222 | parser.add_argument( 223 | '--output_dir', '-o', type=str, 224 | required=True, 225 | metavar='PATH', 226 | help="Output directory") 227 | 228 | parser.add_argument( 229 | '--src_lang', type=str, 230 | required=True, 231 | help="Source language") 232 | 233 | parser.add_argument( 234 | '--tgt_lang', type=str, 235 | required=True, 236 | help="Target language") 237 | 238 | parser.add_argument('--thresh', type=float, default=.95, help='threshold') 239 | parser.add_argument('--batch_size', type=int, default=1000, help='batch size') 240 | 241 | parser.add_argument('--cpu', action='store_true', 242 | help='Run on cpu') 243 | 244 | args = parser.parse_args() 245 | assert args.input_dir != args.output_dir, "input and output directories cant be the same." 246 | loadEncoder(args.cpu) 247 | batchFilterDir(args) 248 | -------------------------------------------------------------------------------- /batch_filtering/source/embed.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | # -------------------------------------------------------- 14 | # 15 | # Tool to calculate to embed a text file 16 | # The functions can be also imported into another Python code 17 | 18 | 19 | import re 20 | import os 21 | import tempfile 22 | import sys 23 | import time 24 | import argparse 25 | import numpy as np 26 | from collections import namedtuple 27 | 28 | import torch 29 | import torch.nn as nn 30 | 31 | # get environment 32 | assert os.environ.get('LASER'), 'Please set the enviornment variable LASER' 33 | LASER = os.environ['LASER'] 34 | 35 | sys.path.append(LASER + '/source/lib') 36 | from text_processing import Token, BPEfastApply 37 | 38 | SPACE_NORMALIZER = re.compile("\s+") 39 | Batch = namedtuple('Batch', 'srcs tokens lengths') 40 | 41 | 42 | def buffered_read(fp, buffer_size): 43 | buffer = [] 44 | for src_str in fp: 45 | buffer.append(src_str.strip()) 46 | if len(buffer) >= buffer_size: 47 | yield buffer 48 | buffer = [] 49 | 50 | if len(buffer) > 0: 51 | yield buffer 52 | 53 | 54 | def buffered_arange(max): 55 | if not hasattr(buffered_arange, 'buf'): 56 | buffered_arange.buf = torch.LongTensor() 57 | if max > buffered_arange.buf.numel(): 58 | torch.arange(max, out=buffered_arange.buf) 59 | return buffered_arange.buf[:max] 60 | 61 | 62 | # TODO Do proper padding from the beginning 63 | def convert_padding_direction(src_tokens, padding_idx, right_to_left=False, left_to_right=False): 64 | assert right_to_left ^ left_to_right 65 | pad_mask = src_tokens.eq(padding_idx) 66 | if not pad_mask.any(): 67 | # no padding, return early 68 | return src_tokens 69 | if left_to_right and not pad_mask[:, 0].any(): 70 | # already right padded 71 | return src_tokens 72 | if right_to_left and not pad_mask[:, -1].any(): 73 | # already left padded 74 | return src_tokens 75 | max_len = src_tokens.size(1) 76 | range = buffered_arange(max_len).type_as(src_tokens).expand_as(src_tokens) 77 | num_pads = pad_mask.long().sum(dim=1, keepdim=True) 78 | if right_to_left: 79 | index = torch.remainder(range - num_pads, max_len) 80 | else: 81 | index = torch.remainder(range + num_pads, max_len) 82 | return src_tokens.gather(1, index) 83 | 84 | 85 | class SentenceEncoder: 86 | 87 | def __init__(self, model_path, max_sentences=None, max_tokens=None, cpu=False, fp16=False, verbose=False, 88 | sort_kind='quicksort'): 89 | self.use_cuda = torch.cuda.is_available() and not cpu 90 | self.max_sentences = max_sentences 91 | self.max_tokens = max_tokens 92 | if self.max_tokens is None and self.max_sentences is None: 93 | self.max_sentences = 1 94 | 95 | state_dict = torch.load(model_path) 96 | self.encoder = Encoder(**state_dict['params']) 97 | self.encoder.load_state_dict(state_dict['model']) 98 | self.dictionary = state_dict['dictionary'] 99 | self.pad_index = self.dictionary[''] 100 | self.eos_index = self.dictionary[''] 101 | self.unk_index = self.dictionary[''] 102 | if fp16: 103 | self.encoder.half() 104 | if self.use_cuda: 105 | if verbose: 106 | print(' - transfer encoder to GPU') 107 | self.encoder.cuda() 108 | self.sort_kind = sort_kind 109 | 110 | def _process_batch(self, batch): 111 | tokens = batch.tokens 112 | lengths = batch.lengths 113 | if self.use_cuda: 114 | tokens = tokens.cuda() 115 | lengths = lengths.cuda() 116 | self.encoder.eval() 117 | embeddings = self.encoder(tokens, lengths)['sentemb'] 118 | return embeddings.detach().cpu().numpy() 119 | 120 | def _tokenize(self, line): 121 | tokens = SPACE_NORMALIZER.sub(" ", line).strip().split() 122 | ntokens = len(tokens) 123 | ids = torch.LongTensor(ntokens + 1) 124 | for i, token in enumerate(tokens): 125 | ids[i] = self.dictionary.get(token, self.unk_index) 126 | ids[ntokens] = self.eos_index 127 | return ids 128 | 129 | def _make_batches(self, lines): 130 | tokens = [self._tokenize(line) for line in lines] 131 | lengths = np.array([t.numel() for t in tokens]) 132 | indices = np.argsort(-lengths, kind=self.sort_kind) 133 | 134 | def batch(tokens, lengths, indices): 135 | toks = tokens[0].new_full((len(tokens), tokens[0].shape[0]), self.pad_index) 136 | for i in range(len(tokens)): 137 | toks[i, -tokens[i].shape[0]:] = tokens[i] 138 | return Batch( 139 | srcs=None, 140 | tokens=toks, 141 | lengths=torch.LongTensor(lengths) 142 | ), indices 143 | 144 | batch_tokens, batch_lengths, batch_indices = [], [], [] 145 | ntokens = nsentences = 0 146 | for i in indices: 147 | if nsentences > 0 and ((self.max_tokens is not None and ntokens + lengths[i] > self.max_tokens) or 148 | (self.max_sentences is not None and nsentences == self.max_sentences)): 149 | yield batch(batch_tokens, batch_lengths, batch_indices) 150 | ntokens = nsentences = 0 151 | batch_tokens, batch_lengths, batch_indices = [], [], [] 152 | batch_tokens.append(tokens[i]) 153 | batch_lengths.append(lengths[i]) 154 | batch_indices.append(i) 155 | ntokens += tokens[i].shape[0] 156 | nsentences += 1 157 | if nsentences > 0: 158 | yield batch(batch_tokens, batch_lengths, batch_indices) 159 | 160 | def encode_sentences(self, sentences): 161 | indices = [] 162 | results = [] 163 | for batch, batch_indices in self._make_batches(sentences): 164 | indices.extend(batch_indices) 165 | results.append(self._process_batch(batch)) 166 | return np.vstack(results)[np.argsort(indices, kind=self.sort_kind)] 167 | 168 | 169 | class Encoder(nn.Module): 170 | def __init__( 171 | self, num_embeddings, padding_idx, embed_dim=320, hidden_size=512, num_layers=1, bidirectional=False, 172 | left_pad=True, padding_value=0. 173 | ): 174 | super().__init__() 175 | 176 | self.num_layers = num_layers 177 | self.bidirectional = bidirectional 178 | self.hidden_size = hidden_size 179 | 180 | self.padding_idx = padding_idx 181 | self.embed_tokens = nn.Embedding(num_embeddings, embed_dim, padding_idx=self.padding_idx) 182 | 183 | self.lstm = nn.LSTM( 184 | input_size=embed_dim, 185 | hidden_size=hidden_size, 186 | num_layers=num_layers, 187 | bidirectional=bidirectional, 188 | ) 189 | self.left_pad = left_pad 190 | self.padding_value = padding_value 191 | 192 | self.output_units = hidden_size 193 | if bidirectional: 194 | self.output_units *= 2 195 | 196 | def forward(self, src_tokens, src_lengths): 197 | if self.left_pad: 198 | # convert left-padding to right-padding 199 | src_tokens = convert_padding_direction( 200 | src_tokens, 201 | self.padding_idx, 202 | left_to_right=True, 203 | ) 204 | 205 | bsz, seqlen = src_tokens.size() 206 | 207 | # embed tokens 208 | x = self.embed_tokens(src_tokens) 209 | 210 | # B x T x C -> T x B x C 211 | x = x.transpose(0, 1) 212 | 213 | # pack embedded source tokens into a PackedSequence 214 | packed_x = nn.utils.rnn.pack_padded_sequence(x, src_lengths.data.tolist()) 215 | 216 | # apply LSTM 217 | if self.bidirectional: 218 | state_size = 2 * self.num_layers, bsz, self.hidden_size 219 | else: 220 | state_size = self.num_layers, bsz, self.hidden_size 221 | h0 = x.data.new(*state_size).zero_() 222 | c0 = x.data.new(*state_size).zero_() 223 | packed_outs, (final_hiddens, final_cells) = self.lstm(packed_x, (h0, c0)) 224 | 225 | # unpack outputs and apply dropout 226 | x, _ = nn.utils.rnn.pad_packed_sequence(packed_outs, padding_value=self.padding_value) 227 | assert list(x.size()) == [seqlen, bsz, self.output_units] 228 | 229 | if self.bidirectional: 230 | def combine_bidir(outs): 231 | return torch.cat([ 232 | torch.cat([outs[2 * i], outs[2 * i + 1]], dim=0).view(1, bsz, self.output_units) 233 | for i in range(self.num_layers) 234 | ], dim=0) 235 | 236 | final_hiddens = combine_bidir(final_hiddens) 237 | final_cells = combine_bidir(final_cells) 238 | 239 | encoder_padding_mask = src_tokens.eq(self.padding_idx).t() 240 | 241 | # Set padded outputs to -inf so they are not selected by max-pooling 242 | padding_mask = src_tokens.eq(self.padding_idx).t().unsqueeze(-1) 243 | if padding_mask.any(): 244 | x = x.float().masked_fill_(padding_mask, float('-inf')).type_as(x) 245 | 246 | # Build the sentence embedding by max-pooling over the encoder outputs 247 | sentemb = x.max(dim=0)[0] 248 | 249 | return { 250 | 'sentemb': sentemb, 251 | 'encoder_out': (x, final_hiddens, final_cells), 252 | 'encoder_padding_mask': encoder_padding_mask if encoder_padding_mask.any() else None 253 | } 254 | 255 | 256 | def EncodeLoad(args): 257 | args.buffer_size = max(args.buffer_size, 1) 258 | assert not args.max_sentences or args.max_sentences <= args.buffer_size, \ 259 | '--max-sentences/--batch-size cannot be larger than --buffer-size' 260 | 261 | print(' - loading encoder', args.encoder) 262 | return SentenceEncoder(args.encoder, 263 | max_sentences=args.max_sentences, 264 | max_tokens=args.max_tokens, 265 | cpu=args.cpu, 266 | verbose=args.verbose) 267 | 268 | 269 | def EncodeTime(t): 270 | t = int(time.time() - t) 271 | if t < 1000: 272 | print(' in {:d}s'.format(t)) 273 | else: 274 | print(' in {:d}m{:d}s'.format(t // 60, t % 60)) 275 | 276 | 277 | # Encode sentences (existing file pointers) 278 | def EncodeFilep(encoder, inp_file, out_file, buffer_size=10000, verbose=False): 279 | n = 0 280 | t = time.time() 281 | for sentences in buffered_read(inp_file, buffer_size): 282 | encoder.encode_sentences(sentences).tofile(out_file) 283 | n += len(sentences) 284 | if verbose and n % 10000 == 0: 285 | print('\r - Encoder: {:d} sentences'.format(n), end='') 286 | if verbose: 287 | print('\r - Encoder: {:d} sentences'.format(n), end='') 288 | EncodeTime(t) 289 | 290 | 291 | # Encode sentences (file names) 292 | def EncodeFile(encoder, inp_fname, out_fname, 293 | buffer_size=10000, verbose=False, over_write=False, 294 | inp_encoding='utf-8'): 295 | # TODO :handle over write 296 | if not os.path.isfile(out_fname): 297 | if verbose: 298 | print(' - Encoder: {} to {}'. 299 | format(os.path.basename(inp_fname) if len(inp_fname) > 0 else 'stdin', 300 | os.path.basename(out_fname))) 301 | fin = open(inp_fname, 'r', encoding=inp_encoding, errors='surrogateescape') if len(inp_fname) > 0 else sys.stdin 302 | fout = open(out_fname, mode='wb') 303 | EncodeFilep(encoder, fin, fout, buffer_size=buffer_size, verbose=verbose) 304 | fin.close() 305 | fout.close() 306 | elif not over_write and verbose: 307 | print(' - Encoder: {} exists already'.format(os.path.basename(out_fname))) 308 | 309 | 310 | # Load existing embeddings 311 | def EmbedLoad(fname, dim=1024, verbose=False): 312 | x = np.fromfile(fname, dtype=np.float32, count=-1) 313 | x.resize(x.shape[0] // dim, dim) 314 | if verbose: 315 | print(' - Embeddings: {:s}, {:d}x{:d}'.format(fname, x.shape[0], dim)) 316 | return x 317 | 318 | 319 | # Get memory mapped embeddings 320 | def EmbedMmap(fname, dim=1024, dtype=np.float32, verbose=False): 321 | nbex = int(os.path.getsize(fname) / dim / np.dtype(dtype).itemsize) 322 | E = np.memmap(fname, mode='r', dtype=dtype, shape=(nbex, dim)) 323 | if verbose: 324 | print(' - embeddings on disk: {:s} {:d} x {:d}'.format(fname, nbex, dim)) 325 | return E 326 | 327 | 328 | if __name__ == '__main__': 329 | parser = argparse.ArgumentParser(description='LASER: Embed sentences') 330 | parser.add_argument('--encoder', type=str, required=True, 331 | help='encoder to be used') 332 | parser.add_argument('--token-lang', type=str, default='--', 333 | help="Perform tokenization with given language ('--' for no tokenization)") 334 | parser.add_argument('--bpe-codes', type=str, default=None, 335 | help='Apply BPE using specified codes') 336 | parser.add_argument('-v', '--verbose', action='store_true', 337 | help='Detailed output') 338 | 339 | parser.add_argument('-o', '--output', required=True, 340 | help='Output sentence embeddings') 341 | parser.add_argument('--buffer-size', type=int, default=10000, 342 | help='Buffer size (sentences)') 343 | parser.add_argument('--max-tokens', type=int, default=12000, 344 | help='Maximum number of tokens to process in a batch') 345 | parser.add_argument('--max-sentences', type=int, default=None, 346 | help='Maximum number of sentences to process in a batch') 347 | parser.add_argument('--cpu', action='store_true', 348 | help='Use CPU instead of GPU') 349 | parser.add_argument('--stable', action='store_true', 350 | help='Use stable merge sort instead of quick sort') 351 | args = parser.parse_args() 352 | 353 | args.buffer_size = max(args.buffer_size, 1) 354 | assert not args.max_sentences or args.max_sentences <= args.buffer_size, \ 355 | '--max-sentences/--batch-size cannot be larger than --buffer-size' 356 | 357 | if args.verbose: 358 | print(' - Encoder: loading {}'.format(args.encoder)) 359 | encoder = SentenceEncoder(args.encoder, 360 | max_sentences=args.max_sentences, 361 | max_tokens=args.max_tokens, 362 | sort_kind='mergesort' if args.stable else 'quicksort', 363 | cpu=args.cpu) 364 | 365 | with tempfile.TemporaryDirectory() as tmpdir: 366 | ifname = '' # stdin will be used 367 | if args.token_lang != '--': 368 | tok_fname = os.path.join(tmpdir, 'tok') 369 | Token(ifname, 370 | tok_fname, 371 | lang=args.token_lang, 372 | romanize=True if args.token_lang == 'el' else False, 373 | lower_case=True, gzip=False, 374 | verbose=args.verbose, over_write=False) 375 | ifname = tok_fname 376 | 377 | if args.bpe_codes: 378 | bpe_fname = os.path.join(tmpdir, 'bpe') 379 | BPEfastApply(ifname, 380 | bpe_fname, 381 | args.bpe_codes, 382 | verbose=args.verbose, over_write=False) 383 | ifname = bpe_fname 384 | 385 | EncodeFile(encoder, 386 | ifname, 387 | args.output, 388 | verbose=args.verbose, over_write=False, 389 | buffer_size=args.buffer_size) 390 | -------------------------------------------------------------------------------- /batch_filtering/source/lib/indexing.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | # -------------------------------------------------------- 14 | # 15 | # tools for indexing and search with FAISS 16 | 17 | import faiss 18 | import os.path 19 | import sys 20 | import numpy as np 21 | 22 | #------------------------------------------------------------- 23 | # Get list of fnames: 24 | # - we loop over the list of given languages 25 | # - for each language, we also check if there are splitted files .%03d 26 | 27 | def SplitFnames(par_fname, langs): 28 | fnames = [] 29 | for l in langs: 30 | fname = par_fname + '.' + l 31 | if os.path.isfile(fname): 32 | fnames.append(fname) 33 | for i in range(1000): 34 | fname = par_fname + '.' + l + '.{:03d}'.format(i) 35 | if os.path.isfile(fname): 36 | fnames.append(fname) 37 | if len(fnames) == 0: 38 | print("ERROR: no embeddings found in {:s}*".format(par_fname)) 39 | sys.exit(1) 40 | return fnames 41 | 42 | def SplitOpen(par_fname, langs, dim, dtype, verbose=False): 43 | M = [] 44 | nf = 0 45 | nc = 0 46 | print('Reading sentence embeddings') 47 | print(' - memory mapped files {:s}'.format(par_fname)) 48 | for fname in SplitFnames(par_fname, langs): 49 | n = int(os.path.getsize(fname) / dim / np.dtype(dtype).itemsize) 50 | if verbose: 51 | print(' - {:s}: {:d} x {:d}'.format(fname, n, dim)) 52 | Mi = np.memmap(fname, mode='r', dtype=dtype, shape=(n, dim)) 53 | nc += n 54 | nf += 1 55 | M.append(Mi) 56 | print(' - total of {:d} files: {:d} x {:d}'.format(nf, nc, dim)) 57 | return M 58 | 59 | def SplitAccess(M, idx): 60 | i = idx 61 | for Mi in M: 62 | n = Mi.shape[0] 63 | if i < n: 64 | return Mi[i,:] 65 | i -= n 66 | print('ERROR: index {:d} is too large form memory mapped files'.format(idx)) 67 | sys.exit(1) 68 | 69 | 70 | ############################################################################### 71 | # create an FAISS index on the given data 72 | 73 | def IndexCreate(dname, idx_type, 74 | verbose=False, normalize=True, save_index=False, dim=1024): 75 | 76 | assert idx_type == 'FlatL2', 'only FlatL2 index is currently supported' 77 | x = np.fromfile(dname, dtype=np.float32, count=-1) 78 | nbex = x.shape[0] // dim 79 | print(' - embedding: {:s} {:d} examples of dim {:d}' 80 | .format(dname, nbex, dim)) 81 | x.resize(nbex, dim) 82 | print(' - creating FAISS index') 83 | idx = faiss.IndexFlatL2(dim) 84 | if normalize: 85 | faiss.normalize_L2(x) 86 | idx.add(x) 87 | if save_index: 88 | iname = 'TODO' 89 | print(' - saving index into ' + iname) 90 | faiss.write_index(idx, iname) 91 | return x, idx 92 | 93 | 94 | ############################################################################### 95 | # search closest vector for all languages pairs and calculate error rate 96 | 97 | def IndexSearchMultiple(data, idx, verbose=False, texts=None, print_errors=False): 98 | nl = len(data) 99 | nbex = data[0].shape[0] 100 | err = np.zeros((nl, nl)).astype(float) 101 | ref = np.linspace(0, nbex-1, nbex).astype(int) # [0, nbex) 102 | if verbose: 103 | if texts is None: 104 | print('Calculating similarity error (indices):') 105 | else: 106 | print('Calculating similarity error (textual):') 107 | for i1 in range(nl): 108 | for i2 in range(nl): 109 | if i1 != i2: 110 | D, I = idx[i2].search(data[i1], 1) 111 | if texts: # do textual comparison 112 | e1 = 0 113 | for p in range(I.shape[0]): 114 | if texts[i2][p] != texts[i2][I[p,0]]: 115 | e1 += 1 116 | if print_errors: 117 | print('Error {:s}\n {:s}' 118 | .format(texts[i2][p].strip(), texts[i2][I[p,0]].strip())) 119 | err[i1, i2] = e1 / nbex 120 | else: # do index based comparision 121 | err[i1, i2] \ 122 | = (nbex - np.equal(I.reshape(nbex), ref) 123 | .astype(int).sum()) / nbex 124 | if verbose: 125 | print(' - similarity error {:s}/{:s}: {:5d}={:5.2f}%' 126 | .format(args.langs[i1], args.langs[i2], 127 | err[i1, i2], 100.0 * err[i1, i2])) 128 | return err 129 | 130 | 131 | ############################################################################### 132 | # print confusion matrix 133 | 134 | def IndexPrintConfusionMatrix(err, langs): 135 | nl = len(langs) 136 | assert nl == err.shape[0], 'size of errror matrix doesn not match' 137 | print('Confusion matrix:') 138 | print('{:8s}'.format('langs'), end='') 139 | for i2 in range(nl): 140 | print('{:8s} '.format(langs[i2]), end='') 141 | print('{:8s}'.format('avg')) 142 | for i1 in range(nl): 143 | print('{:3s}'.format(langs[i1]), end='') 144 | for i2 in range(nl): 145 | print('{:8.2f}%'.format(100 * err[i1, i2]), end='') 146 | print('{:8.2f}%'.format(100 * err[i1, :].sum() / (nl-1))) 147 | 148 | print('avg', end='') 149 | for i2 in range(nl): 150 | print('{:8.2f}%'.format(100 * err[:, i2].sum() / (nl-1)), end='') 151 | 152 | # global average 153 | print('{:8.2f}%'.format(100 * err.sum() / (nl-1) / nl)) 154 | 155 | 156 | ############################################################################### 157 | # Load an FAISS index 158 | 159 | def IndexLoad(idx_name, nprobe, gpu=False): 160 | print('Reading FAISS index') 161 | print(' - index: {:s}'.format(idx_name)) 162 | index = faiss.read_index(idx_name) 163 | print(' - found {:d} sentences of dim {:d}'.format(index.ntotal, index.d)) 164 | print(' - setting nbprobe to {:d}'.format(nprobe)) 165 | if gpu: 166 | print(' - transfer index to %d GPUs ' % faiss.get_num_gpus()) 167 | #co = faiss.GpuMultipleClonerOptions() 168 | #co.shard = True 169 | index = faiss.index_cpu_to_all_gpus(index) # co=co 170 | faiss.GpuParameterSpace().set_index_parameter(index, 'nprobe', nprobe) 171 | return index 172 | 173 | 174 | ############################################################################### 175 | # Opens a text file with the sentences corresponding to the indices used 176 | # by an FAISS index 177 | # We also need the reference files with the byte offsets to the beginning 178 | # of each sentence 179 | # optionnally: array with number of words per sentence 180 | # All arrays are memory mapped 181 | 182 | def IndexTextOpen(txt_fname): 183 | print('Reading text corpus') 184 | print(' - texts: {:s}'.format(txt_fname)) 185 | txt_mmap = np.memmap(txt_fname, mode='r', dtype=np.uint8) 186 | fname = txt_fname.replace('.txt', '.ref.bin32') 187 | if os.path.isfile(fname): 188 | print(' - sentence start offsets (32 bit): {}'.format(fname)) 189 | ref_mmap = np.memmap(fname, mode='r', dtype=np.uint32) 190 | else: 191 | fname = txt_fname.replace('.txt', '.ref.bin64') 192 | if os.path.isfile(fname): 193 | print(' - sentence start offsets (64 bit): {}'.format(fname)) 194 | ref_mmap = np.memmap(fname, mode='r', dtype=np.uint64) 195 | else: 196 | print('ERROR: no file with sentence start offsets found') 197 | sys.exit(1) 198 | print(' - found {:d} sentences'.format(ref_mmap.shape[0])) 199 | 200 | nbw_mmap = None 201 | fname = txt_fname.replace('.txt', '.nw.bin8') 202 | if os.path.isfile(fname): 203 | print(' - word counts: {:s}'.format(fname)) 204 | nbw_mmap = np.memmap(fname, mode='r', dtype=np.uint8) 205 | 206 | M = None 207 | fname = txt_fname.replace('.txt', '.meta') 208 | if os.path.isfile(fname): 209 | M = [] 210 | n = 0 211 | print(' - metafile: {:s}'.format(fname)) 212 | with open(fname, 'r') as fp: 213 | for line in fp: 214 | fields = line.strip().split() 215 | if len(fields) != 2: 216 | print('ERROR: format error in meta file') 217 | sys.exit(1) 218 | n += int(fields[1]) 219 | M.append({'lang': fields[0], 'n': n}) 220 | print(' - found {:d} languages:'.format(len(M)), end='') 221 | for L in M: 222 | print(' {:s}'.format(L['lang']), end='') 223 | print('') 224 | 225 | return txt_mmap, ref_mmap, nbw_mmap, M 226 | 227 | 228 | ############################################################################### 229 | # Return the text for the given index 230 | 231 | def IndexTextQuery(txt_mmap, ref_mmap, idx): 232 | p = int(ref_mmap[idx]) # get starting byte position 233 | i = 0 234 | dim = 10000 # max sentence length in bytes 235 | b = bytearray(dim) 236 | # find EOL 237 | while txt_mmap[p+i] != 10 and i < dim: 238 | b[i] = txt_mmap[p+i] 239 | i += 1 240 | 241 | return b[0:i].decode('utf-8') 242 | 243 | 244 | ############################################################################### 245 | # Search the [k] nearest vectors of [x] in the given index 246 | # and return the text lines 247 | 248 | def IndexSearchKNN(index, x, T, R, kmax=1, Dmax=1.0, dedup=True): 249 | D, I = index.search(x, kmax) 250 | prev = {} # for depuplication 251 | res = [] 252 | for n in range(x.shape[0]): 253 | for i in range(kmax): 254 | txt = IndexTextQuery(T, R, I[n, i]) 255 | if (dedup and txt not in prev) and D[n, i] <= Dmax: 256 | prev[txt] = 1 257 | res.append([txt, D[n, i]]) 258 | return res 259 | -------------------------------------------------------------------------------- /batch_filtering/source/lib/romanize_lc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | # -------------------------------------------------------- 14 | # 15 | # Romanize and lower case text 16 | 17 | import os 18 | import sys 19 | import argparse 20 | from transliterate import translit, get_available_language_codes 21 | 22 | parser = argparse.ArgumentParser( 23 | formatter_class=argparse.RawDescriptionHelpFormatter, 24 | description="Calculate multilingual sentence encodings") 25 | parser.add_argument( 26 | '--input', '-i', type=argparse.FileType('r', encoding='UTF-8'), 27 | default=sys.stdin, 28 | metavar='PATH', 29 | help="Input text file (default: standard input).") 30 | parser.add_argument( 31 | '--output', '-o', type=argparse.FileType('w', encoding='UTF-8'), 32 | default=sys.stdout, 33 | metavar='PATH', 34 | help="Output text file (default: standard output).") 35 | parser.add_argument( 36 | '--language', '-l', type=str, 37 | metavar='STR', default="none", 38 | help="perform transliteration into Roman characters" 39 | " from the specified language (default none)") 40 | parser.add_argument( 41 | '--preserve-case', '-C', action='store_true', 42 | help="Preserve case of input texts (default is all lower case)") 43 | 44 | args = parser.parse_args() 45 | 46 | for line in args.input: 47 | if args.language != "none": 48 | line = translit(line, args.language, reversed=True) 49 | if not args.preserve_case: 50 | line = line.lower() 51 | args.output.write(line) 52 | -------------------------------------------------------------------------------- /batch_filtering/source/lib/text_processing.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | # -------------------------------------------------------- 14 | # 15 | # Helper functions for tokenization and BPE 16 | 17 | import os 18 | import sys 19 | import tempfile 20 | import fastBPE 21 | import numpy as np 22 | from subprocess import run, check_output, DEVNULL 23 | 24 | # get environment 25 | assert os.environ.get('LASER'), 'Please set the enviornment variable LASER' 26 | LASER = os.environ['LASER'] 27 | 28 | FASTBPE = LASER + '/tools-external/fastBPE/fast' 29 | MOSES_BDIR = LASER + '/tools-external/moses-tokenizer/tokenizer/' 30 | MOSES_TOKENIZER = MOSES_BDIR + 'tokenizer.perl -q -no-escape -threads 20 -l ' 31 | MOSES_LC = MOSES_BDIR + 'lowercase.perl' 32 | NORM_PUNC = MOSES_BDIR + 'normalize-punctuation.perl -l ' 33 | DESCAPE = MOSES_BDIR + 'deescape-special-chars.perl' 34 | REM_NON_PRINT_CHAR = MOSES_BDIR + 'remove-non-printing-char.perl' 35 | 36 | # Romanization (Greek only) 37 | ROMAN_LC = 'python3 ' + LASER + '/source/lib/romanize_lc.py -l ' 38 | 39 | # Mecab tokenizer for Japanese 40 | MECAB = LASER + '/tools-external/mecab' 41 | 42 | 43 | ############################################################################### 44 | # 45 | # Tokenize a line of text 46 | # 47 | ############################################################################### 48 | 49 | def TokenLine(line, lang='en', lower_case=True, romanize=False): 50 | assert lower_case, 'lower case is needed by all the models' 51 | roman = lang if romanize else 'none' 52 | tok = check_output( 53 | REM_NON_PRINT_CHAR 54 | + '|' + NORM_PUNC + lang 55 | + '|' + DESCAPE 56 | + '|' + MOSES_TOKENIZER + lang 57 | + ('| python3 -m jieba -d ' if lang == 'zh' else '') 58 | + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '') 59 | + '|' + ROMAN_LC + roman, 60 | input=line, 61 | encoding='UTF-8', 62 | shell=True) 63 | return tok.strip() 64 | 65 | 66 | ############################################################################### 67 | # 68 | # Tokenize a file 69 | # 70 | ############################################################################### 71 | 72 | def Token(inp_fname, out_fname, lang='en', 73 | lower_case=True, romanize=False, descape=False, 74 | verbose=False, over_write=False, gzip=False): 75 | assert lower_case, 'lower case is needed by all the models' 76 | assert not over_write, 'over-write is not yet implemented' 77 | if not os.path.isfile(out_fname): 78 | cat = 'zcat ' if gzip else 'cat ' 79 | roman = lang if romanize else 'none' 80 | # handle some iso3 langauge codes 81 | if lang in ('cmn', 'wuu', 'yue'): 82 | lang = 'zh' 83 | if lang in ('jpn'): 84 | lang = 'ja' 85 | if verbose: 86 | print(' - Tokenizer: {} in language {} {} {}' 87 | .format(os.path.basename(inp_fname), lang, 88 | '(gzip)' if gzip else '', 89 | '(de-escaped)' if descape else '', 90 | '(romanized)' if romanize else '')) 91 | run(cat + inp_fname 92 | + '|' + REM_NON_PRINT_CHAR 93 | + '|' + NORM_PUNC + lang 94 | + ('|' + DESCAPE if descape else '') 95 | + '|' + MOSES_TOKENIZER + lang 96 | + ('| python3 -m jieba -d ' if lang == 'zh' else '') 97 | + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '') 98 | + '|' + ROMAN_LC + roman 99 | + '>' + out_fname, 100 | env=dict(os.environ, LD_LIBRARY_PATH=MECAB + '/lib'), 101 | shell=True) 102 | elif not over_write and verbose: 103 | print(' - Tokenizer: {} exists already' 104 | .format(os.path.basename(out_fname), lang)) 105 | 106 | 107 | ############################################################################### 108 | # 109 | # Apply FastBPE on one line of text 110 | # 111 | ############################################################################### 112 | 113 | def BPEfastLoad(line, bpe_codes): 114 | bpe_vocab = bpe_codes.replace('fcodes', 'fvocab') 115 | return fastBPE.fastBPE(bpe_codes, bpe_vocab) 116 | 117 | def BPEfastApplyLine(line, bpe): 118 | return bpe.apply([line])[0] 119 | 120 | 121 | ############################################################################### 122 | # 123 | # Apply FastBPE on a whole file 124 | # 125 | ############################################################################### 126 | 127 | def BPEfastApply(inp_fname, out_fname, bpe_codes, 128 | verbose=False, over_write=False): 129 | if not os.path.isfile(out_fname): 130 | if verbose: 131 | print(' - fast BPE: processing {}' 132 | .format(os.path.basename(inp_fname))) 133 | bpe_vocab = bpe_codes.replace('fcodes', 'fvocab') 134 | if not os.path.isfile(bpe_vocab): 135 | print(' - fast BPE: focab file not found {}'.format(bpe_vocab)) 136 | bpe_vocab = '' 137 | run(FASTBPE + ' applybpe ' 138 | + out_fname + ' ' + inp_fname 139 | + ' ' + bpe_codes 140 | + ' ' + bpe_vocab, shell=True, stderr=DEVNULL) 141 | elif not over_write and verbose: 142 | print(' - fast BPE: {} exists already' 143 | .format(os.path.basename(out_fname))) 144 | 145 | 146 | ############################################################################### 147 | # 148 | # Split long lines into multiple sentences at "." 149 | # 150 | ############################################################################### 151 | 152 | def SplitLines(ifname, of_txt, of_sid): 153 | if os.path.isfile(of_txt): 154 | print(' - SplitLines: {} already exists'.format(of_txt)) 155 | return 156 | nl = 0 157 | nl_sp = 0 158 | maxw = 0 159 | maxw_sp = 0 160 | fp_sid = open(of_sid, 'w') 161 | fp_txt = open(of_txt, 'w') 162 | with open(ifname, 'r') as ifp: 163 | for line in ifp: 164 | print('{:d}'.format(nl), file=fp_sid) # store current sentence ID 165 | nw = 0 166 | words = line.strip().split() 167 | maxw = max(maxw, len(words)) 168 | for i, word in enumerate(words): 169 | if word == '.' and i != len(words)-1: 170 | if nw > 0: 171 | print(' {}'.format(word), file=fp_txt) 172 | else: 173 | print('{}'.format(word), file=fp_txt) 174 | # store current sentence ID 175 | print('{:d}'.format(nl), file=fp_sid) 176 | nl_sp += 1 177 | maxw_sp = max(maxw_sp, nw+1) 178 | nw = 0 179 | else: 180 | if nw > 0: 181 | print(' {}'.format(word), end='', file=fp_txt) 182 | else: 183 | print('{}'.format(word), end='', file=fp_txt) 184 | nw += 1 185 | if nw > 0: 186 | # handle remainder of sentence 187 | print('', file=fp_txt) 188 | nl_sp += 1 189 | maxw_sp = max(maxw_sp, nw+1) 190 | nl += 1 191 | print(' - Split sentences: {}'.format(ifname)) 192 | print(' - lines/max words: {:d}/{:d} -> {:d}/{:d}' 193 | .format(nl, maxw, nl_sp, maxw_sp)) 194 | fp_sid.close() 195 | fp_txt.close() 196 | 197 | 198 | ############################################################################### 199 | # 200 | # Join embeddings of previously split lines (average) 201 | # 202 | ############################################################################### 203 | 204 | def JoinEmbed(if_embed, sid_fname, of_embed, dim=1024): 205 | if os.path.isfile(of_embed): 206 | print(' - JoinEmbed: {} already exists'.format(of_embed)) 207 | return 208 | # read the input embeddings 209 | em_in = np.fromfile(if_embed, dtype=np.float32, count=-1).reshape(-1, dim) 210 | ninp = em_in.shape[0] 211 | print(' - Combine embeddings:') 212 | print(' input: {:s} {:d} sentences'.format(if_embed, ninp)) 213 | 214 | # get all sentence IDs 215 | sid = np.empty(ninp, dtype=np.int32) 216 | i = 0 217 | with open(sid_fname, 'r') as fp_sid: 218 | for line in fp_sid: 219 | sid[i] = int(line) 220 | i += 1 221 | nout = sid.max() + 1 222 | print(' IDs: {:s}, {:d} sentences'.format(sid_fname, nout)) 223 | 224 | # combining 225 | em_out = np.zeros((nout, dim), dtype=np.float32) 226 | cnt = np.zeros(nout, dtype=np.int32) 227 | for i in range(ninp): 228 | idx = sid[i] 229 | em_out[idx] += em_in[i] # cumulate sentence vectors 230 | cnt[idx] += 1 231 | 232 | if (cnt == 0).astype(int).sum() > 0: 233 | print('ERROR: missing lines') 234 | sys.exit(1) 235 | 236 | # normalize 237 | for i in range(nout): 238 | em_out[i] /= cnt[i] 239 | 240 | print(' output: {:s}'.format(of_embed)) 241 | em_out.tofile(of_embed) 242 | -------------------------------------------------------------------------------- /batch_filtering/source/mine_bitexts.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the BSD-style license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | # 8 | # LASER Language-Agnostic SEntence Representations 9 | # is a toolkit to calculate multilingual sentence embeddings 10 | # and to use them for document classification, bitext filtering 11 | # and mining 12 | # 13 | # -------------------------------------------------------- 14 | # 15 | # Tool to calculate to embed a text file 16 | # The functions can be also imported into another Python code 17 | 18 | import os 19 | import sys 20 | import faiss 21 | import argparse 22 | import tempfile 23 | import numpy as np 24 | 25 | # get environment 26 | assert os.environ.get('LASER'), 'Please set the enviornment variable LASER' 27 | LASER = os.environ['LASER'] 28 | 29 | sys.path.append(LASER + '/source') 30 | sys.path.append(LASER + '/source/tools') 31 | from embed import SentenceEncoder, EncodeLoad, EncodeFile, EmbedLoad 32 | from text_processing import Token, BPEfastApply 33 | 34 | 35 | ############################################################################### 36 | # 37 | # Load texts and remove duplicates 38 | # 39 | ############################################################################### 40 | 41 | def TextLoadUnify(fname, args): 42 | if args.verbose: 43 | print(' - loading texts {:s}: '.format(fname), end='') 44 | 45 | fin = open(fname, encoding=args.encoding, errors='surrogateescape') 46 | inds = [] 47 | sents = [] 48 | 49 | if args.mode == 'score': 50 | for i, line in enumerate(fin): 51 | inds.append(i) 52 | sents.append(line[:-1]) 53 | 54 | return inds, sents 55 | 56 | sent2ind = {} 57 | n = 0 58 | nu = 0 59 | for line in fin: 60 | new_ind = len(sent2ind) 61 | inds.append(sent2ind.setdefault(line, new_ind)) 62 | if args.unify: 63 | if inds[-1] == new_ind: 64 | sents.append(line[:-1]) 65 | nu += 1 66 | else: 67 | sents.append(line[:-1]) 68 | nu += 1 69 | n += 1 70 | if args.verbose: 71 | print('{:d} lines, {:d} unique'.format(n, nu)) 72 | del sent2ind 73 | return inds, sents 74 | 75 | 76 | ############################################################################### 77 | # 78 | # Wrapper for knn on CPU/GPU 79 | # 80 | ############################################################################### 81 | 82 | def knn(x, y, k, use_gpu): 83 | return knnGPU(x, y, k) if use_gpu else knnCPU(x, y, k) 84 | 85 | 86 | ############################################################################### 87 | # 88 | # Perform knn on GPU 89 | # 90 | ############################################################################### 91 | 92 | def knnGPU(x, y, k, mem=5*1024*1024*1024): 93 | dim = x.shape[1] 94 | batch_size = mem // (dim*4) 95 | sim = np.zeros((x.shape[0], k), dtype=np.float32) 96 | ind = np.zeros((x.shape[0], k), dtype=np.int64) 97 | for xfrom in range(0, x.shape[0], batch_size): 98 | xto = min(xfrom + batch_size, x.shape[0]) 99 | bsims, binds = [], [] 100 | for yfrom in range(0, y.shape[0], batch_size): 101 | yto = min(yfrom + batch_size, y.shape[0]) 102 | # print('{}-{} -> {}-{}'.format(xfrom, xto, yfrom, yto)) 103 | idx = faiss.IndexFlatIP(dim) 104 | idx = faiss.index_cpu_to_all_gpus(idx) 105 | idx.add(y[yfrom:yto]) 106 | bsim, bind = idx.search(x[xfrom:xto], min(k, yto-yfrom)) 107 | bsims.append(bsim) 108 | binds.append(bind + yfrom) 109 | del idx 110 | bsims = np.concatenate(bsims, axis=1) 111 | binds = np.concatenate(binds, axis=1) 112 | aux = np.argsort(-bsims, axis=1) 113 | for i in range(xfrom, xto): 114 | for j in range(k): 115 | sim[i, j] = bsims[i-xfrom, aux[i-xfrom, j]] 116 | ind[i, j] = binds[i-xfrom, aux[i-xfrom, j]] 117 | return sim, ind 118 | 119 | 120 | ############################################################################### 121 | # 122 | # Perform knn on CPU 123 | # 124 | ############################################################################### 125 | 126 | def knnCPU(x, y, k): 127 | dim = x.shape[1] 128 | idx = faiss.IndexFlatIP(dim) 129 | idx.add(y) 130 | sim, ind = idx.search(x, k) 131 | return sim, ind 132 | 133 | 134 | ############################################################################### 135 | # 136 | # Scoring 137 | # 138 | ############################################################################### 139 | 140 | def score(x, y, fwd_mean, bwd_mean, margin): 141 | return margin(x.dot(y), (fwd_mean + bwd_mean) / 2) 142 | 143 | def score_with_trans(x, y, xTrans, fwd_mean, bwd_mean, xTrans_fwd_mean, xTrans_bwd_mean, margin, verbose=False): 144 | return margin(x.dot(y), (fwd_mean + bwd_mean) / 2, xTrans.dot(y), (xTrans_fwd_mean + xTrans_bwd_mean) / 2) 145 | 146 | def score_candidates(x, y, candidate_inds, fwd_mean, bwd_mean, margin, verbose=False): 147 | if verbose: 148 | print(' - scoring {:d} candidates'.format(x.shape[0])) 149 | scores = np.zeros(candidate_inds.shape) 150 | for i in range(scores.shape[0]): 151 | for j in range(scores.shape[1]): 152 | k = candidate_inds[i, j] 153 | scores[i, j] = score(x[i], y[k], fwd_mean[i], bwd_mean[k], margin) 154 | return scores 155 | 156 | 157 | def score_candidates_with_trans(x, y, xTrans, candidate_inds, fwd_mean, bwd_mean, xTrans_fwd_mean, xTrans_bwd_mean, margin, verbose=False): 158 | if verbose: 159 | print(' - scoring {:d} candidates'.format(x.shape[0])) 160 | scores = np.zeros(candidate_inds.shape) 161 | for i in range(scores.shape[0]): 162 | for j in range(scores.shape[1]): 163 | k = candidate_inds[i, j] 164 | scores[i, j] = score_with_trans(x[i], y[k], xTrans[i], fwd_mean[i], bwd_mean[k], xTrans_fwd_mean[i], xTrans_bwd_mean[k], margin) 165 | return scores 166 | 167 | 168 | ############################################################################### 169 | # 170 | # Main 171 | # 172 | ############################################################################### 173 | 174 | if __name__ == '__main__': 175 | parser = argparse.ArgumentParser(description='LASER: Mine bitext') 176 | parser.add_argument('src', 177 | help='Source language corpus') 178 | parser.add_argument('trg', 179 | help='Target language corpus') 180 | parser.add_argument('--encoding', default='utf-8', 181 | help='Character encoding for input/output') 182 | parser.add_argument('--src-lang', required=True, 183 | help='Source language id') 184 | parser.add_argument('--trg-lang', required=True, 185 | help='Target language id') 186 | parser.add_argument('--output', required=True, 187 | help='Output file') 188 | parser.add_argument('--threshold', type=float, default=0, 189 | help='Threshold on extracted bitexts') 190 | 191 | # mining params 192 | parser.add_argument('--mode', 193 | choices=['search', 'score', 'mine'], required=True, 194 | help='Execution mode') 195 | parser.add_argument('-k', '--neighborhood', 196 | type=int, default=4, 197 | help='Neighborhood size') 198 | parser.add_argument('--margin', 199 | choices=['absolute', 'distance', 'ratio'], default='ratio', 200 | help='Margin function') 201 | parser.add_argument('--retrieval', 202 | choices=['fwd', 'bwd', 'max', 'intersect'], default='max', 203 | help='Retrieval strategy') 204 | parser.add_argument('--unify', action='store_true', 205 | help='Unify texts') 206 | parser.add_argument('--gpu', action='store_true', 207 | help='Run knn on all available GPUs') 208 | parser.add_argument('--verbose', action='store_true', 209 | help='Detailed output') 210 | 211 | # embeddings 212 | parser.add_argument('--src-embeddings', required=True, 213 | help='Precomputed source sentence embeddings') 214 | parser.add_argument('--trg-embeddings', required=True, 215 | help='Precomputed target sentence embeddings') 216 | parser.add_argument('--dim', type=int, default=1024, 217 | help='Embedding dimensionality') 218 | 219 | 220 | parser.add_argument('--trans', action='store_true', 221 | help='Use translations for scoring') 222 | 223 | 224 | args = parser.parse_args() 225 | 226 | 227 | print('LASER: tool to search, score or mine bitexts') 228 | if args.gpu: 229 | print(' - knn will run on all available GPUs (recommended)') 230 | else: 231 | print(' - knn will run on CPU (slow)') 232 | 233 | src_inds, src_sents = TextLoadUnify(args.src, args) 234 | trg_inds, trg_sents = TextLoadUnify(args.trg, args) 235 | 236 | if args.trans: 237 | srcTransFile = ".enTranslated".join(args.src.rsplit('.bn', 1)) 238 | srcTrans_inds, srcTrans_sents = TextLoadUnify(srcTransFile, args) 239 | 240 | def unique_embeddings(emb, ind, verbose=False): 241 | aux = {j: i for i, j in enumerate(ind)} 242 | if verbose: 243 | print(' - unify embeddings: {:d} -> {:d}'.format(len(emb), len(aux))) 244 | return emb[[aux[i] for i in range(len(aux))]] 245 | 246 | # load the embeddings 247 | x = EmbedLoad(args.src_embeddings, args.dim, verbose=args.verbose) 248 | 249 | if args.trans: 250 | xTrans = EmbedLoad(".enTranslated".join(args.src_embeddings.rsplit('.bn', 1)), args.dim, verbose=args.verbose) 251 | 252 | if args.unify: 253 | x = unique_embeddings(x, src_inds, args.verbose) 254 | if args.trans: 255 | xTrans = unique_embeddings(xTrans, src_inds, args.verbose) 256 | 257 | if args.trans: 258 | faiss.normalize_L2(xTrans) 259 | 260 | faiss.normalize_L2(x) 261 | y = EmbedLoad(args.trg_embeddings, args.dim, verbose=args.verbose) 262 | if args.unify: 263 | y = unique_embeddings(y, trg_inds, args.verbose) 264 | faiss.normalize_L2(y) 265 | 266 | # calculate knn in both directions 267 | if args.retrieval is not 'bwd': 268 | if args.verbose: 269 | print(' - perform {:d}-nn source against target'.format(args.neighborhood)) 270 | x2y_sim, x2y_ind = knn(x, y, min(y.shape[0], args.neighborhood), args.gpu) 271 | x2y_mean = x2y_sim.mean(axis=1) 272 | if args.trans: 273 | xTrans2y_sim, xTrans2y_ind = knn(xTrans, y, min(y.shape[0], args.neighborhood), args.gpu) 274 | xTrans2y_mean = xTrans2y_sim.mean(axis=1) 275 | 276 | if args.retrieval is not 'fwd': 277 | if args.verbose: 278 | print(' - perform {:d}-nn target against source'.format(args.neighborhood)) 279 | y2x_sim, y2x_ind = knn(y, x, min(x.shape[0], args.neighborhood), args.gpu) 280 | y2x_mean = y2x_sim.mean(axis=1) 281 | if args.trans: 282 | y2xTrans_sim, y2xTrans_ind = knn(y, xTrans, min(xTrans.shape[0], args.neighborhood), args.gpu) 283 | y2xTrans_mean = y2xTrans_sim.mean(axis=1) 284 | 285 | # margin function 286 | if args.trans: 287 | if args.margin == 'absolute': 288 | margin = lambda a, b, c, d: c / d 289 | elif args.margin == 'distance': 290 | margin = lambda a, b, c, d: 2 / (1 / (a - b) + 1 / (c - d)) 291 | else: # args.margin == 'ratio': 292 | margin = lambda a, b, c, d: 2 / (1 / (a / b) + 1 / (c / d)) 293 | else: 294 | if args.margin == 'absolute': 295 | margin = lambda a, b: a 296 | elif args.margin == 'distance': 297 | margin = lambda a, b: a - b 298 | else: # args.margin == 'ratio': 299 | margin = lambda a, b: a / b 300 | 301 | fout = open(args.output, mode='w', encoding=args.encoding, errors='surrogateescape') 302 | 303 | if args.mode == 'search': 304 | if args.verbose: 305 | print(' - Searching for closest sentences in target') 306 | print(' - writing alignments to {:s}'.format(args.output)) 307 | scores = score_candidates(x, y, x2y_ind, x2y_mean, y2x_mean, margin, args.verbose) 308 | best = x2y_ind[np.arange(x.shape[0]), scores.argmax(axis=1)] 309 | 310 | nbex = x.shape[0] 311 | ref = np.linspace(0, nbex-1, nbex).astype(int) # [0, nbex) 312 | err = nbex - np.equal(best.reshape(nbex), ref).astype(int).sum() 313 | print(' - errors: {:d}={:.2f}%'.format(err, 100*err/nbex)) 314 | for i in src_inds: 315 | print(trg_sents[best[i]], file=fout) 316 | 317 | elif args.mode == 'score': 318 | for i, j in zip(src_inds, trg_inds): 319 | s = score(x[i], y[j], x2y_mean[i], y2x_mean[j], margin) 320 | print(s, src_sents[i], trg_sents[j], sep='\t', file=fout) 321 | 322 | elif args.mode == 'mine': 323 | if args.verbose: 324 | print(' - mining for parallel data') 325 | if args.trans: 326 | fwd_scores = score_candidates_with_trans(x, y, xTrans, x2y_ind, x2y_mean, y2x_mean, xTrans2y_mean, y2xTrans_mean, margin, args.verbose) 327 | bwd_scores = score_candidates_with_trans(y, x, xTrans, y2x_ind, y2x_mean, x2y_mean, y2xTrans_mean, xTrans2y_mean, margin, args.verbose) 328 | else: 329 | fwd_scores = score_candidates(x, y, x2y_ind, x2y_mean, y2x_mean, margin, args.verbose) 330 | bwd_scores = score_candidates(y, x, y2x_ind, y2x_mean, x2y_mean, margin, args.verbose) 331 | 332 | fwd_best = x2y_ind[np.arange(x.shape[0]), fwd_scores.argmax(axis=1)] 333 | bwd_best = y2x_ind[np.arange(y.shape[0]), bwd_scores.argmax(axis=1)] 334 | if args.verbose: 335 | print(' - writing alignments to {:s}'.format(args.output)) 336 | if args.threshold > 0: 337 | print(' - with threshold of {:f}'.format(args.threshold)) 338 | if args.retrieval == 'fwd': 339 | for i, j in enumerate(fwd_best): 340 | print(fwd_scores[i].max(), src_sents[i], trg_sents[j], sep='\t', file=fout) 341 | if args.retrieval == 'bwd': 342 | for j, i in enumerate(bwd_best): 343 | print(bwd_scores[j].max(), src_sents[i], trg_sents[j], sep='\t', file=fout) 344 | if args.retrieval == 'intersect': 345 | for i, j in enumerate(fwd_best): 346 | if bwd_best[j] == i: 347 | print(fwd_scores[i].max(), src_sents[i], trg_sents[j], sep='\t', file=fout) 348 | if args.retrieval == 'max': 349 | indices = np.stack((np.concatenate((np.arange(x.shape[0]), bwd_best)), 350 | np.concatenate((fwd_best, np.arange(y.shape[0])))), axis=1) 351 | scores = np.concatenate((fwd_scores.max(axis=1), bwd_scores.max(axis=1))) 352 | seen_src, seen_trg = set(), set() 353 | for i in np.argsort(-scores): 354 | src_ind, trg_ind = indices[i] 355 | if not src_ind in seen_src and not trg_ind in seen_trg: 356 | seen_src.add(src_ind) 357 | seen_trg.add(trg_ind) 358 | if scores[i] > args.threshold: 359 | print(scores[i], src_sents[src_ind], trg_sents[trg_ind], sep='\t', file=fout) 360 | 361 | fout.close() 362 | 363 | -------------------------------------------------------------------------------- /segmentation/LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright 2020 Florian Leitner 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | -------------------------------------------------------------------------------- /segmentation/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Setup 3 | 4 | ``` python setup.py install``` or ``` pip install .``` 5 | 6 | ## Usage 7 | 8 | * ### From python scripts 9 | ```python 10 | >>> from segmentation import segmenter 11 | >>> input_text = ''' 12 | কাজী মুহম্মদ ওয়াজেদের একমাত্র পুত্র ছিলেন এ. কে. ফজলুক হক। A. K. Fazlul Huq (Sher-E-Bangla) was born into a middle class Bengali Muslim family in Bakerganj, Barisal, Bangladesh in 1873. 13 | ''' 14 | >>> segmenter.segment_text(input_text) 15 | ['কাজী মুহম্মদ ওয়াজেদের একমাত্র পুত্র ছিলেন এ. কে. ফজলুক হক।', 16 | 'A. K. Fazlul Huq (Sher-E-Bangla) was born into a middle class Bengali Muslim family in Bakerganj, Barisal, Bangladesh in 1873.'] 17 | ``` 18 | *If you don't want a linebreak to be an explicit new line marker, use the following* 19 | ```python 20 | >>> segmenter.segment_text(input_text, mode='multi') 21 | ``` 22 | 23 | 24 | * ***Note: the above snippets run with most of the default options, for more advanced options, refer to the terminal script.*** 25 | 26 | * ### From terminal 27 | ```bash 28 | segmenter --help 29 | ``` -------------------------------------------------------------------------------- /segmentation/__init__.py: -------------------------------------------------------------------------------- 1 | from . import segmenter -------------------------------------------------------------------------------- /segmentation/segmenter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | # Adapted from: https://github.com/fnl/segtok.git 5 | """ 6 | A pattern-based sentence segmentation strategy; 7 | Primarily written for indo-european languages and extended specifically 8 | for bengali. Could be extended for other languages by introducing new rules. 9 | 10 | Known limitations: 11 | 1. The sentence must use a known sentence terminal followed by space(s), 12 | skipping one optional, intervening quote and/or bracket. 13 | 2. The next sentence must start with an upper-case letter or a number, 14 | ignoring one optional quote and/or bracket before it. 15 | Alternatively, it may start with a camel-cased word, like "gene-A". 16 | 3. If the sentence ends with a single upper-case letter followed by a dot, 17 | a split is made (splits names like "A. Dent"), unless there is an easy 18 | to deduce reason that it is a human name. 19 | 20 | The decision for requiring an "syntactically correct" terminal sequence with upper-case letters or 21 | numbers as start symbol is based on the preference to under-split rather than over-split sentences. 22 | 23 | Special care is taken not to split at common abbreviations like "i.e." or "etc.", 24 | to not split at first or middle name initials "... F. M. Last ...", 25 | to not split before a comma, colon, or semi-colon, 26 | and to avoid single letters or digits as sentences ("A. This sentence..."). 27 | 28 | Sentence splits will always be enforced at [consecutive] line separators. 29 | 30 | Important: Windows text files use ``\\r\\n`` as linebreaks and Mac files use ``\\r``; 31 | Convert the text to Unix linebreaks if the case. 32 | """ 33 | from __future__ import absolute_import, unicode_literals 34 | import codecs 35 | from regex import compile, DOTALL, UNICODE, VERBOSE 36 | from itertools import chain 37 | import re 38 | import string 39 | 40 | 41 | SENTENCE_TERMINALS = '.!?\u203C\u203D\u2047\u2048\u2049\u3002' \ 42 | '\uFE52\uFE57\uFF01\uFF0E\uFF1F\uFF61\u09F7\u0964' 43 | "The list of valid Unicode sentence terminal characters." 44 | 45 | # Note that Unicode the category Pd is NOT a good set for valid word-breaking hyphens, 46 | # because it contains many dashes that should not be considered part of a word. 47 | HYPHENS = '\u00AD\u058A\u05BE\u0F0C\u1400\u1806\u2010-\u2012\u2e17\u30A0-' 48 | "Any valid word-breaking hyphen, including ASCII hyphen minus." 49 | 50 | # Use upper-case for abbreviations that always are capitalized: 51 | # Lower-case abbreviations may occur capitalized or not. 52 | # Only abbreviations that should never occur at the end of a sentence 53 | # (such as "etc.") 54 | BENGALISINGLECHARS = "অ আ ই ঈ উ ঊ ঋ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য র ল শ ষ স হ ড় ঢ় য়".split() 55 | ABBREVIATIONS = """ 56 | Adj Adm Adv Asst Bart Bldg Brig Bros Capt Cant Cmdr Col Comdr 57 | Con Corp Cpl Dr Drs Ens Gen Gov Hon Hr Hop Inc Insp Lt MM Maj 58 | Messrs Mlle Mme Op Ord Pfc Ph Pvt Rep Reps Res Rev Rt Sen Sens 59 | Sfc Sgt Sr St Supt Surg approx Capt cf Col Dr f\.?e figs? Gen 60 | e\.?g i\.?e i\.?v Mag med Mr Mrs Mt nat No nr p\.e phil prof rer 61 | sci Sgt Sr Sra Srta St univ vol vs z\.B Jän Jan Ene Feb Mär Mar 62 | Apr Abr May Jun Jul Aug Sep Sept Oct Okt Nov Dic Dez Dec Prof 63 | E\.U U\.K U\.S viz ltd co est rs md Ms tk TK Ps PS Ex""".split() 64 | 65 | BENGALIABBREVIATIONS = """এ বি সি ডি ই এফ জি এইচ আই জে কে এল এম এন ও পি কিউ আর আস টি ইউ ভি আর এস টি ইউ ভি ডব্লিউ এক্স ওআই জেড মি 66 | মো মু কো কৌ মুহ মি মিস প্রফ ফিল গভ অপ ভল ডা লে জনাব মিজ মিসেস ডে যে মি লি সা ডঃ ডেপ্ট ডেপট অধ্যাপক গে অর্গ ডাব্লিউ সেন্ট ওয়াই এম\.ডি ঢা\.বি লিট ডি\.লিট 67 | সং ইস মিস্টার মি গ্রা মিগ্রা মি\.গ্রা রেভ প্র প্রা ইঙ্ক গভ বিদ্র বি\.দ্র দ্র মোহা কিমি কি\.মি কি রেভা মুদ্রা আনু খ্রি খি ক্যান্ট সে সে\.মি সেমি মে জন মি\.লি মিলি লি মি অনু মৃত্যু পূ পৃ ডব্লু 68 | """.split() 69 | 70 | ABBREVIATIONS.extend(a.capitalize() for a in ABBREVIATIONS if a[0].islower()) 71 | ABBREVIATIONS.extend(BENGALISINGLECHARS) 72 | ABBREVIATIONS.extend(BENGALIABBREVIATIONS) 73 | 74 | ABBREVIATIONS.extend(list(string.ascii_uppercase)) 75 | 76 | JWSPECAILS = """Aux\.Pios Par chap pars Pubs ftn Jas Rom ROM PROV Mic 77 | TIM স\.অগ্র, বি\.অগ্র তীম Tim গীত Ps যিশা Isa গালা Gal পিতর Pet মথি Matt করি Cor 78 | রোমীয় Rom ইব্রীয় Heb প্রকা Rev যিহি Ezek বিচার Judg আদি Gen দানি Dan রাজা Ki শমূ Sam 79 | মালাখি Mal ইফি Eph হিতো Prov যিহো Josh দ্বিতী Deut দ্বিতীয় Deut গণনা Num সফ Zeph হোশেও 80 | Hos ফিলি Phil যির Jer কল Col উপ ECCL উপ Eccl পরম Sol থিষল Thess থিষ Thess লেবীয় 81 | Lev যাত্রা Ex বংশা Chron নহি Neh হবক্ Hab অগ্র Pios সখ Zech প্রেরিত Acts ফিলী Philem সা\.কা 82 | লেবী Lev রূৎ Ruth পাদ ftn জানু Jan ফেব্রু Feb সেপ্ট Sept সেপ্টে Sept অক্টো Oct নভে Nov ডিসে Dec পরি pp""".split() 83 | # ABBREVIATIONS.extend(JWSPECAILS) 84 | 85 | ABBREVIATIONS = '|'.join(sorted(list(set(ABBREVIATIONS)))) 86 | ABBREVIATIONS = compile(r""" 87 | (?: \b(?:%s) # 1. known abbreviations, 88 | | ^\S # 2. a single, non-space character "sentence" (only), 89 | | ^\d+ # 3. a series of digits "sentence" (only), or 90 | | (?: \b # 4. terminal letters A.-A, A.A, or A, if prefixed with: 91 | # 4.a. something that makes them most likely a human first name initial 92 | (?: [Bb]y 93 | | [Cc](?:aptain|ommander) 94 | | [Dd]o[ck]tor 95 | | [Gg]eneral 96 | | [Mm](?:ag)?is(?:ter|s) 97 | | [Pp]rofessor 98 | | [Ss]e\u00F1or(?:it)?a? 99 | ) \s 100 | # 4.b. if they are most likely part of an author list: (avoiding "...A and B") 101 | | (?: (?10%): 122 | # after, though, upon, while, yet 123 | # 124 | # Words hardly used after abbrevs vs. SSs (poor continuations, <2%): 125 | # [after], as, at, but, during, for, in, nor, on, to, [though], [upon], 126 | # whereas, [while], within, [yet] 127 | # 128 | # Words hardly ever used as SSs (excellent continuations, <2%): 129 | # and, are, between, by, from, has, into, is, of, or, that, than, through, 130 | # via, was, were, with 131 | # 132 | # Words frequently used after abbrevs (excellent continuations, >10%): 133 | # [and, are, has, into, is, of, or, than, via, was, were] 134 | # 135 | # Grey zone: undecidable words -> leave in to bias towards under-splitting 136 | # whether 137 | 138 | ENDS_IN_DATE_DIGITS = compile(r"\b[0123]?[0-9]$") 139 | MONTH = compile(r"(J[äa]n|Ene|Feb|M[äa]r|A[pb]r|May|Jun|Jul|Aug|Sep|O[ck]t|Nov|D[ei][cz]|0?[1-9]|1[012])") 140 | """ 141 | Special facilities to detect European-style dates. 142 | """ 143 | 144 | CONTINUATIONS = compile(r""" ^ # at string start only 145 | (?: a(?: nd|re ) 146 | | b(?: etween|y ) 147 | | from 148 | | has 149 | | i(?: nto|s ) 150 | | o[fr] 151 | | t(?: han|hat|hrough ) 152 | | via 153 | | w(?: as|ere|hether|ith ) 154 | )\b""", UNICODE | VERBOSE) 155 | "Lower-case words that in the given form usually don't start a sentence." 156 | 157 | BEFORE_LOWER = compile(r""" .*? 158 | (?: [%s]"[\)\]]* # ."]) .") ." 159 | | [%s] [\)\]]+ # .]) .) 160 | | \b spp \. # spp. (species pluralis) 161 | | \b \p{L} \p{Ll}? \. # Ll. L. 162 | ) \s+ $""" % (SENTENCE_TERMINALS, SENTENCE_TERMINALS), DOTALL | UNICODE | VERBOSE 163 | ) 164 | """ 165 | Endings that, if followed by a lower-case word, are not sentence terminals: 166 | - Quotations and brackets ("Hello!" said the man.) 167 | - dotted abbreviations (U.S.A. was) 168 | - genus-species-like (m. musculus) 169 | """ 170 | LOWER_WORD = compile(r'^\p{Ll}+[%s]?\p{Ll}*\b' % HYPHENS, UNICODE) 171 | "Lower-case words are not sentence starters (after an abbreviation)." 172 | 173 | MIDDLE_INITIAL_END = compile(r'\b\p{Lu}\p{Ll}+\W+\p{Lu}$', UNICODE) 174 | "Upper-case initial after upper-case word at the end of a string." 175 | 176 | UPPER_WORD_START = compile(r'^\p{Lu}\p{Ll}+\b', UNICODE) 177 | "Upper-case word at the beginning of a string." 178 | 179 | LONE_WORD = compile(r'^\p{Ll}+[\p{Ll}\p{Nd}%s]*$' % HYPHENS, UNICODE) 180 | "Any 'lone' lower-case word [with hyphens or digits inside] is a continuation." 181 | 182 | UPPER_CASE_END = compile(r'\b[\p{Lu}\p{Lt}]\p{L}*\.\s+$', UNICODE) 183 | "Inside brackets, 'Words' that can be part of a proper noun abbreviation, like a journal name." 184 | UPPER_CASE_START = compile(r'^(?:(?:\(\d{4}\)\s)?[\p{Lu}\p{Lt}]\p{L}*|\d+)[\.,:]\s+', UNICODE) 185 | "Inside brackets, 'Words' that can be part of a large abbreviation, like a journal name." 186 | 187 | SHORT_SENTENCE_LENGTH = 55 188 | "Length of either sentence fragment inside brackets to assume the fragment is not its own sentence." 189 | # This can be increased/decreased to heighten/lower the likelihood of splits inside brackets. 190 | 191 | NON_UNIX_LINEBREAK = compile(r'(?:\r\n|\r|\u2028)', UNICODE) 192 | "All linebreak sequence variants except the Unix newline (only)." 193 | 194 | SEGMENTER_REGEX = r""" 195 | ( # A sentence ends at one of two sequences: 196 | [%s] # Either, a sequence starting with a sentence terminal, 197 | [\'\u2019\"\u201D]? # an optional right quote, 198 | [\]\)]* # optional closing brackets and 199 | \s+ # a sequence of required spaces. 200 | | # Otherwise, 201 | \n{{{},}} # a sentence also terminates at [consecutive] newlines. 202 | | 203 | [\u0964]+ 204 | [\'\u2019\"\u201D]? # an optional right quote, 205 | [\]\)]* # optional closing brackets and 206 | \s* # a sequence of optional spaces. 207 | 208 | )""" % SENTENCE_TERMINALS 209 | 210 | """ 211 | Sentence end a sentence terminal, followed by spaces. 212 | Optionally, a right quote and any number of closing brackets may succeed the terminal marker. 213 | Alternatively, an yet undefined number of line-breaks also may terminate sentences. 214 | """ 215 | 216 | _compile = lambda count: compile(SEGMENTER_REGEX.format(count), UNICODE | VERBOSE) 217 | 218 | # Define that one or more line-breaks split sentences: 219 | DO_NOT_CROSS_LINES = _compile(1) 220 | "A segmentation pattern where any newline char also terminates a sentence." 221 | 222 | # Define that two or more line-breaks split sentences: 223 | MAY_CROSS_ONE_LINE = _compile(2) 224 | "A segmentation pattern where two or more newline chars also terminate sentences." 225 | 226 | # some normalization primitives 227 | REPLACE_UNICODE_PUNCTUATION = [ 228 | (u"\u09F7", u"\u0964"), 229 | (u",", u","), 230 | (u"、", u","), 231 | (u"”", u'"'), 232 | (u"“", u'"'), 233 | (u"∶", u":"), 234 | (u":", u":"), 235 | (u"?", u"?"), 236 | (u"《", u'"'), 237 | (u"》", u'"'), 238 | (u")", u")"), 239 | (u"!", u"!"), 240 | (u"(", u"("), 241 | (u";", u";"), 242 | (u"」", u'"'), 243 | (u"「", u'"'), 244 | (u"0", u"0"), 245 | (u"1", u'1'), 246 | (u"2", u"2"), 247 | (u"3", u"3"), 248 | (u"4", u"4"), 249 | (u"5", u"5"), 250 | (u"6", u"6"), 251 | (u"7", u"7"), 252 | (u"8", u"8"), 253 | (u"9", u"9"), 254 | (u"~", u"~"), 255 | (u"’", u"'"), 256 | (u"…", u"..."), 257 | (u"━", u"-"), 258 | (u"〈", u"<"), 259 | (u"〉", u">"), 260 | (u"【", u"["), 261 | (u"】", u"]"), 262 | (u"%", u"%"), 263 | ] 264 | 265 | NORMALIZE_UNICODE = [ 266 | ('\u00AD', ''), 267 | ('\u09AF\u09BC', '\u09DF'), 268 | ('\u09A2\u09BC', '\u09DD'), 269 | ('\u09A1\u09BC', '\u09DC'), 270 | ('\u09AC\u09BC', '\u09B0'), 271 | ('\u09C7\u09BE', '\u09CB'), 272 | ('\u09C7\u09D7', '\u09CC'), 273 | ('\u0985\u09BE', '\u0986'), 274 | ('\u09C7\u0981\u09D7', '\u09CC\u0981'), 275 | ('\u09C7\u0981\u09BE', '\u09CB\u0981'), 276 | ('\u09C7([^\u09D7])\u09D7', "\g<1>\u09CC"), 277 | ('\\xa0', ' '), 278 | ('\u200B', u''), 279 | ('\u2060', u''), 280 | (u'„', r'"'), 281 | (u'“', r'"'), 282 | (u'”', r'"'), 283 | (u'–', r'-'), 284 | (u'—', r' - '), 285 | (r' +', r' '), 286 | (u'´', r"'"), 287 | (u'([a-zA-Z])‘([a-zA-Z])', r"\g<1>'\g<2>"), 288 | (u'([a-zA-Z])’([a-zA-Z])', r"\g<1>'\g<2>"), 289 | (u'‘', r"'"), 290 | (u'‚', r"'"), 291 | (u'’', r"'"), 292 | (u'´´', r'"'), 293 | (u'…', r'...'), 294 | ] 295 | 296 | FRENCH_QUOTES = [ 297 | (u'\u00A0«\u00A0', r'"'), 298 | (u'«\u00A0', r'"'), 299 | (u'«', r'"'), 300 | (u'\u00A0»\u00A0', r'"'), 301 | (u'\u00A0»', r'"'), 302 | (u'»', r'"'), 303 | ] 304 | 305 | SUBSTITUTIONS = [NORMALIZE_UNICODE, FRENCH_QUOTES, REPLACE_UNICODE_PUNCTUATION] 306 | SUBSTITUTIONS = list(chain(*SUBSTITUTIONS)) 307 | 308 | def normalize_punctuation(text): 309 | """Normalize common punctuations for the splitter to work better""" 310 | for regexp, replacement in SUBSTITUTIONS: 311 | text = re.sub(regexp, replacement, text, flags=re.UNICODE) 312 | 313 | for block in re.findall(r'[\s\.]{2,}', text, flags=re.UNICODE): 314 | block = block.strip() 315 | if len(re.findall(r'[\.]', block, flags=re.UNICODE)) > 1: 316 | newBlock = re.sub(r'[^\S\r\n]', '', block, flags=re.UNICODE) 317 | text = text.replace(block, newBlock, 1) 318 | 319 | return text 320 | 321 | # added punctuation normalization in here 322 | def split_single(text, join_on_lowercase=False, short_sentence_length=SHORT_SENTENCE_LENGTH): 323 | """ 324 | Default: split `text` at sentence terminals and at newline chars. 325 | """ 326 | text = normalize_punctuation(text) 327 | sentences = _sentences(DO_NOT_CROSS_LINES.split(text), join_on_lowercase, short_sentence_length) 328 | return [s for ss in sentences for s in ss.split('\n')] 329 | 330 | 331 | def split_multi(text, join_on_lowercase=False, short_sentence_length=SHORT_SENTENCE_LENGTH): 332 | """ 333 | Sentences may contain non-consecutive (single) newline chars, while consecutive newline chars 334 | ("paragraph separators") always split sentences. 335 | """ 336 | text = normalize_punctuation(text) 337 | return _sentences(MAY_CROSS_ONE_LINE.split(text), join_on_lowercase, short_sentence_length) 338 | 339 | 340 | def split_newline(text): 341 | """ 342 | Split the `text` at newlines (``\\n'') and strip the lines, 343 | but only return lines with content. 344 | """ 345 | for line in text.split('\n'): 346 | line = line.strip() 347 | 348 | if line: 349 | yield line 350 | 351 | 352 | def rewrite_line_separators(text, pattern, join_on_lowercase=False, 353 | short_sentence_length=SHORT_SENTENCE_LENGTH): 354 | """ 355 | Remove line separator chars inside sentences and ensure there is a ``\\n`` at their end. 356 | 357 | :param text: input plain-text 358 | :param pattern: for the initial sentence splitting 359 | :param join_on_lowercase: always join sentences that start with lower-case 360 | :param short_sentence_length: the upper boundary for text spans that are not split 361 | into sentences inside brackets 362 | :return: a generator yielding the spans of text 363 | """ 364 | offset = 0 365 | 366 | for sentence in _sentences(pattern.split(text), join_on_lowercase, short_sentence_length): 367 | start = text.index(sentence, offset) 368 | intervening = text[offset:start] 369 | 370 | if offset != 0 and '\n' not in intervening: 371 | yield '\n' 372 | intervening = intervening[1:] 373 | 374 | yield intervening 375 | yield sentence.replace('\n', ' ') 376 | offset = start + len(sentence) 377 | 378 | if offset < len(text): 379 | yield text[offset:] 380 | 381 | 382 | def to_unix_linebreaks(text): 383 | """Replace non-Unix linebreak sequences (Windows, Mac, Unicode) with newlines (\\n).""" 384 | return NON_UNIX_LINEBREAK.sub('\n', text) 385 | 386 | 387 | def _sentences(spans, join_on_lowercase, short_sentence_length): 388 | """Join spans back together into sentences as necessary.""" 389 | last = None 390 | shorterThanATypicalSentence = lambda c, l: c < short_sentence_length or l < short_sentence_length 391 | 392 | for current in _abbreviation_joiner(spans): 393 | if last is not None: 394 | 395 | if (join_on_lowercase or BEFORE_LOWER.match(last)) and LOWER_WORD.match(current): 396 | last = '%s%s' % (last, current) 397 | elif shorterThanATypicalSentence(len(current), len(last)) and _is_open(last) and ( 398 | _is_not_opened(current) or last.endswith(' et al. ') or ( 399 | UPPER_CASE_END.search(last) and UPPER_CASE_START.match(current) 400 | ) 401 | ): 402 | last = '%s%s' % (last, current) 403 | elif shorterThanATypicalSentence(len(current), len(last)) and _is_open(last, '[]') and ( 404 | _is_not_opened(current, '[]') or last.endswith(' et al. ') or ( 405 | UPPER_CASE_END.search(last) and UPPER_CASE_START.match(current) 406 | ) 407 | ): 408 | last = '%s%s' % (last, current) 409 | elif CONTINUATIONS.match(current): 410 | last = '%s%s' % (last, current) 411 | elif re.search(r'^[\"\']+$|^[\"\']+[ \t]*\n+.+', current.strip(), flags=re.UNICODE): 412 | last = '%s%s' % (last.strip(), current.strip()) 413 | elif current.strip().startswith('-') or re.search(r'^[\"\']\s*[\-]', current.strip(), flags=re.UNICODE): 414 | last = '%s%s' % (last.strip(), current.strip()) 415 | else: 416 | yield last.strip() 417 | last = current 418 | else: 419 | last = current 420 | 421 | if last is not None: 422 | yield last.strip() 423 | 424 | 425 | def _abbreviation_joiner(spans): 426 | """Join spans that match the ABBREVIATIONS pattern.""" 427 | segment = None 428 | makeSentence = lambda start, end: ''.join(spans[start:end]) 429 | total = len(spans) 430 | 431 | for pos in range(total): 432 | if pos and pos % 2: # even => segment, uneven => (potential) terminal 433 | prev_s = spans[pos - 1] 434 | marker = spans[pos] 435 | next_s = spans[pos+1] if pos + 1 < total else None 436 | 437 | if prev_s[-1:].isspace() and marker[0] != '\u0964': 438 | pass # join 439 | elif marker[0] == '.' and ABBREVIATIONS.search(prev_s): 440 | pass # join 441 | elif marker[0] == '.' and next_s and ( 442 | LONE_WORD.match(next_s) or 443 | (ENDS_IN_DATE_DIGITS.search(prev_s) and MONTH.match(next_s)) or 444 | (MIDDLE_INITIAL_END.search(prev_s) and UPPER_WORD_START.match(next_s)) 445 | ): 446 | pass # join 447 | else: 448 | yield makeSentence(segment, pos + 1) 449 | segment = None 450 | elif segment is None: 451 | segment = pos 452 | 453 | if segment is not None: 454 | yield makeSentence(segment, total) 455 | 456 | 457 | def _is_open(span_str, brackets='()'): 458 | """Check if the span ends with an unclosed `bracket`.""" 459 | offset = span_str.find(brackets[0]) 460 | nesting = 0 if offset == -1 else 1 461 | 462 | while offset != -1: 463 | opener = span_str.find(brackets[0], offset + 1) 464 | closer = span_str.find(brackets[1], offset + 1) 465 | 466 | if opener == -1: 467 | if closer == -1: 468 | offset = -1 469 | else: 470 | offset = closer 471 | nesting -= 1 472 | elif closer == -1: 473 | offset = opener 474 | nesting += 1 475 | elif opener < closer: 476 | offset = opener 477 | nesting += 1 478 | elif closer < opener: 479 | offset = closer 480 | nesting -= 1 481 | else: 482 | msg = 'at offset={}: closer={}, opener={}' 483 | raise RuntimeError(msg.format(offset, closer, opener)) 484 | 485 | return nesting > 0 486 | 487 | 488 | def _is_not_opened(span_str, brackets='()'): 489 | """Check if the span starts with an unopened `bracket`.""" 490 | offset = span_str.rfind(brackets[1]) 491 | nesting = 0 if offset == -1 else 1 492 | 493 | while offset != -1: 494 | opener = span_str.rfind(brackets[0], 0, offset) 495 | closer = span_str.rfind(brackets[1], 0, offset) 496 | 497 | if opener == -1: 498 | if closer == -1: 499 | offset = -1 500 | else: 501 | offset = closer 502 | nesting += 1 503 | elif closer == -1: 504 | offset = opener 505 | nesting -= 1 506 | elif closer < opener: 507 | offset = opener 508 | nesting -= 1 509 | elif opener < closer: 510 | offset = closer 511 | nesting += 1 512 | else: 513 | msg = 'at offset={}: closer={}, opener={}' 514 | raise RuntimeError(msg.format(offset, closer, opener)) 515 | 516 | return nesting > 0 517 | 518 | def segment_text(input_text, mode='single'): 519 | """Simple api to segment text with most default values""" 520 | normal = to_unix_linebreaks 521 | if mode == 'single': 522 | sentences = split_single(normal(input_text), short_sentence_length=SHORT_SENTENCE_LENGTH) 523 | text_spans = [i for s in sentences for i in (s, '\n')] 524 | elif mode == 'multi': 525 | text_spans = rewrite_line_separators(normal(input_text), MAY_CROSS_ONE_LINE, short_sentence_length=SHORT_SENTENCE_LENGTH) 526 | 527 | segments = [span.strip() for span in text_spans if span.strip()] 528 | return segments 529 | 530 | 531 | 532 | def main(): 533 | # print one sentence per line 534 | from argparse import ArgumentParser 535 | from sys import argv, stdout, stdin, stderr, getdefaultencoding, version_info 536 | from os import path, linesep 537 | 538 | single, multi = 0, 1 539 | 540 | parser = ArgumentParser(usage='%(prog)s [--mode] [FILE ...]', 541 | description=__doc__, prog=path.basename(argv[0]), 542 | epilog='default encoding: ' + getdefaultencoding()) 543 | parser.add_argument('files', metavar='FILE', nargs='*', 544 | help='UTF-8 plain-text file(s); if absent, read from STDIN') 545 | parser.add_argument('--with-ids', action='store_true', 546 | help='STDIN (only!) input is ID-tab-TEXT; the ID is ' 547 | 'preserved in the output as ID-tab-N-tab-SENTENCE ' 548 | 'where N is the incremental sentence number for that ' 549 | 'text ID') 550 | parser.add_argument('--normal-breaks', '-n', action='store_true', 551 | help=to_unix_linebreaks.__doc__) 552 | parser.add_argument('--bracket-spans', '-b', metavar="INT", type=int, 553 | default=SHORT_SENTENCE_LENGTH, 554 | help="upper boundary for text spans that are not split " 555 | "into sentences inside brackets [%(default)d]") 556 | parser.add_argument('--encoding', '-e', help='force another encoding to use') 557 | mode = parser.add_mutually_exclusive_group() 558 | parser.set_defaults(mode=single) 559 | mode.add_argument('--single', '-s', action='store_const', dest='mode', const=single, 560 | help=split_single.__doc__) 561 | mode.add_argument('--multi', '-m', action='store_const', dest='mode', const=multi, 562 | help=split_multi.__doc__) 563 | 564 | args = parser.parse_args() 565 | pattern = [DO_NOT_CROSS_LINES, MAY_CROSS_ONE_LINE, ][args.mode] 566 | normal = to_unix_linebreaks if args.normal_breaks else lambda t: t 567 | 568 | # fix broken Unicode handling in Python 2.x 569 | # see http://www.macfreek.nl/memory/Encoding_of_Python_stdout 570 | if args.encoding or version_info < (3, 0): 571 | if version_info >= (3, 0): 572 | stdout = stdout.buffer 573 | stdin = stdin.buffer 574 | 575 | stdout = codecs.getwriter( 576 | args.encoding or 'utf-8' 577 | )(stdout, 'xmlcharrefreplace') 578 | 579 | stdin = codecs.getreader( 580 | args.encoding or 'utf-8' 581 | )(stdin, 'xmlcharrefreplace') 582 | 583 | if not args.encoding: 584 | stderr.write('wrapped segmenter stdio with UTF-8 de/encoders') 585 | stderr.write(linesep) 586 | 587 | if not args.files and args.mode != single: 588 | parser.error('only single line splitting mode allowed ' 589 | 'when reading from STDIN') 590 | 591 | def segment(text): 592 | if not args.files and args.with_ids: 593 | tid, text = text.split('\t', 1) 594 | else: 595 | tid = None 596 | 597 | if args.mode == single: 598 | sentences = split_single(normal(text), short_sentence_length=args.bracket_spans) 599 | text_spans = [i for s in sentences for i in (s, '\n')] 600 | else: 601 | text_spans = rewrite_line_separators( 602 | normal(text), pattern, short_sentence_length=args.bracket_spans 603 | ) 604 | 605 | if tid is not None: 606 | def write_ids(tid, sid): 607 | stdout.write(tid) 608 | stdout.write('\t') 609 | stdout.write(str(sid)) 610 | stdout.write('\t') 611 | 612 | last = '\n' 613 | sid = 1 614 | 615 | for span in text_spans: 616 | if last == '\n' and span not in ('', '\n'): 617 | write_ids(tid, sid) 618 | sid += 1 619 | 620 | stdout.write(span) 621 | 622 | if span: 623 | last = span 624 | else: 625 | for span in text_spans: 626 | if span.strip() == "": 627 | continue 628 | stdout.write(f'{span.strip()}\n') 629 | 630 | if args.files: 631 | for txt_file_path in args.files: 632 | with codecs.open( 633 | txt_file_path, 'r', encoding=(args.encoding or 'utf-8') 634 | ) as fp: 635 | segment(fp.read()) 636 | else: 637 | for line in stdin: 638 | segment(line) 639 | 640 | 641 | if __name__ == '__main__': 642 | main() 643 | -------------------------------------------------------------------------------- /segmentation/setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from setuptools import setup 3 | 4 | try: 5 | with open('README.md') as file: 6 | long_description = file.read() 7 | except IOError: 8 | long_description = "missing" 9 | 10 | 11 | setup( 12 | name='segmentation', 13 | data_files = [("", ["LICENSE.txt"])], 14 | packages = ['segmentation'], 15 | package_dir = {'segmentation':''}, 16 | install_requires=['regex'], 17 | long_description=long_description, 18 | entry_points={ 19 | 'console_scripts': [ 20 | 'segmenter = segmentation.segmenter:main' 21 | ] 22 | } 23 | ) 24 | -------------------------------------------------------------------------------- /training/README.md: -------------------------------------------------------------------------------- 1 | # Preprocessing 2 | 3 | If you want to, 4 | * build a new bn-en training dataset from a noisy parallel corpora (by filtering / cleaning some pairs based on our heuristics) with corresponding vocabulary models or 5 | * normalize a new dataset before evaluating on the model or 6 | * remove all evaluation pairs from training pairs for a new set of training / test datasets 7 | 8 | refer to [here](preprocessing/). 9 | 10 | # Training & Evaluation 11 | 12 | **Note:** This code has been refactored to support [OpenNMT-py 2.0](https://github.com/OpenNMT/OpenNMT-py) 13 | 14 | ### Setup 15 | 16 | ```bash 17 | $ cd seq2seq/ 18 | $ conda create python==3.7.9 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch -p ./env 19 | $ conda activate ./env # or source activate ./env (for older versions of anaconda) 20 | $ pip install --upgrade -r requirements.txt 21 | ``` 22 | - **Note**: For newer NVIDIA GPUS such as ***A100*** or ***3090*** use `cudatoolkit=11.0`. 23 | 24 | 25 | ### Usage 26 | 27 | ```bash 28 | $ cd seq2seq/ 29 | $ python pipeline.py -h 30 | usage: pipeline.py [-h] --input_dir PATH --output_dir PATH --src_lang SRC_LANG 31 | --tgt_lang TGT_LANG 32 | [--validation_samples VALIDATION_SAMPLES] 33 | [--src_seq_length SRC_SEQ_LENGTH] 34 | [--tgt_seq_length TGT_SEQ_LENGTH] 35 | [--model_prefix MODEL_PREFIX] [--eval_model PATH] 36 | [--train_steps TRAIN_STEPS] 37 | [--train_batch_size TRAIN_BATCH_SIZE] 38 | [--eval_batch_size EVAL_BATCH_SIZE] 39 | [--gradient_accum GRADIENT_ACCUM] 40 | [--warmup_steps WARMUP_STEPS] 41 | [--learning_rate LEARNING_RATE] [--layers LAYERS] 42 | [--rnn_size RNN_SIZE] [--word_vec_size WORD_VEC_SIZE] 43 | [--transformer_ff TRANSFORMER_FF] [--heads HEADS] 44 | [--valid_steps VALID_STEPS] 45 | [--save_checkpoint_steps SAVE_CHECKPOINT_STEPS] 46 | [--average_last AVERAGE_LAST] [--world_size WORLD_SIZE] 47 | [--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]] 48 | [--train_from TRAIN_FROM] [--do_train] [--do_eval] 49 | [--nbest NBEST] [--alpha ALPHA] 50 | 51 | optional arguments: 52 | -h, --help show this help message and exit 53 | --input_dir PATH, -i PATH 54 | Input directory 55 | --output_dir PATH, -o PATH 56 | Output directory 57 | --src_lang SRC_LANG Source language 58 | --tgt_lang TGT_LANG Target language 59 | --validation_samples VALIDATION_SAMPLES 60 | no. of validation samples to take out from train 61 | dataset when no validation data is present 62 | --src_seq_length SRC_SEQ_LENGTH 63 | maximum source sequence length 64 | --tgt_seq_length TGT_SEQ_LENGTH 65 | maximum target sequence length 66 | --model_prefix MODEL_PREFIX 67 | Prefix of the model to save 68 | --eval_model PATH Path to the specific model to evaluate 69 | --train_steps TRAIN_STEPS 70 | no of training steps 71 | --train_batch_size TRAIN_BATCH_SIZE 72 | training batch size (in tokens) 73 | --eval_batch_size EVAL_BATCH_SIZE 74 | evaluation batch size (in sentences) 75 | --gradient_accum GRADIENT_ACCUM 76 | gradient accum 77 | --warmup_steps WARMUP_STEPS 78 | warmup steps 79 | --learning_rate LEARNING_RATE 80 | learning rate 81 | --layers LAYERS layers 82 | --rnn_size RNN_SIZE rnn size 83 | --word_vec_size WORD_VEC_SIZE 84 | word vector size 85 | --transformer_ff TRANSFORMER_FF 86 | transformer feed forward size 87 | --heads HEADS no of heads 88 | --valid_steps VALID_STEPS 89 | validation interval 90 | --save_checkpoint_steps SAVE_CHECKPOINT_STEPS 91 | model saving interval 92 | --average_last AVERAGE_LAST 93 | average last X models 94 | --world_size WORLD_SIZE 95 | world size 96 | --gpu_ranks [GPU_RANKS [GPU_RANKS ...]] 97 | gpu ranks 98 | --train_from TRAIN_FROM 99 | start training from this checkpoint 100 | --do_train Run training 101 | --do_eval Run evaluation 102 | --nbest NBEST sentencepiece nbest size 103 | --alpha ALPHA sentencepiece alpha 104 | ``` 105 | 106 | * ***Sample `input_dir` structure for bn2en training and evaluation:*** 107 | 108 | ```bash 109 | input_dir/ 110 | |---> data/ 111 | | |---> corpus.train.bn 112 | | |---> corpus.train.en 113 | | |---> RisingNews.valid.bn 114 | | |---> RisingNews.valid.en 115 | | |---> RisingNews.test.bn 116 | | |---> RisingNews.test.en 117 | | |---> sipc.test.bn 118 | | |---> sipc.test.en.0 119 | | |---> sipc.test.en.1 120 | | ... 121 | |---> vocab/ 122 | | |---> bn.model 123 | | |---> en.model 124 | ``` 125 | * Input data files inside the `data/` subdirectory must have the following format: **`X.type.lang(.count)`**, where `X` is any common file prefix, `type` is one of `{train, valid, test}` and `count` is an optional integer (**only applicable for the `target_lang`, when there are multiple reference files**). There can be multiple `.train.`/`.valid.` filepairs. In absence of `.valid.` files, `validation_samples` no of example pairs will be randomly sampled from the training files during `training`. 126 | * The `vocab` subdirectory must hold two sentencepiece vocabulary models formatted as `src_lang.model` and `tgt_lang.model` 127 | 128 | * ***After training / evaluation, the `output_dir` will have the following subdirectories with these contents.*** 129 | * `Models`: All the saved models 130 | * `Reports`: **BLEU and SACREBLEU** scores on the validation files for all saved models with the given `model_prefix`, and the scores on the test files for the given `eval_model` (if the corresponding reference files are present) 131 | * `Outputs`: Detokenized model predictions. 132 | * `data`: Merged training files after applying subword regularization. 133 | * `Preprocessed`: Training and validation data shards 134 | 135 | 136 | ***To reproduce our results on an AWS p3.8xlarge ec2 instance, equipped with 4 Tesla V100 GPUs, run the script with the default hyperparameters.*** For example, for bn2en training, 137 | ```bash 138 | $ export CUDA_VISIBLE_DEVICES=0,1,2,3 139 | # for training 140 | $ python pipeline.py \ 141 | --src_lang bn --tgt_lang en \ 142 | -i inputFolder/ -o outputFolder/ \ 143 | --model_prefix bn2en --do_train --do_eval 144 | ``` 145 | For single GPU training, additionally provide the following flags: ``--world_size 1``, ``--gpu_ranks 0`` and update the effective batch size according to available GPU VRAM using the flags `--train_batch_size X` and ``--gradient_accum X``. 146 | 147 | 148 | # Evaluation 149 | 150 | For evaluating trained models on a single GPU on new test files, use the following snippet with appropriate arguments: 151 | 152 | ```bash 153 | $ python pipeline.py 154 | --src_lang bn --tgt_lang en \ 155 | -i inputFolder/ -o outputFolder/ \ 156 | --eval_model \ 157 | --do_eval 158 | ``` 159 | -------------------------------------------------------------------------------- /training/preprocessing/README.md: -------------------------------------------------------------------------------- 1 | ## Cleaning / Normalizing / Training Vocabularies 2 | ***The purpose of this extra cleaning on top of batch filtering is to maximize the amount of useful information in the bn-en dataset for a bilingual MT system. We do this by employing a variety of heuristics such as removing identical spans of foreign texts on both sides, applying transliteration when appropriate, thresholding allowed amount of foreign text in a sentence pair, etc. For more details, refer to the code. Additionally, the script generates sentencepiece vocabulary files required for tokenizing the parallel corpora.*** 3 | 4 | ### Usage 5 | 6 | ```bash 7 | $ python preprocessor.py -h 8 | usage: preprocessor.py [-h] --input_dir PATH --output_dir PATH [--normalize] 9 | [--bn_vocab_size BN_VOCAB_SIZE] 10 | [--en_vocab_size EN_VOCAB_SIZE] 11 | [--bn_model_type BN_MODEL_TYPE] 12 | [--en_model_type EN_MODEL_TYPE] 13 | [--bn_coverage BN_COVERAGE] [--en_coverage EN_COVERAGE] 14 | 15 | optional arguments: 16 | -h, --help show this help message and exit 17 | --input_dir PATH, -i PATH 18 | Input directory 19 | --output_dir PATH, -o PATH 20 | Output directory 21 | --normalize Only normalize the files in input directory 22 | --bn_vocab_size BN_VOCAB_SIZE 23 | bengali vocab size 24 | --en_vocab_size EN_VOCAB_SIZE 25 | english vocab size 26 | --bn_model_type BN_MODEL_TYPE 27 | bengali sentencepiece model type 28 | --en_model_type EN_MODEL_TYPE 29 | english sentencepiece model type 30 | --bn_coverage BN_COVERAGE 31 | bengali character coverage 32 | --en_coverage EN_COVERAGE 33 | english character coverage 34 | ``` 35 | 36 | * If the script is invoked with `--normalize`, it will only produce the normalized version of all .bn / .en files found in the `input_dir` in corresponding subdirectories of `output_dir`. 37 | * Otherwise, the script will recursively look for all filepairs (`X.bn`, `X.en`) inside `input_dir`, where `X` is any common file prefix, and produce the following files inside `output_dir`: 38 | 39 | * `combined.bn` / `combined.en`: filepairs obtained by cleaning all linepairs. 40 | * `bn.model`, `bn.vocab` / `en.model`, `en.vocab`: sentencepiece models 41 | 42 | 43 | ## Removing Evaluation pairs 44 | ***If you are training from scratch with new test / train datasets, you should remove all evaluation pairs (`validation` / `test`) first from the training dataset to prevent data leakage.*** To do so, run `remove_evaluation_pairs.py`. 45 | 46 | **Make sure all datasets are normalized before running the script.** 47 | 48 | ### Usage 49 | ```bash 50 | $ python remove_evaluation_pairs.py -h 51 | usage: remove_evaluation_pairs.py [-h] --input_dir PATH --output_dir PATH 52 | --src_lang SRC_LANG --tgt_lang TGT_LANG 53 | 54 | optional arguments: 55 | -h, --help show this help message and exit 56 | --input_dir PATH, -i PATH 57 | Input directory 58 | --output_dir PATH, -o PATH 59 | Output directory 60 | --src_lang SRC_LANG Source language 61 | --tgt_lang TGT_LANG Target language 62 | ``` 63 | 64 | * The input directory must be structured as mentioned [here](../). This script will remove all evaluation pairs from training pairs and write those to `corpus.train.src_lang` / `corpus.train.tgt_lang` inside `output_dir`. 65 | -------------------------------------------------------------------------------- /training/preprocessing/preprocessor.py: -------------------------------------------------------------------------------- 1 | #%% 2 | """ 3 | Generated files/folders (depending on the flags set on __main__) summary: 4 | `Folders` 5 | - `tmp/Initial` > Contains files generated by initial hardline filtering 6 | - `tmp/pattern` > Contains files generated by character map replacements 7 | - `tmp/patch` > Contains files generated by replacing patches of identical non-bangla texts on both sides 8 | - `tmp/transilaterate`> Contains files generated by transliterating dangling characters on bangla side 9 | - `Final` > Contains files generated by applying previous transformations and some final postprocessing 10 | `Files(prefixes)` 11 | - `saved` > Transformed Lines that pass hardline filtering after applying transformation 12 | - `savedOriginal` > Original Lines that pass hardline filtering after applying transformation 13 | - `filtered` > Lines that don't pass hardline filtering at a stage 14 | - `cleaned` > Lines that pass hardline filtering at a stage (Initial/Previous passing Lines + saved) 15 | """ 16 | 17 | import re 18 | import os 19 | import difflib 20 | import time 21 | from subprocess import check_output 22 | from aksharamukha import transliterate 23 | from itertools import chain 24 | import shutil 25 | import sys 26 | import uuid 27 | import multiprocessing 28 | from multiprocessing import Pool 29 | import argparse 30 | import glob 31 | from tqdm import tqdm 32 | 33 | def globalize(func): 34 | def result(*args, **kwargs): 35 | return func(*args, **kwargs) 36 | result.__name__ = result.__qualname__ = uuid.uuid4().hex 37 | setattr(sys.modules[result.__module__], result.__name__, result) 38 | return result 39 | 40 | REPLACE_UNICODE_PUNCTUATION = [ 41 | (u"\u09F7", u"\u0964"), 42 | (u",", u","), 43 | (u"、", u","), 44 | (u"”", u'"'), 45 | (u"“", u'"'), 46 | (u"∶", u":"), 47 | (u":", u":"), 48 | (u"?", u"?"), 49 | (u"《", u'"'), 50 | (u"》", u'"'), 51 | (u")", u")"), 52 | (u"!", u"!"), 53 | (u"(", u"("), 54 | (u";", u";"), 55 | (u"」", u'"'), 56 | (u"「", u'"'), 57 | (u"0", u"0"), 58 | (u"1", u'1'), 59 | (u"2", u"2"), 60 | (u"3", u"3"), 61 | (u"4", u"4"), 62 | (u"5", u"5"), 63 | (u"6", u"6"), 64 | (u"7", u"7"), 65 | (u"8", u"8"), 66 | (u"9", u"9"), 67 | (u"~", u"~"), 68 | (u"’", u"'"), 69 | (u"…", u"..."), 70 | (u"━", u"-"), 71 | (u"〈", u"<"), 72 | (u"〉", u">"), 73 | (u"【", u"["), 74 | (u"】", u"]"), 75 | (u"%", u"%"), 76 | ] 77 | 78 | NORMALIZE_UNICODE = [ 79 | ('\u00AD', ''), 80 | ('\u09AF\u09BC', '\u09DF'), 81 | ('\u09A2\u09BC', '\u09DD'), 82 | ('\u09A1\u09BC', '\u09DC'), 83 | ('\u09AC\u09BC', '\u09B0'), 84 | ('\u09C7\u09BE', '\u09CB'), 85 | ('\u09C7\u09D7', '\u09CC'), 86 | ('\u0985\u09BE', '\u0986'), 87 | ('\u09C7\u0981\u09D7', '\u09CC\u0981'), 88 | ('\u09C7\u0981\u09BE', '\u09CB\u0981'), 89 | ('\u09C7([^\u09D7])\u09D7', "\g<1>\u09CC"), 90 | ('\\xa0', ' '), 91 | ('\u200B', u''), 92 | ('\u2060', u''), 93 | (u'„', r'"'), 94 | (u'“', r'"'), 95 | (u'”', r'"'), 96 | (u'–', r'-'), 97 | (u'—', r' - '), 98 | (r' +', r' '), 99 | (u'´', r"'"), 100 | (u'([a-zA-Z])‘([a-zA-Z])', r"\g<1>'\g<2>"), 101 | (u'([a-zA-Z])’([a-zA-Z])', r"\g<1>'\g<2>"), 102 | (u'‘', r"'"), 103 | (u'‚', r"'"), 104 | (u'’', r"'"), 105 | (u'´´', r'"'), 106 | (u'…', r'...'), 107 | ] 108 | 109 | FRENCH_QUOTES = [ 110 | (u'\u00A0«\u00A0', r'"'), 111 | (u'«\u00A0', r'"'), 112 | (u'«', r'"'), 113 | (u'\u00A0»\u00A0', r'"'), 114 | (u'\u00A0»', r'"'), 115 | (u'»', r'"'), 116 | ] 117 | 118 | SUBSTITUTIONS = [NORMALIZE_UNICODE, FRENCH_QUOTES, REPLACE_UNICODE_PUNCTUATION] 119 | SUBSTITUTIONS = list(chain(*SUBSTITUTIONS)) 120 | 121 | BANGLA_CHARS = ( 122 | r'[' 123 | r'\u0981-\u0983' 124 | r'\u0985-\u098B' 125 | r'\u098F-\u0990' 126 | r'\u0993-\u09A8' 127 | r'\u09AA-\u09B0' 128 | r'\u09B2' 129 | r'\u09B6-\u09B9' 130 | r'\u09BC' 131 | r'\u09BE-\u09C3' 132 | r'\u09C7-\u09C8' 133 | r'\u09CB-\u09CC' 134 | r'\u09CE' 135 | r'\u09D7' 136 | r'\u09DC-\u09DD' 137 | r'\u09DF' 138 | r'\u09E6-\u09EF' 139 | r'\u09F3' 140 | r'\u0964' 141 | r']' 142 | ) 143 | 144 | NEUTRAL_CHARS = ( 145 | r'[' 146 | r'\s' 147 | r'\u09CD' 148 | r'\u0021-\u002F' 149 | r'\u003A-\u0040' 150 | r'\u005B-\u0060' 151 | r'\u007B-\u007E' 152 | r'\u00A0' 153 | r'\u00A3' 154 | r'\u00B0' 155 | r'\u2000-\u2014' 156 | r'\u2018-\u201D' 157 | r'\u2028-\u202F' 158 | r'\u2032-\u2033' 159 | r'\u2035-\u2036' 160 | r'\u2060-\u206F' 161 | r']' 162 | ) 163 | 164 | ENGLISH_CHARS = ( 165 | r'[a-zA-Z0-9]' 166 | ) 167 | 168 | NON_BANGLA_PATCH = ( 169 | r'[' 170 | r'^' 171 | r'\u0981-\u0983' 172 | r'\u0985-\u098B' 173 | r'\u098F-\u0990' 174 | r'\u0993-\u09A8' 175 | r'\u09AA-\u09B0' 176 | r'\u09B2' 177 | r'\u09B6-\u09B9' 178 | r'\u09BC' 179 | r'\u09BE-\u09C3' 180 | r'\u09C7-\u09C8' 181 | r'\u09CB-\u09CC' 182 | r'\u09CE' 183 | r'\u09D7' 184 | r'\u09DC-\u09DD' 185 | r'\u09DF' 186 | r'\u09E6-\u09EF' 187 | r'\u09F3' 188 | r'\u0964' 189 | r'\s' 190 | r'\u09CD' 191 | r'\u0021-\u002F' 192 | r'\u003A-\u0040' 193 | r'\u005B-\u0060' 194 | r'\u007B-\u007E' 195 | r'\u00A0' 196 | r'\u00A3' 197 | r'\u00B0' 198 | r'\u2000-\u2014' 199 | r'\u2018-\u201D' 200 | r'\u2028-\u202F' 201 | r'\u2032-\u2033' 202 | r'\u2035-\u2036' 203 | r'\u2060-\u206F' 204 | r']' 205 | r'+' 206 | ) 207 | 208 | NON_ENGLISH_PATCH = ( 209 | r'[' 210 | r'^' 211 | r'a-z' 212 | r'A-Z' 213 | r'0-9' 214 | r'\s' 215 | r'\u09CD' 216 | r'\u0021-\u002F' 217 | r'\u003A-\u0040' 218 | r'\u005B-\u0060' 219 | r'\u007B-\u007E' 220 | r'\u00A0' 221 | r'\u00A3' 222 | r'\u00B0' 223 | r'\u2000-\u2014' 224 | r'\u2018-\u201D' 225 | r'\u2028-\u202F' 226 | r'\u2032-\u2033' 227 | r'\u2035-\u2036' 228 | r'\u2060-\u206F' 229 | r']' 230 | r'+' 231 | ) 232 | 233 | WHITESPACE_PUNCTATION = ( 234 | r'[\(\[\{' 235 | r'\u0021-\u0027' 236 | r'\u002A-\u002F' 237 | r'\u003A-\u0040' 238 | r'\u005C' 239 | r'\u005E-\u0060' 240 | r'\u007C' 241 | r'\u007E' 242 | r'\u02B9-\u02DD' 243 | r'\u09F7' 244 | r'\u0964' 245 | r'\u0965' 246 | r'\u2010-\u201F' 247 | r'\s\t' 248 | r'\)\]\}]' 249 | r'+' 250 | ) 251 | 252 | INCLUSIVE_NON_BANGLA_PATCH = ( 253 | r'[\(\[\{' 254 | r'a-zA-Z' 255 | r'\u00A1-\u00AC' 256 | r'\u00AE-\u02FF' 257 | r'\u0300-\u07BF' 258 | r'\u0900' 259 | r'\u0904-\u094D' 260 | r'\u094E-\u0950' 261 | r'\u0955-\u0963' 262 | r'\u0966-\u097F' 263 | r'\u0A00-\u1FFF' 264 | r'\u2020-\u2027' 265 | r'\u2030-\u2031' 266 | r'\u203B-\u205E' 267 | r'\u2070-\uFE4F' 268 | r'\uFE70-\uFEFF' 269 | r'\uFF21-\uFF3A' 270 | r'\uFF41-\uFF5A' 271 | r'\uFF5F-\uFFEF' 272 | r'\uFFF9-\uFFFF' 273 | r'\u0021-\u0027' 274 | r'\u002A-\u002F' 275 | r'\u003A-\u0040' 276 | r'\u005C' 277 | r'\u005E-\u0060' 278 | r'\u007C' 279 | r'\u007E' 280 | r'\u02B9-\u02DD' 281 | r'\u2010-\u201F' 282 | r'0-9' 283 | r'\s\t' 284 | r'\)\]\}]' 285 | '+' 286 | ) 287 | 288 | BRACKETED_SPANS = ( 289 | r'[\(\[\{]' 290 | r'[' 291 | r'\u0021-\u0027' 292 | r'\u002A-\u002F' 293 | r'\u003A-\u0040' 294 | r'\u005C' 295 | r'\u005E-\u0060' 296 | r'\u007C' 297 | r'\u007E' 298 | r'\u02B9-\u02DD' 299 | r'\u2010-\u201F' 300 | r'\s\t' 301 | r']' 302 | r'*' 303 | r'[\)\]\}]' 304 | ) 305 | 306 | def normalize(text): 307 | for regexp, replacement in SUBSTITUTIONS: 308 | text = re.sub(regexp, replacement, text, flags=re.UNICODE) 309 | 310 | text = re.sub(r'\s+', ' ', text) 311 | return text.strip() 312 | 313 | def readFile(filename): 314 | with open(filename) as f: 315 | lines = [line.strip() for line in f.readlines()] 316 | return lines 317 | 318 | def readFilePair(bnFile, enFile): 319 | bnLines = readFile(bnFile) 320 | enLines = readFile(enFile) 321 | 322 | return zip(bnLines, enLines) 323 | 324 | def readReplacePatterns(filename): 325 | """ 326 | Patterns should have the following form in each line: 327 | `Pattern:enReplacement(:optional bnReplacement)` 328 | 329 | Both Pattern and Replacement can contain arbitrary no of spaces. 330 | Be careful not to place unnecessary spaces. 331 | In absence of bnReplacement, bnReplacement = enReplacement 332 | """ 333 | enPatternMap, bnPatternMap = {}, {} 334 | 335 | with open(filename) as f: 336 | for line in f.readlines(): 337 | try: 338 | splitLine = line.rstrip('\n').split(":") 339 | pattern = splitLine[0] 340 | enReplacement = splitLine[1] 341 | if len(splitLine) == 3: 342 | bnPatternMap[pattern] = splitLine[2] 343 | else: 344 | bnPatternMap[pattern] = enReplacement 345 | enPatternMap[pattern] = enReplacement 346 | except: 347 | continue 348 | 349 | return enPatternMap, bnPatternMap 350 | 351 | def hasNonBangla(line): 352 | return len(line) - countBanglaChars(line) - countNeutralChars(line) 353 | 354 | def hasNonEnglish(line): 355 | return len(line) - countEnglishChars(line) - countNeutralChars(line) 356 | 357 | def hasOOV(line): 358 | return len(line) - countBanglaChars(line) - countEnglishChars(line) - countNeutralChars(line) 359 | 360 | def countBanglaChars(line): 361 | chars = re.findall( 362 | BANGLA_CHARS, 363 | line, 364 | flags=re.UNICODE 365 | ) 366 | return len(chars) 367 | 368 | def countEnglishChars(line): 369 | chars = re.findall( 370 | ENGLISH_CHARS, 371 | line, 372 | flags=re.UNICODE 373 | ) 374 | return len(chars) 375 | 376 | def countNeutralChars(line): 377 | chars = re.findall( 378 | NEUTRAL_CHARS, 379 | line, 380 | flags=re.UNICODE 381 | ) 382 | return len(chars) 383 | 384 | def getNonBanglaPatches(line): 385 | return re.findall( 386 | NON_BANGLA_PATCH, 387 | line, 388 | flags=re.UNICODE 389 | ) 390 | 391 | def getNonEnglishPatches(line): 392 | return re.findall( 393 | NON_ENGLISH_PATCH, 394 | line, 395 | flags=re.UNICODE 396 | ) 397 | 398 | def replaceEmojis(*lines): 399 | outputLines = [] 400 | for line in lines: 401 | line = re.sub( 402 | r'\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]', 403 | "", 404 | line, 405 | flags=re.UNICODE 406 | ) 407 | line = re.sub(r'[\U00010000-\U0010ffff]', "", line, flags=re.UNICODE) 408 | outputLines.append(line) 409 | 410 | return tuple(outputLines) 411 | 412 | def isValidProcessedPair(bnLine, enLine): 413 | if bnLine.strip() == "" or enLine.strip() == "": 414 | return False 415 | 416 | # check if either side contains only punctuations and whitespaces 417 | 418 | if ( 419 | re.sub(WHITESPACE_PUNCTATION, "", bnLine.strip(), flags=re.UNICODE) == "" or 420 | re.sub(WHITESPACE_PUNCTATION, "", enLine.strip(), flags=re.UNICODE) == "" 421 | ): 422 | return False 423 | 424 | return True 425 | 426 | def writeFilePairs(dir, validPairs, foreignPairs, savedPairs=None): 427 | os.makedirs(dir, exist_ok=True) 428 | 429 | with open(os.path.join(dir, "cleaned.bn"), 'w') as bn, \ 430 | open(os.path.join(dir, "cleaned.en"), 'w') as en: 431 | for bnLine, enLine in validPairs: 432 | print(bnLine, file=bn) 433 | print(enLine, file=en) 434 | 435 | with open(os.path.join(dir, "filtered.bn"), 'w') as bn, \ 436 | open(os.path.join(dir, "filtered.en"), 'w') as en: 437 | for bnLine, enLine in foreignPairs: 438 | print(bnLine, file=bn) 439 | print(enLine, file=en) 440 | 441 | if savedPairs: 442 | with open(os.path.join(dir, "savedOrignal.bn"), 'w') as bnO, \ 443 | open(os.path.join(dir, "savedOrignal.en"), 'w') as enO, \ 444 | open(os.path.join(dir, "saved.bn"), 'w') as bnS, \ 445 | open(os.path.join(dir, "saved.en"), 'w') as enS: 446 | for bnOriginal, enOriginal, bnSaved, enSaved in savedPairs: 447 | print(bnOriginal, file=bnO) 448 | print(enOriginal, file=enO) 449 | print(bnSaved, file=bnS) 450 | print(enSaved, file=enS) 451 | 452 | def convertNumerals(bnLine, enLine): 453 | newBnLine, newEnLine = bnLine, enLine 454 | 455 | for enNumeral in re.findall(r'[0-9]+', newBnLine, flags=re.UNICODE): 456 | newBnLine = newBnLine.replace(enNumeral, transliterate.process('RomanReadable', 'Bengali', enNumeral), 1) 457 | 458 | for bnNumeral in re.findall(r'[০-৯]+', newEnLine, flags=re.UNICODE): 459 | newEnLine = newEnLine.replace(bnNumeral, transliterate.process('Bengali', 'RomanReadable', bnNumeral), 1) 460 | 461 | return newBnLine, newEnLine 462 | 463 | def applyPatterns(line, patternMap): 464 | """ 465 | Returns: 466 | patternFound(bool) : Whether any of the patterns were found in the line 467 | line(str) : Transformed line using pattern replacements 468 | """ 469 | patternFound = False 470 | orignalLine = line 471 | for pattern, replacement in patternMap.items(): 472 | line = line.replace(pattern, replacement) 473 | if orignalLine != line: 474 | patternFound = True 475 | 476 | return patternFound, line 477 | 478 | def patternHandler(bnLine, enLine, verbose): 479 | _, newEnLine = applyPatterns(enLine, enPatternMap) 480 | _, newBnLine = applyPatterns(bnLine, bnPatternMap) 481 | 482 | if verbose: 483 | # if there is foreign text in the linepair after applying replacements 484 | if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine): 485 | patternHandler.foreignPairs.append((newBnLine, newEnLine)) 486 | else: 487 | # if the original linepair had foreign characters 488 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 489 | patternHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 490 | 491 | patternHandler.validPairs.append((newBnLine, newEnLine)) 492 | 493 | return newBnLine, newEnLine 494 | 495 | def patchHandler(bnLine, enLine, verbose): 496 | """ 497 | finds continous patches of non bangla text on bangla side and removes 498 | a found patch from both sides if it is also present in english side 499 | """ 500 | 501 | minPatchLength = 2 # this shouldn't be too small 502 | newBnLine, newEnLine = bnLine, enLine 503 | nonBanglaPatches = re.findall( 504 | INCLUSIVE_NON_BANGLA_PATCH, 505 | newBnLine, 506 | flags=re.UNICODE 507 | ) 508 | 509 | for patch in nonBanglaPatches: 510 | patch = patch.strip() 511 | 512 | # ignore whitespace only patches 513 | if patch == "": 514 | continue 515 | # ignore number only patches 516 | try: 517 | float(patch) 518 | continue 519 | except: 520 | pass 521 | # patch shouldnt end in starting braces 522 | if patch[-1] in ['[', '{', '(']: 523 | patch = patch[:-1].strip() 524 | # patch shouldnt start with ending braces/common punctuations 525 | if patch and patch[0] in [']', '}', ')', ',', '.']: 526 | patch = patch[1:].strip() 527 | 528 | # should the patch length be counted with spaces included?? 529 | # should all matching patches be removed in english side? 530 | if ( 531 | len(patch) >= minPatchLength and 532 | ( 533 | patch in newEnLine or 534 | patch.upper() in newEnLine or 535 | patch.lower() in newEnLine or 536 | patch.capitalize() in newEnLine 537 | ) 538 | ): 539 | newBnLine = newBnLine.replace(patch, "").strip() 540 | newEnLine = newEnLine.replace( 541 | patch, "" 542 | ).replace( 543 | patch.lower(), "" 544 | ).replace( 545 | patch.upper(), "" 546 | ).replace( 547 | patch.capitalize(), "" 548 | ).strip() 549 | 550 | 551 | if verbose: 552 | # if there is foreign text in the linepair after applying replacements 553 | if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine): 554 | patchHandler.foreignPairs.append((newBnLine, newEnLine)) 555 | else: 556 | # if the original linepair had foreign characters 557 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 558 | patchHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 559 | 560 | patchHandler.validPairs.append((newBnLine, newEnLine)) 561 | 562 | return newBnLine, newEnLine 563 | 564 | def alternatePatchHandler(bnLine, enLine, verbose): 565 | """ 566 | finds common patches in both lines and removes a patch 567 | from bothsides if the found patch is non bangla 568 | """ 569 | 570 | def isValidPatch(patch): 571 | patch = re.sub( 572 | INCLUSIVE_NON_BANGLA_PATCH, 573 | "", 574 | patch, 575 | flags=re.UNICODE 576 | ).strip() 577 | return patch == "" 578 | 579 | minPatchLength = 2 # this shouldn't be too small 580 | newBnLine, newEnLine = bnLine, enLine 581 | 582 | matcher = difflib.SequenceMatcher(None, newBnLine, newEnLine) 583 | potentialPatches = [newBnLine[match.a : match.a + match.size] for match in matcher.get_matching_blocks() if match.size > 0] 584 | 585 | for patch in potentialPatches: 586 | patch = patch.strip() 587 | 588 | # ignore whitespace only patches 589 | if patch == "": 590 | continue 591 | # ignore number only patches 592 | try: 593 | float(patch) 594 | continue 595 | except: 596 | pass 597 | 598 | # patch shouldnt end in starting braces 599 | if patch[-1] in ['[', '{', '(']: 600 | patch = patch[:-1].strip() 601 | # patch shouldnt start with ending braces/common punctuations 602 | if patch and patch[0] in [']', '}', ')', ',', '.']: 603 | patch = patch[1:].strip() 604 | 605 | # should the patch length be counted with spaces included?? 606 | # should all matching patches be removed in english side? 607 | if ( 608 | isValidPatch(patch) and 609 | len(patch) >= minPatchLength and 610 | ( 611 | patch in newEnLine or 612 | patch.upper() in newEnLine or 613 | patch.lower() in newEnLine or 614 | patch.capitalize() in newEnLine 615 | ) 616 | ): 617 | newBnLine = newBnLine.replace(patch, "").strip() 618 | newEnLine = newEnLine.replace( 619 | patch, "" 620 | ).replace( 621 | patch.lower(), "" 622 | ).replace( 623 | patch.upper(), "" 624 | ).replace( 625 | patch.capitalize(), "" 626 | ).strip() 627 | 628 | if verbose: 629 | # if there is foreign text in the linepair after applying replacements 630 | if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine): 631 | alternatePatchHandler.foreignPairs.append((newBnLine, newEnLine)) 632 | else: 633 | # if the original linepair had foreign characters 634 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 635 | alternatePatchHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 636 | 637 | alternatePatchHandler.validPairs.append((newBnLine, newEnLine)) 638 | 639 | return newBnLine, newEnLine 640 | 641 | def bpediaPatchHandler(bnLine, enLine, verbose): 642 | """ 643 | removes non bangla patches from first-bracketed spans on bangla side. 644 | Specific to Banglapedia. 645 | """ 646 | newBnLine, newEnLine = bnLine, enLine 647 | bracketedSpans = re.findall(r'\([^\)]*?\)', newBnLine, flags=re.UNICODE) 648 | 649 | for span in bracketedSpans: 650 | if hasNonBangla(span): 651 | # if span contains no bangla text, remove it 652 | if not countBanglaChars(span): 653 | newBnLine = newBnLine.replace(span, "").strip() 654 | else: 655 | nonBanglaPatches = re.findall( 656 | INCLUSIVE_NON_BANGLA_PATCH, 657 | span, 658 | flags=re.UNICODE 659 | ) 660 | newSpan = span 661 | for patch in nonBanglaPatches: 662 | if isValidProcessedPair(patch, patch): 663 | newSpan = newSpan.replace(patch.strip(), "") 664 | 665 | newBnLine = newBnLine.replace(span, newSpan.strip()).strip() 666 | 667 | if verbose: 668 | # if there is foreign text in the linepair after applying replacements 669 | if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine): 670 | bpediaPatchHandler.foreignPairs.append((newBnLine, newEnLine)) 671 | else: 672 | # if the original linepair had foreign characters 673 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 674 | bpediaPatchHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 675 | 676 | bpediaPatchHandler.validPairs.append((newBnLine, newEnLine)) 677 | 678 | return newBnLine, newEnLine 679 | 680 | def transliterateHandler(bnLine, enLine, verbose): 681 | newBnLine, newEnLine = bnLine, enLine 682 | # convert numerals to appropriate script 683 | newBnLine, newEnLine = convertNumerals(newBnLine, newEnLine) 684 | 685 | charConvertMap = { 686 | 'a' : 'এ', 687 | 'b' : 'বি', 688 | 'c' : 'সি', 689 | 'd' : 'ডি', 690 | 'e' : 'ই', 691 | 'f' : 'এফ', 692 | 'g' : 'জি', 693 | 'h' : 'এইচ', 694 | 'i' : 'আই', 695 | 'j' : 'জে', 696 | 'k' : 'কে', 697 | 'l' : 'এল', 698 | 'm' : 'এম', 699 | 'n' : 'এন', 700 | 'o' : 'ও', 701 | 'p' : 'পি', 702 | 'q' : 'কিউ', 703 | 'r' : 'আর', 704 | 's' : 'এস', 705 | 't' : 'টি', 706 | 'u' : 'ইউ', 707 | 'v' : 'ভি', 708 | 'w' : 'ডব্লিউ', 709 | 'x' : 'এক্স', 710 | 'y' : 'ওআই', 711 | 'z' : 'জেড', 712 | } 713 | 714 | def customReplace(match): 715 | caughtChar = match.group(1).strip() 716 | return charConvertMap[caughtChar.lower()] 717 | 718 | newBnLine = re.sub(r'(\b[a-zA-Z]\b)', customReplace, newBnLine, flags=re.UNICODE) 719 | 720 | if verbose: 721 | # if there is foreign text in the linepair after applying replacements 722 | if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine): 723 | transliterateHandler.foreignPairs.append((newBnLine, newEnLine)) 724 | else: 725 | # if the original linepair had foreign characters 726 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 727 | transliterateHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 728 | 729 | transliterateHandler.validPairs.append((newBnLine, newEnLine)) 730 | 731 | return newBnLine, newEnLine 732 | 733 | def ratioHandler(bnLine, enLine, verbose): 734 | newBnLine, newEnLine = bnLine, enLine 735 | 736 | bnLowerThresh = .001 737 | bnUpperThresh = 10.0 738 | 739 | enLowerThresh = .001 740 | enUpperThresh = 10.0 741 | 742 | 743 | bnNativeChars = countBanglaChars(bnLine) 744 | bnForeignChars = len(bnLine) - bnNativeChars - countNeutralChars(bnLine) 745 | bnRatio = bnForeignChars/bnNativeChars 746 | 747 | enNativeChars = countEnglishChars(enLine) 748 | enForeignChars = len(enLine) - enNativeChars - countNeutralChars(enLine) 749 | enRatio = enForeignChars/enNativeChars 750 | 751 | if verbose: 752 | # if there is foreign text in the linepair after applying replacements 753 | if ( 754 | bnRatio >= bnUpperThresh or 755 | enRatio >= enUpperThresh or 756 | bnRatio <= bnLowerThresh or 757 | enRatio <= enLowerThresh or 758 | countBanglaChars(enLine) or countEnglishChars(bnLine) 759 | ): 760 | ratioHandler.foreignPairs.append((newBnLine, newEnLine)) 761 | else: 762 | # if the original linepair had foreign characters 763 | if bnForeignChars or enForeignChars: 764 | ratioHandler.savedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 765 | 766 | ratioHandler.validPairs.append((newBnLine, newEnLine)) 767 | 768 | return newBnLine, newEnLine 769 | 770 | def cleanSentencePairs(linePairs, options, output_dir, verbose=True): 771 | for option, choice in options.items(): 772 | if choice: 773 | globals()[f'{option}Handler'].validPairs = multiprocessing.Manager().list() 774 | globals()[f'{option}Handler'].foreignPairs = multiprocessing.Manager().list() 775 | globals()[f'{option}Handler'].savedPairs = multiprocessing.Manager().list() 776 | 777 | 778 | global finalValidPairs, finalForeignPairs, finalSavedPairs, initialValidPairs, initialForeignPairs 779 | 780 | finalValidPairs, finalForeignPairs, finalSavedPairs = ( 781 | multiprocessing.Manager().list(), 782 | multiprocessing.Manager().list(), 783 | multiprocessing.Manager().list() 784 | ) 785 | initialValidPairs, initialForeignPairs = ( 786 | multiprocessing.Manager().list(), 787 | multiprocessing.Manager().list() 788 | ) 789 | 790 | @globalize 791 | def processPair(bnLine, enLine): 792 | validPair = False 793 | 794 | if bnLine == enLine or (not countBanglaChars(bnLine) or not countEnglishChars(enLine)): 795 | return 796 | 797 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 798 | initialForeignPairs.append((bnLine, enLine)) 799 | else: 800 | initialValidPairs.append((bnLine, enLine)) 801 | validPair = True 802 | 803 | # apply all selected transformations 804 | newBnLine, newEnLine = bnLine, enLine 805 | 806 | newBnLine = normalize(newBnLine).strip() 807 | newEnLine = normalize(newEnLine).strip() 808 | newBnLine, newEnLine = replaceEmojis(newBnLine, newEnLine) 809 | 810 | # remove unnecessary bracketed spans; this needs to be done first for patch removal to work 811 | newBnLine = re.sub( 812 | BRACKETED_SPANS, 813 | "", 814 | newBnLine, 815 | flags=re.UNICODE 816 | ) 817 | newEnLine = re.sub( 818 | BRACKETED_SPANS, 819 | "", 820 | newEnLine, 821 | flags=re.UNICODE 822 | ) 823 | 824 | for option, choice in options.items(): 825 | if choice and not validPair: 826 | newBnLine, newEnLine = globals()[f'{option}Handler'](newBnLine, newEnLine, verbose) 827 | 828 | # do some final postprocessing 829 | if not isValidProcessedPair(newBnLine, newEnLine): 830 | return 831 | 832 | # remove unnecessary bracketed spans (after patch deletion) 833 | newBnLine = re.sub( 834 | BRACKETED_SPANS, 835 | "", 836 | newBnLine, 837 | flags=re.UNICODE 838 | ) 839 | newEnLine = re.sub( 840 | BRACKETED_SPANS, 841 | "", 842 | newEnLine, 843 | flags=re.UNICODE 844 | ) 845 | 846 | newBnLine = re.sub(r'\s+', ' ', newBnLine) 847 | newEnLine = re.sub(r'\s+', ' ', newEnLine) 848 | 849 | 850 | # if there is foreign text in the linepair after applying transformations 851 | if hasNonBangla(newBnLine) or hasNonEnglish(newEnLine): 852 | finalForeignPairs.append((bnLine, enLine)) 853 | else: 854 | # if the original linepair had foreign characters 855 | if hasNonBangla(bnLine) or hasNonEnglish(enLine): 856 | finalSavedPairs.append((bnLine, enLine, newBnLine, newEnLine)) 857 | 858 | finalValidPairs.append((newBnLine, newEnLine)) 859 | 860 | print('\tStarting processes...') 861 | with Pool() as pool: 862 | pool.starmap(processPair, linePairs) 863 | 864 | print('\tWriting logs and outputs...') 865 | # write the final and temporary filepairs to appropriate directories 866 | writeFilePairs( 867 | os.path.join(output_dir, 'Final'), 868 | finalValidPairs, finalForeignPairs, finalSavedPairs 869 | ) 870 | 871 | if verbose: 872 | writeFilePairs( 873 | os.path.join(output_dir, "tmp", "Initial"), 874 | initialValidPairs, initialForeignPairs) 875 | 876 | for option, choice in options.items(): 877 | if choice: 878 | funcName = globals()[f'{option}Handler'] 879 | writeFilePairs( 880 | os.path.join(output_dir, "tmp", option), 881 | funcName.validPairs, 882 | funcName.foreignPairs, 883 | funcName.savedPairs 884 | ) 885 | 886 | 887 | def col(id): 888 | if id == 1: return "\033[32m" 889 | if id == 2: return "\033[33m" 890 | if id == 3: return "\033[31m" 891 | return "\033[0m" 892 | 893 | def cleanup(dirname): 894 | for sub_dir in ["Final", "tmp"]: 895 | if os.path.isdir(os.path.join(dirname, sub_dir)): 896 | shutil.rmtree(os.path.join(dirname, sub_dir)) 897 | 898 | def _merge(input_files, output_file): 899 | with open(output_file, 'w') as outf: 900 | for input_file in input_files: 901 | with open(input_file) as inpf: 902 | for line in inpf: 903 | print(line.strip(), file=outf) 904 | 905 | def _train(input_file, output_prefix, coverage, vocab_size, model_type): 906 | cmd = [ 907 | f"spm_train --input=\"{input_file}\"", 908 | f"--model_prefix=\"{output_prefix}\"", 909 | f"--vocab_size={vocab_size}", 910 | f"--character_coverage={coverage}", 911 | "--train_extremely_large_corpus" 912 | ] 913 | os.system(" ".join(cmd)) 914 | 915 | def main(args): 916 | global enPatternMap, bnPatternMap 917 | enPatternMap, bnPatternMap = readReplacePatterns( 918 | os.path.join(os.path.dirname(__file__), "replacePatterns.txt") 919 | ) 920 | if os.path.isdir(args.output_dir): 921 | shutil.rmtree(args.output_dir) 922 | 923 | if args.normalize: 924 | iterator = tqdm( 925 | glob.glob(os.path.join(args.input_dir, "**", "*.bn"), recursive=True) + 926 | glob.glob(os.path.join(args.input_dir, "**", "*.en"), recursive=True), 927 | desc="Normalizing files" 928 | ) 929 | for input_file in iterator: 930 | output_file = input_file.replace( 931 | os.path.normpath(args.input_dir), 932 | os.path.normpath(args.output_dir) 933 | ) 934 | os.makedirs(os.path.dirname(output_file), exist_ok=True) 935 | with open(output_file, 'w') as outf: 936 | with open(input_file) as inpf: 937 | for line in inpf: 938 | line = normalize(line) 939 | _, line = applyPatterns(line, enPatternMap) 940 | _, line = applyPatterns(line, bnPatternMap) 941 | print(line.strip(), file=outf) 942 | else: 943 | cleanup(args.output_dir) 944 | os.makedirs(os.path.join(args.output_dir, "data"), exist_ok=True) 945 | 946 | linePair_list = [] 947 | for bnFile in glob.glob(os.path.join(args.input_dir, "**", f"*.bn"), recursive=True): 948 | enFile = bnFile[:-3] + ".en" 949 | if not os.path.isfile(enFile): 950 | continue 951 | linePair_list.append(readFilePair(bnFile, enFile)) 952 | 953 | linePairs = list(chain.from_iterable(linePair_list)) 954 | 955 | print(col(2) + 'Starting Stage 1...' + col(0)) 956 | cleanSentencePairs(linePairs, {'pattern': True}, args.output_dir) 957 | 958 | shutil.copy( 959 | os.path.join(args.output_dir, "Final", "cleaned.bn"), 960 | os.path.join(args.output_dir, "data", "data1.bn") 961 | ) 962 | shutil.copy( 963 | os.path.join(args.output_dir, "Final", "cleaned.en"), 964 | os.path.join(args.output_dir, "data", "data1.en") 965 | ) 966 | shutil.copy( 967 | os.path.join(args.output_dir, "tmp", "pattern", "filtered.bn"), 968 | os.path.join(args.output_dir, "data", "stage1Filtered.bn") 969 | ) 970 | shutil.copy( 971 | os.path.join(args.output_dir, "tmp", "pattern", "filtered.en"), 972 | os.path.join(args.output_dir, "data", "stage1Filtered.en") 973 | ) 974 | 975 | print(col(2) + 'Starting Stage 2...' + col(0)) 976 | cleanup(args.output_dir) 977 | linePairs = readFilePair( 978 | os.path.join(args.output_dir, "data", "stage1Filtered.bn"), 979 | os.path.join(args.output_dir, "data", "stage1Filtered.en") 980 | ) 981 | cleanSentencePairs(linePairs, {'ratio': True}, args.output_dir) 982 | 983 | shutil.copy( 984 | os.path.join(args.output_dir, "tmp", "ratio", "cleaned.bn"), 985 | os.path.join(args.output_dir, "data", "data2.bn") 986 | ) 987 | shutil.copy( 988 | os.path.join(args.output_dir, "tmp", "ratio", "cleaned.en"), 989 | os.path.join(args.output_dir, "data", "data2.en") 990 | ) 991 | shutil.copy( 992 | os.path.join(args.output_dir, "tmp", "ratio", "filtered.bn"), 993 | os.path.join(args.output_dir, "data", "stage2Filtered.bn") 994 | ) 995 | shutil.copy( 996 | os.path.join(args.output_dir, "tmp", "ratio", "filtered.en"), 997 | os.path.join(args.output_dir, "data", "stage2Filtered.en") 998 | ) 999 | 1000 | 1001 | print(col(2) + 'Starting Stage 3...' + col(0)) 1002 | cleanup(args.output_dir) 1003 | linePairs = readFilePair( 1004 | os.path.join(args.output_dir, "data", "stage2Filtered.bn"), 1005 | os.path.join(args.output_dir, "data", "stage2Filtered.en") 1006 | ) 1007 | cleanSentencePairs( 1008 | linePairs, 1009 | { 1010 | 'patch': True, 1011 | 'alternatePatch': True, 1012 | 'transliterate': True 1013 | }, 1014 | args.output_dir 1015 | ) 1016 | 1017 | shutil.copy( 1018 | os.path.join(args.output_dir, "Final", "cleaned.bn"), 1019 | os.path.join(args.output_dir, "data", "data3.bn") 1020 | ) 1021 | shutil.copy( 1022 | os.path.join(args.output_dir, "Final", "cleaned.en"), 1023 | os.path.join(args.output_dir, "data", "data3.en") 1024 | ) 1025 | 1026 | _merge( 1027 | [ 1028 | os.path.join(args.output_dir, "data", "data1.bn"), 1029 | os.path.join(args.output_dir, "data", "data2.bn"), 1030 | os.path.join(args.output_dir, "data", "data3.bn"), 1031 | ], 1032 | os.path.join(args.output_dir, "combined.bn") 1033 | ) 1034 | _merge( 1035 | [ 1036 | os.path.join(args.output_dir, "data", "data1.en"), 1037 | os.path.join(args.output_dir, "data", "data2.en"), 1038 | os.path.join(args.output_dir, "data", "data3.en"), 1039 | ], 1040 | os.path.join(args.output_dir, "combined.en") 1041 | ) 1042 | _merge( 1043 | [ 1044 | os.path.join(args.output_dir, "data", "data1.bn"), 1045 | os.path.join(args.output_dir, "data", "data3.bn"), 1046 | ], 1047 | os.path.join(args.output_dir, "vocab_train.bn") 1048 | ) 1049 | _merge( 1050 | [ 1051 | os.path.join(args.output_dir, "data", "data1.en"), 1052 | os.path.join(args.output_dir, "data", "data3.en"), 1053 | ], 1054 | os.path.join(args.output_dir, "vocab_train.en") 1055 | ) 1056 | 1057 | _train( 1058 | os.path.join(args.output_dir, "vocab_train.bn"), 1059 | os.path.join(args.output_dir, "bn"), 1060 | args.bn_coverage, 1061 | args.bn_vocab_size, 1062 | args.bn_model_type 1063 | ) 1064 | 1065 | _train( 1066 | os.path.join(args.output_dir, "vocab_train.en"), 1067 | os.path.join(args.output_dir, "en"), 1068 | args.en_coverage, 1069 | args.en_vocab_size, 1070 | args.en_model_type 1071 | ) 1072 | 1073 | shutil.rmtree(os.path.join(args.output_dir, "data")) 1074 | os.remove(os.path.join(args.output_dir, "vocab_train.bn")) 1075 | os.remove(os.path.join(args.output_dir, "vocab_train.en")) 1076 | cleanup(args.output_dir) 1077 | 1078 | 1079 | 1080 | if __name__ == "__main__": 1081 | parser = argparse.ArgumentParser() 1082 | parser.add_argument( 1083 | '--input_dir', '-i', type=str, 1084 | required=True, 1085 | metavar='PATH', 1086 | help="Input directory") 1087 | 1088 | parser.add_argument( 1089 | '--output_dir', '-o', type=str, 1090 | required=True, 1091 | metavar='PATH', 1092 | help="Output directory") 1093 | 1094 | parser.add_argument('--normalize', action='store_true', 1095 | help='Only normalize the files in input directory') 1096 | 1097 | parser.add_argument( 1098 | '--bn_vocab_size', type=int, default=32000, 1099 | help='bengali vocab size') 1100 | 1101 | parser.add_argument( 1102 | '--en_vocab_size', type=int, default=32000, 1103 | help='english vocab size') 1104 | 1105 | parser.add_argument( 1106 | '--bn_model_type', type=str, default="unigram", 1107 | help='bengali sentencepiece model type') 1108 | 1109 | parser.add_argument( 1110 | '--en_model_type', type=str, default="unigram", 1111 | help='english sentencepiece model type') 1112 | 1113 | parser.add_argument( 1114 | '--bn_coverage', type=float, default=1.0, 1115 | help='bengali character coverage') 1116 | 1117 | parser.add_argument( 1118 | '--en_coverage', type=float, default=1.0, 1119 | help='english character coverage') 1120 | 1121 | args = parser.parse_args() 1122 | main(args) 1123 | 1124 | -------------------------------------------------------------------------------- /training/preprocessing/remove_evaluation_pairs.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import argparse 3 | import sys 4 | import shutil 5 | import pyonmttok 6 | import os 7 | import glob 8 | import math 9 | from tqdm import tqdm 10 | 11 | def get_linepairs(args, data_type): 12 | linepairs = set() 13 | 14 | for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.{data_type}.{args.src_lang}")): 15 | tgt_file_prefix = src_file.rsplit(f".{data_type}.{args.src_lang}", 1)[0] + f".{data_type}.{args.tgt_lang}" 16 | tgt_files = glob.glob(tgt_file_prefix + "*") 17 | 18 | if tgt_files: 19 | for tgt_file in tgt_files: 20 | with open(src_file) as fs, open(tgt_file) as ft: 21 | for src_line, tgt_line in zip(fs, ft): 22 | linepairs.add( 23 | (src_line.strip(), tgt_line.strip()) 24 | ) 25 | return linepairs 26 | 27 | def main(args): 28 | exclude_linepairs = set() 29 | exclude_linepairs.update( 30 | get_linepairs(args, "valid") 31 | ) 32 | exclude_linepairs.update( 33 | get_linepairs(args, "test") 34 | ) 35 | 36 | os.makedirs(args.output_dir, exist_ok=True) 37 | with open(os.path.join(args.output_dir, f"corpus.train.{args.src_lang}"), 'w') as srcF, \ 38 | open(os.path.join(args.output_dir, f"corpus.train.{args.tgt_lang}"), 'w') as tgtF: 39 | 40 | for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.train.{args.src_lang}")): 41 | tgt_file_prefix = src_file.rsplit(f".train.{args.src_lang}", 1)[0] + f".train.{args.tgt_lang}" 42 | tgt_files = glob.glob(tgt_file_prefix + "*") 43 | 44 | if tgt_files: 45 | # when multiple references are present, pick the first one 46 | tgt_file = tgt_files[0] 47 | 48 | with open(src_file) as fs, open(tgt_file) as ft: 49 | for src_line, tgt_line in zip(fs, ft): 50 | src_line = src_line.strip() 51 | tgt_line = tgt_line.strip() 52 | 53 | if (src_line, tgt_line) not in exclude_linepairs: 54 | print(src_line, file=srcF) 55 | print(tgt_line, file=tgtF) 56 | 57 | 58 | if __name__ == "__main__": 59 | parser = argparse.ArgumentParser() 60 | parser.add_argument( 61 | '--input_dir', '-i', type=str, 62 | required=True, 63 | metavar='PATH', 64 | help="Input directory") 65 | 66 | parser.add_argument( 67 | '--output_dir', '-o', type=str, 68 | required=True, 69 | metavar='PATH', 70 | help="Output directory") 71 | 72 | parser.add_argument( 73 | '--src_lang', type=str, 74 | required=True, 75 | help="Source language") 76 | 77 | parser.add_argument( 78 | '--tgt_lang', type=str, 79 | required=True, 80 | help="Target language") 81 | 82 | args = parser.parse_args() 83 | main(args) 84 | 85 | 86 | 87 | -------------------------------------------------------------------------------- /training/preprocessing/replacePatterns.txt: -------------------------------------------------------------------------------- 1 | ¹:1 2 | ²:2 3 | ³:3 4 | À:A 5 | Á:A 6 | Â:A 7 | Ã:A 8 | Ä:A 9 | Å:A 10 | Ā:A 11 | Ă:A 12 | Ą:A 13 | Ǎ:A 14 | Ǟ:A 15 | Ǡ:A 16 | Ǻ:A 17 | Ȁ:A 18 | Ȃ:A 19 | Ȧ:A 20 | Ḁ:A 21 | Ạ:A 22 | Ả:A 23 | Ấ:A 24 | Ầ:A 25 | Ẩ:A 26 | Ẫ:A 27 | Ậ:A 28 | Ắ:A 29 | Ằ:A 30 | Ẳ:A 31 | Ẵ:A 32 | Ặ:A 33 | à:a 34 | á:a 35 | â:a 36 | ã:a 37 | ä:a 38 | å:a 39 | ª:a 40 | ā:a 41 | ă:a 42 | ą:a 43 | ǎ:a 44 | ǟ:a 45 | ǡ:a 46 | ǻ:a 47 | ȁ:a 48 | ȃ:a 49 | ȧ:a 50 | ḁ:a 51 | ạ:a 52 | ả:a 53 | ấ:a 54 | ầ:a 55 | ẩ:a 56 | ẫ:a 57 | ậ:a 58 | ắ:a 59 | ằ:a 60 | ẳ:a 61 | ẵ:a 62 | ặ:a 63 | Ḃ:B 64 | Ḅ:B 65 | Ḇ:B 66 | ḃ:b 67 | ḅ:b 68 | ḇ:b 69 | Ç:C 70 | Ć:C 71 | Ĉ:C 72 | Ċ:C 73 | Č:C 74 | Ḉ:C 75 | ç:c 76 | ć:c 77 | ĉ:c 78 | ċ:c 79 | č:c 80 | ḉ:c 81 | Ð:D 82 | Ď:D 83 | Đ:D 84 | Ḋ:D 85 | Ḍ:D 86 | Ḏ:D 87 | Ḑ:D 88 | Ḓ:D 89 | ď:d 90 | đ:d 91 | ḋ:d 92 | ḍ:d 93 | ḏ:d 94 | ḑ:d 95 | ḓ:d 96 | È:E 97 | É:E 98 | Ê:E 99 | Ë:E 100 | Ē:E 101 | Ĕ:E 102 | Ė:E 103 | Ę:E 104 | Ě:E 105 | Ȅ:E 106 | Ȇ:E 107 | Ȩ:E 108 | Ḕ:E 109 | Ḗ:E 110 | Ḙ:E 111 | Ḛ:E 112 | Ḝ:E 113 | Ẹ:E 114 | Ẻ:E 115 | Ẽ:E 116 | Ế:E 117 | Ề:E 118 | Ể:E 119 | Ễ:E 120 | Ệ:E 121 | è:e 122 | é:e 123 | ê:e 124 | ë:e 125 | ē:e 126 | ĕ:e 127 | ė:e 128 | ę:e 129 | ě:e 130 | ȅ:e 131 | ȇ:e 132 | ȩ:e 133 | ḕ:e 134 | ḗ:e 135 | ḙ:e 136 | ḛ:e 137 | ḝ:e 138 | ẹ:e 139 | ẻ:e 140 | ẽ:e 141 | ế:e 142 | ề:e 143 | ể:e 144 | ễ:e 145 | ệ:e 146 | Ḟ:F 147 | ḟ:f 148 | Ĝ:G 149 | Ğ:G 150 | Ġ:G 151 | Ģ:G 152 | Ǧ:G 153 | Ǵ:G 154 | Ḡ:G 155 | ĝ:g 156 | ğ:g 157 | ġ:g 158 | ģ:g 159 | ǧ:g 160 | ǵ:g 161 | ḡ:g 162 | Ĥ:H 163 | Ħ:H 164 | Ȟ:H 165 | Ḣ:H 166 | Ḥ:H 167 | Ḧ:H 168 | Ḩ:H 169 | Ḫ:H 170 | ĥ:h 171 | ħ:h 172 | ȟ:h 173 | ḣ:h 174 | ḥ:h 175 | ḧ:h 176 | ḩ:h 177 | ḫ:h 178 | ẖ:h 179 | Ì:I 180 | Í:I 181 | Î:I 182 | Ï:I 183 | Ĩ:I 184 | Ī:I 185 | Ĭ:I 186 | Į:I 187 | İ:I 188 | Ǐ:I 189 | Ȉ:I 190 | Ȋ:I 191 | Ḭ:I 192 | Ḯ:I 193 | Ỉ:I 194 | Ị:I 195 | ì:i 196 | í:i 197 | î:i 198 | ï:i 199 | ĩ:i 200 | ī:i 201 | ĭ:i 202 | į:i 203 | ı:i 204 | ǐ:i 205 | ȉ:i 206 | ȋ:i 207 | ḭ:i 208 | ḯ:i 209 | ỉ:i 210 | ị:i 211 | Ĵ:J 212 | ĵ:j 213 | Ķ:K 214 | Ǩ:K 215 | Ḱ:K 216 | Ḳ:K 217 | Ḵ:K 218 | ķ:k 219 | ǩ:k 220 | ḱ:k 221 | ḳ:k 222 | ḵ:k 223 | Ĺ:L 224 | Ļ:L 225 | Ľ:L 226 | Ŀ:L 227 | Ł:L 228 | Ḷ:L 229 | Ḹ:L 230 | Ḻ:L 231 | Ḽ:L 232 | ĺ:l 233 | ļ:l 234 | ľ:l 235 | ŀ:l 236 | ł:l 237 | ḷ:l 238 | ḹ:l 239 | ḻ:l 240 | ḽ:l 241 | Ḿ:M 242 | Ṁ:M 243 | Ṃ:M 244 | ḿ:m 245 | ṁ:m 246 | ṃ:m 247 | Ñ:N 248 | Ń:N 249 | Ņ:N 250 | Ň:N 251 | Ǹ:N 252 | Ṅ:N 253 | Ṇ:N 254 | Ṉ:N 255 | Ṋ:N 256 | ñ:n 257 | ń:n 258 | ņ:n 259 | ň:n 260 | ǹ:n 261 | ṅ:n 262 | ṇ:n 263 | ṉ:n 264 | ṋ:n 265 | Ò:O 266 | Ó:O 267 | Ô:O 268 | Õ:O 269 | Ö:O 270 | Ō:O 271 | Ŏ:O 272 | Ő:O 273 | Ơ:O 274 | Ǒ:O 275 | Ǫ:O 276 | Ǭ:O 277 | Ȍ:O 278 | Ȏ:O 279 | Ȫ:O 280 | Ȭ:O 281 | Ȯ:O 282 | Ȱ:O 283 | Ṍ:O 284 | Ṏ:O 285 | Ṑ:O 286 | Ṓ:O 287 | Ọ:O 288 | Ỏ:O 289 | Ố:O 290 | Ồ:O 291 | Ổ:O 292 | Ỗ:O 293 | Ộ:O 294 | Ớ:O 295 | Ờ:O 296 | Ở:O 297 | Ỡ:O 298 | Ợ:O 299 | ò:o 300 | ó:o 301 | ô:o 302 | õ:o 303 | ö:o 304 | ō:o 305 | ŏ:o 306 | ő:o 307 | ơ:o 308 | ǒ:o 309 | ǫ:o 310 | ǭ:o 311 | ȍ:o 312 | ȏ:o 313 | ȫ:o 314 | ȭ:o 315 | ȯ:o 316 | ȱ:o 317 | ṍ:o 318 | ṏ:o 319 | ṑ:o 320 | ṓ:o 321 | ọ:o 322 | ỏ:o 323 | ố:o 324 | ồ:o 325 | ổ:o 326 | ỗ:o 327 | ộ:o 328 | ớ:o 329 | ờ:o 330 | ở:o 331 | ỡ:o 332 | ợ:o 333 | Ṕ:P 334 | Ṗ:P 335 | ṕ:p 336 | ṗ:p 337 | Ŕ:R 338 | Ŗ:R 339 | Ř:R 340 | Ȑ:R 341 | Ȓ:R 342 | Ṙ:R 343 | Ṛ:R 344 | Ṝ:R 345 | Ṟ:R 346 | ŕ:r 347 | ŗ:r 348 | ř:r 349 | ȑ:r 350 | ȓ:r 351 | ṙ:r 352 | ṛ:r 353 | ṝ:r 354 | ṟ:r 355 | Ś:S 356 | Ŝ:S 357 | Ş:S 358 | Š:S 359 | Ș:S 360 | Ṡ:S 361 | Ṣ:S 362 | Ṥ:S 363 | Ṧ:S 364 | Ṩ:S 365 | ś:s 366 | ŝ:s 367 | ş:s 368 | š:s 369 | ș:s 370 | ṡ:s 371 | ṣ:s 372 | ṥ:s 373 | ṧ:s 374 | ṩ:s 375 | Ţ:T 376 | Ť:T 377 | Ŧ:T 378 | Ț:T 379 | Ṫ:T 380 | Ṭ:T 381 | Ṯ:T 382 | Ṱ:T 383 | ţ:t 384 | ť:t 385 | ŧ:t 386 | ț:t 387 | ṫ:t 388 | ṭ:t 389 | ṯ:t 390 | ṱ:t 391 | ẗ:t 392 | Ù:U 393 | Ú:U 394 | Û:U 395 | Ü:U 396 | Ũ:U 397 | Ū:U 398 | Ŭ:U 399 | Ů:U 400 | Ű:U 401 | Ų:U 402 | Ư:U 403 | Ǔ:U 404 | Ǖ:U 405 | Ǘ:U 406 | Ǚ:U 407 | Ǜ:U 408 | Ȕ:U 409 | Ȗ:U 410 | Ṳ:U 411 | Ṵ:U 412 | Ṷ:U 413 | Ṹ:U 414 | Ṻ:U 415 | Ụ:U 416 | Ủ:U 417 | Ứ:U 418 | Ừ:U 419 | Ử:U 420 | Ữ:U 421 | Ự:U 422 | ù:u 423 | ú:u 424 | û:u 425 | ü:u 426 | ũ:u 427 | ū:u 428 | ŭ:u 429 | ů:u 430 | ű:u 431 | ų:u 432 | ư:u 433 | ǔ:u 434 | ǖ:u 435 | ǘ:u 436 | ǚ:u 437 | ǜ:u 438 | ȕ:u 439 | ȗ:u 440 | ṳ:u 441 | ṵ:u 442 | ṷ:u 443 | ṹ:u 444 | ṻ:u 445 | ụ:u 446 | ủ:u 447 | ứ:u 448 | ừ:u 449 | ử:u 450 | ữ:u 451 | ự:u 452 | Ṽ:V 453 | Ṿ:V 454 | ṽ:v 455 | ṿ:v 456 | Ŵ:W 457 | Ẁ:W 458 | Ẃ:W 459 | Ẅ:W 460 | Ẇ:W 461 | Ẉ:W 462 | ŵ:w 463 | ẁ:w 464 | ẃ:w 465 | ẅ:w 466 | ẇ:w 467 | ẉ:w 468 | ẘ:w 469 | Ẋ:X 470 | Ẍ:X 471 | ẋ:x 472 | ẍ:x 473 | Ý:Y 474 | Ŷ:Y 475 | Ÿ:Y 476 | Ȳ:Y 477 | Ẏ:Y 478 | Ỳ:Y 479 | Ỵ:Y 480 | Ỷ:Y 481 | Ỹ:Y 482 | ý:y 483 | ÿ:y 484 | ŷ:y 485 | ȳ:y 486 | ẏ:y 487 | ỳ:y 488 | ỵ:y 489 | ỷ:y 490 | ỹ:y 491 | ẙ:y 492 | Ź:Z 493 | Ż:Z 494 | Ž:Z 495 | Ẑ:Z 496 | Ẓ:Z 497 | Ẕ:Z 498 | ź:z 499 | ż:z 500 | ž:z 501 | ẑ:z 502 | ẓ:z 503 | ẕ:z 504 | IJ:IJ 505 | ij:ij 506 | 507 | 508 | 509 | 510 | 511 | 512 | ú:u 513 | ó:o 514 | é:e 515 | á:a 516 | à:a 517 | ñ:n 518 | è:e 519 | ê:e 520 | û:u 521 | ç:c 522 | â:a 523 | ô:o 524 | ē:e 525 | ā:a 526 | ṇ:n 527 | ḍ:d 528 | ṛ:r 529 | ū:u 530 | Ā:A 531 | ṣ:s 532 | ō:o 533 | ý:y 534 | ü:u 535 | Ý:y 536 | Ł:L 537 | ã:a 538 | ś:s 539 | ä:a 540 | ö:o 541 | ń:n 542 | ł:l 543 | ę:e 544 | Š:s 545 | š:s 546 | ć:c 547 | đ:d 548 | ž:z 549 | ả:a 550 | Ớ:O 551 | ớ:o 552 | ấ:a 553 | ö:o 554 | ő:o 555 | 556 | ộ:o 557 | ạ:a 558 | 559 | ò:o 560 | ş:s 561 | 562 | ğ:g 563 | ą:a 564 | ø:o 565 | Ø:O 566 | ơ:o 567 | Đ:D 568 | ứ:u 569 | ế:e 570 | Č:C 571 | ň:n 572 | 573 | ë:e 574 | å:a 575 | č:c 576 | ริ:y 577 | 578 | 579 | ř:r 580 | ÿ:y 581 | ȓ:r 582 | ầ:a 583 | ũ:u 584 | 585 | ă:a 586 | ţ:t 587 | ễ:e 588 | ệ:e 589 | 590 | ù:u 591 | ọ:o 592 | ừ:u 593 | ỳ:y 594 | ồ:o 595 | ề:e 596 | ư:u 597 | ự:u 598 | ậ:a 599 | ắ:a 600 | 601 | 602 | 603 | 604 | í:i 605 | ī:i 606 | ì:i 607 | î:i 608 | ï:i 609 | ị:i 610 | ĩ:i 611 | ɨ:i 612 | 613 | Ð:D 614 | ð:d -------------------------------------------------------------------------------- /training/seq2seq/.gitignore: -------------------------------------------------------------------------------- 1 | # repo-specific stuff 2 | pred.txt 3 | *.pt 4 | \#*# 5 | .idea 6 | *.sublime-* 7 | .DS_Store 8 | data/ 9 | thesis/Models 10 | thesis/glove_dir 11 | thesis/data 12 | thesis/Preprocessed_files 13 | 14 | # Byte-compiled / optimized / DLL files 15 | __pycache__/ 16 | *.py[cod] 17 | *$py.class 18 | 19 | # C extensions 20 | *.so 21 | 22 | # Distribution / packaging 23 | .Python 24 | build/ 25 | develop-eggs/ 26 | dist/ 27 | downloads/ 28 | eggs/ 29 | .eggs/ 30 | lib/ 31 | lib64/ 32 | parts/ 33 | sdist/ 34 | var/ 35 | wheels/ 36 | *.egg-info/ 37 | .installed.cfg 38 | *.egg 39 | 40 | # PyInstaller 41 | # Usually these files are written by a python script from a template 42 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 43 | *.manifest 44 | *.spec 45 | 46 | # Installer logs 47 | pip-log.txt 48 | pip-delete-this-directory.txt 49 | 50 | # Unit test / coverage reports 51 | htmlcov/ 52 | .tox/ 53 | .coverage 54 | .coverage.* 55 | .cache 56 | nosetests.xml 57 | coverage.xml 58 | *.cover 59 | .hypothesis/ 60 | 61 | # Translations 62 | *.mo 63 | *.pot 64 | 65 | # Django stuff: 66 | *.log 67 | local_settings.py 68 | 69 | # Flask stuff: 70 | instance/ 71 | .webassets-cache 72 | 73 | # Scrapy stuff: 74 | .scrapy 75 | 76 | # Sphinx documentation 77 | docs/_build/ 78 | 79 | # PyBuilder 80 | target/ 81 | 82 | # Jupyter Notebook 83 | .ipynb_checkpoints 84 | 85 | # pyenv 86 | .python-version 87 | 88 | # celery beat schedule file 89 | celerybeat-schedule 90 | 91 | # SageMath parsed files 92 | *.sage.py 93 | 94 | # Environments 95 | .env 96 | .venv 97 | env/ 98 | venv/ 99 | ENV/ 100 | 101 | # Spyder project settings 102 | .spyderproject 103 | .spyproject 104 | 105 | # Rope project settings 106 | .ropeproject 107 | 108 | # mkdocs documentation 109 | /site 110 | 111 | # mypy 112 | .mypy_cache/ 113 | 114 | # Tensorboard 115 | runs/ 116 | -------------------------------------------------------------------------------- /training/seq2seq/dataProcessor.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import argparse 3 | import sys 4 | import shutil 5 | import pyonmttok 6 | import os 7 | import glob 8 | import math 9 | from tqdm import tqdm 10 | 11 | def createFolders(args): 12 | required_dirnames = [ 13 | "data", 14 | "Outputs", 15 | "temp", 16 | "Preprocessed", 17 | "Reports", 18 | "Models" 19 | ] 20 | 21 | # do cleanup first 22 | for dirname in required_dirnames[:4]: 23 | if os.path.isdir(os.path.join(args.output_dir, dirname)): 24 | shutil.rmtree(os.path.join(args.output_dir, dirname)) 25 | 26 | for dirname in required_dirnames: 27 | os.makedirs(os.path.join(args.output_dir, dirname), exist_ok=True) 28 | 29 | def _merge(args, data_type): 30 | with open(os.path.join(args.output_dir, "data", f"src-{data_type}.txt"), 'w') as srcF, \ 31 | open(os.path.join(args.output_dir, "data", f"tgt-{data_type}.txt"), 'w') as tgtF: 32 | 33 | for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.{data_type}.{args.src_lang}")): 34 | tgt_file_prefix = src_file.rsplit(f".{data_type}.{args.src_lang}", 1)[0] + f".{data_type}.{args.tgt_lang}" 35 | tgt_files = glob.glob(tgt_file_prefix + "*") 36 | 37 | if tgt_files: 38 | # when multiple references are present, pick the first one 39 | tgt_file = tgt_files[0] 40 | 41 | with open(src_file) as f: 42 | for line in f: 43 | print(line.strip(), file=srcF) 44 | 45 | with open(tgt_file) as f: 46 | for line in f: 47 | print(line.strip(), file=tgtF) 48 | 49 | def _move(args, dataset_category): 50 | for src_file in glob.glob(os.path.join(args.input_dir, "data", f"*.{dataset_category}.{args.src_lang}")): 51 | tgt_file_prefix = src_file.rsplit(f".{dataset_category}.{args.src_lang}", 1)[0] + f".{dataset_category}.{args.tgt_lang}" 52 | tgt_files = glob.glob(tgt_file_prefix + "*") 53 | 54 | shutil.copy( 55 | src_file, 56 | os.path.join( 57 | args.output_dir, 58 | "Outputs", 59 | f".src-{dataset_category}.txt".join( 60 | os.path.basename(src_file).rsplit(f".{dataset_category}.{args.src_lang}", 1) 61 | ) 62 | ) 63 | ) 64 | 65 | for tgt_file in tgt_files: 66 | shutil.copy( 67 | tgt_file, 68 | os.path.join( 69 | args.output_dir, 70 | "Outputs", 71 | f".tgt-{dataset_category}.txt".join( 72 | os.path.basename(tgt_file).rsplit(f".{dataset_category}.{args.tgt_lang}", 1) 73 | ) 74 | ) 75 | ) 76 | 77 | def moveRawData(args): 78 | # move vocab models 79 | shutil.copy( 80 | os.path.join(args.input_dir, "vocab", f"{args.src_lang}.model"), 81 | os.path.join(args.output_dir, "Preprocessed", "srcSPM.model") 82 | ) 83 | shutil.copy( 84 | os.path.join(args.input_dir, "vocab", f"{args.tgt_lang}.model"), 85 | os.path.join(args.output_dir, "Preprocessed", "tgtSPM.model") 86 | ) 87 | 88 | vocab_cmd = [ 89 | "spm_export_vocab --model", 90 | os.path.join(args.output_dir, "Preprocessed", "srcSPM.model"), 91 | "| tail -n +4 >", 92 | os.path.join(args.output_dir, "Preprocessed", "srcSPM.vocab") 93 | ] 94 | os.system(" ".join(vocab_cmd)) 95 | 96 | vocab_cmd = [ 97 | "spm_export_vocab --model", 98 | os.path.join(args.output_dir, "Preprocessed", "tgtSPM.model"), 99 | "| tail -n +4 >", 100 | os.path.join(args.output_dir, "Preprocessed", "tgtSPM.vocab") 101 | ] 102 | os.system(" ".join(vocab_cmd)) 103 | 104 | if args.do_train: 105 | _merge(args, "train") 106 | _merge(args, "valid") 107 | 108 | if not glob.glob(os.path.join(args.input_dir, "data", f"*.valid.{args.src_lang}")): 109 | np.random.seed(3435) 110 | sampledCount = 0 111 | 112 | with open(os.path.join(args.output_dir, "data", "src-train.txt.backup"), 'w') as srcT, \ 113 | open(os.path.join(args.output_dir, "data", "tgt-train.txt.backup"), 'w') as tgtT, \ 114 | open(os.path.join(args.output_dir, "data", "src-valid.txt"), 'w') as srcV, \ 115 | open(os.path.join(args.output_dir, "data", "tgt-valid.txt"), 'w') as tgtV, \ 116 | open(os.path.join(args.output_dir, "data", "src-train.txt")) as srcO, \ 117 | open(os.path.join(args.output_dir, "data", "tgt-train.txt")) as tgtO: 118 | 119 | for srcLine, tgtLine in zip(srcO, tgtO): 120 | if sampledCount < args.validation_samples: 121 | if np.random.random() > .5: 122 | print(srcLine.strip(), file=srcV) 123 | print(tgtLine.strip(), file=tgtV) 124 | sampledCount += 1 125 | continue 126 | 127 | print(srcLine.strip(), file=srcT) 128 | print(tgtLine.strip(), file=tgtT) 129 | 130 | shutil.move( 131 | os.path.join(args.output_dir, "data", "src-train.txt.backup"), 132 | os.path.join(args.output_dir, "data", "src-train.txt") 133 | ) 134 | shutil.move( 135 | os.path.join(args.output_dir, "data", "tgt-train.txt.backup"), 136 | os.path.join(args.output_dir, "data", "tgt-train.txt") 137 | ) 138 | 139 | 140 | if args.do_eval: 141 | _move(args, "valid") 142 | _move(args, "test") 143 | 144 | def _lc(input_file): 145 | lc = 0 146 | with open(input_file) as f: 147 | for _ in f: 148 | lc += 1 149 | return lc 150 | 151 | 152 | def spmOperate(args, fileType, tokenize): 153 | if tokenize: 154 | modelName = os.path.join(args.output_dir, "Preprocessed", f"{fileType}SPM.model") 155 | input_files = glob.glob(os.path.join(args.output_dir, "Outputs", f'*{fileType}-*')) 156 | 157 | for input_file in input_files: 158 | spm_cmd = [ 159 | f"spm_encode --model=\"{modelName}\"", 160 | f"--output_format=piece", 161 | f"< \"{input_file}\" > \"{input_file}.tok\"" 162 | ] 163 | os.system(" ".join(spm_cmd)) 164 | os.remove(input_file) 165 | 166 | else: 167 | modelName = os.path.join(args.output_dir, "Preprocessed", f"tgtSPM.model") 168 | for input_file in glob.glob(os.path.join(args.output_dir, "Outputs", f'*{fileType}-*.tok')): 169 | spm_cmd = [ 170 | f"spm_decode --model=\"{modelName}\"", 171 | f"< \"{input_file}\" > \"{'.detok'.join(input_file.rsplit('.tok', 1))}\"" 172 | ] 173 | os.system(" ".join(spm_cmd)) 174 | os.remove(input_file) 175 | post_cmd = f"""sed 's/▁/ /g;s/ */ /g' -i \"{'.detok'.join(input_file.rsplit('.tok', 1))}\"""" 176 | os.system(post_cmd) 177 | 178 | 179 | def tokenize(args): 180 | spmOperate(args, 'src', tokenize=True) 181 | spmOperate(args, 'tgt', tokenize=True) 182 | 183 | def detokenize(args): 184 | spmOperate(args, 'tgt', tokenize=False) 185 | spmOperate(args, 'pred', tokenize=False) 186 | 187 | def processData(args, tokenization): 188 | if tokenization: 189 | createFolders(args) 190 | moveRawData(args) 191 | tokenize(args) 192 | else: 193 | detokenize(args) 194 | 195 | -------------------------------------------------------------------------------- /training/seq2seq/multi-bleu-detok.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | # This file uses the internal tokenization of mteval-v13a.pl, 7 | # giving the exact same (case-sensitive) results on untokenized text. 8 | # Using this script with detokenized output and untokenized references is 9 | # preferrable over multi-bleu.perl, since scores aren't affected by tokenization differences. 10 | # 11 | # like multi-bleu.perl , it supports plain text input and multiple references. 12 | 13 | # This file is retrieved from Moses Decoder :: https://github.com/moses-smt/mosesdecoder 14 | # $Id$ 15 | use warnings; 16 | use strict; 17 | 18 | my $lowercase = 0; 19 | if ($ARGV[0] eq "-lc") { 20 | $lowercase = 1; 21 | shift; 22 | } 23 | 24 | my $stem = $ARGV[0]; 25 | if (!defined $stem) { 26 | print STDERR "usage: multi-bleu-detok.pl [-lc] reference < hypothesis\n"; 27 | print STDERR "Reads the references from reference or reference0, reference1, ...\n"; 28 | exit(1); 29 | } 30 | 31 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0"; 32 | 33 | my @REF; 34 | my $ref=0; 35 | while(-e "$stem$ref") { 36 | &add_to_ref("$stem$ref",\@REF); 37 | $ref++; 38 | } 39 | &add_to_ref($stem,\@REF) if -e $stem; 40 | die("ERROR: could not find reference file $stem") unless scalar @REF; 41 | 42 | # add additional references explicitly specified on the command line 43 | shift; 44 | foreach my $stem (@ARGV) { 45 | &add_to_ref($stem,\@REF) if -e $stem; 46 | } 47 | 48 | 49 | 50 | sub add_to_ref { 51 | my ($file,$REF) = @_; 52 | my $s=0; 53 | if ($file =~ /.gz$/) { 54 | open(REF,"gzip -dc $file|") or die "Can't read $file"; 55 | } else { 56 | open(REF,$file) or die "Can't read $file"; 57 | } 58 | while() { 59 | chop; 60 | $_ = tokenization($_); 61 | push @{$$REF[$s++]}, $_; 62 | } 63 | close(REF); 64 | } 65 | 66 | my(@CORRECT,@TOTAL,$length_translation,$length_reference); 67 | my $s=0; 68 | while() { 69 | chop; 70 | $_ = lc if $lowercase; 71 | $_ = tokenization($_); 72 | my @WORD = split; 73 | my %REF_NGRAM = (); 74 | my $length_translation_this_sentence = scalar(@WORD); 75 | my ($closest_diff,$closest_length) = (9999,9999); 76 | foreach my $reference (@{$REF[$s]}) { 77 | # print "$s $_ <=> $reference\n"; 78 | $reference = lc($reference) if $lowercase; 79 | my @WORD = split(' ',$reference); 80 | my $length = scalar(@WORD); 81 | my $diff = abs($length_translation_this_sentence-$length); 82 | if ($diff < $closest_diff) { 83 | $closest_diff = $diff; 84 | $closest_length = $length; 85 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n"; 86 | } elsif ($diff == $closest_diff) { 87 | $closest_length = $length if $length < $closest_length; 88 | # from two references with the same closeness to me 89 | # take the *shorter* into account, not the "first" one. 90 | } 91 | for(my $n=1;$n<=4;$n++) { 92 | my %REF_NGRAM_N = (); 93 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 94 | my $ngram = "$n"; 95 | for(my $w=0;$w<$n;$w++) { 96 | $ngram .= " ".$WORD[$start+$w]; 97 | } 98 | $REF_NGRAM_N{$ngram}++; 99 | } 100 | foreach my $ngram (keys %REF_NGRAM_N) { 101 | if (!defined($REF_NGRAM{$ngram}) || 102 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) { 103 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram}; 104 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}
\n"; 105 | } 106 | } 107 | } 108 | } 109 | $length_translation += $length_translation_this_sentence; 110 | $length_reference += $closest_length; 111 | for(my $n=1;$n<=4;$n++) { 112 | my %T_NGRAM = (); 113 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 114 | my $ngram = "$n"; 115 | for(my $w=0;$w<$n;$w++) { 116 | $ngram .= " ".$WORD[$start+$w]; 117 | } 118 | $T_NGRAM{$ngram}++; 119 | } 120 | foreach my $ngram (keys %T_NGRAM) { 121 | $ngram =~ /^(\d+) /; 122 | my $n = $1; 123 | # my $corr = 0; 124 | # print "$i e $ngram $T_NGRAM{$ngram}
\n"; 125 | $TOTAL[$n] += $T_NGRAM{$ngram}; 126 | if (defined($REF_NGRAM{$ngram})) { 127 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) { 128 | $CORRECT[$n] += $T_NGRAM{$ngram}; 129 | # $corr = $T_NGRAM{$ngram}; 130 | # print "$i e correct1 $T_NGRAM{$ngram}
\n"; 131 | } 132 | else { 133 | $CORRECT[$n] += $REF_NGRAM{$ngram}; 134 | # $corr = $REF_NGRAM{$ngram}; 135 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n"; 136 | } 137 | } 138 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram}; 139 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n" 140 | } 141 | } 142 | $s++; 143 | } 144 | my $brevity_penalty = 1; 145 | my $bleu = 0; 146 | 147 | my @bleu=(); 148 | 149 | for(my $n=1;$n<=4;$n++) { 150 | if (defined ($TOTAL[$n])){ 151 | $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0; 152 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n"; 153 | }else{ 154 | $bleu[$n]=0; 155 | } 156 | } 157 | 158 | if ($length_reference==0){ 159 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n"; 160 | exit(1); 161 | } 162 | 163 | if ($length_translation<$length_reference) { 164 | $brevity_penalty = exp(1-$length_reference/$length_translation); 165 | } 166 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) + 167 | my_log( $bleu[2] ) + 168 | my_log( $bleu[3] ) + 169 | my_log( $bleu[4] ) ) / 4) ; 170 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n", 171 | 100*$bleu, 172 | 100*$bleu[1], 173 | 100*$bleu[2], 174 | 100*$bleu[3], 175 | 100*$bleu[4], 176 | $brevity_penalty, 177 | $length_translation / $length_reference, 178 | $length_translation, 179 | $length_reference; 180 | 181 | sub my_log { 182 | return -9999999999 unless $_[0]; 183 | return log($_[0]); 184 | } 185 | 186 | 187 | 188 | sub tokenization 189 | { 190 | my ($norm_text) = @_; 191 | 192 | # language-independent part: 193 | $norm_text =~ s///g; # strip "skipped" tags 194 | $norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines 195 | $norm_text =~ s/\n/ /g; # join lines 196 | $norm_text =~ s/"/"/g; # convert SGML tag for quote to " 197 | $norm_text =~ s/&/&/g; # convert SGML tag for ampersand to & 198 | $norm_text =~ s/</ 199 | $norm_text =~ s/>/>/g; # convert SGML tag for greater-than to < 200 | 201 | # language-dependent part (assuming Western languages): 202 | $norm_text = " $norm_text "; 203 | $norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g; # tokenize punctuation 204 | $norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit 205 | $norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit 206 | $norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit 207 | $norm_text =~ s/\s+/ /g; # one space only between words 208 | $norm_text =~ s/^\s+//; # no leading space 209 | $norm_text =~ s/\s+$//; # no trailing space 210 | 211 | return $norm_text; 212 | } 213 | -------------------------------------------------------------------------------- /training/seq2seq/pipeline.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import subprocess 4 | import traceback 5 | import time 6 | import shutil 7 | import argparse 8 | import glob 9 | import json 10 | from dataProcessor import processData 11 | 12 | FILEDIR = os.path.dirname(__file__) 13 | 14 | def train(args): 15 | data_map = { 16 | "train": { 17 | "path_src": os.path.join(args.output_dir, "data", "src-train.txt"), 18 | "path_tgt": os.path.join(args.output_dir, "data", "tgt-train.txt"), 19 | "transforms": ["sentencepiece", "filtertoolong"], 20 | "weight": 1 21 | }, 22 | "valid": { 23 | "path_src": os.path.join(args.output_dir, "data", "src-valid.txt"), 24 | "path_tgt": os.path.join(args.output_dir, "data", "tgt-valid.txt"), 25 | "transforms": ["sentencepiece", "filtertoolong"] 26 | } 27 | } 28 | cmd = f''' 29 | onmt_train \ 30 | -data \"{json.dumps(data_map)}\" \ 31 | -src_vocab \"{os.path.join(args.output_dir, "Preprocessed", "srcSPM.vocab")}\" \ 32 | -tgt_vocab \"{os.path.join(args.output_dir, "Preprocessed", "tgtSPM.vocab")}\" \ 33 | -src_subword_type sentencepiece \ 34 | -tgt_subword_type sentencepiece \ 35 | -src_subword_model \"{os.path.join(args.output_dir, "Preprocessed", "srcSPM.model")}\" \ 36 | -tgt_subword_model \"{os.path.join(args.output_dir, "Preprocessed", "tgtSPM.model")}\" \ 37 | -src_subword_nbest {args.nbest} \ 38 | -src_subword_alpha {args.alpha} \ 39 | -tgt_subword_nbest {args.nbest} \ 40 | -tgt_subword_alpha {args.alpha} \ 41 | -src_seq_length {args.src_seq_length} \ 42 | -tgt_seq_length {args.tgt_seq_length} \ 43 | -save_model \"{os.path.join(args.output_dir, "Models", args.model_prefix)}\" \ 44 | -layers {args.layers} -rnn_size {args.rnn_size} -word_vec_size {args.word_vec_size} -transformer_ff {args.transformer_ff} -heads {args.heads} \ 45 | -encoder_type transformer -decoder_type transformer -position_encoding \ 46 | -train_steps {args.train_steps} -max_generator_batches 2 -dropout 0.1 \ 47 | -batch_size {args.train_batch_size} -batch_type tokens -normalization tokens -accum_count {args.gradient_accum} \ 48 | -queue_size 10000 -bucket_size 32768 \ 49 | -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps {args.warmup_steps} -learning_rate {args.learning_rate} \ 50 | -max_grad_norm 0 -param_init 0 -param_init_glorot \ 51 | -share_decoder_embeddings \ 52 | -label_smoothing 0.1 -valid_steps {args.valid_steps} -save_checkpoint_steps {args.save_checkpoint_steps} \ 53 | -world_size {args.world_size} -gpu_ranks {" ".join(args.gpu_ranks)} {"-train_from " + args.train_from if args.train_from else ""} 54 | ''' 55 | os.system(cmd) 56 | 57 | def average_models(args): 58 | step_count = lambda p: int(re.search(r"_step_(\d+)", p).group(1)) 59 | model_paths = sorted( 60 | glob.glob(os.path.join(args.output_dir, "Models", f"{args.model_prefix}*.pt")), 61 | key=step_count 62 | ) 63 | if len(model_paths) > args.average_last: 64 | model_paths = model_paths[-args.average_last:] 65 | output_path = ( 66 | model_paths[0].rsplit("_step_")[0] + 67 | f"_step_{step_count(model_paths[0])}-{step_count(model_paths[-1])}-{args.average_last}.pt" 68 | ) 69 | 70 | model_paths = [f"\"{k}\"" for k in model_paths] 71 | cmd = [ 72 | f"onmt_average_models", 73 | f"-models {' '.join(model_paths)}", 74 | f"-output \"{output_path}\"" 75 | ] 76 | os.system(" ".join(cmd)) 77 | 78 | def _translate(args, modelName, inputFile, outputFile): 79 | cmd = f''' 80 | onmt_translate \ 81 | -model \"{modelName}\" \ 82 | -src \"{inputFile}\" \ 83 | -output \"{outputFile}\" \ 84 | -replace_unk copy -verbose -max_length {args.tgt_seq_length} -batch_size {args.eval_batch_size} -gpu 0 85 | ''' 86 | os.system(cmd) 87 | 88 | def translate(model_path, dataset_category, args): 89 | src_lines, src_map = [], {} 90 | for src_file in glob.glob(os.path.join(args.output_dir, "Outputs", f'*src-{dataset_category}.txt.tok')): 91 | with open(src_file) as f: 92 | lines = f.readlines() 93 | src_map[src_file] = len(lines) 94 | src_lines.extend(lines) 95 | 96 | merged_src_file = os.path.join(args.output_dir, "temp", "merged.src") 97 | merged_tgt_file = os.path.join(args.output_dir, "temp", "merged.tgt") 98 | 99 | with open(merged_src_file, 'w') as f: 100 | for line in src_lines: 101 | print(line.strip(), file=f) 102 | 103 | _translate(args, model_path, merged_src_file, merged_tgt_file) 104 | 105 | with open(merged_tgt_file) as inpf: 106 | idx = 0 107 | lines = inpf.readlines() 108 | 109 | for src_file in src_map: 110 | pred_file = f"pred-{dataset_category}.txt.tok".join( 111 | src_file.rsplit( 112 | f"src-{dataset_category}.txt.tok", 1 113 | ) 114 | ) 115 | 116 | with open(pred_file, 'w') as outf: 117 | for _ in range(src_map[src_file]): 118 | print(lines[idx].strip(), file=outf) 119 | idx += 1 120 | 121 | os.remove(merged_src_file) 122 | os.remove(merged_tgt_file) 123 | 124 | def calculate_scores(args, dataset_category): 125 | scores = [] 126 | for pred_file in glob.glob(os.path.join(args.output_dir, "Outputs", f'*pred-{dataset_category}.txt.detok')): 127 | dataset_name = os.path.basename(pred_file).rsplit( 128 | f".pred-{dataset_category}.txt.detok", 1 129 | )[0] 130 | 131 | tgt_file_prefix = f".tgt-{dataset_category}.txt.*detok".join( 132 | pred_file.rsplit( 133 | f".pred-{dataset_category}.txt.detok", 1 134 | ) 135 | ) 136 | tgt_files = glob.glob(tgt_file_prefix) 137 | if tgt_files: 138 | bleu_cmd = [ 139 | f"perl \"{os.path.join(FILEDIR, 'multi-bleu-detok.perl')}\"", 140 | f"-lc {' '.join(tgt_files)} < \"{pred_file}\"" 141 | ] 142 | sacre_cmd = [ 143 | f"cat \"{pred_file}\"", 144 | "|", 145 | f"sacrebleu {' '.join(tgt_files)}" 146 | ] 147 | 148 | try: 149 | bleu_output = str(subprocess.check_output(" ".join(bleu_cmd), shell=True)).strip() 150 | bleu_score = bleu_output.splitlines()[-1].split(",")[0].split("=")[1] 151 | except: 152 | bleu_score = -1 153 | 154 | try: 155 | sacre_output = str(subprocess.check_output(" ".join(sacre_cmd), shell=True)).strip() 156 | sacre_score = sacre_output.splitlines()[-1].split("=")[1].split()[0] 157 | except: 158 | sacre_score = -1 159 | 160 | scores.append( 161 | { 162 | "dataset": dataset_name, 163 | "bleu": bleu_score, 164 | "sacrebleu": sacre_score 165 | } 166 | ) 167 | 168 | return scores 169 | 170 | def write_scores(scores, output_path): 171 | with open(output_path, 'w') as f: 172 | for model_name in scores: 173 | print(model_name, ":", file=f) 174 | for dataset_score in scores[model_name]: 175 | print( 176 | "", 177 | f"Dataset: {dataset_score['dataset']},", 178 | f"BLEU: {dataset_score['bleu']},", 179 | f"SACREBLEU: {dataset_score['sacrebleu']},", 180 | sep="\t", 181 | file=f 182 | ) 183 | 184 | def evaluate(args): 185 | if args.model_prefix: 186 | model_paths = sorted( 187 | glob.glob(os.path.join(args.output_dir, "Models", f"{args.model_prefix}*.pt")), 188 | key=lambda p: int(re.search(r"_step_(\d+)", p).group(1)) 189 | ) 190 | model_scores = {} 191 | for model_path in model_paths: 192 | translate(model_path, "valid", args) 193 | processData(args, False) 194 | scores = calculate_scores(args, "valid") 195 | model_scores[os.path.basename(model_path)] = scores 196 | 197 | write_scores( 198 | model_scores, 199 | os.path.join( 200 | args.output_dir, 201 | "Reports", f"{args.model_prefix}.valid.{args.src_lang}2{args.tgt_lang}.log" 202 | ) 203 | ) 204 | 205 | if args.eval_model: 206 | model_scores = {} 207 | translate(args.eval_model, "test", args) 208 | processData(args, False) 209 | scores = calculate_scores(args, "test") 210 | model_scores[os.path.basename(args.eval_model)] = scores 211 | 212 | write_scores( 213 | model_scores, 214 | os.path.join( 215 | args.output_dir, "Reports", f"{os.path.basename(args.eval_model)}.test.{args.src_lang}2{args.tgt_lang}.log" 216 | ) 217 | ) 218 | 219 | 220 | def main(args): 221 | processData(args, True) 222 | if args.do_train: 223 | train(args) 224 | if args.model_prefix and args.average_last: 225 | average_models(args) 226 | if args.do_eval: 227 | evaluate(args) 228 | 229 | 230 | if __name__ == "__main__": 231 | parser = argparse.ArgumentParser() 232 | parser.add_argument( 233 | '--input_dir', '-i', type=str, 234 | required=True, 235 | metavar='PATH', 236 | help="Input directory") 237 | 238 | parser.add_argument( 239 | '--output_dir', '-o', type=str, 240 | required=True, 241 | metavar='PATH', 242 | help="Output directory") 243 | 244 | parser.add_argument( 245 | '--src_lang', type=str, 246 | required=True, 247 | help="Source language") 248 | 249 | parser.add_argument( 250 | '--tgt_lang', type=str, 251 | required=True, 252 | help="Target language") 253 | 254 | parser.add_argument( 255 | '--validation_samples', type=int, default=5000, 256 | help='no. of validation samples to take out from train dataset when no validation data is present') 257 | 258 | parser.add_argument( 259 | '--src_seq_length', type=int, default=200, 260 | help='maximum source sequence length') 261 | 262 | parser.add_argument( 263 | '--tgt_seq_length', type=int, default=200, 264 | help='maximum target sequence length') 265 | 266 | parser.add_argument( 267 | '--model_prefix', type=str, 268 | help='Prefix of the model to save') 269 | 270 | parser.add_argument( 271 | '--eval_model', type=str, metavar="PATH", 272 | help='Path to the specific model to evaluate') 273 | 274 | parser.add_argument( 275 | '--train_steps', type=int, default=120000, 276 | help='no of training steps') 277 | 278 | parser.add_argument( 279 | '--train_batch_size', type=int, default=12288, 280 | help='training batch size (in tokens)') 281 | 282 | parser.add_argument( 283 | '--eval_batch_size', type=int, default=8, 284 | help='evaluation batch size (in sentences)') 285 | 286 | parser.add_argument( 287 | '--gradient_accum', type=int, default=2, 288 | help='gradient accum') 289 | 290 | parser.add_argument( 291 | '--warmup_steps', type=int, default=4000, 292 | help='warmup steps') 293 | 294 | parser.add_argument( 295 | '--learning_rate', type=int, default=2, 296 | help='learning rate') 297 | 298 | parser.add_argument( 299 | '--layers', type=int, default=6, 300 | help='layers') 301 | 302 | parser.add_argument( 303 | '--rnn_size', type=int, default=512, 304 | help='rnn size') 305 | 306 | parser.add_argument( 307 | '--word_vec_size', type=int, default=512, 308 | help='word vector size') 309 | 310 | parser.add_argument( 311 | '--transformer_ff', type=int, default=2048, 312 | help='transformer feed forward size') 313 | 314 | parser.add_argument( 315 | '--heads', type=int, default=8, 316 | help='no of heads') 317 | 318 | parser.add_argument( 319 | '--valid_steps', type=int, default=2000, 320 | help='validation interval') 321 | 322 | parser.add_argument( 323 | '--save_checkpoint_steps', type=int, default=1000, 324 | help='model saving interval') 325 | 326 | parser.add_argument( 327 | '--average_last', type=int, default=20, 328 | help='average last X models') 329 | 330 | parser.add_argument( 331 | '--world_size', type=int, default=4, 332 | help='world size') 333 | 334 | parser.add_argument( 335 | '--gpu_ranks', type=str, nargs="*", default=["0", "1", "2", "3"], 336 | help='gpu ranks') 337 | 338 | parser.add_argument( 339 | '--train_from', type=str, default="", 340 | help='start training from this checkpoint') 341 | 342 | parser.add_argument('--do_train', action='store_true', 343 | help='Run training') 344 | parser.add_argument('--do_eval', action='store_true', 345 | help='Run evaluation') 346 | 347 | parser.add_argument( 348 | '--nbest', type=int, default=32, 349 | help='sentencepiece nbest size') 350 | parser.add_argument( 351 | '--alpha', type=float, default=0.1, 352 | help='sentencepiece alpha') 353 | 354 | args = parser.parse_args() 355 | main(args) 356 | 357 | 358 | 359 | -------------------------------------------------------------------------------- /training/seq2seq/requirements.txt: -------------------------------------------------------------------------------- 1 | git+https://github.com/abhik1505040/OpenNMT-py 2 | pyrouge 3 | git+https://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070 4 | sentencepiece>=0.1.94 5 | subword-nmt>=0.3.7 6 | sacrebleu==1.4.2 7 | -------------------------------------------------------------------------------- /training/seq2seq/sample_input_dir/vocab/bn.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csebuetnlp/banglanmt/361801040950e5a50ddf51d4c02d36269fe5dd91/training/seq2seq/sample_input_dir/vocab/bn.model -------------------------------------------------------------------------------- /training/seq2seq/sample_input_dir/vocab/en.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csebuetnlp/banglanmt/361801040950e5a50ddf51d4c02d36269fe5dd91/training/seq2seq/sample_input_dir/vocab/en.model -------------------------------------------------------------------------------- /vocab.tar.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csebuetnlp/banglanmt/361801040950e5a50ddf51d4c02d36269fe5dd91/vocab.tar.bz2 --------------------------------------------------------------------------------